Deep Learning Methods for Short, Informal, and Multilingual Text Analytics

Year
2020
Supervisor:
Dr. Asim Karim
Students:
Muhammad Haroon Shakeel
Reference / Filters
Computer Science
Abstract:
The popularity of social media platforms and knowledge sharing websites has tremendously increased the amount of user-generated textual content. Such content is usually short in length and is often written informally (e.g., improper grammar, self-created abbreviations, and varying spellings). It is also influenced by local languages and mix multiple languages mid-utterance, a phenomenon known as code-switching. Traditional text analytics and natural language processing (NLP) approaches perform poorly on short, informal, and multilingual text as compared to well-written longer documents because of the limited context and language resources available for learning. In recent years, deep learning has produced enhanced results for many NLP tasks. However, these approaches have some major shortcomings: (1) they are tailored for specific problem settings (e.g., short text or informal languages) and do not generalize well to other settings, (2) they do not exploit multiple perspectives and resources for effective learning, and (3) they are hampered by smaller training datasets.

In this research, we present methods and models for effective classification of usergenerated text with a specific application to English and Roman Urdu short and informal text. We present a novel multi-cascaded deep learning model (McM) for robust classification of noisy and clean short text. McM incorporates three independent CNN and LSTM (with and without soft attention) cascades for feature learning. Each cascade is responsible for capturing a specific aspect of natural language. The CNN based cascade extracts n-gram information. The LSTM based cascade with soft-attention “highlights" the task-specific vital words. The third LSTM based cascade captures long-term dependencies of the text. Each cascade is locally supervised and is trained independently. The deep representations learned by each cascade are forwarded to a discriminator for final prediction. As a whole, the architecture is both deep and wide, and is versatile to incorporate learned and linguistic features for robust text classification.

We evaluate the effectiveness and generality of our model on three different text analytics problems. First, we show the efficacy of our model for the problem of paraphrase detection. This is a binary classification problem in which pairs of texts are labeled as either positive (paraphrase) or negative (non-paraphrase). While deep models produce a richer text representation, they require large amounts of data for training purposes. Getting additional pairs of texts annotated in a crowd-sourced setting is costly. Thus, for this particular task, we also develop a novel data augmentation strategy, which generates additional paraphrase and non-paraphrase annotations reliably from existing annotations. The augmentation procedure involves several steps and a parameter through which the degree of augmentation can be tuned. We evaluate our model and data augmentation strategy on three benchmark datasets representing both noisy and clean texts in English language. Our model produces a higher predictive performance on all three datasets beating all previously published results on them.

Second, we show the usefulness of McM for the task of multi-class classification of bilingual SMS. Our goal is to achieve this without any prior knowledge of the language, code-switching indication, language translation, normalizing lexical variations, or language transliteration. For this purpose, we develop and make publicly available a 12 class largescale dataset. The texts in this dataset contain English as well as Roman Urdu, a distinct informal regional dialect of communication that uses English alphabets to write Urdu. Our model achieves greater robustness as compared to the baseline model on this dataset.
Third, we demonstrate the utility of the proposed model for the task of sentiment classification in code-switched tweets. For this purpose, we develop another short text dataset, namely MultiSenti, that is code-switched between Roman Urdu and English languages. The proposed model McM outperforms three baseline models on the MultiSenti dataset in terms of predictive accuracy. We also study the feasibility of adapting language resources from English and learning domain-specific word embeddings in Roman Urdu for multilingual sentiment classification.

This research highlights the power of multi-perspective feature learning and data augmentation for short and informal text classification and takes us a step closer to languageindependent text analytics.