Aug 3 2021 3:00 pm
Data Augmentation to improve Hate Speech Classification in Roman Urdu
Dr. Asim Karim
Zoom Meetings (Online)
MS Thesis defense
Automated system to detect hate speech is inevitable in the era of exponential increase in the users of social media like Twitter, Facebook etc. Automated hate speech detection in low resource languages especially in Roman Urdu is really difficult because of the scarcity of the data. It is known fact that the greater the data on which machine learning models are trained, more effective the results they show. Data augmentation is known to improve the performance of deep learning models. In this paper we have proposed various data augmentation techniques that involves synonym replacement, random swapping, random deletion, and random insertion. Moreover, we have discussed what α parameter (percent of words in sentence changed by each augmentation) need to be set. These techniques were implemented on two Roman Urdu data i.e. RUSHOLD and RUT. With our proposed efficient data augmentation techniques, we achieved F1-score of 95 % on RUT data set, setting the highest benchmark for toxic comment classification in Roman Urdu.