Data Augmentation to improve Hate Speech Classification in Roman Urdu

SBASSE Home
Data Augmentation to improve Hate Speech Classification in Roman Urdu

Event date:

Aug 3 2021 3:00 pm

Data Augmentation to improve Hate Speech Classification in Roman Urdu

Supervisor

Dr. Asim Karim

Student

Ubaid Azam

Venue

Zoom Meetings (Online)

Event

MS Thesis defense

Abstract

Automated system to detect hate speech is inevitable in the era of exponential increase in the users of social media like Twitter, Facebook etc. Automated hate speech detection in low resource languages especially in Roman Urdu is really difficult because of the scarcity of the data. It is known fact that the greater the data on which machine learning models are trained, more effective the results they show. Data augmentation is known to improve the performance of deep learning models. In this paper we have proposed various data augmentation techniques that involves synonym replacement, random swapping, random deletion, and random insertion. Moreover, we have discussed what α parameter (percent of words in sentence changed by each augmentation) need to be set. These techniques were implemented on two Roman Urdu data i.e. RUSHOLD and RUT. With our proposed efficient data augmentation techniques, we achieved F1-score of 95 % on RUT data set, setting the highest benchmark for toxic comment classification in Roman Urdu.

Meeting Link: https://lums-edu-pk.zoom.us/j/92740718278?pwd=b2JuUXQ0ZlMrSzByVDh6NDcxcUpiZz09

Meeting ID: 927 4071 8278

Passcode: 671653