Plagiarism Detection in the Urdu Language
Abstract:
In the modern era of technology, a lot of research and development has been done for different languages across the globe. These researches include creation resources and tools that are facilitating the speakers of languages in various ways. Urdu is the 11th most widely spoken language in the world, has a lot of potential for resource development. Moreover, there is almost negligible work done to check the authenticity of Urdu data. To fill this gaping inadequacy, this research primarily aimed to create a corpus that can set a standard for checking plagiarized content in the Urdu language. For this reason, after studying the work done in the past, we have employed three different obfuscation techniques by following the footmarks of PAN research since PAN created the corpus for the English language. These researches include Random Shuffling technique, Translation-based Obfuscation, and POS-based shuffling technique on the dataset of UPPC and COUNTER corpus. We have successfully created 6 different corpora following our data obfuscation techniques on the aforementioned corpus. We have also employed the BLEU score as a measure to evaluate the credibility of these created datasets.
Committee:
Dr. Agha Ali Raza (Advisor)
Dr. Asim Karim
Zoom link: https://lums-edu-pk.zoom.us/j/95241422118?pwd=NTZRUXk1UGdEU2RvSlQ4T3BlQVBxQT09
Meeting ID: 952 4142 2118
Passcode: 894508