Paraphrase detection in Urdu Language
Paraphrase detection is one of the most challenging tasks in Natural Language Processing. Paraphrase of a text or sentence conveys the same meaning albeit with a different sentence structure and sequence of words. Over the last few years, deep learning methods have been increasingly prevalent in solving a number of NLP problems such as text classification, machine translation, text similarity detection, and image captioning among others. Similarly, deep learning models have been applied in paraphrase detection for a number of resource rich languages such as English. However, not much work has been done in other languages mainly due to scarcity of relevant datasets, and with deep learning models requiring large datasets for training, this problem has become increasingly challenging and a focal point for research.
In this research, we intend to work on paraphrase detection in one such low resource language, Urdu. For this purpose, we are going to use mT5 which is a multilingual variant of Google’s T5 model with an Urdu Paraphrase Plagiarism Corpus (UPPC) to determine whether two Urdu documents are paraphrases of each other or not. We are going to implement this problem as a binary classification task for now, but we aim to extend it to a multi-class classification problem. We also aim to develop our own model for paraphrase generation to estimate the true efficiency of our paraphrase detection model by using it for the generated paraphrases. The proposed model can then be used to detect plagiarism in Urdu documents.
Meeting ID: 835 7146 7262