Abstractive News Summarization for Low Resource Language Urdu
Abstract
Abstractive summarization has traditionally been used primarily for high resource languages like English, primarily due to the lack of sufficient datasets for low resource languages. This study presents UrSum, a comprehensive and diverse dataset of professionally annotated article-summary pairs, first of its kind for Urdu, extracted using a set of carefully crafted heuristics. To the best of our knowledge, UrSum is the largest abstractive summarization dataset in terms of the number of samples ever collected for Urdu. We adapt mT5, a massively multilingual pre-trained text-to-text transformer model to create UrT5 and fine-tune it using UrSum to experiment with low resource summarization task. We achieved Rouge-1 of 35.45, Rouge-2 of 18.02, and Rouge-L of 30.40 by fine-tuning UrT5small and Rouge-1 of 36.21, Rouge-2 of 17.65, and Rouge-L of 30.58 by fine-tuning UrT5base.
Committee:
- Dr. Asim Karim (Advisor)
- Dr. Agha Ali Raza