CATEGORY-BASED SENTIMENT ANALYSIS OF SINDHI NEWS HEADLINES USING MACHINE LEARNING DEEP LEARNING AND TRANSFORMER MODELS

Dr. RAKESH; BHAVITHA; J.SATHWIKA; K.SATHWIKA

doi:10.64751/ijdim.2025.v4.n4(1).pp15-20

Authors

Dr. RAKESH Author
BHAVITHA Author
J.SATHWIKA Author
K.SATHWIKA Author

DOI:

https://doi.org/10.64751/ijdim.2025.v4.n4(1).pp15-20

Keywords:

Sentiment Analysis (SA); Sindhi Language; Low-Resource Languages; Natural Language Processing (NLP); Machine Learning (ML); Deep Learning (DL); Transformer Models; XLM-RoBERTa; Explainable AI (XAI); LIME; Text Classification; News Headlines Dataset (SNHD)

Abstract

The rapid growth of digital content has made sentiment analysis (SA) an essential tool for understanding public sentiment and classifying textual data. Despite significant progress in natural language processing (NLP), low-resource languages, particularly Sindhi, remain underexplored due to the lack of computational tools and annotated datasets. This study addresses this gap by introducing the Sindhi News Headlines Dataset (SNHD), a novel corpus annotated for both SA and category classification across eight categories: Crime, Economy, Entertainment, Health, Politics, Science & Technology, Social, and Sports. To evaluate the effectiveness of different machine learning (ML), deep learning (DL), and transformer-based approaches, we conduct a comparative analysis of various models on SA and category classification tasks. Furthermore, we leverage Explainable Artificial Intelligence (XAI) techniques, such as Local Interpretable Model-Agnostic Explanations (LIME), to gain insights into model decision-making. Experimental results show that traditional ML models outperform DL and transformer-based models on the SNHD dataset. Specifically, Support Vector Machines with Radial Basis Function (SVM-RBF) achieves the highest performance for SA (0.74 accuracy and weighted F-score), while the Ridge Classifier (RC) delivers the best results for category classification (0.84 accuracy and weighted F-score). Among transformer models, XLM-RoBERTa demonstrates strong performance in category classification (0.82 accuracy and weighted F-score). These findings establish a benchmark for future research in Sindhi NLP and highlight the potential of hybrid approaches in tackling challenges associated with low-resource languages. This work provides a foundational resource for NLP researchers seeking to advance computational methods for Sindhi and similar underrepresented languages.