Lightweight Contextual Encoding and Ensemble Classification for Multilingual Text Disambiguation

C. Bagath Basha; M. Ramana Kumar; Edem Gunotham; Syed Khasim; Chirra Ashritha

doi:10.64751/ijdim.2026.v5.n2(2).784

Authors

C. Bagath Basha Author
M. Ramana Kumar Author
Edem Gunotham Author
Syed Khasim Author
Chirra Ashritha Author

DOI:

https://doi.org/10.64751/ijdim.2026.v5.n2(2).784

Keywords:

Multilingual Language Identification, Transformer-based Models, Natural Language Processing (NLP), MiniLM, Text Classification, Machine Learning, Code-Mixed Text

Abstract

The rapid expansion of global connectivity has resulted in the widespread use of thousands of languages across digital platforms, with many users frequently communicating in multiple languages. Despite this linguistic diversity, a significant portion of multilingual content remains inaccurately classified due to the limitations of existing language identification techniques. Traditional manual approaches are timeconsuming and error-prone, particularly when handling short, informal, or code-mixed text. Moreover, conventional algorithms often struggle to capture deeper semantic and contextual relationships inherent in multilingual data. To address these challenges, this study proposes a transformer-based multilingual language identification framework leveraging advanced Natural Language Processing (NLP) techniques. The process begins with a multilingual dataset subjected to preprocessing steps such as tokenization, stopword removal, and lemmatization. Exploratory Data Analysis (EDA) is then conducted to identify patterns and data distributions. Semantic features are extracted using Miniature Language Model (MiniLM), a lightweight transformer model capable of generating meaningful contextual embeddings. These embeddings are utilized by multiple machine learning classifiers, including Decision Tree (DT), K-Nearest Neighbors (KNN), Gaussian Naive Bayes (GNB), and Random Forest (RF), to perform classification. Random Forest is employed as the primary model due to its robustness in handling high-dimensional data and its superior predictive performance. By integrating transformer-based embeddings with classical machine learning techniques, the proposed framework effectively handles short texts, informal language, and multilingual variations. The system is implemented as a Flask-based web application, enabling real-time classification and interactive user engagement.