Lightweight Contextual Encoding and Ensemble Classification for Multilingual Text Disambiguation
DOI:
https://doi.org/10.64751/ijdim.2026.v5.n2(2).784Keywords:
Multilingual Language Identification, Transformer-based Models, Natural Language Processing (NLP), MiniLM, Text Classification, Machine Learning, Code-Mixed TextAbstract
The rapid expansion of global connectivity has resulted in the widespread use of thousands of languages across digital platforms, with many users frequently communicating in multiple languages. Despite this linguistic diversity, a significant portion of multilingual content remains inaccurately classified due to the limitations of existing language identification techniques. Traditional manual approaches are timeconsuming and error-prone, particularly when handling short, informal, or code-mixed text. Moreover, conventional algorithms often struggle to capture deeper semantic and contextual relationships inherent in multilingual data. To address these challenges, this study proposes a transformer-based multilingual language identification framework leveraging advanced Natural Language Processing (NLP) techniques. The process begins with a multilingual dataset subjected to preprocessing steps such as tokenization, stopword removal, and lemmatization. Exploratory Data Analysis (EDA) is then conducted to identify patterns and data distributions. Semantic features are extracted using Miniature Language Model (MiniLM), a lightweight transformer model capable of generating meaningful contextual embeddings. These embeddings are utilized by multiple machine learning classifiers, including Decision Tree (DT), K-Nearest Neighbors (KNN), Gaussian Naive Bayes (GNB), and Random Forest (RF), to perform classification. Random Forest is employed as the primary model due to its robustness in handling high-dimensional data and its superior predictive performance. By integrating transformer-based embeddings with classical machine learning techniques, the proposed framework effectively handles short texts, informal language, and multilingual variations. The system is implemented as a Flask-based web application, enabling real-time classification and interactive user engagement.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.






