Deep Sentinel-XT: Robust Multimodal Meme Understanding for Hate Speech Identification

E. Prashanthi; Veerapareddy Madhava; Singamsetty Siva Sai Nithin; Surla Upendra; Shaik Kaif; Shaik Kaif

doi:10.64751/ijdim.2026.v5.n2.pp34-43

Authors

E. Prashanthi Author
Veerapareddy Madhava Author
Singamsetty Siva Sai Nithin Author
Surla Upendra Author
Shaik Kaif Author
Shaik Kaif Author

DOI:

https://doi.org/10.64751/ijdim.2026.v5.n2.pp34-43

Keywords:

Multimodal Hate Speech Detection, Meme Classification, ViT, XLNet, Feature Fusion, SLIM Classifier

Abstract

The rapid growth of social media has led to an exponential increase in meme-based communication, with recent studies indicating that over, while nearly 45% of harmful memes evade traditional text-only moderation systems. Additionally, manual moderation processes handle less than 30% of multimodal content efficiently, highlighting the urgent need for automated multimodal hate speech detection (MHSD) systems. Manual moderation is time-consuming, inconsistent, and not scalable for large volumes of social media data. Traditional systems often fail to capture contextual dependencies between image and text, leading to high false positives and false negatives. To address these challenges, this work proposes a MHSD System for Memes that integrates advanced deep learning architectures for joint image-text understanding. The system architecture begins with a meme dataset containing both visual and textual components. Image features are extracted using a Vision Transformer (ViT), which captures global visual representations through self-attention mechanisms. Simultaneously, textual content undergoes NLP preprocessing followed by feature extraction using the Extreme Language Network (XLNet) Transformer, enabling bidirectional contextual learning of meme text. The extracted image and text features are then processed in parallel and fused within a multimodal framework. For classification, multiple models are implemented, including Multimodal Parallel Logistic Regression Classifier (LRC), Decision Tree Classifier (DTC), and K-Nearest Neighbours (KNN) as existing algorithms, each operating on combined visual-textual embeddings through a single unified architecture. Finally, a Proposed Multimodal Parallel Supersparse Linear Integer Model (SLIM) Classifier is introduced to enhance interpretability, sparsity, and classification performance by learning optimized linear itemset-based relationships across both modalities. Experimental results demonstrate that the proposed multimodal SLIM-based approach achieves superior accuracy, robustness, and contextual understanding compared to existing classifiers, making it highly effective for real-world MHSD in memes.