Semantic Fusion and Cross-Modal Intelligence for Discovering Covert Harmful Patterns in Multimedia Content

P. Vijay Goud; Pitla Abhilash; Somani Hemanth Kumar; Dala Anirud

doi:10.64751/ijdim.2026.v5.n2(2).783

Authors

P. Vijay Goud Author
Pitla Abhilash Author
Somani Hemanth Kumar Author
Dala Anirud Author

DOI:

https://doi.org/10.64751/ijdim.2026.v5.n2(2).783

Keywords:

Vision Transformer (ViT), XLNet, Natural language processing (NLP), Computer vision, Sparse Linear Integer Model (SLIM), K-Nearest Neighbors (KNN).

Abstract

The widespread expansion of social media has significantly increased the use of memes as a common form of communication, with millions being shared every day. Although many memes are intended for entertainment, a considerable portion includes implicit or explicit hate speech, creating major challenges for effective content moderation. Identifying such content is complex due to the multimodal nature of memes, where meaning emerges from the combination of visual and textual elements rather than from a single modality. Conventional methods primarily depend on human moderation or text-based analysis. However, human moderation is time-consuming, subjective, and lacks scalability, while text-only approaches fail to interpret visual context, sarcasm, symbolism, and hidden intent, resulting in poor accuracy and higher misclassification rates. To overcome these challenges, this study introduces a multimodal framework that combines both visual and textual features for enhanced hate speech detection. Visual representations are extracted using Vision Transformer (ViT), while textual features are derived using eXtreme Language Model (XLNet), allowing for deeper semantic and contextual comprehension. These features are then integrated into a single representation and classified using various machine learning models, including Sparse Linear Integer Model (SLIM), Logistic Regression Classifier (LRC), Decision Tree Classifier (DTC), and K-Nearest Neighbors (KNN) for comparative evaluation. The proposed approach enhances detection performance, minimizes false positives, and improves contextual interpretation. Furthermore, it enables scalable and real-time implementation, contributing to safer digital environments and promoting advancements in multimodal artificial intelligence research.