Automated Detection of Synthetic Social Media Profiles Using Ensemble Learning and Explainable Feature Analysis

Dr M S Khatib, Ms. Shumaila Rehman

doi:10.64751/

Authors

Dr M S Khatib, Ms. Shumaila Rehman Author

DOI:

https://doi.org/10.64751/

Abstract

The exponential growth of social media platforms has revolutionized digital communication, creating unprecedented opportunities for connectivity, commerce, and information dissemination. However, this digital transformation has been accompanied by a parallel surge in fraudulent activities, particularly through the proliferation of fake social media accounts. These synthetic profiles pose multifaceted threats including disinformation campaigns, identity theft, financial fraud, cyberbullying, and manipulation of public opinion. Instagram, with over two billion active users globally, has become a prime target for malicious actors seeking to exploit the platform's visual-centric nature and extensive reach. Traditional detection mechanisms relying on manual reporting systems and static rule-based algorithms have proven inadequate in addressing the sophisticated, evolving tactics employed by modern fraud networks. These conventional approaches suffer from high false-positive rates, delayed response times, and inability to adapt to emerging patterns of fraudulent behavior. This research proposes a comprehensive machine learning-based framework for automated detection of fake Instagram accounts through systematic analysis of profile metadata and behavioral indicators. The study employs a Random Forest Classifier, an ensemble learning algorithm chosen for its robustness, accuracy, and interpretability in handling complex, non-linear relationships within heterogeneous datasets. Our approach utilizes a carefully curated dataset comprising sixteen discriminative features including username characteristics (length, special character usage, name similarity), profile completeness indicators (profile picture presence, biography length, external URL inclusion), engagement metrics (follower count, following count, follower-to-following ratio), and activity patterns (post frequency, account age, story activity). The methodology encompasses a complete machine learning pipeline: data acquisition through Instagram's public API and web scraping tools (Instaloader, BeautifulSoup), feature engineering and normalization, model training with stratified k-fold cross-validation, hyperparameter optimization, and performance evaluation using multiple metrics (accuracy, precision, recall, F1-score, ROC-AUC). The trained model achieved exceptional performance with 95.2% accuracy, 94.8% precision, 93.6% recall, and 94.2% F1-score on the held-out test dataset, demonstrating superior capability in distinguishing genuine from fraudulent accounts. To enhance practical applicability and accessibility, we developed a production-ready web application using the Flask framework. This user-friendly interface enables real-time account verification by accepting Instagram usernames, automatically extracting profile metadata, processing features through the trained model, and displaying instant classification results with confidence scores. The system incorporates persistent storage mechanisms, logging all predictions to CSV files for longitudinal analysis, model monitoring, and continuous improvement through periodic retraining cycles. Feature importance analysis revealed that follower-to-following ratio, posting frequency, biography completeness, and username authenticity were the most influential predictors, providing valuable insights for platform administrators and cybersecurity professionals. The system addresses critical challenges including class imbalance through stratified sampling, feature noise through normalization and ensemble averaging, scalability through parallel processing optimization, and interpretability through transparent feature importance visualization. This research contributes to the cybersecurity domain by delivering a scalable, accurate, and interpretable solution for fake profile detection. The framework's modular architecture facilitates integration with existing platform security infrastructure, while its explainable nature builds trust among stakeholders. Future enhancements may incorporate deep learning architectures, natural language processing for content analysis, cross-platform generalization, and blockchain-based identity verification, positioning this work as a foundational step toward comprehensive social media ecosystem integrity.