Multi model Deep Fake Detection using Vision Transformers (ViT) and Hybrid Deep Learning Techniques
DOI:
https://doi.org/10.64751/ijdim.2026.v5.n2.pp98-105Keywords:
Deepfake Detection, Vision Transformer, ViT, Audio Deepfake, MFCC, FaceForensics++, ASVspoof, Multi-Modal Detection, Explainable AI, Attention Maps, GAN, Media Forensics, Flask, Synthetic MediaAbstract
The rapid democratization of synthetic media generation—powered by Generative Adversarial Networks, diffusion models, and neural vocoders—has created a global misinformation crisis in which manipulated audio and video are increasingly indistinguishable from authentic recordings. Existing deepfake detection systems are fragmented by modality: video-only pipelines miss audio synthesis attacks, while speech anti-spoofing systems ignore visual manipulation. This paper presents a unified Multi-Model Deepfake Detection System that integrates Vision Transformer (ViT) architectures for video analysis with a Mel-frequency Cepstral Coefficient (MFCC)-based Convolutional Neural Network (CNN) for audio analysis, deployed through a Flask web application accessible to non-technical users without installation. For video detection, the google/vit-base-patch16-224 model (12 transformer layers, 12 attention heads, hidden size D=768) is finetuned on FaceForensics++ and the DeepFake Detection Challenge (DFDC) dataset; 224x224 face crops are divided into 16x16 patches, linearly projected to embeddings, processed through multihead self-attention layers, and classified via a twoclass linear head. Frame-level predictions are aggregated by majority voting over 1-fps sampled frames. For audio detection, Librosa extracts 13 MFCC coefficients over 300 frames at 16 kHz, producing (300,13) tensors classified by a TensorFlow CNN trained on ASVspoof 2019. Evaluated on FaceForensics++ and DFDC (video) and ASVspoof 2019 LA (audio), the system achieves 95.1% video accuracy (AUC-ROC 0.987) and 94.2% audio accuracy (AUC-ROC 0.982). Attention map visualization from the final ViT transformer layer provides spatial explainability, highlighting manipulated facial regions. GPU inference completes in 2.5 seconds for video and 0.8 seconds for audio. All 42 test cases pass with 100% rate across unit, integration, system, and performance testing, with mean user satisfaction of 4.4/5.0. The open-source modular architecture supports future integration of cross-modal audio-visual consistency analysis.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.






