Audio Vista: Whisper-Embedded Interpretable Multi-Task Framework for Urban Acoustic Intelligence with GUI Deployment

Akula Varsha; R. Swathi; Rekha Gangula; K Uday Kiran; Kurapati Rahul; Pakanati Shivakumari

doi:10.64751/ijdim.2026.v5.n2(1).pp161-170

Authors

Akula Varsha Author
R. Swathi Author
Rekha Gangula Author
K Uday Kiran Author
Kurapati Rahul Author
Pakanati Shivakumari Author

DOI:

https://doi.org/10.64751/ijdim.2026.v5.n2(1).pp161-170

Keywords:

UrbanSound, Role based authentication, Whisper-base encoder, Rotor noise compensation, Multi-task urban sound classifier

Abstract

Urban environments produce complex acoustic landscapes with overlapping sound events like sirens, horns, and traffic noise, complicating automated monitoring for smart cities, public safety, and noise pollution control. Traditional sound classification systems rely on hand-crafted features such as MFCCs and log-Mel spectrograms fed into SVMs, Random Forests, or shallow CNNs, achieving modest accuracies. These approaches emerged from early 2010s research addressing environmental audio challenges, evolving through competitions that established datasets like UrbanSound8K as standards. However, traditional systems face critical limitations: hand-engineered features fail to capture semantic audio understanding amid co-occurring sounds and abnormal noise conditions; single-task models ignore label correlations; black-box deep networks lack interpretability for regulatory use; and command-line interfaces exclude non-experts, hindering real-world deployment. No integrated GUI exists for multi-task urban sound classification combining transformer representations with interpretable models, creating a gap in accessible, transparent AI for urban monitoring. This research addresses these needs through a Whisper-powered multi-task urban sound classifier GUI, an end-toend Tkinter application with role-based authentication (LMDB + SHA-256). It leverages OpenAI's Whisper-base encoder for state-of-the-art feature extraction like mean-pooling hidden states from audio files organized by class folders yielding fixed-length vectors. These features train four interpretable classifiers such as Boosted Rules Classifier (BRC), Hierarchical Structural (HS) Tree Classifier, Sparse Linear Integer Model (SLIM) Classifier, Marginal Shrinkage Linear Trees (MSLT) for dual tasks like primary sound categories and internal subcategories. The system's significance lies in democratizing interpretable AI: Whisper provides superior representations without fine-tuning, interpretable models ensure trust via rule-based decisions, and the GUI enables non-technical deployment. It advances smart city applications by enabling secure, visual multi-task classification of urban sounds, bridging transformer power with human-understandable analytics for environmental intelligence