End-to-End CNN-RNN Model for Robust Speech Command Recognition
DOI:
https://doi.org/10.64751/ijdim.2026.v5.n2(1).pp182-191Keywords:
Environmental Audio Classification, Deep Learning, Convolutional Neural Network, MFCC, Long Short-Term Memory.Abstract
Environmental audio signals contain rich spectral and temporal characteristics that enable automatic classification of diverse sound events. In real-world scenarios, accurate identification of sounds such as human activities, falls, abnormal environmental noises, machine faults, and acoustic anomalies is essential for applications in safety monitoring, healthcare, surveillance, and intelligent systems. Traditional machine learning methods, including K-Nearest Neighbors (KNN), Decision Tree Classifier (DTC), Adaptive Boosting Classifier (AdaBoost), and Linear Discriminant Analysis (LDA), rely on handcrafted features and shallow architectures. While effective for simple datasets, they struggle with complex, noisy audio and fail to capture high-level temporal dependencies. Their limited generalization across varying acoustic environments highlights the need for more advanced approaches. To address these challenges, this work proposes a hybrid deep learning model combining Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), specifically Long Short-Term Memory (LSTM). The system extracts comprehensive features using Librosa, including MFCC, Mel Spectrogram, Chroma, Zero Crossing Rate (ZCR), Root Mean Square (RMS), Spectral Contrast, Bandwidth, Centroid, and Tonnetz. These features represent both frequency and temporal characteristics of audio signals. The CNN learns hierarchical spectral patterns, while the LSTM captures sequential and long-term dependencies. This combination enables effective understanding of both sound content and temporal evolution. Additionally, a Flask-based API supports real-time classification by allowing external systems to send audio inputs and receive predictions instantly. Experimental results show that the proposed model significantly improves accuracy, robustness, and generalization compared to traditional methods.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.






