End-to-End CNN-RNN Model for Robust Speech Command Recognition

P Mahipal Reddy; Kotha Harshitha; Palithepu Rajkumar; Puli Purnathejitha; Bitla Saketh

doi:10.64751/ijdim.2026.v5.n2(1).pp182-191

Authors

P Mahipal Reddy Author
Kotha Harshitha Author
Palithepu Rajkumar Author
Puli Purnathejitha Author
Bitla Saketh Author

DOI:

https://doi.org/10.64751/ijdim.2026.v5.n2(1).pp182-191

Keywords:

Environmental Audio Classification, Deep Learning, Convolutional Neural Network, MFCC, Long Short-Term Memory.

Abstract

Environmental audio signals contain rich spectral and temporal characteristics that enable automatic classification of diverse sound events. In real-world scenarios, accurate identification of sounds such as human activities, falls, abnormal environmental noises, machine faults, and acoustic anomalies is essential for applications in safety monitoring, healthcare, surveillance, and intelligent systems. Traditional machine learning methods, including K-Nearest Neighbors (KNN), Decision Tree Classifier (DTC), Adaptive Boosting Classifier (AdaBoost), and Linear Discriminant Analysis (LDA), rely on handcrafted features and shallow architectures. While effective for simple datasets, they struggle with complex, noisy audio and fail to capture high-level temporal dependencies. Their limited generalization across varying acoustic environments highlights the need for more advanced approaches. To address these challenges, this work proposes a hybrid deep learning model combining Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), specifically Long Short-Term Memory (LSTM). The system extracts comprehensive features using Librosa, including MFCC, Mel Spectrogram, Chroma, Zero Crossing Rate (ZCR), Root Mean Square (RMS), Spectral Contrast, Bandwidth, Centroid, and Tonnetz. These features represent both frequency and temporal characteristics of audio signals. The CNN learns hierarchical spectral patterns, while the LSTM captures sequential and long-term dependencies. This combination enables effective understanding of both sound content and temporal evolution. Additionally, a Flask-based API supports real-time classification by allowing external systems to send audio inputs and receive predictions instantly. Experimental results show that the proposed model significantly improves accuracy, robustness, and generalization compared to traditional methods.

End-to-End CNN-RNN Model for Robust Speech Command Recognition

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

Latest publications

Information

Language

IF