End-to-End CNN-RNN Model for Robust Speech Command Recognition

Authors

  • P Mahipal Reddy Author
  • Kotha Harshitha Author
  • Palithepu Rajkumar Author
  • Puli Purnathejitha Author
  • Bitla Saketh Author

DOI:

https://doi.org/10.64751/ijdim.2026.v5.n2(1).pp182-191

Keywords:

Environmental Audio Classification, Deep Learning, Convolutional Neural Network, MFCC, Long Short-Term Memory.

Abstract

Environmental audio signals contain rich spectral and temporal characteristics that enable automatic classification of diverse sound events. In real-world scenarios, accurate identification of sounds such as human activities, falls, abnormal environmental noises, machine faults, and acoustic anomalies is essential for applications in safety monitoring, healthcare, surveillance, and intelligent systems. Traditional machine learning methods, including K-Nearest Neighbors (KNN), Decision Tree Classifier (DTC), Adaptive Boosting Classifier (AdaBoost), and Linear Discriminant Analysis (LDA), rely on handcrafted features and shallow architectures. While effective for simple datasets, they struggle with complex, noisy audio and fail to capture high-level temporal dependencies. Their limited generalization across varying acoustic environments highlights the need for more advanced approaches. To address these challenges, this work proposes a hybrid deep learning model combining Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), specifically Long Short-Term Memory (LSTM). The system extracts comprehensive features using Librosa, including MFCC, Mel Spectrogram, Chroma, Zero Crossing Rate (ZCR), Root Mean Square (RMS), Spectral Contrast, Bandwidth, Centroid, and Tonnetz. These features represent both frequency and temporal characteristics of audio signals. The CNN learns hierarchical spectral patterns, while the LSTM captures sequential and long-term dependencies. This combination enables effective understanding of both sound content and temporal evolution. Additionally, a Flask-based API supports real-time classification by allowing external systems to send audio inputs and receive predictions instantly. Experimental results show that the proposed model significantly improves accuracy, robustness, and generalization compared to traditional methods.

Downloads

Published

2026-04-09

How to Cite

P Mahipal Reddy, Kotha Harshitha, Palithepu Rajkumar, Puli Purnathejitha, & Bitla Saketh. (2026). End-to-End CNN-RNN Model for Robust Speech Command Recognition. International Journal of Data Science and IoT Management System, 5(2(1), 182-191. https://doi.org/10.64751/ijdim.2026.v5.n2(1).pp182-191

Similar Articles

1-10 of 477

You may also start an advanced similarity search for this article.