VISION-LANGUAGE INTEGRATION FOR AUTOMATED IMAGE CAPTIONING USING CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS

Authors

  • S.Vijay Kumar Author
  • Bijigiri Pavan Kalyan Author

DOI:

https://doi.org/10.64751/

Abstract

Image captioning bridges computer vision and natural language processing by generating meaningful textual descriptions from visual content. This research presents a vision-language framework for automated image caption generation using Convolutional Neural Networks (CNN) and Long ShortTerm Memory (LSTM) networks. The CNN component extracts high-level visual features from input images, while the LSTM network decodes these features into grammatically coherent sentences. The model is trained on a large annotated dataset that aligns images with corresponding captions to learn semantic relationships between objects and linguistic structures. Experimental results demonstrate that the proposed model achieves high accuracy and fluency in caption generation, outperforming traditional template-based and rule-driven methods. This approach showcases the effectiveness of deep learning in understanding and describing visual information, making it valuable for applications in accessibility, content retrieval, and human-computer interaction

Downloads

Published

2025-11-04

How to Cite

S.Vijay Kumar, & Bijigiri Pavan Kalyan. (2025). VISION-LANGUAGE INTEGRATION FOR AUTOMATED IMAGE CAPTIONING USING CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS. International Journal of Data Science and IoT Management System, 4(4), 282–289. https://doi.org/10.64751/