VISION-LANGUAGE INTEGRATION FOR AUTOMATED IMAGE CAPTIONING USING CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS
DOI:
https://doi.org/10.64751/Abstract
Image captioning bridges computer vision and natural language processing by generating meaningful textual descriptions from visual content. This research presents a vision-language framework for automated image caption generation using Convolutional Neural Networks (CNN) and Long ShortTerm Memory (LSTM) networks. The CNN component extracts high-level visual features from input images, while the LSTM network decodes these features into grammatically coherent sentences. The model is trained on a large annotated dataset that aligns images with corresponding captions to learn semantic relationships between objects and linguistic structures. Experimental results demonstrate that the proposed model achieves high accuracy and fluency in caption generation, outperforming traditional template-based and rule-driven methods. This approach showcases the effectiveness of deep learning in understanding and describing visual information, making it valuable for applications in accessibility, content retrieval, and human-computer interaction
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.






