Automatic Image Captioning Using Different Variants of the Long Short-Term Memory (LSTM) Deep Learning Model

Automatic Image Captioning Using Different Variants of the Long Short-Term Memory (LSTM) Deep Learning Model

Ritwik Kundu, Shaurya Singh, Geraldine Amali, Mathew Mithra Noel, Umadevi K. S.
DOI: 10.4018/978-1-6684-6001-6.ch008
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Today's world is full of digital images; however, the context is unavailable most of the time. Thus, image captioning is quintessential for providing the content of an image. Besides generating accurate captions, the image captioning model must also be scalable. In this chapter, two variants of long short-term memory (LSTM), namely stacked LSTM and BiLSTM along with convolutional neural networks (CNN) have been used to implement the Encoder-Decoder model for generating captions. Bilingual evaluation understudy (BLEU) score metric is used to evaluate the performance of these two bi-layered models. From the study, it was observed that both the models were on par when it came to performance. Some resulted in low BLEU scores suggesting that the predicted caption was dissimilar to the actual caption whereas some very high BLEU scores suggested that the model was able to predict captions almost similar to human. Furthermore, it was found that the bidirectional LSTM model is more computationally intensive and requires more time to train than the stacked LSTM model owing to its complex architecture.
Chapter Preview
Top

Background

The complexity of images can vary widely from being described by a single word to requiring multiple phrases to describe a single image. The authors of this chapter have conducted a detailed analysis of different related studies conducted across the world in order to understand the different models used in similar applications of ML (machine learning) and DL (deep learning). The following table comprises of a summary of some of the existing work in this domain:

Key Terms in this Chapter

Feature Extraction: The process used to convert raw data to numerical format to make it easier to process the data while retaining the information in the original dataset.

Machine Learning: A subfield of Artificial Intelligence that enables systems to learn and improve based upon previous experience and data without the need of explicit programming.

Deep Learning: A subfield of Machine learning that enables computers to learn and enhance themselves with the help of neural networks.

Computer Vision: A subfield of Artificial Intelligence that allows computers to derive meaningful insights from visual inputs like images, videos, etc.

Image Captioning: The process of creating a meaningful and coherent vector of words that best describes an input image is known as image captioning.

Sequence Prediction: A popular problem in the field of Machine Learning that deals with predicting the next value in a given sequence based on all the values that have occurred in the sequence until now.

Natural Language Processing: A subfield of Artificial Intelligence and Computer Science that deals with how computers perceive natural languages. It involves giving computers the capability to process large amounts of natural language data.

Artificial Intelligence: The field of computer science that aims to give computers the ability to have human-like intelligence.

Complete Chapter List

Search this Book:
Reset