Image Captioning Using Deep Learning

Image Captioning Using Deep Learning

Bhavana D. (Koneru Lakshmaiah Education Foundation, India), K. Chaitanya Krishna (Koneru Lakshmaiah Education Foundation, India), Tejaswini K. (Koneru Lakshmaiah Education Foundation, India), N. Venkata Vikas (Koneru Lakshmaiah Education Foundation, India), and A. N. V. Sahithya (Koneru Lakshmaiah Education Foundation, India)
DOI: 10.4018/978-1-7998-6870-5.ch026
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The task of image caption generator is mainly about extracting the features and ongoings of an image and generating human-readable captions that translate the features of the objects in the image. The contents of an image can be described by having knowledge about natural language processing and computer vision. The features can be extracted using convolution neural networks which makes use of transfer learning to implement the exception model. It stands for extreme inception, which has a feature extraction base with 36 convolution layers. This shows accurate results when compared with the other CNNs. Recurrent neural networks are used for describing the image and to generate accurate sentences. The feature vector that is extracted by using the CNN is fed to the LSTM. The Flicker 8k dataset is used to train the network in which the data is labeled properly. The model will be able to generate accurate captions that nearly describe the activities carried in the image when an input image is given to it. Further, the authors use the BLEU scores to validate the model.
Chapter Preview
Top

1. Introduction

The individuals communicate through dialect, whether written or talked. They frequently utilize this dialect to describe the visual world around them. Pictures, signs are another way of communication and understanding for the physically challenged individuals. The era of depictions from the image consequently in appropriate sentences could be a exceptionally troublesome and challenging assignment (Vinyals et al., 2015), but it can offer assistance and have a extraordinary impact on outwardly disabled individuals for better understanding of the description of pictures on the internet. A good depiction of an image is frequently said for ’Visualizing a picture within the mind’. The creation of an picture in intellect can play a critical part in sentence era. Too, human can portray the picture after having a speedy look at it. The advance in accomplishing complex goals of human acknowledgment will be done after examining existing natural picture descriptions.

This errand of naturally creating captions and describing the picture is significantly harder than picture classification and question acknowledgment. The portrayal of an picture must involve not as it were the objects within the picture, but moreover relation between the objects with their traits and exercises shown in pictures (Bhavana et al., n.d.). Most of the work exhausted visual recognition previously has concentrated to name pictures with as of now fixed classes or categories driving to the huge advance in this field. Inevitably, vocabularies of visual concepts which are closed, makes a appropriate and straightforward demonstrate for assumption. These concepts show up broadly restricted after comparing them with the colossal sum of considering control which human possesses. In any case, the characteristic dialect like English should be utilized to specific over semantic information, that's for visual understanding dialect show is necessary.

Figure 1.

Model based on Neural Networks (Vinyals et al., 2015)

978-1-7998-6870-5.ch026.f01

In order to produce depiction from an picture, most of the previous attempts have proposed to combine all the current solutions of the over issue. Though, we'll be designing a single demonstrate which takes an picture as an input and is trained for creating a grouping of words where each word belongs to the word reference that depicts the picture reasonably as shown in Fig. 1. The connection between visual significance and descriptions moves to the content summarization issue in common language processing (NLP) (Ahammad, Rajesh, Neetha et al, 2019). The vital objective of content summarization is selecting or creating an theoretical for record. In problem of picture captioning, for any picture we would like to generate a caption which can portray different highlights of that image (Sunitha et al., 2018).

This paper proposes a novel descriptions from pictures. For this errand, we have utilized Flickr 8k dataset comprising of 8000 pictures and five descriptions per picture. The dataset structure is a picture having five common dialect captions. In this work, we are utilizing CNN as well as RNN. Pre-trained Convolutional Neural Organize (CNN) is utilized for the image classification assignment. This arranges acts as a picture encoder. The final covered up layer is utilized as an input to Repetitive Neural Network (RNN). This organize may be a decoder which generates sentences. In some cases, the created sentence appears to lose track or predict off-base sentence than that of the initial image content. This sentence is produced from portrayal that's common in dataset and the sentence is weakly related to input image. The provocation of image captioning is to style a model which can fully use image information to urge more human-like rich image descriptions. Image Captioning is that the process of generating textual description of a picture. The image database is given as input to a deep neural network (Convolutional Neural Network (CNN)) encoder for caused “thought vector” which extracts the features and nuances out of our image and RNN (Recurrent Neural Network) decoder is used to translate the features and objects given by our image to urge sequential, meaningful description of the image.

Complete Chapter List

Search this Book:
Reset