A Deep Structured Model for Video Captioning

A Deep Structured Model for Video Captioning

V. Vinodhini, B. Sathiyabhama, S. Sankar, Ramasubbareddy Somula
Copyright: © 2020 |Pages: 13
DOI: 10.4018/IJGCMS.2020040103
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Video captions help people to understand in a noisy environment or when the sound is muted. It helps people having impaired hearing to understand much better. Captions not only support the content creators and translators but also boost the search engine optimization. Many advanced areas like computer vision and human-computer interaction play a vital role as there is a successful growth of deep learning techniques. Numerous surveys on deep learning models are evolved with different methods, architecture, and metrics. Working with video subtitles is still challenging in terms of activity recognition in video. This paper proposes a deep structured model that is effective towards activity recognition, automatically classifies and caption it in a single architecture. The first process includes subtracting the foreground from the background; this is done by building a 3D convolutional neural network (CNN) model. A Gaussian mixture model is used to remove the backdrop. The classification is done using long short-term memory networks (LSTM). A hidden Markov model (HMM) is used to generate the high quality data. Next, it uses the nonlinear activation function to perform the normalization process. Finally, the video captioning is achieved by using natural language.
Article Preview
Top

Introduction

Human beings need automation to lead a luxurious life; automation was mandated to excellence in different fields. As the quote says “Necessity is the mother of invention”, humans began to train the machine to interact with humans closely and started to monitor every being in the world in the form of technology like identification and credentials, satellite imagery, social network analysis, the surveillance camera, and so on. One of the latest growing research fields is Artificial Intelligence (AI) and deep learning. It is widely applied to various applications like predicting natural disasters, personalized virtual assistant, self-driving cars from which elderly people and physically challenged people can get the advantage out of it (Chen et al., 2019).

Deep learning will simulate human-like decision making, where a model is trained rather than programmed. This decision making can be introduced with real-time problem solving as it can deal with high dimensional data and scale-up speed. One such effort made by researchers is video captioning. Today’s world has an exponential increase in data with multi modalities like video, audio, text, and image. The most shared content in social networking is in the form of video. The advancement in mobile phones has included a different option in the camera to shoot in fast and slow-motion (Dilawari et al., 2019).

In recent years, several research works have been done to caption the video in natural language. The current emerging area in machine learning is deep learning. It handles sensitive data and it effectively deals the live dataset. The convolution neural network is a part of a deep learning algorithm that can remember and propagates back for better choice. It assigns weight and biases to various objects. It can differentiate the content in the image (Barati et al., 2019).

The objective is to generate a simple sentence from a given video without human intervention called description. The present research is towards a new dense video captioning approach that can utilize any number of modalities (Lee et al., 2019) for event descriptions. In case the aim is to help a hearing-impaired people to know the content of the video, then it is a challenging task, as the model should express the actions and reactions in the video. The main issue is when the accent varies the generation of description gets complicated (Liu et al., 2019). The popular datasets are Montreal Video Annotation Dataset, MPII Movie Description Corpus, Microsoft Research Video Description Corpus, MSR Video to Text, Youtube2Text video corpus.

Caption is the term to refer the translated conversation as simple words at the bottom of a picture display. Captions makes video more accessible for the viewers in numerous ways. It makes the user to access the video for non-native English speakers to comprehend the message. Captions can be enabled or disabled according to their convenience. There are three types of learners they are visual, auditory, reading/writing, and kinesthetic. Using captions, all the three learners are benefited. The importance of choosing this method is it is simple and has a high correlation compared with other methods. The loss has prevailed over a new activation function. All dependencies are addressed as a result of improved caption prediction.

Complete Article List

Search this Journal:
Reset
Volume 16: 1 Issue (2024)
Volume 15: 1 Issue (2023)
Volume 14: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 13: 4 Issues (2021)
Volume 12: 4 Issues (2020)
Volume 11: 4 Issues (2019)
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing