Article Preview
Top1. Introduction
Artificial Neural networks (ANNs) have become popular now a days and is used for various applications such as classification, clustering, pattern recognition and prediction. It is relatively more competitive and yields better results as compared to the conventional machine learning (ML) techniques (V.S. Dave et al, 2014). ANNs are very useful for the development of the applications such as image recognition, natural language processing, speech recognition, machine translation etc. (N. Izeboudjen et al, 2014). The important advantages of using ANNs are self-learning, fault tolerance, capture non-linearity and the advances in the input-output mapping (D. Wang et al, 2018). It also makes the ease of using the model in complex natural systems with large inputs (Mahanta, J., 2017). The motivation of ANN is it can be compared to the human brain that performs a given task of interest. For example human brain is capable of remembering objects and recognize the semantics behind them (S. Haykin, 2009). The same idea can be extended to ANNs in development of the object detection applications.
Object detection applications helps in identifying the objects using object models that are known a priori. Labeling of objects is one of the important challenges of object detection applications. For a given image, there are different objects of interest and labeling each object requires an intelligence mechanism. A set of correct labels need to be assigned to the objects in the given image. The term detection can be used for functions such as identification, categorization and discrimination. In recent studies (A. Babenko et al, 2014,) (J. Wan et al, 2014,) (Zou, X. et al, 2019) on object detection, the different labels to objects in the image are identified. However, the other essential elements such as motion, living or non living objects is not identified and tagged. The motivation of this paper lies in identifying such essential elements in the given image. Tagging living/non-living objects (and the presence/absence of motion) in the input image could be great value in security-related applications for threat identification.
The Intelligent Sensing and Caption Generation (ISCG) system proposed in this work is able to detect life and motion in addition to detecting objects within the image. Multiple metrics are available which measure the quality of generated sentences. Here the authors propose a new metric which measures the intelligence level of the captioning model. Intelligence of the model can be measured by the set of words used in the generated caption. The proposed method looks for verbs that describe whether an action is being performed in the image. It also looks for words that describe whether the object is an inanimate object or not and thereby whether the object is living or not. Each criterion when satisfied is used as results in a score being assigned to the model. Combining scores for all criteria gives the intelligence level of the system.
The main contributions of the paper are listed as follows:
- •
An ISCG model based on the CNN and LSTM for recognition of different entities in an image.
- •
An Intelligent score is assigned to each image based on the entities discovered in the image.
- •
The model has been compared with state of art methods using the benchmarked dataset FlickR 8K. It gives better caption compared to other methods.
The rest of the paper is organized as follows. In section 2, related work pertaining to object detection is discussed. The IGSA proposed system is discussed in section 3. It is followed by a discussion of the estimation of the scores for intelligent object detection and finally the results are discussed in the last section.