Article Preview
TopIntroduction
With the explosive growth of data on the Internet, information retrieval technology comes into being. The most common is information retrieval with single mode of text retrieval (Wang, L et al.,2014), which is often applied in information retrieval systems (Luo,Meng,Quan&Tu,2016) and structured databases. However, with more and more image information on the Internet, text and image have become the most common data modes on the Internet. Single text modal retrieval can no longer meet the needs of users, and cross-modal retrieval between text and image has become a popular research topic in the field of information retrieval. It has been widely used in the field of visual question and answer (Lin&Parikh, 2016), image description(Jia, Gavves, Fernando & Tuytelaars, 2015), text-image retrieval(Ma,Lu,Shang&Li, 2015), etc.
Budikova et al. classifies and discusses fusion strategies for large-scale image retrieval. Adnan and Akbar (2019) reviewed the existing information extraction technology and its sub-tasks, limitations and challenges to various unstructured data(contains text and images), and emphasized the impact of unstructured big data on information extraction technology. With the rise of deep learning technology, most existing text-image retrieval approaches can be roughly divided into two categories: (1) Coarse-grained matching methods, which aim to map the whole image and full text to a common space. Kiros, Salakhutdinov and Zemel (2014) used CNN to extract image features and Long Short Term Memory (LSTM) to extract text features. Similar methods Faghri et al. and Gu et al. only used Convolutional Neural Networks (CNN) to extract image features, in which only global image are analyzed. These methods have difficulties in identifying object-level semantic concepts and in matching text exactly to image features. (2) Fine-grained matching methods, which aim to find local features of the image and the text, and use local similarity matching. Lee et al. proposed object-level features to align between image regions and words. Although some local semantics of image regions are captured, global features such as global background and environment information are not considered.
Saliency features can selectively focus on certain regions of the image, which are usually some entity objects. The effective extraction of such saliency features is of great value in object segmentation (Pahuja,Majumder,Chakraborty&Babu,2019) and object detection(Hou,Cheng,Hu,Borji,Tu&Torr,2017). These saliency objects often match the focus of human observation and the key information in the text description. Compared with image global features, fusion of meaningful object-level semantic feature can bring more valuable information to match sentences.
Based on the disadvantages of the above two kinds of text-image retrieval tasks and the advantages of saliency features, it can be found that the fusion of saliency features and global image features can be targeted to mitigate current technical defects. Therefore, this paper proposes a text-image retrieval network that fuses object-level salient image features, which integrates the saliency features of the image object-level into the global image features. The proposed method of exploring the correspondence between images and texts is characterized by two main aspects: First, finding object-level salient regions of the image through the convolutional layer in the pre-trained network. Second, combining salient features with global image features to ensure that object-level semantic concepts are obtained without losing global image features focuses, thus making the image feature expression closer to the text.
The experiment on the MSCOCO 5K and Flickr30K test sets shows that the proposed method can achieve comparable results to the latest work, and can also improve the recall rate in text-image retrieval. The experimental results indicate that the proposed method demonstrates stronger robustness.