Text-Image Retrieval With Salient Features

Text-Image Retrieval With Salient Features

Xia Feng, Zhiyi Hu, Caihua Liu, W. H. Ip, Huiying Chen
Copyright: © 2021 |Pages: 13
DOI: 10.4018/JDM.2021100101
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

In recent years, deep learning has achieved remarkable results in the text-image retrieval task. However, only global image features are considered, and the vital local information is ignored. This results in a failure to match the text well. Considering that object-level image features can help the matching between text and image, this article proposes a text-image retrieval method that fuses salient image feature representation. Fusion of salient features at the object level can improve the understanding of image semantics and thus improve the performance of text-image retrieval. The experimental results show that the method proposed in the paper is comparable to the latest methods, and the recall rate of some retrieval results is better than the current work.
Article Preview
Top

Introduction

With the explosive growth of data on the Internet, information retrieval technology comes into being. The most common is information retrieval with single mode of text retrieval (Wang, L et al.,2014), which is often applied in information retrieval systems (Luo,Meng,Quan&Tu,2016) and structured databases. However, with more and more image information on the Internet, text and image have become the most common data modes on the Internet. Single text modal retrieval can no longer meet the needs of users, and cross-modal retrieval between text and image has become a popular research topic in the field of information retrieval. It has been widely used in the field of visual question and answer (Lin&Parikh, 2016), image description(Jia, Gavves, Fernando & Tuytelaars, 2015), text-image retrieval(Ma,Lu,Shang&Li, 2015), etc.

Budikova et al. classifies and discusses fusion strategies for large-scale image retrieval. Adnan and Akbar (2019) reviewed the existing information extraction technology and its sub-tasks, limitations and challenges to various unstructured data(contains text and images), and emphasized the impact of unstructured big data on information extraction technology. With the rise of deep learning technology, most existing text-image retrieval approaches can be roughly divided into two categories: (1) Coarse-grained matching methods, which aim to map the whole image and full text to a common space. Kiros, Salakhutdinov and Zemel (2014) used CNN to extract image features and Long Short Term Memory (LSTM) to extract text features. Similar methods Faghri et al. and Gu et al. only used Convolutional Neural Networks (CNN) to extract image features, in which only global image are analyzed. These methods have difficulties in identifying object-level semantic concepts and in matching text exactly to image features. (2) Fine-grained matching methods, which aim to find local features of the image and the text, and use local similarity matching. Lee et al. proposed object-level features to align between image regions and words. Although some local semantics of image regions are captured, global features such as global background and environment information are not considered.

Saliency features can selectively focus on certain regions of the image, which are usually some entity objects. The effective extraction of such saliency features is of great value in object segmentation (Pahuja,Majumder,Chakraborty&Babu,2019) and object detection(Hou,Cheng,Hu,Borji,Tu&Torr,2017). These saliency objects often match the focus of human observation and the key information in the text description. Compared with image global features, fusion of meaningful object-level semantic feature can bring more valuable information to match sentences.

Based on the disadvantages of the above two kinds of text-image retrieval tasks and the advantages of saliency features, it can be found that the fusion of saliency features and global image features can be targeted to mitigate current technical defects. Therefore, this paper proposes a text-image retrieval network that fuses object-level salient image features, which integrates the saliency features of the image object-level into the global image features. The proposed method of exploring the correspondence between images and texts is characterized by two main aspects: First, finding object-level salient regions of the image through the convolutional layer in the pre-trained network. Second, combining salient features with global image features to ensure that object-level semantic concepts are obtained without losing global image features focuses, thus making the image feature expression closer to the text.

The experiment on the MSCOCO 5K and Flickr30K test sets shows that the proposed method can achieve comparable results to the latest work, and can also improve the recall rate in text-image retrieval. The experimental results indicate that the proposed method demonstrates stronger robustness.

Complete Article List

Search this Journal:
Reset
Volume 35: 1 Issue (2024)
Volume 34: 3 Issues (2023)
Volume 33: 5 Issues (2022): 4 Released, 1 Forthcoming
Volume 32: 4 Issues (2021)
Volume 31: 4 Issues (2020)
Volume 30: 4 Issues (2019)
Volume 29: 4 Issues (2018)
Volume 28: 4 Issues (2017)
Volume 27: 4 Issues (2016)
Volume 26: 4 Issues (2015)
Volume 25: 4 Issues (2014)
Volume 24: 4 Issues (2013)
Volume 23: 4 Issues (2012)
Volume 22: 4 Issues (2011)
Volume 21: 4 Issues (2010)
Volume 20: 4 Issues (2009)
Volume 19: 4 Issues (2008)
Volume 18: 4 Issues (2007)
Volume 17: 4 Issues (2006)
Volume 16: 4 Issues (2005)
Volume 15: 4 Issues (2004)
Volume 14: 4 Issues (2003)
Volume 13: 4 Issues (2002)
Volume 12: 4 Issues (2001)
Volume 11: 4 Issues (2000)
Volume 10: 4 Issues (1999)
Volume 9: 4 Issues (1998)
Volume 8: 4 Issues (1997)
Volume 7: 4 Issues (1996)
Volume 6: 4 Issues (1995)
Volume 5: 4 Issues (1994)
Volume 4: 4 Issues (1993)
Volume 3: 4 Issues (1992)
Volume 2: 4 Issues (1991)
Volume 1: 2 Issues (1990)
View Complete Journal Contents Listing