Multimedia Feature Mapping and Correlation Learning for Cross-Modal Retrieval

Multimedia Feature Mapping and Correlation Learning for Cross-Modal Retrieval

Xu Yuan (School of Software Technology, Dalian University of Technology, China), Hua Zhong (School of Software Technology, Dalian University of Technology, China), Zhikui Chen (School of Software Technology, Dalian University of Technology, China), Fangming Zhong (School of Software Technology, Dalian University of Technology, China) and Yueming Hu (College of Natural Resources and Environment, South China Agricultural University, China)
Copyright: © 2018 |Pages: 17
DOI: 10.4018/IJGHPC.2018070103


This article describes how with the rapid increasing of multimedia content on the Internet, the need for effective cross-modal retrieval has attracted much attention recently. Many related works ignore the latent semantic correlations of modalities in the non-linear space and the extraction of high-level modality features, which only focuses on the semantic mapping of modalities in linear space and the use of low-level artificial features as modality feature representation. To solve these issues, the authors first utilizes convolutional neural networks and topic modal to obtain a high-level semantic feature of various modalities. Sequentially, they propose a supervised learning algorithm based on a kernel with partial least squares that can capture semantic correlations across modalities. Finally, the joint model of different modalities is learnt by the training set. Extensive experiments are conducted on three benchmark datasets that include Wikipedia, Pascal and MIRFlickr. The results show that the proposed approach achieves better retrieval performance over several state-of-the-art approaches.
Article Preview


Cross-modal multimedia retrieval has become a widespread concern over the last few years owing to the explosion growth of multimedia information over the Internet. Multimedia data that is typical multimodal is derived from different channels, and data of different modalities can be represented by the same semantic type. Specifically, texts are used as the semantic representation of associated images or videos. The massive collections of images, texts and videos pose several challenges to multimedia retrieval. However, most of the conventional systems are only applied to the retrieval of single modal data, such as search engines (Google or Yahoo), resulting in the limited use of multimodal data. How to sustainably use these multimodal data for smart retrieval remains a challenge.

The key step of cross-modal retrieval task that the image or video can be found by text query is to reduce the semantic gap across modalities. A number of cross-modal retrieval approaches (Chen, Wang, Wang, & Zhang, 2012; Rasiwasia et al., 2010; Tang, Deng, & Gao, 2015; Zhang, Zhong, Yang, Chen, & Bu, 2016; Wang, Yang, & Meinel, 2015; Wang et al., 2016; Yu, Cong, Qin, & Wan, 2012; Zhuang, Wang, Wu, Zhang, & Lu, 2013) have been devoted to address the issue of semantic gap in the recent past. In our work, the semantic gap between image and text is mainly concerned.

Recently, the academic community has explored some models to bridge the semantic gap. The most popular technique may be canonical correlation analysis (CCA) (Rasiwasia et al., 2010), aiming to obtain a common space by maximizing the correlations between feature vectors of different modalities. Another typical approach is partial least squares (PLS) (Sharma & Jacobs, 2011), which also has attracted much attention. Besides CCA and PLS, some other methods are proposed to reduce the semantic gap. Yu et al. (Yu et al., 2012) used statistical correlation based on the topic model for image and text query. Zhai et al. (Zhai, Peng, & Xiao, 2012) proposed a joint model to exploit negative and positive correlation for cross-modal retrieval. Wang et al. (Wang, He, Wang, Wang, & Tan, 2013) applied penalty to projection matrices, and mapped multimodal data into a common latent subspace for feature matching.

The above methods only pay attention to the semantic mapping of modalities in linear space, while neglecting the latent semantic correlations of modalities in the highly non-linear space, as well as the high-level semantic feature in non-linear space. However, there may be non-linear correlations across modalities. Non-linear space may be more appropriate to mine the semantic correlations of different modalities than the linear space. If the multimodal correlation model is directly used in non-linear space, there exists a series of problems. Such as the selection of non-linear mapping functions and the curse of dimensionality in high-dimensional feature space. Additionally, the low-level artificial features utilized in these methods cannot contain enough semantic information that results in weakness semantic representation, such as scale invariant feature transformation (SIFT) or GIST used for image representation. Hence, constructing a joint high-level semantic model is crucial for cross-modal retrieval.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 11: 4 Issues (2019): 2 Released, 2 Forthcoming
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing