Constructing and Utilizing Video Ontology for Accurate and Fast Retrieval

Constructing and Utilizing Video Ontology for Accurate and Fast Retrieval

Kimiaki Shirahama (Kobe University, Japan) and Kuniaki Uehara (Kobe University, Japan)
Copyright: © 2013 |Pages: 17
DOI: 10.4018/978-1-4666-2940-0.ch012
OnDemand PDF Download:


This paper examines video retrieval based on Query-By-Example (QBE) approach, where shots relevant to a query are retrieved from large-scale video data based on their similarity to example shots. This involves two crucial problems: The first is that similarity in features does not necessarily imply similarity in semantic content. The second problem is an expensive computational cost to compute the similarity of a huge number of shots to example shots. The authors have developed a method that can filter a large number of shots irrelevant to a query, based on a video ontology that is knowledge base about concepts displayed in a shot. The method utilizes various concept relationships (e.g., generalization/specialization, sibling, part-of, and co-occurrence) defined in the video ontology. In addition, although the video ontology assumes that shots are accurately annotated with concepts, accurate annotation is difficult due to the diversity of forms and appearances of the concepts. Dempster-Shafer theory is used to account the uncertainty in determining the relevance of a shot based on inaccurate annotation of this shot. Experimental results on TRECVID 2009 video data validate the effectiveness of the method.
Chapter Preview

A popular current ontology for video retrieval is Large-Scale Concept Ontology for Multimedia (LSCOM) (Naphade et al., 2006). LSCOM defines a standardized set of 1,000 concepts, such as Person, Car, and Building. Concepts are selected based on their utility for classifying content in videos, their coverage in being responding to a variety of user queries, their feasibility for automatic detection, and the availability (observability) of large annotated data sets. As was exemplified in the high-level feature extraction (concept detection) in TRECVID (Smeaton, Over, & Kraaij, 2006), a lot of research effort has been spent on developing a method to accurately detect LSCOM concepts in shots. However, a crucial problem with LSCOM is that it just provides a list of concepts. For LSCOM concepts to be utilized in video retrieval, they need to be organized into a meaningful structure. Then, it will become possible to select concepts related to a given query based on the structure. In what follows, we describe the contributions of this paper in the areas of concept structuring (concept relationship extraction) and the selection of concepts related to the query.

A number of researchers have studied the extraction of relationships among LSCOM concepts. Koskela et al. (2007) proposed a method that computes the degree of concept relationships by analyzing the result of clustering shots based on features. Two concepts are considered to be related if shots annotated with those concepts are distributed on the same clusters. Yan et al. (2006) proposed a method for extracting dependences among concepts using a probabilistic graphical model. Wei and Ngo (2008) developed a method that extracts relationships among concepts using WordNet. It is a large lexical database where synonym sets (synsets) of nouns, verbs, adjectives, and adverbs are interlinked based on their meanings (Fellbaum, 1998). In addition, the method in Wei and Ngo (2008) extracted the co-occurrence relationships among concepts using annotated shots. Weng and Chuang (2008) developed a method that extracts co-occurrence relationships among concepts and inter-shot relationships by conducting the chi-square test on annotated shots.

As these examples demonstrate, most existing methods adopt an inductive approach to automatically extracts relations among concepts using annotated shots and external resources (e.g., WordNet). However, the concept relationships extracted by this inductive approach are very coarse as they only indicate the degree of interrelatedness between concepts. For example, Car has a high correlation with Road, whereas Kitchen has a very low correlation with Outdoor. In the field of ontology engineering, ontologies are constructed using several other types of concept relationships, such as generalization/specialization, part-of, attribute-of, and self-defined relationships. It is difficult to extract such relationships automatically. We therefore adopt a deductive approach whereby concept relationships and properties are defined manually, based on the design patterns of general ontologies. This allows us to define various concept relationships and draw effective inferences. For example, suppose the concept Hand is defined as a part of Person. If Hand is detected in a shot, we can infer that Person too appears in the shot. Zha, Mei, Wang, and Hua (2007) used an inductive approach to define generalization/specialization (hierarchical) relationships among concepts. However, they extracted concept properties and co-occurrence relationships based on an inductive approach, which meant they did not consider part-of, attribute-of, or other types of relationships.

Several methods exist for selecting concepts that are related to a given query. These can be classified into the following four types: the Ontology-based, Text-based, Corpus-based, and Visual-based approaches. Ontology-based methods select concepts that are related to words in the text description of the query using an external knowledge base (typically WordNet). The relatedness between a concept and a word is measured using a lexical similarity measure, such as Resnik’s measure (Snoek et al., 2007) and Lesk semantic relatedness score (Natsev et al., 2007). Text-based methods work by selecting concepts based on the extent to which their text descriptions match the text description of a query. This typically involves using a document retrieval approach, such as vector space model (Snoek et al., 2007). Corpus-based methods select concepts that significantly co-occur with words in the text description of a query. The use of annotated video collections (Wei & Ngo, 2008) and external resources like Flickr (Ngo et al., 2009) facilitates the extraction of co-occurrence patterns in advance. Visual-based methods select concepts that have high detection scores in example shots (Natsev et al., 2007; Snoek et al., 2007, 2009).

In contrast to existing concept selection methods, we select concepts related to a query using our own video ontology. The advantage is that it enables us to use concept relationships defined in our video ontology for filtering irrelevant shots. For example, our video ontology defines Tower as a subconcept of Building. Thus, if Tower is detected in a shot, and Building is not, we can infer that Tower has been incorrectly detected in this shot. In contrast to this use of concept relationships, many existing methods opt to use a linear combination of detection scores for selected concepts (Natsev et al., 2007; Ngo et al., 2009; Snoek et al., 2007, 2009; Wei & Ngo, 2008). To the best of our knowledge, no existing methods deal with the uncertainty in determining the relevance of a shot to a query based on error-prone concept detection results. We present a first study on this issue using DST.

Complete Chapter List

Search this Book: