Temporal-Based Video Event Detection and Retrieval

Temporal-Based Video Event Detection and Retrieval

Min Chen (University of Montana, USA)
DOI: 10.4018/978-1-61692-859-9.ch010


The fast proliferation of video data archives has increased the need for automatic video content analysis and semantic video retrieval. Since temporal information is critical in conveying video content, in this chapter, an effective temporal-based event detection framework is proposed to support high-level video indexing and retrieval. The core is a temporal association mining process that systematically captures characteristic temporal patterns to help identify and define interesting events. This framework effectively tackles the challenges caused by loose video structure and class imbalance issues. One of the unique characteristics of this framework is that it offers strong generality and extensibility with the capability of exploring representative event patterns with little human interference. The temporal information and event detection results can then be input into our proposed distributed video retrieval system to support the high-level semantic querying, selective video browsing and event-based video retrieval.
Chapter Preview


With the proliferation of multimedia data and ever-growing requests for multimedia applications, new challenges emerge for efficient and effective managing and accessing large audio-visual collections. Discovering events from video streams improves the access and reuse of large video collections. Events are real-world occurrences that unfold over space and time, and play important roles in classic areas of multimedia and new experiential applications such as eChronicles, life logs, and event-centric media managers (Westermann & Jain, 2007). However, with current technologies, there is little or no metadata associated with events captured in videos, making it very difficult to search through a large collection to find instances of a particular pattern or event (Xie, Sundaram, & Campbell, 2008).

To address this need, semantic event classification, which is the process of mapping video streams to pre-defined semantic event categories, has been an active area of research with notable recent progress. (Westermann & Jain, 2007) (Xie et al., 2008) provide some extensive surveys. In essence, most existing event detection frameworks involve two main steps (Leonardi, Migliorati, & Prandini, 2004): video content processing (or called video syntactic analysis) and decision-making process. During the first step, the video clip is segmented into certain analysis units (mostly in shots which refer to unbroken sequences of frames taken by a single camera) and their representative features ranging from low-level, mid-level, and feature aggregations (Xie et al., 2008) are extracted. While good features are deemed important, coming up with the “optimal features” remains an open problem and some prefer a featureless approach that leaves the task of determining the relative importance of input dimensions to the learner. The second step then extracts the semantic index from the feature descriptors. In the literature, several generative models such as hidden Markov model (HMM), dynamic Bayesian network (DBN), linear dynamic systems are commonly used for capturing events that unfold in time. Generally speaking, the events detected by the abovementioned methods are semantically meaningful and usually significant to the users. The major disadvantage, however, is that many rely on specific artifacts (so-called domain knowledge or a priori information) (Chen, Chen, Shyu, & Wickramaratna, 2006) and hinder the generalization and extensibility of the framework. In addition, current techniques on video semantic analysis and representation are mostly shot-based (Chen, & Zhang, 2007). However, events are inherently related to the concept of time (Westermann & Jain, 2007) and therefore normally a single analysis unit separately from its context has less capability of conveying semantics (Zhu, Wu, Elmagarmid, Feng, & Wu, 2005).

In this chapter, we propose an automatic process in developing an extensible framework, in terms of event pattern discovery, representation, and usage. It fully utilizes contextual correlation and temporal dependencies to improve event detection and retrieval accuracy. The main contributions of this framework are summarized as follows:

Complete Chapter List

Search this Book: