Event Detection in Sports Video Based on Generative-Discriminative Models

Event Detection in Sports Video Based on Generative-Discriminative Models

Guoliang Fan (Oklahoma State University, USA) and Yi Ding (Oklahoma State University, USA)
Copyright: © 2011 |Pages: 23
DOI: 10.4018/978-1-60960-024-2.ch009
OnDemand PDF Download:
No Current Special Offers


Semantic event detection is an active and interesting research topic in the field of video mining. The major challenge is the semantic gap between low-level features and high-level semantics. In this chapter, we will advance a new sports video mining framework where a hybrid generative-discriminative approach is used for event detection. Specifically, we propose a three-layer semantic space by which event detection is converted into two inter-related statistical inference procedures that involve semantic analysis at different levels. The first is to infer the mid-level semantic structures from the low-level visual features via generative models, which can serve as building blocks of high-level semantic analysis. The second is to detect high-level semantics from mid-level semantic structures using discriminative models, which are of direct interests to users. In this framework we can explicitly represent and detect semantics at different levels. The use of generative and discriminative approaches in two different stages is proved to be effective and appropriate for event detection in sports video. The experimental results from a set of American football video data demonstrate that the proposed framework offers promising results compared with traditional approaches.
Chapter Preview


The goal of video mining is to discover knowledge, patterns, and events in the video data stored either in databases, data warehouses, or other online repositories (S.-F. Chang, 2002; Mei, Ma, Zhou, Ma, & Zhang, 2005). Specifically, semantic event detection is an active research field driven by the ever increasing needs of numerous multimedia and online database applications. Its benefits range from efficient browsing and summarization of video content to facilitating video access and retrieval. According to different production and edition styles, videos can be classified into two major categories: scripted and non-scripted (Xiong, Zhou, Tian, Rui, & Huang, 2006), which are usually associated with different video mining tasks. Scripted videos (e.g., news and movies) are produced or edited according to a pre-defined script or plan, for which we can build a Table-of-Content (TOC) to facilitate the viewing or editing of the video data (Rui, Huang, & Mehrotra, 1998). In Non-scripted videos (e.g., sports), events usually occur spontaneously in a relatively fixed setting, such as meetings, sports, and surveillances. Therefore, how to detecting the highlights or events of interests is of great interest for non-scripted videos. In our research, we focus on sports video and use the American football video as a case study.

Sports video mining has been widely studied due to its great commercial value (L. Duan, Xu, Tian, Xu, & Jin, 2005; Gong, Sin, Chuan, Zhang, & Sakauchi, 1995; Kokaram, et al., 2006; Xie, Chang, Divakaran, & Sun, 2002). Although the sports video is considered non-scripted, they usually have a relatively well-defined structure (such as the field scene) or repetitive patterns (such as a certain play type), which could help us enhance its “scriptedness” and develop effective tools for retrieval, searching, browsing and indexing. Currently, there are two kinds of approaches for sports video mining: structure-based (Kokaram, et al., 2006; Xie, Chang, Divakaran, & Sun, 2004) and event-based (Assfalg, Bertini, Colombo, Bimbo, & Nunziati, 2003; T. Wang, et al., 2006). The former one uses either supervised or unsupervised learning methods to recognize some basic semantic structures (such as the canonical view in a baseball game or the play/break in a soccer game). This can serve as an intermediate representation to support semantics-oriented video retrieval, but usually cannot deliver high-level semantics directly. The latter one provides a better understanding of the video content by detecting and extracting the events-of-interest or highlights, which could be very specific and task-dependent and usually requires sufficient and representative training data. Because these two approaches are complementary in nature, researchers have investigated how to integrate both of them in one unified computational framework. For example, a mid-level representation framework was proposed for semantic sports video analysis involving both temporal structures and events hierarchy (L. Y. Duan, Xu, Chua, Q. Tian, & Xu, 2003) and a mosaic-based generic scene representation was developed from video shots and used to mine both events and structures (Mei, et al., 2005). The advantage of this kind of video representation lies in its expandability and openness to support versatile and flexible video mining tasks.

Complete Chapter List

Search this Book: