Content-Based Video Scene Clustering and Segmentation

Content-Based Video Scene Clustering and Segmentation

Hong Lu, Xiangyang Xue
Copyright: © 2011 |Pages: 14
DOI: 10.4018/978-1-60960-024-2.ch010
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

With the amount of video data increasing rapidly, automatic methods are needed to deal with large-scale video data sets in various applications. In content-based video analysis, a common and fundamental preprocess for these applications is video segmentation. Based on the segmentation results, video has a hierarchical representation structure of frames, shots, and scenes from the low level to high level. Due to the huge amount of video frames, it is not appropriate to represent video contents using frames. In the levels of video structure, shot is defined as an unbroken sequence of frames from one camera; however, the contents in shots are trivial and can hardly convey valuable semantic information. On the other hand, scene is a group of consecutive shots that focuses on an object or objects of interest. And a scene can represent a semantic unit for further processing such as story extraction, video summarization, etc. In this chapter, we will survey the methods on video scene segmentation. Specifically, there are two kinds of scenes. One kind of scene is to just consider the visual similarity of video shots and clustering methods are used for scene clustering. Another kind of scene is to consider both the visual similarity and temporal constraints of video shots, i.e., shots with similar contents and not lying too far in temporal order. Also, we will present our proposed methods on scene clustering and scene segmentation by using Gaussian mixture model, graph theory, sequential change detection, and spectral methods.
Chapter Preview
Top

1. Introduction

In recent years, with the amount of video data increasing rapidly, automatic methods are needed to deal with large-scale video data sets in various applications. A common and fundamental preprocess for these applications is video segmentation. Based on the segmentation results, video has a hierarchical representation structure of frames, shots, and scenes from the low level to high level. And frames are not appropriate to represent video contents due to their huge amount. In the levels of video structure analysis, shot is defined as an unbroken sequence of frames from one camera; however, the contents in shots are trivial and can hardly convey valuable semantic information. Also, the number of shots is still large for analysis and processing. On the other hand, scene is a group of consecutive shots that focuses on an object or objects of interest. And a scene can represent a semantic unit. Therefore, the effectiveness of scene segmentation is crucial for further analyzing and understanding video content.

There are two types of scenes. The first type of scene is defined as shots having similar visual contents and can be obtained by scene clustering. The second type of scene is defined as a collection of consecutive shots featuring a dramatic event; such scene can be obtained by scene segmentation with the incorporation of suitable temporal correlation and prior knowledge.

In scene clustering work, much of the work has been done. Specifically, Yeung et al. (Yeung, 1998) proposed shot-based representation structure. And video can be represented as a scene transition graph (STG). In the graph, each node represents a shot and links between shots reflect transitions which are characterized by visual features and temporal characteristic of video. Then hierarchical clustering method is used to find out the closed subgraphs from the entire graph. These subgraphs are regarded as scenes. Rasheed et al. (Rasheed, 2005) constructed a weighted undirected graph, i.e. shot similarity graph, (SSG). In the graph, each node represents a shot while each edge between two nodes (shots) represent the similarity which is based on color and motion information. The SSG is split into subgraphs by using normalized cuts. Rasheed et al. (Rasheed, 2003) adopted a two-pass segmentation algorithm for scene boundary detection. Their work focuses on featured films and TV shows by utilizing the features of motion, shot length, and color similarity. The potential boundaries are first detected based on color feature, then the over-segmented scenes are merged based on motion.

Also, some work has been done in scene segmentation. Specifically, Kender et al. (Kender, 1998) computed video coherence between shots using a short term memory-based model. In the processing, the local minimum is determined to permit robust and flexible scene segmentations. Due to the variance of shot length, the method is sensitive to the buffer size. Furthermore, to deal with different genres of videos, Li et al. (Li, 2004) adopted the method of sequential change detection to perform the video scene segmentation. To let the proposed method could be suitable for various genres of videos, nonparametric density estimation and adaptive threshold were employed. However, this method simply computes the log-likelihood for determining scene boundaries. Thus it may fall into some unsatisfied local minima which would lead to poor performance.

Hidden Markov Models (HMM) is well known for its ability of incorporating temporal information. Thus, in (Huang, 2005), an HMM is modeled for video classification and segmentation since video having spatial and temporal dimensions. By using audio and visual features, the segmentation performance is improved, but the method becomes more complicated. (Zhai, 2006) proposes a general framework of Markov Chain Monte Carlo (MCMC) to solve the issues of scene segmentation on different video genres. The posterior probabilities of the number of the scenes are first computed, then the corresponding boundaries are determined to complete segmentation based on the model priors and the data likelihood. The model parameters are updated by the hypothesis ratio test in the MCMC process and the samples are used to generate the final results. It could effectively solve the problems of processing different genres of videos. However, MCMC itself has some problems. Specifically, one problem is how to select the initial values for the parameters; the other is the convergence of the iteration process. To get a better result, multiple restarts are required. Therefore it has an expensive computational cost.

Complete Chapter List

Search this Book:
Reset