Multi-Sensored Vision for Autonomous Production of Personalized Video Summary

Multi-Sensored Vision for Autonomous Production of Personalized Video Summary

Fan Chen (Université catholique de Louvain, Belgium), Damien Delannay (Université catholique de Louvain, Belgium), Christophe De Vleeschouwer (Université catholique de Louvain, Belgium) and Pascaline Parisot (Université catholique de Louvain, Belgium)
Copyright: © 2011 |Pages: 19
DOI: 10.4018/978-1-60960-024-2.ch007
OnDemand PDF Download:
List Price: $37.50


This chapter provides a survey of the major research efforts that have exploited computer vision tools to extend the content production industry towards automated infrastructures allowing contents to be produced, stored, and accessed at low cost and in a personalized and dedicated way.
Chapter Preview


Today’s media consumption evolves towards increased user-centric adaptation of contents, to meet the requirements of users having different expectations in terms of story-telling and heterogeneous constraints in terms of access devices. Individuals and organizations want to access dedicated contents through a personalized service that is able to provide what they are interested in, at the time when they want it and through the distribution channel of their choice.

Hence, democratic and personalized production of multimedia content is one of the most exciting challenges that content providers will have to face in the near future. In this chapter, we explain how it is possible to address this challenge by building on computer vision tools to automate the collection and distribution of audiovisual contents.

In a typical application scenario, as depicted in Figure 1, the sensor network for media acquisition is composed of (microphones and) cameras, which, for example, cover a basket-ball field. Distributed analysis and interpretation of the scene are exploited to decide what to show or not to show about the event, so as to produce a video composed of a valuable subset from the streams provided by each individual camera, or interpolated from multiple cameras. The process involves numerous integrated technologies and methodologies, including but not limited to automatic scene analysis, camera viewpoint selection and control, and generation of summaries through automatic organization of stories. Considering the problem in a multi-camera environment not only mitigates the difficulty of scene understanding caused by reflection, occlusion and shadow in the single view case, but also offers higher flexibility in producing visually pleasant video reports. In final, multi-camera autonomous production/summarization can provide practical solutions to a wide range of applications, such as personalized access to local sport events through a web portal or a mobile hand-set (APIDIS, 2008; Papaoulakis, 2008), cost-effective and fully automated production of content dedicated to small-audience, e.g. souvenirs DVD, university lectures, conference (Rui, 2001; Al-Hames, 2007), etc, and interactive browsing and automated summarization for video surveillance (Yamasaki, 2008).

Figure 1.

Vision of autonomous production of personalized video summaries

From a technical perspective, this chapter will present a unified framework for cost-effective and autonomous generation of video contents from multi-sensored data. It will first investigate the automatic extraction of intelligent contents from a network of sensors distributed around the scene at hand. Here, intelligence refers to the identification of salient segments within the audiovisual content, using distributed scene analysis algorithms. Second, it will explain how that knowledge can be exploited to automate the production and personalize the summarization of video contents.

In more details, to identify salient segments in the raw video content, multi-camera analysis is considered, with an emphasis on people detection methods relying on the fusion of the foreground likelihood information computed in each view. We will observe that multi-view analysis can overcome traditional hurdles such as occlusions, shadows and changing illumination. This is in contrast with single sensor signal analysis, which is often subject to interpretation ambiguities, due to the lack of accurate model of the scene, and to coincidental adverse scene configurations (Delannay, 2009).

To produce semantically meaningful and perceptually comfortable video summaries based on the extraction or interpolation of images from the raw content, our proposed framework introduces three fundamental concepts, i.e. “completeness”, “smoothness” and “fineness”, to abstract the semantic and narrative requirement of video contents. Based on those concepts, as a key contribution, we formulate the selection of camera viewpoints and that of temporal segments in the summary as two independent optimization problems. In short, those problems define and trade-off the above concepts as a function of the computer vision analysis outcomes, in a way that is easily parameterized by individual user preferences. Interestingly, the solution to the viewpoint selection problem is augmented by Markov regularization mechanisms (Chen, 2009a; Chen, 2010), while the formulation of the summarization problem builds on a generic resource allocation framework (Chen, 2009b).

Complete Chapter List

Search this Book: