On the Applicability of Speaker Diarization to Audio Indexing of Non-Speech and Mixed Non-Speech/Speech Video Soundtracks

On the Applicability of Speaker Diarization to Audio Indexing of Non-Speech and Mixed Non-Speech/Speech Video Soundtracks

Robert Mertens (International Computer Science Institute, University of California, Berkeley, USA), Po-Sen Huang (Beckman Institute, University of Illinois at Urbana-Champaign, USA), Luke Gottlieb (International Computer Science Institute, University of California, Berkeley, USA), Gerald Friedland (International Computer Science Institute, University of California, Berkeley, USA), Ajay Divakaran (SRI International Sarnoff, USA) and Mark Hasegawa-Johnson (Beckman Institute, University of Illinois at Urbana-Champaign, USA)
DOI: 10.4018/jmdem.2012070101
OnDemand PDF Download:
No Current Special Offers


A video’s soundtrack is usually highly correlated to its content. Hence, audio-based techniques have recently emerged as a means for video concept detection complementary to visual analysis. Most state-of-the-art approaches rely on manual definition of predefined sound concepts such as “ngine sounds,” “utdoor/indoor sounds.” These approaches come with three major drawbacks: manual definitions do not scale as they are highly domain-dependent, manual definitions are highly subjective with respect to annotators and a large part of the audio content is omitted since the predefined concepts are usually found only in a fraction of the soundtrack. This paper explores how unsupervised audio segmentation systems like speaker diarization can be adapted to automatically identify low-level sound concepts similar to annotator defined concepts and how these concepts can be used for audio indexing. Speaker diarization systems are designed to answer the question “ho spoke when?”by finding segments in an audio stream that exhibit similar properties in feature space, i.e., sound similar. Using a diarization system, all the content of an audio file is analyzed and similar sounds are clustered. This article provides an in-depth analysis on the statistic properties of similar acoustic segments identified by the diarization system in a predefined document set and the theoretical fitness of this approach to discern one document class from another. It also discusses how diarization can be tuned in order to better reflect the acoustic properties of general sounds as opposed to speech and introduces a proof-of-concept system for multimedia event classification working with diarization-based indexing.
Article Preview

1. Introduction

To tackle the challenge of indexing multimedia data, a multitude of approaches have been proposed in the last decade (Lew et al., 2006; Snoek & Worring, 2009, for an overview). Most video contents contain both audio and visual information. Many approaches, however, focus only on the visual part of a video. Recently, audio has begun to play its part in multi modal multimedia analysis and can be leveraged to complement results from visual analysis techniques to increase the effectiveness of multimedia detection or retrieval approaches. Audio information can be engaged to these ends in two essentially different ways. Since the late 1990s (Wactlar et al., 1996), speech recognition has been used for video analysis. Detecting sound concepts, which describe a video’s content, is another way of using audio information for video content detection. Manually defined low-level acoustic concepts such as “people laughing” or “indoor sound” transport useful information to a video's content and such concepts can be automatically recognized once a system is trained to detect them (Li et al., 2010; Jiang et al., 2010). This approach does, however, usually involve manual concept definition. The two drawbacks of manual concept definition are that it introduces a human bias since human annotators are likely to identify these concepts based on different properties of sound than a machine algorithm would, and it usually leads to rather abstract concepts. This paper explores the applicability of a speaker diarization engine to extract low-level acoustic concept from domain specific training data. A speaker diarization engine clusters segments of an audio stream that have similar properties. It automatically extracts acoustic concepts defined by a machine algorithm, namely the speaker diarization engine. To examine whether this assumption holds, we have generated and examined diarization data from the TRECVID 2011 development data set (NIST, 2011), which has been released by NIST as training data for the TRECVID Multimedia Event Detection (MED) 2011 challenge. It contains randomly selected videos that are examples from fifteen different categories of high-level concepts such as “woodworking project” and “wedding ceremony,” and thus represents a suitable data set for the analysis presented in this paper. The data set not only deliver a wealthy low-level features that can be detected by the diarization approach, but also separate data into higher-level classes such that we can explore whether higher-level classes can be predicted by the presence or absence of certain low-level features. The analysis of the distribution of speaker segments discovered in the videos from different categories implies that speaker segments are not randomly distributed but useful in predicting whether a video belongs to a certain class of event or not. It shows that speaker diarization generates low-level audio concepts that are helpful in higher-level event classification and detection. An in-depth analysis of the application of the speaker diarization engine to a group of sounds, selected from freesound.org and combined in a single sound, examines the performance of speaker diarization on a number of examples from five selected sound categories. This analysis was also used to tune parameters for speaker diarization. Finally, this paper presents an event detection system that utilizes speaker diarization for indexing audio contents. Finally, we present an event detection system that uses speaker diarization for indexing audio contents. For comparison purposes, we have evaluated the system performance on data from the NIST TRECVID 2010 MED corpus.

The remainder of this paper is organized as follows: Section 3 briefly explains the ICSI speaker diarization system. Section 4 presents an experiment applying diarization to the NIST TRECVID MED 2011 data set and a discussion of the experiment’s findings. Section 5 presents an in-depth analysis of the performance of speaker diarization on a synthetically compiled sound set and discusses the implications for diarization tuning. Section 6 introduces a diarization-based event detection system and presents an evaluation of this system’s performance on the NIST TRECVID 2010 MED corpus. The paper concludes with a discussion of the relevance of the findings of this paper as well as perspectives for further research.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 13: 4 Issues (2022): 1 Released, 3 Forthcoming
Volume 12: 4 Issues (2021)
Volume 11: 4 Issues (2020)
Volume 10: 4 Issues (2019)
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing