Cocktail Party Problem: Source Separation Issues and Computational Methods

Cocktail Party Problem: Source Separation Issues and Computational Methods

Tariqullah Jan (Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK) and Wenwu Wang (Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK)
Copyright: © 2011 |Pages: 19
DOI: 10.4018/978-1-61520-919-4.ch003


Cocktail party problem is a classical scientific problem that has been studied for decades. Humans have remarkable skills in segregating target speech from a complex auditory mixture obtained in a cocktail party environment. Computational modeling for such a mechanism is however extremely challenging. This chapter presents an overview of several recent techniques for the source separation issues associated with this problem, including independent component analysis/blind source separation, computational auditory scene analysis, model-based approaches, non-negative matrix factorization and sparse coding. As an example, a multistage approach for source separation is included. The application areas of cocktail party processing are explored. Potential future research directions are also discussed.
Chapter Preview


The concept of the cocktail party problem (CPP) was coined by Cherry (1953). It was proposed to address the phenomenon associated with human auditory system that, in a cocktail party environment, humans have the ability to focus their listening attention on a single speaker when multiple conversations and background interferences and noise are presented simultaneously. Many researchers and scientists from a variety of research areas attempt to tackle this problem (Bregman, 1990; Arons, 1992; Yost, 1997; Feng et al., 2000; Bronkhorst, 2000). Despite of all these works, the CPP remains an open problem and demands further research effort. Figure 1 illustrates the cocktail party effect using a simplified scenario with two simultaneous conversations in the room environment.

Figure 1.

A simplified scenario of the cocktail party problem with two speakers and two listeners (microphones)

As the solution to the CPP offers many practical applications, engineers and scientists have spent their efforts in understanding the mechanism of the human auditory system, and hoping to design a machine which can work similarly to the human auditory system. However, there are no machines produced so far that can perform as humans in a real cocktail party environment. Studies on the human auditory system could help understand the cocktail party phenomenon, and offer hopes of designing a machine that could approach a normal human’s listening ability.

It has been observed that people with the perceptive hearing loss suffer from insufficient speech intelligibility (Kocinski, 2008). It is difficult for them to pick up the target speech, in particular, when there exist some interfering sounds nearby. However, amplification of the signal is not sufficient to increase the intelligibility of the target speech as all the signals (both target and interference) are amplified. For this application scenario, it is highly desirable to produce a machine that can offer clean target speech to these hearing impaired people.

Scientists have attempted to analyze and simplify the complicated CPP problem, see, for example, a recent overview in (Haykin, 2005). A variety of methods have been proposed for this problem. For example, computational auditory scene analyses (CASA) approach attempts to transform the human auditory system into mathematical modeling using computational means (Wang & Brown, 2006; Wang, 2005). Blind source separation (BSS) is also used by many people to address this problem (Wang et al., 2005; Araki et al., 2003; Olsson et al., 2006; Makino et al., 2005). BSS approaches are based on the independent component analysis (ICA) technique assuming that the source signals coming from different speakers are statistically independent (Hyvarinen et al., 2001; Lee, 1998). Non-negative matrix factorization (NMF) and its extension non-negative tensor factorization (NTF) have also been applied to speech and music separation problems (Smaragdis, 2004, Virtanen, 2007; Schmidt & Olsson, 2006, Schmidt & Laurberg, 2008, Wang, 2009). Another interesting approach is the sparse representation of the sources in which the source signals are assumed to be sparse and hence only one of the source signals in the mixture is active while others are relatively insignificant for a given time instant (Pearlmutter et al., 2004; Bofill et al., 2001; Zibulevsky & Pearlmutter, 2001). Some model based approaches have also been employed to address this problem (Todros et al., 2004; Radfar et al., 2007). The following sections provide a detailed review of these techniques for addressing the cocktail party problem, in particular, for audio source separation which is a key issue for creating an artificial cocktail party machine.

Complete Chapter List

Search this Book: