Unstructured Environmental Audio: Representation, Classification and Modeling

Unstructured Environmental Audio: Representation, Classification and Modeling

Selina Chu (University of Southern California, USA), Shrikanth Narayanan (University of Southern California, USA) and C.-C. Jay Kuo (University of Southern California, USA)
Copyright: © 2011 |Pages: 21
DOI: 10.4018/978-1-61520-919-4.ch001
OnDemand PDF Download:
No Current Special Offers


Recognizing environmental sounds is a basic audio sigFnal processing problem. The goal of the authors’ work is on the characterization of unstructured environmental sounds for understanding and predicting the context surrounding of an agent or device. Most research on audio recognition has focused primarily on speech and music. Less attention has been paid to the challenges and opportunities for using audio to characterize unstructured audio. The authors’ research investigates issues in characterizing unstructured environmental sounds such as the development of appropriate feature extraction algorithm and learning techniques for modeling backgrounds of the environment.
Chapter Preview


Unstructured audio is an important aspect in building systems that are capable of understanding their surrounding environment through the use of audio and other modalities of information, i.e. visual, sonar, global positioning, etc. Consider, for example, applications in robotic navigation, assistive robotics, and other mobile device-based services, where context aware processing is often desired. Human beings utilize both vision and hearing to navigate and respond to their surroundings, a capability still quite limited in machine processing. The first step toward achieving recognition of multi-modality is the ability to process unstructured audio and recognize audio scenes (or environments).

By audio scenes, we refer to a location with different acoustic characteristics such as a coffee shop, park, or quiet hallway. Differences in acoustic characteristics could be caused by the physical environment or activities of humans and nature. To enhance a system's context awareness, we need to incorporate and adequately utilize such audio information. A stream of audio data contains a significant wealth of information, enabling the system to capture a semantically richer environment. Moreover, to capture a more complete description of a scene, the fusion of audio and other sensory information can be advantageous, say, for disambiguation of environment and object types. To use any of these capabilities, we have to determine the current ambient context first.

Most research in environmental sounds has centered mostly on recognition of specific events or sounds. To date, only a few systems have been proposed to model raw environment audio without pre-extracting specific events or sounds. In this work, our focus is not in the analysis and recognition of discrete sound events, but rather on characterizing the general unstructured acoustic environment as a whole. Unstructured environment characterization is still in its infancy. Current algorithms still have difficulty in handling such situations, and a number of issues and challenges remain. We briefly describe some of the issues that we think make learning in unstructured audio particularly challenging:

  • One of the main issues arises from the lack of proper audio features for environmental sounds. Audio signals have been traditionally characterized by Mel-frequency cepstral coefficients (MFCCs) or some other time-frequency representations such as the short-time Fourier transform and the wavelet transform, etc. We found from our study that traditional features do not perform well with environmental sounds. MFCCs have been shown to work relatively well for structured sounds, such as speech and music, but their performance degrades in the presence of noise. Environmental sounds, for example, contain a large variety of sounds, which may include components with strong temporal domain signatures, such as chirpings of insects and sounds of rain. These sounds are in fact noise-like with a broad spectrum and are not effectively modeled by MFCCs.

  • Modeling the background audio of complex environments is a challenging problem as the audio, in most cases, are constantly changing. Therefore the question is what is considered the background and how do we model it. We can define the background in an ambient auditory scene as something recurring, and noise-like, which is made up of various sound sources, but changing over time, i.e., traffic and passers-by on a street. In contrast, the foreground can be viewed as something unanticipated or as a deviation from the background model, i.e., passing ambulance with siren. The problem arises when identifying foreground existence in the presence of background noise, given the background also changes with a varying rate, depending on different environments. If we create fixed models with too much prior knowledge, these models could be too specific and might not do well with new sounds.

In this chapter, we will try to answer these problems. The remainder of the chapter will be organized as follows: We will review some related and previous work. The next section afterwards, the MP algorithm is described and MP-based features are presented. The following section reports on a listening test for studying human abilities recognizing acoustic environments. In the background modeling section, we present a framework that utilizes semi-supervised learning to model the background and detect foreground events. Concluding remarks are drawn in the last section.

Complete Chapter List

Search this Book: