Audition: From Sound to Sounds

Audition: From Sound to Sounds

Tjeerd C. Andringa (University of Groningen, Netherlands)
Copyright: © 2011 |Pages: 27
DOI: 10.4018/978-1-61520-919-4.ch004
OnDemand PDF Download:
No Current Special Offers


This chapter addresses the functional requirements of auditory systems, both natural and artificial, to be able to deal with the complexities of uncontrolled real-world input. The demand to function in uncontrolled environments has severe implications for machine audition. The natural system has addressed this demand by adapting its function flexibly to changing task demands. Intentional processes and the concept of perceptual gist play an important role in this. Hearing and listening are seen as complementary processes. The process of hearing detects the existence and general character of the environment and its main and most salient sources. In combination with task demands these processes allow the pre-activation of knowledge about expected sources and their properties. Consecutive listening phases, in which the relevant subsets of the signal are analyzed, allow the level of detail required by task- and system-demands. This form of processing requires a signal representation that can be reasoned about. A representation based on source physics is suitable and has the advantage of being situation independent. The demand to determine physical source properties from the signal imposes restrictions on the signal processing. When these restrictions are not met, systems are limited to controlled domains. Novel signal representations are needed to couple the information in the signal to knowledge about the sources in the signal.
Chapter Preview


This chapter addresses machine audition and natural audition by carefully analyzing the difficulties and roles of audition in real-world conditions. The reason for this focus is my experience with the development of a verbal aggression detection system (van Hengel and Andringa, 2007). This system was first deployed in 2004 and 2005 in the inner city of Groningen (the Netherlands), by the company Sound Intelligence, and helps the police to prioritize camera feeds. It is the first commercial sound recognition application for a complex target in uncontrolled (city) environments.

Furthermore, the system is a prototypical and a rather idiosyncratic example of machine audition. It is prototypical because it must function, like its natural counter part, in realistic and therefore complex social environments. Inner cities are complex because they are full of people who speak, shout, play, laugh, tease, murmur, sell, run, fall, kick, break, whistle, sing, and cry. The same environment contains birds that sing, dogs that bark, cars that pass or slam with doors, police and ambulances that pass with wailing sirens and screeching tires, pubs that play music, wind that whines, rain that clatters, builders who build, and many, many other rare or common sound events.

What makes the system idiosyncratic is simple: it must ignore all these sounds. The simplest way of doing this is to make it deaf. However, there is one type of sound that should not be ignored: verbal aggression. Of the 2,678,400 seconds each week, the system is interested in about 10 seconds of verbal aggression and has to ignore the other 2.6 million seconds. Fortunately the situation is not as bleak as it seems. The police observers may graciously allow the system some false alarms, as long as the majority of them are informative and justifiable. This means that the system is allowed to select no more than about 50 seconds per month, which corresponds to 0.002% of the time. Ignoring almost everything, while remaining vigilant for the occasional relevant event, is an essential property of the natural auditory system. It requires the (subconscious) processing of perceptual information up to the point of estimated irrelevancy. That is exactly what the system aims to do.

After a considerable period of optimization the system worked (and works) adequately. However it has one major restriction: the system is not easily extended or adapted to other tasks and environments. Every migration to a new city or new operating environment requires some expert-time to readjust the system. Although this is a restriction the system has in common with other applications of machine learning and pattern recognition, it is qualitatively different from the performance of human audition. In general, the comparison between a natural and an artificial system is not favorable for the artificial system. In fact, it is quite a stretch to refer to the function of the verbal aggression detection system as similar to audition: I consider the comparison degrading for the richness, versatility, robustness, and helpfulness of the natural auditory system.

My experiences with the development of the verbal aggression detection system have led me to reconsider my approach to machine audition. This chapter aims at the functional demands of audition, both natural and artificial, because I consider the functional level the level where most progress can be made. The functional level is both beneficial for theories about the natural system and for the design of technology that can function on par with its natural counter-part.

Working with police-observers, who are not at all interested in the technology itself, but only in whether or not it actually helps them, was also revealing. Expectation management was essential to ensure a favorable evaluation and the eventual definitive deployment of the first system. This is why I use a common sense definition of audition as starting point and why I aim to develop systems that comply with common-sense expectations. Only these systems will be truly impressive for the end-user.

Complete Chapter List

Search this Book: