Probabilistic Methods in Automatic Speech Recognition

Probabilistic Methods in Automatic Speech Recognition

Paul De Palma
DOI: 10.4018/978-1-4666-5888-2.ch024
(Individual Chapters)
No Current Special Offers

Chapter Preview



Work in automatic speech recognition has a long history. It began at about same time that researchers first developed compilers for the early high-level programming languages. Bell Labs, RCA Research, and MIT’s Lincoln labs all used new ideas in acoustic phonetics to work on the recognition of single digits, syllables, and certain vowel sounds. Work continued through the 1960s in the United States, Japan, and the Soviet Union, using pattern-recognition techniques. This work received a big boost after the development of linear predictive coding in the 1970s, a technique to represent a compressed version of an acoustic signal. In all cases, however, the effort was to develop systems that could recognize single words. Two developments in the 1980s gave ASR its modern shape. The first was Defense Advanced Research Products Agency funding for research into large vocabulary continuous speech recognition (LVCSR). LVCSR systems have vocabularies in the 20,000 to 60,000 word range and process continuous speech from multiple speakers. The second was the adaptation of statistical techniques, most notably hidden Markov models, to the speech recognition problem.

Though current ASR falls far short of the bar established in the 1968 movie, 2001, where a malevolent, but graciously conversational computer, takes control of a space station, converting acoustic waveforms to text is useful in a wide variety of contexts. The most obvious community to benefit from voice-augmented computers is the sight-impaired, for whom it might be argued that the evolution from textual input to point-and-click devices has made interacting with a computer somewhat more difficult. The rapid development of mobile computing devices is shifting the user paradigm away from a person sitting at a desk, and possibly away from point-and-click itself. Speech is a natural replacement; but the movement from desktop to mobile computing does not exhaust the possibilities for a spoken interface. Anytime a computer user’s eyes or hands are not available, as in situations where equipment or objects must be handled, the point-and-click model has clear limitations. From automobile mechanics, to industrial device operators, to medical technicians, to airplane pilots, to surgeons, any application that requires hands, eyes, and a computer is a candidate for speech recognition (and synthesis, of course).

Key Terms in this Chapter

Viterbi Algorithm: A dynamic programming technique that stores intermediate results eliminating the need for multiple identical computations. Often used in the decoder.

Language Model: Expresses the probability that a sequence of words would be uttered by a speaker of English (for example). Called the prior probability in Bayesian inference (p(x)).

Acoustic Model: Expresses the probability that a string of symbols is associated with an acoustic signal. Called the probability likelihood in Bayesian inference (p(y|x)).

Large Vocabulary Continuous Speech Recognition (LVCSR): Handles vocabulary in the 20,000 to 60,000 word range.

Decoder: Chooses the most likely word string among candidate word strings given an acoustic signal.

Bayesian Inference: A statistical technique that computes the probability of an event occurring given some evidence using previously computed probabilistic models of the world under discussion. Expressed as Bayes Rule: xy= pyx*pxpy.

Hidden Markov Model: A statistical technique that computes the probabilistic relationship between an observed sequence, say an acoustic signal, and a non-observed (or hidden) sequence, say a word string.

Automatic Speech Recognition (ASR): A set of techniques that attempt to transform spoken words to printed words.

Continuous Speech: Normal conversational speech. To be distinguished from single words or digits.

Complete Chapter List

Search this Book: