Abstract
Speech recognition and synthesis technology has advanced to the point where the use of voice input and output is now feasible for Web-based applications over the Internet. This article describes applications, standards, and architectures for a speech-enabled Web, or SpeechWeb. The ready availability of mobile devices, such as cell phones and PDAs with wireless access to the Internet but without a conventional desktop keyboard, mouse, and large display, make voice input and output very compelling. Voice input and output for small screen/ keyboard devices, and for hands-/eyes-free situations, is essential to enable the user’s interaction with the device and to make it more user friendly.
Key Terms in this Chapter
Speech Recognition: The process of interpreting human speech for transcription or as a method of interacting with a computer, using a computer equipped with a source of speech input, such as a microphone.
SpeechWeb: A collection of hyper-linked applications that are distributed over the Internet and are accessible by spoken commands and queries that are input through remote end-user devices.
Speech Synthesis Markup Language (SSML): A standard that specifies the rendering of synthesized speech to the user.
Voice Extensible Markup Language (Voice-XML): A standard that is used for defining dialogs and for specifying the exchange of information between a user and a speech application.
Speech Recognition Grammar Specification (SRGS): A standard that allows applications to specify the words and phrases that users are prompted to speak.
Speech Synthesis: The artificial production of human speech. Speech synthesis systems are also called text-to-speech systems in reference to their ability to convert text into speech.
Speech Application Language Tags (SALT): A standard that enables multi-modal and telephony access to the Web by providing access to information, applications, and Web services from PCs, telephones, and PDAs.