Spoken dialog systems present a classic example of planning under uncertainty. Speech recognition errors are ubiquitous and impossible to detect reliably, so the state of the conversation can never be known with certainty. Despite this, the system must choose actions to make progress to a long term goal. As such, decision theory, and in particular partially-observable Markov decision processes (POMDPs), present an attractive approach to building spoken dialog systems. Initial work on “toy” dialog systems validated the benefits of the POMDP approach; however, it also found that straightforward application of POMDPs could not scale to real-world problems. Subsequent work by a number of research teams has scaled up planning and belief monitoring, incorporated high-fidelity user simulations, and married commercial development practices with automatic optimization. Today, statistical dialog systems are being fielded by research labs for public use. This chapter traces the history of POMDP-based spoken dialog systems, and sketches avenues for future work.
TopIntroduction
Spoken dialog systems (SDSs) are a widespread commercial technology with a broad range of applications. For example, currently deployed telephone-based SDSs enable callers to check their bank balance, get airline gate information, or find the status of a train. In a car, an SDS enables drivers to change the music, check traffic conditions or get driving directions. SDSs on mobile devices enable people to find a business, send a message, dial a contact, or set a social-networking status. Analysts estimate the total market for SDSs is in the billions of US dollars per year (Kowalke, 2008).
Although widespread, spoken dialog systems remain challenging to build. First, to hear and understand users, spoken dialog systems use Automatic Speech Recognition (ASR) which is prone to errors. Despite years of research, ASR is still imperfect, and yields the wrong answer 20–30% of the time for non-trivial tasks. As a result, a dialog system can never know the user’s true intentions – i.e., to the dialog system, the state of the world is partially observable. Moreover, dialog is a temporal process that requires careful planning: early decisions affect the long-term outcome, and there are important trade-offs between confirming current hypotheses (“Flying to Boston, is that right?”), gathering more information (“When would you like to travel?”), and committing to the current hypothesis (“Ok, issuing a ticket from New York to Boston for flight 103 on March 15.”).
In industry, these two issues are addressed through hand-crafted heuristics. Directed questions (“Please say the time you would like to depart.”), confirmations (“Eleven thirty, is that right?”), and local accept/reject decisions (“Sorry, I didn’t understand. What time was that?”) help reduce uncertainty; and dialog plans – carefully designed by experts – are highly constrained. Although these techniques are sufficient for certain commercial applications, their scope and robustness are inherently limited. Increasing automation by only a few percent would have real commercial impact. Moreover, increasing robustness is an important step toward moving new applications of spoken dialog systems out of the research lab into widespread use, in domains such as robotics (Roy, Pineau, & Thrun, 2000, Doshi & Roy, 2008b), eldercare (Mihailidis, Boger, Candido, & Hoey, 2008), handheld device interaction (Johnston et al., 2002), situated interaction (Bohus & Horvitz, 2009), and others.
With this in mind, researchers at several laboratories have turned to decision theory as a framework for building spoken dialog systems. With sequential decisions and a partially observable state, dialog systems present a classic example of decision-making under uncertainty, for which partially-observable Markov decision processes (POMDPs) are an attractive method. Initial work at several research laboratories applying POMDPs to toy spoken dialog systems in 2000–2005 suggested that POMDPs were indeed capable of achieving significantly better performance than the traditional approach of hand-crafting dialog control. However, this early work also identified numerous barriers to commercial use.
Since that early pioneering work, the research community has made substantial progress. Current approaches are now capable of handling a virtually unbounded number of possible dialog states, system actions, and observations, yet perform on-line inference in real-time, and perform off-line planning quickly. Methods have been developed for incorporating business rules into the policy, encoding structured domain knowledge into the state, and automatically learning transition dynamics from unlabelled data. Together these techniques have enabled POMDPs to scale to real-world dialog systems, producing better robustness to speech recognition errors, better task completion rates, and shorter dialogs.
This chapter has three broad goals. First, the next two sections aim to present the spoken dialog task and explain why POMDPs are an attractive solution compared to current practice in industry and related approaches in research. Second, the following section details how POMDPs have been adapted to the requirements of this real-world task. Third, the final section identifies open problems for POMDP-based dialog systems, and suggests avenues for future research.