This chapter addresses a number of advances in formulating spoken document retrieval for the National Gallery of the Spoken Word (NGSW) and the U.S.-based Collaborative Digitization Program (CDP). After presenting an overview of the audio stream content of the NGSW and CDP audio corpus, an overall system diagram is presented with a discussion of critical tasks associated with effective audio information retrieval that include advanced audio segmentation, speech recognition model adaptation for acoustic background noise and speaker variability, and information retrieval using natural language processing for text query requests that include document and query expansion. Our experimental online system entitled “SpeechFind” is presented which allows for audio retrieval from the NGSW and CDP corpus. Finally, a number of research challenges as well as new directions are discussed in order to address the overall task of robust phrase searching in unrestricted audio corpora.
The focus of chapter is to provide an overview of the SpeechFind online spoken document retrieval system, including its subtasks, corpus enrollment, and online search and retrieval engines (Hansen, Huang, Zhou, Seadle, Deller, Gurijala, et al., 2005, http://www.ngsw.org) and the Collaborative Digitization Program (CDP, http://cdpheritage.org). The field of spoken document retrieval requires an interdisciplinary effort, with researchers from electrical engineering (speech recognition), computer science (natural language processing), historians, library archivists, and so forth. As such, we provide a summary of acronyms and definition of terms at the end of this chapter to assist those interested in spoken document retrieval for audio archives.
The problem of reliable speech recognition for spoken document/information retrieval is a challenging problem when data are recorded across different media, equipment, and time periods. NGSW is the first large-scale repository of its kind, consisting of speeches, news broadcasts, and recordings that are of significant historical content. The U.S. National Science Foundation recently established an initiative to provide better transition of library services to digital format. As part of this Phase-II Digital Libraries Initiative, researchers from Michigan State University (MSU) and University of Texas at Dallas (UTD, formerly at Univ. of Colorado at Boulder) have teamed to establish a fully searchable, online WWW database of spoken word collections that span the 20th century. The database draws primarily from holdings of MSU’s Vincent Voice Library (VVL) that includes +60,000 hours of recordings.
In the field of robust speech recognition, there are a variety challenging problems that persist, such as reliable speech recognition across wireless communications channels, recognition of speech across changing speaker conditions (e.g. emotion and stress [Bou-Ghazale & Hansen, 2000; Hansen, 1996; Sarikaya & Hansen, 2000] and accent [Angkititrakul & Hansen, 2006; Arslan & Hansen, 1997]), or recognition of speech from unknown or changing acoustic environments. The ability to achieve effective performance in changing speaker conditions for large vocabulary continuous speech recognition (LVCSR) remains a challenge, as demonstrated in recent DARPA evaluations focused on broadcast news (BN) vs. previous results from the Wall Street Journal (WSJ) corpus.
One natural solution to audio stream search is to perform forced transcription for the entire dataset, and simply search the synchronized text stream. While this may be a manageable task for BN (consisting of about 100 hours), the initial offering for NGSW will be 5000 hours (with a potential of +60,000 total hours), and it will simply not be possible to achieve accurate forced transcription since text data will generally not be available. Other studies have also considered Web-based spoken document retrieval (SDR) (Fujii & Itou, 2003; Hansen, Zhou, Akbacak, Sarikaya, & Pellom, 2000; Zhou & Hansen, 2002). Transcript generation of broadcast news can also be conducted in an effort to obtain near real-time close-captioning (Saraclar, Riley, Bocchieri, & Goffin, 2002). Instead of generating exact transcripts, some studies have considered summarization and topic indexing (Hori & Furui, 2000; Maskey & Hirschberg, 2003; Neukirchen, Willett, & Rigoll, 1999), or more specifically, topic detection and tracking (Walls, Jin, Sista, & Schwartz, 1999), and others have considered lattice-based search (Saraclar & Sproat, 2004). Some of these ideas are related to speaker clustering (Moh, Nguyen, & Junqua, 2003; Mori & Nakagawa, 2001), which is needed to improve acoustic model adaptation for BN transcription generation. Language model adaptation (Langzhou, Gauvain, Lamel, & Adda, 2003) and multiple/alternative language modeling (Kurimo, Zhou, Huang, & Hansen, 2004) have also been considered for SDR. Finally, cross and multilingual-based studies have also been performed for SDR (Akbacak & Hansen, 2006; Navratil, 2001; Wang, Meng, Schone, Chen, & Lo, 2001).
Key Terms in this Chapter
LVCSR: Large Vocabulary Continuous Speech Recognition
Word Error Rate: (WER): A performance measure for speech recognition that includes substitution errors (i.e., miss-recognition of one word for another), deletion errors (i.e., words missed by the recognition system), and insertions (i.e., words introduced into the text output by the recognition system).
Mel Frequency Cepstral Coefficients: (MFCC): A standard set of features used to parameterize speech for acoustic models in speech recognition
NGSW: The National Gallery of the Spoken Word – National Science Foundation (NSF in USA) supported Digital Libraries Initiative consortium of Universities to establish the first nationally recognized, fully searchable online audio archive.
Broadcast News: (BN): An audio corpus consisting of recordings from TV and radio broadcasts used for developing/performance assessment of speech recognition systems
Out-of-Vocabulary: (OOV): In speech recognition, the available vocabulary must first be defined. OOV refers to vocabulary contained in the input audio signal, which is not part of the available vocabulary lexicon, and therefore will always be miss-recognized using automatic speech recognition.
Managing Gigabytes (MG): One of the two general purpose-based systems available for text search and indexing. See the textbook by Witten, Moffat, and Bell (1999) for extended discussion.
SDR: Spoken Document Retrieval
Collaborative Digitization Program (CDP): A consortium of libraries, universities, and archives working together to establish best practices for transitioning materials (e.g., audio, image, etc.) to digital format.
ASR: Automatic Speech Recognition