This chapter addresses a number of advances in formulating spoken document retrieval for the National Gallery of the Spoken Word (NGSW) and the U.S.-based Collaborative Digitization Program (CDP). After presenting an overview of the audio stream content of the NGSW and CDP audio corpus, an overall system diagram is presented with a discussion of critical tasks associated with effective audio information retrieval that include advanced audio segmentation, speech recognition model adaptation for acoustic background noise and speaker variability, and information retrieval using natural language processing for text query requests that include document and query expansion. Our experimental online system entitled “SpeechFind” is presented which allows for audio retrieval from the NGSW and CDP corpus. Finally, a number of research challenges as well as new directions are discussed in order to address the overall task of robust phrase searching in unrestricted audio corpora.
The focus of chapter is to provide an overview of the SpeechFind online spoken document retrieval system, including its subtasks, corpus enrollment, and online search and retrieval engines (Hansen, Huang, Zhou, Seadle, Deller, Gurijala, et al., 2005, http://www.ngsw.org) and the Collaborative Digitization Program (CDP, http://cdpheritage.org). The field of spoken document retrieval requires an interdisciplinary effort, with researchers from electrical engineering (speech recognition), computer science (natural language processing), historians, library archivists, and so forth. As such, we provide a summary of acronyms and definition of terms at the end of this chapter to assist those interested in spoken document retrieval for audio archives.
The problem of reliable speech recognition for spoken document/information retrieval is a challenging problem when data are recorded across different media, equipment, and time periods. NGSW is the first large-scale repository of its kind, consisting of speeches, news broadcasts, and recordings that are of significant historical content. The U.S. National Science Foundation recently established an initiative to provide better transition of library services to digital format. As part of this Phase-II Digital Libraries Initiative, researchers from Michigan State University (MSU) and University of Texas at Dallas (UTD, formerly at Univ. of Colorado at Boulder) have teamed to establish a fully searchable, online WWW database of spoken word collections that span the 20th century. The database draws primarily from holdings of MSU’s Vincent Voice Library (VVL) that includes +60,000 hours of recordings.
In the field of robust speech recognition, there are a variety challenging problems that persist, such as reliable speech recognition across wireless communications channels, recognition of speech across changing speaker conditions (e.g. emotion and stress [Bou-Ghazale & Hansen, 2000; Hansen, 1996; Sarikaya & Hansen, 2000] and accent [Angkititrakul & Hansen, 2006; Arslan & Hansen, 1997]), or recognition of speech from unknown or changing acoustic environments. The ability to achieve effective performance in changing speaker conditions for large vocabulary continuous speech recognition (LVCSR) remains a challenge, as demonstrated in recent DARPA evaluations focused on broadcast news (BN) vs. previous results from the Wall Street Journal (WSJ) corpus.
One natural solution to audio stream search is to perform forced transcription for the entire dataset, and simply search the synchronized text stream. While this may be a manageable task for BN (consisting of about 100 hours), the initial offering for NGSW will be 5000 hours (with a potential of +60,000 total hours), and it will simply not be possible to achieve accurate forced transcription since text data will generally not be available. Other studies have also considered Web-based spoken document retrieval (SDR) (Fujii & Itou, 2003; Hansen, Zhou, Akbacak, Sarikaya, & Pellom, 2000; Zhou & Hansen, 2002). Transcript generation of broadcast news can also be conducted in an effort to obtain near real-time close-captioning (Saraclar, Riley, Bocchieri, & Goffin, 2002). Instead of generating exact transcripts, some studies have considered summarization and topic indexing (Hori & Furui, 2000; Maskey & Hirschberg, 2003; Neukirchen, Willett, & Rigoll, 1999), or more specifically, topic detection and tracking (Walls, Jin, Sista, & Schwartz, 1999), and others have considered lattice-based search (Saraclar & Sproat, 2004). Some of these ideas are related to speaker clustering (Moh, Nguyen, & Junqua, 2003; Mori & Nakagawa, 2001), which is needed to improve acoustic model adaptation for BN transcription generation. Language model adaptation (Langzhou, Gauvain, Lamel, & Adda, 2003) and multiple/alternative language modeling (Kurimo, Zhou, Huang, & Hansen, 2004) have also been considered for SDR. Finally, cross and multilingual-based studies have also been performed for SDR (Akbacak & Hansen, 2006; Navratil, 2001; Wang, Meng, Schone, Chen, & Lo, 2001).