Speech/Text Alignment in Web-Based Language Learning
Sheng-Wei Lee (National Chi-Nan University, Taiwan, R.O.C.), Hao-Tung Lin (National Chi-Nan University, Taiwan, R.O.C.) and Herng-Yow Chen (National Chi-Nan University, Taiwan, R.O.C.)
Copyright: © 2005
After the extensive development of network technology and explosive progress of computing, it is now possible that instructors and students need not meet at the same place to begin the classroom experience. In recent years, there have been many efforts devoted to developing effective online learning systems, including tele-teaching with automated support and learning with presentation of vivid classroom experiences. As to the automated support for teaching, the primary issues lie in a robust, ubiquitous computing application to support capturing everything including teaching behaviors, multimedia authoring, and content generation. In practice, the ability to record every thing or event such as a mouse movement, clicking, and typing during a tele-presentation or a class is the key factor to reconstructing vivid classroom experiences. Most of the previous studies focused on the issues such as explicitly recording the synchronized replay (Abowd, 1999; Muller & Ottmann, 2000), but few efforts looked at the special properties of speech and pre-prepared transcript. The explicitly recorded media streams can be audio, video, slides, and whiteboard, and all streams have the property of time dependency. Time dependency is that each stream should be synchronized with a global clock when played back. However, with regard to language lectures or broadcasting news program, there are temporal relations that have existed between speech and text and need not be recorded explicitly. Such an existing but hidden relationship is regarded as an implicit relation; the relation recorded explicitly by automated supporting tool is an explicit relation. Explicit relations mean that the correlations between different media are pre-orchestrated as a scheduled scenario or could easily be captured by a recording tool. The Synchronized Multimedia Integrated Language (SMIL) enables simple authoring of interactive audiovisual presentations (W3C, 2004) and an SMIL-based document defines the playback scenario, including temporal, spatial, and content information to present multiple media. In contrast, implicit correlations are usually hidden and therefore could not easily be determined by a simple detecting process, so further computational analyses are needed to discover them. Suppose that the implicit relation between content and speech can be analyzed and that the structure of the recorded speech is also extracted. We can then design a friendly interface for such multimedia document navigation that is much more convenient than the traditional VCR-like navigation mechanism. The speech/text alignment is the tool that analyzes the implicit relation of time between speech and text content. The proposed Web-based Synchronized Multimedia Lectures (WSML) (Chu & Chen, 2002; Chu, Hsu, & Chen, 2001) system can exploit the combination of implicit relation (analysis) and explicit relation (capture) to provide an effective integrated presentation for teaching English as Second Language (ESL) lectures and broadcasting news programs.