Exploiting Captions for Multimedia Data Mining

Exploiting Captions for Multimedia Data Mining

Neil C. Rowe (U.S. Naval Postgraduate School, USA)
DOI: 10.4018/978-1-60566-014-1.ch073
OnDemand PDF Download:


Captions are text that describes some other information; they are especially useful for describing nontext media objects (images, audio, video, and software). Captions are valuable metadata for managing multimedia, since they help users better understand and remember (McAninch, Austin, & Derks, 1992-1993) and permit better indexing of media. Captions are essential for effective data mining of multimedia data, since only a small amount of text in typical documents with multimedia—1.2% in a survey of random World Wide Web pages (Rowe, 2002)—describes the media objects. Thus standard Web browsers do poorly at finding media without knowledge of captions. Multimedia information is increasingly common in documents as computer technology improves in speed and ability to handle it, and people need multimedia for a variety of purposes like illustrating educational materials and preparing news stories. Captions are also valuable because nontext media rarely specify internally the creator, date, or spatial and temporal context, and cannot convey linguistic features like negation, tense, and indirect reference. Furthermore, experiments with users of multimediaretrieval systems show a wide range of needs (Sutcliffe, Hare, Doubleday, & Ryan, 1997), but a focus on media meaning rather than appearance (Armitage & Enser, 1997). This suggests that content analysis of media is unnecessary for many retrieval situations, which is fortunate, because it is often considerably slower and more unreliable than caption analysis. But using captions requires finding them and understanding them. Many captions are not clearly identified, and the mapping from captions to media objects is rarely easy. Nonetheless, the restricted semantics of media and captions can be exploited.
Chapter Preview

Finding, Rating, And Indexing Captions


Much text in a document near a media object is unrelated to that object, and even text explicitly associated with an object may often not describe it (like “JPEG picture here” or “Photo39573”). Thus, we need clues to distinguish and rate a variety of caption possibilities and words within them, allowing there may be more than one caption for an object or more than one object for a caption. Free commercial media search engines (like images.google.com, multimedia.lycos.com, and www.altavista.com/image) use a few simple clues to index media, but their accuracy is significantly lower than that for indexing text. For instance, Rowe (2005) reported that none of five major image search engines could find pictures for “President greeting dignitaries” in 18 tries. So research is exploring a broader range of caption clues and types (Mukherjea & Cho, 1999; Sclaroff, La Cascia, Sethi, & Taycher, 1999).

Sources of Captions

Some captions are explicitly attached to media objects in adding them to a digital library or database. On Web pages, HTML “alt” and “caption” tags also explicitly associate text with media objects. Clickable text links to media files are another good source of captions since the text must explain the link. The name of a media itself can be a short caption (like “socket_wrench.gif”). Less-explicit captions use conventions like centering or font changes to text. Titles and headings preceding a media object can sometimes serve as captions as they generalize over a block of information. Paragraphs above, below, or next to media can also be captions, especially short paragraphs.

Other captions are embedded directly into the media, like characters drawn on an image (Lienhart & Wernicke, 2002) or explanatory words at the beginning of audio. These require specialized processing like optical character recognition to extract. Captions can be attached through a separate channel of video or audio, as with the “closed captions” associated with television broadcasts that aid hearing-impaired viewers and students learning languages. “Annotations” can function like captions though they tend to emphasize analysis or background knowledge.

Key Terms in this Chapter

Controlled vocabulary: A limited menu of words from which metadata like captions must be constructed.

Metadata: Information describing another data object such as its size, format, or description.

HTML: Hypertext Markup Language, the base language of pages on the World Wide Web.

Caption: Text describing a media object.

Deixis: A linguistic expression whose understanding requires understanding something besides itself, as with a caption.

Media Search Engine: A Web search engine designed to find media (usually images) on the Web.

Web Search Engine: A Web site that finds other Web sites whose contents match a set of keywords, using a large index to Web pages.

“Alt” String: An HTML tag for attaching text to a media object.

Data Mining: Searching for insights in large quantities of data.

Complete Chapter List

Search this Book: