From late 1990s to early 2000s, the availability of powerful computing capability, large storage devices, high-speed networking, and especially the advent of the Internet, led to a phenomenal growth of digital multimedia content in terms of size, diversity, and impact. As suggested by its name, “multimedia” is a name given to a collection of data of multiple types, which include not only “traditional multimedia” such as images and videos, but also emerging media such as 3D graphics (like VRML objects) and Web animations (like Flash animations). Furthermore, relevant techniques have been developed for a growing number of applications, ranging from document editing software to digital libraries and many Web applications. For example, most people who have used Microsoft Word have tried to insert pictures and diagrams into their documents, and they have the experience of watching online video clips such as movie trailers from Web sites such as YouTube.com. Multimedia data have been available in every corner of the digital world. With the huge volume of multimedia data, finding and accessing the multimedia documents that satisfy people’s needs in an accurate and efficient manner becomes a nontrivial problem. This problem is referred to as multimedia information retrieval. The core of multimedia information retrieval is to compute the degree of relevance between users’ information needs and multimedia data. A user’s information need is expressed as a query, which can be in various forms such as a line of free text like “Find me the photos of George Washington,” a few keywords like “George Washington photo,” a media object like a sample picture of George Washington, or their combinations. On the other hand, multimedia data are represented using a certain form of summarization, typically called index, which is directly matched against queries. Similar to a query, the index can take a variety of forms, including keywords, visual features such as color histogram and motion vector, depending on the data and task characteristics. For textual documents, mature information retrieval (IR) technologies have been developed and successfully applied in commercial systems such as Web search engines. In comparison, the research on multimedia retrieval is still in its early stage. Unlike textual data, which can be well represented by term vectors that are descriptive of data semantics, multimedia data lack an effective, semantic-level representation that can be computed automatically, which makes multimedia retrieval a much harder research problem. On the other hand, the diversity and complexity of multimedia data offer new opportunities for the retrieval task to be leveraged by the techniques in other research areas. In fact, research on multimedia retrieval has been initiated and investigated by researchers from areas of multimedia database, computer vision, natural language processing, human-computer interaction, and so forth. Overall, it is currently a very active research area that has many interactions with other areas. In the coming sections, we will overview the techniques for multimedia information retrieval, followed by a review on the applications and challenges in this area. Then, the future trends will be discussed, and some important terms in this area are defined at the end of this chapter.
Despite the various techniques proposed in literature, there exist three major approaches to multimedia retrieval, namely text-based approach, content-based approach, and hybrid approach. Their main difference lies in the type of index used for retrieval: the first approach uses text (keywords) as index, the second one uses low-level features extracted from multimedia data, and the last one uses the combination of text and low-level features. As a result, they differ from each other in many other aspects ranging from feature extraction to similarity measures.
Key Terms in this Chapter
Index: In the area of information retrieval, “index” is the representation or summarization of a data item that is used for matching with queries to obtain the similarity between the data and the query, or matching with the indexes of other data items. For example, keywords are frequently used indexes of textual documents, and color histogram is a common index of images. Indexes can be manually assigned or automatically extracted. The text description of an image is usually manually given, but its color histogram can be computed by programs.
Content-Based Retrieval: An important retrieval method for multimedia data, which use the low-level features (automatically) extracted from the data as the indexes to match with queries. Content-based image retrieval is a good example. The specific low-level features used depend on the data type: color, shape, and texture features are common features for images, while kinetic energy, motion vectors are used to describe video data. Correspondingly, a query can be also represented in terms of features so that it can be matched against the data.
Multimedia Document: A multimedia document is a natural extension of a conventional textual document in the multimedia area. It is defined as a digital document that is composed of one or multiple media elements of different types (text, image, video, etc.) as a logically coherent unit. A multimedia document can be a single picture or a single MPEG video file, but more often it is a complicated document such as a Web page consisting of both text and images.
Cross-Media Retrieval: As an extension of traditional multimedia retrieval methods, cross-media retrieval can be regarded as a unified multimedia retrieval approach that tries to break through the modality of different media objects. For example, when an user submits a “tiger” image, the system will return some “tiger”-related media objects with different modalities, such as the sound of a tiger roaring, and the video describing a tiger is capturing animals.
High-Dimensional Index: For content-based multimedia retrieval, the low-level features extracted from the media objects, such as image, audio, and the like, are usually multi- or high-dimensional. The high-dimensional index is a scheme which can efficiently and effectively organize and order the features from a great number of the multimedia objects. The aim of it is to improve the performance of similarity search over large multimedia databases by significantly reducing the search region.
Multimodality: Multiple types of media data, or multiple aspects of a data item. Its emphasis is on the existence of more than one type (aspects) of data. For example, a clip of digital broadcast news video has multiple modalities, include the audio, video frames, closed-caption (text), and so forth.
Multimedia Information Retrieval (System): Storage, indexing, search, and delivery of multimedia data such as images, videos, sounds, 3D graphics, or their combination. By definition, it includes works on, for example, extracting descriptive features from images, reducing high-dimensional indexes into low-dimensional ones, defining new similarity metrics, efficient delivery of the retrieved data, and so forth. Systems that provide all or part of the above functionalities are multimedia retrieval systems. The Google image search engine is a typical example of such a system. A video-on-demand site that allows people to search movies by their titles is another example.
Information Retrieval (IR): The research area that deals with the storage, indexing, organization of, search, and access to information items, typically textual documents. Although its definition includes multimedia retrieval (since information items can be multimedia), the conventional IR refers to the work on textual documents, including retrieval, classification, clustering, filtering, visualization, summarization, and so forth. The research on IR started nearly half century ago and it grew fast in the past 20 years with the efforts of librarians, information experts, researchers on artificial intelligence and other areas. A system for the retrieval of textual data is an IR system, such as all the commercial Web search engines.
Query-by-Example (QBE): A method of forming queries that contain one or more media object(s) as examples with the intention of finding similar objects. A typical example of QBE is the function of “See Similar Pages” provided in the Google search engine, which supports finding Web pages similar to a given page. Using an image to search for visually similar images is another good example.y
Multimedia Database: A database system that is dedicated to the storage, management, and access of one or more media types, such as text, image, video, sound, diagram, etc. For example, an image database such as Corel Image Gallery that stores a large number of pictures and allow users to browse them or search them by keywords can be regarded as a multimedia database. An electronic encyclopedia such as Microsoft Encarta Encyclopedia, which consists of tens of thousands of multimedia documents with text descriptions, photos, video clips, animations, is another typical example of a multimedia database.