While the technology for mining text documents in large databases could be said to be relatively mature, the same cannot be said for mining other important data types such as speech, music, images and video. Multimedia data mining attracts considerable attention from researchers, but multimedia data mining is still at the experimental stage (Hsu, Lee & Zhang, 2002). Nowadays, the most effective way to search multimedia archives is to search the metadata of the archive, which are normally labeled manually by humans. This is already uneconomic or, in an increasing number of application areas, quite impossible because these data are being collected much faster than any group of humans could meaningfully label them — and the pace is accelerating, forming a veritable explosion of non-text data. Some driver applications are emerging from heightened security demands in the 21st century, postproduction of digital interactive television, and the recent deployment of a planetary sensor network overlaid on the internet backbone.
Although they say a picture is worth a thousand words, computer scientists know that the ratio of information contained in images compared to text documents is often much greater than this. Providing text labels for image data is problematic because appropriate labeling is very dependent on the typical queries users will wish to perform, and the queries are difficult to anticipate at the time of labeling. For example, a simple image of a red ball would be best labeled as sports equipment, a toy, a red object, a round object, or even a sphere, depending on the nature of the query. Difficulties with text metadata have led researchers to concentrate on techniques from the fields of Pattern Recognition and Computer Vision that work on the image content itself. Although pattern recognition, computer vision, and image data mining are quite different fields, they share a large number of common functions (Hsu, Lee & Zhang, 2002).
An interesting commercial application of pattern recognition is a system to semi-automatically annotate video streams to provide content for digital interactive television. A similar idea was behind the MIT MediaLab Hypersoap project (The Hypersoap Project, 2007; Agamanolis & Bove, 1997). In this system, users touch images of objects and people on a television screen to bring up information and advertising material related to the object. For example, a user might select a famous actor and then a page would appear describing the actor, films in which they have appeared, and the viewer might be offered the opportunity to purchase copies of their other films. In the case of Hypersoap, the metadata for the video was created manually. Automatic face recognition and tracking would greatly simplify the task of labeling video in post-production — the major cost component of producing such interactive video.
With the rapid development of computer networks, some web-based image mining applications have emerged. SIMBA (Siggelkow, Schael, & Burkhardt, 2001) is a content-based image retrieval system performing queries based on image appearance from an image database with about 2500 images. RIYA (RIYA Visual Search) is a visual search engine that tries to search content relevant images from the context input. In 2007, Google added face detection to its image search engine (Google Face Search). For example, the URL http://images.google.com/images?q=bush&imgtype=face will return faces associated with the name “Bush” including many images of recent US presidents. While the application appears to work well, it does not actually identify the face images. Instead it relies on the associated text metadata to determine identity.
None of the above systems support the input of face images as a query to retrieve similar images of the same person. A robust face recognition method is needed for such kind of systems. Now we will focus on the crucial technology underpinning such a data mining service—automatically recognizing faces in image and video databases.