Computer Vision for Multimedia Applications: Methods and Solutions

Computer Vision for Multimedia Applications: Methods and Solutions

Jinjun Wang (NEC Laboratories America, Inc., USA), Jian Cheng (Chinese Academy of Sciences, China) and Shuqiang Jiang (Chinese Academy of Sciences, China)
Indexed In: SCOPUS
Release Date: October, 2010|Copyright: © 2011 |Pages: 354
ISBN13: 9781609600242|ISBN10: 160960024X|EISBN13: 9781609600266|DOI: 10.4018/978-1-60960-024-2


Although a number of methods for solving computer vision tasks exist, these methods are often task-specific and can seldom be generalized over a wide range of applications. In addition, many computer vision algorithms have not been thoroughly studied.

Computer Vision for Multimedia Applications: Methods and Solutions includes the latest developments in computer vision methods applicable to various problems in multimedia computing. This publication presents discussions on new ideas, as well as problems in computer vision and multimedia computing. It will serve as an important reference in multimedia and computer vision for academicians, researchers, and academic libraries.

Topics Covered

The many academic areas covered in this publication include, but are not limited to:

  • 3D Modeling
  • Broadcasting technologies
  • Computer vision in human computer interaction
  • Content-based multimedia retrieval
  • Image Synthesis
  • Motion analysis in multimedia
  • Multimedia content adaption in wireless environment
  • Multimedia visual content representation
  • Object analysis in multimedia
  • Video segmentation

Reviews and Testimonials

We believe that a book to address the problem of applying computer vision into multimedia computing is high necessary. To a limited extent, the book can present existing works that pioneer the field; to a large extent, the book will let people in related area be aware of the situation, and inspire them to develop systems with better performance.

– Jinjun Wang, NEC Laboratories America, Inc., USA; Jian Cheng, Chinese Academy of Sciences, China; and Shuqiang Jiang, Chinese Academy of Sciences, China

Table of Contents and List of Contributors

Search this Book:


The multimedia research mainly focused on images in the early of 1990s, and then multimedia became the synonymous of video in mid-1990s. Other modalities, such as audio and text, began to be added after 2000s. Today, the ever increasing variety and decreasing cost of various types of sensors enables the use of additional different media such as GPS location data, infrared, motion sensor information, optical sensor data, biological and physiological sensor signal etc. There were sporadic discussions on the merits and demerits of using single or joint media, and now it is being accepted that, the specific needs from different applications define the set of media, either single or multiple. 

To process different type of media, the multimedia computing research have been borrowing methods from other domains, trimming and fusing them to fulfill the needs of individual multimedia application. For example, identifying audio keywords from sports video mainly uses the MFCC feature that is popular in the speech recognition field; analyzing the text streams in news video borrows many algorithms from the Natural Language Processing area. Among all the studied types of media, video is still the most fundamental one. This defines a coupled relationship between multimedia computing and computer vision, because the later provides necessary techniques to go from sequences of two-dimensional images to structural description of the visual content for modeling, such as edge/texture detection and SIFT/SURF local descriptor for low-level video feature extraction, image segmentation and object tracking for middle-level video keywords creation, and image understanding and object/scene recognition for high-level video semantic annotation. 

Recovering the unknown structural representation from the video media is an inverse problem. There is, however, no standard formulation of how this inverse problem could be solved. In another word, although there exists an abundance of methods for solving various well-defined computer vision tasks, these methods are often task-specific and can seldom be generalized over a wide range of applications. In addition, many existing computer vision algorithms are still in the state of basic research, therefore it is not straightforward how computer vision methods can be applied to multimedia applications. At the same time, little literature is available to discuss the disciplines of computer vision for multimedia applications. This has motivated us to edit a book to address problems related to applying computer vision to multimedia computing. The book includes many latest developments in computer vision methods applicable to various problems in multimedia computing, and lists successful examples of multimedia modeling, system and application where computer vision techniques play an indispensible role. The whole book is organized as follows:

Section 1 focuses on introducing computer vision in human-computer interaction. Multimedia system is primarily designed with the human being as the user. This human-centered design is a distinctive feature and defines the necessity of the human-computer-interaction (HCI) functionality. The recent development in Internet media, streaming, gaming, etc fields has further demanded the next generation of HCI, i.e., Human Multimedia Interaction (HMI), where users may convey messages and emotions to the computer or even to other users with the help of multimedia. Human beings perceive the world with a three-dimensional representation, while most multimedia documents only convey the one-dimensional audio and/or two-dimensional visual signals. To let an automatic system perceive the world in the same way as human does, it is necessary to recover some unknowns given insufficient information to fully specify the solution. The computer vision research is the one that addresses this inverse problem. Although HCI using other modality, such as human speech, has already been made possible, the ability of existing computer vision techniques in modeling the visual world in all of its rich complexity is far more challenging than modeling the vocal tract that produces spoken sounds. Human being can perceive the object, shape, shading or other three-dimensional structures from the video with ease. Despites decades of research by perceptual psychologists trying to understand how the human visual system works, and computer vision researchers trying to develop mathematical techniques for recovering the three-dimensional shape and appearance of objects in imagery, many people still believe that it remains challenging to have a computer to interpret the visual content at the same level as a child. Fortunately, with physics-based and probabilistic models, existing computer vision techniques are already able to accomplish many tasks to a moderate success, and HMI tools and services based on these works are emerging. Section 1 of this book presents several representative works on face modeling, landmark recognition, expression analysis for HMI applications.

Section 2 and 3 discuss computer vision techniques for multimedia content analysis, summarization and retrieval. The problem arises with the explosive increasing of multimedia content, and automatic understanding of these contents has been demanding. Generally speaking, multimedia content is consisted of heterogeneous multi-modal media, such as the image, video, audio, text, etc. According to perception, these media can be roughly classified into two categories: visual content and audio content. As an inter-discipline of computer science, vision and statistics, computer vision has been investigated for near thirty years. A set of relatively mature theory and algorithms have been developed, which provides substantial foundation for multimedia content analysis. Some issues in multimedia content analysis are directly or indirectly derived from computer vision field. In this section, we selected several works that focus in three aspects: visual feature extraction, object detection/tracking/recognition, and video structuring. Feature extraction is the basis of multimedia content analysis. Low-level visual features extracted for multimedia content analysis are usually borrowed from computer vision community, such as color histogram, Gabor/Tamura texture descriptor, Fourier shape descriptor, etc; Object detection, tracking and recognition technologies are related to computer vision that deal with capturing and recognizing instances of semantic objects of a certain class. Some well-researched object categories, including face, hand, car, pedestrian, etc, have applications in many multimedia applications such as image/video retrieval and video surveillance; Video structuring aims at identifying semantic units from the multimedia data at different hierarchical level. It is usually domain specific, such as shot and story in news video and event in sports video. Video structuring relies on analyzing the temporal transition or motion pattern, which poses challenging topics in applying techniques available for static image.

Section 4 talks about multimedia authentication. Multimedia signal in electronic forms makes it easy to reproduce and manipulated, especially with the availability of versatile multimedia processing software and the wide coverage of the Internet. The existing HMI tools and content-based retrieval engines have also eased the way for large-scale multimedia applications. However, abuses of these technologies pose threats to multimedia security management and multimedia copyright protection. Multimedia authentication is to confirm the genuineness or truth of the structure and/or content of multimedia documents. However, the unique characteristic of multimedia data makes traditional authentication methods based on physical clues inapplicable. Nowadays, there are mainly two approaches for multimedia authentication, cryptograph and digital watermarking. The cryptograph method, or called the digital signature technique, depends on the multimedia content and certain secret information known only to the signer. The digital signature cannot be forged, and the authenticator can verify multimedia data by examining whether its content matches the information contained in the digital signature; The digital watermarking method is to modify the multimedia bitstream to embed some codes, called watermarks, without changing the meaning of the content. The embedded watermark may represent either a specific digital producer identification label or some content-based codes generated by applying certain rules. In the authenticator, the watermarks are examined to verify the integrity of the data. However, for both the two approaches, since the multimedia data are usually distributed and re-interpreted by many interim entities, the authentication information may get distorted or discarded. Hence it becomes challenging to guarantee trustworthiness between the origin source and the final recipient. In this section, we show two representative works related to multimedia watermarking and forgery detection.

Section 5 presents several biologically inspired methods for multimedia computing. The current Internet-scale multimedia database, highly distributed and orderless, brings difficulties for many multimedia applications. The solutions to these problems may lie in our nature. Many species exhibit collective movement patterns which are highly organized compared to the seemingly random individual behaviors. This shows that aggregate behaviors in these animals may have special group-level properties that go beyond the ability of an individual, and evidences show that the group behaviors are not coordinated by a centralized leader. This implies that there exists an intrinsic mechanism among insect aggregates that overcomes individuals’ drawbacks and yields results that might be impossible for individuals to attain. The fact has inspired many algorithms and methods that, through cooperation between much simpler group members, more elegant and complicated tasks can be complimented. The biologically inspired algorithms have shown promising results in many domains, including distributed covering and searching, optimization, distributed localization and estimation, group pattern modeling and group formations. These algorithms have also found their usages in multimedia applications due to the similarity between multimedia computing environment and the animal’s aggregation society structure. In this section, we include several works that utilized biologically inspired algorithms to solve a family of multimedia problems.

In summary, multimedia computing and computer vision are different but closely related domains. They both emerged around 1970s when computers could manage the processing of large data sets such as images. After decades of development, rich literature exists in both domains. In this book, we formally present the problems related to applying computer vision for multimedia computing. It’s never like today where multimedia technology is requesting urgent computer vision algorithms and methods to support various multimedia applications. On the other side, the field of computer vision is still characterized as diverse (task-specific) and immature (in the state of basic research), and is expected to endure for the near future. Hence in this particular historic moment, we believe that a book to address the problem of applying computer vision into multimedia computing is high necessary. To a limited extend, the book can presents existing works that pioneers the field; to a large extend, the book will let people in related area be aware of the situation, and inspire them to develop systems with better performance. 

Author(s)/Editor(s) Biography

Jinjun Wang received the B.E. and M.E. degree from Huazhong University of Science and Technology, China, in 2000 and 2003. He received the Ph.D degree from Nanyang Technological University, Singapore, in 2006. From 2006 to 2009, Dr. Wang was with NEC Laboratories America, Inc. as a postdoctoral research scientist, and in 2010, he joined Epson Research and Development, Inc. as a senior research scientist. His research interests include pattern classification, image/video enhancement and editing, content-based image/video annotation and retrieval, semantic event detection, etc. He has published over 30 journal and conference papers in those areas, and has six US patents pending. Dr. Wang served as Technical Program Committee Member of major international multimedia conferences, including ACM MM'08, IEEE PCM’09, IEEE MMM’09/'10, IEEE 3D-TV’09/'10, etc. He also served as peer reviewer of many journals and conferences.
Jian Cheng is currently an associate professor of Institute of Automation, Chinese Academy of Sciences. He received the B.S. and M.S. degrees in Mathematics from Wuhan University in 1998 and in 2001, respectively. In 2004, he got his Ph.D degree in pattern recognition and intelligent systems from Institute of Automation, Chinese Academy of Sciences. From 2004 to 2006, he has been working as postdoctoral in Nokia Research Center. Then he joined National Laboratory of Pattern Recognition, Institute of Automation. His current research interests include image and video search, machine learning, etc. He has authored or co-authored more than 40 academic papers in these areas. He was awarded LU JIAXi Young Talent Prize in 2010. Dr. Cheng served as Technical Program Committee member for some international conferences, such as ACM Multimedia 2009 (content), IEEE Conference on Computer Vision and Pattern Recognition (CVPR’ 08), IEEE International Conference on Multimedia and Expo (ICME’ 08), Pacific-Rim Conference on Multimedia (PCM’ 08), IEEE International Conference on Computer Vision (ICCV’ 07), etc. He has also co-organized one special issue on Pattern Recognition Journal, and several special sessions on PCM 2008, ICME 2009, PCM 2010.
Shuqiang Jiang, associate professor. He received the Ph.D degree from ICT CAS, China in 2005. He is currently a faculty member at Digital Media Research Center, Institute of Computing Technology, Chinese Academy of Sciences. He is also with the Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences. His research interests include multimedia processing and semantic understanding, pattern recognition, and computer vision. He has published over 60 technical papers in the area of multimedia. He is a Member of IEEE and ACM. He serves as General Special session Co-Chair of Pacific-Rim Conference on Multimedia (PCM2008). He also served as Technical Program Committee Member in many prestigious multimedia conferences including International conference on IEEE Conference on Computer Vision and Pattern Recognition, IEEE International Conference on Computer Vision, ACM Multimedia, International Conference on Multimedia and Expo (ICME), Pacific-Rim Conference on Multimedia (PCM).


Editorial Board

  • Dr. Hanjalic Alan, Delft University of Technology, Netherlands
  • Dr. Xu Changsheng, Institute of Automation, Chinese Academy of Sciences, China
  • Dr. Lu Hanqing, Chinese Academy of Sciences, China
  • Dr. Ebroul Izquierdo, Queen Mary University of London, UK
  • Dr. Jin Jesse S., University of Newcastle, Australia
  • Dr. Pietikainen Matti, University of Oulu, Finland 
  • Dr. Tian Qi, Microsoft Research Asia, China
  • Dr. Gao Wen, Peking University, China