Video Event Understanding
Nikolaos Gkalelis (Information Technologies Institute, Centre for Research and Technology Hellas, Greece), Vasileios Mezaris (Information Technologies Institute, Centre for Research and Technology Hellas, Greece), Michail Dimopoulos (Information Technologies Institute, Centre for Research and Technology Hellas, Greece) and Ioannis Kompatsiaris (Information Technologies Institute, Centre for Research and Technology Hellas, Greece)
Copyright: © 2015
High-level events can be conceived as dynamic objects that pace our everyday activities and index our memories. This definition reflects the compositional nature of the event (i.e., consisting of actions, actors, objects, locations, times and other components with possible relations among them), and implies that its perception depends on the cultural and personal perspective of the observer (Brown, 2005). For this reason, it is generally expected that event understanding technologies can offer effective organization of multimedia content and natural language description of this content to human users. On the other hand, it is clear that this task is much more challenging than tasks dealing with the detection of simple human actions (Turaga, Chellappa, Subrahmanian, & Udrea, 2008) or other semantic concepts (Mezaris, Papadopoulos, Briassouli, Kompatsiaris, & Strintzis, 2008).
The necessity of event models for describing real life events in video signals has been recently acknowledged as an essential step towards effective large-scale multimedia content analysis, indexing and search (Gupta & Jain, 2011). Moreover, in Westermann and Jain (2007) a set of aspects that an event model should satisfy are defined, such as media independence, model interoperability, and other. In the following we briefly review a representative fraction of the related work. In Scherp, Franz, Saathoff, and Staab (2009), the pattern oriented ontology approach of DOLCE and DUL is utilized to define the event-model F so that several event aspects (Westermann & Jain, 2007) are addressed. In Gupta and Jain (2011), the event-model E* is presented that extends the event-model E (Westermann & Jain, 2007) using a graph-based design and the ABC and DOLCE ontology to provide formal definition of event aspects. In Gkalelis, Mezaris, and Kompatsiaris (2010a, 2010b) a joint content-event model is presented, which additionally provides a mechanism for the automatic (or semi-automatic) association and enrichment of event descriptions with multimedia content, so that video event analysis technologies can be directly exploited for populating the model. Combining event models and video annotation tools, Agius and Angelides (2006) proposed COSMOS-7, an MPEG-7 compliant scheme for modeling events, objects and spatiotemporal relationships among them, and based on it they designed COSMOSIS to enable annotation of video content. Several other video annotation tools that support the generation of event-based video descriptions have been presented, such as Vannotator (http://homepages.inf.ed.ac.uk/rbf/CAVIAR/), and other.
Key Terms in this Chapter
Multimedia Indexing: The problem of preprocessing multimedia content so that the multimedia content items (images, videos, shots, scenes, visual objects, etc.) can be efficiently and accurately retrieved at the required granularity levels.
Ad Hoc Multimedia Event Detection (Ad Hoc MED): A task where an event detection system must learn a set of events which are not known a priori (at the time of designing the event detection system).
Model Vector: The vector created from the ordered concatenation of the DoCs retrieved using a set of pre-trained semantic concept detectors. Intuitively, each component of the model vector expresses the degree of confidence that the respective semantic concept is depicted in the video keyframe.
Semantic Concept Detector: The software and/or hardware implementation of a logical procedure that receives as input a video keyframe and provides as output a DoC regarding the presence of the respective semantic concept in the keyframe.
Bag-of-Words (BoW): This method exploits a word vocabulary to represent a document, image or video keyframe as a histogram of word occurrences. A word vocabulary is generated by clustering a large set of feature vectors and treating each cluster centre as one word. The traditional BoW method has been extended in several ways, e.g., using a spatial pyramid technique to encode spatial information in the BoW model.
Degree of Confidence (DoC): A real number in the range [0,1] that expresses the reliability of the estimate. For instance, during the application of a concept detector in a video keyframe the derived DoC expresses our confidence in the hypothesis that the respective concept is depicted in the keyframe.
Multimedia Event Recounting (MER): It is a textual human-understandable description of the key semantic entities of the detected event in a particular video. Ideally, a human user should be able to match the MER of a video to the specific event and video that the MER refers to.