Towards Fusion of Textual and Visual Modalities for Describing Audiovisual Documents

Towards Fusion of Textual and Visual Modalities for Describing Audiovisual Documents

Manel Fourati (Laboratory MIR@CL, University of Sfax, Sfax, Tunisia), Anis Jedidi (Laboratory MIR@CL, University of Sfax, Sfax, Tunisia), Hanen Ben Hassin (Laboratory MIR@CL, University of Sfax, Sfax, Tunisia) and Faiez Gargouri (Laboratory MIR@CL, University of Sfax, Sfax, Tunisia)
DOI: 10.4018/IJMDEM.2015040104
OnDemand PDF Download:
No Current Special Offers


Audiovisual documents provide a wide range of content description through more descriptors from different media types. Indeed, the extraction of these descriptions has received an increasing attention. But, the lack of semantic description always persists. In fact, this lack affects the retrieval process. To address this problem, this paper describes an automatic and semantic description of cinematic audiovisual documents. This description is based not only on the audiovisual flux in this post-production phase but also in the documentation in the pre-production phase by using textual and visual modalities. In this context, to extract content description, we find it is essential to extract texts superposed in the image. This process is mainly based on the neural network classifier. Moreover, an effective OCR (Tesseract) is adapted for texts recognition. Experiments results confirmed the interesting performance through two databases, namely, “ICDAR 2011” and our own created database from the Internet Movie Database Imdb.
Article Preview

1. Introduction

Due to the proliferation of TV channels and the technological advances in the field of computer science, a huge amount of audiovisual documents is transferred to the Web. This development has an effect on different platforms for exchange and backup. In fact, the user still needs to search, retrieve, exchange and analyze the content conveyed with in audiovisual resources. In this context, several studies have proven that it is essential to use some organization techniques of audiovisual documents as an index (Essid & Fevotte, 2013). However, indexing is the process of linking the audiovisual content with its description. Nevertheless, there comes the importance of the description of audiovisual document that represents a major challenge. In the literature, several works are interested in extracting the description’s content of multimedia documents (images and videos). In(LOZANO ESPINOSA, 2000), the authors underline the importance of the low-level analysis to extract descriptions as key frames of plans, by calculating the average of all the frames of the video sequence. In fact, they consider the audiovisual stream as visual data using image and signal processing tools. Consequently, the text in multimedia documents represents an important source for extracting description. This importance result from the effective presence of the documents’ content. The text in the image or video can be useful to extract several descriptions. For this purpose, we will be interested in extracting the text superposed in the image. Some works proposed methods to extract annotation and description of audiovisual documents based on the documentary process; the significant ones are (Thi, 2003) and (Troncy, 2003).They extract a strict documentary metadata, expressed with the MPEG-7 language. The documentary elements are then instantiated as descriptors classified under the audiovisual descriptors. They follow a backward description process to build the necessary steps for the description. In this context, several tools for the description of audiovisual documents are implemented namely, ANVIL, VideoAnnEx etc., which generate annotations following the documentary patterns (XML or MPEG7).Though interesting, the dissatisfaction of the user’s needs persists because of the lack of a standard representation and the lack of semantic description. Indeed, the descriptors proposed by existing standards are far from being sufficient to extract and to structure the semantics conveyed in an audiovisual document. To overcome this problem, the semantic search engines provide the data of the video’s content and represent them as key elements; one of these is Voxalead (Law-To, Grefenstette, & Gauvain, 2009), which extracts the semantics of the speech content and generates an xml file containing the words identified in the audio signal. The works of (Aubert, Prié, & Schmitt, 2012) were devoted to the description of audiovisual documents and the development of the ADVEN platform which allow creating semantic annotations semi-automatically by the creation of hyper videos generated as XML format which are non-normalized.

In general, the interest in the potential of cinematic document has increased. In his work, P. Stockinger (Stockinger, 2003) focuses on the importance of describing and structuring the film content. He defines different semiotic descriptions namely, textual, pragmatic, keyword, etc… To satisfy the need of the exploitation of Audiovisual Archives (AAR), (Stockinger, 2011)mentioned the use of a tool called ‘Interview’ allowing the description and the indexing to enrich an audiovisual document with annotations exported in an XML format as non-normalized.

In the web environment, the W3C standard provides a technology set of a semantic web. The Semantic Web environment provides a wide range of description languages which can be useful for audiovisual resources in different films, semantic, thematic, temporal, and environmental dimensions. In his work, Issac (Isaac, 2005) uses the languages and tools related to the semantic web for the description of audiovisual documents. He uses a combination of ontologies represented in OWL-DL (Ontology Web Language Description Logics), inference rules allowing a structured description and a more complete research of audiovisual sequences. The description is based on a predefined theme: medicine. The annotations are expressed in RDF (Resource Description Framework) so that, all resources can be distributed and reused in other applications. He follows a top-down approach for the description.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 13: 4 Issues (2022): Forthcoming, Available for Pre-Order
Volume 12: 4 Issues (2021): 3 Released, 1 Forthcoming
Volume 11: 4 Issues (2020)
Volume 10: 4 Issues (2019)
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing