Article Preview
Top1. Introduction
Today enormous amount of visual intelligence [Goldman, 2014; Hopkins, 2014; West, 2014; Ranker, 2014] are generated every second and are being stored in various digital formats like Images and Videos. Boom in technical advancements in Cameras in Mobiles [Chen, 2014; Li, 2014; Ojala, 2014], Televisions [Cesar, 2014, Trundle, 2014; Blanchfield, 2014], Internet [Hürst, 2015; Ahsan, 2014; Sankaran, 2014] add to the data repository of the multimedia information. According to statistics per day 4 billion videos are viewed in one of the popular video portal YouTube [Yu, 2015; Cheng, 2008] and about 60 million of video are uploaded every minute. With such an enormous amount of data being produced it calls for an urgent need to for efficient video indexing and retrieval algorithms [Saravanan, 2015; Souza, 2014; Chen, 2014]. Relevancy of the video content are better represented by effectively extracting descriptive features. Perceptual features like color, intensity, object shape, texture, etc. are widely used for image classification and video indexing [Mehmood, 2015; Gupta, 2015]. However, they cannot provide an exact description of the video content. Texts, Images, Speeches, Music and so on are the typical contents of a video. Of all these contents texts carry more significant semantic information, thus contributing greatly to video content understanding and analysis. With relatively reliable OCR technique, video text analysis is a promising way for video content processing. However, the detection and recognition of video text is a challenging task due to variations of video text in terms of size, color, font and alignment. Video text analysis [Nguyen, 2014; Lu, 2014] is a widely researched way of video content processing. High level semantic information about video content and distinctive visual characteristic is provided by Video text. Being highly compact and structured video text provides valuable indexing information such as scene location, speaker names, program introduction, sports scores, special announcements, dates and time. Two types of text in video are: (1) caption/graphics/artificial text [Lu, 2014; Castillo, 2013; Wang, 2012] which is artificially superimposed on the video at the time of editing, and (2) scene text [Zhu, 2015; Weinman, 2014] which naturally occurs in the field of view of the camera during video capture. Text detection and extraction in video has to deal with problems peculiar to video, namely, low contrast, low resolution, color bleeding, text movement and blurring.
There have been several approaches proposed in the last few years for the automatic extraction of text in digital videos. There are basically three major categories of text extraction methods, namely Connected component (CC) based [Zhang, 2008; Yi, 2007], Edge based [Huang, 2014; Kumar, 2012] and texture based [Shivakumara, 2014; Prakash, 2014; Shekar, 2014]. CC-based methods group small components into successively larger ones until all regions are identified in the image. Kim [Kim, 1996] proposed a method in which image is segmented using color clustering by a color histogram in the RGB space. Non-text components, such as long horizontal text lines and text segments. Kim [Kim, 2008] proposed a novel static text region detection algorithm for preventing the motion compensation error in FRC (frame rate conversion). Based on the concept that the color of the text is spatio-temporally consistent and the orientation of the text boundary is preserved in consecutive frame the algorithm can perfectly extract the static text region.
Jain [Jain, 1998] suggested to apply some connected component based approaches like bit dropping, color clustering, multivalued image decomposition and foreground image generation. 24 bit color image is reduced to 6 bit image and then quantized by a color clustering algorithm. Each clustered region then undergoes text localization process followed by merger into one output image. The algorithm works does not work when the color histogram is sparse.