Key Frame Extraction from Video: Framework and Advances

Key Frame Extraction from Video: Framework and Advances

Sergii Mashtalir (Kharkiv National University of Radio Electronics, Kharkiv, Ukraine) and Olena Mikhnova (Kharkiv National University of Radio Electronics, Kharkiv, Ukraine)
Copyright: © 2014 |Pages: 12
DOI: 10.4018/ijcvip.2014040105
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

A complete overview of key frame extraction techniques has been provided. It has been found out that such techniques usually have three phases, namely shot boundary detection as a pre-processing phase, main phase of key frame detection, where visual, structural, audio and textual features are extracted from each frame, then processed and analyzed with artificial intelligence methods, and the last post-processing phase lies in removal of duplicates if they occur in the resulting sequence of key frames. Estimation techniques and available test video collections have been also observed. At the end, conclusions concerning drawbacks of the examined procedure and basic tendencies of its development have been marked.
Article Preview

Introduction

The importance of key frame extraction has immensely increased when intelligent access to multimedia started to gain large popularity. Key frame extraction can be used as a basis for enhancing indexing and searching capabilities. It is also a main tool for video summarization, which allows users quickly to get acquainted with multi-hour video content. For the summarization purposes, any video can be decomposed into a sequence of images, audio track, and textual part. Each unit is very essential for processing, but we shall focus merely on a sequence of images. On the contrast to video skimming, where initial material is shortened into a dynamic representative clip, summarization assumes extraction of static meaningful frames which are selected by chosen features and analyzed by intelligent methods.

Two types of key frames can be selected: least common content (Yang & Wei, 2011) and best representatives (Fayka et al., 2010). The type of key frames usually depends on the type of potential content to be analyzed. If a video has a variety of scenes and great variance of feature data, then best representatives are better to be extracted, otherwise, if video content is very similar, the results would be much better when least common frames are chosen. It is also thought that longer shots have more importance compared with shorter ones, as they attract users’ attention much longer. Frames that appear earlier in a timeline are also considered of greater importance, compared with similar frames appeared later (Yang & Wei, 2011).

First key frame extraction techniques emerged in the last decade of the previous century. The simplest methods that were designed to extract first frame in a shot were later modified and improved for better content orientation. Current techniques extract a sequence of meaningful frames from different fragments of video. Key frames are taken not only from the beginning of each shot, but also from the middle and the end. The whole video can be fragmented into scenes, subscenes, and shots for future analysis and key frame extraction. Shots are the most popular units from which key frames are further selected. Shot or scene boundary detection is sometimes referred to as temporal video segmentation. Frames are the atoms of video sequence, which are also obtained after temporal segmentation. On the contrast to temporal segmentation, spatial segmentation of each frame assumes detection of objects or regions of interest. This information is also very important for content-based video summarization.

As a rule, the pre-processing phase of each key frame extraction procedure consists in shot boundary detection (SBD) or scene change identification. Such a temporal segmentation of any video sequence is usually performed by analyzing changes in color or texture, object edges, motion or any other selected features (Liang et al., 2012; Smeaton et al., 2010). After the video has been divided into segments, key frames in each segment are chosen using a number of artificial intelligence methods applied to image and video information presented by audio, textual, visual and structural features. And the last phase, typical almost for all the methods of key frame extraction, consists in removal of extracted duplicates.

Despite of researchers’ efforts, lots of duplicates are sometimes observed in the results. It does not mean that the last, post-processing phase is done badly. It may be the fault of the whole extraction procedure. Some researchers (Fayka et al., 2010) argue that the last phase, if it is well-developed, may fine-tune the results of the first one. And vice versa, if the first phase is well-developed, the last one may be omitted (Liang et al., 2012). The authors who suggest such methods provide time consuming reasons, but usually not quality. The more phases are incorporated for checking, proving and sorting of extracted key frames, the better it often is in terms of quality.

In this paper a number of novel approaches from all over the world are observed. Approaches proposed before 2009 are briefly observed by Mikhnova (2012). The authors try to provide a comprehensive overview of what was going on in the field from 2009 to now. In the next section the main phase of key frame extraction is examined from different points of view. The section after it discusses pre-processing and post-processing phases of the procedure. At the end, estimation techniques applied for extracted frames are listed, and open video test samples are briefly described. The paper is traditionally finished by a short summary, conclusions and bibliography list.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 7: 4 Issues (2017)
Volume 6: 2 Issues (2016)
Volume 5: 2 Issues (2015)
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing