Article Preview
Top1. Introduction
The importance of video classification has increased in the last decade due to the ubiquity and exponential growth of crowd-generated videos—more than 80% of the internet traffic will be for video streaming (Cisco, 2017). Demands for classifying videos are elicited by various corporate, government, and educational policies that require identification of harmful videos (e.g., phishing and spam), hate- and crime-promoting content, pornography, and/or cyber bullying, which can be spread through online video streaming platforms. Alternatively, the classification information can be used to promote relative content, generate targeted advertisement, and retain more users (Duverger & Steffes, 2012; Xu, Zhang, et al., 2008).
To classify YouTube videos, machine learning algorithms (e.g. Convolutional Neural Networks (CNN) (Karpathy et al., 2014)) are used to extract key features from the videos’ frames (e.g., (Roach, Mason, & Pawlewski, 2001)), text (e.g., (Brezeale & Cook, 2006)), audio (e.g., (Z. Liu, Wang, & Chen, 1998)), or a combination of these data (e.g., (Qi, Gu, Jiang, Chen, & Zhang, 2000)). This information is used to train the models offline using a set of pre-downloaded videos. After the models are trained, they are used to predict the class of new, unknown videos. However, the drawbacks of these approaches include lengthy training time and high computation complexity, and a priori access to training videos.
To reduce time and computation complexity, other research targeted key frames and segments (Lu, Drew, & Au, 2001), the video’s texts (Huang, Fu, & Chen, 2010), such as the title, description, and/or comments. However, these approaches can only predict classification of new videos using training on pre-determined features that were available during training time, and they did not consider new, unprecedented classes (e.g., a new online challenge). Furthermore, the exponential growth of the number of video creators on social media—over 1 million YouTube channels (YouTube, 2017) and 8 billion Facebook videos watched per day (Tang et al., 2017), the number of new, subjective video classes will increase beyond what a single model can determine. Furthermore, the subjectivity of the information technology’s (IT) policies deployed at different entities—government agencies, corporate office, and education institution, exacerbate the relativity of the classification information. Since these policies are focused on the entity’s demands, the policies require adaptivity to the contextual environment in which the video is being streamed.
A generic video classification information (sport, news, entertainment, …) must be accompanied with contextual information. Whereas the generic classification information provides valid information, awareness of the contextual environment will impact the IT decision about these videos. For instance, a video that was classified into commercial advertisement, as a result of pre-trained classification model, could be also classified as an online challenge based on recent events in the online and social media. Since the recent events, and possibly the category itself, were unavailable during the module training period, the trained module would fail to predict the online challenge as a category of the video. If the recent events, contextual information, is obtained as part of a classification framework, the IT policies can provide more accurate decisions.