Article Preview
Top1. Introduction
Concurrent with the growing use of online social networks is the significant increase in the number of videos that are uploaded to the Internet. Consider, for example, the number of video hours watched daily on YouTube reached one billion hours with more than 70% of watching time spent on mobile devices (YouTube, 2019). Many videos are uploaded for sharing experiences, knowledge, and entertainment. When a video is posted, users who have viewed similar content in the past may receive notification of new uploads. However, if videos had been uploaded some time ago, users need to search using retrieval engines to locate the relevant videos. Nevertheless, even plain video retrieval that seeks to locate an object, event, or action is a challenging task due to the complexity of query building, utility gap (Hanjalic, 2013), and subjectivity. Content-based video retrieval requires efficient index structures that support both spatial and temporal queries. Analyzing videos, extracting features, indexing content and classifying video data represents an increasingly important and active research area (Ashgar, Hussain, & Manton, 2014). The problem remains, however, that the gap between what the user seeks when he or she initiates a query, and what the retrieval system is capable of returning, currently hinders the broader use of these retrieval tools.
Face search/retrieval is one type of video retrieval that has many applications: locating a video in which a single person appears, two people appear together (as in a sporting game or event,) a person appears after another person, or some other temporal constraints. For surveillance, it may be important to determine if two people are exchanging an item or not. For crime-scene investigations using surveillance, it may be possible to track people before and after an event. Querying faces in videos requires face detection, face recognition, and then retrieving video clips based on the user query. In particular, modern methods utilizing deep learning for face recognition (Ashgar et al., 2014; Schroff, Kalenichenko, & Philbin, 2015; Taigman, Yang, Ranzato, & Wolf, 2014) are proving to be nearly as accurate as human perception. Nonetheless, indexing video content, even just for face searches, is quite challenging due to the large volume of data. As indicated earlier, the number of online videos and the number of people who appear in these videos are significantly higher compared to 15.
In this paper, we propose a new technique, FaceTimeMap, for indexing videos for face searches using a bitmap index (Shrestha, Chung, & Aygun, 2019). The bitmap index has recently been used for column-based retrieval in big-data systems (Chen et al., 2015). Since the bitmap index was originally designed to select a set of records that satisfy a value in the domain of the attribute, there is no clear strategy as to how to apply it for temporal querying. We utilized a multi-level bitmap index by creating two types of matrices. The first bitmap matrix has a bit set if a person appears in a video. The second level of the bitmap index is built for each video whereby a video is represented as a sequence of intervals. In the second level matrix, a bit is set for a person if it appears in that interval. Whenever a query based on appearance or temporal ordering of faces is submitted to the system, our retrieval engine first finds the relevant videos using the first level of the index, and then the intervals are only checked for those relevant videos based on the user query. There are three types of queries considered in this paper: (a) co-appearance: intervals where multiple people (or faces) appear simultaneously in a scene of a video; (b) next-appearance: intervals where a person appears right after another person disappears; and (c) eventual (or prior)-appearance: videos where a person appears sometime after (or before) another person disappears.