This work presents a method of semi-automatic ground truth annotation for benchmarking of face detection in video. We aim to illustrate the solution to the issue where an image processing and pattern recognition expert is able to label and annotate facial patterns in video sequences at the rate of 7500 frames per hour. We extend these ideas to the semi-automatic face annotation methodology, where all object patterns are categorized into 4 classes in order to increase flexibility of evaluation results analysis. We present a strict guide how to speedup manual annotation process by 30 times and illustrate it with the sample test video sequences that consists of more than 100000 frames, 950 individuals and 75000 facial images. Experimental evaluation of the face detection using the ground truth data, that was semi-automatically labeled, demonstrates effectiveness of current approach both for learning and test stages.