Article Preview
TopIntroduction
Object detection aims to recognize and localize each object instance with a bounding box. As a classical problem in the field of computer vision, it is widely used in autonomous vehicle (Kim et al., 2017) and assistive robots (Martinez-Martin & Del Pobil, 2017). The traditional object detection methods are generally based on scale invariant feature transform (SIFT) (Lowe, 2004) and histogram of oriented gradient (HOG) (Dalal & Triggs, 2005). These methods extract the object features and sweep through the image to find regions with a class-specific maximum response. However, these methods perform well only on constrained object categories and are sensitive to noise. These problems limit the application range of the traditional object detection methods.
Recently, deep learning technology (DL) (Schmidhuber,2015) has been widely used in object detection. The object detection methods based on deep learning can be grouped into the region-free methods (Redmon et al., 2016; Liu et al., 2016) and the region-based methods (Girshick et al., 2014; Ren et al., 2017). The region-free methods frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. These methods improve the detection speed, but they still struggle with the accuracy. The region-based methods select the candidate bounding boxes based on region proposals, and then a region-wise subnetwork is designed to classify and refine these candidate bounding boxes.
Regions with CNN features (R-CNN) (Girshick et al., 2014) is a pioneer in introducing deep learning into object detection. R-CNN uses selective search to generate many region proposals and convolutional neural networks (CNNs) (Alex et al., 2012) are used to classify objects in these region proposals. It has made significant improvements in detecting more general object categories. However, selective search takes a long time to compute proposals and feature computation in R-CNN is time-consuming, as it repeatedly applies the deep convolutional networks to thousands of warped region proposals per image (He et al., 2014) . Hence, its detection speed is slow and detection efficiency is low. Fast-RCNN (Girshick, 2015) improves training and testing speed and detection accuracy of R-CNN by enabling end-to-end detector training on shared convolutional features. However, there are still some issues in this method: (1) Although Fast-RCNN has reduced the running time of these detection networks, region proposal computation has become the bottleneck; (2) The size of the feature map outputted by the last layer of VGG16 (Chen, Krishna, Emer, &Sze, 2017) is too coarse for classification of some instances with small size in Fast R-CNN; (3) Neighboring regions may overlap each other seriously.