Article Preview
TopIntroduction
Semantic segmentation is a challenging computer vision problem wherein a class label is assigned to each pixel in an image. This provides much richer scene information than traditional image classification, and therefore it is inherently a more difficult task. While image classification requires only detection of the primary object, semantic segmentation requires precise, localized pixel-wise detections. Annotating training images for semantic segmentation in a supervised manner is more expensive and time-consuming, and as a result, most public datasets are relatively small in comparison to those for image classification (e.g., ImageNet) (Deng, Todorovic, & Latecki, 2015).
Current semantic segmentation methods show promising results on multiple datasets. However, consistent errors in their output can readily be found. For example, RefineNet (by Lin et al. (2017)) mis-classifies ground as sidewalk so often in the PASCAL-Context dataset (by Mottaghi et al. (2014)) that pixels classified as sidewalk are 60% likely to have the label ground in truth. In this example, pixel accuracy would actually be improved by simply relabeling all sidewalk pixels as ground. However, we would prefer not to lose all ability to classify sidewalks, but rather exploit our knowledge of this confusion to reason about the appropriate labels to assign and the corresponding confidence we should have.
We categorize labeling errors into two types. An “in-context” labeling error is when a pixel is assigned an incorrect label from the set of actual labels for a given image (e.g., a foreground object label is assigned to an incorrect location). An “out-of-context” labeling error is an invalid pixel label assignment that is outside of the actual label set for the image (e.g., an indoor object label is assigned to a pixel in an image of an outdoor scene). Examples of these errors are shown in Figure 1.
Given that a classifier will have similar performance characteristics on related images, an analysis of the classification errors and label confusions across a dataset should support a secondary (post-processing) refinement stage to re-estimate the output label likelihoods. These more informed classifications should greatly reduce the incidence of in-context errors and make out-of-context errors less likely. The proposed approach is based on these motivations.
Our method is derived from a direct marginalization of p(l|d), the probability of label l given input data d, resulting in a decomposition of the formulation into classifier output label probabilities and learned classifier/truth confusions. Applied together, the framework treats the confidence levels of the original classifier output as partial evidence and incorporates the confusion information to determine the final probability of witnessing each object label at each location in an image. In a sense, this confusion approach can be considered a form of context-aware re-estimation. We present a robust method to compute the confusion probabilities and also outline various label priors used to bias the refinement. Upper bound performances of the framework with various priors are reported for three challenging datasets to justify the approach.
The rest of this paper is organized as follows. Section II describes related work in semantic segmentation. Section III describes our framework for classifier refinement using label confusion probabilities and priors. Section IV presents experimental results showing the performance improvements of the approach on multiple datasets, followed by a conclusion in Section V.
Figure 1. Example contextual errors of RefineNet on PASCAL-Context. In-Context: The sofa back in the left image is classified as floor, which is incorrect even though the label correctly appears elsewhere in the image. Out-of-context: A portion of the right image is classified as motorbike, which is not a valid label anywhere in the image.
TopConvolutional Neural Networks (CNNs) such as those in Lin et al. (2017), Long et al. (2015), Chen et al. (2016), and Lin et al. (2016), have achieved unprecedented results in semantic segmentation. However, they also introduce a tension between precise localization and the inclusion of broader image context (Garcia-Garcia, Ors-Escolano, Oprea, Villena-Martinez, & Garcia-Rodriguez, 2017). Many recent innovations in network architecture have been motivated by this issue.