Article Preview
TopIntroduction
Image classification is a common task in computer vision, and various techniques for image classification have been proposed over the years. With the dramatic increase in the popularity of mobile electronic devices equipped with cameras, such as smartphones, there is a growing number of real-world applications for image classification. Nevertheless, some of these real-world applications aim to classify images captured in an uncontrolled manner and in complex environments, conditions under which existing image classification techniques may not perform well.
Recently, the municipality of the Indian metropolitan city of Hyderabad has established an e-Governance workflow for municipal tasks such as repairing street lights, cleaning city streets, and cleaning local parks. One of the more important tasks is collection of waste from the streets: Sanitation teams are supposed to collect the trash from dumpsters located around the city on a daily basis. Supervisors allocated to different regions of the city use smartphones to capture images of the dumpsters through a mobile application. The captured images (along with associated metadata) are submitted to an online server and can be accessed publicly through a portal. The city uses the processed information to submit reports and penalize third-party contractors for service-level agreement infractions when identified. However, the manual reports provided by the supervisors include incorrect information regarding the cleanliness of the dumpsters. As a result, the municipality is very interested in automated means for analyzing the images captured by the supervisors in order to validate the manual feedback and take corrective actions if required. The task is therefore to perform binary classification of the dumpster images captured by mobile phones (see Figure 1) to one of the following classes: 'clean' (if trash is not visible from the bin opening), and 'unclean' (if trash is visible from the bin opening).
Figure 1. An example of (a) 'Clean' dumpster vs. (b) 'Unclean' dumpster. The region that discriminates between the two cleanliness states (the bin opening) is marked in white.
Following the definition of the task, we are interested in learning a classifier with a feature set that can discriminate between clean and unclean bin openings. Namely, the classifier needs to discriminate between two different states of the dumpster, where the object of interest is the dumpster itself. In comparison, conventional image classification techniques are usually suited to discriminate between two different classes of objects or scenes, and generally fail to achieve adequate accuracy when applied to the task at hand, mainly due to the background clutter present in the images and the challenging imaging conditions (as we observed in several experiments with such systems). Therefore, it makes sense to adopt a multi-stage approach in which dumpster detection and localization is performed prior to the classification. One of the significant benefits of dumpster detection is that it removes the noisy background and focuses on the region of interest, thus aiding the classification.
In this work we propose an efficient image classification pipeline that is able to achieve accurate classification of the dumpster images from the complex urban environment, despite evident background clutter and challenging imaging conditions. This is achieved by utilizing a multi-stage approach, where the first stage is the efficient detection of the dumpster, and the second stage is the classification of its cleanliness state.
The proposed system utilizes various commonly used building blocks and representations that can also be found in current state-of-the-art image classification schemes. It is mainly the unconventional structure of the proposed system that enables it to achieve accurate classification despite the challenging environment. Namely, we first identify the region of interest that captures the dumpster at the detection stage, and then classify it at the classification stage that follows. For efficient detection, we utilize the useful Bag-of-Words (BoW) (Sivic & Zisserman, 2003) representation based on Scale-Invariant Feature Transform (SIFT) (Lowe, 2004) features. For classification, we extract Histogram of Oriented Gradients (HOG) (Dalal & Triggs, 2005) features and train a non-linear Support Vector Machine (SVM) (Cortes & Vapnik, 1995) classifier.