Article Preview
Top1. Introduction
Localization is crucial for people’s life and many applications like navigation, robotics, augmented reality, etc. Though the global positioning system (GPS) can solve the problem in the most of situations, there are still some cases that GPS cannot handle well. Many image-based localization methods are proposed to deal with these cases. This paper proposes a novel localization system named Dis-Retrieval to estimate position from a monocular RGB image.
Our proposed system takes a monocular RGB image as input and outputs a position where this image is token. The core of our system is a deep convolutional neural network (CNN) model, which can map an image to a d-dimension space where feature pairs of distance-closer image pairs in real word have smaller Euclidean distances. As Figure 1 illustrates, the structure of our system is similar with content-based image retrieval system (Zhang, Zhao, and Han, 2009). When a query image is coming, our system firstly uses CNN feature extraction model to extract a feature vector from query image, then uses it to match features of labeled images for k nearest images, finally uses k images’ position information to estimate query image’s position. Similar with content-based image retrieval system, it can operate in real time speeded by hash technology (Gionis, Indyk, and Motwani, 2000).
Figure 1. Structure of proposed localization system for monocular image
Before introducing our main contribution, we first simply talk about motivations of this paper. By now, there are mainly two types of methods to solve the image-based localization problem. Methods of first type are traditional methods which estimate position by using traditional image features (like SIFT (Lowe, 2004)) to match images. Methods of this type are easily influenced by noise factors in images such as light, camera’s angle, pedestrians, cars, etc. Methods of second type are CNN-feature-based methods, which use CNN to solve the problem. Methods of this type can easily handle the influence of light, camera’s angle, pedestrians and cars through data-driven learning. But they are less extendable, that is one trained model can just process one scene or one place. For example, we train a model using labeled images of A University: if we want to have a localization model of B University, we have to train another model using labeled images of B University; even when we want to add some labeled images of A University, we also have to fine tune old model of A University. Maybe you can force one model to process two universities, but what if there are 1000 universities? The model’s parameters are limited, so it is hard to extend methods of this type to many places. Our proposed system overcomes these two difficulties through combining traditional content-based image retrieval structure with CNN feature extraction model.