Article Preview
TopIntroduction
Computing accurate camera pose including position and orientation with respect to the reconstructed structure of a scene has a broad range of applications in both outdoor and indoor environments such as autonomous navigation, objects tracking, AR (augmented reality), cognitive path finding, multi-agent coordination, critical event management (accidents, fire, traffic congestions, etc.). Most of current localization systems running on mobile phones rely on radio signals like Wi-Fi and GPS. However, the availability of these signals depends upon scene scale and site conditions. To be specific, GPS signal is only feasible in outdoor environments, while the accuracy of Wi-Fi-based localization is vulnerable in complicated indoor scenes. Different from the signal-based localization systems, image-based techniques can satisfy the demands of more complex and diverse scenarios. In addition, the tremendous progress in Structure-from-Motion (SfM) techniques enables the camera pose to be estimated accurately with respect to the reconstructed 3D scene.
Most recent 3D-model-based approaches focus on determining position in large-scale outdoor environment. They can provide the camera poses with a desired accuracy. In spite of that, the effectiveness of these systems is affected by two main problems: (1) occlusions in enclosed environments; (2) drift and deformation of the reconstructed model. The prerequisite of 3D camera pose estimation is associating 2D background features detected on an image with 3D points reconstructed from the structure of the background correctly. Any object that moves in an indoor setting can cause occlusion which may result in wrong feature associations. Meanwhile, some essential features belonging to the static background might be lost because they are sheltered by various obstructions. Most importantly, even if all the 2D-to-3D correspondences are correct, the drifting structure of a 3D model would bring about large localization errors.
In order to solve these two problems we propose an obstruction removal method for 3D localization as a problem of separating the foreground from the background. Furthermore, we propose a new synthetic framework combining the state-of-the-art localization technique with refined SfM and an occlusion removing component. The basic structure of our system is similar to the recent work on 3D structure-based localization: establish 2D-to-3D correspondences by relating 2D features detected from a query image to 3D points previously created by SfM, and compute pose of query from the geometric constraints constructed by these 2D-to-3D matches.
Unlike other state-of-the-art localization techniques, our proposed framework seeks to promote the system effectiveness and localization accuracy in a crowded indoor environment. There are two essential assumptions in this paper. (1) By recovering important features occluded by obstacles with deformation it is possible to reduce the sparse “noise” of ambient occlusion. (2) Re-triangulating the failed matches of SIFT features in incremental SfM can decrease the loss of correct feature matches which would result in drifting and deformation of 3D structures. Go Decomposition (GoDec) (Zhou & Tao, 2011), which is a classic low- rank and sparse matrix decomposition algorithm, is adopted in our work to filter out the negative inferences of moving obstructions in multiple frames with static structure. Moreover, a SfM approach with RT (Wu, 2013) is integrated in our system to guarantee the geometric accuracy of reconstructed indoor scene. More importantly, the experimental results illustrate that the localization system proposed by us remains robust in complex indoor environments, especially buildings with moving people and other large moving objects.
Well-studied approaches, such as optical flow estimation (Mémin & Pérez, 1998) and Markov Random Fields (MRF) (Huang et al, 2007), can be applied to accomplish the segmentation of moving pixels as well. However, they cannot tackle the moving object occlusions with large-scale displacement or obscured contour. Furthermore, our own data sets are created to evaluate the performance of our approach in indoor environments with crowds, on account that currently existing data sets built up for location awareness only concentrate on outdoor or static indoor scenes.