Low-Complexity Stereo Matching and Viewpoint Interpolation in Embedded Consumer Applications

Low-Complexity Stereo Matching and Viewpoint Interpolation in Embedded Consumer Applications

Lu Zhang (IMEC, Belgium), Ke Zhang (IMEC, Belgium), Jiangbo Lu (Advanced Digital Sciences Center, Singapore), Tian-Sheuan Chang (National Chiao-Tung University, Taiwan) and Gauthier Lafruit (IMEC, Belgium)
Copyright: © 2012 |Pages: 24
DOI: 10.4018/978-1-61350-326-3.ch016

Abstract

Viewpoint interpolation is the process of synthesizing plausible in-between views - so-called virtual camera views - from a couple of surrounding fixed camera views. To make viewpoint interpolation possible for low/moderate-power consumer applications, a further quality/complexity trade-off study is required to conciliate algorithmic quality to architectural performance. In essence, the inter-dependencies between the different algorithmic steps in the processing chain are thoroughly analyzed, aiming at an overall quality-performance model that pinpoints which algorithmic functionalities can be simplified with minor global input-output quality degradation, while maximally reducing their implementation complexity w.r.t. arithmetic and line buffer requirements. Compared to state-of-the-art CPU and GPU platforms running at several GHz clock speed, our low-power 100 MHz FPGA implementation achieves speedups with one to two orders of magnitude, without impeding on the visual quality, reaching over 100 frames per second VGA high-quality, 64-disparity search range stereo matching and enabling viewpoint interpolation in low-power, embedded applications.
Chapter Preview
Top

Introduction

Figure 1 shows a typical eye-gaze correcting video conferencing application where virtual camera viewpoint interpolation restores straight eye contact to video tele-conference participants by interpolating surrounding views of the user/viewer/participant captured through cameras all around the display. This principle can be extended towards rendering multiple, adjacent, interpolated viewpoints for auto-stereoscopic, shutter-glasses-free 3D displays, where depth impression is obtained by rendering – for each pixel - up to ten different images in different viewing cones (see Figure 1 (top)), two of which being captured by the viewer’s eyes. Ultimately, these dozens of views are calculated through viewpoint interpolation from a single pair of cameras, capturing stereoscopic content.

Figure 1.

Interpolation of Left/Right camera views into a rendered virtual viewpoint for eye-gaze correction in video teleconferencing (bottom), possibly augmented with auto-stereoscopic 3D displays where each pixel projects multi-directional viewing cones from which two are captured by the viewer’s eyes (top)

978-1-61350-326-3.ch016.f01

An essential DSP kernel in this process is the extraction of depth from the stereo cameras. Though we humans do not experience the difficulty of perceiving depth from our binocular view on the outside world, this depth extraction – also called stereo matching – is an incredibly complex processing step that only recently has been ported to embedded platforms (Woodfill, 2004; van der Horst, 2006) at the expense of the quality of the extracted depth image (also called dense depth map) in targeting near-to-real-time performances.

Figure 2 confirms we achieve competitive, real-time processing (over 100 frames per second at VGA resolution, including frame buffer access latency), while preserving high-quality standards, as confirmed by the very low Bad Pixel Error Rate (BPER) reported in Figure 2(d), following the definition of (Scharstein, 2002), i.e. the average difference between calculated and ground truth disparities over all pixels in the image (cfr. Figure 2(b)), using the test images of http://vision.middlebury.edu/stereo/. The black arrows refer to the presented FPGA solution, and the grey arrows correspond to a comparable solution on GPU from one of the co-authors, with one order of magnitude higher clock speed, though reaching a 15 times lower frame rate (or Million Disparity Estimations per second – MDE/s) at a marginally higher quality (lower BPER: 7.65% versus 8.2%).

Figure 2.

Frame rate (a) - (frames per second – fps) and quality (a,d) figures of merit (Bad Pixels Error Rate – BPER – cfr. definition in (b)) on different platforms (CPU, GPU and proposed FPGA implementation). The arrows compare implementations on FPGA (black arrow) and GPU (grey arrow) of the same/similar reference stereo matching code from two authors of this chapter.

978-1-61350-326-3.ch016.f02

Complete Chapter List

Search this Book:
Reset