This chapter introduces a system for acquiring synchronized multi-view color and depth (RGB-D) video data using multiple off-the-shelf Microsoft Kinect and methods for reconstructing temporally coherent 3D animation from the multi-view RGB-D video data. The acquisition system is very cost-effective and provides a complete software-based synchronization of the camera system. It is shown that the data acquired by this framework can be registered in a global coordinate system and then can be used to reconstruct the 360-degree 3D animation of a dynamic scene. In addition, a number of algorithms to reconstruct a temporally-coherent representation of a 3D animation without using any template model or a-prior assumption about the underlying surface are also presented. It is shown that despite some limitations imposed by the hardware for the synchronous acquisition of the data, a reasonably accurate reconstruction of the animated 3D geometry can be obtained that can be used in a number of applications.
TopIntroduction
Temporally coherent time-varying dynamic scene geometry has been employed in a number of applications. It can be used for 3D animation in digital entertainment productions, electronic games, 3D television, motion analysis, gesture recognition etc. First step in obtaining temporally coherent 3D video is to capture the shape, appearance and motion of a dynamic real-world object. One or more video cameras are employed for this acquisition, but unfortunately, data obtained by these video cameras has no temporal consistency, as there is no relationship between the consecutive frames of a video stream. In addition, for a multi-view video, all the cameras have to be synchronized to extract temporal correspondences at each frame of the video. This synchronization is typically achieved by means of a hardware-based camera trigger, which acts as an external synchronizer. From the acquired synchronized data, in order to reconstruct a temporally coherent 3D animation, a spatial structure between cameras has to be established along with the temporal matching over the complete video data.
In this chapter, a system for acquiring synchronized dynamic 3D data using multiple RGB-D cameras along with three new method for capturing spatio-temporal coherence between RGB-D images captured from multiple RGB-D video cameras are presented. Synchronized multi-view video (MVV) data is used in a number of applications, e.g. motion capture, dynamic scene reconstruction, free-viewpoint video etc. Traditionally, the MVV recordings are acquired using synchronized color (RGB) cameras, which are later processed for use in a number of applications (Aguiar et al., 2008; Carranza et al., 2003; Starck et al., 2007; Theobalt et al., 2007; Vlasic et al., 2008). The acquisition setups used for these earlier works comprised of a dedicated system for capturing synchronous high quality RGB MVV recordings, which were then used to reconstruct dynamic 3D scene representation.
One of the earlier works in this area was presented by Carranza et al. (2003), who used eight multi-view recordings to reconstruct the motion and shape of a moving subject and applied it in the area of free-viewpoint video reconstruction. Theobalt et al. (2007) extended this work so that in addition to capturing the shape and motion they also captured surface reflectance properties of a dynamic object. Starck et al. (2007) presented a high quality surface reconstruction method that could capture detailed moving geometry from multi-view video recordings. Later de Aguiar et al. (2008) and Vlasic et al. (2008) presented new method for reconstructing really high quality of dynamic scene using multi-view video recordings. Both of their methods first obtained the shape of the real world object using a laser scanner and then deformed the shape to reconstruct the 3D animation. Ahmed et al. (2008) presented a method of dynamic scene reconstruction with time coherent information without the use of any template geometry, but unlike one of the presented method they did not explicitly include multiple matching criteria for extracting time coherence in their method.