Optical Flow Prediction for Blind and Non-Blind Video Error Concealment Using Deep Neural Networks

Optical Flow Prediction for Blind and Non-Blind Video Error Concealment Using Deep Neural Networks

Arun Sankisa (Northwestern University, Evanston, USA), Arjun Punjabi (Northwestern University, Evanston, USA) and Aggelos K. Katsaggelos (Northwestern University, Evanston, USA)
DOI: 10.4018/IJMDEM.2019070102

Abstract

A novel optical flow prediction model using an adaptable deep neural network architecture for blind and non-blind error concealment of videos degraded by transmission loss is presented. The two-stream network model is trained by separating the horizontal and vertical motion fields which are passed through two similar parallel pipelines that include traditional convolutional (Conv) and convolutional long short-term memory (ConvLSTM) layers. The ConvLSTM layers extract temporally correlated motion information while the Conv layers correlate motion spatially. The optical flows used as input to the two-pipeline prediction network are obtained through a flow generation network that can be easily interchanged, increasing the adaptability of the overall end-to-end architecture. The performance of the proposed model is evaluated using real-world packet loss scenarios. Standard video quality metrics are used to compare frames reconstructed using predicted optical flows with those reconstructed using “ground-truth” flows obtained directly from the generator.
Article Preview
Top

1. Introduction

Applications in signal processing using neural networks have often tackled problems such as object classification (Krizhevsky et al., 2012), recognition (Simonyan et al., 2015; He et al., 2016), inpainting and reconstruction (Cai et al. 2015; Pathak et al., 2016) or solving inverse problems (Lucas et al., 2018) in still images to classification (Ng et al., 2015), action recognition (Simonyan et al., 2014), frame prediction and optical-flow estimation (Ilg et al., 2017; Revaud et al., 2015; Weinzaepfel et al., 2013) in videos. Research has shown that deep convolutional neural networks (CNN) (Krizhevsky et al., 2012; LeCun et al., 1990; LeCun et al., 1998) are naturally suited for extracting spatial correlations embedded in natural images while multi-dimensional long short-term memory (LSTM) (Hochreiter et al., 1997; Sundermeyer et al., 2010; Sutskever et al., 2014) modules, a variant of Recurrent Neural Networks (RNN), have been applied in applications where the underlying data exhibits temporal dependencies. More recently, specialized models that engineer both frameworks into a combined class, the Convolutional LSTM (ConvLSTM), have been proposed in domains with both spatial and temporal coherence that require efficient handling of 3D volumetric time-sequence information such as weather/precipitation forecasting (Shi et al., 2015), and more generally in videos (Patraucean et al., 2016). Although future frame synthesis and optical flow estimation problems are being widely researched in recent works, decoder-like motion-compensated error concealment using dense optical flows in videos has not been actively studied. Most approaches in current-art focus on extracting motion through semantic segmentation or implicitly in the textural domain through pixel or voxel flows. The model we present in this work explicitly uses available motion-field information to predict future motion and reconstruct a degraded frame, a behavior that mimics traditional video codecs.

Communication Networks has been another area of active research owing in large part to the growing presence of smart mobile devices capable of capturing and sharing high-definition images and videos. Wireless technology has advanced with newer generations of transmission technology 5G (Rost et al., 2016; Chih-Lin, et al., 2016; Agiwal et al., 2016) and with content-aware resource allocation, packet prioritization and scheduling (e.g., Nasralla et al., 2018; Pahalawatta et al., 2007; Maani et al., 2008; Sankisa et al., 2016) but packet loss is inherently inevitable when the transmission is over shared resources. This is further exacerbated with the prevalence of cloud computing (Rost et al., 2014) and the advent of telco cloud (Soares, 2015; Zhiqun, 2013), Machine-to-Machine (M2M) communication and the Internet of Things (IoT) (Palattella et al., 2016) which further contribute to overloading the available bandwidth. The work presented in this paper addresses problems encountered in resource constrained, error-prone transmission environments. In general, an encoder (usually on the sending/server side) takes individual frames from video sequences and partitions them to groups of blocks (GOBs) or slices to perform coding and compression using motion vectors. Each GOB/slice is then packaged into transmission units called packets for communication. If packet loss is encountered at the decoder (at the receiver), motion vectors in the preceding blocks or frames are used to predict the lost vectors. The predicted motion vectors are then combined with previously received/reconstructed frames to conceal the errors from lost packets in the current frame.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 11: 4 Issues (2020): 1 Released, 3 Forthcoming
Volume 10: 4 Issues (2019)
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing