Article Preview
Top1. Introduction
Applications in signal processing using neural networks have often tackled problems such as object classification (Krizhevsky et al., 2012), recognition (Simonyan et al., 2015; He et al., 2016), inpainting and reconstruction (Cai et al. 2015; Pathak et al., 2016) or solving inverse problems (Lucas et al., 2018) in still images to classification (Ng et al., 2015), action recognition (Simonyan et al., 2014), frame prediction and optical-flow estimation (Ilg et al., 2017; Revaud et al., 2015; Weinzaepfel et al., 2013) in videos. Research has shown that deep convolutional neural networks (CNN) (Krizhevsky et al., 2012; LeCun et al., 1990; LeCun et al., 1998) are naturally suited for extracting spatial correlations embedded in natural images while multi-dimensional long short-term memory (LSTM) (Hochreiter et al., 1997; Sundermeyer et al., 2010; Sutskever et al., 2014) modules, a variant of Recurrent Neural Networks (RNN), have been applied in applications where the underlying data exhibits temporal dependencies. More recently, specialized models that engineer both frameworks into a combined class, the Convolutional LSTM (ConvLSTM), have been proposed in domains with both spatial and temporal coherence that require efficient handling of 3D volumetric time-sequence information such as weather/precipitation forecasting (Shi et al., 2015), and more generally in videos (Patraucean et al., 2016). Although future frame synthesis and optical flow estimation problems are being widely researched in recent works, decoder-like motion-compensated error concealment using dense optical flows in videos has not been actively studied. Most approaches in current-art focus on extracting motion through semantic segmentation or implicitly in the textural domain through pixel or voxel flows. The model we present in this work explicitly uses available motion-field information to predict future motion and reconstruct a degraded frame, a behavior that mimics traditional video codecs.
Communication Networks has been another area of active research owing in large part to the growing presence of smart mobile devices capable of capturing and sharing high-definition images and videos. Wireless technology has advanced with newer generations of transmission technology 5G (Rost et al., 2016; Chih-Lin, et al., 2016; Agiwal et al., 2016) and with content-aware resource allocation, packet prioritization and scheduling (e.g., Nasralla et al., 2018; Pahalawatta et al., 2007; Maani et al., 2008; Sankisa et al., 2016) but packet loss is inherently inevitable when the transmission is over shared resources. This is further exacerbated with the prevalence of cloud computing (Rost et al., 2014) and the advent of telco cloud (Soares, 2015; Zhiqun, 2013), Machine-to-Machine (M2M) communication and the Internet of Things (IoT) (Palattella et al., 2016) which further contribute to overloading the available bandwidth. The work presented in this paper addresses problems encountered in resource constrained, error-prone transmission environments. In general, an encoder (usually on the sending/server side) takes individual frames from video sequences and partitions them to groups of blocks (GOBs) or slices to perform coding and compression using motion vectors. Each GOB/slice is then packaged into transmission units called packets for communication. If packet loss is encountered at the decoder (at the receiver), motion vectors in the preceding blocks or frames are used to predict the lost vectors. The predicted motion vectors are then combined with previously received/reconstructed frames to conceal the errors from lost packets in the current frame.