Article Preview
TopIntroduction
In practical applications, speech signals are often corrupted by acoustic noises, which will degrade the quality and intelligibility of the speeches. Therefore, speech enhancement systems are required in many situations to improve the speech quality, intelligibility or the performance of speech coding and speech recognition systems (Veisi & Sameti, 2013).
Current speech enhancement techniques can be broadly divided into two categories, i.e., the single-channel and multiple-channel approaches (Low et al., 2013). The single-channel approach processes the speech signals received from a single microphone in a certain domain like the time, frequency or wavelet domain, which includes many classical speech enhancement methods such as the spectral subtraction (Boll, 1979), Winner filter (Lim & Oppenheim, 1978), MMSE (Martin, 2005), and etc. On the other hand, the multi-channel approach (Jarrett et al., 2014; Tavakoli et al., 2016) requires more than one microphone to exploit spatial information and separate the signal of interest from other interferences. The spatial filtering (or beamforming) is commonly used to form a beam towards the target signal so as to suppress the interferences from other directions. Comparing with the single-channel techniques, the multi-channel techniques can usually achieve better performance but with higher computational complexity and larger sizes. Therefore, the single microphone speech enhancement techniques are still of wide interest in many applications (Veisi & Sameti, 2013).
In most of the current speech enhancement systems, clean speeches are recovered only from the signals collected by acoustic microphones. These speech enhancement systems will be greatly affected by the acoustic noises and suffered from performance degradation in low SNR situations. The multi-stream (MS) approach (Erzin, 2012; Estellers et al., 2012; Graciarena et al., 2003; Nemala et al., 2013), which tries to use heterogeneous information from different kinds of sensors or feature extraction methods, has been successfully applied to the automatic speech recognition (ASR) for many years and proven to be an effective way to improve the recognition accuracy and robustness of the ASR systems. Based on the fact that the noises or mismatches do not affect different data streams in similar ways, the MS recognizers can usually outperform the single stream ones in various and unpredictable noisy environments by choosing and fusing complementary data streams properly. Besides the ASR, heterogeneous information is also wildly used to enhance the performances in many kinds of systems (Rychlý et al., 2015; Wu et al., 2014). However, the multi-stream approach is seldom employed in speech enhancement systems. The major difficulty of applying the multi-stream approach to speech enhancement is that the speech waveform cannot be directly recovered from many kinds of data streams, for example, the visual information of lip movements. Therefore, most of the current speech enhancement techniques only use the noisy acoustic speeches from the microphones to recover the clean speeches.