Speech Enhancement Using Heterogeneous Information

Speech Enhancement Using Heterogeneous Information

Yan Xiong (Department of Computer Science, Guangdong University of Education, Guangdong, China), Fang Xu (School of Electronic and Information Engineering, South China University of Technology, Guangdong, China), Qiang Chen (Department of Computer Science, Guangdong University of Education, Guangdong, China) and Jun Zhang (School of Electronic and Information Engineering, South China University of Technology, Guangdong, China)
Copyright: © 2018 |Pages: 14
DOI: 10.4018/IJGHPC.2018070104

Abstract

This article describes how to use heterogeneous information in speech enhancement. In most of the current speech enhancement systems, clean speeches are recovered only from the signals collected by acoustic microphones, which will be greatly affected by the acoustic noises. However, heterogeneous information from different kinds of sensors, which is usually called the “multi-stream,” are seldom used in speech enhancement because the speech waveforms cannot be recovered from the signals provided by many kinds of sensors. In this article, the authors propose a new model-based multi-stream speech enhancement framework that can make use of the heterogeneous information provided by the signals from different kinds of sensors even when some of them are not directly related to the speech waveform. Then a new speech enhancement scheme using the acoustic and throat microphone recordings is also proposed based on the new speech enhancement framework. Experimental results show that the proposed scheme outperforms several single-stream speech enhancement methods in different noisy environments.
Article Preview

Introduction

In practical applications, speech signals are often corrupted by acoustic noises, which will degrade the quality and intelligibility of the speeches. Therefore, speech enhancement systems are required in many situations to improve the speech quality, intelligibility or the performance of speech coding and speech recognition systems (Veisi & Sameti, 2013).

Current speech enhancement techniques can be broadly divided into two categories, i.e., the single-channel and multiple-channel approaches (Low et al., 2013). The single-channel approach processes the speech signals received from a single microphone in a certain domain like the time, frequency or wavelet domain, which includes many classical speech enhancement methods such as the spectral subtraction (Boll, 1979), Winner filter (Lim & Oppenheim, 1978), MMSE (Martin, 2005), and etc. On the other hand, the multi-channel approach (Jarrett et al., 2014; Tavakoli et al., 2016) requires more than one microphone to exploit spatial information and separate the signal of interest from other interferences. The spatial filtering (or beamforming) is commonly used to form a beam towards the target signal so as to suppress the interferences from other directions. Comparing with the single-channel techniques, the multi-channel techniques can usually achieve better performance but with higher computational complexity and larger sizes. Therefore, the single microphone speech enhancement techniques are still of wide interest in many applications (Veisi & Sameti, 2013).

In most of the current speech enhancement systems, clean speeches are recovered only from the signals collected by acoustic microphones. These speech enhancement systems will be greatly affected by the acoustic noises and suffered from performance degradation in low SNR situations. The multi-stream (MS) approach (Erzin, 2012; Estellers et al., 2012; Graciarena et al., 2003; Nemala et al., 2013), which tries to use heterogeneous information from different kinds of sensors or feature extraction methods, has been successfully applied to the automatic speech recognition (ASR) for many years and proven to be an effective way to improve the recognition accuracy and robustness of the ASR systems. Based on the fact that the noises or mismatches do not affect different data streams in similar ways, the MS recognizers can usually outperform the single stream ones in various and unpredictable noisy environments by choosing and fusing complementary data streams properly. Besides the ASR, heterogeneous information is also wildly used to enhance the performances in many kinds of systems (Rychlý et al., 2015; Wu et al., 2014). However, the multi-stream approach is seldom employed in speech enhancement systems. The major difficulty of applying the multi-stream approach to speech enhancement is that the speech waveform cannot be directly recovered from many kinds of data streams, for example, the visual information of lip movements. Therefore, most of the current speech enhancement techniques only use the noisy acoustic speeches from the microphones to recover the clean speeches.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 11: 4 Issues (2019): Forthcoming, Available for Pre-Order
Volume 10: 4 Issues (2018): 3 Released, 1 Forthcoming
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing