Article Preview
TopIntroduction
Over the years, broadcasting stations have accumulated a large amount of unlabeled audio content for programs. These valuable resources can be saved, indexed, and retrieved for later use by means of information technology. Efficient information retrieval calls for the help of labels that attach meaning to the data. According to an insider from China Radio International (CRI), as of July 2018, the total amount of all the audio content from CRI has exceeded 55 Terabytes, which corresponds to a 530,000 hours of audio playback. It is often impossible for human to accomplish such a tedious, time-consuming annotation task without the assistance from automatic or semi-automatic labeling techniques. Therefore, the potential of the audio content cannot be fully utilized.
Audio segmentation and classification are key techniques to successful completion of audio (data) labeling, or audio annotation, in that the classification results provide a starting point for efficient audio annotation. This topic has attracted researchers mainly from the AI (artificial intelligence) and signal processing communities (Castán et al., 2015).
The primary goal of automatic audio segmentation is to provide boundaries to delimit portions of audio with homogeneous acoustic content (Shin, Chang, & Kim, 2010). In the meantime, audio classification aims to help identify the semantic meaning of each portion derived from audio segmentation, whereas, it is, by no means, an easy task for audio streams such as broadcast news, containing single type classes and mixed type of classes (e.g., speech with music and speech with noise) (Xie, Fu, Feng, & Luo, 2011; Cheong, Oh, & Lee, 2004).
This article presents an audio classification solution to news broadcasting, which features coupling of audio segmentation and classification, namely segmentation-by-classification, see Background. A Dual-CNN (Dual-Convolutional Neural Network) was introduced to perform classification on clips with a fixed length. Unlike others, it can make use of both (a small amount of) labeled and (a large number of) unlabeled data for the training of CNNs. A novel smoothing method, SEG-smoothing, was then applied to the classification result, thus yielding portions of audio with homogeneous acoustic content. For performance evaluation of our proposed approach, a series of experiments involving Dual-CNN and other alternatives using datasets from Beijing People’s Broadcasting Station and GTZAN, have been conducted. The results verify that, in terms of classification accuracy and segmentation error rate, Dual-CNN outperforms alternative solutions.
The remainder of the article is organized as follows. We presented related work in Section Background. This is followed by a detailed introduction to the Dual-CNN in Section the Dual-CNN Approach. The following section describes a smoothing method for audio segmentation. An array of experiments and related analysis for performance evaluation of the Dual-CNN is given in Section Evaluation. We conclude our work and identify future research in Section Conclusions and Future Work.
Background
We present, in this section, related work by which our proposed research was most inspired. In the beginning, we present two predominant categories for segmentation, i.e. segmentation-and-classification and segmentation-by-classification, and then review the state-of-the-art in related fields. This is followed by a discussion on the application of deep learning techniques, CNN and autoencoders, to deal with audio classification. We further this research by facilitating audio segmentation and classification in the broadcasting domain by a combination of both techniques. For a comprehensive and fair comparison, we investigate five approaches that are either classical in the field or share most features in common with our proposed work.
TopSegmentation And Classification
Audio segmentation/classification systems can be divided into two classes depending on how segmentation is performed (Castán, Ortega, Miguel, & Lleida, 2014).