Article Preview
Top1. Introduction
Thanks to the recent advances in various digital signal processing technologies and the capabilities of data storage and transmission, music material has enjoyed an unprecedented growth in the production and distribution of late years. However, the rapid proliferation of music material also arises the ironic dilemma on how to locate the desired music from the innumerable options and how to ensure only those that are authorized can access them. Such a dilemma has motivated research into developing automatic techniques for classifying, identifying, or indexing music data, such as melody extraction (Durey & Clements, 2002; Akeroyd et al., 2001), instrument recognition (Herrera et al., 2000; Eronen, 2003), genre classification (Tzanetakis & Cook, 2002; Li et al., 2003), song identification (Venkatachalam et al., 2004; Haitsma & Kalker, 2002), and mood classification (Liu & Zhang, 2003; Yang & Lee, 2004), in order to facilitate music retrieval.
When people listen to music, the singing voice usually draws more of listeners’ attention than other musical attributes such as tempo, tonality, or instrumentation. Information on singers, therefore, is key to organizing, searching, and retrieving music. Although identifying singers based on their voice characteristics might be an effortless task for most people, performing this task using machine learning is more difficult. One of the challenges in designing and building a robust automatic singer identification (SID) system lies in training the system to discriminate the multiple sounds that occur simultaneously in music. Some examples of musical sounds that may overlap in time include background noise, background vocals, instrumental accompaniment, and/or simultaneous singing. Although a number of studies on automatic SID from acoustic features have been reported, most systems to date identify the singers in recordings of solo performances (Kim & Whitman, 2002; Berenzweig et al., 2002; Liu & Huang, 2002; Zhang, 2003; Bartsch & Wakefield, 2004; Tsai et al., 2004; Maddage et al., 2004; Tsai & Wang, 2004; Fujihara et al., 2005; Mesaros & Astola, 2005; Tsai & Wang, 2006; Shen et al., 2006; Nwe & Li, 2007; Mesaros et al., 2007; Nwe & Li, 2008; Fujihara et al., 2010). Very little research has considered a more realistic case, which is to automatically identify more than one singer in a music recording.
Although Tsai & Wang (2004) investigated automatic detection and tracking of singers in music recordings, their study only considered singing by multiple singers whose performances do not overlap in time. For SID to be more practical and adaptable, this research attempts to develop a system that automatically identifies simultaneous singers in music recordings that have singing voices overlapping in time. We refer to this problem as overlapping singer identification (OSID). To the best of our knowledge, there is no prior literature devoting to this problem, except our preliminary work (Tsai et al., 2008).