Predictive Analytics in Digital Signal Processing: A Convolutive Model for Polyphonic Instrument Identification and Pitch Detection Using Combined Classification

Predictive Analytics in Digital Signal Processing: A Convolutive Model for Polyphonic Instrument Identification and Pitch Detection Using Combined Classification

Josh Weese
DOI: 10.4018/978-1-4666-5063-3.ch010
(Individual Chapters)
No Current Special Offers


Pitch detection and instrument identification can be achieved with relatively high accuracy when considering monophonic signals in music; however, accurately classifying polyphonic signals in music remains an unsolved research problem. Pitch and instrument classification is a subset of Music Information Retrieval (MIR) and automatic music transcription, both having numerous research and real-world applications. Several areas of research are covered in this chapter, including the fast Fourier transform, onset detection, convolution, and filtering. Polyphonic signals with many different voices and frequencies can be exceptionally complex. This chapter presents a new model for representing the spectral structure of polyphonic signals: Uniform MAx Gaussian Envelope (UMAGE). The new spectral envelope precisely approximates the distribution of frequency parts in the spectrum while still being resilient to oscillating rapidly and is able to generalize well without losing the representation of the original spectrum.
Chapter Preview

1. Introduction

Since the rapid development of technology, music has been changed from “in-person” to digital thanks to radio, Internet, CDs, MP3 players, and alike. Due to improvements in technology, the science that is music is readily available for the general population. While technology has provided us with large amounts of music, ready at the click of a button, it also limits our ability on how we access it. As Marc Leman (2008) describes in “Embodied Music: Cognition and Mediation Technology,” music is accessed merely by the title of the song, artist, and composer, but not by how it sounds or feels. Projects, such as the Music Genome Project by, aim to help expand how we listen to music by analyzing musical features. The Music Genome project (About The Music Genome Project, 2013) uses up to 450 distinct musical characteristics set by music analysts to provide a better experience for individuals so they may listen not only to specific genres of music, but music that they like; their own unique taste. However, Pandora, an online, customizable radio, does not use automated information retrieval (About The Music Genome Project, 2013).

Constructing identifiable features for music automatically remains a challenging problem. While some properties apply to particular instruments, styles, or genres, those properties may not apply to music globally. Firstly, we must understand the basis of Music Information Retrieval (MIR). A music signal in raw form (time domain) depicts a rather complex domain. Extended information can be extracted by transforming the signal from the time domain to the frequency or time/frequency domain by using the Fast Fourier Transform (FFT) or Short Time Fourier Transform (STFT). These are some of the most common algorithms to transform signals from one domain to the other and back. This work focuses on the FFT and frequency domain. Signals are generally transformed for a different level of analysis on data (i.e. going from studying the signal in the time domain to the frequency domain). Further data transformation is achieved by using convolution and filtering. Convolution in Digital Signal Processing (DSP) involves a machine which applies some function or impulse response to an input signal to produce an output signal. Also note that convolution in the time domain maps to multiplication in the frequency domain. Convolution directly relates to filtering which attempts to reduce or eliminate specific frequencies or ranges of frequencies from the original signal. This reduces noise and complexity of the signal and simplifies the analysis of properties like timbre (harmonic structure or frequencies present).

Timbre can also be referred to as how music sounds or color. The definition can be subjective, as there is not a definite way on how timbre should be represented. One way to model timbre is by using the spectral envelope or the best fit line for all harmonic/inharmonic structure of a signal (spectral structure). A common approach is generating the power spectrum (squared magnitude), i.e. the strengths of frequencies present in a signal. Different ways of creating the spectral envelope can be seen in the Figure 1.

Figure 1.

Common methods of creating spectral envelopes. Reused with written permission (Schwarz & Rodet, 1999).


Cepstrum (squared magnitude of the Fourier transform of the logarithm of the spectrum), discrete cepstrum, and LPC (Linear Predictive Coding) envelopes are graphed versus the original spectrum of an arbitrary signal in Figure 1. The major downfall of the discrete cepstrum envelope is that it is not resilient to noise. It correctly links all of the peaks of the partials together; however, it gives no notion of the residual noise between partials (Schwarz & Rodet, 1999). The cepstrum and LPC envelope apply well to signals with noise, although both do not accurately link peaks of each partial together. The LPC envelope can also be too smooth if too low of an order is used (Schwarz & Rodet, 1999).

Complete Chapter List

Search this Book: