Top9.1. Structure Of Proposed Perceptual Audio Coder
The structure of the proposed high quality perceptual audio encoder is shown in Figure 1 (He et al., 2008b). Input PCM audio samples are fed into the encoder. The time to frequency mapping creates a sub-sampled representation of the audio samples using the DWPT. The psychoacoustic model calculates the masking thresholds, which are later employed to control the quantizer and coding. Bit allocation strategy is utilized to allocate bits to each sub-band sample according to its perceptual importance. Typically, more bits are reserved for low frequency samples, which are perceptually more important. Quantization is performed in a way to keep the quantization noise below the audible threshold for transparent audio coding. The bit allocation information is transmitted together with the encoded audio as ancillary data or side information, which are used in the audio decoder to reconstruct the PCM audio samples. Lossless coding, which is usually Huffman coding, is employed to further remove the redundancy of the quantized value. The frame packing block packs the output of quantizer and coding block as well as the side information and yields the encoded audio stream.
Figure 1. Structure of perceptual audio encoder
Figure 2 shows the decoder of the proposed audio coding scheme. The encoded audio stream is fed into the frame unpacking block, which unpacks the compressed audio stream into the quantized samples as well as the side information. In the de-quantization and decoding block, Huffman decoding is performed first followed by de-quantization, using the side information extracted from the frame-unpacking block. The output is the audio samples in the wavelet domain, which are later transformed in time domain by the inverse time/frequency mapping block to form the decoded PCM audio samples.
Figure 2. Structure of perceptual audio decoder
Time/frequency mapping block and psychoacoustic model block are illustrated in chapter 7 as the proposed psychoacoustic model, so only quantizer and coding block is explained in the following section.
Top9.2 Quantization And Huffman Coding
The quantization and Huffman coding employed in the proposed audio codec is similar to that of the MPEG 1 layer III standard. The input to the quantization and Huffman coding block includes the spectral values (wavelet coefficients) of the frame, the maximum number of bits available for Huffman coding, the critical band partition table and the allowed distortion in each critical band (also called scalefactor band in audio coding).
The maximum number of bits available for Huffman coding for one frame (called granule in audio coding) is defined as
(9.1) where
bit_
rate is the actual bit rate,
granul_
size is the number of spectral values in one granule (1024 for our case) and the sampling frequency is 44.1 kHz for CD quality audio.
The allowed distortion in each scalefactor band is calculated as
(9.2) where
sb is the scalefactor band index,
thrn(
sb) is the masking threshold estimated by proposed psychoacoustic model, and
bw(
sb) is the bandwidth of each scalefactor band and can be read from Table 1.
Table 1. Scalefactor band partition (sampling frequency 44.1 kHz)
Scalefactor band(sb) | Bandwidth(bw) | Index of start | Index of end |
1 | 4 | 1 | 4 |
2 | 4 | 5 | 8 |
3 | 8 | 9 | 16 |
4 | 4 | 17 | 20 |
5 | 4 | 21 | 24 |
6 | 8 | 25 | 32 |
7 | 4 | 33 | 36 |
8 | 4 | 37 | 40 |
9 | 8 | 41 | 48 |
10 | 16 | 49 | 64 |
11 | 8 | 65 | 72 |
12 | 8 | 73 | 80 |
13 | 16 | 81 | 96 |
14 | 16 | 97 | 112 |
15 | 16 | 113 | 128 |
16 | 16 | 129 | 144 |
17 | 16 | 145 | 160 |
18 | 32 | 161 | 192 |
19 | 64 | 193 | 256 |
20 | 32 | 257 | 288 |
21 | 32 | 289 | 320 |
22 | 64 | 321 | 384 |
23 | 128 | 385 | 512 |
24 | 256 | 513 | 768 |
25 | 256 | 769 | 1024 |