My Second Brain Repository: Human speech recognition techniques (Cepstrum and MFCC based)

1. Introduction

The movie ‘Iron Man’ comes with an artificial intelligence computer "Jarvis" that perfectly comprehends human language as shown in Fig. 1. 'Jarvis' can be commanded through a human voice without an input device such as a separate keyboard or mouse. Speech recognition is the most basic technology for realizing this 'Jarvis'. To create a machine that recognizes human speech, it is necessary to understand the principle of voice generation and the characteristics of voice signals. In this article, we describe the most widely used MFCCs for analyzing the characteristics of human speech signals.

Fig 1. Jarvis in the Iron Man movie

2. The generation of speech sound

The generation of a speech sound is due to vibrations of the vocal cords as shown in Fig. 2. During the human breathing process, the air passes through the airway quickly. At this time, the vocal cords are vibrated by the air flowing between the vocal cords due to the Bernoulli’s effect. Between the pairs of vocal cords there is a glottis where the sound waves are first generated. The average vibration frequency of vocal cords for men is 120 Hz and the average vibration frequency of vocal cords for women is 230 Hz. This frequency is the fundamental frequency and the harmonic components are synthesized in multiples of the fundamental frequency.

Fig 2. Structure of vocal organ

The sound waves, generated in the vocal cords and glottis, pass through the vocal tract and are converted to a certain level of speech signal as shown in Fig. 3. A vocal tract is a path of sound composed of the neck, mouth, oral cavity, nasal cavity, tongue, and so on. Depending on the shape of the vocal tract, the speech sound is produced as syllable units. In terms of digital signal processing, the shape of a vocal tract can be regarded as a kind of transfer function. Therefore, we can define it as vocal tract transfer function. The excitation signal generated by the glottis is a source of voice, but not important in terms of speech recognition. Rather, recognizing the vocal tract that can estimate the shape of the oral cavity and the position of the tongue is the main purpose of speech recognition. In other words, it is essential to extract the frequency envelope of the speech signal and to remove the excitation signal.

Fig 3. The process of speech sound generation

3. Characteristics of speech signal

The speech sound is generated by applying an excitation signal generated by vibrations of the vocal cords to the vocal tract transfer function. In the speaker recognition field, the excitation signal can be used as an important element of personality. However, in the field of speech recognition, only the envelope of the speech signal expressed by the vocal tract transfer function is important. Therefore, in order to recognize a speech, it is necessary to extract the envelope of the speech signal and focus on suppressing the characteristics of the individual as much as possible.
Figure 4 shows one frame of the speech signal in the time domain and frequency domain. The frequency domain of the speech signal is obtained by an FFT operation. As shown in Fig. 4, there are many ripples in the frequency domain signal. These ripples are due to vibrations of the vocal cords and are considered noise components that interfere with speech recognition. Therefore, it is necessary to focus on the spectral envelope of the speech signal in order to increase the recognition rate of the speech signal.

Fig 4. Speech signal in time domain and frequency domain

4. Speech signal analysis using Cepstrum

First of all, a person's voice is generated through the vibration of the vocal cords (excitation signal) and is made into a speech expressing language as it passes through the vocal tract. From the viewpoint of digital signal processing, it can be assumed that the excitation signal of the vocal cords passes through a digital filter expressed by a vocal tract transfer function as shown in Fig. 5.

Fig 5. Speech generation from the perspective of digital filter

Thus, the speech signal s(n) is the result of the convolution operation of the excitation signal x(n) and the transfer function h(n), and this is expressed in the frequency domain as a multiplication of X(f) and H(f).

Here, h(n) is the object that we want to extract for human speech recognition purposes. The shape of the lips and the position of the tongue can be estimated through h(n). However, it is difficult to separate the two signals from S(f), which is the form of the product of the two signals. Therefore, it is meaningful to use cepstrum to convert the form of multiplication into the form of addition by taking logarithm on both sides. The two signals synthesized in an additive form can be easily separated. It can be seen that the excitation signal and spectral envelope can be successfully separated using cepstrum as shown in Fig. 6. Then, ‘liftering’, which is the same concept of ‘filtering’ in the general digital signal processing field, can be used to extract only envelope area in cepstrum field.

Fig 6. Separation of envelope and excitation signal by using cepstrum

The entire process of speech signal analysis using cepstrum is illustrated in Fig. 7. As described above, this process consists of the 'hamming windowing', 'DFT (FFT)' for frequency domain transformation, and log operations and so on. Finally, we can get cepstrum coefficients that can be used to pattern match or categorize.

Fig 7. Cepstrum process of speech signal

5. Speech signal analysis using Mel-Frequency Cepstrum Coefficients (MFCC)

We have already seen that the cepstrum is capable of recognizing the speech signal by itself. Nevertheless, why do we need the Mel-frequency cepstrum coefficients (MFCC)? The only difference between cepstrum and MFCC is that there is a Mel-filter bank in the process.

The Mel-filter bank is designed to mimic sound scale perception of the human ear. Therefore, the Mel-filter bank allocates fewer sub-filters in the high-frequency range and a larger number of sub-filters in the lower-frequency range as shown in Fig. 8. A detailed description of the Mel-Scale and Mel-filter banks is available at the following link:

http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/

Fig 8. Structure of Mel-filter bank

Therefore, the entire process of MFCC is illustrated in Fig. 9. As already mentioned above, the MFCC process is almost identical to the cepstrum process but only the Mel-filter bank. During the MFCC process, the 'IDFT (IFFT)' step can be replaced by 'Discrete Cosine Transform (DCT)'. When the IDFT or the IFFT is performed, there is a disadvantage that the imaginary term is generated again in the resultant value. On the contrary, when using the DCT, the result is composed of only real number coefficients and has the advantage of faster processing time. In addition, because the DCT coefficient values are arranged in the order of low frequency to high frequency, the envelope components concentrated in the low frequency region can be effectively separated using this feature.

Fig 9. Entire process of MFCC

Finally, what are the advantages of using MFCC in speech recognition? Thanks to the Mel-filter bank in the MFCC process, it can be assumed that subtle differences in pronunciation according to individual characteristics are eliminated. For human speech recognition, we need to find common characteristics of everyone who uses same language. Speech recognition becomes difficult when the characteristics of pronunciation of each individual are reflected. In other words, by using the Mel-filter bank, the subtle pronunciation characteristics of each individual can be eliminated. In addition, since the noise in the high-frequency range is suppressed, the effect of suppressing the surrounding high-frequency noise that can be included when recording the voice through the microphone can be also obtained. Therefore, cepstrum can be used for human speech recognition, but MFCC is used in more application.

6. Conclusion

1. The speech sound is generated by applying an excitation signal generated by vibrations of the vocal cords to the vocal tract transfer function.

2. Both cepstrum and MFCC can be used to human speech recognition.

3. The only difference between cepstrum and MFCC is that there is a Mel-filter bank in the MFCC process.

4. Due to the Mel-filter bank in the MFCC process, subtle differences in pronunciation according to individual characteristics could be eliminated. In addition, it can also suppress the high frequency band noise that can be included in the voice recording process. Because of this feature, MFCC can increase speech recognition rate dramatically.

End.

My Second Brain Repository

Friday, January 6, 2017

Human speech recognition techniques (Cepstrum and MFCC based)

No comments:

Post a Comment