I recently left a question on the MFCC algorithm in LinkedIn's Audio Engineering Society.
The question was:
Question about speech recognition by using MFCC algorithm
MFCC is the most widely used algorithm for human speech recognition. MFCC mimics how humans perceive the sounds, and decomposes the speech signals into the Mel-Frequency domain.
Here is the question. When implementing a real speech recognition system, the subject that recognizes the voice is not a person but a microphone. The voice is captured by the microphone and the signal processing is performed by CPU or DSP, in this process, the use of Mel-Filter bank seems to distort the original signal.
In my opinion, converting a voice signal received through a microphone to a frequency domain or Cepstrum without using a Mel-Filter Bank can be a way of utilizing meaningful information. What are the advantages of converting to the Mel-Frequency domain without processing the original audio signal acquired through the microphone?Please let me know about this. Thank you.
And it's been two weeks since I left the question, but no one left a reply. But thanks to that, during this time I began to find myself answering the question. And leave the summary for others as below.
First of all, a person 's voice is first created through the vibration of the vocal cords (excitation signals) and is made into a voice expressing language as it passes through the vocal tract. From the viewpoint of signal processing, it can be assumed that the vibration signal of the vocal cords passes through a filter expressed by a vocal tract transfer function.
Thus, the speech signal s (n) is the result of the convolution operation of the excitation signal x (n) and the transfer function h (n), and this is expressed in the frequency domain as a multiplication of X (f) and H(f).
Here, h(n) is the object that we want to extract for natural language processing purposes. The shape of the lips and the position of the tongue can be estimated through h(n). However, it is difficult to separate the two signals from S (f), which is the form of the product of the two signals. Therefore, it is meaningful to use Cepstrum to convert the form of multiplication into the form of addition by taking logarithm on both sides. The two signals synthesized in an additive form can be easily separated.
So why do we need the MFCC? The MFCC uses a Mel-filter bank to allocate a larger number of filters to the low-frequency range. In this process, it can be assumed that subtle differences in pronunciation according to individual characteristics are eliminated. For natural language processing, we need to find common characteristics of everyone who uses same language. Recognition becomes difficult when the characteristics of pronunciation of each individual are reflected. In other words, by using the Mel-filter bank, the subtle pronunciation characteristics of each individual can be eliminated. In addition, since the noise in the high-frequency range is suppressed, the effect of suppressing the surrounding high-frequency noise that can be included when recording the voice through the microphone can be also obtained.
Therefore, Cepstrum can be used for natural language recognition, but MFCC is the most commonly used.
No comments:
Post a Comment