How to calculate the MFCC?

Asked

Viewed 180 times

0

Hello, these days I woke up wanting to learn about voice recognition, with a brief research I found on the MFCC, so I decided to study and found this material through a google search:
- http://aquarius.ime.eb.br/~apolin/papers/Carlos_uff_2007.pdf
- http://www.practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/#eqn1

In these researches I understood the following:

1 - First we need to calculate the FFT (Fourier Fast Transform) of a signal to obtain the frequencies of that signal (which in this case is the sound).

2 - Apply the pre-press filter to eliminate frequency instability.

3 - Do signal windowing, separating the voice signal into small parts that can be from 20 to 40 ms.

4 - Calculates the MFCC of each sign window.

I don’t know if I’m right about the above steps, if I’m right.

About the calculations, my doubts are about the values used and the meaning of the variables:

  • Pre-enfase filter
    H(z) = 1 − az^-1, 0.9 ≤ a ≤ 1.0

What is z and what is a? and why a has to be between 0.9 and 1.0?

  • Janelamento
    h(n) = 0.54 − 0.46cos(2 . Pi . n / N - 1)

What is n? What I understood about N is that it is the total number of samples, again correct me if I am wrong.

  • MFCC
    N/2 P(i) = Σ |S(k,m)|²Hi(k.(2Pi/N)) k=0

Here I admit that I understood absolutely nothing, if you can give me a good explanation I thank you.

And one last doubt (must have much more, but I’m not remembering now):

If I understand well the result of the calculation of MFCC is a value vector, so for recognition I just have to calculate the Mfccs of two signals and compare these vectors?

I am very layman in physics and not very good with calculations so relegate if I’m wrong or if my doubts are too layy. `

1 answer

1


  • Pre-emphasis Filter - Just as the name suggests, this type of filter tries to emphasize the higher frequencies, it is extremely useful if the signal (voice) is with some kind of noise, in this way the lower frequencies where probably the noises are will be suppressed while the higher frequencies gain emphasis (amplitude increase), in the equation o z is the input signal(voice, music, etc), the closer to 1 the a is more stressed the signal will win in the higher frequencies.

  • Janelamento - yes N is the total sample number n is the number of current iteration, pseudo code to create the equation window:

code:

int N = 2048;
for (int n = 0; n < N; n++) {
     janela[n] = 0.54 − 0.46 * cos(2*PI*n/N-1);
}

About the equation:

Σ|S(k,m)|²*Hi(k*(2Pi/N))

that represents the spectrum S(k, m) is the return of its function FFT, the first part of the equation is equivalent abs(fft(voz*janela,N))).^2; that is to say |S(k,m)|² == abs(fft(voz*janela,N))).^2

The next part of the equation Hi(k*(2Pi/N)) represents the triangular filter bench that will be multiplied by the magnitude of the spectrum, detail that this filter should be spaced with respect to honey scale

More details on the second part of the equation k*(2Pi/N), k is the number of the iteration acting and N continues being the total number of samples, this equation is uncommon in articles that define how to work with triangular filter bank(actually I had never seen being employed as definition filters), I honestly do not know if the author wanted to give a complicated huahuahua, the equation k*(2Pi/N) defines the equivalent frequency within each spectral component, but it has made everything more difficult because the equation places the periodicities of each component in radiano(2*pi), I won’t get into the merits, look at my answer here to understand how FFT maps the corresponding frequencies on each component of the spectrum, if you read the answer will better understand what the author wanted to represent, so Hi is the equivalent frequency collection of each component of the spectrum, he did it so that later you could select the frequency bands of your spectrum, then would just apply the triangular filter, but let’s test, imagine your FFT has such a size 2048 and the sampling of your audio is in 44100hz, what would be the frequency and frequency of the first component? easy way (not the way q is described in the above equation:

1/2048 = 4.8828e-04

That is to say the component 1 has periodicity in 4.8828e-04, converting it to Hertz:

4.8828e-04 *  44100 = 21.5332

Checking whether the article formula matches:

1*2*pi/2048 = 0.0031 

the first component of the spectrum has periodicity in 0.0031 radians, but then it gets complicated to see if this is true, let me convert radians to hertz:

0.0031 * 1/(2*pi) = 4.9338e-04

Opa the result of the periodicity was very similar to the result of my equation, for hertz now:

4.9338e-04 * fs = 21.7581

You can’t forget sigma Σ every iteration you add up everything starting from the index 0 until N/2 ...

If I understand correctly the result of the MFCC calculation is a vector of values, so for recognition I just have to calculate the Mfccs of two signals and compare these vectors?

A: Simplistically speaking the answer is yes...

  • More questions: 1 - do I have to take the FFT out of each window and go through the equation? 2 - Could you explain me a little more about the second part of the equation?

  • @Alexsanderss yes you have to apply FFT in each frame and use the equations, I gave an incremented to answer your question, check again

Browser other questions tagged

You are not signed in. Login or sign up in order to post.