Musical similarity from MPEG-7 descriptor patterns

Asked

Viewed 780 times

15

I am doing research in the area of musical similarity for music recommendation and created a test database with 1,000 songs. I would like to create a playlist with 10 songs similar to a song chosen among these thousand, using MPEG-7 descriptors.

I extracted 30 seconds of each song to have a pattern of features extracted by the MPEG-7 descriptors. I would not like to generate classifications of sounds in genres songs, only create a playlist the rhythm similar to that of the selected song. If a slow song is selected, this playlist will be slow songs (the ten most similar).

I tried to extract some descriptors like Audio Spectral Centroid, Audio Power, which gave me a vector. However, using the calculation of Euclidean distance in MATLAB, I did not get a very good similarity. How can I build this prototype using the MPEG-7 descriptors?

I extracted a vector of characteristics for the Audio Spectral Centroid and Audio Power descriptors where in both, for 30 seconds, there was a vector of 1000 positions, because the extraction of audio signal characteristics occurs every 30ms. The question is precisely whether I am using the right descriptors for this type of work.

<AudioDescriptor xsi:type="AudioSpectrumCentroidType">
    <SeriesOfScalar hopSize="PT30N1000F" totalNumOfSamples="1002">
        <Raw> 0.0 -1.6024935 -0.6072393 -0.7425593 -0.8901987 -1.1543454 -1.1027017 -0.64731646 -0.96495366 -1.022632 -0.8076392 -0.993545 -0.66203004 -0.93275607 -1.1654149 -0.9243456 -1.2580872 -1.339062 -1.5594536 -1.5959411 -1.7814989 -1.6296328 -1.5468938 -1.0133578 -1.250789 -1.0111073......</Raw> 
    </SeriesOfScalar>
</AudioDescriptor> 

This is one of the characteristics extracted, the Audio Spectrum Centroid, which indicates the center of gravity of the amplitude. But I’m not sure which descriptor to use and how to use it.

  • Where’s your question about programming ??

  • 1

    That’s right. The concept is very interesting. But the best thing is you show your code to Tentei extrair alguns descritores and utilizando o cálculo da distância euclidiana.

  • 1

    Please edit the question to include this information.

  • Hello. Your question is interesting and her belonging to the scope of the site is under discussion here).

  • 1

    I don’t know much about the audio processing domain, so I don’t risk writing an answer. But I don’t think Euclidean distance is appropriate for features like the ones you mentioned. Have you tried using the standard scalar product (also called cosine similarity) to measure the similarity between their vectors? Here is a calculation suggestion in Matlab: http://stackoverflow.com/a/14340447/2896619

  • Hello Luiz, I will give a study about yes...thank you very much. The doubt is also which descriptors to use for this type of project

  • 1

    Opa, for nothing. About the descriptors, I imagine that the Fourier Fast Transform and/or Eigenvectors are also used. You can use this link from Wikipedia: http://en.wikipedia.org/wiki/Musical_similarity

  • I would suggest trying to do a dimensionality reduction (using PCA) before calculating the distance. Maybe I can get some improvement. That is if the descriptors are good, of course.

Show 3 more comments

1 answer

17


The subject is complex and it is not easy to answer without going into some details. I will try in a simple way to address the points raised by the question, come on:

Something of extreme relevance when comparing rhythms is knowing how many beats per minute an audio has, of course to perform this type of analysis you will need a larger window, 30 milliseconds is insufficient to measure beats per minute (bpm). This is of great help to know if a certain audio has a slow or more hectic style, but only this information is still not enough to achieve good results. A good list of time/frequency descriptors would be:

Descriptors in the time domain:

  • RMS (Root Mean Square) - analyzes the energy in an audio signal, can be useful to determine whether the analyzed audio has high intensity or not.
  • Zero Crossing Rate - analyzes to know how many times an audio signal crossed the x axis, has an analogy in the audio variation.
  • Low Energy Rate - makes an analysis of the percentage(quantity) of captured frames that are below the RMS average, with that you will know the percentage of frames that have less power than the average. Consequently, it is possible to know if the analyzed audio is in the majority sometimes with greater intensity or not.

Descriptors in the Spectrum Domain:

  • Spectral Entropy - calculates the entropy of an audio signal, the entropy is nothing more than the measure of the disarray in a given system. Let me illustrate, imagine that you possess within a shoe box a collection of strollers and all of them are organized (lined up) by color, imagine now that you take a little out of order this (mess), at this point you will have a measure of entropy. Imagine now that you take the box and swing strong, all the carts will be in total disorder characterizing another measure of entropy. This type of feature should be useful in checking how organized a given signal audio is.
  • Spectral Flux - measure the flux of spectra. Recalling that analysis of audio signals analyses are made within Windows (blocks), the Spectral Flux is calculated block by block, the block is subtracted from the previous block and its results are indicative of how fast the signal variation is.
  • Spectral Irregularity - measure the irregularity of the signal, it works similar to the Spectral Flux, this function will produce a serrated, denting in the results. It calculates the difference in of the captured block and not block by block as is done in the Spectral Flux.
  • Spectral Centroid - has relation to the brightness of the signal, calculates the relative energy between the highest and lowest frequency. This gives clues in the signal brightness.
  • Spectral Rolloff - Spectral rolloff is defined as the amount of times (frequency) where the spectrum energy is below a given point. It has indications of an asymmetric wave on the right.
  • Spectral Skewness - calculates the degree of asymmetry of a audio, this descriptor will return values of how asymmetrical a certain frame is. Surely this type of information gives timbral catheterisms in the analyzed waveform.
  • Spectral Kurtosis - Spectral Skewness calculates the asymmetry of the signal and Kurtosis calculates the degree of flattening of a wave.
  • MFCC (Mel-Frequency Cepstrum) - collects essential information (coefficients) that help recognize patterns.

Now that you know what each descriptor does, assemble a vector for each analysis block, in this case we will have:

  • The first thirteen are the results of the MFCC coefficients

  • The 14th value is the Centroid Mean

  • The 15º value is Centroid standard deviation(Std)

  • 16º Mean Irregularity

  • 17º Std(standard deviation) Irregularity

  • 18th Mean Entropy

  • 19º Std Entropy

  • 20º Mean Flux

  • 21º Std Flux

  • 22º Mean Kurtosis

  • 23º Std Kurtosis

  • 24º Mean Rolloff

  • 25º Std Rolloff

  • 26º Mean Skewness

  • 27º Std Skewness

  • 28º Mean RMS(Root Mean square)

  • 29th Std RMS(Root Mean square)

  • 30º Mean ZCR(Zero Crossing Rate)

  • 31º Std ZCR(Zero Crossing Rate)

  • 32º LER(Low Energy Rate)

After the 32nd still missing 5 more descriptors that are the rhythmic taken by Beat Histogram (beat histogram) ie in total you will have a vector with 37 positions each of them describing something!

To not extend too much, try other "search for similarity" algorithms, besides the Euclidean distance try KNN, LSH, DP or Neural Networks (Random Forests, MLP).

  • 1

    Masterful class, muchas Gracias!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.