How to get a feature from sound/audio data learning using machine learning supervised classification?

3

2

I'm a bit confused about how data in audio files can be processed for a ML classification model. I have several .wav files that contain dogs barking and cats "meowing". The pipeline is the following:

  1. Load the data
  2. Convert the data to FFT for desired window
  3. Apply MFCC filtering
  4. Back-transform using DCT
  5. Creating a "spectrogram" for the window
  6. Training a model?

What I don't understand is:

  1. if I have .wav files of different sizes, lets say 1 second and 0.8 seconds, I will end up with a different number of windows, if window size is 0.1 seconds then the first file will have 10 windows and the second will have 8 windows, so how can I feed this information to the learning algorithm consistently.

  2. Does the algorithm learn from the entire .wav file or window by window?

  3. If the algorithm learns from each window, will each window have a different prediction value?

thank you.

Zahi Azmi

Posted 2018-04-24T14:22:29.517

Reputation: 49

Answers

1

Your pipeline is roughly right, but clarifying it a bit below:

1. Load the audio
2. Convert the audio to a `spectrogram` using STFT
3. Apply mel-scale filtering to get `mel-spectrogram`
4. Transform using DCT to get `MFCC`

You now have a time sequence of MFCC frames. It is typical for the classifier to operate on a time window of such frames. The window length can be relatively small (say 4-10 frames), or be the entire length of an audio clip. If they are smaller, the windows are often processed with some overlap.

Learning and outputting predictions for the entire file is the easiest, a standard classification problem.

If labels are only available on the entire file ("weak labeling") but one wants to output a prediction per window, this is usually done with Multi-Instance Learning.

jonnor

Posted 2018-04-24T14:22:29.517

Reputation: 777