## How to get a feature from sound/audio data learning using machine learning supervised classification?

3

2

I'm a bit confused about how data in audio files can be processed for a ML classification model. I have several .wav files that contain dogs barking and cats "meowing". The pipeline is the following:

2. Convert the data to FFT for desired window
3. Apply MFCC filtering
4. Back-transform using DCT
5. Creating a "spectrogram" for the window
6. Training a model?

What I don't understand is:

1. if I have .wav files of different sizes, lets say 1 second and 0.8 seconds, I will end up with a different number of windows, if window size is 0.1 seconds then the first file will have 10 windows and the second will have 8 windows, so how can I feed this information to the learning algorithm consistently.

2. Does the algorithm learn from the entire .wav file or window by window?

3. If the algorithm learns from each window, will each window have a different prediction value?

thank you.

1

Your pipeline is roughly right, but clarifying it a bit below:

2. Convert the audio to a spectrogram using STFT
3. Apply mel-scale filtering to get mel-spectrogram
4. Transform using DCT to get MFCC