I'm a bit confused about how data in audio files can be processed for a ML classification model. I have several .wav files that contain dogs barking and cats "meowing". The pipeline is the following:
- Load the data
- Convert the data to FFT for desired window
- Apply MFCC filtering
- Back-transform using DCT
- Creating a "spectrogram" for the window
- Training a model?
What I don't understand is:
if I have .wav files of different sizes, lets say 1 second and 0.8 seconds, I will end up with a different number of windows, if window size is 0.1 seconds then the first file will have 10 windows and the second will have 8 windows, so how can I feed this information to the learning algorithm consistently.
Does the algorithm learn from the entire .wav file or window by window?
If the algorithm learns from each window, will each window have a different prediction value?