## How to reduce dimensionality of audio data that comes in form of matrices and vectors?

4

1

I'm working on a project involved with identifying different types of sounds (such as screams, singing, and bangs) from each other. We've got our data a reasonable number of different transformations (e.g.: spectrograms, chromagrams, MFCCs, etc.), but since most of our features are 2-dimensional matrices (some are actually 1-dimensional vectors), we'd like to reduce this information in some way, so that the machine learning we're hoping to do takes a "feasible" amount of time. However, I don't quite know enough about math and statistics to make an educated decision on this.

Our data consists of small sound files from ~1-10 seconds long. There are recordings of screams, singing, bangs (and other man-made noises), and birds (and other natural noises). We are hoping to be able to differentiate and identify each source type from the others. See https://github.com/BenSandeen/surveillance_sound_classifier/blob/master/Project.ipynb for the different plots we have made to guide our selection of features to use. Focus mainly on the 3x3 plots, as that's where the comparisons are being made. These plots are primarily time vs. frequency, with amplitude represented by color.

I was thinking that maybe we could "collapse" each matrix down to a vector by somehow choosing some representative frequency/amplitude-related feature at each time slice (we're using Short Time Fourier Transforms to analyze the sounds) and then get a vector of some length, containing a bunch of scalars. Although this could make accounting for different lengths of sounds difficult. Would it be reasonable to just set the shorter sounds' vectors to be filled in with zeros if they have no useful data? That would effectively make these sounds a projection onto some lower-dimensional space. Then, maybe we could just use dot products to compare the vectors; if they're parallel, they'll have a large for product, but of they're nearly perpendicular, they'll have a dot product near zero.

Alternatively, I was thinking that something like a trace of our matrices, or finding their characteristic polynomial, might be useful direction to pursue. I've read a bit about PCA, but I'm not quite understanding it enough to know if this may be what I'm looking for.

Can anyone think of any other ways of handling and reducing this data? For what it's worth, we're currently planning on using Sci-Kit Learn (sorry that I can't use more than 2 links yet) to perform our machine learning.

Would you be interested in an audio fingerprinting technique like the one used in Shazam ? https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf

– Laurent Duval – 2016-03-15T12:25:52.667

1@LaurentDuval I've read that fantastic paper, but no, that's not what we're looking for. We're hoping to be able to have our system "listen" to arbitrary sound recordings (not studio recordings) and analyze them. – Ben Sandeen – 2016-03-15T15:22:31.333

In passing, this post talked about bird sound classification, just for the record http://dsp.stackexchange.com/questions/28612/best-similarity-measure-for-audio-classification/28614#28614

– Laurent Duval – 2016-03-15T15:31:35.673

1

Interesting ideas on finger printing, but that isnt going to work here, as you don't have a definitive sound you are working with.

To me, this seems a bit more like an image processing problem. So instead of thinking how you could reduce dimensionality, could could try and combine your data together into "images" that digitally represent the sounds you are looking at. Then using standard image processing techniques. I say this waving my hands around vaguely, as I don't know of specific ones, but have enough understanding of edge detection, feature detection and applying map/reduce type methodologies which might be able to give templates for what you are after.

1

Just unroll the matrices into a large set of features:

If all of the matrices are the same size, you can just unroll them into a large feature set e.g. a 20 x 20 matrix turns into 400 features. You can do this with multiple different feature matrices. Its up to the learning algorithm to infer each feature's meaning, so don't overthink the lack of human readablility. Take a look at tutorials on digit recognition in scikit learn and you will see that the "image pixel matrices" have been similarly unrolled.

You can then employ PCA or a nonlinear dimensionality reduction scheme to select a subset of the feature space. Though it might go against your intuition, the dimensionality reduction will likely improve your classification algorithm.

Accounting for different lengths of sounds is a difficult problem because normalization will alter the frequency and thus affect your Fourier Transformation. I suggest defining some sort of characteristic sound length and subsampling long sounds $n$ times and oversampling short sounds $n$ times.

Hope this helps!