Classifying non-labeled data with high dimensionality



Disclaimer: I am a novice in the world of machine learning, so please excuse my ignorance.

My dataset consists of things like age, days since last visit, etc. This information is medical related. None of which is geometrical, just data pertaining to particular clients.

The goal is to classify my dataset into three labels. The dataset is not labeled, meaning I'm dealing with an unsupervised learning problem. My dataset consists of ~20,000 records, but this will linearly increase overtime. The data is nearly all floats, with some being strings that can easily be converted into a float. Using this cheat sheet for selecting a solution from the scikit site, a KMeans Cluster seems like potential solution, but I've been reading that having high dimensionality can render the KMeans Cluster unhelpful. I'm not married to a particular implementation either. I've currently got a KMeans Cluster implementation using TensorFlow in Python, but am open for alternatives.

My question is: what would be some solutions for me to further explore that might be more optimal for my particular situation?


Posted 2018-06-07T15:50:16.833

Reputation: 175

can you describe more about the dataset rather than saying that is just "floats" , i mean do they have any geometrical significance like 3d points?, it would be helpful to give a description of what the data is about and what the sample data would look like rather than just mentioning "records" – riemann77 – 2018-06-08T16:53:37.523

@thecomplexitytheorist I've updated the question, I hope that helps. – Tory – 2018-06-08T17:46:44.500

you can start by finding correlation between the features , this really helps in understanding the data then you can use a set of those features based on their correlation properties , based on your description i would try DBSCAN first – riemann77 – 2018-06-08T18:33:42.370



I would recommend to have a look at Finding Groups in Data, which is a very readable introduction to clustering methods. It gives a good overview over a number of different algorithms, both agglomerative and hierarchical. As far as I remember, source code for the various algorithms is available on the web somewhere.

I am sure you will find a fitting algorithm for your problem in there.

Oliver Mason

Posted 2018-06-07T15:50:16.833

Reputation: 3 755


This is supposed to be a comment but I haven't got enough reputation to do that.

In addition to what @the complexitytheorist has said, I recommend you to have a deeper look at your data first, using dimension reduction and visualisation methods such as PCA and t-SNE. A better understanding of data may always save you a lot of work.

Then you can choose which clustering algo to use. For example, KMeans or DBSCAN as a start.

Kevin. Fang

Posted 2018-06-07T15:50:16.833

Reputation: 313