Spatial clustering methods in R for data with large variance


I am quite new to clustering methods and would like to do spatial clustering on GDP per capita in the UK.

I tried using dbscan() but what I obtained was one big cluster that contains more than 80% of the data points and the rest of my clusters that have very few data points. I believe this is because dbscan() clusters based on density, so it cannot differentiate among my data points which have a large variance spread out across the UK map. (data summary() provided below)

 Min.   :       0              
 1st Qu.:   23894              
 Median :   45884              
 Mean   :  206663              
 3rd Qu.:  150746              
 Max.   :23526994

Does anyone have any advice on methods for clustering spatial data with huge variance?


Posted 2020-11-01T15:33:57.460

Reputation: 11



Could you provide a few rows of your data and explain your task more? It sounds like you're trying to cluster based on spatial (lat+long) & non-spatial (GDP) features. You might find clustering based only on lat+long or a simple choropleth map to be more appropriate. If you still wish to use lat+long+GDP, it is important to note that GDP is on a much different scale than lat+long and it will dominate your distance function unless you rescale. The exact rescaling should depend on how much you value spatial similarity vs GDP similarity. To fix your issue with 1 big cluster & many small clusters, consider playing with parameters (lowering eps and/or increasing minPts.). Finally, consider trying different distance metrics (I'm not sure if this is possible in R) -- haversine is common for lat+long, Euclidean is common for non-spatial data.


Posted 2020-11-01T15:33:57.460

Reputation: 183