## Clustering geo location coordinates (lat,long pairs)

31

13

What is the right approach and clustering algorithm for geolocation clustering?

I'm using the following code to cluster geolocation coordinates:

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.vq import kmeans2, whiten

coordinates= np.array([
[lat, long],
[lat, long],
...
[lat, long]
])
x, y = kmeans2(whiten(coordinates), 3, iter = 20)
plt.scatter(coordinates[:,0], coordinates[:,1], c=y);
plt.show()


Is it right to use K-means for geolocation clustering, as it uses Euclidean distance, and not Haversine formula as a distance function?

Yoou can also take a look at this similar question: https://datascience.stackexchange.com/questions/10063/for-which-real-world-data-sets-does-dbscan-surpass-k-means

VividD 2017-05-11T11:41:06.187

4

K-means should be right in this case. Since k-means tries to group based solely on euclidean distance between objects you will get back clusters of locations that are close to each other.

To find the optimal number of clusters you can try making an 'elbow' plot of the within group sum of square distance. This may be helpful (http://nbviewer.ipython.org/github/nborwankar/LearnDataScience/blob/master/notebooks/D3.%20K-Means%20Clustering%20Analysis.ipynb)

3How are points close to each other on the wrap-around point handled?casperOne 2014-07-18T17:00:53.687

You need to find an algorithm that takes a pre-computed distance matrix or allows you to supply a distance-function that it can call when it needs to compute distances. Otherwise it wont work.Spacedman 2014-07-28T13:45:35.470

The elbow plot may not help you at all because there might be no elbow. Also make sure to try several runs of k-means with the same cluster number because you might get different results.Grasshopper 2017-05-12T10:05:37.800

34

K-means is not the most appropriate algorithm here.

The reason is that k-means is designed to minimize variance. This is, of course, appearling from a statistical and signal procssing point of view, but your data is not "linear".

Since your data is in latitude, longitude format, you should use an algorithm that can handle arbitrary distance functions, in particular geodetic distance functions. Hierarchical clustering, PAM, CLARA, and DBSCAN are popular examples of this.

The problems of k-means are easy to see when you consider points close to the +-180 degrees wrap-around. Even if you hacked k-means to use Haversine distance, in the update step when it recomputes the mean the result will be badly screwed. Worst case is, k-means will never converge!

Can you suggest a more appropriate clustering method for geo-location data?Alex Spurling 2016-10-25T21:50:43.077

Did you notice the third paragraph?Anony-Mousse 2016-10-25T21:54:31.240

2

I am probably very late with my answer, but if you are still dealing with geo clustering, you may find this study interesting. It deals with comparison of two fairly different approaches to classifying geographic data: K-means clustering and latent class growth modeling.

One of the images from the study:

The authors concluded that the end results were overall similar, and that there were some aspects where LCGM overperfpormed K-means.

2

You can use HDBSCAN for this. The python package has support for haversine distance which will properly compute distances between lat/lon points.

As the docs mention, you will need to convert your points to radians first for this to work. The following psuedocode should do the trick:

points = np.array([[lat1, lon1], [lat2, lon2], ...])
clusterer = hdbscan.HDBSCAN(min_cluster_size=N, metric='haversine')
cluster_labels = clusterer.fit_predict(points)


1

GPS coordinates can be directly converted to a geohash. Geohash divides the Earth into "buckets" of different size based on the number of digits (short Geohash codes create big areas and longer codes for smaller areas). Geohash is an adjustable precision clustering method.

0

The k-means algorithm to cluster the locations is a bad idea. Your locations can be spread across the world and the number of clusters cant be predicted by you, not only that if you put the cluster as 1 then the locations will be grouped to 1 single cluster. I am using Hierarchical clustering for the same.

0

Java Apache commons-math does this pretty easily.

List<Cluster<T>>    cluster(Collection<T> points)


0

Go with Kmeans clustering as HBScan will take forever. I tried it for one of the project and ended but using Kmeans with desired results.