## Clustering geo location coordinates (lat,long pairs)

62

32

What is the right approach and clustering algorithm for geolocation clustering?

I'm using the following code to cluster geolocation coordinates:

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.vq import kmeans2, whiten

coordinates= np.array([
[lat, long],
[lat, long],
...
[lat, long]
])
x, y = kmeans2(whiten(coordinates), 3, iter = 20)
plt.scatter(coordinates[:,0], coordinates[:,1], c=y);
plt.show()


Is it right to use K-means for geolocation clustering, as it uses Euclidean distance, and not Haversine formula as a distance function?

Yoou can also take a look at this similar question: https://datascience.stackexchange.com/questions/10063/for-which-real-world-data-sets-does-dbscan-surpass-k-means

– VividD – 2017-05-11T11:41:06.187

I think the feasibility of k-means would depend on where your data are. If your data is spreaded all over the world, it won't work, as the distance is not euclidean, as other users have already told. But if your data is more local, k-means would be good enough, as the geometry is locally euclidean. – Juan Ignacio Gil – 2018-05-31T08:34:00.893

12

K-means should be right in this case. Since k-means tries to group based solely on euclidean distance between objects you will get back clusters of locations that are close to each other.

To find the optimal number of clusters you can try making an 'elbow' plot of the within group sum of square distance. This may be helpful

3How are points close to each other on the wrap-around point handled? – casperOne – 2014-07-18T17:00:53.687

1You need to find an algorithm that takes a pre-computed distance matrix or allows you to supply a distance-function that it can call when it needs to compute distances. Otherwise it wont work. – Spacedman – 2014-07-28T13:45:35.470

The elbow plot may not help you at all because there might be no elbow. Also make sure to try several runs of k-means with the same cluster number because you might get different results. – Grasshopper – 2017-05-12T10:05:37.800

This is a poor idea since all points will be clustered, which is rarely a good idea in mapping. – Richard – 2019-08-12T20:40:48.893

65

K-means is not the most appropriate algorithm here.

The reason is that k-means is designed to minimize variance. This is, of course, appearling from a statistical and signal procssing point of view, but your data is not "linear".

Since your data is in latitude, longitude format, you should use an algorithm that can handle arbitrary distance functions, in particular geodetic distance functions. Hierarchical clustering, PAM, CLARA, and DBSCAN are popular examples of this.

This recommends OPTICS clustering.

The problems of k-means are easy to see when you consider points close to the +-180 degrees wrap-around. Even if you hacked k-means to use Haversine distance, in the update step when it recomputes the mean the result will be badly screwed. Worst case is, k-means will never converge!

Can you suggest a more appropriate clustering method for geo-location data? – Alex Spurling – 2016-10-25T21:50:43.077

Did you notice the third paragraph? – Has QUIT--Anony-Mousse – 2016-10-25T21:54:31.240

9

GPS coordinates can be directly converted to a geohash. Geohash divides the Earth into "buckets" of different size based on the number of digits (short Geohash codes create big areas and longer codes for smaller areas). Geohash is an adjustable precision clustering method.

This seems to suffer from the same 180 degree wrap-around problem that K-Means does per the Wikipedia article linked in the answer. – Norman H – 2018-06-01T04:07:11.423

Yep! Plus codes are much better https://plus.codes/

– Brian Spiering – 2018-06-02T18:28:57.607

One benefit to this solution is that as long as you calculate the geohash once, repeated comparison operations will go much more quickly. – Norman H – 2018-06-07T14:15:29.497

Geohash will have issues with bucket-edge cases - two very close points will be put in different buckets based on the arbitrary edges of each bucket. – Dan G – 2019-06-12T21:08:09.857

8

I am probably very late with my answer, but if you are still dealing with geo clustering, you may find this study interesting. It deals with comparison of two fairly different approaches to classifying geographic data: K-means clustering and latent class growth modeling.

One of the images from the study:

The authors concluded that the end results were overall similar, and that there were some aspects where LCGM overperfpormed K-means.

6

You can use HDBSCAN for this. The python package has support for haversine distance which will properly compute distances between lat/lon points.

As the docs mention, you will need to convert your points to radians first for this to work. The following psuedocode should do the trick:

points = np.array([[lat1, lon1], [lat2, lon2], ...])
clusterer = hdbscan.HDBSCAN(min_cluster_size=N, metric='haversine')
cluster_labels = clusterer.fit_predict(points)


1is there a way to put a constraint on the Haversine distance saying "Hey put only those points in the cluster where pairwise distance is less than 10 KM"? – CKM – 2020-04-10T14:04:22.860

0

Java Apache commons-math does this pretty easily.

List<Cluster<T>>    cluster(Collection<T> points)


-1

The k-means algorithm to cluster the locations is a bad idea. Your locations can be spread across the world and the number of clusters cant be predicted by you, not only that if you put the cluster as 1 then the locations will be grouped to 1 single cluster. I am using OPTICS clustering for the same. It worked like a Charm.

-2

Go with Kmeans clustering as HBScan will take forever. I tried it for one of the project and ended but using Kmeans with desired results.