Clustering geo location coordinates (lat,long pairs)

Question

What is the right approach and clustering algorithm for geolocation clustering?

I'm using the following code to cluster geolocation coordinates:

import numpy as np import matplotlib.pyplot as plt from scipy.cluster.vq import kmeans2, whiten coordinates= np.array([ [lat, long], [lat, long], ... [lat, long] ]) x, y = kmeans2(whiten(coordinates), 3, iter = 20) plt.scatter(coordinates[:,0], coordinates[:,1], c=y); plt.show()

Is it right to use K-means for geolocation clustering, as it uses Euclidean distance, and not Haversine formula as a distance function?

Yoou can also take a look at this similar question: datascience.stackexchange.com/questions/10063/… — VividD
– VividD, Commented May 11, 2017 at 11:41
I think the feasibility of k-means would depend on where your data are. If your data is spreaded all over the world, it won't work, as the distance is not euclidean, as other users have already told. But if your data is more local, k-means would be good enough, as the geometry is locally euclidean. — Juan Ignacio Gil
– Juan Ignacio Gil, Commented May 31, 2018 at 8:34

Zephyr · Accepted Answer · 2020-08-02 14:02:48Z

11

K-means should be right in this case. Since k-means tries to group based solely on euclidean distance between objects you will get back clusters of locations that are close to each other.

To find the optimal number of clusters you can try making an 'elbow' plot of the within group sum of square distance. This may be helpful

edited Aug 2, 2020 at 14:02

Zephyr

9834 gold badges11 silver badges20 bronze badges

answered Jul 17, 2014 at 12:34

mike1886

9339 silver badges17 bronze badges

4

$\begingroup$ How are points close to each other on the wrap-around point handled? $\endgroup$

casperOne
– casperOne

2014-07-18 17:00:53 +00:00
Commented Jul 18, 2014 at 17:00
1

$\begingroup$ You need to find an algorithm that takes a pre-computed distance matrix or allows you to supply a distance-function that it can call when it needs to compute distances. Otherwise it wont work. $\endgroup$

Spacedman
– Spacedman

2014-07-28 13:45:35 +00:00
Commented Jul 28, 2014 at 13:45
$\begingroup$ The elbow plot may not help you at all because there might be no elbow. Also make sure to try several runs of k-means with the same cluster number because you might get different results. $\endgroup$

Grasshopper
– Grasshopper

2017-05-12 10:05:37 +00:00
Commented May 12, 2017 at 10:05
$\begingroup$ This is a poor idea since all points will be clustered, which is rarely a good idea in mapping. $\endgroup$

Richard
– Richard

2019-08-12 20:40:48 +00:00
Commented Aug 12, 2019 at 20:40

Add a comment |

Zephyr · Accepted Answer · 2020-08-02 14:02:53Z

K-means is not the most appropriate algorithm here.

The reason is that k-means is designed to minimize variance. This is, of course, appearling from a statistical and signal procssing point of view, but your data is not "linear".

Since your data is in latitude, longitude format, you should use an algorithm that can handle arbitrary distance functions, in particular geodetic distance functions. Hierarchical clustering, PAM, CLARA, and DBSCAN are popular examples of this.

This recommends OPTICS clustering.

The problems of k-means are easy to see when you consider points close to the +-180 degrees wrap-around. Even if you hacked k-means to use Haversine distance, in the update step when it recomputes the mean the result will be badly screwed. Worst case is, k-means will never converge!

Can you suggest a more appropriate clustering method for geo-location data? — Alex Spurling
– Alex Spurling, Commented Oct 25, 2016 at 21:50

Brian Spiering · Accepted Answer · 2018-02-02 05:39:58Z

10

GPS coordinates can be directly converted to a geohash. Geohash divides the Earth into "buckets" of different size based on the number of digits (short Geohash codes create big areas and longer codes for smaller areas). Geohash is an adjustable precision clustering method.

edited Feb 2, 2018 at 5:39

answered Dec 5, 2017 at 20:47

Brian Spiering

23.9k2 gold badges30 silver badges120 bronze badges

$\begingroup$ This seems to suffer from the same 180 degree wrap-around problem that K-Means does per the Wikipedia article linked in the answer. $\endgroup$

Norman H
– Norman H

2018-06-01 04:07:11 +00:00
Commented Jun 1, 2018 at 4:07
$\begingroup$ Yep! Plus codes are much better plus.codes $\endgroup$

Brian Spiering
– Brian Spiering

2018-06-02 18:28:57 +00:00
Commented Jun 2, 2018 at 18:28
$\begingroup$ One benefit to this solution is that as long as you calculate the geohash once, repeated comparison operations will go much more quickly. $\endgroup$

Norman H
– Norman H

2018-06-07 14:15:29 +00:00
Commented Jun 7, 2018 at 14:15
$\begingroup$ Geohash will have issues with bucket-edge cases - two very close points will be put in different buckets based on the arbitrary edges of each bucket. $\endgroup$

Dan G
– Dan G

2019-06-12 21:08:09 +00:00
Commented Jun 12, 2019 at 21:08

Add a comment |

VividD · Accepted Answer · 2017-05-11 11:31:11Z

I am probably very late with my answer, but if you are still dealing with geo clustering, you may find this study interesting. It deals with comparison of two fairly different approaches to classifying geographic data: K-means clustering and latent class growth modeling.

One of the images from the study:

The authors concluded that the end results were overall similar, and that there were some aspects where LCGM overperfpormed K-means.

Vivek Payasi · Accepted Answer · 2023-01-12 18:50:07Z

You can use HDBSCAN for this. The python package has support for haversine distance which will properly compute distances between lat/lon points.

As the docs mention, you will need to convert your points to radians first for this to work. The following psuedocode should do the trick:

points = np.array([[lat1, lon1], [lat2, lon2], ...]) rads = np.radians(points) clusterer = hdbscan.HDBSCAN(min_cluster_size=N, metric='haversine') cluster_labels = clusterer.fit_predict(rads)

is there a way to put a constraint on the Haversine distance saying "Hey put only those points in the cluster where pairwise distance is less than 10 KM"? — CKM
– CKM, Commented Apr 10, 2020 at 14:04

Zephyr · Accepted Answer · 2020-08-02 13:23:58Z

0

Java Apache commons-math does this pretty easily.

List<Cluster<T>> cluster(Collection<T> points)

edited Aug 2, 2020 at 13:23

Zephyr

9834 gold badges11 silver badges20 bronze badges

answered Jul 24, 2017 at 5:12

Jeryl Cook

1011 bronze badge

Add a comment |

RzvnK · Accepted Answer · 2021-07-14 21:20:13Z

@CKM there is a parameter in HDBSCAN package: cluster_selection_epsilon which allows you to set the acceptable distance for the neighboring points in the same cluster (just like epsilon in DBSCAN).

Alternatively, you can use DBSCAN and set the eps parameter to 10(km)/6371.0088 (earth-radius). This does not mean though that every single pair in your clusters will have less than that distance. Two border points can be far yet reachable to each other through the chain of core points. This limit just makes sure that each core point has at least N neighboring points within that distance (N is the minimum number of points required in each cluster).

Rugved Mahamune · Accepted Answer · 2020-05-22 14:08:19Z

The k-means algorithm to cluster the locations is a bad idea. Your locations can be spread across the world and the number of clusters cant be predicted by you, not only that if you put the cluster as 1 then the locations will be grouped to 1 single cluster. I am using OPTICS clustering for the same. It worked like a Charm.

Vivek Khetan · Accepted Answer · 2018-02-03 17:27:03Z

Go with Kmeans clustering as HBScan will take forever. I tried it for one of the project and ended but using Kmeans with desired results.

Stack Exchange Network

Clustering geo location coordinates (lat,long pairs)

9 Answers 9

Linked

Hot Network Questions

Clustering geo location coordinates (lat,long pairs)

9 Answers 9

Linked

Related

Hot Network Questions