Knn distance plot for determining eps of DBSCAN

Question

I would like to use the knn distance plot to be able to figure out which eps value should I choose for the DBSCAN algorithm. Based on this page:

The idea is to calculate, the average of the distances of every point to its k nearest neighbors. The value of k will be specified by the user and corresponds to MinPts. Next, these k-distances are plotted in an ascending order. The aim is to determine the “knee”, which corresponds to the optimal eps parameter.

Using python with numpy/sklearn, I have the following points, with the following distance for 6-knn:

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) nbrs = NearestNeighbors(n_neighbors=len(X)).fit(X) distances, indices = nbrs.kneighbors(X) # Indices [[0 1 2 3 4 5] [1 0 2 3 4 5] [2 1 0 3 4 5] [3 4 5 0 1 2] [4 3 5 0 1 2] [5 4 3 0 1 2]] # Distances [[ 0. 1. 2.23606798 2.82842712 3.60555128 5. ] [ 0. 1. 1.41421356 3.60555128 4.47213595 5.83095189] [ 0. 1.41421356 2.23606798 5. 5.83095189 7.21110255] [ 0. 1. 2.23606798 2.82842712 3.60555128 5. ] [ 0. 1. 1.41421356 3.60555128 4.47213595 5.83095189] [ 0. 1.41421356 2.23606798 5. 5.83095189 7.21110255]]

then I computed the average distance:

distances.mean() 2.9269575028354495

The problem is I don't understand how exactly could I represent the same plot as them with distances in y-axis and number of points according to the distances on the x-axis using python.

Thank for your help.

![enter image description here](i.sstatic.net/KFDbs.png) Why does my neighboring point graph have this shape? Please help me!!! — Dung Le
– Dung Le, Commented Oct 10, 2017 at 0:37

Has QUIT--Anony-Mousse · Accepted Answer · 2016-02-09 19:34:15Z

9

You

take the last column of that matrix
sort descending
plot index, distance
hope to see a knee (if the distance does not work well. there might be none)

answered Feb 9, 2016 at 19:34

Has QUIT--Anony-Mousse

8,1541 gold badge16 silver badges31 bronze badges

$\begingroup$ On the same plot, I do this for different k? or only one k for one plot as in the example? and what do you mean by "index" $\endgroup$

Marc Lamberti
– Marc Lamberti

2016-02-09 20:53:20 +00:00
Commented Feb 9, 2016 at 20:53
$\begingroup$ Using the 6NN when you only have 6 points is of course nonsense. Do it for an appropriate k. Index as in "array index". because you need 2d to plot. $\endgroup$

Has QUIT--Anony-Mousse
– Has QUIT--Anony-Mousse

2016-02-09 20:57:03 +00:00
Commented Feb 9, 2016 at 20:57
$\begingroup$ And i only use the last column of the distance matrix. Because in the example they talk about averaging distances.. $\endgroup$

Marc Lamberti
– Marc Lamberti

2016-02-09 22:26:03 +00:00
Commented Feb 9, 2016 at 22:26
$\begingroup$ That post is incorrect there and in at least another place (you don't need to set a seed) $\endgroup$

Has QUIT--Anony-Mousse
– Has QUIT--Anony-Mousse

2016-02-09 22:46:04 +00:00
Commented Feb 9, 2016 at 22:46
1

$\begingroup$ You only have one k. Why don't you use the DBSCAN paper. but mash-up various low-quality websites? $\endgroup$

Has QUIT--Anony-Mousse
– Has QUIT--Anony-Mousse

2016-02-09 22:53:32 +00:00
Commented Feb 9, 2016 at 22:53

| Show 1 more comment

Stack Exchange Network

Knn distance plot for determining eps of DBSCAN

1 Answer 1

Hot Network Questions

Knn distance plot for determining eps of DBSCAN

1 Answer 1

Related

Hot Network Questions