Skip to main content
Added dbscan tag
Source Link
Kasra Manshaei
  • 6.8k
  • 1
  • 23
  • 47

I would like to use the knn distance plot to be able to figure out which eps value should I choose for the DBSCAN algorithm. Based on this page: http://www.sthda.com/english/wiki/dbscan-density-based-clustering-for-discovering-clusters-in-large-datasets-with-noise-unsupervised-machine-learningthis page:

The idea is to calculate, the average of the distances of every point to its k nearest neighbors. The value of k will be specified by the user and corresponds to MinPts. Next, these k-distances are plotted in an ascending order. The aim is to determine the “knee”, which corresponds to the optimal eps parameter.

Using python with numpy/sklearn, I have the following points, with the following distance for 6-knn:

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) nbrs = NearestNeighbors(n_neighbors=len(X)).fit(X) distances, indices = nbrs.kneighbors(X) # Indices [[0 1 2 3 4 5] [1 0 2 3 4 5] [2 1 0 3 4 5] [3 4 5 0 1 2] [4 3 5 0 1 2] [5 4 3 0 1 2]] # Distances [[ 0. 1. 2.23606798 2.82842712 3.60555128 5. ] [ 0. 1. 1.41421356 3.60555128 4.47213595 5.83095189] [ 0. 1.41421356 2.23606798 5. 5.83095189 7.21110255] [ 0. 1. 2.23606798 2.82842712 3.60555128 5. ] [ 0. 1. 1.41421356 3.60555128 4.47213595 5.83095189] [ 0. 1.41421356 2.23606798 5. 5.83095189 7.21110255]] 

then I computed the average distance:

distances.mean() 2.9269575028354495 

The problem is I don't understand how exactly could I represent the same plot as them with distances in y-axis and number of points according to the distances on the x-axis using python.

Thank for your help.

I would like to use the knn distance plot to be able to figure out which eps value should I choose for the DBSCAN algorithm. Based on this page: http://www.sthda.com/english/wiki/dbscan-density-based-clustering-for-discovering-clusters-in-large-datasets-with-noise-unsupervised-machine-learning :

The idea is to calculate, the average of the distances of every point to its k nearest neighbors. The value of k will be specified by the user and corresponds to MinPts. Next, these k-distances are plotted in an ascending order. The aim is to determine the “knee”, which corresponds to the optimal eps parameter.

Using python with numpy/sklearn, I have the following points, with the following distance for 6-knn:

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) nbrs = NearestNeighbors(n_neighbors=len(X)).fit(X) distances, indices = nbrs.kneighbors(X) # Indices [[0 1 2 3 4 5] [1 0 2 3 4 5] [2 1 0 3 4 5] [3 4 5 0 1 2] [4 3 5 0 1 2] [5 4 3 0 1 2]] # Distances [[ 0. 1. 2.23606798 2.82842712 3.60555128 5. ] [ 0. 1. 1.41421356 3.60555128 4.47213595 5.83095189] [ 0. 1.41421356 2.23606798 5. 5.83095189 7.21110255] [ 0. 1. 2.23606798 2.82842712 3.60555128 5. ] [ 0. 1. 1.41421356 3.60555128 4.47213595 5.83095189] [ 0. 1.41421356 2.23606798 5. 5.83095189 7.21110255]] 

then I computed the average distance:

distances.mean() 2.9269575028354495 

The problem is I don't understand how exactly could I represent the same plot as them with distances in y-axis and number of points according to the distances on the x-axis using python.

Thank for your help.

I would like to use the knn distance plot to be able to figure out which eps value should I choose for the DBSCAN algorithm. Based on this page:

The idea is to calculate, the average of the distances of every point to its k nearest neighbors. The value of k will be specified by the user and corresponds to MinPts. Next, these k-distances are plotted in an ascending order. The aim is to determine the “knee”, which corresponds to the optimal eps parameter.

Using python with numpy/sklearn, I have the following points, with the following distance for 6-knn:

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) nbrs = NearestNeighbors(n_neighbors=len(X)).fit(X) distances, indices = nbrs.kneighbors(X) # Indices [[0 1 2 3 4 5] [1 0 2 3 4 5] [2 1 0 3 4 5] [3 4 5 0 1 2] [4 3 5 0 1 2] [5 4 3 0 1 2]] # Distances [[ 0. 1. 2.23606798 2.82842712 3.60555128 5. ] [ 0. 1. 1.41421356 3.60555128 4.47213595 5.83095189] [ 0. 1.41421356 2.23606798 5. 5.83095189 7.21110255] [ 0. 1. 2.23606798 2.82842712 3.60555128 5. ] [ 0. 1. 1.41421356 3.60555128 4.47213595 5.83095189] [ 0. 1.41421356 2.23606798 5. 5.83095189 7.21110255]] 

then I computed the average distance:

distances.mean() 2.9269575028354495 

The problem is I don't understand how exactly could I represent the same plot as them with distances in y-axis and number of points according to the distances on the x-axis using python.

Thank for your help.

Source Link
Marc Lamberti
  • 327
  • 1
  • 3
  • 8

Knn distance plot for determining eps of DBSCAN

I would like to use the knn distance plot to be able to figure out which eps value should I choose for the DBSCAN algorithm. Based on this page: http://www.sthda.com/english/wiki/dbscan-density-based-clustering-for-discovering-clusters-in-large-datasets-with-noise-unsupervised-machine-learning :

The idea is to calculate, the average of the distances of every point to its k nearest neighbors. The value of k will be specified by the user and corresponds to MinPts. Next, these k-distances are plotted in an ascending order. The aim is to determine the “knee”, which corresponds to the optimal eps parameter.

Using python with numpy/sklearn, I have the following points, with the following distance for 6-knn:

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) nbrs = NearestNeighbors(n_neighbors=len(X)).fit(X) distances, indices = nbrs.kneighbors(X) # Indices [[0 1 2 3 4 5] [1 0 2 3 4 5] [2 1 0 3 4 5] [3 4 5 0 1 2] [4 3 5 0 1 2] [5 4 3 0 1 2]] # Distances [[ 0. 1. 2.23606798 2.82842712 3.60555128 5. ] [ 0. 1. 1.41421356 3.60555128 4.47213595 5.83095189] [ 0. 1.41421356 2.23606798 5. 5.83095189 7.21110255] [ 0. 1. 2.23606798 2.82842712 3.60555128 5. ] [ 0. 1. 1.41421356 3.60555128 4.47213595 5.83095189] [ 0. 1.41421356 2.23606798 5. 5.83095189 7.21110255]] 

then I computed the average distance:

distances.mean() 2.9269575028354495 

The problem is I don't understand how exactly could I represent the same plot as them with distances in y-axis and number of points according to the distances on the x-axis using python.

Thank for your help.