dbscan算法_sklearn聚类算法之DBSCAN

算法

DBSCAN, (Density-Based Spatial Clustering of Applications with Noise)

有噪声的应用背景下的基于密度的空间聚类方法

The idea is that if a particular point belongs to a cluster, it should be near to lots of other points in that cluster.

【其理念是:如果特定点属于群集,则该点应接近该群集中的许多其他点。】

它是一种非监督式的聚类方法,事先并不知道要聚成几类。

算法原理

资料翻译:https://cn.bing.com/translator 辅以手动修改

该算法的可视化可见参考文献[2]

The DBSCAN algorithm views clusters as areas of high density separated by areas of low density.

【DBSCAN 算法将聚类视为高密度区域,由低密度区域分隔。】

It works like this: First we choose two parameters, a positive number epsilon and a natural number minPoints. We then begin by picking an arbitrary point in our dataset. If there are more than minPoints points within a distance of epsilon from that point, (including the original point itself), we consider all of them to be part of a "cluster". We then expand that cluster by checking all of the new points and seeing if they too have more than minPoints points within a distance of epsilon, growing the cluster recursively if so.

Eventually, we run out of points to add to the cluster. We then pick a new arbitrary point and repeat the process. Now, it's entirely possible that a point we pick has fewer than minPoints points in its epsilon ball, and is also not a part of any other cluster. If that is the case, it's considered a "noise point" not belonging to any cluster.

【算法的工作方式是这样的:首先,我们选择两个参数,一个正数 epsilon和一个自然数 minPoints。然后,首先在数据集中选取一个任意点。如果在距离小于等于 epsilon 的范围内有超过 minPoints 个点(包括原始点本身),则我们认为所有这些点都是"群集"的一部分。然后,我们通过检查所有新点并查看它们在 epsilon 邻域内是否也具有多于 minPoints 个点来扩展该群集。

最终,没有要添加到群集的点了。然后,我们选择一个新的任意点并重复该过程。我们挑选的点完全有可能在其 epsilon 邻域中有小于 minPoints 个点,并且也不属于任何其他群集。如果是这种情况,则它被视为不属于任何群集的 “噪声点”。】

(There's a slight complication worth pointing out: say minPoints=4, and you have a point with three points in its epsilon ball, including itself. Say the other two points belong to two different clusters, and each has 4 points in their epsilon balls. Then both of these dense points will "fight over" the original point, and it's arbitrary which of the two clusters it ends up in. To see what I mean, try out "Example A" with minPoints=4, epsilon=1.98. Since DBSCAN considers the points in an arbitrary order, the middle point can end up in either the left or the right cluster on different runs. This kind of point is known as a "border point").

【(有一个稍微的复杂情形值得指出:设minPoints=4,有一个点,它有三个点在其epsilon邻域,包括本身。若除它以外其他两个点属于两个不同的集群,每个点有4个点在他们的epsilon邻域。然后,这两个密集点将"争夺"原始点,原始点落到哪个集群中是任意地。要了解我的意思,请尝试使用 minPoints = 4,epsilon=1.98 的"示例 A"。由于 DBSCAN 以任意顺序考虑点,因此中间点可能在不同的运行结果被分到左或右群集。这种点被称为 "边界点"。)】

实现

sklearn.cluster.DBSCAN官方参考文档

class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric='euclidean', metric_params=None) 参数:

eps [float, default=0.5]

对应算法原理中的epsilon

min_samples [int, default=5]

对应算法原理中的minPoints

metric [string, or callable, default=’euclidean’]

计算点之间的距离时使用的度量指标(默认为欧式距离)。如果指标是string或callable,则必须是 sklearn.metric.pairwise_distances参数所允许的选项之一。若点之间的距离已经预先计算好了,则 metric 为距离矩阵 X,并且必须为正方形。

metric_params [dict, default=None]

metric函数的其他关键字参数。

属性:

core_sample_indices_ [ndarray of shape (n_core_samples,)]

Indices of core samples.

components_ [ndarray of shape (n_core_samples, n_features)]

Copy of each core sample found by training.

labels_ [ndarray of shape (n_samples)]

给fit()的数据集中每个点的cluster label。噪声点被赋予标签-1。

dbscan算法_sklearn聚类算法之DBSCAN_第1张图片

一个简单的例子

>>> from sklearn.cluster import DBSCAN
>>> import numpy as np
>>> X = np.array([[1, 2], [2, 2], [2, 3],
...               [8, 7], [8, 8], [25, 80]])
>>> clustering = DBSCAN(eps=3, min_samples=2).fit(X)
>>> clustering.labels_
array([ 0,  0,  0,  1,  1, -1])
>>> clustering
DBSCAN(eps=3, min_samples=2)

算法demo

官方示例demo

参考文献:

[1] sklearn官方文档 [2] visualizing-dbscan-clustering [3] 风弦鹤的博客:DBSCAN聚类算法——机器学习(理论+图解+python代码)

你可能感兴趣的:(dbscan算法,dbscan聚类算法,sklearn,聚类)