DBSCAN根据密度聚类,常用欧式距离(Euclidean Distance)来度量数据之间的距离。笔者学习发现DBSCAN中欧式距离的计算先后调用了sklearn.neighbors.NearestNeighbors和sklearn.metrics.pairwise.euclidean_distances,并浅谈其中sklearn.metrics.pairwise.euclidean_distances对欧式距离计算公式的优化和不足。
class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric=‘euclidean’, metric_params=None, algorithm=‘auto’, leaf_size=30, p=None, n_jobs=None)
该类的方法包括fit(X, y=None, sample_weight=None)等
实例化该类时,默认metric=‘euclidean’,即欧式距离。从源码可以看到,DBSCAN调用sklearn.neighbors.NearestNeighbors计算数据间的距离并与eps进行对比。
# 节选自Lib\site-packages\sklearn\cluster\_dbscan.py
class DBSCAN(ClusterMixin, BaseEstimator):
def fit(self, X, y=None, sample_weight=None):
.......
neighbors_model = NearestNeighbors(radius=self.eps, algorithm=self.algorithm, leaf_size=self.leaf_size, metric=self.metric, metric_params=self.metric_params, p=self.p, n_jobs=self.n_jobs)
neighbors_model.fit(X)
# This has worst case O(n^2) memory complexity
neighborhoods = neighbors_model.radius_neighbors(X, return_distance=False)
......
从上述节选的源码可见,DBSCAN先是实例化了NearestNeighbors类,调用fit()方法拟合训练数据集X,再通过radius_neighbors()查找给定半径eps内的所有邻点。
class sklearn.neighbors.NearestNeighbors(*, n_neighbors=5, radius=1.0, algorithm=‘auto’, leaf_size=30, metric=‘minkowski’, p=2, metric_params=None, n_jobs=None)
该类的方法包括fit(X, y=None), radius_neighbors(X=None, radius=None, return_distance=True, sort_results=False)等
当metric=‘euclidean’, p=None时(同sklearn.cluster.DBSCAN类构建函数的默认参数),DBSCAN实例.radius_neighbors()实际调取sklearn.metrics.pairwise中的pairwise_distances_chunked函数(详见 3 sklearn.metrics.pairwise.euclidean_distances)
# 节选自Lib\site-packages\sklearn\neighbors\_unsupervised.py
from ._base import NeighborsBase
from ._base import KNeighborsMixin
from ._base import RadiusNeighborsMixin
class NearestNeighbors(KNeighborsMixin, RadiusNeighborsMixin, NeighborsBase):
def fit(self, X, y=None):
return self._fit(X)
# 节选自Lib\site-packages\sklearn\neighbors\_base.py
from ..metrics import pairwise_distances_chunked
class NeighborsBase(MultiOutputMixin, BaseEstimator, metaclass=ABCMeta):
def _fit(self, X, y=None):
......
self.effective_metric_ = self.metric
......
self._fit_X = X
......
return self
class RadiusNeighborsMixin:
def radius_neighbors(self, X=None, radius=None, return_distance=True, sort_results=False):
......
elif self._fit_method == 'brute':
# for efficiency, use squared euclidean distances
if self.effective_metric_ == 'euclidean':
radius *= radius
kwds = {'squared': True}
else:
kwds = self.effective_metric_params_
reduce_func = partial(self._radius_neighbors_reduce_func, radius=radius, return_distance=return_distance)
chunked_results = pairwise_distances_chunked(X, self._fit_X, reduce_func=reduce_func, metric=self.effective_metric_, n_jobs=self.n_jobs, **kwds)
......
由第二节可见,DBSCAN执行的radius_neighbors()实际调用sklearn.metrics.pairwise中的pairwise_distances_chunked()函数。而当metric='euclidean’时,最终计算欧式距离用的是euclidean_distances()函数。对该函数的计算详见第4节。
# 节选自Lib\site-packages\sklearn\metircs\pairwise.py
def pairwise_distances_chunked(X, Y=None, *, reduce_func=None, metric='euclidean', n_jobs=None, working_memory=None, **kwds):
......
for sl in slices:
......
D_chunk = pairwise_distances(X_chunk, Y, metric=metric, n_jobs=n_jobs, **kwds)
if ((X is Y or Y is None)
and PAIRWISE_DISTANCE_FUNCTIONS.get(metric, None)
is euclidean_distances):
# zeroing diagonal, taking care of aliases of "euclidean",
# i.e. "l2"
D_chunk.flat[sl.start::_num_samples(X) + 1] = 0
......
yield D_chunk
def pairwise_distances(X, Y=None, metric="euclidean", *, n_jobs=None, force_all_finite=True, **kwds):
......
elif metric in PAIRWISE_DISTANCE_FUNCTIONS:
func = PAIRWISE_DISTANCE_FUNCTIONS[metric]
......
return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
PAIRWISE_DISTANCE_FUNCTIONS = {......'euclidean': euclidean_distances,......} # 字典, value对应函数
def _parallel_pairwise(X, Y, func, n_jobs, **kwds):
if Y is None:
Y = X
X, Y, dtype = _return_float_dtype(X, Y)
if effective_n_jobs(n_jobs) == 1:
return func(X, Y, **kwds) # 这里func为euclidean_distances
......
def euclidean_distances(X, Y=None, *, Y_norm_squared=None, squared=False, X_norm_squared=None):
......#具体见4 sklearn.metrics.pairwise.euclidean_distances
sklearn.metrics.pairwise.euclidean_distances(X, Y=None, *, Y_norm_squared=None, squared=False, X_norm_squared=None)
euclidean_distances是一个函数,输入向量数组X和Y, 计算向量数组X和Y每对(行)向量之间的欧式距离,返回距离矩阵。若仅输入X而Y=None(相当于Y=X),则返回X自身向量间的邻接矩阵。根据官方文档,每对向量x和y的欧式距离计算公式为:
dist(x, y) = sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y))
以二维平面的两个数据点 a = [ x a y a ] a=\begin{bmatrix}x_a\\y_a\end{bmatrix} a=[xaya], b = [ x b y b ] b=\begin{bmatrix}x_b\\y_b\end{bmatrix} b=[xbyb]为例。 d o t ( a , a ) = x a 2 + y a 2 dot(a, a) = x_a^2+y_a^2 dot(a,a)=xa2+ya2, d o t ( a , b ) = x a x b + y a y b dot(a, b) = {x_ax_b}+{y_ay_b} dot(a,b)=xaxb+yayb, d o t ( b , b ) = x b 2 + y b 2 dot(b, b) = x_b^2+y_b^2 dot(b,b)=xb2+yb2,因此 d i s t ( a , b ) = x a 2 + y a 2 − 2 ∗ ( x a x b + y a y b ) + x b 2 + y b 2 = ( x a − x b ) 2 + ( y a − y b ) 2 dist(a, b)=\sqrt{x_a^2+y_a^2-2\ast(x_ax_b+y_ay_b)+x_b^2+y_b^2}=\sqrt{{{{(x_a}-x_b})}^2+{{(y_a}-y_b)}^2} dist(a,b)=xa2+ya2−2∗(xaxb+yayb)+xb2+yb2=(xa−xb)2+(ya−yb)2。采用该公式实际是出于计算效率的考虑,如果预先算出每个向量(数据)自身的点乘dot(x,x)和dot(y,y),则能缩减不同向量间距离的计算时间(原文见下)。比如,对于任意i来说,计算 d i s t ( x 1 ⇀ , x i ⇀ ) dist(\overset\rightharpoonup{x_1}, \overset\rightharpoonup{x_i}) dist(x1⇀,xi⇀)涉及 d o t ( x 1 ⇀ , x 1 ⇀ ) dot(\overset\rightharpoonup{x_1}, \overset\rightharpoonup{x_1}) dot(x1⇀,x1⇀)这一中间值,若提前算好并临时保存这个点乘结果,则能更快地算出与向量 x 1 ⇀ \overset\rightharpoonup{x_1} x1⇀相关的所有距离。而这一点乘结果实际对应函数输入中的关键词参数X_norm_squared和Y_norm_squared。
This formulation has two advantages over other ways of computing distances. First, it is computationally efficient when dealing with sparse data. Second, if one argument varies but the other remains unchanged, then dot(x, x) and/or dot(y, y) can be pre-computed.
因此,给定数据集X,调用euclidean_distances(X)函数会获得邻接矩阵M(注意此时Y=X),则M的第i行第j列元素为 x i ⇀ \overset\rightharpoonup{x_i} xi⇀和 x j ⇀ \overset\rightharpoonup{x_j} xj⇀之间的欧式距离,理论上M应为对称矩阵。
X = [ x 1 ⇀ x 2 ⇀ ⋯ x m ⇀ ] = [ x 11 x 12 ⋯ x 1 n x 21 x 22 ⋯ x 2 n ⋯ ⋯ ⋯ ⋯ x m 1 x m 2 ⋯ x m n ] X\;=\;\begin{bmatrix}\overset\rightharpoonup{x_1}\\\overset\rightharpoonup{x_2}\\\cdots\\\overset\rightharpoonup{x_m}\end{bmatrix}\;=\;\begin{bmatrix}x_{11}&x_{12}&\cdots&x_{1n}\\x_{21}&x_{22}&\cdots&x_{2n}\\\cdots&\cdots&\cdots&\cdots\\x_{m1}&x_{m2}&\cdots&x_{mn}\end{bmatrix} X=⎣⎢⎢⎡x1⇀x2⇀⋯xm⇀⎦⎥⎥⎤=⎣⎢⎢⎡x11x21⋯xm1x12x22⋯xm2⋯⋯⋯⋯x1nx2n⋯xmn⎦⎥⎥⎤
比如 x 1 ⇀ \overset\rightharpoonup{x_1} x1⇀和 x 2 ⇀ \overset\rightharpoonup{x_2} x2⇀之间的距离为dist( x 1 ⇀ \overset\rightharpoonup{x_1} x1⇀, x 2 ⇀ \overset\rightharpoonup{x_2} x2⇀),对应M的 m 12 m_{12} m12以及 m 21 m_{21} m21元素。
然而这个算法在提升了效率的同时也牺牲了准确度。官方文档指出了其不足之处:这个算法虽然是高效的,但不是最精确的,函数返回的距离矩阵很有可能不是完全对称的。
However, this is not the most precise way of doing this computation, because this equation potentially suffers from “catastrophic cancellation”. Also, the distance matrix returned by this function may not be exactly symmetric as required by, e.g., scipy.spatial.distance functions.
参考scipy.spatial.distance.euclidean(u, v, w=None),该函数计算两个1-D array u和v之间的欧式距离: ∥ u − v ∥ 2 = ( ∑ ( w i ∣ u i − v i ∣ 2 ) ) 1 / 2 {\parallel u-v\parallel}_2={({\textstyle\sum_{}}(w_i\left|u_i-v_i\right|^2))}^{1/2} ∥u−v∥2=(∑(wi∣ui−vi∣2))1/2,其中w指u和v之间的权重,默认 w i w_i wi为1.0。
参考:scikit-learn 1.1.3版本
(1)sklearn.neighbors.NearestNeighbors, https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html?highlight=nearestneighbors#sklearn.neighbors.NearestNeighbors
(2)sklearn.metrics.pairwise.euclidean_distances, https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html#sklearn.metrics.pairwise.euclidean_distances
(3)sklearn.cluster.DBSCAN, https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN)
(4)scipy.spatial.distance.euclidean, https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html#scipy.spatial.distance.euclidean