python之dbscan算法

DBSCAN全称是Density-Based Spatial Clustering of Applications with Noise,即基于密度的聚类算法,是一种非常常见的聚类算法。

DBSCAN将数据点分为三类:核心点(core point)、边界点(border point)和噪声点(noise point)。它的核心思想是,如果一个数据点周围一定范围内的数据点数量大于等于某个值,那么这个数据点就是核心点。然后以核心点为中心,半径为一定值的区域内的数据点都属于同一个簇,即密度可达(density-reachable)。而边界点则是既不是核心点又不是噪声点,一个点如果邻域内核心点的数量小于设定的阈值,则该点为边界点。噪声点则是指一个点如果不在任何核心点的邻域内,也不是边界点,则该点为噪声点。

下面是一个简单的Python实现:

from typing import List
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np

class DBSCAN:
    def __init__(self, eps: float, min_samples: int) -> None:
        self.eps = eps
        self.min_samples = min_samples

    def _region_query(self, X: np.ndarray, point_idx: int, eps: float) -> List[int]:
        """
        寻找距离点point_idx距离在eps范围内的点
        """
        D = euclidean_distances(X, X[[point_idx], :])[0]
        return np.where(D <= eps)[0].tolist()

    def fit(self, X: np.ndarray) -> np.ndarray:
        n_samples = X.shape[0]
        labels = np.zeros(n_samples, dtype=int)
        visited = np.zeros(n_samples, dtype=bool)
        cluster_idx = 0

        for i in range(n_samples):
            if not visited[i]:
                visited[i] = True
                neighbors = self._region_query(X, i, self.eps)
                if len(neighbors) >= self.min_samples:
                    labels[i] = cluster_idx
                    for j in neighbors:
                        if not visited[j]:
                            visited[j] = True
                            neighbors_j = self._region_query(X, j, self.eps)
                            if len(neighbors_j) >= self.min_samples:
                                neighbors.extend(neighbors_j)
                        if labels[j] == 0:
                            labels[j] = cluster_idx
                cluster_idx += 1

        return labels

示例代码:

import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=1000, centers=3, random_state=42)

dbscan = DBSCAN(eps=

你可能感兴趣的:(python,算法,开发语言)