机器学习 -- 密度聚类(DBSCAN)

密度聚类

    密度聚类也称"基于密度的聚类"(density-based clustering)  ,次类算法假设聚类结构能通过样本分布的紧密程度确定。通常情况下,密度聚类从样本密度的角度来考察样本之间的可连接线,并基于可连接样本不断扩展聚类簇以获得最终的聚类结果。

   DBSCAN 是一种著名的密度聚类算法,它基于一组 "邻域" 参数  (\large \epsilon , Minpts) 来刻画样本分布的紧密程度。给出样本集\large D = \left \{ x_{1} ,x_{2} ,...,x_{m} \right \}  , 定义下面这几个概念 :

  1. \large \epsilon-邻域 :  对于样本点  x_{j} \in D  , 其邻域包含样本集 D 中与 \large x_{j} 的距离不大于 \large \epsilon 的样本 ,即 N_{ \epsilon } (x_{j}) = \left \{ x_{i} \in D | dist(x_{i},x_{j}) \leq \epsilon ) \right \} ; 
  2. 核心对象 (core object ): 若  \large x_{j} 的 \large \epsilon-邻域至少包含 Minpts 个样本, 即 |N_{\epsilon } (x_{j})| \geq MinPts  ,  则  \large x_{j} 是一个核心对象 ; 
  3. 密度直达 (directlt density-reachable): 若  x_{j} 位于 x_{i} 的  \large \epsilon-邻域中 , 且 x_{i} 是核心对象,则称 x_{j} 由 x_{i} 密度直达 ;
  4. 密度可达(density-reachable) : 对于 x_{i} 与  x_{j} , 若存在样本序列 p_{1},p_{2},...,p_{n} , 其中 p_{1} = x_{i}  , p_{n} = x_{j} ,且 p_{i+1} 由 p_{i} 密度直达, 则称 x_{j} 由 x_{i} 密度可达 。
  5. 密度相连(density-connected) : 对x_{i} 与  x_{j} ,若存在 x_{k} , 使得 x_{i} 与  x_{j} 均由 x_{k} 密度可达,则称 x_{i}  与 x_{j} 密度相连 ; 

机器学习 -- 密度聚类(DBSCAN)_第1张图片

算法描述 : 

输入:样本集D={x1,x2,...,xm}
    邻域参数(ε,MinPts).
过程:
初始化核心对象集合:Ω = Ø
for j=1,2,...,m do
    确定样本xj的ε-邻域N(xj);
    if |N(xj)|>=MinPts then
        将样本xj加入核心对象集合Ω
    end if
end for
初始化聚类簇数:k=0
初始化未访问样本集合:Γ =D
while Ω != Ø do
    记录当前未访问样本集合:Γold = Γ;
    随机选取一个核心对象 o ∈ Ω,初始化队列Q=
    Γ = Γ\{o};
    while Q != Ø do
        取出队列Q中首个样本q;
        if |N(q)|<=MinPts then
            令Δ = N(q)∩Γ;
            将Δ中的样本加入队列Q;
            Γ = Γ\Δ;
        end if
    end while
    k = k+1,生成聚类簇Ck = Γold\Γ;
    Ω = Ω\Ck
end while
输出:
簇划分C = {C1,C2,...,Ck}

数据集 : 

data.txt
0.697,0.46
0.774,0.376
0.634,0.264
0.608,0.318
0.556,0.215
0.403,0.237
0.481,0.149
0.437,0.211
0.666,0.091
0.243,0.267
0.245,0.057
0.343,0.099
0.639,0.161
0.657,0.198
0.36,0.37
0.593,0.042
0.719,0.103
0.359,0.188
0.339,0.241
0.282,0.257
0.748,0.232
0.714,0.346
0.483,0.312
0.478,0.437
0.525,0.369
0.751,0.489
0.532,0.472
0.473,0.376
0.725,0.445
0.446,0.459

算法实现 : 


import matplotlib.pyplot as plt
import numpy as np
import random
#计算两个向量之间的欧式距离
def calDist(X1 , X2 ):
    sum = 0
    for x1 , x2 in zip(X1 , X2):
        sum += (x1 - x2) ** 2
    return sum ** 0.5

#获取一个点的ε-邻域(记录的是索引)
def getNeibor(data , dataSet , e):
    res = []
    for i in range(np.shape(dataSet)[0]):
        if calDist(data , dataSet[i])=minPts:
            coreObjs[i] = neibor
    oldCoreObjs = coreObjs.copy()
    k = 0#初始化聚类簇数
    notAccess = list(range(n) )#初始化未访问样本集合(索引)
    while len(coreObjs)>0:
        OldNotAccess = []
        OldNotAccess.extend(notAccess)
        cores = list(coreObjs.keys())
        #随机选取一个核心对象
        randNum = random.randint(0,len(cores))
        
        # print(len(cores))
        # print(randNum)
        core = cores[randNum]
        queue = []
        queue.append(core)
        notAccess.remove(core)
        while len(queue)>0:
            q = queue[0]
            del queue[0]
            if q in oldCoreObjs.keys() :
                delte = [val for val in oldCoreObjs[q] if val in notAccess]#Δ = N(q)∩Γ
                queue.extend(delte)#将Δ中的样本加入队列Q
                notAccess = [val for val in notAccess if val not in delte]#Γ = Γ\Δ
        k += 1
        C[k] = [val for val in OldNotAccess if val not in notAccess]
        for x in C[k]:
            if x in coreObjs.keys():
                del coreObjs[x]
    return C
def loadDataSet(fileName):
    dataMat = []
    fr = open(fileName)
    for line in fr.readlines():
        curLine = line.strip().split(',')
        fltLine = list(map(float,curLine))
        dataMat.append(fltLine)
    return dataMat

def draw(C , dataSet):
    color = ['r', 'y', 'g', 'b', 'c', 'k', 'm']
    for i in C.keys():
        X = []
        Y = []
        datas = C[i]
        for j in range(len(datas)):
            X.append(dataSet[datas[j]][0])
            Y.append(dataSet[datas[j]][1])
        plt.scatter(X, Y, marker='o', color=color[i % len(color)], label=i)
    plt.legend(loc='upper right')
    plt.show()

dataSet = loadDataSet("data.txt")
C = DBSCAN(dataSet , 0.11 , 5)
draw(C , dataSet)

Reference :

                            机器学习 --- 周志华 

                             https://blog.csdn.net/chenge_j/article/details/72357471

你可能感兴趣的:(机器学习实战)