密度聚类(二)OPTICS和python实现

上一节写的DBSCAN算法的一个缺点是无法对密度不同的样本集进行很好的聚类,就如下图中所示,是DBSCAN获得的聚类结果,第二个图中紫色的点是异常点,由于黄色的样本集密度小,与另外2个样本集的区别很大,这个时候DBSCAN的缺点就显现出来了。
密度聚类(二)OPTICS和python实现_第1张图片
于是有人提出了另外一个算法叫做Ordering points to identify the clustering structure(OPTICS),这个算法有效的解决了密度不同导致的聚类效果不好的问题。OPTICS通过计算出一个可达距离列表来进行聚类,首先计算出所有的核心点,然后随机选择一个核心点,计算其附近点的可达距离并从小到大进行排序,如果附近的点存在核心点则递归式地继续计算附近核心点的周围点的可达距离并排序存在列表中,全部完成后画出可达距离列表,观察变化特别大的位置即为类别变化的地方,具体图片可参考博客最后的程序运行图。

OPTICS的参数和DBSCAN一样,也是 { ϵ , M i n P t s } \left \{ \epsilon ,MinPts \right \} {ϵ,MinPts}
引出核心距离定义:
对于核心点,距离其第 M i n P t s MinPts MinPts 近的点与之的距离

在这里插入图片描述
可达距离,对于核心点p,o到p的可达距离定义为o到p的距离和p的核心距离的最大值,即公式
在这里插入图片描述
在我实现的代码中,我将UNDEFINED定义为-1了,从原论文中也没发现确切的定义。

从wiki中扒出来一个OPTICS的伪代码:

 OPTICS(DB, eps, MinPts)
    for each point p of DB
       p.reachability-distance = UNDEFINED
    for each unprocessed point p of DB
       N = getNeighbors(p, eps)
       mark p as processed
       output p to the ordered list
       if (core-distance(p, eps, Minpts) != UNDEFINED)
          Seeds = empty priority queue
          update(N, p, Seeds, eps, Minpts)
          for each next q in Seeds
             N' = getNeighbors(q, eps)
             mark q as processed
             output q to the ordered list
             if (core-distance(q, eps, Minpts) != UNDEFINED)
                update(N', q, Seeds, eps, Minpts)

其中update()函数如下:

 update(N, p, Seeds, eps, Minpts)
    coredist = core-distance(p, eps, MinPts)
    for each o in N
       if (o is not processed)
          new-reach-dist = max(coredist, dist(p,o))
          if (o.reachability-distance == UNDEFINED) // o is not in Seeds
              o.reachability-distance = new-reach-dist
              Seeds.insert(o, new-reach-dist)
          else               // o in Seeds, check for improvement
              if (new-reach-dist < o.reachability-distance)
                 o.reachability-distance = new-reach-dist
                 Seeds.move-up(o, new-reach-dist)

下面给出python3.6实现代码:

# -*- coding: gbk -*-

import numpy as np
import matplotlib.pyplot as plt
import copy
from sklearn.datasets import make_moons
from sklearn.datasets.samples_generator import make_blobs
import random
import time


class OPTICS():
    def __init__(self, epsilon, MinPts):
        self.epsilon = epsilon
        self.MinPts = MinPts

    def dist(self, x1, x2):
        return np.linalg.norm(x1 - x2)

    def getCoreObjectSet(self, X):
        N = X.shape[0]
        Dist = np.eye(N) * 9999999
        CoreObjectIndex = []
        for i in range(N):
            for j in range(N):
                if i > j:
                    Dist[i][j] = self.dist(X[i], X[j])
        for i in range(N):
            for j in range(N):
                if i < j:
                    Dist[i][j] = Dist[j][i]
        for i in range(N):
            # 获取对象周围小于epsilon的点的个数
            dist = Dist[i]
            num = dist[dist < self.epsilon].shape[0]
            if num >= self.MinPts:
                CoreObjectIndex.append(i)
        return np.array(CoreObjectIndex), Dist

    def get_neighbers(self, p, Dist):
        N = []
        dist = Dist[p].reshape(-1)
        for i in range(dist.shape[0]):
            if dist[i] < self.epsilon:
                N.append(i)
        return N

    def get_core_dist(self, p, Dist):
        dist = Dist[p].reshape(-1)
        sort_dist = np.sort(dist)
        return sort_dist[self.MinPts - 1]

    def resort(self):
        '''
        根据self.ReachDist对self.Seeds重新升序排列
        '''
        reachdist = copy.deepcopy(self.ReachDist)
        reachdist = np.array(reachdist)
        reachdist = reachdist[self.Seeds]
        new_index = np.argsort(reachdist)
        Seeds = copy.deepcopy(self.Seeds)
        Seeds = np.array(Seeds)
        Seeds = Seeds[new_index]
        self.Seeds = Seeds.tolist()

    def update(self, N, p, Dist, D):

        for i in N:
            if i in D:
                new_reach_dist = max(self.get_core_dist(p, Dist), Dist[i][p])
                if i not in self.Seeds:
                    self.Seeds.append(i)
                    self.ReachDist[i] = new_reach_dist
                else:
                    if new_reach_dist < self.ReachDist[i]:
                        self.ReachDist[i] = new_reach_dist
                self.resort()

    def fit(self, X):

        length = X.shape[0]
        CoreObjectIndex, Dist = self.getCoreObjectSet(X)
        self.Seeds = []
        self.Ordered = []
        D = np.arange(length).tolist()
        self.ReachDist = [-0.1] * length

        while (len(D) != 0):
            p = random.randint(0, len(D) - 1)  # 随机选取一个对象
            p = D[p]
            self.Ordered.append(p)
            D.remove(p)

            if p in CoreObjectIndex:
                N = self.get_neighbers(p, Dist)
                self.update(N, p, Dist, D)

                while(len(self.Seeds) != 0):
                    q = self.Seeds.pop(0)
                    self.Ordered.append(q)
                    D.remove(q)
                    if q in CoreObjectIndex:
                        N = self.get_neighbers(q, Dist)
                        self.update(N, q, Dist, D)
        return self.Ordered, self.ReachDist

    def plt_show(self, X, Y, ReachDist, Ordered, name=0):
        if X.shape[1] == 2:
            fig = plt.figure(name)
            plt.subplot(211)
            plt.scatter(X[:, 0], X[:, 1], marker='o', c=Y)
            plt.subplot(212)
            ReachDist = np.array(ReachDist)
            plt.plot(range(len(Ordered)), ReachDist[Ordered])
        else:
            print('error arg')


if __name__ == '__main__':
    # 111111
    center = [[1, 1], [-1, -1], [1, -1]]
    cluster_std = 0.35
    X1, Y1 = make_blobs(n_samples=300, centers=center,
                        n_features=2, cluster_std=cluster_std, random_state=1)
    optics1 = OPTICS(epsilon=2, MinPts=5)
    Ordered, ReachDist = optics1.fit(X1)
    optics1.plt_show(X1, Y1, ReachDist, Ordered, name=1)
    # 2222222
    center = [[1, 1], [-1, -1], [2, -2]]
    cluster_std = [0.35, 0.1, 0.8]
    X2, Y2 = make_blobs(n_samples=300, centers=center,
                        n_features=2, cluster_std=cluster_std, random_state=1)
    optics2 = OPTICS(epsilon=2, MinPts=5)
    Ordered, ReachDist = optics2.fit(X2)
    optics2.plt_show(X2, Y2, ReachDist, Ordered, name=2)
    # 33333333
    X3, Y3 = make_moons(n_samples=500, noise=0.1)
    optics3 = OPTICS(epsilon=2, MinPts=5)
    Ordered, ReachDist = optics3.fit(X2)
    optics3.plt_show(X3, Y3, ReachDist, Ordered, name=3)
    plt.show()

密度聚类(二)OPTICS和python实现_第2张图片
上图为原数据和类别,下图为可达距离列表画出来的图,从图中可以看出3个趋势变化很类似的曲线,距离变化最剧烈的地方即为聚类结束的地方。
密度聚类(二)OPTICS和python实现_第3张图片
该图中3个样本集的密度差距很大,而OPTICS算法还是很好的进行了聚类,相比于DBSCAN要好很多
密度聚类(二)OPTICS和python实现_第4张图片
对于月牙形状的样本分布,OPTICS的表现不是很好,可能是因为2个样本在接触的地方密度相近。

OPTICS相比于DBSCAN而言,对参数的设置很不敏感,因为它是计算一个可达距离列表排序后查看变化最快的地方进行聚类的。

参考文献:
https://en.wikipedia.org/wiki/OPTICS_algorithm

你可能感兴趣的:(机器学习)