聚类算法 之 OPTICS算法总结

DBSCAN由于存在一些缺陷,所以引入的OPTICS算法进行改善

背景:

在DBSCAN算法中,需要人为确定领域半径ϵ \epsilonϵ和密度阈值M
MM,同时该算法的性能又对这两个超参数非常敏感,不同的初始参数设定会导致完全不同的结果。基于此,学者们提出了新的聚类算法OPTICS。该聚类算法同样也是基于密度聚类的算法,与DBSCAN不同的是,该算法的设计使得其对初始超参数的设定敏感度较低

基本知识点:
core_distance:核心距离
reach_distance:可达距离
具体知识点可以参考这一篇博文:我是链接

OPTICS核心思想
较稠密簇中的对象在簇排序中相互靠近
一个对象的最小可达距离给出了一个对象连接到一个稠密簇的最短路径(这也就是为什么一个样本点的可达距离定义为它关于各个核心距离中最小的那一个)

下面给出代码:

通过sklearn库的OPTICS和cluster_optics_dbscan函数进行聚类操作

import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
from sklearn.cluster import OPTICS, cluster_optics_dbscan
from sklearn.cluster import DBSCAN
import numpy as np

G = gridspec.GridSpec(3, 2)
# ---------------------------------数据-------------------------------
n_points_per_cluster = 250
C1 = [-5, -2] + .8 * np.random.randn(n_points_per_cluster, 2)  # randn生成矩阵
C2 = [4, -1] + .1 * np.random.randn(n_points_per_cluster, 2)
C3 = [1, -2] + .2 * np.random.randn(n_points_per_cluster, 2)
C4 = [-2, 3] + .3 * np.random.randn(n_points_per_cluster, 2)
C5 = [3, -2] + 1.6 * np.random.randn(n_points_per_cluster, 2)
C6 = [5, 6] + 2 * np.random.randn(n_points_per_cluster, 2)
X = np.vstack((C1, C2, C3, C4, C5, C6))  # 按列合成向量
# -----------------------元数据画图----------------------------------------
ax = plt.subplot(G[0, 0])
ax.scatter(X[:, 0], X[:, 1])
ax.set_title('Scatter Picture for Original Data')
# ------------------DBSCAN------------------------------
clustering_module_dbs = DBSCAN(eps=.6, min_samples=20).fit(X)  # fit()生成的是训练模型
clustering_classif_dbs = clustering_module_dbs.fit_predict(X)  # fit_predict生成的是数据分类
ax2 = plt.subplot(G[1, 0])
ax2.scatter(X[:, 0], X[:, 1], c=clustering_module_dbs.labels_)
ax2.set_title('DBSCAN Scatter Picture for Eps=.6 Min_sample=20')
# -----------------OPTICS--------------------------------
clustering_module_opt = OPTICS(min_samples=50, min_cluster_size=.05, xi=.05).fit(X)  # run fit()
clustering_classif_opt = clustering_module_opt.fit_predict(X)

space = np.arange(len(X))
reachability = clustering_module_opt.reachability_[clustering_module_opt.ordering_]
labels = clustering_module_opt.labels_[clustering_module_opt.ordering_]
colors = ['g.', 'r.', 'b.', 'y.', 'c.']

ax3 = plt.subplot(G[0, 1])
# plot ReachAbility Pic
for klass, color in zip(range(0, 5), colors):
    # zip() 函数用于将可迭代的对象作为参数 将对象中对应的元素打包成一个个元组
    # 然后返回由这些元组组成的列表。
    # 如果各个迭代器的元素个数不一致 则返回列表长度与最短的对象相同 利用 * 号操作符 可以将元组解压为列表
    '''
    >> > a = [1, 2, 3]
    >> > b = [4, 5, 6]
    >> > c = [4, 5, 6, 7, 8]
    >> > zipped = zip(a, b)  # 打包为元组的列表
    [(1, 4), (2, 5), (3, 6)]
    >> > zip(a, c)  # 元素个数与最短的列表一致
    [(1, 4), (2, 5), (3, 6)]
    >> > zip(*zipped)  # 与 zip 相反,*zipped 可理解为解压,返回二维矩阵式
    [(1, 2, 3), (4, 5, 6)]
    '''
    Xk = space[labels == klass]
    Rk = reachability[labels == klass]
    ax3.plot(Xk, Rk, color, alpha=0.3)
ax3.plot(space[labels == -1], reachability[labels == -1], 'k.', alpha=0.3)  # noise # labels == -1
ax3.plot(space, np.full_like(a=space, shape=space.shape, fill_value=2, dtype=float), linestyle='-.', color='b',
         alpha=.5)
ax3.plot(space, np.full_like(a=space, shape=space.shape, fill_value=.5, dtype=float), linestyle='-', color='b',
         alpha=.5)
ax3.set_ylabel('Reachability (epsilon distance)')
ax3.set_title('Reachability Plot')
# ------------noise points eliminate------------------------------------
print("This sample has {}'s mini_blob(s)".format(max(clustering_module_opt.labels_ + 1)))
ax4 = plt.subplot(G[1, :])
ax4.scatter(X[:, 0][clustering_classif_opt != -1], X[:, 1][clustering_classif_opt != -1],
            c=clustering_classif_opt[clustering_classif_opt != -1])  # 去除噪声点
ax4.set_title('OPTICS Scatter Picture without Noise Points')

# ---------------------eps=0.5------eps=2----------------------------------------------------
clustering_05 = cluster_optics_dbscan(reachability=clustering_module_opt.reachability_,
                                      core_distances=clustering_module_opt.core_distances_,
                                      ordering=clustering_module_opt.ordering_,
                                      eps=2)  # 返回类型跟DBSCAN一样 clustering2是簇类
clustering_20 = cluster_optics_dbscan(reachability=clustering_module_opt.reachability_,
                                      core_distances=clustering_module_opt.core_distances_,
                                      ordering=clustering_module_opt.ordering_,
                                      eps=.5)
colors1 = ['g', 'm', 'y', 'c']  # for eps=2.
colors2 = ['g', 'greenyellow', 'olive', 'r', 'b', 'c']  # for eps=.5

ax5 = plt.subplot(G[2, 0])
for c, color in zip(range(4), colors1):
    Xi = X[clustering_20 == c]
    ax5.plot(Xi[:, 0], Xi[:, 1], 'k+', alpha=0.9, color=color)
ax5.plot(X[:, 0][clustering_20 == -1], X[:, 1][clustering_20 == -1], 'k+', alpha=0.1)  # noise points with marker'+'
ax5.set_title('OPTICS for eps=.5 with noise points')

ax6 = plt.subplot(G[2, 1])
for c, color in zip(range(6), colors2):
    Xi = X[clustering_20 == c]
    ax6.plot(Xi[:, 0], Xi[:, 1], 'k+', alpha=0.9, color=color)
ax6.plot(X[:, 0][clustering_05 == -1], X[:, 1][clustering_05 == -1], 'k+', alpha=0.1)  # noise points with marker'+'
ax6.set_title('OPTICS for eps=2 with noise points')
# --------show fig-----------------------------------
plt.show()

运行结果:
聚类算法 之 OPTICS算法总结_第1张图片

你可能感兴趣的:(算法,聚类,python,机器学习)