pyclustering 是一个聚类分析的python库。本文将对其中的kmeans库讲解。
最近本人在用kmeans算法做一些研究,有个想法是把kmeans的距离函数更换,但sklearn并没有提供接口,自己造的轮子效果也并不好。最后找到pyclustering库,因此在这记录一下使用心得。
kmeans训练过程如博客所示。
用到的包
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.cluster.kmeans import kmeans
import numpy as np
1. 初始化形心:
initial_centers = kmeans_plusplus_initializer(x, cluster_num).initialize()
其中x是数据,cluster\_num是簇数目。
2. 实例化kmeans类:
kmeans_instance = kmeans(x, initial_centers, metric=metric)
metric是度量距离,默认是欧式距离,下面细讲。
3. 训练:
kmeans_instance.process()
3. 归类:
clusters = kmeans_instance.get_clusters()
把上面的训练的数据x以列表形式归类。比如数据a,b,c类别分别是1,1,0则返回index列表[[0,0],[1]]
4. 返回形心:
cs = kmeans_instance.get_centers()
5. 预测:
关于预测,这里给出了几种方法以适应不同场景。
- 首先是直接用实例类预测
label = kmeans_instance.predict(x)
- 根据之前得到的clusters
label = np.array([0]*len(x))
for i,sub in enumerate(clusters):
label[sub] = i
- 根据得到的形心,这里直接封装成函数,metric是度量函数
def Clu_predict(x,cs,class_num,metric = distance_metric(type_metric.EUCLIDEAN)):
differences = np.zeros((len(x), class_num))
for index_point in range(len(x)):
differences[index_point] = [metric(x[index_point], c) for c in cs]
label = np.argmin(differences, axis=1)
return label
注意这里效率很满,推荐自己定义矩阵运算。
6. 度量:
- 使用库的度量,以曼哈顿距离为例:
manhattan_metric = distance_metric(type_metric.MANHATTAN)
kmeans_instance = kmeans(x, initial_centers, metric=manhattan_metric)
把type_metric.后面的换掉就行,库提供的距离有
class type_metric(IntEnum):
"""!
@brief Enumeration of supported metrics in the module for distance calculation between two points.
"""
## Euclidean distance, for more information see function 'euclidean_distance'. EUCLIDEAN = 0
## Square Euclidean distance, for more information see function 'euclidean_distance_square'.
EUCLIDEAN_SQUARE = 1
## Manhattan distance, for more information see function 'manhattan_distance'.
MANHATTAN = 2
## Chebyshev distance, for more information see function 'chebyshev_distance'.
CHEBYSHEV = 3
## Minkowski distance, for more information see function 'minkowski_distance'.
MINKOWSKI = 4
## Canberra distance, for more information see function 'canberra_distance'.
CANBERRA = 5
## Chi square distance, for more information see function 'chi_square_distance'.
CHI_SQUARE = 6
## Gower distance, for more information see function 'gower_distance'.
GOWER = 7
## User defined function for distance calculation between two points.
USER_DEFINED = 1000
- 使用自定义距离,以余弦距离为例:
def cosine_distance(a, b):
a_norm = np.linalg.norm(a)
b_norm = np.linalg.norm(b)
similiarity = np.dot(a, b.T)/(a_norm * b_norm)
dist = 1. - similiarity
return dist
metric = distance_metric(type_metric.USER_DEFINED, func=cosine_distance)
距离只需实现计算两个点的距离即可。