numpy实现K-Means代码

参考征哥博客:传送门

K-Means步骤:

1、初始化k个聚类中心

2、计算每个样本到每个聚类中心的距离,并划分到距离最近的簇中(M步)

3、根据每个簇中的样本重新计算聚类中心(E步)

4、重复2、3,直到聚类中心不再发生改变

复杂度分析:

O(kndp),其中k是聚类个数,n是样本个数,d是特征维度,p是迭代次数

代码:

import numpy as np


class KMeans:
    def __init__(self, k, max_iter=1000, stop_var=1e-3, dist_type='l1'):
        self.cluster_num = k
        self.max_iter = max_iter
        self.stop_var = stop_var
        self.variance = 10 * stop_var
        self.dist_type = dist_type
        self.centers = None
        self.dists = None
        self.labels = None

    def fit(self, samples):
        self.init_centers(samples)
        for _iter in range(self.max_iter):
            self.update_dists(samples)
            self.update_centers(samples)
            if self.variance < self.stop_var:
                print('Current iter:', _iter)
                break

    def init_centers(self, samples):  # 初始化聚类中心
        init_row = np.random.choice(range(1, samples.shape[0]+1), self.cluster_num, replace=False)
        self.centers = samples[init_row]

    def update_dists(self, samples):  # 更新样本到聚类中心距离,并划分到最近的簇中(M步)
        labels = np.empty((samples.shape[0]))
        dists = np.empty((0, self.cluster_num))
        for i, sample in enumerate(samples):
            if self.dist_type == 'l1':
                dist = self.l1_distance(sample)
            elif self.dist_type == 'l2':
                dist = self.l2_distance(sample)
            else:
                raise ValueError('wrong dist_type')
            labels[i] = np.argmin(dist)
            dists = np.vstack((dists, dist))
        if self.dists is not None:
            self.variance = np.sum(np.abs(self.dists - dists))
        self.dists = dists
        self.labels = labels

    def update_centers(self, samples):  # 更新聚类中心的坐标(E步)
        centers = np.empty((0, samples.shape[1]))
        for i in range(self.cluster_num):
            idx = (self.labels == i)
            center_samples = samples[idx]
            if len(center_samples) > 0:
                center = np.mean(center_samples, axis=0)
            else:  # todo: 聚类中心为空时处理方式还需改
                center = self.centers[i]
            centers = np.vstack((centers, center[np.newaxis, :]))
        self.centers = centers

    def l1_distance(self, sample):
        return np.sum(np.abs(self.centers-sample), axis=1)

    def l2_distance(self, sample):
        return np.sum(np.square(self.centers-sample), axis=1)


if __name__ == '__main__':
    samples = np.random.rand(1000, 4)  # 样本量 N=1000, 每个样本特征维度为 d=4
    print(samples.shape)
    num_cluster = 5  # 簇个数为 5
    kmeans = KMeans(num_cluster)
    kmeans.fit(samples)
    print(kmeans.centers)

注意:

在k-means聚类之前要先对特征每个维度进行归一化,原因是k-means要计算距离,如果各个维度取值不同的话,取值范围较大的维度会占很大比重,而取值较小的维度占比重较小,而实际上各维度权重应该是相同的。

你可能感兴趣的:(机器学习)