要解决的问题
kmeans算法存在的一个问题是初始中心的选取是随机的,造成聚类的结果也是随机的,一般的做法是进行多次重复整个聚类过程,然后选取聚类效果好的。Kmeans++算法可以很好的解决初始点的选取问题,本文简单进行了总结和实现,代码方面还有很多不完善的地方,仅供参考,欢迎拍砖。
算法流程
a). 首先从数据集中随机选取一个点作为中心点,并加入到中心点集合centers中
b). 对于数据集中的每个点i,都和集合centers中的点进行计算,得到最近距离d[i],计算完之后得到sum(d[i])
c). 取一个随机值random,使random落在sum(d[i])内,然后random -= d[i] 直到random < 0的时候,这个i即为下一个中心点,将这个点加入到centers中
d). 重复b和c过程直到完成所有中心点的选取
算法分析
初始点的选取类似于加权的蓄水池采样,权重是和中心点的最近距离相关的,算法的复杂度为O(k*k*m*n)其中k为聚类中心的个数,m为数据集的样本数,n为数据样本的空间维度
算法实现
#!/usr/bin/python #-*-coding:utf-8-*- import sys import random import math from decimal import * """ @author:xyl This is an example for kmeans++ centers initialization """ """ init k centers at beginning points: data set to be clustered pNum: number of points in data set cNum: number of points to be selected """ def initCenters(points, pNum, cNum): centers = [] #points selected for initial centers firstCenterIndex = random.randint(0, pNum-1) centers.append(points[firstCenterIndex]) distance = [] #save min distance with centers for cIndex in xrange(1, cNum): sum = 0.0 for pIndex in xrange(0, pNum): dist = nearest(points[pIndex], centers, cIndex) distance.append(dist) sum += dist sum = random.uniform(0, sum) for pIndex in xrange(0, pNum): sum -= distance[pIndex] if sum > 0:continue centers.append(points[pIndex]) break return centers """ compute min distance of point and centers point: point in data set centers: selected centers cIndex: number of centers already selected """ def nearest(point, centers, cIndex): minDist = 65536.0 #should be a double large enough dist = 0.0 for index in xrange(0, cIndex): dist = distance(point, centers[index]) if minDist > dist: minDist = dist return minDist """ compute distance between two point point: point in data set center: point selected as center """ def distance(point, center): dim = len(point) if dim != len(center): return 0.0#do something here a = 0.0 b = 0.0 c = 0.0 for index in xrange(0, dim): a += point[index] * center[index] b += math.pow(point[index], 2) c += math.pow(center[index], 2) b = math.sqrt(b) c = math.sqrt(c) try: return a/(b*c) except Exception as e: print e#do something here return 0.0 def test(): points = [] points.append([1,2,1,2,3,4,5]) points.append([1,2,1,3,1,4,5]) points.append([1,2,3,2,2,4,5]) points.append([2,2,1,2,2,4,1]) points.append([1,2,1,1,3,1,5]) points.append([1,2,4,2,3,1,1]) points.append([1,3,1,2,3,1,2]) points.append([1,4,1,1,3,2,1]) points.append([1,1,1,2,3,4,1]) points.append([1,1,1,1,3,4,1]) print initCenters(points, 10, 4) if __name__ == "__main__": test()蓄水池采样算法
关于这个算法可以百度下,比较经典的面试题目,这里想提的是其在ClouderaML的两个应用,比如分布式蓄水池采样和加权分布式蓄水池采样,有些算法看着很无趣,但是应用到具体的实践场景还是能让人眼前一亮,原文参考Algorithms Every Data Scientist Should Know: Reservoir Sampling