【机器学习实战04】k-均值聚类算法

1、聚类定义

聚类是一种无监督学习,它将相似的对象归为一类,簇内的对象越相似,聚类的效果越好。k-均值首先发现k个不同的簇,且每个簇的中心采用簇中所含值的均值计算而成。

2、开发机器学习应用程序的步骤

(1)收集数据:收集各种样本数据,为了节省时间,可以使用公开的可用数据源
(2)准备输入数据:确保数据格式符合要求,本书采用的格式是Python语言的List。
(3)数据分析:人工分析以前得到的数据,确保数据集中没有垃圾数据。
(4)训练算法,这一步才是机器学习真正的开始,对于监督学习,我们把之前的格式化数据输入到算法中,得到分类模型,方便后续处理。对于无监督学习,不存在目标变量,因此也不需要训练算法!!!
(5)测试算法:这一步使用第4步机器学习的模型。这一步主要测试算法的工作效率。
(6)使用算法:将机器学习算法转换为应用程序,执行实际任务。

3、K-均值聚类

首先,随机确定k个初始点作为质心。然后为每个点找距离其最近的质心,将其分配到该质心所对应的簇中。完成这一步之后,更新每个簇的质心为该簇所有点的平均值。迭代以上过程,直到簇分配结果不再改变,完成K-均值聚类。

4、k-均值伪代码

创建k个点作为起始质心(随机选择)
当任意一个点的簇分配结果发生改变时
        对数据集中的每个数据点
                对每个质心
                        计算质心与数据点之间的距离
                将数据点分配到距其最近的簇
         对每一个簇,计算簇中所有点的均值并将均值作为质心

5、算法实现

k-均值聚类支持函数

#encoding:utf-8

from numpy import *
from math import *

#加载数据集
def loadDataSet(fileName):      #将一个文本文件导入到列表中
    dataMat = []                #创建一个dataMat空列表
    fr = open(fileName)
    for line in fr.readlines():  #一行一行的读取文本文件
        curLine = line.strip().split('\t')
        fltLine = map(float,curLine) #将文本文件转化为字符型
        dataMat.append(fltLine)
    return dataMat

#计算两个向量的欧式距离  
def distEclud(vecA, vecB):      
    return sqrt(sum(power(vecA - vecB, 2))) 

#随机构建初始质心
def randCent(dataSet, k):               #构建一个包含k个随机质心的集合
        #数据列数
    n = shape(dataSet)[1]
    #初始化质心
    centroids = mat(zeros((k,n)))
    for j in range(n):              
            #数据集中每一维的最小和最大值,保证随机选取的质心在边界之内
        minJ = min(dataSet[:,j]) 
        rangeJ = float(max(dataSet[:,j]) - minJ)
        centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))
    return centroids


#数据导入
data = loadDataSet('testSet.txt')
print "data:"
print data

print "\n"

#将数据按照矩阵方式输出
dataMat = mat(data)
print "dataMat:"
print dataMat

print "\n"

#矩阵中的最大值和最小值
print "矩阵中的最大值和最小值:"
min1 = min(dataMat[:,0])
print "min(dataMat[:,0]):"
print min1

min2 = min(dataMat[:,1])
print "min(dataMat[:,1]):"
print min2

max1 = max(dataMat[:,0])
print "max(dataMat[:,0]):"
print max1

max2 = max(dataMat[:,1])
print "max(dataMat[:,1]):"
print max2

print "\n"

#k个随机质心的集合,随机质心必须在整个数据集的边界之内,也就是在最小值和最大值之间
centroids = randCent(dataMat, 2)
print "质心:"
print centroids

print "\n"

#计算距离
distance = distEclud(dataMat[0],dataMat[1])
print "distance:"
print distance

Output:

data(列表形式):
[[1.658985, 4.285136], [-3.453687, 3.424321], [4.838138, -1.151539], [-5.379713, -3.362104], [0.972564, 2.924086], [-3.567919, 1.531611], [0.450614, -3.302219], [-3.487105, -1.724432], [2.668759, 1.594842], [-3.156485, 3.191137], [3.165506, -3.999838], [-2.786837, -3.099354], [4.208187, 2.984927], [-2.123337, 2.943366], [0.704199, -0.479481], [-0.39237, -3.963704], [2.831667, 1.574018], [-0.790153, 3.343144], [2.943496, -3.357075], [-3.195883, -2.283926], [2.336445, 2.875106], [-1.786345, 2.554248], [2.190101, -1.90602], [-3.403367, -2.778288], [1.778124, 3.880832], [-1.688346, 2.230267], [2.592976, -2.054368], [-4.007257, -3.207066], [2.257734, 3.387564], [-2.679011, 0.785119], [0.939512, -4.023563], [-3.674424, -2.261084], [2.046259, 2.735279], [-3.18947, 1.780269], [4.372646, -0.822248], [-2.579316, -3.497576], [1.889034, 5.1904], [-0.798747, 2.185588], [2.83652, -2.658556], [-3.837877, -3.253815], [2.096701, 3.886007], [-2.709034, 2.923887], [3.367037, -3.184789], [-2.121479, -4.232586], [2.329546, 3.179764], [-3.284816, 3.273099], [3.091414, -3.815232], [-3.762093, -2.432191], [3.542056, 2.778832], [-1.736822, 4.241041], [2.127073, -2.98368], [-4.323818, -3.938116], [3.792121, 5.135768], [-4.786473, 3.358547], [2.624081, -3.260715], [-4.009299, -2.978115], [2.493525, 1.96371], [-2.513661, 2.642162], [1.864375, -3.176309], [-3.171184, -3.572452], [2.89422, 2.489128], [-2.562539, 2.884438], [3.491078, -3.947487], [-2.565729, -2.012114], [3.332948, 3.983102], [-1.616805, 3.573188], [2.280615, -2.559444], [-2.651229, -3.103198], [2.321395, 3.154987], [-1.685703, 2.939697], [3.031012, -3.620252], [-4.599622, -2.185829], [4.196223, 1.126677], [-2.133863, 3.093686], [4.668892, -2.562705], [-2.793241, -2.149706], [2.884105, 3.043438], [-2.967647, 2.848696], [4.479332, -1.764772], [-4.905566, -2.91107]]

dataMat(矩阵形式):
[[ 1.658985  4.285136]
 [-3.453687  3.424321]
 [ 4.838138 -1.151539]
 [-5.379713 -3.362104]
 [ 0.972564  2.924086]
 [-3.567919  1.531611]
 [ 0.450614 -3.302219]
 [-3.487105 -1.724432]
 [ 2.668759  1.594842]
 [-3.156485  3.191137]
 [ 3.165506 -3.999838]
 [-2.786837 -3.099354]
 [ 4.208187  2.984927]
 [-2.123337  2.943366]
 [ 0.704199 -0.479481]
 [-0.39237  -3.963704]
 [ 2.831667  1.574018]
 [-0.790153  3.343144]
 [ 2.943496 -3.357075]
 [-3.195883 -2.283926]
 [ 2.336445  2.875106]
 [-1.786345  2.554248]
 [ 2.190101 -1.90602 ]
 [-3.403367 -2.778288]
 [ 1.778124  3.880832]
 [-1.688346  2.230267]
 [ 2.592976 -2.054368]
 [-4.007257 -3.207066]
 [ 2.257734  3.387564]
 [-2.679011  0.785119]
 [ 0.939512 -4.023563]
 [-3.674424 -2.261084]
 [ 2.046259  2.735279]
 [-3.18947   1.780269]
 [ 4.372646 -0.822248]
 [-2.579316 -3.497576]
 [ 1.889034  5.1904  ]
 [-0.798747  2.185588]
 [ 2.83652  -2.658556]
 [-3.837877 -3.253815]
 [ 2.096701  3.886007]
 [-2.709034  2.923887]
 [ 3.367037 -3.184789]
 [-2.121479 -4.232586]
 [ 2.329546  3.179764]
 [-3.284816  3.273099]
 [ 3.091414 -3.815232]
 [-3.762093 -2.432191]
 [ 3.542056  2.778832]
 [-1.736822  4.241041]
 [ 2.127073 -2.98368 ]
 [-4.323818 -3.938116]
 [ 3.792121  5.135768]
 [-4.786473  3.358547]
 [ 2.624081 -3.260715]
 [-4.009299 -2.978115]
 [ 2.493525  1.96371 ]
 [-2.513661  2.642162]
 [ 1.864375 -3.176309]
 [-3.171184 -3.572452]
 [ 2.89422   2.489128]
 [-2.562539  2.884438]
 [ 3.491078 -3.947487]
 [-2.565729 -2.012114]
 [ 3.332948  3.983102]
 [-1.616805  3.573188]
 [ 2.280615 -2.559444]
 [-2.651229 -3.103198]
 [ 2.321395  3.154987]
 [-1.685703  2.939697]
 [ 3.031012 -3.620252]
 [-4.599622 -2.185829]
 [ 4.196223  1.126677]
 [-2.133863  3.093686]
 [ 4.668892 -2.562705]
 [-2.793241 -2.149706]
 [ 2.884105  3.043438]
 [-2.967647  2.848696]
 [ 4.479332 -1.764772]
 [-4.905566 -2.91107 ]]

矩阵中的最大值和最小值:
min(dataMat[:,0]):
[[-5.379713]]
min(dataMat[:,1]):
[[-4.232586]]
max(dataMat[:,0]):
[[ 4.838138]]
max(dataMat[:,1]):
[[ 5.1904]]

质心:
[[-0.63033973  3.67867827]
 [ 4.73957277 -1.20849738]]

distance:
5.18463281668

所有支持函数正常运行之后,就可以准备实现完整的k-均值聚类算法了。

k-均值聚类算法:

#encoding:utf-8

from numpy import *
from math import *

#加载数据集
def loadDataSet(fileName):      #将一个文本文件导入到列表中
    dataMat = []                #创建一个dataMat空列表
    fr = open(fileName)
    for line in fr.readlines():  #一行一行的读取文本文件
        curLine = line.strip().split('\t')
        fltLine = map(float,curLine) #将文本文件转化为字符型
        dataMat.append(fltLine)
    return dataMat

#计算两个向量的欧式距离  
def distEclud(vecA, vecB):      
    return sqrt(sum(power(vecA - vecB, 2))) 

#随机构建初始质心
def randCent(dataSet, k):               #构建一个包含k个随机质心的集合
        #数据列数
    n = shape(dataSet)[1]
    #初始化质心
    centroids = mat(zeros((k,n)))
    for j in range(n):              
            #数据集中每一维的最小和最大值,保证随机选取的质心在边界之内
        minJ = min(dataSet[:,j]) 
        rangeJ = float(max(dataSet[:,j]) - minJ)
        centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))
    return centroids

 #k-均值算法:四个参数(数据集、簇的数目、距离、创建初始质心)
def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
        #数据点的行数
    m = shape(dataSet)[0]
    #用于记录数据点到质心的距离平方
    clusterAssment = mat(zeros((m,2))) 
    #中心点                               
    centroids = createCent(dataSet, k)
    #聚类结束标志
    clusterChanged = True
    while clusterChanged:
        clusterChanged = False
        #遍历每条数据
        for i in range(m):
                #设置两个变量,分别存放数据点到质心的距离,及数据点属于哪个质心
            minDist = inf; minIndex = -1
            #遍历每个质心
            for j in range(k):
                distJI = distMeas(centroids[j,:],dataSet[i,:])
                if distJI < minDist:
                        #将数据归为最近的质心
                    minDist = distJI; minIndex = j
            #簇分配结果发生改变,更新标志
            if clusterAssment[i,0] != minIndex: clusterChanged = True
            clusterAssment[i,:] = minIndex,minDist**2
        print centroids
        #更新质心
        for cent in range(k):#recalculate centroids
            ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster
            centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean 
    return centroids, clusterAssment

#数据导入
data = loadDataSet('testSet.txt')
print "data:"
print data

print "\n"

#将数据按照矩阵方式输出
dataMat = mat(data)
print "dataMat:"
print dataMat

print "\n"

centroids, clusterAssment = kMeans(dataMat,4)
print centroids,clusterAssment

Output:

data:
[[1.658985, 4.285136], [-3.453687, 3.424321], [4.838138, -1.151539], [-5.379713, -3.362104], [0.972564, 2.924086], [-3.567919, 1.531611], [0.450614, -3.302219], [-3.487105, -1.724432], [2.668759, 1.594842], [-3.156485, 3.191137], [3.165506, -3.999838], [-2.786837, -3.099354], [4.208187, 2.984927], [-2.123337, 2.943366], [0.704199, -0.479481], [-0.39237, -3.963704], [2.831667, 1.574018], [-0.790153, 3.343144], [2.943496, -3.357075], [-3.195883, -2.283926], [2.336445, 2.875106], [-1.786345, 2.554248], [2.190101, -1.90602], [-3.403367, -2.778288], [1.778124, 3.880832], [-1.688346, 2.230267], [2.592976, -2.054368], [-4.007257, -3.207066], [2.257734, 3.387564], [-2.679011, 0.785119], [0.939512, -4.023563], [-3.674424, -2.261084], [2.046259, 2.735279], [-3.18947, 1.780269], [4.372646, -0.822248], [-2.579316, -3.497576], [1.889034, 5.1904], [-0.798747, 2.185588], [2.83652, -2.658556], [-3.837877, -3.253815], [2.096701, 3.886007], [-2.709034, 2.923887], [3.367037, -3.184789], [-2.121479, -4.232586], [2.329546, 3.179764], [-3.284816, 3.273099], [3.091414, -3.815232], [-3.762093, -2.432191], [3.542056, 2.778832], [-1.736822, 4.241041], [2.127073, -2.98368], [-4.323818, -3.938116], [3.792121, 5.135768], [-4.786473, 3.358547], [2.624081, -3.260715], [-4.009299, -2.978115], [2.493525, 1.96371], [-2.513661, 2.642162], [1.864375, -3.176309], [-3.171184, -3.572452], [2.89422, 2.489128], [-2.562539, 2.884438], [3.491078, -3.947487], [-2.565729, -2.012114], [3.332948, 3.983102], [-1.616805, 3.573188], [2.280615, -2.559444], [-2.651229, -3.103198], [2.321395, 3.154987], [-1.685703, 2.939697], [3.031012, -3.620252], [-4.599622, -2.185829], [4.196223, 1.126677], [-2.133863, 3.093686], [4.668892, -2.562705], [-2.793241, -2.149706], [2.884105, 3.043438], [-2.967647, 2.848696], [4.479332, -1.764772], [-4.905566, -2.91107]]


dataMat:
[[ 1.658985  4.285136]
 [-3.453687  3.424321]
 [ 4.838138 -1.151539]
 [-5.379713 -3.362104]
 [ 0.972564  2.924086]
 [-3.567919  1.531611]
 [ 0.450614 -3.302219]
 [-3.487105 -1.724432]
 [ 2.668759  1.594842]
 [-3.156485  3.191137]
 [ 3.165506 -3.999838]
 [-2.786837 -3.099354]
 [ 4.208187  2.984927]
 [-2.123337  2.943366]
 [ 0.704199 -0.479481]
 [-0.39237  -3.963704]
 [ 2.831667  1.574018]
 [-0.790153  3.343144]
 [ 2.943496 -3.357075]
 [-3.195883 -2.283926]
 [ 2.336445  2.875106]
 [-1.786345  2.554248]
 [ 2.190101 -1.90602 ]
 [-3.403367 -2.778288]
 [ 1.778124  3.880832]
 [-1.688346  2.230267]
 [ 2.592976 -2.054368]
 [-4.007257 -3.207066]
 [ 2.257734  3.387564]
 [-2.679011  0.785119]
 [ 0.939512 -4.023563]
 [-3.674424 -2.261084]
 [ 2.046259  2.735279]
 [-3.18947   1.780269]
 [ 4.372646 -0.822248]
 [-2.579316 -3.497576]
 [ 1.889034  5.1904  ]
 [-0.798747  2.185588]
 [ 2.83652  -2.658556]
 [-3.837877 -3.253815]
 [ 2.096701  3.886007]
 [-2.709034  2.923887]
 [ 3.367037 -3.184789]
 [-2.121479 -4.232586]
 [ 2.329546  3.179764]
 [-3.284816  3.273099]
 [ 3.091414 -3.815232]
 [-3.762093 -2.432191]
 [ 3.542056  2.778832]
 [-1.736822  4.241041]
 [ 2.127073 -2.98368 ]
 [-4.323818 -3.938116]
 [ 3.792121  5.135768]
 [-4.786473  3.358547]
 [ 2.624081 -3.260715]
 [-4.009299 -2.978115]
 [ 2.493525  1.96371 ]
 [-2.513661  2.642162]
 [ 1.864375 -3.176309]
 [-3.171184 -3.572452]
 [ 2.89422   2.489128]
 [-2.562539  2.884438]
 [ 3.491078 -3.947487]
 [-2.565729 -2.012114]
 [ 3.332948  3.983102]
 [-1.616805  3.573188]
 [ 2.280615 -2.559444]
 [-2.651229 -3.103198]
 [ 2.321395  3.154987]
 [-1.685703  2.939697]
 [ 3.031012 -3.620252]
 [-4.599622 -2.185829]
 [ 4.196223  1.126677]
 [-2.133863  3.093686]
 [ 4.668892 -2.562705]
 [-2.793241 -2.149706]
 [ 2.884105  3.043438]
 [-2.967647  2.848696]
 [ 4.479332 -1.764772]
 [-4.905566 -2.91107 ]]


[[ 3.23128394  0.95068522]
 [ 1.44392461 -0.39902729]
 [-1.62230418 -2.52007168]
 [-3.51187771  3.79667864]]
[[ 2.95373358  2.32801413]
 [ 2.5935345  -2.92880329]
 [-3.01169468 -3.01238673]
 [-2.46154315  2.78737555]]
[[ 2.6265299   3.10868015]
 [ 2.80293085 -2.7315146 ]
 [-3.38237045 -2.9473363 ]
 [-2.46154315  2.78737555]]
[[ 2.6265299   3.10868015]
 [ 2.80293085 -2.7315146 ]
 [-3.38237045 -2.9473363 ]
 [-2.46154315  2.78737555]] [[  0.00000000e+00   2.32019150e+00]
 [  3.00000000e+00   1.39004893e+00]
 [  1.00000000e+00   6.63839104e+00]
 [  2.00000000e+00   4.16140951e+00]
 [  0.00000000e+00   2.76967820e+00]
 [  3.00000000e+00   2.80101213e+00]
 [  1.00000000e+00   5.85909807e+00]
 [  2.00000000e+00   1.50646425e+00]
 [  0.00000000e+00   2.29348924e+00]
 [  3.00000000e+00   6.45967483e-01]
 [  1.00000000e+00   1.74010499e+00]
 [  2.00000000e+00   3.77769471e-01]
 [  0.00000000e+00   2.51695402e+00]
 [  3.00000000e+00   1.38716420e-01]
 [  1.00000000e+00   9.47633071e+00]
 [  2.00000000e+00   9.97310599e+00]
 [  0.00000000e+00   2.39726914e+00]
 [  3.00000000e+00   3.10242360e+00]
 [  1.00000000e+00   4.11084375e-01]
 [  2.00000000e+00   4.74890795e-01]
 [  0.00000000e+00   1.38706133e-01]
 [  3.00000000e+00   5.10240996e-01]
 [  1.00000000e+00   1.05700176e+00]
 [  2.00000000e+00   2.90181828e-02]
 [  0.00000000e+00   1.31601105e+00]
 [  3.00000000e+00   9.08203769e-01]
 [  1.00000000e+00   5.02608557e-01]
 [  2.00000000e+00   4.57942717e-01]
 [  0.00000000e+00   2.13786618e-01]
 [  3.00000000e+00   4.05632356e+00]
 [  1.00000000e+00   5.14171888e+00]
 [  2.00000000e+00   5.56237495e-01]
 [  0.00000000e+00   4.76142736e-01]
 [  3.00000000e+00   1.54414110e+00]
 [  1.00000000e+00   6.10930460e+00]
 [  2.00000000e+00   9.47660177e-01]
 [  0.00000000e+00   4.87745774e+00]
 [  3.00000000e+00   3.12703929e+00]
 [  1.00000000e+00   6.45118831e-03]
 [  2.00000000e+00   3.01415411e-01]
 [  0.00000000e+00   8.84955695e-01]
 [  3.00000000e+00   7.98870968e-02]
 [  1.00000000e+00   5.23673430e-01]
 [  2.00000000e+00   3.24171404e+00]
 [  0.00000000e+00   9.32523506e-02]
 [  3.00000000e+00   9.13705455e-01]
 [  1.00000000e+00   1.25766593e+00]
 [  2.00000000e+00   4.09563895e-01]
 [  0.00000000e+00   9.46987842e-01]
 [  3.00000000e+00   2.63836399e+00]
 [  1.00000000e+00   5.20371222e-01]
 [  2.00000000e+00   1.86796790e+00]
 [  0.00000000e+00   5.46768776e+00]
 [  3.00000000e+00   5.73153563e+00]
 [  1.00000000e+00   3.12040332e-01]
 [  2.00000000e+00   3.93986735e-01]
 [  0.00000000e+00   1.32864695e+00]
 [  3.00000000e+00   2.38032454e-02]
 [  1.00000000e+00   1.07872914e+00]
 [  2.00000000e+00   4.35369355e-01]
 [  0.00000000e+00   4.55502856e-01]
 [  3.00000000e+00   1.96212809e-02]
 [  1.00000000e+00   1.95213538e+00]
 [  2.00000000e+00   1.54154401e+00]
 [  0.00000000e+00   1.26364010e+00]
 [  3.00000000e+00   1.33108375e+00]
 [  1.00000000e+00   3.02422139e-01]
 [  2.00000000e+00   5.58860689e-01]
 [  0.00000000e+00   9.52516316e-02]
 [  3.00000000e+00   6.25129762e-01]
 [  1.00000000e+00   8.41875177e-01]
 [  2.00000000e+00   2.06159470e+00]
 [  0.00000000e+00   6.39227291e+00]
 [  3.00000000e+00   2.01200372e-01]
 [  1.00000000e+00   3.51030769e+00]
 [  2.00000000e+00   9.83287604e-01]
 [  0.00000000e+00   7.06014703e-02]
 [  3.00000000e+00   2.59901305e-01]
 [  1.00000000e+00   3.74491207e+00]
 [  2.00000000e+00   2.32143993e+00]]

6、二分k-均值聚类

二分-K均值是为了解决k-均值的用户自定义输入簇值k所延伸出来的自己判断k数目,其基本思路是:
为了得到k个簇,将所有点的集合分裂成两个簇,从这些簇中选取一个继续分裂,如此下去,直到产生k个簇。

#encoding :utf-8
from numpy import *
from math import *

#加载数据集
def loadDataSet(fileName):      #将一个文本文件导入到列表中
    dataMat = []                #创建一个dataMat空列表
    fr = open(fileName)
    for line in fr.readlines():  #一行一行的读取文本文件
        curLine = line.strip().split('\t')
        fltLine = map(float,curLine) #将文本文件转化为字符型
        dataMat.append(fltLine)
    return dataMat

#计算两个向量的欧式距离  
def distEclud(vecA, vecB):      
    return sqrt(sum(power(vecA - vecB, 2))) 

#随机构建初始质心
def randCent(dataSet, k):               #构建一个包含k个随机质心的集合
        #数据列数
    n = shape(dataSet)[1]
    #初始化质心
    centroids = mat(zeros((k,n)))
    for j in range(n):              
            #数据集中每一维的最小和最大值,保证随机选取的质心在边界之内
        minJ = min(dataSet[:,j]) 
        rangeJ = float(max(dataSet[:,j]) - minJ)
        centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))
    return centroids


 #k-均值算法:四个参数(数据集、簇的数目、距离、创建初始质心)
def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
        #数据点的行数
    m = shape(dataSet)[0]
    #用于记录数据点到质心的距离平方
    clusterAssment = mat(zeros((m,2))) 
    #中心点                               
    centroids = createCent(dataSet, k)
    #聚类结束标志
    clusterChanged = True
    while clusterChanged:
        clusterChanged = False
        #遍历每条数据
        for i in range(m):
                #设置两个变量,分别存放数据点到质心的距离,及数据点属于哪个质心
            minDist = inf; minIndex = -1
            #遍历每个质心
            for j in range(k):
                distJI = distMeas(centroids[j,:],dataSet[i,:])
                if distJI < minDist:
                        #将数据归为最近的质心
                    minDist = distJI; minIndex = j
            #簇分配结果发生改变,更新标志
            if clusterAssment[i,0] != minIndex: clusterChanged = True
            clusterAssment[i,:] = minIndex,minDist**2
        print centroids
        #更新质心
        for cent in range(k):#recalculate centroids
            ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster
            centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean 
    return centroids, clusterAssment

def biKmeans(dataSet, k, distMeas=distEclud):
    m = shape(dataSet)[0]
    clusterAssment = mat(zeros((m,2)))
    #创建一个初始簇
    centroid0 = mean(dataSet, axis=0).tolist()[0]
    centList =[centroid0] #create a list with one centroid
    #计算初始误差 
    for j in range(m):#calc initial Error
        clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:])**2
    while (len(centList) < k):
        lowestSSE = inf
        #遍历每个簇,尝试划分每个簇  
        for i in range(len(centList)):
            ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A==i)[0],:]#get the data points currently in cluster i
            centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2, distMeas)
            #划分后的误差平方和
            sseSplit = sum(splitClustAss[:,1])#compare the SSE to the currrent minimum
            #剩余的误差之和
            sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A!=i)[0],1])
            print "sseSplit, and notSplit: ",sseSplit,sseNotSplit
            if (sseSplit + sseNotSplit) < lowestSSE:
                bestCentToSplit = i
                bestNewCents = centroidMat
                bestClustAss = splitClustAss.copy()
                lowestSSE = sseSplit + sseNotSplit
        #更新分配结果
        bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList) #change 1 to 3,4, or whatever
        bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit
        print 'the bestCentToSplit is: ',bestCentToSplit
        print 'the len of bestClustAss is: ', len(bestClustAss)
        centList[bestCentToSplit] = bestNewCents[0,:].tolist()[0]#replace a centroid with two best centroids 
        centList.append(bestNewCents[1,:].tolist()[0])
        clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:]= bestClustAss#reassign new clusters, and SSE
    return mat(centList), clusterAssment

#数据导入
data = loadDataSet('testSet.txt')
print "data:"
print data

print "\n"

#将数据按照矩阵方式输出
dataMat = mat(data)
print "dataMat:"
print dataMat

print "\n"

mat, clusterAssment = biKmeans(dataMat, 4)
print mat, clusterAssment

Output:

data:
[[1.658985, 4.285136], [-3.453687, 3.424321], [4.838138, -1.151539], [-5.379713, -3.362104], [0.972564, 2.924086], [-3.567919, 1.531611], [0.450614, -3.302219], [-3.487105, -1.724432], [2.668759, 1.594842], [-3.156485, 3.191137], [3.165506, -3.999838], [-2.786837, -3.099354], [4.208187, 2.984927], [-2.123337, 2.943366], [0.704199, -0.479481], [-0.39237, -3.963704], [2.831667, 1.574018], [-0.790153, 3.343144], [2.943496, -3.357075], [-3.195883, -2.283926], [2.336445, 2.875106], [-1.786345, 2.554248], [2.190101, -1.90602], [-3.403367, -2.778288], [1.778124, 3.880832], [-1.688346, 2.230267], [2.592976, -2.054368], [-4.007257, -3.207066], [2.257734, 3.387564], [-2.679011, 0.785119], [0.939512, -4.023563], [-3.674424, -2.261084], [2.046259, 2.735279], [-3.18947, 1.780269], [4.372646, -0.822248], [-2.579316, -3.497576], [1.889034, 5.1904], [-0.798747, 2.185588], [2.83652, -2.658556], [-3.837877, -3.253815], [2.096701, 3.886007], [-2.709034, 2.923887], [3.367037, -3.184789], [-2.121479, -4.232586], [2.329546, 3.179764], [-3.284816, 3.273099], [3.091414, -3.815232], [-3.762093, -2.432191], [3.542056, 2.778832], [-1.736822, 4.241041], [2.127073, -2.98368], [-4.323818, -3.938116], [3.792121, 5.135768], [-4.786473, 3.358547], [2.624081, -3.260715], [-4.009299, -2.978115], [2.493525, 1.96371], [-2.513661, 2.642162], [1.864375, -3.176309], [-3.171184, -3.572452], [2.89422, 2.489128], [-2.562539, 2.884438], [3.491078, -3.947487], [-2.565729, -2.012114], [3.332948, 3.983102], [-1.616805, 3.573188], [2.280615, -2.559444], [-2.651229, -3.103198], [2.321395, 3.154987], [-1.685703, 2.939697], [3.031012, -3.620252], [-4.599622, -2.185829], [4.196223, 1.126677], [-2.133863, 3.093686], [4.668892, -2.562705], [-2.793241, -2.149706], [2.884105, 3.043438], [-2.967647, 2.848696], [4.479332, -1.764772], [-4.905566, -2.91107]]


dataMat:
[[ 1.658985  4.285136]
 [-3.453687  3.424321]
 [ 4.838138 -1.151539]
 [-5.379713 -3.362104]
 [ 0.972564  2.924086]
 [-3.567919  1.531611]
 [ 0.450614 -3.302219]
 [-3.487105 -1.724432]
 [ 2.668759  1.594842]
 [-3.156485  3.191137]
 [ 3.165506 -3.999838]
 [-2.786837 -3.099354]
 [ 4.208187  2.984927]
 [-2.123337  2.943366]
 [ 0.704199 -0.479481]
 [-0.39237  -3.963704]
 [ 2.831667  1.574018]
 [-0.790153  3.343144]
 [ 2.943496 -3.357075]
 [-3.195883 -2.283926]
 [ 2.336445  2.875106]
 [-1.786345  2.554248]
 [ 2.190101 -1.90602 ]
 [-3.403367 -2.778288]
 [ 1.778124  3.880832]
 [-1.688346  2.230267]
 [ 2.592976 -2.054368]
 [-4.007257 -3.207066]
 [ 2.257734  3.387564]
 [-2.679011  0.785119]
 [ 0.939512 -4.023563]
 [-3.674424 -2.261084]
 [ 2.046259  2.735279]
 [-3.18947   1.780269]
 [ 4.372646 -0.822248]
 [-2.579316 -3.497576]
 [ 1.889034  5.1904  ]
 [-0.798747  2.185588]
 [ 2.83652  -2.658556]
 [-3.837877 -3.253815]
 [ 2.096701  3.886007]
 [-2.709034  2.923887]
 [ 3.367037 -3.184789]
 [-2.121479 -4.232586]
 [ 2.329546  3.179764]
 [-3.284816  3.273099]
 [ 3.091414 -3.815232]
 [-3.762093 -2.432191]
 [ 3.542056  2.778832]
 [-1.736822  4.241041]
 [ 2.127073 -2.98368 ]
 [-4.323818 -3.938116]
 [ 3.792121  5.135768]
 [-4.786473  3.358547]
 [ 2.624081 -3.260715]
 [-4.009299 -2.978115]
 [ 2.493525  1.96371 ]
 [-2.513661  2.642162]
 [ 1.864375 -3.176309]
 [-3.171184 -3.572452]
 [ 2.89422   2.489128]
 [-2.562539  2.884438]
 [ 3.491078 -3.947487]
 [-2.565729 -2.012114]
 [ 3.332948  3.983102]
 [-1.616805  3.573188]
 [ 2.280615 -2.559444]
 [-2.651229 -3.103198]
 [ 2.321395  3.154987]
 [-1.685703  2.939697]
 [ 3.031012 -3.620252]
 [-4.599622 -2.185829]
 [ 4.196223  1.126677]
 [-2.133863  3.093686]
 [ 4.668892 -2.562705]
 [-2.793241 -2.149706]
 [ 2.884105  3.043438]
 [-2.967647  2.848696]
 [ 4.479332 -1.764772]
 [-4.905566 -2.91107 ]]


[[-0.61154259 -0.99840509]
 [ 0.21655089  0.84736297]]
[[-0.77465184 -2.93442862]
 [ 0.47379212  2.62599895]]
[[-0.54735726 -2.93692713]
 [ 0.2978695   2.76065064]]
[[-0.2897198  -2.83942545]
 [ 0.08249337  2.94802785]]
sseSplit, and notSplit:  792.916856537 0.0
the bestCentToSplit is:  0
the len of bestClustAss is:  80
[[-1.675842   -3.5709562 ]
 [-0.44613055 -1.66674358]]
[[-3.17656652 -2.99858519]
 [ 2.90100553 -2.66351205]]
[[-3.38237045 -2.9473363 ]
 [ 2.80293085 -2.7315146 ]]
sseSplit, and notSplit:  83.5874695564 326.284075201
[[ 0.17931125  3.21151684]
 [-2.49413705  4.77613273]]
[[ 1.6577805   2.93121792]
 [-2.84303986  2.97924629]]
[[ 2.6265299   3.10868015]
 [-2.46154315  2.78737555]]
sseSplit, and notSplit:  66.36683512 466.632781336
the bestCentToSplit is:  0
the len of bestClustAss is:  40
[[-1.61809661 -4.16618929]
 [-2.19748766 -3.31337473]]
[[-1.2569245  -4.098145  ]
 [-3.61853111 -2.81946867]]
sseSplit, and notSplit:  19.619284753 377.270301893
[[-4.25497296  3.22644812]
 [-3.07524324  4.4511578 ]]
[[-3.28879656  2.53721789]
 [ 1.06125497  3.06729526]]
[[-2.64677572  2.78993217]
 [ 2.31553173  3.07737886]]
[[-2.46154315  2.78737555]
 [ 2.6265299   3.10868015]]
sseSplit, and notSplit:  66.36683512 83.5874695564
[[ 1.60206889 -3.5436413 ]
 [ 3.03830692 -1.88925585]]
[[ 2.3728161 -3.548637 ]
 [ 3.2330456 -1.9143922]]
[[ 2.44798442 -3.43588358]
 [ 3.3353505  -1.67496113]]
[[ 2.47787177 -3.37608915]
 [ 3.406612   -1.53444757]]
sseSplit, and notSplit:  31.6296069802 358.885318066
the bestCentToSplit is:  1
the len of bestClustAss is:  40
[[-3.38237045 -2.9473363 ]
 [-2.46154315  2.78737555]
 [ 2.80293085 -2.7315146 ]
 [ 2.6265299   3.10868015]] [[  3.00000000e+00   2.32019150e+00]
 [  1.00000000e+00   1.39004893e+00]
 [  2.00000000e+00   6.63839104e+00]
 [  0.00000000e+00   4.16140951e+00]
 [  3.00000000e+00   2.76967820e+00]
 [  1.00000000e+00   2.80101213e+00]
 [  2.00000000e+00   5.85909807e+00]
 [  0.00000000e+00   1.50646425e+00]
 [  3.00000000e+00   2.29348924e+00]
 [  1.00000000e+00   6.45967483e-01]
 [  2.00000000e+00   1.74010499e+00]
 [  0.00000000e+00   3.77769471e-01]
 [  3.00000000e+00   2.51695402e+00]
 [  1.00000000e+00   1.38716420e-01]
 [  2.00000000e+00   9.47633071e+00]
 [  0.00000000e+00   9.97310599e+00]
 [  3.00000000e+00   2.39726914e+00]
 [  1.00000000e+00   3.10242360e+00]
 [  2.00000000e+00   4.11084375e-01]
 [  0.00000000e+00   4.74890795e-01]
 [  3.00000000e+00   1.38706133e-01]
 [  1.00000000e+00   5.10240996e-01]
 [  2.00000000e+00   1.05700176e+00]
 [  0.00000000e+00   2.90181828e-02]
 [  3.00000000e+00   1.31601105e+00]
 [  1.00000000e+00   9.08203769e-01]
 [  2.00000000e+00   5.02608557e-01]
 [  0.00000000e+00   4.57942717e-01]
 [  3.00000000e+00   2.13786618e-01]
 [  1.00000000e+00   4.05632356e+00]
 [  2.00000000e+00   5.14171888e+00]
 [  0.00000000e+00   5.56237495e-01]
 [  3.00000000e+00   4.76142736e-01]
 [  1.00000000e+00   1.54414110e+00]
 [  2.00000000e+00   6.10930460e+00]
 [  0.00000000e+00   9.47660177e-01]
 [  3.00000000e+00   4.87745774e+00]
 [  1.00000000e+00   3.12703929e+00]
 [  2.00000000e+00   6.45118831e-03]
 [  0.00000000e+00   3.01415411e-01]
 [  3.00000000e+00   8.84955695e-01]
 [  1.00000000e+00   7.98870968e-02]
 [  2.00000000e+00   5.23673430e-01]
 [  0.00000000e+00   3.24171404e+00]
 [  3.00000000e+00   9.32523506e-02]
 [  1.00000000e+00   9.13705455e-01]
 [  2.00000000e+00   1.25766593e+00]
 [  0.00000000e+00   4.09563895e-01]
 [  3.00000000e+00   9.46987842e-01]
 [  1.00000000e+00   2.63836399e+00]
 [  2.00000000e+00   5.20371222e-01]
 [  0.00000000e+00   1.86796790e+00]
 [  3.00000000e+00   5.46768776e+00]
 [  1.00000000e+00   5.73153563e+00]
 [  2.00000000e+00   3.12040332e-01]
 [  0.00000000e+00   3.93986735e-01]
 [  3.00000000e+00   1.32864695e+00]
 [  1.00000000e+00   2.38032454e-02]
 [  2.00000000e+00   1.07872914e+00]
 [  0.00000000e+00   4.35369355e-01]
 [  3.00000000e+00   4.55502856e-01]
 [  1.00000000e+00   1.96212809e-02]
 [  2.00000000e+00   1.95213538e+00]
 [  0.00000000e+00   1.54154401e+00]
 [  3.00000000e+00   1.26364010e+00]
 [  1.00000000e+00   1.33108375e+00]
 [  2.00000000e+00   3.02422139e-01]
 [  0.00000000e+00   5.58860689e-01]
 [  3.00000000e+00   9.52516316e-02]
 [  1.00000000e+00   6.25129762e-01]
 [  2.00000000e+00   8.41875177e-01]
 [  0.00000000e+00   2.06159470e+00]
 [  3.00000000e+00   6.39227291e+00]
 [  1.00000000e+00   2.01200372e-01]
 [  2.00000000e+00   3.51030769e+00]
 [  0.00000000e+00   9.83287604e-01]
 [  3.00000000e+00   7.06014703e-02]
 [  1.00000000e+00   2.59901305e-01]
 [  2.00000000e+00   3.74491207e+00]
 [  0.00000000e+00   2.32143993e+00]]

你可能感兴趣的:(机器学习实战,Python,机器学习实战,python,机器学习)