K-Means是发现给定数据集的 K 个簇的聚类算法, 之所以称之为 K-均值是因为它可以发现 K 个不同的簇, 且每个簇的中心采用簇中所含值的均值计算而成。簇个数 K 是用户指定的, 每一个簇通过其质心(centroid), 即簇中所有点的中心来描述。聚类与分类算法的最大区别在于, 分类的目标类别已知, 而聚类的目标类别是未知的。
因此得到相互分离的球状聚类,在这些聚类中,均值点趋向收敛于聚类中心。 一般会希望得到的聚类大小大致相当,这样把每个观测都分配到离它最近的聚类中心(即均值点)就是比较正确的分配方案。
给定两个样本 X = ( x 1 , x 2 , . . . , x n ) X=(x_{1},x_{2},...,x_{n}) X=(x1,x2,...,xn) 与 Y = ( y 1 , y 2 , . . . , y n ) Y=(y_{1},y_{2},...,y_{n}) Y=(y1,y2,...,yn) ,其中n表示特征数 ,X和Y两个向量间的欧氏距离(Euclidean Distance)表示为:
d i s t e d ( X , Y ) = ∣ ∣ X − Y ∣ ∣ 2 = ( x 1 − y 1 ) 2 + . . . + ( x n − y n ) 2 2 dist_{ed}(X,Y)=||X-Y||{2}=\sqrt[2]{(x_{1}-y_{1})^{2}+...+(x_{n}-y_{n})^{2}} disted(X,Y)=∣∣X−Y∣∣2=2(x1−y1)2+...+(xn−yn)2
S S E = ∑ i = 1 K ∑ x ∈ C i ( C i − x ) 2 SSE=\sum_{i=1}^{K} \sum_{x \in C_{i}}\left(C_{i}-x\right)^{2} SSE=i=1∑Kx∈Ci∑(Ci−x)2
其中C表示聚类中心,如果x属于 C i C_{i} Ci这个簇,则计算两者的欧式距离,将所有样本点到其中心点距离算出来,并加总,就是k-means的目标函数。实现同一个簇中的样本差异小,就是最小化SSE。
∂ ∂ C k S S E = ∂ ∂ C k ∑ i = 1 K ∑ x ∈ C i ( C i − x ) 2 = ∑ i = 1 K ∑ x ∈ C i ∂ ∂ C k ( C i − x ) 2 = ∑ x ∈ C i 2 ( C i − x ) = 0 \begin{aligned} \frac{\partial}{\partial C_{k}} S S E &=\frac{\partial}{\partial C_{k}} \sum_{i=1}^{K} \sum_{x \in C_{i}}\left(C_{i}-x\right)^{2} \\ &=\sum_{i=1}^{K} \sum_{x \in C_{i}} \frac{\partial}{\partial C_{k}}\left(C_{i}-x\right)^{2} \\ &=\sum_{x \in C_{i}} 2\left(C_{i}-x\right)=0 \end{aligned} ∂Ck∂SSE=∂Ck∂i=1∑Kx∈Ci∑(Ci−x)2=i=1∑Kx∈Ci∑∂Ck∂(Ci−x)2=x∈Ci∑2(Ci−x)=0
∑ x ∈ C i 2 ( C i − x ) = 0 ⇒ m i C i = ∑ x ∈ C i x ⇒ C i = 1 m i ∑ x ∈ C i x \sum_{x \in C_{i}} 2\left(C_{i}-x\right)=0 \Rightarrow m_{i} C_{i}=\sum_{x \in C_{i}} x \Rightarrow C_{i}=\frac{1}{m_{i}} \sum_{x \in C_{i}} x x∈Ci∑2(Ci−x)=0⇒miCi=x∈Ci∑x⇒Ci=mi1x∈Ci∑x
式中m是簇中点的数量,发现了没有,这个C的解,就是X的均值点。多点的均值点应该很好理解吧,给定一组点 X 1 , . . . , X m X_{1},...,X_{m} X1,...,Xm ,其中 X i = ( x i 1 , x i 2 , . . . , x i n ) X_{i}=(x_{i1},x_{i2},...,x_{in}) Xi=(xi1,xi2,...,xin) ,这组点的均值向量表示为:
C = ( x 11 + . . . + x 1 n m , . . . , x m 1 + . . . + x m n m ) C=(\frac{x_{11}+...+x_{1n}}{m},...,\frac{x_{m1}+...+x_{mn}}{m}) C=(mx11+...+x1n,...,mxm1+...+xmn)
function K-Means(输入数据,中心点个数K)
from collections import defaultdict
from random import uniform
from math import sqrt
def point_avg(points):
Accepts a list of points, each with the same number of dimensions.
NB. points can have more dimensions than 2
Returns a new point which is the center of all the points.
dimensions = len(points[0])
new_center = []
for dimension in xrange(dimensions):
dim_sum = 0 # dimension sum
for p in points:
dim_sum += p[dimension]
# average of each dimension
new_center.append(dim_sum / float(len(points)))
return new_center
def update_centers(data_set, assignments):
Accepts a dataset and a list of assignments; the indexes
of both lists correspond to each other.
Compute the center for each of the assigned groups.
Return `k` centers where `k` is the number of unique assignments.
new_means = defaultdict(list)
centers = []
for assignment, point in zip(assignments, data_set):
for points in new_means.itervalues():
return centers
def assign_points(data_points, centers):
Given a data set and a list of points betweeen other points,
assign each point to an index that corresponds to the index
of the center point on it's proximity to that point.
Return a an array of indexes of centers that correspond to
an index in the data set; that is, if there are N points
in `data_set` the list we return will have N elements. Also
If there are Y points in `centers` there will be Y unique
possible values within the returned list.
assignments = []
for point in data_points:
shortest = () # positive infinity
shortest_index = 0
for i in xrange(len(centers)):
val = distance(point, centers[i])
if val < shortest:
shortest = val
shortest_index = i
return assignments
def distance(a, b):
dimensions = len(a)
_sum = 0
for dimension in xrange(dimensions):
difference_sq = (a[dimension] - b[dimension]) ** 2
_sum += difference_sq
return sqrt(_sum)
def generate_k(data_set, k):
Given `data_set`, which is an array of arrays,
find the minimum and maximum for each coordinate, a range.
Generate `k` random points between the ranges.
Return an array of the random points within the ranges.
centers = []
dimensions = len(data_set[0])
min_max = defaultdict(int)
for point in data_set:
for i in xrange(dimensions):
val = point[i]
min_key = 'min_%d' % i
max_key = 'max_%d' % i
if min_key not in min_max or val < min_max[min_key]:
min_max[min_key] = val
if max_key not in min_max or val > min_max[max_key]:
min_max[max_key] = val
for _k in xrange(k):
rand_point = []
for i in xrange(dimensions):
min_val = min_max['min_%d' % i]
max_val = min_max['max_%d' % i]
rand_point.append(uniform(min_val, max_val))
return centers
def k_means(dataset, k):
k_points = generate_k(dataset, k)
assignments = assign_points(dataset, k_points)
old_assignments = None
while assignments != old_assignments:
new_centers = update_centers(dataset, assignments)
old_assignments = assignments
assignments = assign_points(dataset, new_centers)
return zip(assignments, dataset)
# points = [
# [1, 2],
# [2, 1],
# [3, 1],
# [5, 4],
# [5, 5],
# [6, 5],
# [10, 8],
# [7, 9],
# [11, 5],
# [14, 9],
# [14, 14],
# ]
# print k_means(points, 3)