聚类在机器学习,数据挖掘,模式识别,图像分析以及生物信息等领域有广泛的应用。聚类是把相似的对象通过静态分类的方法分成不同的组别或者更多的子集(subset),这样让在同一个子集中的成员对象都有相似的一些属性,常见的包括在坐标系中更加短的空间距离(一般是欧式距离)等。
在商务上,聚类能帮助市场分析人员从客户基本库中发现不同的客户群,并且用购买模式来刻画不同的客户群的特征。在生物学上,聚类能用于推导植物和动物的分类,对基因进行分类,获得对种群中固有结构的认识。聚类在地球观测数据库中相似地区的确定,汽车保险单持有者的分组,及根据房子的类型、价值和地理位置对一个城市中房屋的分组上也可以发挥作用。聚类也能用于对Web上的文档进行分类,以发现信息。诸如此类,聚类有着广泛的实际应用。
K-Means是发现给定数据集的 K 个簇的聚类算法, 之所以称之为 K-均值是因为它可以发现 K 个不同的簇, 且每个簇的中心采用簇中所含值的均值计算而成。簇个数 K 是用户指定的, 每一个簇通过其质心(centroid), 即簇中所有点的中心来描述。聚类与分类算法的最大区别在于, 分类的目标类别已知, 而聚类的目标类别是未知的。
K-Means聚类步骤是一个循环迭代的算法,具体·步骤如下:
1)没有(或最小数目)对象被重新分配给不同的聚类。
2)没有(或最小数目)聚类中心再发生变化。
3)误差平方和局部最小。
因此得到相互分离的球状聚类,在这些聚类中,均值点趋向收敛于聚类中心。 一般会希望得到的聚类大小大致相当,这样把每个观测都分配到离它最近的聚类中心(即均值点)就是比较正确的分配方案。
聚类是把相似的物体聚在一起,这个相似度(或称距离)是用欧氏距离来衡量的。
给定两个样本 X = ( x 1 , x 2 , . . . , x n ) X=(x_{1},x_{2},...,x_{n}) X=(x1,x2,...,xn) 与 Y = ( y 1 , y 2 , . . . , y n ) Y=(y_{1},y_{2},...,y_{n}) Y=(y1,y2,...,yn) ,其中n表示特征数 ,X和Y两个向量间的欧氏距离(Euclidean Distance)表示为:
d i s t e d ( X , Y ) = ∣ ∣ X − Y ∣ ∣ 2 = ( x 1 − y 1 ) 2 + . . . + ( x n − y n ) 2 2 dist_{ed}(X,Y)=||X-Y||{2}=\sqrt[2]{(x_{1}-y_{1})^{2}+...+(x_{n}-y_{n})^{2}} disted(X,Y)=∣∣X−Y∣∣2=2(x1−y1)2+...+(xn−yn)2
k-means算法是把数据给分成不同的簇,目标是同一个簇中的差异小,不同簇之间的差异大,这个目标怎么用数学语言描述呢?我们一般用误差平方和作为目标函数(想想线性回归中说过的残差平方和、损失函数,是不是很相似),公式如下:
S S E = ∑ i = 1 K ∑ x ∈ C i ( C i − x ) 2 SSE=\sum_{i=1}^{K} \sum_{x \in C_{i}}\left(C_{i}-x\right)^{2} SSE=i=1∑Kx∈Ci∑(Ci−x)2
其中C表示聚类中心,如果x属于 C i C_{i} Ci这个簇,则计算两者的欧式距离,将所有样本点到其中心点距离算出来,并加总,就是k-means的目标函数。实现同一个簇中的样本差异小,就是最小化SSE。
可以通过求导来求函数的极值,我们对SSE求偏导看看能得到什么结果:
∂ ∂ C k S S E = ∂ ∂ C k ∑ i = 1 K ∑ x ∈ C i ( C i − x ) 2 = ∑ i = 1 K ∑ x ∈ C i ∂ ∂ C k ( C i − x ) 2 = ∑ x ∈ C i 2 ( C i − x ) = 0 \begin{aligned} \frac{\partial}{\partial C_{k}} S S E &=\frac{\partial}{\partial C_{k}} \sum_{i=1}^{K} \sum_{x \in C_{i}}\left(C_{i}-x\right)^{2} \\ &=\sum_{i=1}^{K} \sum_{x \in C_{i}} \frac{\partial}{\partial C_{k}}\left(C_{i}-x\right)^{2} \\ &=\sum_{x \in C_{i}} 2\left(C_{i}-x\right)=0 \end{aligned} ∂Ck∂SSE=∂Ck∂i=1∑Kx∈Ci∑(Ci−x)2=i=1∑Kx∈Ci∑∂Ck∂(Ci−x)2=x∈Ci∑2(Ci−x)=0
∑ x ∈ C i 2 ( C i − x ) = 0 ⇒ m i C i = ∑ x ∈ C i x ⇒ C i = 1 m i ∑ x ∈ C i x \sum_{x \in C_{i}} 2\left(C_{i}-x\right)=0 \Rightarrow m_{i} C_{i}=\sum_{x \in C_{i}} x \Rightarrow C_{i}=\frac{1}{m_{i}} \sum_{x \in C_{i}} x x∈Ci∑2(Ci−x)=0⇒miCi=x∈Ci∑x⇒Ci=mi1x∈Ci∑x
式中m是簇中点的数量,发现了没有,这个C的解,就是X的均值点。多点的均值点应该很好理解吧,给定一组点 X 1 , . . . , X m X_{1},...,X_{m} X1,...,Xm ,其中 X i = ( x i 1 , x i 2 , . . . , x i n ) X_{i}=(x_{i1},x_{i2},...,x_{in}) Xi=(xi1,xi2,...,xin) ,这组点的均值向量表示为:
C = ( x 11 + . . . + x 1 n m , . . . , x m 1 + . . . + x m n m ) C=(\frac{x_{11}+...+x_{1n}}{m},...,\frac{x_{m1}+...+x_{mn}}{m}) C=(mx11+...+x1n,...,mxm1+...+xmn)
function K-Means(输入数据,中心点个数K)
获取输入数据的维度Dim和个数N
随机生成K个Dim维的点
while(算法未收敛)
对N个点:计算每个点属于哪一类。
对于K个中心点:
1,找出所有属于自己这一类的所有数据点
2,把自己的坐标修改为这些数据点的中心点坐标
end
输出结果:
end
from collections import defaultdict
from random import uniform
from math import sqrt
def point_avg(points):
"""
Accepts a list of points, each with the same number of dimensions.
NB. points can have more dimensions than 2
Returns a new point which is the center of all the points.
"""
dimensions = len(points[0])
new_center = []
for dimension in xrange(dimensions):
dim_sum = 0 # dimension sum
for p in points:
dim_sum += p[dimension]
# average of each dimension
new_center.append(dim_sum / float(len(points)))
return new_center
def update_centers(data_set, assignments):
"""
Accepts a dataset and a list of assignments; the indexes
of both lists correspond to each other.
Compute the center for each of the assigned groups.
Return `k` centers where `k` is the number of unique assignments.
"""
new_means = defaultdict(list)
centers = []
for assignment, point in zip(assignments, data_set):
new_means[assignment].append(point)
for points in new_means.itervalues():
centers.append(point_avg(points))
return centers
def assign_points(data_points, centers):
"""
Given a data set and a list of points betweeen other points,
assign each point to an index that corresponds to the index
of the center point on it's proximity to that point.
Return a an array of indexes of centers that correspond to
an index in the data set; that is, if there are N points
in `data_set` the list we return will have N elements. Also
If there are Y points in `centers` there will be Y unique
possible values within the returned list.
"""
assignments = []
for point in data_points:
shortest = () # positive infinity
shortest_index = 0
for i in xrange(len(centers)):
val = distance(point, centers[i])
if val < shortest:
shortest = val
shortest_index = i
assignments.append(shortest_index)
return assignments
def distance(a, b):
"""
"""
dimensions = len(a)
_sum = 0
for dimension in xrange(dimensions):
difference_sq = (a[dimension] - b[dimension]) ** 2
_sum += difference_sq
return sqrt(_sum)
def generate_k(data_set, k):
"""
Given `data_set`, which is an array of arrays,
find the minimum and maximum for each coordinate, a range.
Generate `k` random points between the ranges.
Return an array of the random points within the ranges.
"""
centers = []
dimensions = len(data_set[0])
min_max = defaultdict(int)
for point in data_set:
for i in xrange(dimensions):
val = point[i]
min_key = 'min_%d' % i
max_key = 'max_%d' % i
if min_key not in min_max or val < min_max[min_key]:
min_max[min_key] = val
if max_key not in min_max or val > min_max[max_key]:
min_max[max_key] = val
for _k in xrange(k):
rand_point = []
for i in xrange(dimensions):
min_val = min_max['min_%d' % i]
max_val = min_max['max_%d' % i]
rand_point.append(uniform(min_val, max_val))
centers.append(rand_point)
return centers
def k_means(dataset, k):
k_points = generate_k(dataset, k)
assignments = assign_points(dataset, k_points)
old_assignments = None
while assignments != old_assignments:
new_centers = update_centers(dataset, assignments)
old_assignments = assignments
assignments = assign_points(dataset, new_centers)
return zip(assignments, dataset)
# points = [
# [1, 2],
# [2, 1],
# [3, 1],
# [5, 4],
# [5, 5],
# [6, 5],
# [10, 8],
# [7, 9],
# [11, 5],
# [14, 9],
# [14, 14],
# ]
# print k_means(points, 3)
https://zhuanlan.zhihu.com/p/75477709