机器学习笔记之聚类 (Machine Learning -Clustering)

Machine Learning - Clustering 聚类

Concept

  1. Clustering is a process to find similarity groups in data, called clusters
    • Group data instances that are similar or near to each other in one cluster
    • Data instances that are (very) different or far away from each other should be in differenct clusters
    • Clusters are unlabelled(未标记) and no a priori grouping of the data instances are given
    • Thus, ofter knowns as unsupervised(非监督性) learning
Approaches
  • K-means
    l a b e l i = arg ⁡ min ⁡ ( ∑ i = 1 n ( x i − a j ) 2 ) label_i = \arg \min {\sqrt{(\sum_{i=1}^n(x_i - a_j)^2)}} labeli=argmin(i=1n(xiaj)2)
    a j = 1 N ( C j ) ∑ i ∈ C j x i a_j = \frac{1}{N(C_j)}\sum_{i\in C_j}x_i aj=N(Cj)1iCjxi
    K-means algorithm assume sample set is: T = X 1 , X 2 , X 3 , . . . , X m T = X_1,X_2,X_3,...,X_m T=X1,X2,X3,...,Xm
    according to Euclid distance formula [1], the algorithmic steps following:
  • Determine(决定) the value for K (number of clusters)
  • Randomly choose initial K centroids
  • Repeat:
    • Assign each data point to the nearest centroid(中心点)
    • Update the centroids based on data partitioning
  • Until the stopping criterion is met(直到达到终止条件)
    机器学习笔记之聚类 (Machine Learning -Clustering)_第1张图片
    However, how to determine it is a good clustering(accoriding to K-means)?
  • Minimise the Sum of Squared Error (SSE) from data points to their corresponding centroids
    S S E = ∑ j = 1 k ∑ x ∈ C j d i s t ( x , m j ) 2 SSE = \sum_{j=1}^k\sum{_{x\in C_j}} dist(x,m_j)^2 SSE=j=1kxCjdist(x,mj)2
  • C j C_j Cj denotes the j t h j^{th} jth cluster, m j m_j mj is the centroid of cluster C j C_j Cj, and d i s t ( x , m j ) dist(x,m_j) dist(x,mj) denotes the distance between data point x and its centroid.
  • Hence, the stopping criteria for the iterative estimation of the centroids is often based on the change in SSE
    • Very small changes in SSE indicates convergence.
    • Sometimes, fixed number of iterations is used.

Example:

  • step 1: random isitialisation of centroids
    机器学习笔记之聚类 (Machine Learning -Clustering)_第2张图片
  • step 2: assign each data to nearest centroid
    机器学习笔记之聚类 (Machine Learning -Clustering)_第3张图片
    -Step 3: recalculate centroids
    机器学习笔记之聚类 (Machine Learning -Clustering)_第4张图片
  • Repeat steps 2 and 3:
    机器学习笔记之聚类 (Machine Learning -Clustering)_第5张图片
  • Until converges
    机器学习笔记之聚类 (Machine Learning -Clustering)_第6张图片
	import numpy as np
	import matplotlib.pyplot as plt
	from sklearn.datasets.samples_generator import make_blobs
	from sklearn.cluster import KMeans
	x, y = make_blobs(n_samples=2000, n_features=2, centers=[[-1,-1], [0,0], [1,1], [2,2]],cluster_std = [0.4, 0.2, 0.2, 0.2], random_state=9)
	plt.scatter(X[:, 0], X[:, 1], marker='+')
	plt.show()
	y_pred = KMeans(n_clusters=4, random_state=9).fit_predict(X)
	plt.scatter(X[:, 0], X[:, 1], c=y_pred)
	plt.show()

Summary

  • Generally fast (although an iterative process)
  • Still one of the most popular clustering algorithms
    • Fuzzified version often is more robust
  • Have to know the number of clusters to start with, different K values obtain the different results
    • For some case, its not easy
  • Provides a local solution
    • Results depends on initialisation
  • Sensitive to outliers

[1] Euclide distance formula
欧几里得距离公式
Usually the Euclide distance in 2D dimension is the distance between two points.
2D dimension formula: d i s t a n c e = ( ( x 1 − x 2 ) 2 + ( y 1 − y 2 ) 2 ) distance = \sqrt{((x_1-x_2)^2+(y_1-y_2)^2)} distance=((x1x2)2+(y1y2)2)
3D dimension formula: d i s t a n c e = ( ( x 1 − x 2 ) 2 + ( y 1 − y 2 ) 2 ) + ( z 1 − z 2 ) 2 distance = \sqrt{((x_1-x_2)^2+(y_1-y_2)^2)+(z_1-z_2)^2} distance=((x1x2)2+(y1y2)2)+(z1z2)2
So that if we follow this pattern:
ND dimension formula gonna be:
d i s t a n c e = ∑ i ( x i 1 − x i 2 ) 2 ( i = 1 , 2 , . . . , n ) distance = \sqrt{\sum_i(xi_1-xi_2)^2} (i = 1,2,...,n) distance=i(xi1xi2)2 (i=1,2,...,n)

[2] chaowu1993.(2018).KMeans原理与源码实现.Retrieved from https://blog.csdn.net/weixin_40479663/article/details/82974625

你可能感兴趣的:(机器学习笔记之聚类 (Machine Learning -Clustering))