机器学习经典算法

K-Means Clustering

作为非监督学习的典型代表，K-Means 可以通过对于数据点的聚类来完成分类，其计算过程如下：

按照设定的 K 值随机初始化 K 个聚类的中心点
将所有的数据点根据离这 K 个中心点的距离进行分类
计算每一个分类中的数据点的均值，并将这 K 个均值点作为新的中心点，重新进行聚类
上述聚类过程循环多次，算法最终会按照设定的收敛条件收敛到指定的 K 个中心点以完成聚类

在计算机相关应用中，可以利用 K-Means 来完成图像的分割，并在此基础上对分割后的图像进行操作：

criteria = (cv2.TERM_CRITERIA_MAX_ITER + cv2.TERM_CRITERIA_EPS, 100, 0.2)

# then perform K-Means clustering
# the number 10 after the criteria is asking the algorithm to try 10 different initial centers
k = 3
retval, labels, centers = cv2.kmeans(pixel_vals, k, None, criteria, 10, cv2.KMEANS_RANDOM_CENTERS)

# convert data into 8-bit values
centers = np.uint8(centers)
segmented_data = centers[labels.flatten()]

# reshape data into the original image dimensions
segmented_image = segmented_data.reshape((image.shape))

# display the image with colors according to each pixel's class centers intensity
plt.imshow(segmented_data)

层次聚类 Hierarchical Clustering

K-Means 算法在预先知道分类的数量 K 时，使用起来一般更加有效，但由于 K-Means 算法中类的中心的定义方式使得其更加倾向于基于圆形、球形或超球面进行分类，例如针对下面这个明显的月牙形数据分布的分类来说，K-Means 就无法做出合适的分类。

K-means would fail in tasks like this

此时，如果我们对于评价是否归属一个分类的距离有一个大致的概念时，可以选择层次聚类的方式：层次聚类如其名称所述，通过衡量两个特征点之间的距离，分层级的不断扩大归属于同一个聚类的点的数量，最终当算法计算出来的分类取得设定值时停止聚类。根据层次聚类中距离的定义方式，其又可以进一步分解为多种不同的算法。

在 Scikit-Learn 的聚类算法中层次聚类的实施算法采用的是 Agglomerative clustering，其采用从底向上的方式进行聚类，算法开始时会假定每一个输入数据都属于一个单独的分类，在此基础上根据选定的判断标准逐渐将相同的类别进行融合。

The Agglomerative Clustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. The linkage criteria determines the metric used for the merge strategy:

Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.

Maximum or complete linkage minimizes the maximum distance between observations of pairs of clusters.

Average linkage minimizes the average of the distances between all observations of pairs of clusters.

在 Scikit-Learn 中使用层次聚类的代码如下：

from sklearn import cluster
# Specify the parameter for linkage method, 'ward' is default, 
# and can also use 'complete' and 'average'
cluster = cluster.AgglomerativeClustering(n_clusters=3, linkage='ward')
# labels will be an array with the same length of the data which represents which cluster each point belongs to.
labels = cluster.fit_predict(X)

层次聚类的较为突出的缺点：

对于异常值敏感，因此应该尽量确保数据的质量
计算复杂度较高，O(n²)

DBSCAN

DBSCAN 算法的全称是 Density Based Spatial Clustering Applications with Noise，这个聚类算法最终不会将所有的数据划归到某一个分类当中，仅会对足够紧凑的数据集进行聚类，而将其他零散数据视作异常值，因此对于数据噪声不再敏感。

在 Scikit-Learn 中使用 DBSCAN 聚类的代码如下：

from sklearn import cluster
# Specify the parameter for distance and minimum number of points in a cluster
dbscan = cluster.DBSCAN(eps=5, min_samples=5)
# labels will be an array with the same length of the data which represents which cluster each point belongs to.
dbscan.fit(X)
# The result can be found via `dbscan.labels_` with -1 being noise

在 DBSCAN 中无需事先指定分类的数量，并且对于异常数据不敏感，其主要缺点为：

在分类过程中假设位于边界的点与两个不同分类的距离相同，那么算法会采取先到先得的方式，因而可能无法准确分类边界点

Gaussian Mixture Model Clustering

高斯混合模型聚类假定同一类中的数据点服从同一个概率（高斯）分布，在两个类的重合部分，会分别计算样本点属于其中某一个类的概率，进而将其划归到概率更高的类中。

算法实现的步骤如下：

指定潜在的分类的数量 K，进而建立 K 个高斯分布，也即指定每一个分布的均值和方差，在实际实现过程中可以先对数据集执行 K-Means 聚类，在聚类的基础上再计算相应的均值和方差
按照当前的分布的对数据进行分类
以最大化期望为目标，在前述分类的基础上对于之前建立的多个分布进行评估
评估算法的收敛性，如果未能收敛则从第二步开始执行

通过 Scikit-Learn 执行高斯混合模型聚类的步骤如下：

from sklearn import mixture

# Specify the parameters for the clustering
gmm = mixture.GaussianMixture(n_components=3)

# Fit the model
gmm.fit(X)
# clustering contains an array representing class labels each data belongs to
clustering = gmm.predict(X)

聚类的结果评估

在聚类完成后，我们还需要通过一些客观的指标对于聚类的结果进行评价，其主要的评价指标的来源如下：

数据的外部标注
数据的内部评价指标：同一类的聚合程度 Compactness 和类间的分隔程度 Seperability

具体的计算过程可以参考 Scikit-Learn 的相应文档部分。

参考阅读

Clustering in Scikit-Learn

机器学习经典算法 - 聚类算法