聚类实现

1、kmeans

k-means算法的思路很简单,一句话描述就是:随机选定K个中心点,将样本归入距离其最近的中心点代表的簇,重新计算K个簇的中心,重复上述过程,直到误差减小值低于某个阈值。

但是我们常常忽视k-means算法的适用范围,即它在什么情况下表现好,什么情况下表现不好。我们知道,所有机器学习的算法的效果好坏都是和数据相关的,不存在一种放之四海而皆准的算法。那么k-means适用于怎样的数据呢?或者说使用k-means能产生好的簇划分的数据应该满足何种条件呢?

当我们使用簇内误差平方和作为优化目标,隐含的假设就是簇应该是凸的、各向同性的(通俗地说就是球形簇),且各个簇的尺寸(样本数量)和密度(方差)应该相当。

下面我们看一下当数据不满足假设时k-means的表现:
(https://scikit-learn.org/stable/modules/clustering.html)

import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

plt.figure(figsize=(12, 12))

n_samples = 1500
random_state = 177
#默认centers=3,即产生3个簇
X, y = make_blobs(n_samples=n_samples, random_state=random_state)

# Incorrect number of clusters
y_pred = KMeans(n_clusters=2, random_state=random_state).fit_predict(X)

plt.subplot(221)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.title("Incorrect Number of Blobs")

# Anisotropicly distributed data
transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
X_aniso = np.dot(X, transformation)
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_aniso)

plt.subplot(222)
plt.scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)
plt.title("Anisotropicly Distributed Blobs")

# Different variance
X_varied, y_varied = make_blobs(n_samples=n_samples,
                                cluster_std=[1.0, 2.5, 0.5],
                                random_state=random_state)
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_varied)

plt.subplot(223)
plt.scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)
plt.title("Unequal Variance")

# Unevenly sized blobs
# np.vstack:按垂直方向(行顺序)堆叠数组构成一个新的数组
X_filtered = np.vstack((X[y == 0][:500], X[y == 1][:100], X[y == 2][:10]))
y_pred = KMeans(n_clusters=3,
                random_state=random_state).fit_predict(X_filtered)

plt.subplot(224)
plt.scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_pred)
plt.title("Unevenly Sized Blobs")

plt.show()

图一展示的是K选取不当时的结果,关于K的选取一会会再提到。

图二是簇形状不规则的情形(非球形),此时k-means的聚类结果表现并不理想。

图三是各个簇方差不同的情况,可以看到,方差大的簇更难“留住样本”,划分结果显示方差小的簇可能会“抢夺”方差大的簇的样本,使划分结果表现不好。

图四是各个簇包含样本数不同的情况,可以看到中心点倾向于向样本多的方向偏移,从而导致尺寸大的簇中部分样本点被划入错误的簇(图中实际上就出现了一个簇中有两个中心点的情况,这会分走簇的一大部分样本)。

除了上述四种情况,k-means还有一个局限,那就是维度灾难。维度灾难对几乎所有机器学习算法来说都是棘手的问题,对于基于距离定义相似度的k-means算法更是如此,因此当我们处理高维数据的聚类时,通常要先采用适当的降维方法(如PCA等)然后再对降维后的数据进行聚类。

下面是一个手写数字识别数据集上的聚类问题,我们先使用PCA降维然后运用k-means进行聚类:
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py

from time import time
import numpy as np
import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

np.random.seed(42)

digits = load_digits()
data = scale(digits.data)

n_samples, n_features = data.shape
n_digits = len(np.unique(digits.target))
labels = digits.target

sample_size = 300

print("n_digits: %d, \t n_samples %d, \t n_features %d"
      % (n_digits, n_samples, n_features))


print(82 * '_')
print('init\t\ttime\tinertia\thomo\tcompl\tv-meas\tARI\tAMI\tsilhouette')


def bench_k_means(estimator, name, data):
    t0 = time()
    estimator.fit(data)
    print('%-9s\t%.2fs\t%i\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f'
          % (name, (time() - t0), estimator.inertia_,
             metrics.homogeneity_score(labels, estimator.labels_),
             metrics.completeness_score(labels, estimator.labels_),
             metrics.v_measure_score(labels, estimator.labels_),
             metrics.adjusted_rand_score(labels, estimator.labels_),
             metrics.adjusted_mutual_info_score(labels,  estimator.labels_,
                                                average_method='arithmetic'),
             metrics.silhouette_score(data, estimator.labels_,
                                      metric='euclidean',
                                      sample_size=sample_size)))

bench_k_means(KMeans(init='k-means++', n_clusters=n_digits, n_init=10),
              name="k-means++", data=data)

bench_k_means(KMeans(init='random', n_clusters=n_digits, n_init=10),
              name="random", data=data)

# in this case the seeding of the centers is deterministic, hence we run the
# kmeans algorithm only once with n_init=1
pca = PCA(n_components=n_digits).fit(data)
bench_k_means(KMeans(init=pca.components_, n_clusters=n_digits, n_init=1),
              name="PCA-based",
              data=data)
print(82 * '_')

# #############################################################################
# Visualize the results on PCA-reduced data

reduced_data = PCA(n_components=2).fit_transform(data)
kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)
kmeans.fit(reduced_data)

# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02     # point in the mesh [x_min, x_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired,
           aspect='auto', origin='lower')

plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=169, linewidths=3,
            color='w', zorder=10)
plt.title('K-means clustering on the digits dataset (PCA-reduced data)\n'
          'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

out:
n_digits: 10,    n_samples 1797,     n_features 64
__________________________________________________________________________________
init        time    inertia homo    compl   v-meas  ARI AMI silhouette
k-means++   0.55s   69432   0.602   0.650   0.625   0.465   0.621   0.146
random      0.40s   69694   0.669   0.710   0.689   0.553   0.686   0.147
PCA-based   0.09s   70804   0.671   0.698   0.684   0.561   0.681   0.118
__________________________________________________________________________________

2、 Gaussian mixture model

我们在iris数据集上使用各种GMM协方差类型绘制训练的预测标签,并给出测试数据,将diagonal、spherical、tied、full按性能递增顺序进行比较。

尽管人们期望全协方差在一般情况下表现最好,但它很容易对小数据集进行过度拟合,并且不能很好地推广到保留的测试数据(diagonal指每个分量分布有各自不同对角协方差矩阵,spherical指每个分量分布有各自不同的简单协方差矩阵, tied指所有分量分布有相同的标准协方差矩阵,full指每个分量分布有各自不同的标准协方差矩阵)。

https://scikit-learn.org/stable/modules/mixture.html#mixture

import matplotlib.pyplot as plt

import numpy as np

from sklearn import datasets
from sklearn.mixture import [GaussianMixture](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture "View documentation for sklearn.mixture.GaussianMixture")
from sklearn.model_selection import [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold "View documentation for sklearn.model_selection.StratifiedKFold")

print(__doc__)

colors = ['navy', 'turquoise', 'darkorange']

def make_ellipses(gmm, ax):
    for n, color in enumerate(colors):
        if gmm.covariance_type == 'full':
            covariances = gmm.covariances_[n][:2, :2]
        elif gmm.covariance_type == 'tied':
            covariances = gmm.covariances_[:2, :2]
        elif gmm.covariance_type == 'diag':
            covariances = [np.diag](https://docs.scipy.org/doc/numpy/reference/generated/numpy.diag.html#numpy.diag "View documentation for numpy.diag")(gmm.covariances_[n][:2])
        elif gmm.covariance_type == 'spherical':
            covariances = [np.eye](https://docs.scipy.org/doc/numpy/reference/generated/numpy.eye.html#numpy.eye "View documentation for numpy.eye")(gmm.means_.shape[1]) * gmm.covariances_[n]
        v, w = [np.linalg.eigh](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.eigh.html#numpy.linalg.eigh "View documentation for numpy.linalg.eigh")(covariances)
        u = w[0] / [np.linalg.norm](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html#numpy.linalg.norm "View documentation for numpy.linalg.norm")(w[0])
        angle = [np.arctan2](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arctan2.html#numpy.arctan2 "View documentation for numpy.arctan2")(u[1], u[0])
        angle = 180 * angle / [np.pi](https://docs.scipy.org/doc/numpy/reference/constants.html#numpy.pi "View documentation for numpy.pi")  # convert to degrees
        v = 2. * [np.sqrt](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sqrt.html#numpy.sqrt "View documentation for numpy.sqrt")(2.) * [np.sqrt](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sqrt.html#numpy.sqrt "View documentation for numpy.sqrt")(v)
        ell = [mpl.patches.Ellipse](https://matplotlib.org/api/_as_gen/matplotlib.patches.Ellipse.html#matplotlib.patches.Ellipse "View documentation for matplotlib.patches.Ellipse")(gmm.means_[n, :2], v[0], v[1],
                                  180 + angle, color=color)
        ell.set_clip_box(ax.bbox)
        ell.set_alpha(0.5)
        ax.add_artist(ell)
        ax.set_aspect('equal', 'datalim')

iris = [datasets.load_iris](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris "View documentation for sklearn.datasets.load_iris")()

# Break up the dataset into non-overlapping training (75%) and testing
# (25%) sets.
skf = [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold "View documentation for sklearn.model_selection.StratifiedKFold")(n_splits=4)
# Only take the first fold.
train_index, test_index = next(iter(skf.split(iris.data, iris.target)))

X_train = iris.data[train_index]
y_train = iris.target[train_index]
X_test = iris.data[test_index]
y_test = iris.target[test_index]

n_classes = len([np.unique](https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html#numpy.unique "View documentation for numpy.unique")(y_train))

# Try GMMs using different types of covariances.
estimators = {cov_type: [GaussianMixture](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture "View documentation for sklearn.mixture.GaussianMixture")(n_components=n_classes,
              covariance_type=cov_type, max_iter=20, random_state=0)
              for cov_type in ['spherical', 'diag', 'tied', 'full']}

n_estimators = len(estimators)

[plt.figure](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.figure.html#matplotlib.pyplot.figure "View documentation for matplotlib.pyplot.figure")(figsize=(3 * n_estimators // 2, 6))
[plt.subplots_adjust](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots_adjust.html#matplotlib.pyplot.subplots_adjust "View documentation for matplotlib.pyplot.subplots_adjust")(bottom=.01, top=0.95, hspace=.15, wspace=.05,
                    left=.01, right=.99)

for index, (name, estimator) in enumerate(estimators.items()):
    # Since we have class labels for the training data, we can
    # initialize the GMM parameters in a supervised manner.
    estimator.means_init = [np.array](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html#numpy.array "View documentation for numpy.array")([X_train[y_train == i].mean(axis=0)
                                    for i in range(n_classes)])

    # Train the other parameters using the EM algorithm.
    estimator.fit(X_train)

    h = [plt.subplot](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplot.html#matplotlib.pyplot.subplot "View documentation for matplotlib.pyplot.subplot")(2, n_estimators // 2, index + 1)
    make_ellipses(estimator, h)

    for n, color in enumerate(colors):
        data = iris.data[iris.target == n]
        [plt.scatter](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html#matplotlib.pyplot.scatter "View documentation for matplotlib.pyplot.scatter")(data[:, 0], data[:, 1], s=0.8, color=color,
                    label=iris.target_names[n])
    # Plot the test data with crosses
    for n, color in enumerate(colors):
        data = X_test[y_test == n]
        [plt.scatter](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html#matplotlib.pyplot.scatter "View documentation for matplotlib.pyplot.scatter")(data[:, 0], data[:, 1], marker='x', color=color)

    y_train_pred = estimator.predict(X_train)
    train_accuracy = [np.mean](https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html#numpy.mean "View documentation for numpy.mean")(y_train_pred.ravel() == y_train.ravel()) * 100
    [plt.text](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.text.html#matplotlib.pyplot.text "View documentation for matplotlib.pyplot.text")(0.05, 0.9, 'Train accuracy: %.1f' % train_accuracy,
             transform=h.transAxes)

    y_test_pred = estimator.predict(X_test)
    test_accuracy = [np.mean](https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html#numpy.mean "View documentation for numpy.mean")(y_test_pred.ravel() == y_test.ravel()) * 100
    [plt.text](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.text.html#matplotlib.pyplot.text "View documentation for matplotlib.pyplot.text")(0.05, 0.8, 'Test accuracy: %.1f' % test_accuracy,
             transform=h.transAxes)

    [plt.xticks](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.xticks.html#matplotlib.pyplot.xticks "View documentation for matplotlib.pyplot.xticks")(())
    [plt.yticks](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.yticks.html#matplotlib.pyplot.yticks "View documentation for matplotlib.pyplot.yticks")(())
    [plt.title](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.title.html#matplotlib.pyplot.title "View documentation for matplotlib.pyplot.title")(name)

[plt.legend](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html#matplotlib.pyplot.legend "View documentation for matplotlib.pyplot.legend")(scatterpoints=1, loc='lower right', prop=dict(size=12))

[plt.show](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.show.html#matplotlib.pyplot.show "View documentation for matplotlib.pyplot.show")()

3、DBSCAN

DBSCAN算法将簇看作由低密度区域分隔的高密度区域。与KMeans算法不同,DBSCAN不需要确定聚类的数量,而是基于数据推测聚类的数目,它能够针对任意形状产生聚类。

DBSCAN算法需要首先确定两个参数:

  • epsilon:在一个点周围邻近区域的半径
  • minPts:邻近区域内至少包含点的个数

根据以上两个参数,结合epsilon-neighborhood的特征,可以把样本中的点分成三类:

  • 核点(core point):满足NBHD(p,epsilon)>=minPts,则为核样本点
  • 边缘点(border point):NBHD(p,epsilon)
  • 离群点(Outlier):既不是核点也不是边缘点,则是不属于这一类的点

下面我们看一下DBSCAN算法在不规则形状的数据集上的表现,与k-means形成对比:

https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-py

import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler


# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.3,
                            random_state=12)

X = StandardScaler().fit_transform(X)

transformation = [[0.80834549, -0.43667341], [-0.40887718, 0.85253229]]
X = np.dot(X, transformation)
# #############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
      % metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
      % metrics.adjusted_mutual_info_score(labels_true, labels,
                                           average_method='arithmetic'))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels))

# #############################################################################
# Plot result
import matplotlib.pyplot as plt

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

你可能感兴趣的:(聚类实现)