scikit-learn之聚类算法之K-Means

K-means

算法步骤:
1、给定类别个数 k,在数据集 X 中选择 k 个点作为初始的图心;
重复进行2、3步骤直到更新前后的图心之间的距离小于设定的阈值;
2、将数据集 X 中的点分配给离它最近的图心;
3、根据属于每个图心的所有点,重新计算新的图心;
上面的算法涉及两个问题:
1、输入问题;将输入的文本、图像向量化,这属于特征选择问题;
2、初始图心选择问题:可以使用k-means++初始化方法,它使得初始化的图心彼此之间的距离尽可能的远;
3、距离:采用欧式距离;

Mini Batch K-Means

为了减少计算时间,有人提出了 Mini Batch K-Means 方法:
1、给定类别个数 k,在数据集 X 中选择 k 个点作为初始的图心;
重复进行 2-4 步骤直到更新前后的图心之间的距离小于设定的阈值;
2、在数据集 X 中随机选择 b 个点;
3、将上面的 b 个点分配给离它最近的图心;
4、根据属于每个图心的所有点(这 b 个点以及之前分配过的所有点),重新计算新的图心;

Mini Batch K-Means 比 K-Means 更高效,但是聚类效果不如 K-Means 好。

算法优缺点

优点:收敛速度快;
缺点:由于使用了图心的概念,K-Means 算法聚出来的类在空间的形状都是凸的,它对于形状是细长型或者不规则的图形表现不好;

sklearn中的参数

[class sklearn.cluster.KMeans]
n_clusters=8:聚类个数,也就是图心个数;
init=’k-means++’:{‘k-means++’, ‘random’ or an ndarray}
                              ‘k-means++’:使用 k-means++ 算法选择初始点,可以较快收敛;
                              ‘random’:随机选择初始点;
                               ndarray:形状必须是(n_clusters, n_features),自己定义初始点;
n_init=10:以不同的初始图心聚类的轮数;
max_iter=300:每轮中的最大迭代次数;
tol=0.0001:容忍的最小误差,当误差小于tol就会退出迭代
precompute_distances=’auto’:{‘auto’, True, False},这个参数会在空间和时间之间做权衡;‘auto’:当 n_samples * n_clusters > 12 million 时(双精度大概100M),不保存距离矩阵;True:保存距离矩阵;False:不保存距离矩阵;
verbose=0:int 类型,是否输出详细信息;
random_state=None:int,RandomState instance or None,随机生成器的种子 ,和初始化中心有关;
copy_x=True:在 scikit-learn 很多接口中都会有这个参数的,就是是否对输入数据进行 copy 操作,以便不修改用户的输入数据;
n_jobs=1:int,多线程;
                   -1:使用所有的cpu;
                    1:不使用多线程;
                  -2:如果 n_jobs<0,(n_cpus + 1 + n_jobs)个cpu被使用,所以 n_jobs=-2 时,所有的cpu中只有一块不被使用;
algorithm=’auto’:“auto”, “full” or “elkan”;
                              “full” :EM风格算法;
                              “elkan”:使用了三角形公理的变体算法,更加高效,但是不支持稀疏数据;
                               “auto”:对于稠密数据使用 “elkan”,对于稀疏数据使用“full”;

[class sklearn.cluster.MiniBatchKMeans]
batch_size=100:mini batches 的大小;
compute_labels=True:对于这个参数不太理解;
max_no_improvement=10:类似于 early stopping 中的 patience,当连续 max_no_improvement 个 mini-batch 目标函数(数据集中所有点到各自图心的距离和)没有再变小,就停止更新;
init_size=None:对于这个参数不太理解;
reassignment_ratio=0.01: float,再次分配给某个图心的点的最大比例,这个比例越高,那些原本点数少的图心会更容易被分配新的点,这会使得算法不容易收敛,但会得到一个更好的聚类。

示例代码

Clustering text documents using K-means

# Author: Peter Prettenhofer 
#         Lars Buitinck
# License: BSD 3 clause

from __future__ import print_function

from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn import metrics

from sklearn.cluster import KMeans, MiniBatchKMeans

import logging
from optparse import OptionParser
import sys
from time import time

import numpy as np


# Display progress logs on stdout
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')

# parse commandline arguments
op = OptionParser()
op.add_option("--lsa",
              dest="n_components", type="int",
              help="Preprocess documents with latent semantic analysis.")
op.add_option("--no-minibatch",
              action="store_false", dest="minibatch", default=True,
              help="Use ordinary k-means algorithm (in batch mode).")
op.add_option("--no-idf",
              action="store_false", dest="use_idf", default=True,
              help="Disable Inverse Document Frequency feature weighting.")
op.add_option("--use-hashing",
              action="store_true", default=False,
              help="Use a hashing feature vectorizer")
op.add_option("--n-features", type=int, default=10000,
              help="Maximum number of features (dimensions)"
                   " to extract from text.")
op.add_option("--verbose",
              action="store_true", dest="verbose", default=False,
              help="Print progress reports inside k-means algorithm.")

print(__doc__)
op.print_help()


def is_interactive():
    return not hasattr(sys.modules['__main__'], '__file__')

# work-around for Jupyter notebook and IPython console
argv = [] if is_interactive() else sys.argv[1:]
(opts, args) = op.parse_args(argv)
if len(args) > 0:
    op.error("this script takes no arguments.")
    sys.exit(1)


# #############################################################################
# Load some categories from the training set
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
# Uncomment the following to do the analysis on all the categories
# categories = None

print("Loading 20 newsgroups dataset for categories:")
print(categories)

dataset = fetch_20newsgroups(subset='all', categories=categories,
                             shuffle=True, random_state=42)

print("%d documents" % len(dataset.data))
print("%d categories" % len(dataset.target_names))
print()

labels = dataset.target
true_k = np.unique(labels).shape[0]

print("Extracting features from the training dataset using a sparse vectorizer")
t0 = time()
if opts.use_hashing:
    if opts.use_idf:
        # Perform an IDF normalization on the output of HashingVectorizer
        hasher = HashingVectorizer(n_features=opts.n_features,
                                   stop_words='english', alternate_sign=False,
                                   norm=None, binary=False)
        vectorizer = make_pipeline(hasher, TfidfTransformer())
    else:
        vectorizer = HashingVectorizer(n_features=opts.n_features,
                                       stop_words='english',
                                       alternate_sign=False, norm='l2',
                                       binary=False)
else:
    vectorizer = TfidfVectorizer(max_df=0.5, max_features=opts.n_features,
                                 min_df=2, stop_words='english',
                                 use_idf=opts.use_idf)
X = vectorizer.fit_transform(dataset.data)

print("done in %fs" % (time() - t0))
print("n_samples: %d, n_features: %d" % X.shape)
print()

if opts.n_components:
    print("Performing dimensionality reduction using LSA")
    t0 = time()
    # Vectorizer results are normalized, which makes KMeans behave as
    # spherical k-means for better results. Since LSA/SVD results are
    # not normalized, we have to redo the normalization.
    svd = TruncatedSVD(opts.n_components)
    normalizer = Normalizer(copy=False)
    lsa = make_pipeline(svd, normalizer)

    X = lsa.fit_transform(X)

    print("done in %fs" % (time() - t0))

    explained_variance = svd.explained_variance_ratio_.sum()
    print("Explained variance of the SVD step: {}%".format(
        int(explained_variance * 100)))

    print()


# #############################################################################
# Do the actual clustering

if opts.minibatch:
    km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
                         init_size=1000, batch_size=1000, verbose=opts.verbose)
else:
    km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1,
                verbose=opts.verbose)

print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X)
print("done in %0.3fs" % (time() - t0))
print()

print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(labels, km.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, km.labels_, sample_size=1000))

print()


if not opts.use_hashing:
    print("Top terms per cluster:")

    if opts.n_components:
        original_space_centroids = svd.inverse_transform(km.cluster_centers_)
        order_centroids = original_space_centroids.argsort()[:, ::-1]
    else:
        order_centroids = km.cluster_centers_.argsort()[:, ::-1]

    terms = vectorizer.get_feature_names()
    for i in range(true_k):
        print("Cluster %d:" % i, end='')
        for ind in order_centroids[i, :10]:
            print(' %s' % terms[ind], end='')
        print()

References

[1] “k-means++: The advantages of careful seeding” Arthur, David, and Sergei Vassilvitskii, Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics (2007)
[2] “Web Scale K-Means clustering” D. Sculley, Proceedings of the 19th international conference on World wide web (2010)

你可能感兴趣的:(机器学习)