利用Spark MLlib实现Kmeans算法实例(Python)

聚类 - spark.mllib

聚类是一种无监督的学习问题,我们的目标是根据一些相似的概念将实体的子集相互分组。聚类通常用于探索性分析和/或作为分层 监督学习管线(其中针对每个群集训练不同的分类器或回归模型)的组成部分。

该spark.mllib软件包支持以下模型:
K-means
Gaussian mixture
Power iteration clustering (PIC)
Latent Dirichlet allocation (LDA)
Bisecting k-means
Streaming k-means

K-means

K-means是最常用的聚类算法之一,可将数据点聚类到预定义数量的聚类中。该spark.mllib实现包含称为kmeans的k-means++方法的并行变体。
实现中有以下参数:spark.mllib
k是期望的簇的数量。
maxIterations是要运行的最大迭代次数。
initializationModel指定随机初始化或通过k-means 进行初始化。
runs是运行k-means算法的次数(k-means不能保证找到全局最优解,并且在给定数据集上运行多次时,算法返回最佳聚类结果)。
initializationSteps确定k-means算法中的步数。
epsilon决定了我们认为k-means已经收敛的距离阈值。
initialModel是用于初始化的一组可选集群中心。如果提供此参数,则只执行一次运行。

from __future__ import print_function

import sys

import numpy as np
from pyspark.sql import SparkSession


def parseVector(line):
    return np.array([float(x) for x in line.split(',')])


def closestPoint(p, centers):
    bestIndex = 0
    closest = float("+inf")
    for i in range(len(centers)):
        tempDist = np.sum((p - centers[i]) ** 2)
        if tempDist < closest:
            closest = tempDist
            bestIndex = i
    return bestIndex


if __name__ == "__main__":

#    if len(sys.argv) != 4:
#        print("Usage: kmeans   ", file=sys.stderr)
#        exit(-1)

    print("""WARN: This is a naive implementation of KMeans Clustering and is given
       as an example! Please refer to examples/src/main/python/ml/kmeans_example.py for an
       example on how to use ML's KMeans implementation.""", file=sys.stderr)

    spark = SparkSession\
        .builder\
        .appName("PythonKMeans")\
        .getOrCreate()
    inputFile = "hdfs://node1:8020/mv_training/training_set.txt"

    lines = spark.read.text(inputFile).rdd
    #lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
    #data = lines.map(parseVector).cache()
    data = lines.map(lambda row: row.value.split(","))
    #temp = data
    K = int(sys.argv[2])
    convergeDist = float(sys.argv[3])

    kPoints = data.takeSample(False, K, 1)
    tempDist = 1.0

    while tempDist > convergeDist:
        closest = data.map(
            lambda p: (closestPoint(p, kPoints), (p, 1)))
        pointStats = closest.reduceByKey(
            lambda p1_c1, p2_c2: (p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))
        newPoints = pointStats.map(
            lambda st: (st[0], st[1][0] / st[1][1])).collect()

        tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)

        for (iK, p) in newPoints:
            kPoints[iK] = p

    print("Final centers: " + str(kPoints))
    closest = data.map(lambda p: (p,closestPoint(np.array(p).astype(int) , np.array(kPoints).astype(int))))
    closest.repartition(1).saveAsTextFile("file:///root/kmeans_result")

最后得到的中心点是kPoints,closest为保存各行数据所属类别后的结果,并保存到文件中。
官网链接:Clustering - spark.mllib

你可能感兴趣的:(Spark)