Spark框架中分布式K-Means算法源码解读

Spark中的K-Means实现采用 K-Means|| 模式。

1. 何为 K-Means||?

K-Means||是解决K-Means++算法在大数据量情况下,初始化速度太慢而产生的一种算法。主要思路是改变每次遍历时候的取样规则。并非按照K-Means++算法,初始化每次遍历只获取一个样本,而是每次获取K个样本,重复该取样操作O(logn)次(n是样本的个数),然后再将这些抽样出来的样本聚类出K个点,最后使用这K个点作为K-Means算法的初始聚簇中心点。实践证明:甚至不用O(logn)次次,一般5次重复采用就可以保证一个比较好的聚簇中心点。

2. Spark框架下一般调用K-Means的方式:

    val model = new KMeans()
      // 这里设置了分类的个数,也就是k值
      .setK(numClusters)
      .setMaxIterations(numIterations)
      .run(paresdData)

    // centers是k个中心坐标
    val centers = model.clusterCenters
    val k = model.k
    centers.foreach(println)
    // 保存训练过后的模型
     model.save(sc, "dir")
    // 加载模型
    val sameModel = KMeansModel.load(sc, "dir")
    // 验证模型是否正确
    print(sameModel.predict(Vectors.dense(9,9)))
    //SparkSQL读取显示2个中心点坐标
    val sqlContext = new SQLContext(sc)
    sqlContext.read.parquet("path").show()

接下来对以上调用过程进行简单的源码解读

1. setKsetMaxIterations

就是设置 K 和迭代次数。然后调用 run 方法

2. run

对每个向量 x 初始化一个权重 weight = 1,然后转 runWithWeight 方法

  def run(data: RDD[Vector]): KMeansModel = {
    val instances: RDD[(Vector, Double)] = data.map {
      case (point) => (point, 1.0)
    }
    runWithWeight(instances, None)
  }

3. runWithWeight

计算每个x的二范数【sqrt(x1² + ... +)】,并且和原数据集 zip 一下形成大的 RDD 元组【RDD(x, norm, weight)】,x 即一个 x 向量,norm 即 x 的二范数,weight 默认初始化为1;持久化该 RDD;然后走到 runAlgorithmWithWeight

 private[spark] def runWithWeight(
     data: RDD[(Vector, Double)],
     instr: Option[Instrumentation]): KMeansModel = {

   // 计算二范数并缓存
   val norms = data.map { case (v, _) =>
     Vectors.norm(v, 2.0)
   }
   // zip一下将范数结果与原数据集绑定到一起
   val zippedData = data.zip(norms).map { case ((v, w), norm) =>
     new VectorWithNorm(v, norm, w)
   }
   // 持久化
   if (data.getStorageLevel == StorageLevel.NONE) {
     zippedData.persist(StorageLevel.MEMORY_AND_DISK)
   }
   val model = runAlgorithmWithWeight(zippedData, instr)
   zippedData.unpersist()

   model
 }

4. runAlgorithmWithWeight

根据初始化模式进行K-Means的初始化,即中心点选取。默认是并行初始化 initKMeansParallel。初始化结束后,生成【double类型的累加器】,并【广播初始化后的中心点数组centers(类型为Array[VectorWithNorm])】,最后返回训练好的模型。

 private def runAlgorithmWithWeight(
     data: RDD[VectorWithNorm],
     instr: Option[Instrumentation]): KMeansModel = {

   val sc = data.sparkContext

   val initStartTime = System.nanoTime()

   val distanceMeasureInstance = >DistanceMeasure.decodeFromString(this.distanceMeasure)
   // 先判断是否需要初始化。如果已经初始化过则直接获取中心点数组即可。
   // 若未初始化,根据初始化模式进行K-Means的初始化,即中心点选取。默认是并行初始化 initKMeansParallel
   val centers = initialModel match {
     case Some(kMeansCenters) =>
       kMeansCenters.clusterCenters.map(new VectorWithNorm(_))
     case None =>
       if (initializationMode == KMeans.RANDOM) {
         initRandom(data)
       } else {
         initKMeansParallel(data, distanceMeasureInstance)
       }
   }
   val initTimeInSeconds = (System.nanoTime() - initStartTime) / 1e9
   logInfo(f"Initialization with $initializationMode took $initTimeInSeconds%.3f seconds.")

   var converged = false
   var cost = 0.0
   var iteration = 0

   val iterationStartTime = System.nanoTime()

   instr.foreach(_.logNumFeatures(centers.head.vector.size))

   // 执行Lloyd算法迭代, 直至收敛
   while (iteration < maxIterations && !converged) {
     // double类型的累加器(不知道什么是累加器百度)
     val costAccum = sc.doubleAccumulator
     // 将中心点数组广播出去
     val bcCenters = sc.broadcast(centers)

     // 轮盘法 计算新的中心点  Executor端执行计算
     // mapPartitions: 对每个分区进行集中操作, 执行效果等价于map, 执行方式是先对分区内全部数据进行计算, 
     // 因此需要将一个分区数据全部加载进内存, 相较于map, 是用空间换时间
     // collected: k-v对, k为簇索引, v为(当前簇的所有sum(向量*权重), sum(权重))
     val collected: Map[Int, (Vector, Double)] = data.mapPartitions { points =>
       // 获取被广播的中心点
       val thisCenters = bcCenters.value
       // 数据维度
       val dims = thisCenters.head.vector.size
       // 生成和原中心点数组同样长宽的数组. sums 用来保存每个簇中的 点 * 点的权重
       val sums = Array.fill(thisCenters.length)(Vectors.zeros(dims))

       // clusterWeightSum 用来保存每个簇的权重和
       // cluster center =
       //     sample1 * weight1/clusterWeightSum + sample2 * weight2/clusterWeightSum + ...
       val clusterWeightSum = Array.ofDim[Double](thisCenters.length)

       // 对points中的每个点: 
       points.foreach { point =>
         // 1. 找到中心点数组中距离该点最近的中心点, 返回该中心点的索引与该中心点和该point的距离
         val (bestCenter, cost) = distanceMeasureInstance.findClosest(thisCenters, point)
         // 2. 累加器添加  距离 * 点的权重
         costAccum.add(cost * point.weight)
         // 3. 将该点纳入该中心点( sums(bestCenter) += point.weight * point.vector )
         distanceMeasureInstance.updateClusterSum(point, sums(bestCenter))
         // 4. 更新每个簇的权重
         clusterWeightSum(bestCenter) += point.weight
       }
       // 过滤掉权重和为0的簇
       clusterWeightSum.indices.filter(clusterWeightSum(_) > 0)
         .map(j => (j, (sums(j), clusterWeightSum(j)))).iterator
     }.reduceByKey { (sumweight1, sumweight2) =>
       axpy(1.0, sumweight2._1, sumweight1._1)
       (sumweight1._1, sumweight1._2 + sumweight2._2)
     }.collectAsMap()

     if (iteration == 0) {
       instr.foreach(_.logNumExamples(costAccum.count))
       instr.foreach(_.logSumOfWeights(collected.values.map(_._2).sum))
     }
     // 计算新中心点
     val newCenters: Map[Int, VectorWithNorm] = collected.mapValues { case (sum, weightSum) =>
       distanceMeasureInstance.centroid(sum, weightSum)
     }
     // 销毁被广播的旧中心点
     bcCenters.destroy()

     // Update the cluster centers and costs
     converged = true
     newCenters.foreach { case (j, newCenter) =>
       // 判断中心点是否收敛, 只要有一个中心点不收敛, 就置false. 若收敛, 则最外层while循环将提前退出
       // 收敛条件就是旧中心点数组与新中心点数组的元素距离和小于阈值
       if (converged &&
         !distanceMeasureInstance.isCenterConverged(centers(j), newCenter, epsilon)) {
         converged = false
       }
       centers(j) = newCenter
     }

     cost = costAccum.value
     iteration += 1
   }

   val iterationTimeInSeconds = (System.nanoTime() - iterationStartTime) / 1e9
   logInfo(f"Iterations took $iterationTimeInSeconds%.3f seconds.")

   if (iteration == maxIterations) {
     logInfo(s"KMeans reached the max number of iterations: $maxIterations.")
   } else {
     logInfo(s"KMeans converged in $iteration iterations.")
   }

   logInfo(s"The cost is $cost.")

   new KMeansModel(centers.map(_.vector), distanceMeasure, cost, iteration)
 }
4.1 initKMeansParallel

初始化方法代码详解。这里是体现K-Means具体品种的地方。中心点的迭代过程都类似,区别就是中心点初始化的方式。

 /**
   * Initialize a set of cluster centers using the k-means|| algorithm by Bahmani et al.
   * (Bahmani et al., Scalable K-Means++, VLDB 2012). This is a variant of k-means++ that tries
   * to find dissimilar cluster centers by starting with a random center and then doing
   * passes where more centers are chosen with probability proportional to their squared distance
   * to the current cluster set. It results in a provable approximation to an optimal clustering.
   *
   * The original paper can be found at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf.
   */
  private[clustering] def initKMeansParallel(data: RDD[VectorWithNorm],
      distanceMeasureInstance: DistanceMeasure): Array[VectorWithNorm] = {
    // Initialize empty centers and point costs.
    var costs = data.map(_ => Double.PositiveInfinity)

    // 第一个点随机初始化
    val seed = new XORShiftRandom(this.seed).nextInt()
    val sample = data.takeSample(false, 1, seed)
    // Could be empty if data is empty; fail with a better message early:
    require(sample.nonEmpty, s"No samples available from $data")

    val centers = ArrayBuffer[VectorWithNorm]()
    var newCenters = Seq(sample.head.toDense)
    centers ++= newCenters

    // 每次迭代, 抽样2 * k个点, 被抽到的概率∝该点到所有中心点中最近的一个的平方距离. 每次迭代只计算点和中心点的距离
    // 默认 initializationSteps = 2
    var step = 0
    val bcNewCentersList = ArrayBuffer[Broadcast[_]]()
    while (step < initializationSteps) {
      val bcNewCenters = data.context.broadcast(newCenters)
      bcNewCentersList += bcNewCenters
      val preCosts = costs
      // 计算每个point到所有中心点的距离, 最近的一个距离为该point的cost值
      costs = data.zip(preCosts).map { case (point, cost) =>
        math.min(distanceMeasureInstance.pointCost(bcNewCenters.value, point), cost)
      }.persist(StorageLevel.MEMORY_AND_DISK)
      // costs求和
      val sumCosts = costs.sum()
      // 异步删除广播数据
      bcNewCenters.unpersist()
      // 将该RDD标记为非持久, 并从内存和磁盘中删除该RDD的所有块. 默认异步删除(不阻塞)
      preCosts.unpersist()

      // mapPartitionsWithIndex: 每次计算是针对一个分区内的全部数据的迭代器
      // 以一定的概率过滤掉一些元素, 条件是, 该点的cost(即该点离所有中心点中最近的一个的距离)越大, 保留概率越大
      // 符合K-Means++的初始化的思想
      val chosen: Array[VectorWithNorm] = data.zip(costs).mapPartitionsWithIndex { (index, pointCosts) =>
        val rand = new XORShiftRandom(seed ^ (step << 16) ^ index)
        pointCosts.filter { case (_, c) => rand.nextDouble() < 2.0 * c * k / sumCosts }.map(_._1)
      }.collect()
      newCenters = chosen.map(_.toDense)
      centers ++= newCenters
      step += 1
    }

    // 解除costs的内存占用
    costs.unpersist()
    // 销毁广播常量
    bcNewCentersList.foreach(_.destroy())

    // 基于向量进行去重. centers类型为 ArrayBuffer[VectorWithNorm], 每个元素不仅有向量坐标信息, 还有二范数与权重信息
    val distinctCenters = centers.map(_.vector).distinct.map(new VectorWithNorm(_))

    if (distinctCenters.size <= k) {
      distinctCenters.toArray
    } else {
      // 最后,我们可能有一组超过k个不同的候选中心;根据映射到每个候选对象的数据集中的点数对每个候选对象进行加权,
      // 并在加权中心上运行局部k-means++以选取其中的k个
      val bcCenters = data.context.broadcast(distinctCenters)
      val countMap = data
        .map(distanceMeasureInstance.findClosest(bcCenters.value, _)._1)
        .countByValue()

      bcCenters.destroy()

      val myWeights = distinctCenters.indices.map(countMap.getOrElse(_, 0L).toDouble).toArray
      LocalKMeans.kMeansPlusPlus(0, distinctCenters.toArray, myWeights, k, 30)
    }
  }

你可能感兴趣的:(Spark框架中分布式K-Means算法源码解读)