legotime

SparkML之聚类(一)Kmeans聚类

------------------------------目录--------------------------------------------------

Kmeans理论

Matlab实现

Spark源码分析

Spark源码

Spark实验

------------------------------------------------------------------------------------

Kmeans聚类理论可以访问http://cs229.stanford.edu/notes/cs229-notes7a.pdf

可以总结下面三点

1、初始化K个聚类中心点

2、计算每个点到K个聚类中心的距离

3、根据距离，寻找更好的K个聚类中心点

问题1：怎么确定这K个了？随机选择还是有算法确定

解答：我们可以随机选择这K个点，也可以用算法确定这K个点，一般用K-mean++算法进行对K个点的选择

详细的K-mean++算法访问（https://en.wikipedia.org/wiki/K-means%2B%2B和http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf.），具体流程如下：

1、从输入的数据集中随机选择一个点作为第一个聚类中心C1

2、计算每个数据点到C1的距离,保存在D（x）中

3、选择D（x）中最大距离的那个数据点作为第二个聚类中心C2

4、计算每个数据点到C1、C2的距离，那个聚类中心里这个数据点近，那么这个数据点就数据这个聚类中心

那么此时，所有的数据点分为2类，一类是依附于C1，另一类是依附于C2

问题2：当K大于2怎么办？那么是不是我们依附于C1、C2的距离都放在一个D(X),在按照那个距离最大，选哪个

解答：這是可以的！算法嘛，总是希望做的和别人有更优之处，要想更优，那么就得考虑更多，下面就是考虑更多的

结果

5、计算每个数据点到最近聚类中心的距离（比如4中，计算C1类的点到C1的距离，C2类的点到C2的距离），之后

对所有的距离进行求和得到Sum

6、取(0,sum)之间的一个随机值记为temp，让temp = temp - (i = 1,2,...),当temp <0 的那一刻，就取这个（i）

点，那么这个i点就是下一个聚类中心

7、就這样一直寻找到第K个点

现在结合matlab编写函数进行讲解

数据和代码，百度云链接;链接：http://pan.baidu.com/s/1mh9ldnM 密码：dcz4

%导入数据
filename = 'C:\Users\andrew\Desktop\kmeans\IrisData.txt';
delimiter = '\t';
startRow = 2;
formatSpec = '%f%f%f%f%s%[^\n\r]';
fileID = fopen(filename,'r');
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, 'HeaderLines' ,startRow-1, 'ReturnOnError', false);
VarName1 = dataArray{:, 1};
Sepalwidth = dataArray{:, 2};%萼片宽
Petallength = dataArray{:, 3};%花瓣长度
Petalwidth = dataArray{:, 4};%花瓣宽度
Species = dataArray{:, 5};%种类

%查看样本的分布情况
figure(1)
plot(Petalwidth,Petallength,'o')
xlabel('花瓣宽度')
ylabel('花瓣长度')
title('未聚类的分布')

%为了更加直观，把不同的种类用不同的符号标记出来
%1到50的种类为：   I. setosa
%51到100的种类为： I. versicolor
%101到150的种类为：I. virginica
figure(2)
hold on
plot(Petalwidth(1:50),Petallength(1:50),'rs','LineWidth',2,'MarkerEdgeColor','k','MarkerFaceColor','g','MarkerSize',10)
plot(Petalwidth(51:100),Petallength(51:100),'mo','LineWidth',2,'MarkerEdgeColor','k','MarkerFaceColor',[.49 1 .63],'MarkerSize',10)
plot(Petalwidth(101:end),Petallength(101:end),'rh','LineWidth',2,'MarkerEdgeColor','k','MarkerFaceColor','r','MarkerSize',10)
xlabel('花瓣宽度')
ylabel('花瓣长度')
title('实际类别分布情况')
hold off


%调用 自己编写的kmeans，看看效果
X = [Petalwidth,Petallength];
[cid,nr,centers] = mykmeans(X,3);
figure(3)
hold on
for i = 1:length(Petalwidth)
    if cid(i) ==1
        plot(Petalwidth(i),Petallength(i),'rs','LineWidth',2,'MarkerEdgeColor','k','MarkerFaceColor','g','MarkerSize',10)
    end
    if cid(i) ==2
        plot(Petalwidth(i),Petallength(i),'mo','LineWidth',2,'MarkerEdgeColor','k','MarkerFaceColor',[.49 1 .63],'MarkerSize',10)
    end
    if cid(i) ==3
        plot(Petalwidth(i),Petallength(i),'rh','LineWidth',2,'MarkerEdgeColor','k','MarkerFaceColor','r','MarkerSize',10)
    end
end
title('算法聚类结果')
hold off
        
%centers =
%
%    2.0478    5.6261
%    0.2460    1.4620
%    1.3593    4.2926

function [cid,nr,centers] = mykmeans(X,k)
%参数说明
%   X：数据集（n*m）
%   k：输入的聚类中心个数
%   cid:每一个数据输入那一类
%    nr:每个数据的个数
%   centers:每个类的聚合中心（集合）

%%寻找初始的k个聚类中心


[n,m]=size(X);
cid = zeros(1,n);
nr = zeros(1,k);

%从输入的数据集中随机找出k个聚类中心
temp = randperm(n);
a = temp(1:k);
nc= X(a',:);

%开始对k个聚类中心进行优化
maxIter = 100;
iter = 1;
while iter < maxIter
    for i = 1:n
         dist = sum((repmat(X(i,:),k,1)-nc).^2,2); 
         [~,ind] = min(dist); 
          cid(i) = ind;
    end
    for i = 1:k
        ind = find(cid==i); 
        nc(i,:) = mean(X(ind,:)); 
        nr(i) = length(ind); 
    end
    iter = iter + 1;
end 
tempMaxiter = 2;
tempIter = 1;
tempMove = 1;
while tempIter < tempMaxiter && tempMove ~= 0 
    tempMove = 0; 
    for i = 1:n
        dist = sum((repmat(X(i,:),k,1)-nc).^2,2); 
        r = cid(i);
        dadj = nr./(nr+1).*dist';
        [~,ind] = min(dadj);
         if ind ~= r 
                   cid(i) = ind;
                   ic = find(cid==ind);
                   nc(ind,:) = mean(X(ic,:)); 
                   tempMove = 1; 
         end
    end
    tempIter = tempIter+1;
end
centers = nc;
end

SparkML源码分析

Spark对于Kmeans算法程序一共有：KMeans类和KMeans同名对象。

VectorWithNorm类（自行设定norm，为fastSquaredDistance函数服务）

KMeansModel类和KMeansModel同名对象

作用：

KMeans类：

设定训练的参数（聚类中心数目：k,最大迭代次数： maxIterations并行数：runs（默认是1），

初始化算法：initializationMode，种子：seed，收敛值：Epsilon等）

训练模型（run）:初始化中心（随机方法和K-mean++算法），聚类中心点计算（runAlogorithm）

初始化并行度（initKMeansParallel）

KMeans同名对象

对不同输入格式，书写train方法

找里各个聚类中心最近的点（findClosest）和计算距离（pointCost）

快速距离计算（fastSquaredDistance）

KMeansModel类

计算每个样本到最近聚类中心的距离（computeCost）

保存模型（save）

预测（predict）

KMeansModel同名对象

导入模型（load）

当前模型的版本号处理（object SaveLoadV1_0）

spark源码：

Kmeans类

@Since("0.8.0")
class KMeans private (
                       private var k: Int,
                       private var maxIterations: Int,
                       private var runs: Int,
                       private var initializationMode: String,
                       private var initializationSteps: Int,
                       private var epsilon: Double,
                       private var seed: Long) extends Serializable with Logging {

  /**  * Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, runs: 1,  * initializationMode: "k-means||", initializationSteps: 5, epsilon: 1e-4, seed: random}.  */  //构造函数：
  //默认情况下：分2类，20次迭代，1个并行，输出化模型选择KMeans.K_MEANS_PARALLEL，初始steps为5
  //epsilon = 0.0001，
  @Since("0.8.0")
  def this() = this(2, 20, 1, KMeans.K_MEANS_PARALLEL, 5, 1e-4, Utils.random.nextLong())

  /**  * Number of clusters to create (k).  */  @Since("1.4.0")
  def getK: Int = k

  /**  * Set the number of clusters to create (k). Default: 2.  */  @Since("0.8.0")
  def setK(k: Int): this.type = {
    this.k = k
    this  }

  /**  * Maximum number of iterations allowed.  */  @Since("1.4.0")
  def getMaxIterations: Int = maxIterations

  /**  * Set maximum number of iterations allowed. Default: 20.  */  @Since("0.8.0")
  def setMaxIterations(maxIterations: Int): this.type = {
    this.maxIterations = maxIterations
    this  }

  /**  * The initialization algorithm. This can be either "random" or "k-means||".  */  @Since("1.4.0")
  def getInitializationMode: String = initializationMode

  /**  * Set the initialization algorithm. This can be either "random" to choose random points as  * initial cluster centers, or "k-means||" to use a parallel variant of k-means++  * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.  */  @Since("0.8.0")
  def setInitializationMode(initializationMode: String): this.type = {
    KMeans.validateInitMode(initializationMode)
    this.initializationMode = initializationMode
    this  }

  /**  * :: Experimental ::  * Number of runs of the algorithm to execute in parallel.  */  @Since("1.4.0")
  @deprecated("Support for runs is deprecated. This param will have no effect in 2.0.0.", "1.6.0")
  def getRuns: Int = runs

  /**  * :: Experimental ::  * Set the number of runs of the algorithm to execute in parallel. We initialize the algorithm  * this many times with random starting conditions (configured by the initialization mode), then  * return the best clustering found over any run. Default: 1.  */  @Since("0.8.0")
  @deprecated("Support for runs is deprecated. This param will have no effect in 2.0.0.", "1.6.0")
  def setRuns(runs: Int): this.type = {
    internalSetRuns(runs)
  }

  // Internal version of setRuns for Python API, this should be removed at the same time as setRuns
  // this is done to avoid deprecation warnings in our build.
  private[mllib] def internalSetRuns(runs: Int): this.type = {
    if (runs <= 0) {
      throw new IllegalArgumentException("Number of runs must be positive")
    }
    if (runs != 1) {
      logWarning("Setting number of runs is deprecated and will have no effect in 2.0.0")
    }
    this.runs = runs
    this  }

  /**  * Number of steps for the k-means|| initialization mode  */  @Since("1.4.0")
  def getInitializationSteps: Int = initializationSteps

  /**  * Set the number of steps for the k-means|| initialization mode. This is an advanced  * setting -- the default of 5 is almost always enough. Default: 5.  */  @Since("0.8.0")
  def setInitializationSteps(initializationSteps: Int): this.type = {
    if (initializationSteps <= 0) {
      throw new IllegalArgumentException("Number of initialization steps must be positive")
    }
    this.initializationSteps = initializationSteps
    this  }

  /**  * The distance threshold within which we've consider centers to have converged.  */  @Since("1.4.0")
  def getEpsilon: Double = epsilon

  /**  * Set the distance threshold within which we've consider centers to have converged.  * If all centers move less than this Euclidean distance, we stop iterating one run.  */  @Since("0.8.0")
  def setEpsilon(epsilon: Double): this.type = {
    this.epsilon = epsilon
    this  }

  /**  * The random seed for cluster initialization.  */  @Since("1.4.0")
  def getSeed: Long = seed

  /**  * Set the random seed for cluster initialization.  */  @Since("1.4.0")
  def setSeed(seed: Long): this.type = {
    this.seed = seed
    this  }

  // Initial cluster centers can be provided as a KMeansModel object rather than using the
  // random or k-means|| initializationMode
  private var initialModel: Option[KMeansModel] = None

  /**  * Set the initial starting point, bypassing the random initialization or k-means||  * The condition model.k == this.k must be met, failure results  * in an IllegalArgumentException.  */  @Since("1.4.0")
  def setInitialModel(model: KMeansModel): this.type = {
    require(model.k == k, "mismatched cluster count")
    initialModel = Some(model)
    this  }

  /**  * Train a K-means model on the given set of points; `data` should be cached for high  * performance, because this is an iterative algorithm.  */  @Since("0.8.0")
  //训练模型的run方法
  //官方提示说，因为迭代最好选择将数据进行缓存
  def run(data: RDD[Vector]): KMeansModel = {

    if (data.getStorageLevel == StorageLevel.NONE) {
      logWarning("The input data is not directly cached, which may hurt performance if its"
        + " parent RDDs are also uncached.")
    }

    // Compute squared norms and cache them.
    //计算2范数，并且缓存
    val norms = data.map(Vectors.norm(_, 2.0))
    norms.persist()
    val zippedData = data.zip(norms).map { case (v, norm) =>
      new VectorWithNorm(v, norm)
    }
    //调用runAlgorithm函数，实现K-Means算法，
    val model = runAlgorithm(zippedData)
    norms.unpersist()

    // Warn at the end of the run as well, for increased visibility.
    if (data.getStorageLevel == StorageLevel.NONE) {
      logWarning("The input data was not directly cached, which may hurt performance if its"
        + " parent RDDs are also uncached.")
    }
    model//返回模型
  }

  /**  * Implementation of K-Means algorithm.  */  private def runAlgorithm(data: RDD[VectorWithNorm]): KMeansModel = {

    val sc = data.sparkContext

    val initStartTime = System.nanoTime()

    // Only one run is allowed when initialModel is given
    val numRuns = if (initialModel.nonEmpty) {
      if (runs > 1) logWarning("Ignoring runs; one run is allowed when initialModel is given.")
      1
    } else {
      runs
    }

    //聚类中心初始化
    val centers = initialModel match {
      case Some(kMeansCenters) => {
        Array(kMeansCenters.clusterCenters.map(s => new VectorWithNorm(s)))
      }
      case None => {
        if (initializationMode == KMeans.RANDOM) {
          initRandom(data)
        } else {
          initKMeansParallel(data)
        }
      }
    }

    val initTimeInSeconds = (System.nanoTime() - initStartTime) / 1e9
    logInfo(s"Initialization with $initializationMode took " + "%.3f".format(initTimeInSeconds) +
      " seconds.")

    val active = Array.fill(numRuns)(true)
    val costs = Array.fill(numRuns)(0.0)

    var activeRuns = new ArrayBuffer[Int] ++ (0 until numRuns)
    var iteration = 0

    val iterationStartTime = System.nanoTime()

    // Execute iterations of Lloyd's algorithm until all runs have converged
    //使用Lloyd算法
    //可以参考分析理论部分的5和6
    //同时可以对比一下matlab的算法
    //這里多了缓存，并行度的变量
    while (iteration < maxIterations && !activeRuns.isEmpty) {
      type WeightedPoint = (Vector, Long)
      def mergeContribs(x: WeightedPoint, y: WeightedPoint): WeightedPoint = {
        axpy(1.0, x._1, y._1)
        (y._1, x._2 + y._2)
      }

      val activeCenters = activeRuns.map(r => centers(r)).toArray
      val costAccums = activeRuns.map(_ => sc.accumulator(0.0))

      val bcActiveCenters = sc.broadcast(activeCenters)

      // Find the sum and count of points mapping to each center
      //理论分析的第五部分，对每个中心到各自的样本的距离进行计算
      val totalContribs = data.mapPartitions { points =>
        val thisActiveCenters = bcActiveCenters.value
        val runs = thisActiveCenters.length
        val k = thisActiveCenters(0).length
        val dims = thisActiveCenters(0)(0).vector.size

        val sums = Array.fill(runs, k)(Vectors.zeros(dims))
        val counts = Array.fill(runs, k)(0L)

        points.foreach { point =>
          (0 until runs).foreach { i =>
            val (bestCenter, cost) = KMeans.findClosest(thisActiveCenters(i), point)
            costAccums(i) += cost
            val sum = sums(i)(bestCenter)
            axpy(1.0, point.vector, sum)
            counts(i)(bestCenter) += 1
          }
        }

        //contribs = ((并行度，那个中心)，某个聚类中心到各自样本之和，那个并行度下的那个聚类中心)
        val contribs = for (i <- 0 until runs; j <- 0 until k) yield {
          ((i, j), (sums(i)(j), counts(i)(j)))
        }
        contribs.iterator
      }.reduceByKey(mergeContribs).collectAsMap()

      bcActiveCenters.unpersist(blocking = false)

      // Update the cluster centers and costs for each active run
      for ((run, i) <- activeRuns.zipWithIndex) {
        var changed = false  var j = 0
        while (j < k) {
          val (sum, count) = totalContribs((i, j))
          if (count != 0) {
            scal(1.0 / count, sum)
            val newCenter = new VectorWithNorm(sum)
            if (KMeans.fastSquaredDistance(newCenter, centers(run)(j)) > epsilon * epsilon) {
              changed = true  }
            centers(run)(j) = newCenter
          }
          j += 1
        }
        if (!changed) {
          active(run) = false  logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations")
        }
        costs(run) = costAccums(i).value
      }

      activeRuns = activeRuns.filter(active(_))
      iteration += 1
    }

    val iterationTimeInSeconds = (System.nanoTime() - iterationStartTime) / 1e9
    logInfo(s"Iterations took " + "%.3f".format(iterationTimeInSeconds) + " seconds.")

    if (iteration == maxIterations) {
      logInfo(s"KMeans reached the max number of iterations: $maxIterations.")
    } else {
      logInfo(s"KMeans converged in $iteration iterations.")
    }

    val (minCost, bestRun) = costs.zipWithIndex.min

    logInfo(s"The cost for the best run is $minCost.")

    new KMeansModel(centers(bestRun).map(_.vector))
  }

  /**  * Initialize `runs` sets of cluster centers at random.  */  //随机的寻找K个聚类中心
  private def initRandom(data: RDD[VectorWithNorm])
  : Array[Array[VectorWithNorm]] = {
    // Sample all the cluster centers in one pass to avoid repeated scans
    val sample = data.takeSample(true, runs * k, new XORShiftRandom(this.seed).nextInt()).toSeq
    Array.tabulate(runs)(r => sample.slice(r * k, (r + 1) * k).map { v =>
      new VectorWithNorm(Vectors.dense(v.vector.toArray), v.norm)
    }.toArray)
  }

  /**  * Initialize `runs` sets of cluster centers using the k-means|| algorithm by Bahmani et al.  * (Bahmani et al., Scalable K-Means++, VLDB 2012). This is a variant of k-means++ that tries  * to find with dissimilar cluster centers by starting with a random center and then doing  * passes where more centers are chosen with probability proportional to their squared distance  * to the current cluster set. It results in a provable approximation to an optimal clustering.  *  * The original paper can be found at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf.  */  //用K-Means++理论
  private def initKMeansParallel(data: RDD[VectorWithNorm])
  : Array[Array[VectorWithNorm]] = {
    // Initialize empty centers and point costs.
    val centers = Array.tabulate(runs)(r => ArrayBuffer.empty[VectorWithNorm])
    var costs = data.map(_ => Array.fill(runs)(Double.PositiveInfinity))

    // Initialize each run's first center to a random point.
    val seed = new XORShiftRandom(this.seed).nextInt()
    val sample = data.takeSample(true, runs, seed).toSeq
    val newCenters = Array.tabulate(runs)(r => ArrayBuffer(sample(r).toDense))

    /** Merges new centers to centers. */  def mergeNewCenters(): Unit = {
      var r = 0
      while (r < runs) {
        centers(r) ++= newCenters(r)
        newCenters(r).clear()
        r += 1
      }
    }

    // On each step, sample 2 * k points on average for each run with probability proportional
    // to their squared distance from that run's centers. Note that only distances between points
    // and new centers are computed in each iteration.
    var step = 0
    while (step < initializationSteps) {
      val bcNewCenters = data.context.broadcast(newCenters)
      val preCosts = costs
      costs = data.zip(preCosts).map { case (point, cost) =>
        Array.tabulate(runs) { r =>
          math.min(KMeans.pointCost(bcNewCenters.value(r), point), cost(r))
        }
      }.persist(StorageLevel.MEMORY_AND_DISK)
      val sumCosts = costs
        .aggregate(new Array[Double](runs))(
          seqOp = (s, v) => {
            // s += v
            var r = 0
            while (r < runs) {
              s(r) += v(r)
              r += 1
            }
            s
          },
          combOp = (s0, s1) => {
            // s0 += s1
            var r = 0
            while (r < runs) {
              s0(r) += s1(r)
              r += 1
            }
            s0
          }
        )

      bcNewCenters.unpersist(blocking = false)
      preCosts.unpersist(blocking = false)

      val chosen = data.zip(costs).mapPartitionsWithIndex { (index, pointsWithCosts) =>
        val rand = new XORShiftRandom(seed ^ (step << 16) ^ index)
        pointsWithCosts.flatMap { case (p, c) =>
          val rs = (0 until runs).filter { r =>
            rand.nextDouble() < 2.0 * c(r) * k / sumCosts(r)
          }
          if (rs.length > 0) Some((p, rs)) else None
        }
      }.collect()
      mergeNewCenters()
      chosen.foreach { case (p, rs) =>
        rs.foreach(newCenters(_) += p.toDense)
      }
      step += 1
    }

    mergeNewCenters()
    costs.unpersist(blocking = false)

    // Finally, we might have a set of more than k candidate centers for each run; weigh each
    // candidate by the number of points in the dataset mapping to it and run a local k-means++
    // on the weighted centers to pick just k of them
    val bcCenters = data.context.broadcast(centers)
    val weightMap = data.flatMap { p =>
      Iterator.tabulate(runs) { r =>
        ((r, KMeans.findClosest(bcCenters.value(r), p)._1), 1.0)
      }
    }.reduceByKey(_ + _).collectAsMap()

    bcCenters.unpersist(blocking = false)

    val finalCenters = (0 until runs).par.map { r =>
      val myCenters = centers(r).toArray
      val myWeights = (0 until myCenters.length).map(i => weightMap.getOrElse((r, i), 0.0)).toArray
      LocalKMeans.kMeansPlusPlus(r, myCenters, myWeights, k, 30)
    }

    finalCenters.toArray
  }
}

KMeans同名对象

/**  * Top-level methods for calling K-means clustering.  */ @Since("0.8.0")
object KMeans {

  // Initialization mode names
  @Since("0.8.0")
  val RANDOM = "random"
  @Since("0.8.0")
  val K_MEANS_PARALLEL = "k-means||"

  /**  * Trains a k-means model using the given set of parameters.  *  * @param data Training points as an `RDD` of `Vector` types.  * @param k Number of clusters to create.  * @param maxIterations Maximum number of iterations allowed.  * @param runs Number of runs to execute in parallel. The best model according to the cost  * function will be returned. (default: 1)  * @param initializationMode The initialization algorithm. This can either be "random" or  * "k-means||". (default: "k-means||")  * @param seed Random seed for cluster initialization. Default is to generate seed based  * on system time.  */   @Since("1.3.0")
  //输入分类数据：data，聚类中心数目：k,最大迭代次数： maxIterations
  //并行数：runs（默认是1），初始化算法：initializationMode，种子：seed
  //一共6个参数
  def train(
             data: RDD[Vector],
             k: Int,
             maxIterations: Int,
             runs: Int,
             initializationMode: String,
             seed: Long): KMeansModel = {
    new KMeans().setK(k)
      .setMaxIterations(maxIterations)
      .internalSetRuns(runs)
      .setInitializationMode(initializationMode)
      .setSeed(seed)
      .run(data)
  }

  /**  * Trains a k-means model using the given set of parameters.  *  * @param data Training points as an `RDD` of `Vector` types.  * @param k Number of clusters to create.  * @param maxIterations Maximum number of iterations allowed.  * @param runs Number of runs to execute in parallel. The best model according to the cost  * function will be returned. (default: 1)  * @param initializationMode The initialization algorithm. This can either be "random" or  * "k-means||". (default: "k-means||")  */  @Since("0.8.0")
  //输入分类数据：data，聚类中心数目：k,最大迭代次数： maxIterations
  //并行数：runs（默认是1），初始化算法：initializationMode
  //一共5个参数
  def train(
             data: RDD[Vector],
             k: Int,
             maxIterations: Int,
             runs: Int,
             initializationMode: String): KMeansModel = {
    new KMeans().setK(k)
      .setMaxIterations(maxIterations)
      .internalSetRuns(runs)
      .setInitializationMode(initializationMode)
      .run(data)
  }

  /**  * Trains a k-means model using specified parameters and the default values for unspecified.  */  //输入分类数据：data，聚类中心数目：k,最大迭代次数： maxIterations

  //一共3个参数
  @Since("0.8.0")
  def train(
             data: RDD[Vector],
             k: Int,
             maxIterations: Int): KMeansModel = {
    train(data, k, maxIterations, 1, K_MEANS_PARALLEL)
  }

  /**  * Trains a k-means model using specified parameters and the default values for unspecified.  */  //输入分类数据：data，聚类中心数目：k,最大迭代次数： maxIterations
  //并行数：runs（默认是1）
  //一共4个参数
  @Since("0.8.0")
  def train(
             data: RDD[Vector],
             k: Int,
             maxIterations: Int,
             runs: Int): KMeansModel = {
    train(data, k, maxIterations, runs, K_MEANS_PARALLEL)
  }

  /**  * Returns the index of the closest center to the given point, as well as the squared distance.  */  //找出点到所有聚类中心最近的一个聚类中心，返回：(bestIndex, bestDistance)
  private[mllib] def findClosest(
                                  centers: TraversableOnce[VectorWithNorm],
                                  point: VectorWithNorm): (Int, Double) = {
    var bestDistance = Double.PositiveInfinity  var bestIndex = 0
    var i = 0
    centers.foreach { center =>
      // Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary
      // distance computation.
      var lowerBoundOfSqDist = center.norm - point.norm
      lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist
      if (lowerBoundOfSqDist < bestDistance) {
        val distance: Double = fastSquaredDistance(center, point)
        if (distance < bestDistance) {
          bestDistance = distance
          bestIndex = i
        }
      }
      i += 1
    }
    (bestIndex, bestDistance)
  }
  /**  * Returns the K-means cost of a given point against the given cluster centers.  */  //计算样本点和和中心点之间的距离
  private[mllib] def pointCost(
                                centers: TraversableOnce[VectorWithNorm],
                                point: VectorWithNorm): Double =
    findClosest(centers, point)._2

  /**  * Returns the squared Euclidean distance between two vectors computed by  * [[org.apache.spark.mllib.util.MLUtils#fastSquaredDistance]].  */  //返回两个点的2范数（距离）
  private[clustering] def fastSquaredDistance(
                                               v1: VectorWithNorm,
                                               v2: VectorWithNorm): Double = {
    MLUtils.fastSquaredDistance(v1.vector, v1.norm, v2.vector, v2.norm)
  }
  //验证初始化模型
  private[spark] def validateInitMode(initMode: String): Boolean = {
    initMode match {
      case KMeans.RANDOM => true  case KMeans.K_MEANS_PARALLEL => true  case _ => false  }
  }
}
KMeansModel
/**  * A vector with its norm for fast distance computation.  *  * @see [[org.apache.spark.mllib.clustering.KMeans#fastSquaredDistance]]
  */ private[clustering]
//自己定义
class VectorWithNorm(val vector: Vector, val norm: Double) extends Serializable {

  def this(vector: Vector) = this(vector, Vectors.norm(vector, 2.0))

  def this(array: Array[Double]) = this(Vectors.dense(array))

  /** Converts the vector to a dense vector. */  def toDense: VectorWithNorm = new VectorWithNorm(Vectors.dense(vector.toArray), norm)
}

KMeansModel类

class KMeansModel @Since("1.1.0") (@Since("1.0.0") val clusterCenters: Array[Vector])
  extends Saveable with Serializable with PMMLExportable {

  /**  * A Java-friendly constructor that takes an Iterable of Vectors.  */  @Since("1.4.0")
  def this(centers: java.lang.Iterable[Vector]) = this(centers.asScala.toArray)

  /**  * Total number of clusters.  */  @Since("0.8.0")
  def k: Int = clusterCenters.length

  /**  * Returns the cluster index that a given point belongs to.  */  @Since("0.8.0")
  def predict(point: Vector): Int = {
    KMeans.findClosest(clusterCentersWithNorm, new VectorWithNorm(point))._1
  }

  /**  * Maps given points to their cluster indices.  */  @Since("1.0.0")
  def predict(points: RDD[Vector]): RDD[Int] = {
    val centersWithNorm = clusterCentersWithNorm
    val bcCentersWithNorm = points.context.broadcast(centersWithNorm)
    points.map(p => KMeans.findClosest(bcCentersWithNorm.value, new VectorWithNorm(p))._1)
  }

  /**  * Maps given points to their cluster indices.  */  @Since("1.0.0")
  def predict(points: JavaRDD[Vector]): JavaRDD[java.lang.Integer] =
    predict(points.rdd).toJavaRDD().asInstanceOf[JavaRDD[java.lang.Integer]]

  /**  * Return the K-means cost (sum of squared distances of points to their nearest center) for this  * model on the given data.  */  @Since("0.8.0")
  def computeCost(data: RDD[Vector]): Double = {
    val centersWithNorm = clusterCentersWithNorm
    val bcCentersWithNorm = data.context.broadcast(centersWithNorm)
    data.map(p => KMeans.pointCost(bcCentersWithNorm.value, new VectorWithNorm(p))).sum()
  }

  private def clusterCentersWithNorm: Iterable[VectorWithNorm] =
    clusterCenters.map(new VectorWithNorm(_))

  @Since("1.4.0")
  override def save(sc: SparkContext, path: String): Unit = {
    KMeansModel.SaveLoadV1_0.save(sc, this, path)
  }

  override protected def formatVersion: String = "1.0"
}

KMeansModel同名对象

object KMeansModel extends Loader[KMeansModel] {

  @Since("1.4.0")
  override def load(sc: SparkContext, path: String): KMeansModel = {
    KMeansModel.SaveLoadV1_0.load(sc, path)
  }

  private case class Cluster(id: Int, point: Vector)

  private object Cluster {
    def apply(r: Row): Cluster = {
      Cluster(r.getInt(0), r.getAs[Vector](1))
    }
  }

  private[clustering]
  object SaveLoadV1_0 {

    private val thisFormatVersion = "1.0"

    private[clustering]
    val thisClassName = "org.apache.spark.mllib.clustering.KMeansModel"

    def save(sc: SparkContext, model: KMeansModel, path: String): Unit = {
      val sqlContext = SQLContext.getOrCreate(sc)
      import sqlContext.implicits._
      val metadata = compact(render(
        ("class" -> thisClassName) ~ ("version" -> thisFormatVersion) ~ ("k" -> model.k)))
      sc.parallelize(Seq(metadata), 1).saveAsTextFile(Loader.metadataPath(path))
      val dataRDD = sc.parallelize(model.clusterCenters.zipWithIndex).map { case (point, id) =>
        Cluster(id, point)
      }.toDF()
      dataRDD.write.parquet(Loader.dataPath(path))
    }

    def load(sc: SparkContext, path: String): KMeansModel = {
      implicit val formats = DefaultFormats
      val sqlContext = SQLContext.getOrCreate(sc)
      val (className, formatVersion, metadata) = Loader.loadMetadata(sc, path)
      assert(className == thisClassName)
      assert(formatVersion == thisFormatVersion)
      val k = (metadata \ "k").extract[Int]
      val centroids = sqlContext.read.parquet(Loader.dataPath(path))
      Loader.checkSchema[Cluster](centroids.schema)
      val localCentroids = centroids.rdd.map(Cluster.apply).collect()
      assert(k == localCentroids.length)
      new KMeansModel(localCentroids.sortBy(_.id).map(_.point))
    }
  }
}

SparkML实验

note:为了方便处理，这个数据和只是Matlab中的3，4列，而且是纯数字

package Cluster

import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.{SparkConf, SparkContext}

 object myKmean {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("KMeansExample").setMaster("local")
    val sc = new SparkContext(conf)
    
    // Load and parse the data
    val data = sc.textFile("C:\\Users\\andrew\\Desktop\\kmeans\\IrisData.txt")
    println(data)
    data.collect.foreach(println)
    val parsedData = data.map(s => Vectors.dense(s.split('\t').map(_.toDouble)))
    parsedData.collect.foreach(println)
    
    // Cluster the data into two classes using KMeans
    val initMode = "K-means||"
    val numClusters = 3
    val numIterations = 20
    val clusters = KMeans.train(parsedData, numClusters, numIterations,1,initMode)
    clusters.clusterCenters.foreach(println)
    //聚类中心的坐标
    //[5.595833333333332,2.0374999999999988]
    //[1.462,0.24599999999999994]
    //[4.269230769230769,1.3423076923076924]
    
    // Evaluate clustering by computing Within Set Sum of Squared Errors
    val WSSSE = clusters.computeCost(parsedData)
    println("Within Set Sum of Squared Errors = " + WSSSE)
    //Within Set Sum of Squared Errors = 31.371358974359016

    // Save and load model
    clusters.save(sc, "target/org/apache/spark/KMeansExample/KMeansModel")
    val sameModel = KMeansModel.load(sc, "target/org/apache/spark/KMeansExample/KMeansModel")


    sc.stop()

  }
}

对比于matlab的聚合中心

centers =
2.0478 5.6261
0.2460 1.4620
1.3593 4.2926

//聚类中心的坐标
[5.595833333333332,2.0374999999999988]
[1.462,0.24599999999999994]
[4.269230769230769,1.3423076923076924]

你可能感兴趣的:(源码,spark机器学习)

店群合一模式下的社区团购新发展——结合链动 2+1 模式、AI 智能名片与 S2B2C 商城小程序源码说私域人工智能小程序
摘要：本文探讨了店群合一的社区团购平台在当今商业环境中的重要性和优势。通过分析店群合一模式如何将互联网社群与线下终端紧密结合，阐述了链动2+1模式、AI智能名片和S2B2C商城小程序源码在这一模式中的应用价值。这些创新元素的结合为社区团购带来了新的机遇，提升了用户信任感、拓展了营销渠道，并实现了线上线下的完美融合。一、引言随着互联网技术的不断发展，社区团购作为一种新兴的商业模式，在满足消费者日常需
四章-32-点要素的聚合彩云飘过
本文基于腾讯课堂老胡的课《跟我学Openlayers--基础实例详解》做的学习笔记，使用的openlayers5.3.xapi。源码见1032.html，对应的官网示例https://openlayers.org/en/latest/examples/cluster.htmlhttps://openlayers.org/en/latest/examples/earthquake-clusters.
DIV+CSS+JavaScript技术制作网页（旅游主题网页设计与制作）云南大理 STU学生网页设计网页设计期末网页作业 html静态网页 html5期末大作业网页设计 web大作业
️精彩专栏推荐作者主页:【进入主页—获取更多源码】web前端期末大作业：【HTML5网页期末作业(1000套)】程序员有趣的告白方式：【HTML七夕情人节表白网页制作(110套)】文章目录二、网站介绍三、网站效果▶️1.视频演示2.图片演示四、网站代码HTML结构代码CSS样式代码五、更多源码二、网站介绍网站布局方面：计划采用目前主流的、能兼容各大主流浏览器、显示效果稳定的浮动网页布局结构。网站程
关于城市旅游的HTML网页设计——(旅游风景云南 5页)HTML+CSS+JavaScript 二挡起步 web前端期末大作业 javascript html css 旅游风景
⛵源码获取文末联系✈Web前端开发技术描述网页设计题材，DIV+CSS布局制作,HTML+CSS网页设计期末课程大作业|游景点介绍|旅游风景区|家乡介绍|等网站的设计与制作|HTML期末大学生网页设计作业，Web大学生网页HTML：结构CSS：样式在操作方面上运用了html5和css3，采用了div+css结构、表单、超链接、浮动、绝对定位、相对定位、字体样式、引用视频等基础知识JavaScrip
libyuv之linux编译 jaronho Linux linux 运维服务器
文章目录一、下载源码二、编译源码三、注意事项1、银河麒麟系统（aarch64）（1）解决armv8-a+dotprod+i8mm指令集支持问题（2）解决armv9-a+sve2指令集支持问题一、下载源码到GitHub网站下载https://github.com/lemenkov/libyuv源码，或者用直接用git克隆到本地，如：gitclonehttps://github.com/lemenko
ArrayList 源码解析程序猿进阶 Java基础 ArrayList List java 面试性能优化架构设计 idea
ArrayList是Java集合框架中的一个动态数组实现，提供了可变大小的数组功能。它继承自AbstractList并实现了List接口，是顺序容器，即元素存放的数据与放进去的顺序相同，允许放入null元素，底层通过数组实现。除该类未实现同步外，其余跟Vector大致相同。每个ArrayList都有一个容量capacity，表示底层数组的实际大小，容器内存储元素的个数不能多于当前容量。当向容器中添
Python神器！WEB自动化测试集成工具 DrissionPage 亚丁号 python 开发语言
一、前言用requests做数据采集面对要登录的网站时，要分析数据包、JS源码，构造复杂的请求，往往还要应付验证码、JS混淆、签名参数等反爬手段，门槛较高。若数据是由JS计算生成的，还须重现计算过程，体验不好，开发效率不高。使用浏览器，可以很大程度上绕过这些坑，但浏览器运行效率不高。因此，这个库设计初衷，是将它们合而为一，能够在不同须要时切换相应模式，并提供一种人性化的使用方法，提高开发和运行效率
笋丁网页自动回复机器人V3.0.0免授权版源码希希分享软希网58soho_cn 源码资源笋丁网页自动回复机器人
笋丁网页机器人一款可设置自动回复，默认消息，调用自定义api接口的网页机器人。此程序后端语言使用Golang，内存占用最高不超过30MB，1H1G服务器流畅运行。仅支持Linux服务器部署，不支持虚拟主机，请悉知！使用自定义api功能需要有一定的建站基础。源码下载：https://download.csdn.net/download/m0_66047725/89754250更多资源下载：关注我。安
ESP32-C3入门教程网络篇⑩——基于esp_https_ota和MQTT实现开机主动升级和被动触发升级的OTA功能小康师兄 ESP32-C3入门教程 https 服务器 esp32 OTA MQTT
文章目录一、前言二、软件流程三、部分源码四、运行演示一、前言本文基于VSCodeIDE进行编程、编译、下载、运行等操作基础入门章节请查阅：ESP32-C3入门教程基础篇①——基于VSCode构建HelloWorld教程目录大纲请查阅：ESP32-C3入门教程——导读ESP32-C3入门教程网络篇⑨——基于esp_https_ota实现史上最简单的ESP32OTA远程固件升级功能二、软件流程
【Python搞定车载自动化测试】——Python实现车载以太网DoIP刷写（含Python源码）疯狂的机器人 Python搞定车载自动化 python DoIP UDS ISO 14229 1SO 13400 Bootloader tcp/ip
系列文章目录【Python搞定车载自动化测试】系列文章目录汇总文章目录系列文章目录前言一、环境搭建1.软件环境2.硬件环境二、目录结构三、源码展示1.DoIP诊断基础函数方法2.DoIP诊断业务函数方法3.27服务安全解锁4.DoIP自动化刷写四、测试日志1.测试日志五、完整源码链接前言随着智能电动汽车行业的发展，汽车=智能终端+四个轮子，各家车企都推出了各自的OTA升级方案，本章节主要介绍如何使
进销存小程序源码 PHP网络版ERP进销存管理系统全开源可二开摸鱼小号 php
可直接源码搭建部署发布后使用：一、功能模块介绍该系统模板主要有进，销，存三个主要模板功能组成，下面将介绍各模块所对应的功能；进：需要将产品采购入库，自动生成采购明细台账同时关联财务生成付款账单；销：是指对客户的销售订单记录，汇总生成产品销售明细及回款计划；存：库存的日常盘点与统计，库存下限预警、出入库台账、库存位置等。1.进购管理采购订单：采购下单审批→由上级审批通过采购入库；采购入库：货品到货>
计算机毕业设计PHP仓储综合管理系统（源码+程序+VUE+lw+部署） java毕设程序源码王哥 php 课程设计 vue.js
该项目含有源码、文档、程序、数据库、配套开发软件、软件安装教程。欢迎交流项目运行环境配置：phpStudy+Vscode+Mysql5.7+HBuilderX+Navicat11+Vue+Express。项目技术：原生PHP++Vue等等组成，B/S模式+Vscode管理+前后端分离等等。环境需要1.运行环境：最好是小皮phpstudy最新版，我们在这个版本上开发的。其他版本理论上也可以。2.开发
JVM源码分析之堆外内存完全解读 HeapDump性能社区
概述广义的堆外内存说到堆外内存，那大家肯定想到堆内内存，这也是我们大家接触最多的，我们在jvm参数里通常设置-Xmx来指定我们的堆的最大值，不过这还不是我们理解的Java堆，-Xmx的值是新生代和老生代的和的最大值，我们在jvm参数里通常还会加一个参数-XX:MaxPermSize来指定持久代的最大值，那么我们认识的Java堆的最大值其实是-Xmx和-XX:MaxPermSize的总和，在分代算法
html+css网页设计旅游网站首页1个页面 html+css+js网页设计 html css 旅游
html+css网页设计旅游网站首页1个页面网页作品代码简单，可使用任意HTML辑软件（如：Dreamweaver、HBuilder、Vscode、Sublime、Webstorm、Text、Notepad++等任意html编辑软件进行运行及修改编辑等操作）。获取源码1，访问该网站https://download.csdn.net/download/qq_42431718/897527112，点击
Istio pilot-discovery服务发现源码解析（1.13版本） xidianjiapei001 #Istio istio 云原生服务发现
Istiopilot-discovery服务发现介绍工作机制初始化初始化Config控制器初始化Service控制器controller初始化NamespaceServiceNodePodPilotDiscovery各组件启动流程DiscoveryServer接收Envoy的gRPC连接请求流程Config变化后向Envoy推送更新的流程总结参考介绍IstioPilot的代码分为Pilot-Dis
python中文版软件下载-Python中文版编程大乐趣
python中文版是一种面向对象的解释型计算机程序设计语言。python中文版官网面向对象编程，拥有高效的高级数据结构和简单而有效的方法，其优雅的语法、动态类型、以及天然的解释能力，让它成为理想的语言。软件功能强大，简单易学，可以帮助用户快速编写代码，而且代码运行速度非常快，几乎可以支持所有的操作系统，实用性真的超高的。python中文版软件介绍：python中文版的解释器及其扩展标准库的源码和编
Scanpy源码浅析之pp.normalize_total 何物昂
版本导入Scanpy,其版本为'1.9.1'，如果你看到的源码和下文有差异，其可能是由于版本差异。importscanpyasscsc.__version__#'1.9.1'例子函数pp.normalize_total用于Normalizecountspercell，其源代码在scanpy/preprocessing/_normalization.py我们通过一个简单例子来了解该函数主要功能:将一
基于JavaWeb开发的Java+SpringMvc+vue+element实现上海汽车博物馆平台网顺技术团队成品程序项目 java vue.js 汽车课程设计 spring boot
基于JavaWeb开发的Java+SpringMvc+vue+element实现上海汽车博物馆平台作者主页网顺技术团队欢迎点赞收藏⭐留言文末获取源码联系方式查看下方微信号获取联系方式承接各种定制系统精彩系列推荐精彩专栏推荐订阅不然下次找不到哟Java毕设项目精品实战案例《1000套》感兴趣的可以先收藏起来，还有大家在毕设选题，项目以及论文编写等相关问题都可以给我留言咨询，希望帮助更多的人文章目录基
【K8s】专题十一：Kubernetes 集群证书过期处理方法行者Sun1989 Kubernetes kubernetes 云原生容器
本文内容均来自个人笔记并重新梳理，如有错误欢迎指正！如果对您有帮助，烦请点赞、关注、转发、订阅专栏！专栏订阅入口Linux专栏|Docker专栏|Kubernetes专栏往期精彩文章【Docker】（全网首发）KylinV10下MySQL容器内存占用异常的解决方法【Docker】（全网首发）KylinV10下MySQL容器内存占用异常的解决方法（续）【Docker】MySQL源码构建Docker镜
SAP自动化-ME12批量更新最后一行的价格小九不懂SAP 自动化 SAP python
Python源码#-Begin-----------------------------------------------------------------#-Includes--------------------------------------------------------------importsys,win32com.clientimportosimporttime#-Sub
linux gcc 格式,Linux下gcc与gdb简介神奇的战士 linux gcc 格式
gcc编译器可以将C、C++等语言源程序、汇编程序编译、链接成可执行程序。gdb是GNU开发的一个Unix/Linux下强大的程序调试工具。linux下没有后缀名的概念。但gcc根据文件的后缀来区别输入文件的类别：.cC语言源代码文件.a由目标文件构成的库文件.C、.cc、.cppC++源码文件.h头文件.i经过预处理之后的C语言文件.ii经过预处理之后的C++文件.o编译后的目标文件.s汇编源码
浅谈openresty 爱编码的钓鱼佬 nginx openresty 运维
熟悉了nginx后再来看openresty，不得不说openresty是比较优秀的。对nginx和openresty的历史等在这此就不介绍了。首先对标nginx，自然有优劣一、开发难度nginx：毫无疑问nginx的开发难度比较高，需要扎实的c/c++基础，而且还需要对nginx源码比较熟悉，开发效率慢，比如实现一个类似echo的功能，至少要上百行代码。而openresty只需要一句ngx.say
Golang Channel PandaSkr golang
Channel解析1.Channel源码分析1.1Channel数据结构typehchanstruct{qcountuint//channel的元素数量dataqsizuint//channel循环队列长度bufunsafe.Pointer//指向循环队列的指针elemsizeuint16//元素大小closeduint32//channel是否关闭0-未关闭elemtype*_type//元素类
分享一个基于python的电子书数据采集与可视化分析 hadoop电子书数据分析与推荐系统 spark大数据毕设项目（源码、调试、LW、开题、PPT) 计算机源码社 Python项目大数据大数据 python hadoop 计算机毕业设计选题计算机毕业设计源码数据分析 spark毕设
作者：计算机源码社个人简介：本人八年开发经验，擅长Java、Python、PHP、.NET、Node.js、Android、微信小程序、爬虫、大数据、机器学习等，大家有这一块的问题可以一起交流！学习资料、程序开发、技术解答、文档报告如需要源码，可以扫取文章下方二维码联系咨询Java项目微信小程序项目Android项目Python项目PHP项目ASP.NET项目Node.js项目选题推荐项目实战|p
使用FPGA接收MIPI CSI RX信号并进行去抖动、RGB转YUV处理：FX3014 USB3.0 UVC传输与帧率控制源代码，FPGA实现MIPI CSI RX接收，去Debayer， RGB转 kVfINoSzdrt fpga开发程序人生
fpgamipicsirx接收去debayer,rgb转yuv,fx3014usb3.0uvc传输与帧率控制源代码，具体架构看图，除dphy物理层外，mipi均为源码sensorimx219mipi源码mipi4lanecsirxraw10fpgamachXO3lf-690usb3.0fx301432bityuvdatawithframesync测试模式3280*246415fps1920*108
移动订货小程序哪个好批发订货系统源码哪个好多用户商城系统订货系统源码移动订货小程序批发订货系统订货系统源码
订货小程序就是依托微信小程序的订货系统，微信小程序订货系统相较于其他终端的订货方式，能够更快进入商城，对经销商而言更为方便。今天，我们一起盘点三个主流的移动订货小程序，看看哪个移动订货小程序好。第一、核货宝订货小程序核货宝是商淘科技旗下的订货系统，可为批发企业提供不同客户不同商品、不同客户不同价格快速订货和商家账期管理。功能介绍：客户批发订货的专属数字化订货系统，可以移动端订货。与传统手写开单相比
MacOS Catalina 从源码构建Qt6.2开发库之01: 编译Qt6.2源代码捕鲸叉 QT macos c++QT
安装xcode，cmake，ninjabrewinstallnodemac下安装OpenGL库并使之对各项目可见在macOS上安装OpenGL通常涉及到安装一些依赖库，如MGL、GLUT或者是GLEW等，同时确保LLVM的OpenGL框架和相关工具链的兼容性。以下是一个基本的安装步骤，你可以在终端中执行：安装Homebrew（如果还没有安装的话）：/bin/bash-c"$(curl-fsSLht
基于Python执行lua脚本 xu-jssy Python自动化脚本 python lua 自动化 rpa
一、依赖安装pipinstalllupa二、源码将lua文件存放在base_path路径，将lua文件名称（不包含后缀名）传递给lua_runner函数即可importmultiprocessingimportlupa#lua文件存放位置base_path='D:\\test\\lua'classLuaFuncion:#创建Lua运行时环境lua=lupa.LuaRuntime(unpack_re
Python实现mysql命令行 xu-jssy python mysql adb
一、源码importosimportpymysqldefsql_shell():password=input("EnterPassword:")#访问密码ifpassword.strip()!="yyds":print("Bye")return#清空控制台输出os.system("cls"ifos.name=="nt"else"clear")try:#连接到MySQL数据库conn=pymysql
Java集合类框架源码分析之 RoleList源码解析【6】 yunzhonghefei Java集合类源码分析 RoleList源码解析
该类继承于ArrayList，针对Role进行了一些扩展。其他方法和ArrayList中基本相同，源码不做针对性分析：看一下类简介：/***代表了一个roles的列表，作为方法setRoles()的参数，去创建一个关联关系，并且尝试在同一个关系中设置多个角色。*ARoleListrepresentsalistofroles(Roleobjects).Itisusedas*parameterwhen
对股票分析时要注意哪些主要因素？会飞的奇葩猪股票分析云掌股吧
　　众所周知，对散户投资者来说，股票技术分析是应战股市的核心武器，想学好股票的技术分析一定要知道哪些是重点学习的，其实非常简单，我们只要记住三个要素：成交量、价格趋势、振荡指标。一、成交量　　大盘的成交量状态。成交量大说明市场的获利机会较多，成交量小说明市场的获利机会较少。当沪市的成交量超过150亿时是强市市场状态，运用技术找综合买点较准；
【Scala十八】视图界定与上下文界定 bit1129 scala
Context Bound，上下文界定，是Scala为隐式参数引入的一种语法糖，使得隐式转换的编码更加简洁。隐式参数首先引入一个泛型函数max，用于取a和b的最大值 def max[T](a: T, b: T) = { if (a > b) a else b } 因为T是未知类型，只有运行时才会代入真正的类型，因此调用a >
C语言的分支——Object-C程序设计阅读有感 darkblue086 apple c 框架 cocoa
自从1972年贝尔实验室Dennis Ritchie开发了C语言，C语言已经有了很多版本和实现，从Borland到microsoft还是GNU、Apple都提供了不同时代的多种选择，我们知道C语言是基于Thompson开发的B语言的，Object-C是以SmallTalk-80为基础的。和C++不同的是，Object C并不是C的超集，因为有很多特性与C是不同的。 Object-C程序设计这本书
去除浏览器对表单值的记忆周凡杨 html 记忆 autocomplete form 浏览
&n
java的树形通讯录 g21121 java
最近用到企业通讯录，虽然以前也开发过，但是用的是jsf，拼成的树形，及其笨重和难维护。后来就想到直接生成json格式字符串，页面上也好展现。 // 首先取出每个部门的联系人 for (int i = 0; i < depList.size(); i++) { List<Contacts> list = getContactList(depList.get(i
Nginx安装部署 510888780 nginx linux
Nginx ("engine x") 是一个高性能的 HTTP 和反向代理服务器，也是一个 IMAP/POP3/SMTP 代理服务器。 Nginx 是由 Igor Sysoev 为俄罗斯访问量第二的 Rambler.ru 站点开发的，第一个公开版本0.1.0发布于2004年10月4日。其将源代码以类BSD许可证的形式发布，因它的稳定性、丰富的功能集、示例配置文件和低系统资源
java servelet异步处理请求墙头上一根草ｊａｖａ异步返回ｓｅｒｖｌｅｔ
servlet3.0以后支持异步处理请求，具体是使用AsyncContext ，包装httpservletRequest以及httpservletResponse具有异步的功能， final AsyncContext ac = request.startAsync(request, response); ac.s
我的spring学习笔记8-Spring中Bean的实例化 aijuans Spring 3
在Spring中要实例化一个Bean有几种方法： 1、最常用的（普通方法） <bean id="myBean" class="www.6e6.org.MyBean" /> 使用这样方法，按Spring就会使用Bean的默认构造方法，也就是把没有参数的构造方法来建立Bean实例。（有构造方法的下个文细说） 2、还
为Mysql创建最优的索引 annan211 mysql 索引
索引对于良好的性能非常关键，尤其是当数据规模越来越大的时候，索引的对性能的影响越发重要。索引经常会被误解甚至忽略，而且经常被糟糕的设计。索引优化应该是对查询性能优化最有效的手段了，索引能够轻易将查询性能提高几个数量级，最优的索引会比较好的索引性能要好2个数量级。 1 索引的类型 (1) B-Tree 不出意外，这里提到的索引都是指 B-
日期函数百合不是茶 oracle sql 日期函数查询
ORACLE日期时间函数大全 TO_DATE格式(以时间:2007-11-02 13:45:25为例) Year: yy two digits 两位年显示值:07 yyy three digits 三位年显示值:007
线程优先级 bijian1013 java thread 多线程 java多线程
多线程运行时需要定义线程运行的先后顺序。线程优先级是用数字表示，数字越大线程优先级越高，取值在1到10，默认优先级为5。实例： package com.bijian.study; /** * 因为在代码段当中把线程B的优先级设置高于线程A,所以运行结果先执行线程B的run()方法后再执行线程A的run()方法 * 但在实际中，JAVA的优先级不准，强烈不建议用此方法来控制执
适配器模式和代理模式的区别 bijian1013 java 设计模式
一.简介适配器模式：适配器模式（英语：adapter pattern）有时候也称包装样式或者包装。将一个类的接口转接成用户所期待的。一个适配使得因接口不兼容而不能在一起工作的类工作在一起，做法是将类别自己的接口包裹在一个已存在的类中。 &nbs
【持久化框架MyBatis3三】MyBatis3 SQL映射配置文件 bit1129 Mybatis3
SQL映射配置文件一方面类似于Hibernate的映射配置文件，通过定义实体与关系表的列之间的对应关系。另一方面使用<select>,<insert>,<delete>，<update>元素定义增删改查的SQL语句，这些元素包含三方面内容 1. 要执行的SQL语句 2. SQL语句的入参，比如查询条件 3. SQL语句的返回结果
oracle大数据表复制备份个人经验 bitcarter oracle 大表备份大表数据复制
前提：数据库仓库A（就拿oracle11g为例）中有两个用户user1和user2,现在有user1中有表ldm_table1,且表ldm_table1有数据5千万以上，ldm_table1中的数据是从其他库B（数据源）中抽取过来的，前期业务理解不够或者需求有变，数据有变动需要重新从B中抽取数据到A库表ldm_table1中。
HTTP加速器varnish安装小记 ronin47 http varnish 加速
上午共享的那个varnish安装手册，个人看了下，有点不知所云，好吧~看来还是先安装玩玩！苦逼公司服务器没法连外网，不能用什么wget或yum命令直接下载安装，每每看到别人博客贴出的在线安装代码时，总有一股羡慕嫉妒“恨”冒了出来。。。好吧，既然没法上外网，那只能麻烦点通过下载源码来编译安装了！ Varnish 3.0.4下载地址： http://repo.varnish-cache.org/
java-73-输入一个字符串，输出该字符串中对称的子字符串的最大长度 bylijinnan java
public class LongestSymmtricalLength { /* * Q75题目：输入一个字符串，输出该字符串中对称的子字符串的最大长度。 * 比如输入字符串“google”，由于该字符串里最长的对称子字符串是“goog”，因此输出4。 */ public static void main(String[] args) { Str
学习编程的一点感想 Cb123456 编程感想 Gis
写点感想，总结一些，也顺便激励一些自己.现在就是复习阶段，也做做项目. 本专业是GIS专业，当初觉得本专业太水，靠这个会活不下去的，所以就报了培训班。学习的时候，进入状态很慢，而且当初进去的时候，已经上到Java高级阶段了，所以.....，呵呵，之后有点感觉了，不过，还是不好好写代码，还眼高手低的，有
[能源与安全]美国与中国 comsci 能源
现在有一个局面：地球上的石油只剩下N桶，这些油只够让中国和美国这两个国家中的一个顺利过渡到宇宙时代，但是如果这两个国家为争夺这些石油而发生战争，其结果是两个国家都无法平稳过渡到宇宙时代。。。。而且在战争中，剩下的石油也会被快速消耗在战争中，结果是两败俱伤。。。在这个大
SEMI-JOIN执行计划突然变成HASH JOIN了的原因分析 cwqcwqmax9 oracle
甲说： A B两个表总数据量都很大，在百万以上。 idx1 idx2字段表示是索引字段 A B 两表上都有 col1字段表示普通字段 select xxx from A where A.idx1 between mmm and nnn and exists (select 1 from B where B.idx2 =
SpringMVC-ajax返回值乱码解决方案 dashuaifu Ajax springMVC response 中文乱码
SpringMVC-ajax返回值乱码解决方案一：（自己总结，测试过可行） ajax返回如果含有中文汉字，则使用：（如下例：） @RequestMapping(value="/xxx.do") public @ResponseBody void getPunishReasonB
Linux系统中查看日志的常用命令 dcj3sjt126com OS
因为在日常的工作中，出问题的时候查看日志是每个管理员的习惯，作为初学者，为了以后的需要，我今天将下面这些查看命令共享给各位 cat tail -f 日志文件说明 /var/log/message 系统启动后的信息和错误日志，是Red Hat Linux中最常用的日志之一 /var/log/secure 与安全相关的日志信息 /var/log/maillog 与邮件相关的日志信
[应用结构]应用 dcj3sjt126com PHP yii2
应用主体应用主体是管理 Yii 应用系统整体结构和生命周期的对象。每个Yii应用系统只能包含一个应用主体，应用主体在入口脚本中创建并能通过表达式 \Yii::$app 全局范围内访问。补充: 当我们说"一个应用"，它可能是一个应用主体对象，也可能是一个应用系统，是根据上下文来决定[译：中文为避免歧义，Application翻译为应
assertThat用法 eksliang JUnit assertThat
junit4.0 assertThat用法一般匹配符1、assertThat( testedNumber, allOf( greaterThan(8), lessThan(16) ) ); 注释： allOf匹配符表明如果接下来的所有条件必须都成立测试才通过，相当于“与”（&&） 2、assertThat( testedNumber, anyOf( g
android点滴2 gundumw100 应用服务器 android 网络应用 OS HTC
如何让Drawable绕着中心旋转？ Animation a = new RotateAnimation(0.0f, 360.0f, Animation.RELATIVE_TO_SELF, 0.5f, Animation.RELATIVE_TO_SELF,0.5f); a.setRepeatCount(-1); a.setDuration(1000); 如何控制Andro
超简洁的CSS下拉菜单 ini html Web 工作 html5 css
效果体验：http://hovertree.com/texiao/css/3.htmHTML文件： <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>简洁的HTML+CSS下拉菜单-HoverTree</title>
kafka consumer防止数据丢失 kane_xie kafka offset commit
kafka最初是被LinkedIn设计用来处理log的分布式消息系统，因此它的着眼点不在数据的安全性（log偶尔丢几条无所谓），换句话说kafka并不能完全保证数据不丢失。尽管kafka官网声称能够保证at-least-once，但如果consumer进程数小于partition_num，这个结论不一定成立。考虑这样一个case，partiton_num=2
@Repository、@Service、@Controller 和 @Component mhtbbx DAO spring bean prototype
@Repository、@Service、@Controller 和 @Component 将类标识为Bean Spring 自 2.0 版本开始，陆续引入了一些注解用于简化 Spring 的开发。@Repository注解便属于最先引入的一批，它用于将数据访问层 (DAO 层 ) 的类标识为 Spring Bean。具体只需将该注解标注在 DAO类上即可。同时，为了让 Spring 能够扫描类
java 多线程高并发读写控制误区 qifeifei java thread
先看一下下面的错误代码，对写加了synchronized控制，保证了写的安全，但是问题在哪里呢？ public class testTh7 { private String data; public String read(){ System.out.println(Thread.currentThread().getName() + "read data "
mongodb replica set(副本集)设置步骤 tcrct java mongodb
网上已经有一大堆的设置步骤的了，根据我遇到的问题，整理一下，如下：首先先去下载一个mongodb最新版，目前最新版应该是2.6 cd /usr/local/bin wget http://fastdl.mongodb.org/linux/mongodb-linux-x86_64-2.6.0.tgz tar -zxvf mongodb-linux-x86_64-2.6.0.t
rust学习笔记 wudixiaotie 学习笔记
1.rust里绑定变量是let，默认绑定了的变量是不可更改的，所以如果想让变量可变就要加上mut。 let x = 1; let mut y = 2; 2.match 相当于erlang中的case，但是case的每一项后都是分号，但是rust的match却是逗号。 3.match 的每一项最后都要加逗号，但是最后一项不加也不会报错，所有结尾加逗号的用法都是类似。 4.每个语句结尾都要加分