legotime

SparkML之推荐算法（一）ALS

ALS(alternating least squares ):交替最小二乘法

---------------------------------------------------------------------

原理应用

Matlab 主成分分析应用als

Spark源码

SparkML实验

---------------------------------------------------------------------

ALS在推荐系统中应用参考论文：见文献1

对分布式原理兴趣的还可以读读這篇文章：见文献2

原理

下面从文献1中取材，来讲解这个交替最小二乘法在推荐系统中应用的问题。如下图，对于一个R（观众对电影

的一个评价矩阵）可以分解为U（观众的特征矩阵）和V（电影的特征矩阵）

现在假如观众有5个人，电影有5部，那么R就是一个5*5的矩阵。假设评分如下：

假设d是三个属性（性格，文化程度，兴趣爱好）那么U的矩阵如下：

V的矩阵如下：

仔细的人或许发现是R约等于U*V，为什么是约等于呢？因为对于一个U矩阵来说，我们并不可能说（性格，文化程

度，兴趣爱好）这三个属性就代表着一个人对一部电影评价全部的属性，比如还有地域等因素。但是我们可以用“主

成分分析的思想”来近似（我没有从纯数学角度来谈，是为了大家更好理解）。這也是ALS和核心：一个评分矩阵可

以用两个小矩阵来近似（ALS是NNMF问题下在丢失数据情况下的一个重要手段）。

那么如何评价这两个矩阵的好坏？

理想的情况下：

但是实际中我们求使用数值计算的方法来求，那么计算得到的就会存在误差，如何评价的好坏。采用如下

表达

其中RMSE的意思是均方根误差（root mean square error），是u对v评分的预测值，是u对v评分的观察值。

的表达如下：

那么就是转化的要求和和

，

现在出现了一共有

个参数需要求解，而且碰到以下问题：

（1）、我们所知道的K矩阵是稀疏矩阵（K就是R在U对V没有全部评价的矩阵）

（2）、K的大小远小于远小于R的密集对应的大小

在解决这样的一个问题是采用拟合数据的形式来进行解决数据是稀疏的问题，公式如下：

note：其实后面的

是为了解决过拟合问题而增加的。

对于ALS来求解這样這个问题的思想是：先固定或者

,然后就转化为最小二乘法的问题了。他這样做就可以把一个非凸函数的问题转为二次函数的问题了。下面就求解步骤[1]：
步骤1：初始化矩阵V（可以取平均值也可以随机取值）

步骤2：固定V，然后通过最小化误差函数(RMSE)解决求解U

步骤3：固定步骤2中的U，然后通过最小化误差函数(RMSE)解决求解V

步骤4：反复步骤2，3；直到U和V收敛。

梳理：为什么是交替，从处理步骤来看就是确定V，来优化U，再来优化V，再来优化U，。。。。直到收敛

因为采用梯度下降和最小二乘都可以解决這个问题，在此不写代码来讲如何决定参数，可以看前面的最小二乘或者梯度下降算法。下面结合matlab来展示一下，利用主成分分析，借助ALS来完善丢失数据来预测电影评分的效果：

数据：链接：http://pan.baidu.com/s/1kUWo2iF 密码：iie8

%目的
% 通过主成分分析，用ALS来优化，同时来得到潜在的评分，数据就是上面观众看电影数据
load Data.txt
R = Data;
[coeff1,score1,latent,tsquared,explained,mu1] = pca(R,'algorithm','als');
%% 参数
%coeff1  主成分系数
% 0.2851   -0.5043    0.8124   -0.0266
%  0.9230   -0.0764   -0.3655    0.0830
%  -0.1822   -0.4338   -0.1826    0.8602
%  -0.0890   -0.2844   -0.0782   -0.0986
%   0.1602    0.6861    0.4085    0.4927
%score1  主成分得分
% 3.1434   -2.0913   -0.1917   -0.0505
%  -3.1122    0.5615   -0.1839   -0.2997
%  -4.9612   -0.4934   -0.0388    0.2334
%  3.3091    1.5365   -0.4941    0.1154
%   1.6210    0.4868    0.9086    0.0014
%latent  主成分方差
%14.4394
%  1.8826
%   0.2854
%   0.0400
%tsquared Hotelling的T平方统计，在X每个观测
%3.2000
%   3.2000
%   3.2000
%   3.2000
%   3.2000
%explained 向量包含每个主成分解释的总方差的百分比
% 86.7366
%  11.3084
%  1.7145
%  0.2405
%mu1 返回的平均值
% 5.2035    3.8730    4.6740    4.7043    5.0344


%% 重建矩阵（预测）
p = score1*coeff1' + repmat(mu1,5,1)

%7.0000    7.0000    5.0000    5.0393    4.0000
%3.8915    1.0000    4.7733    4.8655    4.6982
%4.0000   -0.6348    6.0000    5.2662    4.0000
% 4.9677    7.0000    3.5939    4.0000    6.4738
%6.1583    5.0000    4.0027    4.3503    6.0000

再来分析一下Spark对ALS优化参数部分，这部分因为“观看者”和“电影”数据进行了block化（源码对其两个参数命名为：User和products，为了和源码一样，所以接下来的分析用User和products这两个名字），所以拉来重点说：

官方参考文献参考的是文献3，它的核心思想是,对User数据和products数据进行Block化，目的：减少数据通信。

为了让大家清楚block可以带减少数据通信的优势，我极端化分析一下，也就是把上面每个用户和每个电影都block

如下：

从上图可以发现，在进行参数优化确定V，来优化U，再来优化V，再来优化U，。。。。直到收敛，这样的一

个过程，对数据进行Block化确实可以减少通信消耗。

源码

/**
 * 一个比Tuple3[Int, Int, Double]更加紧凑的class 用于表示评分
 */
@Since("0.8.0")
case class Rating @Since("0.8.0") (
    @Since("0.8.0") user: Int,
    @Since("0.8.0") product: Int,
    @Since("0.8.0") rating: Double)

/**
 * 交替最小二乘法的矩阵分解。
 *
 * ALS attempts to estimate the ratings matrix `R` as the product of two lower-rank matrices,
 * `X` and `Y`, i.e. `X * Yt = R`. Typically these approximations are called 'factor' matrices.
 * The general approach is iterative. During each iteration, one of the factor matrices is held
 * constant, while the other is solved for using least squares. The newly-solved factor matrix is
 * then held constant while solving for the other factor matrix.
 *
 * This is a blocked implementation of the ALS factorization algorithm that groups the two sets
 * of factors (referred to as "users" and "products") into blocks and reduces communication by only
 * sending one copy of each user vector to each product block on each iteration, and only for the
 * product blocks that need that user's feature vector. This is achieved by precomputing some
 * information about the ratings matrix to determine the "out-links" of each user (which blocks of
 * products it will contribute to) and "in-link" information for each product (which of the feature
 * vectors it receives from each user block it will depend on). This allows us to send only an
 * array of feature vectors between each user block and product block, and have the product block
 * find the users' ratings and update the products based on these messages.
 *
 * For implicit preference data, the algorithm used is based on
 * "Collaborative Filtering for Implicit Feedback Datasets", available at
 * [[http://dx.doi.org/10.1109/ICDM.2008.22]], adapted for the blocked approach used here.
 *
 * Essentially instead of finding the low-rank approximations to the rating matrix `R`,
 * this finds the approximations for a preference matrix `P` where the elements of `P` are 1 if
 * r > 0 and 0 if r <= 0. The ratings then act as 'confidence' values related to strength of
 * indicated user
 * preferences rather than explicit ratings given to items.
 */
@Since("0.8.0")
class ALS private (
    private var numUserBlocks: Int,
    private var numProductBlocks: Int,
    private var rank: Int,
    private var iterations: Int,
    private var lambda: Double,
    private var implicitPrefs: Boolean,
    private var alpha: Double,
    private var seed: Long = System.nanoTime()
  ) extends Serializable with Logging {

  /**
   * 构造一个默认参数的ALS的实例: {numBlocks: -1, rank: 10, iterations: 10,
   * lambda: 0.01, implicitPrefs: false, alpha: 1.0}.
   */
  @Since("0.8.0")
  def this() = this(-1, -1, 10, 10, 0.01, false, 1.0)

  /** 如果是true，做交替的非负最小二乘。. */
  private var nonnegative = false

  /** storage level for user/product in/out links */
  private var intermediateRDDStorageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK
  private var finalRDDStorageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK

  /** checkpoint interval */
  private var checkpointInterval: Int = 10

  /**
   * 设置用户模块和产品模块并行计算块的数量（假设设置为2，那么用户和产品的模块都是2个）,numBlocks=-1的时候表示自动配置模块数，默认情况下是 numBlocks=-1
   */
  @Since("0.8.0")
  def setBlocks(numBlocks: Int): this.type = {
    require(numBlocks == -1 || numBlocks > 0,
      s"Number of blocks must be -1 or positive but got ${numBlocks}")
    this.numUserBlocks = numBlocks
    this.numProductBlocks = numBlocks
    this
  }

  /**
   * 设置并行计算的用户的块的数量。
   */
  @Since("1.1.0")
  def setUserBlocks(numUserBlocks: Int): this.type = {
    require(numUserBlocks == -1 || numUserBlocks > 0,
      s"Number of blocks must be -1 or positive but got ${numUserBlocks}")
    this.numUserBlocks = numUserBlocks
    this
  }

  /**
   * 设置并行计算的产品的块的数量。
   */
  @Since("1.1.0")
  def setProductBlocks(numProductBlocks: Int): this.type = {
    require(numProductBlocks == -1 || numProductBlocks > 0,
      s"Number of product blocks must be -1 or positive but got ${numProductBlocks}")
    this.numProductBlocks = numProductBlocks
    this
  }

  /** 计算特征矩阵的秩（特征数），默认情况下为10 */
  @Since("0.8.0")
  def setRank(rank: Int): this.type = {
    require(rank > 0,
      s"Rank of the feature matrices must be positive but got ${rank}")
    this.rank = rank
    this
  }

  /** 设置要运行的迭代次数。默人为10次 */
  @Since("0.8.0")
  def setIterations(iterations: Int): this.type = {
    require(iterations >= 0,
      s"Number of iterations must be nonnegative but got ${iterations}")
    this.iterations = iterations
    this
  }

  /** 集的正则化参数，λ. 默认为 0.01. */
  @Since("0.8.0")
  def setLambda(lambda: Double): this.type = {
    require(lambda >= 0.0,
      s"Regularization parameter must be nonnegative but got ${lambda}")
    this.lambda = lambda
    this
  }

  /** 设置是否使用隐式偏好。Default: false. */
  @Since("0.8.1")
  def setImplicitPrefs(implicitPrefs: Boolean): this.type = {
    this.implicitPrefs = implicitPrefs
    this
  }

  /**
   * Sets the constant used in computing confidence in implicit ALS. Default: 1.0.
   */
  @Since("0.8.1")
  def setAlpha(alpha: Double): this.type = {
    this.alpha = alpha
    this
  }

  /** Sets a random seed to have deterministic results. */
  @Since("1.0.0")
  def setSeed(seed: Long): this.type = {
    this.seed = seed
    this
  }

  /**
    * *
    * 设置每一次迭代中的最小二乘法，是否都要非负约束
   * Set whether the least-squares problems solved at each iteration should have
   * nonnegativity constraints.
   */
  @Since("1.1.0")
  def setNonnegative(b: Boolean): this.type = {
    this.nonnegative = b
    this
  }

  /**
   * :: DeveloperApi ::* 对每一个RDD在中间的缓存级别的选择
   * Sets storage level for intermediate RDDs (user/product in/out links). The default value is
   * `MEMORY_AND_DISK`. Users can change it to a serialized storage, e.g., `MEMORY_AND_DISK_SER` and
   * set `spark.rdd.compress` to `true` to reduce the space requirement, at the cost of speed.
   */
  @DeveloperApi
  @Since("1.1.0")
  def setIntermediateRDDStorageLevel(storageLevel: StorageLevel): this.type = {
    require(storageLevel != StorageLevel.NONE,
      "ALS is not designed to run without persisting intermediate RDDs.")
    this.intermediateRDDStorageLevel = storageLevel
    this
  }

  /**
   * :: DeveloperApi ::
   * Sets storage level for final RDDs (user/product used in MatrixFactorizationModel). The default
   * value is `MEMORY_AND_DISK`. Users can change it to a serialized storage, e.g.
   * `MEMORY_AND_DISK_SER` and set `spark.rdd.compress` to `true` to reduce the space requirement,
   * at the cost of speed.
   */
  @DeveloperApi
  @Since("1.3.0")
  def setFinalRDDStorageLevel(storageLevel: StorageLevel): this.type = {
    this.finalRDDStorageLevel = storageLevel
    this
  }

  /**
   * Set period (in iterations) between checkpoints (default = 10). Checkpointing helps with
   * recovery (when nodes fail) and StackOverflow exceptions caused by long lineage. It also helps
   * with eliminating temporary shuffle files on disk, which can be important when there are many
   * ALS iterations. If the checkpoint directory is not set in [[org.apache.spark.SparkContext]],
   * this setting is ignored.
   */
  @DeveloperApi
  @Since("1.4.0")
  //设置每隔多久 checkpoint一下
  def setCheckpointInterval(checkpointInterval: Int): this.type = {
    this.checkpointInterval = checkpointInterval
    this
  }

  /**
   * Run ALS with the configured parameters on an input RDD of [[Rating]] objects.
   * Returns a MatrixFactorizationModel with feature vectors for each user and product.
   */
  @Since("0.8.0")
  //查看一开始给的Rating,這个RDD的形式内部数据如下：
  //case class Rating @Since("0.8.0") (
  //                                   @Since("0.8.0") user: Int,
  //                                  @Since("0.8.0") product: Int,
  //                                  @Since("0.8.0") rating: Double)
  def run(ratings: RDD[Rating]): MatrixFactorizationModel = {
    val sc = ratings.context

    //分块设置，默认下：在并行度和rating的partitions的二分之一中选一个最大的
    //          设置参数下：为numUserBlocks
    val numUserBlocks = if (this.numUserBlocks == -1) {
      math.max(sc.defaultParallelism, ratings.partitions.length / 2)
    } else {
      this.numUserBlocks
    }
    //分块设置，默认下：在并行度和rating的partitions的二分之一中选一个最大的
    //          设置参数下：为numProductBlocks
    val numProductBlocks = if (this.numProductBlocks == -1) {
      math.max(sc.defaultParallelism, ratings.partitions.length / 2)
    } else {
      this.numProductBlocks
    }

    val (floatUserFactors, floatProdFactors) = NewALS.train[Int](
      ratings = ratings.map(r => NewALS.Rating(r.user, r.product, r.rating.toFloat)),
      rank = rank,
      numUserBlocks = numUserBlocks,
      numItemBlocks = numProductBlocks,
      maxIter = iterations,
      regParam = lambda,
      implicitPrefs = implicitPrefs,
      alpha = alpha,
      nonnegative = nonnegative,
      intermediateRDDStorageLevel = intermediateRDDStorageLevel,
      finalRDDStorageLevel = StorageLevel.NONE,
      checkpointInterval = checkpointInterval,
      seed = seed)

    val userFactors = floatUserFactors
      .mapValues(_.map(_.toDouble))
      .setName("users")
      .persist(finalRDDStorageLevel)
    val prodFactors = floatProdFactors
      .mapValues(_.map(_.toDouble))
      .setName("products")
      .persist(finalRDDStorageLevel)
    if (finalRDDStorageLevel != StorageLevel.NONE) {
      userFactors.count()
      prodFactors.count()
    }
    new MatrixFactorizationModel(rank, userFactors, prodFactors)
  }

  /**
   * Java-friendly version of [[ALS.run]].
   */
  @Since("1.3.0")
  def run(ratings: JavaRDD[Rating]): MatrixFactorizationModel = run(ratings.rdd)
}

/**
 * Top-level methods for calling Alternating Least Squares (ALS) matrix factorization.
 */
@Since("0.8.0")
object ALS {
  /**
   * Train a matrix factorization model given an RDD of ratings by users for a subset of products.
   * The ratings matrix is approximated as the product of two lower-rank matrices of a given rank
   * (number of features). To solve for these features, ALS is run iteratively with a configurable
   * level of parallelism.
   *
   * @param ratings    RDD of [[Rating]] objects with userID, productID, and rating
   * @param rank       number of features to use
   * @param iterations number of iterations of ALS
   * @param lambda     regularization parameter
   * @param blocks     level of parallelism to split computation into
   * @param seed       random seed for initial matrix factorization model
   */
  @Since("0.9.1")
  def train(
      ratings: RDD[Rating],
      rank: Int,
      iterations: Int,
      lambda: Double,
      blocks: Int,
      seed: Long
    ): MatrixFactorizationModel = {
    new ALS(blocks, blocks, rank, iterations, lambda, false, 1.0, seed).run(ratings)
  }

  /**
   * Train a matrix factorization model given an RDD of ratings by users for a subset of products.
   * The ratings matrix is approximated as the product of two lower-rank matrices of a given rank
   * (number of features). To solve for these features, ALS is run iteratively with a configurable
   * level of parallelism.
   *
   * @param ratings    RDD of [[Rating]] objects with userID, productID, and rating
   * @param rank       number of features to use
   * @param iterations number of iterations of ALS
   * @param lambda     regularization parameter
   * @param blocks     level of parallelism to split computation into
   */
  @Since("0.8.0")
  def train(
      ratings: RDD[Rating],
      rank: Int,
      iterations: Int,
      lambda: Double,
      blocks: Int
    ): MatrixFactorizationModel = {
    new ALS(blocks, blocks, rank, iterations, lambda, false, 1.0).run(ratings)
  }

  /**
   * Train a matrix factorization model given an RDD of ratings by users for a subset of products.
   * The ratings matrix is approximated as the product of two lower-rank matrices of a given rank
   * (number of features). To solve for these features, ALS is run iteratively with a level of
   * parallelism automatically based on the number of partitions in `ratings`.
   *
   * @param ratings    RDD of [[Rating]] objects with userID, productID, and rating
   * @param rank       number of features to use
   * @param iterations number of iterations of ALS
   * @param lambda     regularization parameter
   */
  @Since("0.8.0")
  def train(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double)
    : MatrixFactorizationModel = {
    train(ratings, rank, iterations, lambda, -1)
  }

  /**
   * Train a matrix factorization model given an RDD of ratings by users for a subset of products.
   * The ratings matrix is approximated as the product of two lower-rank matrices of a given rank
   * (number of features). To solve for these features, ALS is run iteratively with a level of
   * parallelism automatically based on the number of partitions in `ratings`.
   *
   * @param ratings    RDD of [[Rating]] objects with userID, productID, and rating
   * @param rank       number of features to use
   * @param iterations number of iterations of ALS
   */
  @Since("0.8.0")
  def train(ratings: RDD[Rating], rank: Int, iterations: Int)
    : MatrixFactorizationModel = {
    train(ratings, rank, iterations, 0.01, -1)
  }

  /**
   * Train a matrix factorization model given an RDD of 'implicit preferences' given by users
   * to some products, in the form of (userID, productID, preference) pairs. We approximate the
   * ratings matrix as the product of two lower-rank matrices of a given rank (number of features).
   * To solve for these features, we run a given number of iterations of ALS. This is done using
   * a level of parallelism given by `blocks`.
   *
   * @param ratings    RDD of (userID, productID, rating) pairs
   * @param rank       number of features to use
   * @param iterations number of iterations of ALS
   * @param lambda     regularization parameter
   * @param blocks     level of parallelism to split computation into
   * @param alpha      confidence parameter
   * @param seed       random seed for initial matrix factorization model
   */
  @Since("0.8.1")
  def trainImplicit(
      ratings: RDD[Rating],
      rank: Int,
      iterations: Int,
      lambda: Double,
      blocks: Int,
      alpha: Double,
      seed: Long
    ): MatrixFactorizationModel = {
    new ALS(blocks, blocks, rank, iterations, lambda, true, alpha, seed).run(ratings)
  }

  /**
   * Train a matrix factorization model given an RDD of 'implicit preferences' of users for a
   * subset of products. The ratings matrix is approximated as the product of two lower-rank
   * matrices of a given rank (number of features). To solve for these features, ALS is run
   * iteratively with a configurable level of parallelism.
   *
   * @param ratings    RDD of [[Rating]] objects with userID, productID, and rating
   * @param rank       number of features to use
   * @param iterations number of iterations of ALS
   * @param lambda     regularization parameter
   * @param blocks     level of parallelism to split computation into
   * @param alpha      confidence parameter
   */
  @Since("0.8.1")
  def trainImplicit(
      ratings: RDD[Rating],
      rank: Int,
      iterations: Int,
      lambda: Double,
      blocks: Int,
      alpha: Double
    ): MatrixFactorizationModel = {
    new ALS(blocks, blocks, rank, iterations, lambda, true, alpha).run(ratings)
  }

  /**
   * Train a matrix factorization model given an RDD of 'implicit preferences' of users for a
   * subset of products. The ratings matrix is approximated as the product of two lower-rank
   * matrices of a given rank (number of features). To solve for these features, ALS is run
   * iteratively with a level of parallelism determined automatically based on the number of
   * partitions in `ratings`.
   *
   * @param ratings    RDD of [[Rating]] objects with userID, productID, and rating
   * @param rank       number of features to use
   * @param iterations number of iterations of ALS
   * @param lambda     regularization parameter
   * @param alpha      confidence parameter
   */
  @Since("0.8.1")
  def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double, alpha: Double)
    : MatrixFactorizationModel = {
    trainImplicit(ratings, rank, iterations, lambda, -1, alpha)
  }

  /**
   * Train a matrix factorization model given an RDD of 'implicit preferences' of users for a
   * subset of products. The ratings matrix is approximated as the product of two lower-rank
   * matrices of a given rank (number of features). To solve for these features, ALS is run
   * iteratively with a level of parallelism determined automatically based on the number of
   * partitions in `ratings`.
   *
   * @param ratings    RDD of [[Rating]] objects with userID, productID, and rating
   * @param rank       number of features to use
   * @param iterations number of iterations of ALS
   */
  @Since("0.8.1")
  def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int)
    : MatrixFactorizationModel = {
    trainImplicit(ratings, rank, iterations, 0.01, -1, 1.0)
  }
}

MatrixFactorizationModel类

/**
 *矩阵分解模型
 *
 * Note:如果你直接用构造函数来创建模型，请注意，快速预测需要缓存user/product,
 *
 * @param rank 秩
 * @param userFeatures RDD的元组，每个元组都有计算后的userId 和 它features
 * @param productFeatures RDD的元组，每个元组都有计算后的productId 和 它features
 */
@Since("0.8.0")
class MatrixFactorizationModel @Since("0.8.0") (
    @Since("0.8.0") val rank: Int,
    @Since("0.8.0") val userFeatures: RDD[(Int, Array[Double])],
    @Since("0.8.0") val productFeatures: RDD[(Int, Array[Double])])
  extends Saveable with Serializable with Logging {

  require(rank > 0)
  validateFeatures("User", userFeatures)
  validateFeatures("Product", productFeatures)

  /**验证因素，并提醒用户，如果有性能问题。 */
  private def validateFeatures(name: String, features: RDD[(Int, Array[Double])]): Unit = {
    require(features.first()._2.length == rank,
      s"$name feature dimension does not match the rank $rank.")
    if (features.partitioner.isEmpty) {
      logWarning(s"$name factor does not have a partitioner. "
        + "Prediction on individual records could be slow.")
    }
    if (features.getStorageLevel == StorageLevel.NONE) {
      logWarning(s"$name factor is not cached. Prediction could be slow.")
    }
  }

  /** 预测一个用户对一个产品的评价. */
  @Since("0.8.0")
  def predict(user: Int, product: Int): Double = {
    val userVector = userFeatures.lookup(user).head
    val productVector = productFeatures.lookup(product).head
    blas.ddot(rank, userVector, 1, productVector, 1)
  }

  /**
   * 输入usersProducts，返回用户和产品的近似数，這个方法是基于countApproxDistinct
   *
   * @param usersProducts  RDD of (user, product) pairs.
   * @return 用户和产品的近似值。
   */
  private[this] def countApproxDistinctUserProduct(usersProducts: RDD[(Int, Int)]): (Long, Long) = {
    val zeroCounterUser = new HyperLogLogPlus(4, 0)
    val zeroCounterProduct = new HyperLogLogPlus(4, 0)
    val aggregated = usersProducts.aggregate((zeroCounterUser, zeroCounterProduct))(
      (hllTuple: (HyperLogLogPlus, HyperLogLogPlus), v: (Int, Int)) => {
        hllTuple._1.offer(v._1)
        hllTuple._2.offer(v._2)
        hllTuple
      },
      (h1: (HyperLogLogPlus, HyperLogLogPlus), h2: (HyperLogLogPlus, HyperLogLogPlus)) => {
        h1._1.addAll(h2._1)
        h1._2.addAll(h2._2)
        h1
      })
    (aggregated._1.cardinality(), aggregated._2.cardinality())
  }

  /**
   * 预测多个用户对产品的评价。
   *输出的RDD和输入的RDD元素一一对应 （包括所有副本）除非用户或产品中缺少训练集。
   * @param usersProducts  RDD of (user, product) pairs.
   * @return RDD of Ratings.
   */
  @Since("0.9.0")
  def predict(usersProducts: RDD[(Int, Int)]): RDD[Rating] = {
    // Previously the partitions of ratings are only based on the given products.
    // So if the usersProducts given for prediction contains only few products or
    // even one product, the generated ratings will be pushed into few or single partition
    // and can't use high parallelism.
    // Here we calculate approximate numbers of users and products. Then we decide the
    // partitions should be based on users or products.
    val (usersCount, productsCount) = countApproxDistinctUserProduct(usersProducts)

    if (usersCount < productsCount) {
      val users = userFeatures.join(usersProducts).map {
        case (user, (uFeatures, product)) => (product, (user, uFeatures))
      }
      users.join(productFeatures).map {
        case (product, ((user, uFeatures), pFeatures)) =>
          Rating(user, product, blas.ddot(uFeatures.length, uFeatures, 1, pFeatures, 1))
      }
    } else {
      val products = productFeatures.join(usersProducts.map(_.swap)).map {
        case (product, (pFeatures, user)) => (user, (product, pFeatures))
      }
      products.join(userFeatures).map {
        case (user, ((product, pFeatures), uFeatures)) =>
          Rating(user, product, blas.ddot(uFeatures.length, uFeatures, 1, pFeatures, 1))
      }
    }
  }

  /**
   * Java-friendly version of [[MatrixFactorizationModel.predict]].
   */
  @Since("1.2.0")
  def predict(usersProducts: JavaPairRDD[JavaInteger, JavaInteger]): JavaRDD[Rating] = {
    predict(usersProducts.rdd.asInstanceOf[RDD[(Int, Int)]]).toJavaRDD()
  }

  /**
   * 向用户推荐产品。
   *
   * @param user the user to recommend products to
   * @param num how many products to return. The number returned may be less than this.
   * @return [[Rating]] objects, each of which contains the given user ID, a product ID, and a
   *  "score" in the rating field. Each represents one recommended product, and they are sorted
   *  by score, decreasing. The first returned is the one predicted to be most strongly
   *  recommended to the user. The score is an opaque value that indicates how strongly
   *  recommended the product is.
   */
  @Since("1.1.0")
  def recommendProducts(user: Int, num: Int): Array[Rating] =
    MatrixFactorizationModel.recommend(userFeatures.lookup(user).head, productFeatures, num)
      .map(t => Rating(user, t._1, t._2))

  /**
   * 推荐用户给产品. 也就是说，看看那些用户对這个产品感兴趣
   *
   * @param product 产品推荐用户
   * @param num 设定返回多少个用户，实际返回的大小有可能少于设定的值 
   * @return [[Rating]] objects, 其中每一个包含用户的ID、产品ID，和一个得分，每个表示一个推荐的用户，并且它们按从大到小的分数排序，第一次返回的是一个预测是最建议的产品
   */
  @Since("1.1.0")
  def recommendUsers(product: Int, num: Int): Array[Rating] =
    MatrixFactorizationModel.recommend(productFeatures.lookup(product).head, userFeatures, num)
      .map(t => Rating(t._1, product, t._2))

  protected override val formatVersion: String = "1.0"

  /**
   * 输入路径、保持模型
   *
   * This saves:
   *  - human-readable (JSON) model metadata to path/metadata/
   *  - Parquet formatted data to path/data/
   *
   * The model may be loaded using [[Loader.load]].
   *
   * @param sc  Spark context used to save model data.
   * @param path  Path specifying the directory in which to save this model.
   *              If the directory already exists, this method throws an exception.
   */
  @Since("1.3.0")
  override def save(sc: SparkContext, path: String): Unit = {
    MatrixFactorizationModel.SaveLoadV1_0.save(this, path)
  }

  /**
   * 为所有用户推荐top products。
   *
   * @param num 为每个用户返回多少产品
   * @return [(Int, Array[Rating])] objects, where every tuple contains a userID and an array of
   * rating objects which contains the same userId, recommended productID and a "score" in the
   * rating field. Semantics of score is same as recommendProducts API
   * 
   */
  @Since("1.4.0")
  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
    MatrixFactorizationModel.recommendForAll(rank, userFeatures, productFeatures, num).map {
      case (user, top) =>
        val ratings = top.map { case (product, rating) => Rating(user, product, rating) }
        (user, ratings)
    }
  }


  /**
   * 为所有产品推荐 top users
   *
   * @param num how many users to return for every product.
   * @return [(Int, Array[Rating])] objects, where every tuple contains a productID and an array
   * of rating objects which contains the recommended userId, same productID and a "score" in the
   * rating field. Semantics of score is same as recommendUsers API
   */
  @Since("1.4.0")
  def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = {
    MatrixFactorizationModel.recommendForAll(rank, productFeatures, userFeatures, num).map {
      case (product, top) =>
        val ratings = top.map { case (user, rating) => Rating(user, product, rating) }
        (product, ratings)
    }
  }
}

@Since("1.3.0")
object MatrixFactorizationModel extends Loader[MatrixFactorizationModel] {

  import org.apache.spark.mllib.util.Loader._

  /**
   * 对单个用户（或产品）进行推荐
   */
  private def recommend(
      recommendToFeatures: Array[Double],
      recommendableFeatures: RDD[(Int, Array[Double])],
      num: Int): Array[(Int, Double)] = {
    val scored = recommendableFeatures.map { case (id, features) =>
      (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1))
    }
    scored.top(num)(Ordering.by(_._2))
  }

  /**
   * 对所有用户（或产品）进行推荐
   * @param rank rank
   * @param srcFeatures src features to receive recommendations
   * @param dstFeatures dst features used to make recommendations
   * @param num number of recommendations for each record
   * @return an RDD of (srcId: Int, recommendations), where recommendations are stored as an array
   *         of (dstId, rating) pairs.
   */
  private def recommendForAll(
      rank: Int,
      srcFeatures: RDD[(Int, Array[Double])],
      dstFeatures: RDD[(Int, Array[Double])],
      num: Int): RDD[(Int, Array[(Int, Double)])] = {
    val srcBlocks = blockify(rank, srcFeatures)
    val dstBlocks = blockify(rank, dstFeatures)
    val ratings = srcBlocks.cartesian(dstBlocks).flatMap {
      case ((srcIds, srcFactors), (dstIds, dstFactors)) =>
        val m = srcIds.length
        val n = dstIds.length
        val ratings = srcFactors.transpose.multiply(dstFactors)
        val output = new Array[(Int, (Int, Double))](m * n)
        var k = 0
        ratings.foreachActive { (i, j, r) =>
          output(k) = (srcIds(i), (dstIds(j), r))
          k += 1
        }
        output.toSeq
    }
    ratings.topByKey(num)(Ordering.by(_._2))
  }

  /**
   * Blockifies features to use Level-3 BLAS.
   */
  private def blockify(
      rank: Int,
      features: RDD[(Int, Array[Double])]): RDD[(Array[Int], DenseMatrix)] = {
    val blockSize = 4096 // TODO: tune the block size
    val blockStorage = rank * blockSize
    features.mapPartitions { iter =>
      iter.grouped(blockSize).map { grouped =>
        val ids = mutable.ArrayBuilder.make[Int]
        ids.sizeHint(blockSize)
        val factors = mutable.ArrayBuilder.make[Double]
        factors.sizeHint(blockStorage)
        var i = 0
        grouped.foreach { case (id, factor) =>
          ids += id
          factors ++= factor
          i += 1
        }
        (ids.result(), new DenseMatrix(rank, i, factors.result()))
      }
    }
  }

  /**
   * 输入模型的路径，加载這个模型
   *
   * The model should have been saved by [[Saveable.save]].
   *
   * @param sc  Spark context used for loading model files.
   * @param path  Path specifying the directory to which the model was saved.
   * @return  Model instance
   */
  @Since("1.3.0")
  override def load(sc: SparkContext, path: String): MatrixFactorizationModel = {
    val (loadedClassName, formatVersion, _) = loadMetadata(sc, path)
    val classNameV1_0 = SaveLoadV1_0.thisClassName
    (loadedClassName, formatVersion) match {
      case (className, "1.0") if className == classNameV1_0 =>
        SaveLoadV1_0.load(sc, path)
      case _ =>
        throw new IOException("MatrixFactorizationModel.load did not recognize model with" +
          s"(class: $loadedClassName, version: $formatVersion). Supported:\n" +
          s"  ($classNameV1_0, 1.0)")
    }
  }

  private[recommendation]
  object SaveLoadV1_0 {

    private val thisFormatVersion = "1.0"

    private[recommendation]
    val thisClassName = "org.apache.spark.mllib.recommendation.MatrixFactorizationModel"

    /**
     * Saves a [[MatrixFactorizationModel]], where user features are saved under `data/users` and
     * product features are saved under `data/products`.
     */
    def save(model: MatrixFactorizationModel, path: String): Unit = {
      val sc = model.userFeatures.sparkContext
      val sqlContext = SQLContext.getOrCreate(sc)
      import sqlContext.implicits._
      val metadata = compact(render(
        ("class" -> thisClassName) ~ ("version" -> thisFormatVersion) ~ ("rank" -> model.rank)))
      sc.parallelize(Seq(metadata), 1).saveAsTextFile(metadataPath(path))
      model.userFeatures.toDF("id", "features").write.parquet(userPath(path))
      model.productFeatures.toDF("id", "features").write.parquet(productPath(path))
    }

    def load(sc: SparkContext, path: String): MatrixFactorizationModel = {
      implicit val formats = DefaultFormats
      val sqlContext = SQLContext.getOrCreate(sc)
      val (className, formatVersion, metadata) = loadMetadata(sc, path)
      assert(className == thisClassName)
      assert(formatVersion == thisFormatVersion)
      val rank = (metadata \ "rank").extract[Int]
      val userFeatures = sqlContext.read.parquet(userPath(path)).rdd.map {
        case Row(id: Int, features: Seq[_]) =>
          (id, features.asInstanceOf[Seq[Double]].toArray)
      }
      val productFeatures = sqlContext.read.parquet(productPath(path)).rdd.map {
        case Row(id: Int, features: Seq[_]) =>
          (id, features.asInstanceOf[Seq[Double]].toArray)
      }
      new MatrixFactorizationModel(rank, userFeatures, productFeatures)
    }

    private def userPath(path: String): String = {
      new Path(dataPath(path), "user").toUri.toString
    }

    private def productPath(path: String): String = {
      new Path(dataPath(path), "product").toUri.toString
    }
  }

}

SparkML实验

import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.recommendation.{ALS, Rating}
import org.apache.spark.{SparkConf, SparkContext}




object myAls {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Als example").setMaster("local[2]")
    val sc = new SparkContext(conf)
    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
    Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)

    val trainData = sc.textFile("/root/application/upload/train.data")
    val parseTrainData =trainData.map(_.split(',') match{
      case Array(user,item,rate) => Rating(user.toInt,item.toInt,rate.toDouble)
    })
    val testData = sc.textFile("/root/application/upload/test.data")
    val parseTestData =testData.map(_.split(',') match{
      case Array(user,item,rate) => Rating(user.toInt,item.toInt,rate.toDouble)
    })

    parseTrainData.foreach(println)


    val model =  new ALS().setBlocks(2).run(ratings = parseTrainData)

    val userProducts =parseTestData.map{
      case Rating(user,product,rate) =>
        (user,product)
    }

    val predictions = model.predict(userProducts).map{
      case Rating(user,product,rate) =>
        ((user,product),rate)
    }
    predictions.foreach(println)

    /** ((4,1),1.7896680396660953)
((4,3),1.0270402568376826)
((4,5),0.1556322625035942)
((2,4),0.33505846168235803)
((2,1),0.5416217248274381)
((2,3),0.4346857699980956)
((2,5),0.4549716283423277)
((1,4),1.2289770624608378)
((3,4),1.8560000519252107E-5)
((3,2),3.3417571983500647)
((5,4),-0.049730215285125445)
((5,1),3.9938137663334397)
((5,3),4.041703646645967)
     */
    //预测的不怎么样

    sc.stop()

  }
}

参考文献

http://cs229.stanford.edu/proj2014/Christopher%20Aberger,%20Recommender.pdf

http://www.grappa.univ-lille3.fr/~mary/cours/stats/centrale/reco/paper/MatrixFactorizationALS.pdf

http://yifanhu.net/PUB/cf.pdf

你可能感兴趣的:(SparkML)

【SparkML实践7】特征选择器FeatureSelector 周润发的弟弟 Spark机器学习 spark-ml
本节介绍了用于处理特征的算法，大致可以分为以下几组：提取（Extraction）：从“原始”数据中提取特征。转换（Transformation）：缩放、转换或修改特征。选择（Selection）：从更大的特征集中选择一个子集。局部敏感哈希（LocalitySensitiveHashing,LSH）：这类算法结合了特征转换的方面与其他算法。FeatureSelectorsVectorSlicerVe
【SparkML实践5】特征转换FeatureTransformers实战scala版周润发的弟弟 Spark机器学习 spark-ml scala 开发语言
本节介绍了用于处理特征的算法，大致可以分为以下几组：提取（Extraction）：从“原始”数据中提取特征。转换（Transformation）：缩放、转换或修改特征。选择（Selection）：从更大的特征集中选择一个子集。局部敏感哈希（LocalitySensitiveHashing,LSH）：这类算法结合了特征转换的方面与其他算法。本章节主要讲转换1FeatureTransformersTo
【SparkML实践4】Pipeline实战scala版周润发的弟弟 Spark机器学习 spark-ml scala 开发语言
Pipeline中的主要概念MLlib标准化了机器学习算法的API，使得将多个算法组合成单一的管道或工作流程变得更加容易。本节介绍了PipelinesAPI引入的关键概念，其中管道的概念主要受到scikit-learn项目的启发。DataFrame：这个机器学习API使用来自SparkSQL的DataFrame作为机器学习数据集，它可以包含多种数据类型。例如，一个DataFrame可以有不同的列存
【SparkML系列3】特征提取器TF-IDF、Word2Vec和CountVectorizer 周润发的弟弟 spark-ml tf-idf word2vec
本节介绍了用于处理特征的算法，大致可以分为以下几组：提取（Extraction）：从“原始”数据中提取特征。转换（Transformation）：缩放、转换或修改特征。选择（Selection）：从更大的特征集中选择一个子集。局部敏感哈希（LocalitySensitiveHashing,LSH）：这类算法结合了特征转换的方面与其他算法。###FeatureExtractors（特征提取器）###
【SparkML系列2】DataSource读取图片数据周润发的弟弟 Spark机器学习 spark-ml
DataSource(数据源)在本节中，我们将介绍如何在机器学习中使用数据源加载数据。除了一些通用的数据源，如Parquet、CSV、JSON和JDBC外，我们还提供了一些专门用于机器学习的数据源。###Imagedatasource（图像数据源）该图像数据源用于从目录加载图像文件，它可以通过Java库中的ImageIO加载压缩图像（jpeg、png等）到原始图像表示。加载的DataFrame有一
【SparkML系列1】相关性、卡方检验和概述器实现周润发的弟弟 Spark机器学习 spark-ml
Correlation(相关性)计算两组数据之间的相关性在统计学中是一种常见的操作。在spark.ml中，我们提供了计算多组数据之间成对相关性的灵活性。目前支持的相关性方法是皮尔逊（Pearson）相关系数和斯皮尔曼（Spearman）相关系数。相关性计算使用指定的方法为输入的向量数据集计算相关性矩阵。输出将是一个数据框，其中包含向量列的相关性矩阵。importorg.apache.spark.m
SparkML program chef #3计算Spark spark-ml
SparkMLSparkML_lr_train：读取py处理后的train表用于训练，将训练模型保存好。SparkML_lr_predict：读取训练好的模型，读取py处理后的test表用于预测。将预测结果写入normal_data中，根据id修改stream_is_normal的值。提交spark任务bin/spark-submit\--classSparkML_lr_train\--maste
Spark学习之路——9.Spark ML Nelson_hehe Spark Spark ML
一、简介基于RDD的APIspark.mllib已进入维护模式。SparkML是SparkMLlib的一种新的API，它有下面的优点：1.面向DataFrame，基于RDD进一步封装，拥有功能更多的API2.具有Pipeline功能，可以实现复杂的机器学习模型3.性能得到提升二、MLPipeline一个pipeline在结构上会包含一个或多个Stage，每一个Stage都会完成一个任务，如数据集处
大数据系列之Spark集群环境部署 solihawk 大数据系列 #spark 大数据 spark
Spark作为一种大数据分布式计算框架，已经构建SparkStreaming、SparkSQL、SparkML等组件，与文件系统HDFS、资源调度YARN一起，构建了Spark生态体系，如下图所示：以下部分将主要介绍Hadoop和Spark两节点集群环境部署，并结合官方示例程序验证Spark作业提交的几种模式。1、环境准备1.1Java环境查看Java版本信息，如找不到JAVA命令，可通过yumi
sklearn中的fit/transform/fit_transform 王金松
对于fit和transform，sklearn和sparkml都存在，fit可以翻译为拟合，transform翻译为转换fit:拟合出模型，输入为dataframe或者数据，输出为拟合出的模型transform转换，输入和输出一致，相当于把一种数据转换为另一种数据，一般用于特征抽取和转换，通常会转换为向量，比如正则化/统一化fit_transform:fit+transform
《Spark大数据分析》一书的书评和采访 H_MZ scala 运维数据库
\主要结论\\了解如何将ApacheSpark用于不同类型的大数据分析用例，例如批处理、互操作、图表、数据流分析，以及机器学习。\\t了解SparkCore及加载项库，包括SparkSQL、SparkStreaming、GraphX、Mllib和SparkML。\\t了解开发者在项目中使用Spark时可能需要用到的开发和测试工具。\\tSpark程序性能和调优最佳实践。\\t了解Spark在集群设
PySpark 线性回归 ROBOT玲玉机器学习算法 spark-ml
SparkML简介SparkML是Spark提供的一个机器学习库，用于构建和训练机器学习模型。它提供了一系列常用的机器学习算法和工具，包括分类、回归、聚类、模型评估等。我们可以使用PySpark中的SparkML来训练和评估我们的机器学习模型。模型训练在使用PySpark进行模型训练之前，我们首先需要准备数据集。Spark支持多种数据源，包括文本文件、CSV文件、Parquet文件等等。我们可以使
SparkML机器学习火玄 spark spark-ml 机器学习人工智能
SparkML机器学习:让机器学会人的学习行为,通过算法和数据来模拟或实现人类的学习行为，使之不断改善自身性能。机器学习的步骤:加载数据特征工程数据筛选:选取适合训练的特征列,例如用户id就不适合,因为它特性太显著.数据转化:将字符串的数据转化数据类型,因为模型训练的数据不能为字符串.将多个特征列转化为一个向量列,因为spark机器学习要求数据输入只能为一个特征列数据缩放:把所有的特征缩放到0~1
计算机毕业设计全网首发Python+Spark招聘爬虫可视化系统招聘数据分析 Hadoop职位可视化大数据毕业设计 51job数据分析(可选加推荐算法) 计算机毕业设计大神
开发技术Hadoop、HDFS、Spark、SpringBoot、echarts、PySpark、Python、MySQL创新点大数据架构、爬虫、数据可视化啰里啰嗦适合大数据毕业设计、数据分析、爬虫类计算机毕业设计可二次开发选加推荐算法(协同过滤算法等或者调用SparkML库)数据处理流程本环节主要讲述的是对于整体项目功能的设计，设计方案为主要是由大数据系统以及可视化前端子系统组成。在可视化前端子
5.Spark 学习成果转化—机器学习—使用Spark ML的线性回归来预测商品销量 (线性回归问题) 页川叶川 Spark 学习成果转化 spark scala big data
本文目录如下：第5例使用SparkML的线性回归来预测商品销量5.1数据准备5.1.1数据集文件准备5.1.2数据集字段解释(按列来划分)5.2使用SparkML实现代码5.2.1引入项目依赖5.2.2加载并解析数据5.2.3对DtaFrame中的数据进行筛选与处理5.2.4将特征列合并为特征向量5.2.5创建测试集和训练集5.2.6设置回归参数和正则化参数5.2.7生成训练模型并对测试集进行预测
机器学习---聚类算法总览 qq_38142901 机器学习算法聚类机器学习算法
聚类算法总览参考资料k-means:本人文章sparkml聚类算法谱聚类：https://blog.csdn.net/wangqianqianya/article/details/103482708LDA:https://blog.csdn.net/worryabout/article/details/79792835均值漂移：https://www.cnblogs.com/xfzhang/p/7
spark-mongodb简单上手 Josen_Qu
Spark提供的所有计算，不管是批处理，SparkSQL，SparkStreaming还是SparkML，它们底层都是通过RDD计算。所以这里就以RDD方式简单上手。首先认识一下RDD：RDD（ResilientDistributedDataset）是Spark最基础核心的概念，它表示可分区，不可变的并能够被并行操作的数据集合，不同的数据集格式对应不同的RDD实现。RDD可以缓存到内存或磁盘中，每
SparkML预测PV 易企秀工程师
背景公司需要根据过去一段时间内每天网站的流量数据，预测未来一段时间每日流量，这样，在流量高峰到来前，可以提前警示相关的运营、运维提前准备。这是个典型的“时序预测问题”，关于时序预测的方法有很多，有规则法、机器学习、传统建模法等等。本文主要讲述机器学习的方式。由于工作中主要用的是Spark技术栈处理数据，所以这里也选用SparkML来解决。当然，机器学习的包和库又很多，完全可以用sklearn来做。
从开发、数据分析等多角度系统深度讲解Spark核心技术与高级应用笑起来真好看LQQ
前言Spark核心技术与高级应用是Spark领域少有的专注于核心原理与深度应用的著作，由科大讯飞和百分点科技的4位大数据专家撰写。不仅细致介绍了Spark的程序开发、编程模型、作业执行解析等基础知识，而且还深度讲解了SparkSQL、SparkML、SparkStreaming等大量内部模块和周边模块的原理与使用。除此之外，还从管理和性能优化的角度对Spark进行了深入探索。本书特色从适合读者阅读
SparkML（三）北极光。大数据 #SparkML 机器学习 spark 分类算法
分类逻辑回归在spark官方文档中，逻辑回归又分为二项式逻辑回归和多项式逻辑回归。逻辑回归本质是线性回归，只是在特征到结果的过程上加上了一层映射。即首先需要把特征进行求和，然后将求和后的结果应用于一个g(z)函数,g(z)可以将值映射到0或者是1上面，这个函数就是Sigmoid函数，默认分类的值是0.5，超过0.5则类别为1，小于0.5类别为0。如下图例子importorg.apache.spar
SparkML（四）北极光。大数据 #SparkML 机器学习 spark 回归算法
回归回归问题其实就是求解一堆自变量与因变量之间一种几何关系，这种关系可以是线性的就是线性回归，可以是非线性的就是非线性回归。按照自变量的多少有可以分为一元线性回归，多元线性回归。线性回归线性回归，顾名思义拟合出来的预测函数是一条直线，数学表达如下：h(x)=a0+a1x1+a2x2+…+anxn+J(θ)其中h(x)为预测函数，ai(i=1,2,…,n）为估计参数，模型训练的目的就是计算出这些参数
Spark Machine Learning(SparkML):机器学习(部分一) Thomson617 Spark 大数据 spark 机器学习 ml 大数据
机器学习是现阶段实现人工智能应用的主要方法,它广泛应用于机器视觉、语音识别、自然语言处理、数据挖掘等领域。MLlib是ApacheSpark的可伸缩机器学习库。官网地址:[http://spark.apache.org/docs/latest/ml-guide.html]Spark的机器学习(ML)库提供了许多分布式ML算法。这些算法包括特征选取、分类、回归、聚类、推荐等任务。ML还提供了用于构建
SparkML之分类(一)贝叶斯分类 legotime SparkML spark机器学习源码
1.1、贝叶斯定理贝叶斯定理：用来描述两个条件概率之间的关系。比如P(A/B)和P(B/A),那么可以推导：，我们下图进行进行说明：假设：，那么有,,:那么有贝叶斯定理公式：1.2、朴素贝叶斯分类器（NaiveBayesClassifiers）大家知道最为广泛的两个分类模型就是决策树模型和朴素贝叶斯分类模型，前者是对象属性与对象值之间的一种映射关系，后者则是用那个概率最大，那么待分类项就属于哪个类
源码经验分享会计算机毕业设计吊炸天Hadoop+Spark电影推荐系统电影用户画像系统电影可视化电影数据分析电影爬虫电影大数据大数据毕业设计大数据毕设 haochengxu2022 推荐系统机器学习 python数据分析大数据经验分享课程设计
开发技术前端：vue.js、websocket、echarts后端：springboot+mybatis-plus数据库：mysql虚拟机服务器：es、redis、mongodb、kafka、hadoop、spark机器学习/深度学习：SparkML包、协同过滤算法、ALS、基于隐语义模型的推荐算法、LFM等10种推荐算法数据集/爬虫：scrapy爬取豆瓣、IMDB等国内外网站创新点推荐算法、短信
分享思路：Python+Spark招聘爬虫可视化系统招聘数据分析 Hadoop职位可视化大数据毕业设计 51job数据分析(可选加推荐算法) haochengxu2022 机器学习爬虫 python数据分析爬虫 python spark 数据分析 hadoop
开发技术Hadoop、HDFS、Spark、SpringBoot、echarts、PySpark、Python、MySQL创新点大数据架构、爬虫、数据可视化啰里啰嗦适合大数据毕业设计、数据分析、爬虫类计算机毕业设计可二次开发选加推荐算法(协同过滤算法等或者调用SparkML库)数据处理流程本环节主要讲述的是对于整体项目功能的设计，设计方案为主要是由大数据系统以及可视化前端子系统组成。在可视化前端子
SparkML（五）北极光。大数据 #SparkML 聚类机器学习 spark
聚类k-means算法k-means算法是聚类分析中使用最广泛的算法之一。它把n个对象根据它们的属性分为k个聚类以便使得所获得的聚类满足：同一聚类中的对象相似度较高；而不同聚类中的对象相似度较小。k-means算法的基本过程如下所示：任意选择k个初始中心c1,c2,…,ckc{1},c{2},…,c_{k}c1,c2,…,ck。计算X中的每个对象与这些中心对象的距离；并根据最小距离重新对相应对象进
5.Spark ML学习笔记—聚类—Kmeans (K-均值) 聚类算法、LDA 主题聚类算法页川叶川 Spark ML学习笔记 spark kmeans 算法
本文目录如下：第5章SparkML聚类算法5.1基于中心的聚类—Kmeans(K-均值)聚类算法5.1.1K-均值聚类算法主要步骤5.1.2K-均值算法聚类效果演示5.1.3初始化聚类中心点5.1.4Kmeans模型参数详解5.2LDA主题聚类算法第5章SparkML聚类算法问题描述:假设在你的硬盘驱动器上有很多文件夹，里面存放着大量的mp3文件。现在，如果可以构建一个预测模型，从而可以帮助你自动
【大数据】分布式机器学习平台 MachineCYL 大数据机器学习大数据机器学习
记录一下团队之前搭建的分布式机器学习平台。功能展示架构图平台演变前端页面SparkML和sklearn模型训练耗时记录
梯度提升树GBDT模型原理及spark ML实现辰星M 机器学习算法 GBDT Boost spark ML
目录一、GBDT模型原理1.1GB(GradientBoost)算法1.2GBDT模型二、sparkML机器学习库中GBDT使用案例三、GBDT与Boost算法比较四、GBDT与RF比较一、GBDT模型原理1.1GB(GradientBoost)算法GB算法直观理解，将损失函数的负梯度在当前模型的值，当做下个模型训练的目标函数(第3,4步)。沿着损失函数负梯度方向迭代，使得损失函数越来越小，模型偏
Spark 3.0 - 11.ML 随机森林实现二分类实战 BIT_666 Spark 3.0 x 机器学习 Scala spark 随机森林大数据
目录一.引言二.随机森林实战1.数据预处理2.随机森林Pipeline3.模型预测与验证三.总结一.引言之前介绍了决策树，而随机森林则可以看作是多颗决策树的集合。在SparkML中，随机森林中的每一颗树都被分配到不同的节点上进行并行计算，或者在一些特定的条件下，单独的一颗决策树也可以并行化运算，其中每一棵决策树之间没有相关性。随机森林在运行的时候，每当有一个新的数据传输到系统中，都会由随机森林的每
Java实现的基于模板的网页结构化信息精准抽取组件：HtmlExtractor yangshangchuan 信息抽取 HtmlExtractor 精准抽取信息采集
HtmlExtractor是一个Java实现的基于模板的网页结构化信息精准抽取组件，本身并不包含爬虫功能，但可被爬虫或其他程序调用以便更精准地对网页结构化信息进行抽取。 HtmlExtractor是为大规模分布式环境设计的，采用主从架构，主节点负责维护抽取规则，从节点向主节点请求抽取规则，当抽取规则发生变化，主节点主动通知从节点，从而能实现抽取规则变化之后的实时动态生效。如
java编程思想 -- 多态百合不是茶 java 多态详解
一: 向上转型和向下转型面向对象中的转型只会发生在有继承关系的子类和父类中（接口的实现也包括在这里）。父类：人子类：男人向上转型： Person p = new Man() ; //向上转型不需要强制类型转化向下转型： Man man =
[自动数据处理]稳扎稳打,逐步形成自有ADP系统体系 comsci dp
对于国内的IT行业来讲,虽然我们已经有了"两弹一星",在局部领域形成了自己独有的技术特征,并初步摆脱了国外的控制...但是前面的路还很长.... 首先是我们的自动数据处理系统还无法处理很多高级工程...中等规模的拓扑分析系统也没有完成,更加复杂的
storm 自定义日志文件商人shang storm cluster logback
Storm中的日志级级别默认为INFO，并且，日志文件是根据worker号来进行区分的，这样，同一个log文件中的信息不一定是一个业务的，这样就会有以下两个需求出现： 1. 想要进行一些调试信息的输出 2. 调试信息或者业务日志信息想要输出到一些固定的文件中不要怕，不要烦恼，其实Storm已经提供了这样的支持，可以通过自定义logback 下的 cluster.xml 来输
Extjs3 SpringMVC使用 @RequestBody 标签问题记录 21jhf
springMVC使用 @RequestBody(required = false) UserVO userInfo 传递json对象数据，往往会出现http 415，400,500等错误，总结一下需要使用ajax提交json数据才行，ajax提交使用proxy，参数为jsonData，不能为params；另外，需要设置Content-type属性为json，代码如下：（由于使用了父类aaa
一些排错方法文强chu 方法
1、java.lang.IllegalStateException: Class invariant violation at org.apache.log4j.LogManager.getLoggerRepository(LogManager.java:199)at org.apache.log4j.LogManager.getLogger(LogManager.java:228) at o
Swing中文件恢复我觉得很难小桔子 swing
我那个草了！老大怎么回事，怎么做项目评估的？只会说相信你可以做的，试一下，有的是时间！用java开发一个图文处理工具，类似word，任意位置插入、拖动、删除图片以及文本等。文本框、流程图等，数据保存数据库，其余可保存pdf格式。ok,姐姐千辛万苦，
php 文件操作 aichenglong PHP 读取文件写入文件
1 写入文件 @$fp=fopen("$DOCUMENT_ROOT/order.txt", "ab"); if(!$fp){ echo "open file error" ; exit; } $outputstring="date:"." \t tire:".$tire."
MySQL的btree索引和hash索引的区别 AILIKES 数据结构 mysql 算法
Hash 索引结构的特殊性，其检索效率非常高，索引的检索可以一次定位，不像B-Tree 索引需要从根节点到枝节点，最后才能访问到页节点这样多次的IO访问，所以 Hash 索引的查询效率要远高于 B-Tree 索引。可能很多人又有疑问了，既然 Hash 索引的效率要比 B-Tree 高很多，为什么大家不都用 Hash 索引而还要使用 B-Tree 索引呢
JAVA的抽象--- 接口 --实现百合不是茶
抽象接口实现接口 //抽象类 ,方法 //定义一个公共抽象的类 ,并在类中定义一个抽象的方法体抽象的定义使用abstract abstract class A 定义一个抽象类例如： //定义一个基类 public abstract class A{ //抽象类不能用来实例化，只能用来继承 //
JS变量作用域实例 bijian1013 作用域
<script> var scope='hello'; function a(){ console.log(scope); //undefined var scope='world'; console.log(scope); //world console.log(b);
TDD实践（二） bijian1013 java TDD
实践题目：分解质因数 Step1：单元测试： package com.bijian.study.factor.test; import java.util.Arrays; import junit.framework.Assert; import org.junit.Before; import org.junit.Test; import com.bijian.
[MongoDB学习笔记一]MongoDB主从复制 bit1129 mongodb
MongoDB称为分布式数据库，主要原因是1.基于副本集的数据备份， 2.基于切片的数据扩容。副本集解决数据的读写性能问题，切片解决了MongoDB的数据扩容问题。事实上，MongoDB提供了主从复制和副本复制两种备份方式，在MongoDB的主从复制和副本复制集群环境中，只有一台作为主服务器，另外一台或者多台服务器作为从服务器。本文介绍MongoDB的主从复制模式，需要指明
【HBase五】Java API操作HBase bit1129 hbase
import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.HColumnDescriptor; import org.apache.ha
python调用zabbix api接口实时展示数据 ronin47
zabbix api接口来进行展示。经过思考之后，计划获取如下内容： 1、获得认证密钥 2、获取zabbix所有的主机组 3、获取单个组下的所有主机 4、获取某个主机下的所有监控项
jsp取得绝对路径 byalias 绝对路径
在JavaWeb开发中，常使用绝对路径的方式来引入JavaScript和CSS文件，这样可以避免因为目录变动导致引入文件找不到的情况，常用的做法如下：一、使用${pageContext.request.contextPath} 　　代码” ${pageContext.request.contextPath}”的作用是取出部署的应用程序名，这样不管如何部署，所用路径都是正确的。
Java定时任务调度：用ExecutorService取代Timer bylijinnan java
《Java并发编程实战》一书提到的用ExecutorService取代Java Timer有几个理由，我认为其中最重要的理由是：如果TimerTask抛出未检查的异常，Timer将会产生无法预料的行为。Timer线程并不捕获异常，所以 TimerTask抛出的未检查的异常会终止timer线程。这种情况下，Timer也不会再重新恢复线程的执行了;它错误的认为整个Timer都被取消了。此时，已经被
SQL 优化原则 chicony sql
一、问题的提出　在应用系统开发初期，由于开发数据库数据比较少，对于查询SQL语句，复杂视图的的编写等体会不出SQL语句各种写法的性能优劣，但是如果将应用系统提交实际应用后，随着数据库中数据的增加，系统的响应速度就成为目前系统需要解决的最主要的问题之一。系统优化中一个很重要的方面就是SQL语句的优化。对于海量数据，劣质SQL语句和优质SQL语句之间的速度差别可以达到上百倍，可见对于一个系统
java 线程弹球小游戏 CrazyMizzz java 游戏
最近java学到线程，于是做了一个线程弹球的小游戏，不过还没完善这里是提纲 1.线程弹球游戏实现 1.实现界面需要使用哪些API类 JFrame JPanel JButton FlowLayout Graphics2D Thread Color ActionListener ActionEvent MouseListener Mouse
hadoop jps出现process information unavailable提示解决办法 daizj hadoop jps
hadoop jps出现process information unavailable提示解决办法 jps时出现如下信息： 3019 -- process information unavailable3053 -- process information unavailable2985 -- process information unavailable2917 --
PHP图片水印缩放类实现 dcj3sjt126com PHP
<?php class Image{ private $path; function __construct($path='./'){ $this->path=rtrim($path,'/').'/'; } //水印函数，参数：背景图，水印图，位置，前缀,TMD透明度 public function water($b,$l,$pos
IOS控件学习：UILabel常用属性与用法 dcj3sjt126com ios UILabel
参考网站： http://shijue.me/show_text/521c396a8ddf876566000007 http://www.tuicool.com/articles/zquENb http://blog.csdn.net/a451493485/article/details/9454695 http://wiki.eoe.cn/page/iOS_pptl_artile_281
完全手动建立maven骨架 eksliang java eclipse Web
建一个 JAVA 项目： mvn archetype:create -DgroupId=com.demo -DartifactId=App [-Dversion=0.0.1-SNAPSHOT] [-Dpackaging=jar] 建一个 web 项目： mvn archetype:create -DgroupId=com.demo -DartifactId=web-a
配置清单 gengzg 配置
1、修改grub启动的内核版本 vi /boot/grub/grub.conf 将default 0改为1 拷贝mt7601Usta.ko到/lib文件夹拷贝RT2870STA.dat到 /etc/Wireless/RT2870STA/文件夹拷贝wifiscan到bin文件夹，chmod 775 /bin/wifiscan 拷贝wifiget.sh到bin文件夹，chm
Windows端口被占用处理方法 huqiji windows
以下文章主要以80端口号为例，如果想知道其他的端口号也可以使用该方法..........................1、在windows下如何查看80端口占用情况?是被哪个进程占用?如何终止等. 这里主要是用到windows下的DOS工具,点击"开始"--"运行",输入&
开源ckplayer 网页播放器，跨平台(html5, mobile)，flv, f4v, mp4, rtmp协议. webm, ogg, m3u8 ！天梯梦 mobile
CKplayer，其全称为超酷flv播放器，它是一款用于网页上播放视频的软件，支持的格式有：http协议上的flv,f4v,mp4格式，同时支持rtmp视频流格式播放，此播放器的特点在于用户可以自己定义播放器的风格，诸如播放/暂停按钮，静音按钮，全屏按钮都是以外部图片接口形式调用，用户根据自己的需要制作出播放器风格所需要使用的各个按钮图片然后替换掉原始风格里相应的图片就可以制作出自己的风格了，
简单工厂设计模式 hm4123660 java 工厂设计模式简单工厂模式
简单工厂模式（Simple Factory Pattern）属于类的创新型模式，又叫静态工厂方法模式。是通过专门定义一个类来负责创建其他类的实例，被创建的实例通常都具有共同的父类。简单工厂模式是由一个工厂对象决定创建出哪一种产品类的实例。简单工厂模式是工厂模式家族中最简单实用的模式，可以理解为是不同工厂模式的一个特殊实现。
maven笔记 zhb8015 maven
跳过测试阶段： mvn package -DskipTests 临时性跳过测试代码的编译： mvn package -Dmaven.test.skip=true maven.test.skip同时控制maven-compiler-plugin和maven-surefire-plugin两个插件的行为，即跳过编译，又跳过测试。指定测试类 mvn test
非mapreduce生成Hfile，然后导入hbase当中 Stark_Summer map hbase reduce Hfile path实例
最近一个群友的boss让研究hbase，让hbase的入库速度达到5w+/s，这可愁死了，4台个人电脑组成的集群，多线程入库调了好久，速度也才1w左右，都没有达到理想的那种速度，然后就想到了这种方式，但是网上多是用mapreduce来实现入库，而现在的需求是实时入库，不生成文件了，所以就只能自己用代码实现了，但是网上查了很多资料都没有查到，最后在一个网友的指引下，看了源码，最后找到了生成Hfile
jsp web tomcat 编码问题王新春 tomcat jsp pageEncode
今天配置jsp项目在tomcat上，windows上正常，而linux上显示乱码，最后定位原因为tomcat 的server.xml 文件的配置，添加 URIEncoding 属性： <Connector port="8080" protocol="HTTP/1.1" connectionTi