推荐系统-基于物品协同过滤算法代码实现

1、简介

当前Spark没有像mahout那样,严格区分基于物品的协同过滤推荐(ItemCF)和基于用户的协同过滤推荐(UserCF),只有基于模型的协同过滤推荐算法ALS(model-based CF)。但ALS算法对于一些特定的问题(用户数量较小的场景,以及物品数量明显小于用户数量的场景),效果并不理想,不像mahout提供了各种推荐算法选择。为了充分利用spark在速度上带来的提升同时为满足一些业务需求,于是使用spark构建ItemCF算法。同时spark提供了新的DataFrame数据类型,使算法的开发更加清晰,易于理解与实现。

2、常用相似度计算公式

2.1 同现相似度(CoOccurrence)

w ( x , y ) = ∣ N ( x ) ∩ N ( y ) ∣ ∣ N ( x ) ∣ w(x,y)=\frac{|N(x)\cap{N(y)}|}{|N(x)|} w(x,y)=N(x)N(x)N(y)

  • 公式中分母是对物品x感兴趣的用户数,而分子则是同时对物品x和物品y感兴趣的用户数。因此,上述公式可用理解为对物品x感兴趣的用户有多大概率也对y感兴趣 (和关联规则类似)
  • 但上述的公式存在一个问题,如果物品y是热门物品,有很多人都喜欢,则会导致w(x, y)很大,甚至接近于1,因此会造成任何物品都和热门物品交有很大的相似度。因此,我们需要用如下公式进行修正:
改进的相似度公式

w ( x , y ) = ∣ N ( x ) ∩ N ( y ) ∣ ∣ N ( x ) ∣ ∣ N ( y ) ∣ w(x,y)=\frac{|N(x)\cap{N(y)}|}{\sqrt{|N(x)||N(y)|}} w(x,y)=N(x)∣∣N(y) N(x)N(y)

  • 这个格式惩罚了物品y的权重,因此减轻了热门物品与很多物品相似的可能性(归一化处理)

2.2 欧几里得相似度(Eucledian Similarity)

欧几里得相似度根据欧几里得距离计算而来,距离越近相似度越高,反之相反。

欧几里得距离定义

在数学中,欧几里得距离或欧几里得度量是欧几里得空间中两点间“普通”(即直线)距离。使用这个距离,欧氏空间成为度量空间。相关联的范数称为欧几里得范数,较早的文献称之为毕达哥拉斯度量。
  d X , Y = ∑ i = 1 n ( x i − y i ) 2 \ d_{X,Y}=\sqrt{ \sum_{i=1}^n(x_i-y_i)^2}  dX,Y=i=1n(xiyi)2

2.3 皮尔逊相似度

皮尔逊相关系数,即概率论中的相关系数,取值范围[-1,+1]。当大于零时,两个变量正相关,当小于零时表示两个向量负相关。

皮尔逊相关系数定义

两个变量之间的皮尔逊相关系数定义为两个变量之间的协方差和标准差的商:
ρ X , Y = c o v ( X , Y ) σ x σ y = E ( ( X − μ x ) ( Y − μ y ) ) σ x σ y = E ( X Y ) − E ( X ) E ( Y ) E ( X 2 ) − E 2 ( X ) E ( Y 2 ) − E 2 ( Y ) \rho_{X,Y}=\frac{cov(X,Y)}{\sigma_{x}\sigma_{y}}=\frac{E((X-\mu_x)(Y-\mu_y))}{\sigma_{x}\sigma_{y}}=\frac{E(XY)-E(X)E(Y)}{\sqrt{E(X^2)-E^2(X)}\sqrt{E(Y^2)-E^2(Y)}} ρX,Y=σxσycov(X,Y)=σxσyE((Xμx)(Yμy))=E(X2)E2(X) E(Y2)E2(Y) E(XY)E(X)E(Y)

2.4 余弦相似度(Cosine Similarity)

利用多维空间两点与所设定的点形成夹角的余弦值范围为[-1,1],余弦值越大,则说明夹角越大,两点相距就越远,相似度就越小。

向量间余弦定义

多维空间两点与所设定的点形成夹角的余弦值计算

余弦计算公式

s i m X , Y = X Y ∣ ∣ X ∣ ∣ ∣ ∣ Y ∣ ∣ = ∑ i = 1 n ( x i y i ) ∑ i = 1 n ( x i ) 2 ∗ ∑ i = 1 n ( y i ) 2 sim_{X,Y}=\frac{XY}{||X||||Y||}=\frac{ \sum_{i=1}^n(x_iy_i)}{\sqrt{\sum_{i=1}^n(x_i)^2}*\sqrt{\sum_{i=1}^n(y_i)^2}} simX,Y=∣∣X∣∣∣∣Y∣∣XY=i=1n(xi)2 i=1n(yi)2 i=1n(xiyi)
公式中xi表示第i个用户对物品x的评分,yi同理。
该公式只考虑到了用户的评分,很可能评分较高的物品会排在前面而忽略掉物品的其它信息,经过改进版的余弦相似度计算公式如下:

改进余弦计算公式

s i m X , Y = X Y n u m X ∩ Y ∣ ∣ X ∣ ∣ ∣ ∣ Y ∣ ∣ n u m X l o g 10 ( 10 + n u m Y ) sim_{X,Y}=\frac{XYnum_{X\cap{Y}}}{||X||||Y||num_{X}log10(10+num_{Y})} simX,Y=∣∣X∣∣∣∣Y∣∣numXlog10(10+numY)XYnumXY
改进公式考虑了两个向量相同个体个数,X向量大小,Y向量大小,需要注意:
  s i m X , Y ≠ s i m Y , X \ sim_{X,Y}\neq sim_{Y,X}  simX,Y=simY,X

2.5 Tanimoto 相似度(Jaccard 系数)

Tanimoto相似度也称为Jaccard系数,是Cosine相似度扩展,多用于文档相似度就算。此相似度不考虑评价值,只考虑两个集合共同个体数量。
s i m ( x , y ) = X ∩ Y ∣ ∣ X ∣ ∣ + ∣ ∣ Y ∣ ∣ − ∣ ∣ X ∩ Y ∣ ∣ sim(x,y)=\frac{X\cap{Y}}{||X||+||Y||-||X\cap{Y}||} sim(x,y)=∣∣X∣∣+∣∣Y∣∣∣∣XY∣∣XY

3、预测用户评分公式

p r e d u , p = ∑ i ∈ r a t e d I t e m s ( u ) s i m ( i , p ) r u , i ∑ i ∈ r a t e d I t e m s ( u ) s i m ( i , p ) pred_{u,p}=\frac{\sum_{i\in{ratedItems(u)}}{sim(i,p)r_{u,i}}}{\sum_{i\in{ratedItems(u)}}{sim(i,p)}} predu,p=iratedItems(u)sim(i,p)iratedItems(u)sim(i,p)ru,i
公式中u表示用户,p表示物品,ratedItems(u)指用户u评价过的物品,sim指相似度(item之间的),r指用户对物品评分。

4、构建ItemCF模型

样例类定义
// 物品信息
case class Item(itemId: Int, itemName: String)
// 用户-物品-评分
case class ItemRating(userId: Int, itemId: Int, rating: Double)
// 用户信息
case class User(userId: Int, userName: String)
相似度计算公式
/**
  * SIMILARITY MEASURES
  */
object SimilarityMeasures {

  /**
    * The Co-occurrence similarity between two vectors A, B is
    * |N(i) ∩ N(j)| / sqrt(|N(i)||N(j)|)
    */
  def cooccurrence(numOfRatersForAAndB: Long, numOfRatersForA: Long, numOfRatersForB: Long): Double = {
    numOfRatersForAAndB / math.sqrt(numOfRatersForA * numOfRatersForB)
  }

  /**
    * The correlation between two vectors A, B is
    * cov(A, B) / (stdDev(A) * stdDev(B))
    *
    * This is equivalent to
    * [n * dotProduct(A, B) - sum(A) * sum(B)] /
    * sqrt{ [n * norm(A)^2 - sum(A)^2] [n * norm(B)^2 - sum(B)^2] }
    */
  def correlation(size: Double, dotProduct: Double, ratingSum: Double,
                  rating2Sum: Double, ratingNormSq: Double, rating2NormSq: Double): Double = {

    val numerator = size * dotProduct - ratingSum * rating2Sum
    val denominator = scala.math.sqrt(size * ratingNormSq - ratingSum * ratingSum) *
      scala.math.sqrt(size * rating2NormSq - rating2Sum * rating2Sum)

    numerator / denominator
  }

  /**
    * Regularize correlation by adding virtual pseudocounts over a prior:
    * RegularizedCorrelation = w * ActualCorrelation + (1 - w) * PriorCorrelation
    * where w = # actualPairs / (# actualPairs + # virtualPairs).
    */
  def regularizedCorrelation(size: Double, dotProduct: Double, ratingSum: Double,
                             rating2Sum: Double, ratingNormSq: Double, rating2NormSq: Double,
                             virtualCount: Double, priorCorrelation: Double): Double = {

    val unregularizedCorrelation = correlation(size, dotProduct, ratingSum, rating2Sum, ratingNormSq, rating2NormSq)
    val w = size / (size + virtualCount)

    w * unregularizedCorrelation + (1 - w) * priorCorrelation
  }

  /**
    * The cosine similarity between two vectors A, B is
    * dotProduct(A, B) / (norm(A) * norm(B))
    */
  def cosineSimilarity(dotProduct: Double, ratingNorm: Double, rating2Norm: Double): Double = {
    dotProduct / (ratingNorm * rating2Norm)
  }

  /**
    * The improved cosine similarity between two vectors A, B is
    * dotProduct(A, B) * num(A ∩ B) / (norm(A) * norm(B) * num(A) * log10(10 + num(B)))
    */
  def improvedCosineSimilarity(dotProduct: Double, ratingNorm: Double, rating2Norm: Double
                               , numAjoinB: Long, numA: Long, numB: Long): Double = {
    dotProduct * numAjoinB / (ratingNorm * rating2Norm * numA * math.log10(10 + numB))
  }

  /**
    * The Jaccard Similarity between two sets A, B is
    * |Intersection(A, B)| / |Union(A, B)|
    */
  def jaccardSimilarity(usersInCommon: Double, totalUsers1: Double, totalUsers2: Double): Double = {
    val union = totalUsers1 + totalUsers2 - usersInCommon
    usersInCommon / union
  }
}
相似度算法
def computeAlgorithm(ratings:Dataset[ItemRating], spark:SparkSession): Option[DataFrame] = {
    import spark.implicits._
    val defaultParallelism = spark.sparkContext.defaultParallelism
    val numRatersPerItem = ratings.groupBy("itemId").count().alias("nor").coalesce(defaultParallelism)

    val ratingsWithSize = ratings.join(numRatersPerItem, "itemId").coalesce(defaultParallelism)
    /**
     * 执行内联操作(userId, item1,rating1,nor1,item2,rating2,nor2,product,rating1Pow,rating2Pow)
     * product(商品item1和商品item2评分相乘)
     * rating1Pow = pow(item,2) 商品评分平方
     */
    ratingsWithSize.join(ratingsWithSize, "userId")
      .toDF("userId", "item1", "rating1", "nor1", "item2", "rating2", "nor2")
      .selectExpr("userId"
        , "item1", "rating1", "nor1"
        , "item2", "rating2", "nor2"
        , "rating1 * rating2 as product"
        , "pow(rating1, 2) as rating1Pow"
        , "pow(rating2, 2) as rating2Pow")
      .coalesce(defaultParallelism).createOrReplaceTempView("joined")
    spark.sql("select *from joined").show(false)
    /**
     * 针对同时两个商品:item1,item2
     * 计算必要的中间数据,注意此处有WHERE限定
     * size:同时喜欢item1和item2的用户数
     */
    val sparseMatrix = spark.sql(
      """
        |SELECT item1
        |, item2
        |, count(userId) as size
        |, sum(product) as dotProduct
        |, sum(rating1) as ratingSum1
        |, sum(rating2) as ratingSum2
        |, sum(rating1Pow)  as  ratingSumOfSq1
        |, sum(rating2Pow)  as  ratingSumOfSq2
        |, first(nor1)  as nor1
        |, first(nor2)  as nor2
        |FROM joined
        |WHERE item1 < item2
        |GROUP BY item1, item2
      """.stripMargin).coalesce(defaultParallelism).cache()
    //  计算物品相似度
    val sim = sparseMatrix.map(row => {
      val size = row.getAs[Long](2)
      val dotProduct = row.getAs[Double](3)
      val ratingSum1 = row.getAs[Double](4)
      val ratingSum2 = row.getAs[Double](5)
      val ratingSumOfSq1 = row.getAs[Double](6)
      val ratingSumOfSq2 = row.getAs[Double](7)
      val numRaters1 = row.getAs[Long](8)
      val numRaters2 = row.getAs[Long](9)
      //TODO 核心算法
      //1、计算物品item1和item2的同现相似度
      val cooc = SimilarityMeasures.cooccurrence(size, numRaters1, numRaters2)
      //2、计算物品item2和item1向量相关性
      val corr = SimilarityMeasures.correlation(size, dotProduct, ratingSum1, ratingSum2, ratingSumOfSq1, ratingSumOfSq2)
      val PRIOR_COUNT = 10
      val PRIOR_CORRELATION = 15
      val regCorr = SimilarityMeasures.regularizedCorrelation(size, dotProduct, ratingSum1, ratingSum2,
        ratingSumOfSq1, ratingSumOfSq2, PRIOR_COUNT, PRIOR_CORRELATION)
      //4、计算物品item1和item2计算余弦相似度
      val cosSim = SimilarityMeasures.cosineSimilarity(dotProduct, scala.math.sqrt(ratingSumOfSq1), scala.math.sqrt(ratingSumOfSq2))
      //5、改进物品item1和item2计算余弦相似度
      val impCosSim = SimilarityMeasures.improvedCosineSimilarity(dotProduct, math.sqrt(ratingSumOfSq1), math.sqrt(ratingSum2), size, numRaters1, numRaters2)
      //6、Tanimoto 相似度jaccard)相似度计算
      val jaccard = SimilarityMeasures.jaccardSimilarity(size, numRaters1, numRaters2)
      (row.getInt(0), row.getInt(1), cooc, corr, regCorr, cosSim, impCosSim, jaccard)
    }).toDF("itemId_01", "itemId_02", "cooc", "corr", "regCorr", "cosSim", "impCosSim", "jaccard")
    
    //compute item to item similarity
    //recommendForItems(spark, sim, maxRecommendNum, "impCosSim")
    Option(sim)
  }
物品相似物品
/**
   * 为指定的物品推荐MAX_RECOMMEND_NUM个物品
   * @similarityMeasure 相似度算法
   */
 def recommendForItems(spark: SparkSession, sim: DataFrame, similarityMeasure: String): Unit = {
    import spark.implicits._
    sim.createOrReplaceTempView("t_item_similarity")
    val resultDF = spark.sql(
      s"""
         |SELECT t.itemId_01
         |,  t.itemId_02
         |,  t.$similarityMeasure
         |FROM t_item_similarity
         |WHERE t.$similarityMeasure != 'NaN'
      """.stripMargin)
      .map(row => {
        var newVal = row.getDouble(2)
        if (newVal.isNaN) {
          val bg = BigDecimal(newVal)
          newVal = bg.setScale(8, BigDecimal.RoundingMode.HALF_UP).doubleValue()
        }
        (row.get(0), (row.getInt(1), newVal))
      })
      .rdd
      .aggregateByKey(ListBuffer[(Int, Double)]())((u, v) => {
        u += v
        u.sortBy(_._2).reverse.take(MAX_RECOMMEND_NUM)
      }, (u1, u2) => {
        u1 ++= u2
        u1.sortBy(_._2).reverse.take(MAX_RECOMMEND_NUM)
      })
      .toDF("item_id", "recommends")
    //resultDF: similarity item save to db
  }
用户喜欢物品
/**
   * 为指定的用户推荐MAX_RECOMMEND_NUM个物品
   *
   * @param users 用户集
   * @return 推荐表
   */
  def recommendForUsers(spark:SparkSession, ratings:Dataset[ItemRating], users: Dataset[User], similarities: Option[DataFrame]):DataFrame= {
    import spark.implicits._
    val defaultParallelism = spark.sparkContext.defaultParallelism
    val similarityMeasure = "impCosSim"
    //  similarityMeasure为相似度算法名:impCosSim此处选择同现相似度
    var sim = similarities.get.select("itemId_01", "itemId_02", similarityMeasure)
    val rits = ratings

    val project: DataFrame = users
      .selectExpr("userId as user", "userName")
      //  进行子投影,此处左表数量远小于右表,执行左连接
      .join(rits, $"user" <=> rits("userId"), "left")
      .drop($"user")
      //  选择与用户相关的物品以及评分
      .select("userId", "itemId", "rating")

    // 获得用户感兴趣的物品与其它物品的相似度
    project.join(sim, $"itemId" <=> sim("itemId_01"))
      .selectExpr("userId"
        , "itemId_01 as relatedItem"
        , "itemId_02 as otherItem"
        , similarityMeasure
        , s"$similarityMeasure * rating as simProduct")
      .coalesce(defaultParallelism)
      .createOrReplaceTempView("t_tempTable")

    val resultDF = spark.sql(
      s"""
         |SELECT userId
         |,  otherItem
         |,  sum(simProduct) / sum($similarityMeasure) as rating
         |FROM t_tempTable
         |GROUP BY userId, otherItem
      """.stripMargin)
      .map(row => {
        var newVal = row.getDouble(2)
        if (newVal.isNaN) {
          val bg = BigDecimal(newVal)
          newVal = bg.setScale(8, BigDecimal.RoundingMode.HALF_UP).doubleValue()
        }
        (row.get(0), (row.getInt(1), newVal))
      })
      .filter(p => p._2._2.isNaN)
      .rdd
      .aggregateByKey(ListBuffer[(Int, Double)]())((u, v) => {
        u += v
        u.sortBy(_._2).reverse.take(MAX_RECOMMEND_NUM)
      }, (u1, u2) => {
        u1 ++= u2
        u1.sortBy(_._2).reverse.take(MAX_RECOMMEND_NUM)
      })
      .toDF("user_id", "recommends")
    resultDF
  }
/**
 * Copyright (c) 2023-2088 XXLB All Rights Reserved
 * Project:itemcf
 * Package:com.spark.ml.recommender.itemcf
 * Version 1.0
 * 主函数入口基于物品推荐,并推荐给指定用户
 * @author libiao
 * @since 2023-01-20 17:49
 */
 object ItemCFRecommender {
  val MAX_RECOMMEND_NUM = 20

  def main(args: Array[String]): Unit = {
    //创建一个spark config
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("ItemRecommender")
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()
    try {
      val resource = getClass.getResource("/u.data.txt")
      val userResource = getClass.getResource("/u.user.txt")
      import spark.implicits._
      //加载数据
      val dataStream = spark.sparkContext.textFile(resource.getPath)
      val movieRatinDs = dataStream.map(item => {
        //product数据通过^进行分割,切分出来
        val attr = item.split("\\s+")
        ItemRating(attr(0).toInt, attr(1).trim.toInt, attr(2).trim.toFloat)
      }).toDS()

      val userDataStream = spark.sparkContext.textFile(userResource.getPath)
      val userDS = userDataStream.map(item => {
        //product数据通过^进行分割,切分出来
        val attr = item.split("\\|")
        User(attr(0).toInt, attr(3).trim)
      }).toDS()
      //compute similarity
      val similarities = computeAlgorithm(movieRatinDs, spark)
      //recommend for users
      recommendForUsers(spark, movieRatinDs, userDS, similarities)
    } catch {
      case e: Exception => throw e
    } finally {
      spark.stop()
    }
  }
}
加载cassandra方法
import scala.collection.mutable
import scala.reflect.runtime.{universe => ru}
import scala.util.matching.Regex

/**
 * Copyright (c) 2020-2088 LEE POSSIBLE All Rights Reserved
 * Project:spark-ml
 * Package:com.spark.ml.recommender.itemcf
 * Version 1.0
 *
 * @author object CommonUtils {

  def getFileds[T: ru.TypeTag]: List[String] ={
    val calssSymbol = ru.typeOf[T].typeSymbol.asClass
    val fields = calssSymbol.typeSignature.members.collect {
      case m: ru.MethodSymbol if m.isCaseAccessor => m
    }
    fields.map(_.name.toString).toList
  }

  def getFieldMap[T: ru.TypeTag]: mutable.Map[String, String]={
    val maps = mutable.Map[String, String]()
    val fields = getFileds[T]
    for (field <- fields){
      val snakeName = convertFieldNameToSnakeCase(field)
      maps += (field -> snakeName)
    }
    maps
  }

  def convertFieldNameToSnakeCase(fieldName: String): String= {
    val regex: Regex = "([a-z](A-Z]+)".r
    regex.replaceAllIn(fieldName, "$1_$2").toLowerCase
  }
}

object RDCassandraObject{
  def saveToCass(saveDF: DataFrame, keyspace: String, tableName: String): Unit = {
    saveDF.write
      .format("org.apache.spark.sql.cassandra")
      .options(Map("keyspace" -> keyspace, "table" -> tableName))
      .mode(SaveMode.Append)
      .option("spark.cassandra.output.consistency.level", "ONE")
      .save()
  }

  def readFromCass(spark: SparkSession, keyspace: String, tableName: String): DataFrame = {
    val userItemDF = spark.read
      .format("org.apache.spark.sql.cassandra")
      .options(Map("keyspace" -> keyspace, "table" -> tableName))
      .load()
      .toDF("userId", "itemId", "rating")
    userItemDF
  }

  /**
   * 读取DF,cassandra表列下划线对应类snakeName(fg: user_id, class: userId)
   * @param spark
   * @param keyspace
   * @param tableName
   * @tparam T 类对象
   * @return
   */
  def readFromCass[T: scala.reflect.runtime.universe.TypeTag](spark: SparkSession, keyspace: String, tableName: String):DataFrame = {
    val customMap = CommonUtils.getFieldMap[T]
    val userItemDF = spark.read
      .format("org.apache.spark.sql.cassandra")
      .options(Map("keyspace" -> keyspace, "table" -> tableName))
      .load()
      .selectExpr(customMap.map(kv => s"${kv._2} as ${kv._1}").toSeq: _*)

    userItemDF.printSchema

    userItemDF
  }
}

5、相似度计算结果展示

数据来自MovieLens,MovieLens数据集是一个关于电影评分的数据集,里面包含了从IMDB, The Movie DataBase上面得到1000名用户对1700部电影的评分信息。

5.1 计算出的物品间相似度

以下展示了使用同现相似度,余弦相似度以及改进版进行相似度计算后(其它相似度请自行测试)的电影间的相似度,并以《星球大战(1977)》进行测试的结果(只展示前20个结果)。

我们发现余弦相似度的推荐结果似乎不太令人满意,这或许是因为余弦相似度只和用户评分有关(更适用于推荐5星电影,不关心电影的类型等),也有可能是我的算法出现了差错,欢迎在评论区指正。

5.2 同现相似度结果

movie1 movie2 coocurrence
星球大战(1977) 绝地归来(1983) 0.8828826458931883
星球大战(1977) 迷失方舟攻略(1981) 0.7679353753201742
星球大战(1977) 帝国反击,(1980) 0.7458505006229118
星球大战(1977) 教父,The(1972) 0.7275434127191666
星球大战(1977) 法戈(1996) 0.7239858668831711
星球大战(1977) 独立日(ID4)(1996) 0.723845113716724
星球大战(1977) 沉默的羔羊,The(1991) 0.7025515983155468
星球大战(1977) 印第安纳琼斯和最后的十字军东征(1989) 0.6920306174608959
星球大战(1977) 低俗小说(1994) 0.6885437675802282
星球大战(1977) 星际迷航:第一次接触(1996) 0.6850249237265413
星球大战(1977) 回到未来(1985) 0.6840536741086217
星球大战(1977) 逃亡者,The(1993) 0.6710463728397225
星球大战(1977) 摇滚,The(1996) 0.6646215466055597
星球大战(1977) 终结者,The(1984) 0.6636319257721421
星球大战(1977) 阿甘正传(1994) 0.6564951869930893
星球大战(1977) 终结者2:审判日(1991) 0.653467518885383
星球大战(1977) Princess Bride,The(1987) 0.6534487891771482
星球大战(1977) 异形(1979) 0.648232034779792
星球大战(1977) E.T。外星(1982) 0.6479990753086882
星球大战(1977) 巨蟒和圣杯(1974) 0.6476896799641126

5.3 余弦相似度结果

movie1 movie2 cosSim
星球大战(1977) Infinity(1996) 1.0
星球大战(1977) Mostro,Il(1994) 1.0
星球大战(1977) Boys,Les(1997) 1.0
星球大战(1977) 陌生人,(1994) 1.0
星球大战(1977) 爱是一切(1996) 1.0
星球大战(1977) 巴黎是女人(1995) 1.0
星球大战(1977) 遇难者,A(1937) 1.0
星球大战(1977) 馅饼在天空(1995) 1.0
星球大战(1977) 世纪(1993) 1.0
星球大战(1977) 天使在我的肩膀(1946) 1.0
星球大战(1977) 这里来曲奇(1935) 1.0
星球大战(1977) 力量98(1995) 1.0
星球大战(1977) 滑稽女郎(1943) 1.0
星球大战(1977) 火山(1996) 1.0
星球大战(1977) 难忘的夏天(1994) 1.0
星球大战(1977) Innocents,The(1961) 1.0
星球大战(1977) Sleepover(1995) 1.0
星球大战(1977) 木星的妻子(1994) 1.0
星球大战(1977) 我的生活与时代与安东宁·阿托(En compagnie d’Antonin Artaud)(1993) 1.0
星球大战(1977) Bent(1997) 1.0

5.4 改进余弦相似度结果

movie1 movie2 impCosSim
星球大战(1977) 绝地归来(1983)0.6151374130038775
星球大战(1977) 失落方舟攻略(1981) 0.5139215764696529
星球大战(1977) 法戈(1996) 0.4978221397190352
星球大战(1977) 帝国反击,The(1980) 0.47719131109655355
星球大战(1977) 教父,The(1972) 0.4769568086870377
星球大战(1977) 沉默的羔羊,The(1991) 0.449096021012343
星球大战(1977) 独立日(ID4)(1996) 0.4334888029282058
星球大战(1977) 低俗小说(1994) 0.43054394420596026
星球大战(1977) 联系(1997) 0.4093441266211224
星球大战(1977) 印第安纳琼斯和最后的十字军东征(1989) 0.4080635382244593
星球大战(1977) 回到未来(1985) 0.4045977014813726
星球大战(1977) 星际迷航:第一次接触(1996) 0.40036290288050874
星球大战(1977) 逃亡者,The(1993) 0.3987919640908379
星球大战(1977) Princess Bride,The(1987) 0.39490206690864144
星球大战(1977) 摇滚,The(1996) 0.39100622194841916
星球大战(1977) 巨蟒与圣杯(1974) 0.3799595474408077
星球大战(1977) 终结者,The(1984) 0.37881311350029406
星球大战(1977) 阿甘正传(1994) 0.3755685058241706
星球大战(1977) 终结者2:审判日(1991) 0.37184317281514295
星球大战(1977) 杰瑞马奎尔(1996) 0.370478212770262

你可能感兴趣的:(大数据,推荐系统,算法,推荐算法,大数据,spark)