Spark学习笔记-推荐系统(协同过滤算法为用户推荐播放歌手)

这是Spark高级数据分析的第二个项目,基于用户,歌手,播放次数的简单数据记录,来为用户推荐歌手。


(1)获取数据

miaofu@miaofu-Virtual-Machine:~/user_artist_data$ wget http://www.iro.umontreal.ca/~lisa/datasets/profiledata_06-May-2005.tar.gz
--2016-09-12 14:14:10--  http://www.iro.umontreal.ca/~lisa/datasets/profiledata_06-May-2005.tar.gz
正在解析主机 www.iro.umontreal.ca (www.iro.umontreal.ca)... 132.204.26.36
正在连接 www.iro.umontreal.ca (www.iro.umontreal.ca)|132.204.26.36|:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度: 135880312 (130M) [application/x-gzip]
正在保存至: “profiledata_06-May-2005.tar.gz”

profiledata_06-May-2005.tar.gz                100%[================================================================================================>] 129.58M  1.40MB/s   用时 90s  

2016-09-12 14:15:44 (1.44 MB/s) - 已保存 “profiledata_06-May-2005.tar.gz” [135880312/135880312])

miaofu@miaofu-Virtual-Machine:~/user_artist_data$ ls
profiledata_06-May-2005.tar.gz
miaofu@miaofu-Virtual-Machine:~/user_artist_data$ tar -zxvf profiledata_06-May-2005.tar.gz 
profiledata_06-May-2005/
profiledata_06-May-2005/artist_data.txt
profiledata_06-May-2005/README.txt
profiledata_06-May-2005/user_artist_data.txt
profiledata_06-May-2005/artist_alias.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data$ ls
profiledata_06-May-2005  profiledata_06-May-2005.tar.gz
miaofu@miaofu-Virtual-Machine:~/user_artist_data$ cd profiledata_06-May-2005/
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ ls
artist_alias.txt  artist_data.txt  README.txt  user_artist_data.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ ls -l -h
总用量 464M
-rw-r--r-- 1 miaofu miaofu 2.8M  5月  6  2005 artist_alias.txt
-rw-r--r-- 1 miaofu miaofu  54M  5月  6  2005 artist_data.txt
-rw-r--r-- 1 miaofu miaofu 1.3K  5月 11  2005 README.txt
-rw-r--r-- 1 miaofu miaofu 407M  5月  5  2005 user_artist_data.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ vi user_artist_data.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ vi artist_alias.txt 
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ vi artist_data.txt 
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ vi README.txt 
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ pwd
/home/miaofu/user_artist_data/profiledata_06-May-2005
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ hadoop fs -put /home/miaofu/user_artist_data/profiledata_06-May-2005/
artist_alias.txt      artist_data.txt       README.txt            user_artist_data.txt  
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ hadoop fs -put /home/miaofu/user_artist_data/profiledata_06-May-2005/
artist_alias.txt      artist_data.txt       README.txt            user_artist_data.txt  
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ hadoop fs -put /home/miaofu/user_artist_data/profiledata_06-May-2005/user_artist_data.txt 
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ hadoop fs ls /user/miaofu/
ls: Unknown command
Did you mean -ls?  This command begins with a dash.
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ hadoop fs -ls /user/miaofu/
Found 4 items
drwxr-xr-x   - miaofu supergroup          0 2016-07-25 21:03 /user/miaofu/follower
drwxr-xr-x   - miaofu supergroup          0 2016-07-25 20:14 /user/miaofu/grep-temp-31205318
drwxr-xr-x   - miaofu supergroup          0 2016-08-27 22:02 /user/miaofu/linkage
-rw-r--r--   2 miaofu supergroup  426761761 2016-09-12 14:22 /user/miaofu/user_artist_data.txt

再把其它的两个数据放入到hdfs中,

miaofu@miaofu-Virtual-Machine:~$ cd user_artist_data/profiledata_06-May-2005/
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ ls
artist_alias.txt  artist_data.txt  README.txt  user_artist_data.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ pwd
/home/miaofu/user_artist_data/profiledata_06-May-2005
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ hadoop fs -put artist_alias.txt artist_data.txt 
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ hadoop fs -ls .
Found 5 items
-rw-r--r--   2 miaofu supergroup    2932731 2016-09-12 16:18 artist_data.txt
drwxr-xr-x   - miaofu supergroup          0 2016-07-25 21:03 follower
drwxr-xr-x   - miaofu supergroup          0 2016-07-25 20:14 grep-temp-31205318
drwxr-xr-x   - miaofu supergroup          0 2016-08-27 22:02 linkage
-rw-r--r--   2 miaofu supergroup  426761761 2016-09-12 14:22 user_artist_data.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ hadoop fs -put artist_alias.txt 
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ hadoop fs -ls .
Found 6 items
-rw-r--r--   2 miaofu supergroup    2932731 2016-09-12 16:19 artist_alias.txt
-rw-r--r--   2 miaofu supergroup    2932731 2016-09-12 16:18 artist_data.txt
drwxr-xr-x   - miaofu supergroup          0 2016-07-25 21:03 follower
drwxr-xr-x   - miaofu supergroup          0 2016-07-25 20:14 grep-temp-31205318
drwxr-xr-x   - miaofu supergroup          0 2016-08-27 22:02 linkage
-rw-r--r--   2 miaofu supergroup  426761761 2016-09-12 14:22 user_artist_data.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ 


(2)看一看代码纵览

Last login: Mon Sep 12 13:38:36 on ttys006
miaofudeMacBook-Pro:aas miaofu$ ls
LICENSE			ch03-recommender	ch06-lsa		ch09-risk		common
README.md		ch04-rdf		ch07-graph		ch10-genomics		pom.xml
ch02-intro		ch05-kmeans		ch08-geotime		ch11-neuro		simplesparkproject
miaofudeMacBook-Pro:aas miaofu$ 
miaofudeMacBook-Pro:aas miaofu$ vi ch03-recommender/src/main/scala/com/cloudera/datascience/recommender/RunRecommender.scala 
miaofudeMacBook-Pro:aas miaofu$ cat ch
ch02-intro/       ch03-recommender/ ch04-rdf/         ch05-kmeans/      ch06-lsa/         ch07-graph/       ch08-geotime/     ch09-risk/        ch10-genomics/    ch11-neuro/
miaofudeMacBook-Pro:aas miaofu$ cat ch03-recommender/src/main/scala/com/cloudera/datascience/recommender/RunRecommender.scala 
/*
 * Copyright 2015 and onwards Sanford Ryza, Juliet Hougland, Uri Laserson, Sean Owen and Joshua Wills
 *
 * See LICENSE file for further information.
 */

package com.cloudera.datascience.recommender

import scala.collection.Map
import scala.collection.mutable.ArrayBuffer
import scala.util.Random
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.ml.recommendation.{ALS, ALSModel}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.apache.spark.sql.functions._

object RunRecommender {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().getOrCreate()

    val base = "hdfs:///user/ds/"
    val rawUserArtistData = spark.read.textFile(base + "user_artist_data.txt")
    val rawArtistData = spark.read.textFile(base + "artist_data.txt")
    val rawArtistAlias = spark.read.textFile(base + "artist_alias.txt")

    val runRecommender = new RunRecommender(spark)
    runRecommender.preparation(rawUserArtistData, rawArtistData, rawArtistAlias)
    runRecommender.model(rawUserArtistData, rawArtistData, rawArtistAlias)
    runRecommender.evaluate(rawUserArtistData, rawArtistAlias)
    runRecommender.recommend(rawUserArtistData, rawArtistData, rawArtistAlias)
  }

}

class RunRecommender(private val spark: SparkSession) {

  import spark.implicits._

  def preparation(
      rawUserArtistData: Dataset[String],
      rawArtistData: Dataset[String],
      rawArtistAlias: Dataset[String]): Unit = {

    val userArtistDF = rawUserArtistData.map { line =>
      val Array(user, artist, _*) = line.split(' ')
      (user.toInt, artist.toInt)
    }.toDF("user", "artist")

    userArtistDF.agg(min("user"), max("user"), min("artist"), max("artist")).show()

    val artistByID = buildArtistByID(rawArtistData)
    val artistAlias = buildArtistAlias(rawArtistAlias)

    val (badID, goodID) = artistAlias.head
    artistByID.filter($"id" isin (badID, goodID)).show()
  }

  def model(
      rawUserArtistData: Dataset[String],
      rawArtistData: Dataset[String],
      rawArtistAlias: Dataset[String]): Unit = {

    val bArtistAlias = spark.sparkContext.broadcast(buildArtistAlias(rawArtistAlias))

    val trainData = buildCounts(rawUserArtistData, bArtistAlias).cache()

    val model = new ALS().
      setImplicitPrefs(true).
      setRank(10).
      setRegParam(0.01).
      setAlpha(1.0).
      setMaxIter(5).
      setUserCol("user").
      setItemCol("artist").
      setRatingCol("count").
      setPredictionCol("prediction").
      fit(trainData)

    trainData.unpersist()

    model.userFactors.select("features").show(truncate = false)

    val userID = 2093760

    val existingArtistIDs = trainData.
      filter($"user" === userID).
      select("artist").as[Int].collect()

    val artistByID = buildArtistByID(rawArtistData)

    artistByID.filter($"id" isin (existingArtistIDs:_*)).show()

    val topRecommendations = makeRecommendations(model, userID, 5)
    topRecommendations.show()

    val recommendedArtistIDs = topRecommendations.select("artist").as[Int].collect()

    artistByID.filter($"id" isin (recommendedArtistIDs:_*)).show()

    model.userFactors.unpersist()
    model.itemFactors.unpersist()
  }

  def evaluate(
      rawUserArtistData: Dataset[String],
      rawArtistAlias: Dataset[String]): Unit = {

    val bArtistAlias = spark.sparkContext.broadcast(buildArtistAlias(rawArtistAlias))

    val allData = buildCounts(rawUserArtistData, bArtistAlias)
    val Array(trainData, cvData) = allData.randomSplit(Array(0.9, 0.1))
    trainData.cache()
    cvData.cache()

    val allArtistIDs = allData.select("artist").as[Int].distinct().collect()
    val bAllArtistIDs = spark.sparkContext.broadcast(allArtistIDs)

    val mostListenedAUC = areaUnderCurve(cvData, bAllArtistIDs, predictMostListened(trainData))
    println(mostListenedAUC)

    val evaluations =
      for (rank     <- Seq(5,  30);
           regParam <- Seq(1.0, 0.0001);
           alpha    <- Seq(1.0, 40.0))
      yield {
        val model = new ALS().
          setImplicitPrefs(true).
          setRank(rank).setRegParam(regParam).
          setAlpha(alpha).setMaxIter(20).
          setUserCol("user").setItemCol("artist").
          setRatingCol("count").setPredictionCol("prediction").
          fit(trainData)

        val auc = areaUnderCurve(cvData, bAllArtistIDs, model.transform)

        model.userFactors.unpersist()
        model.itemFactors.unpersist()

        (auc, (rank, regParam, alpha))
      }

    evaluations.sorted.reverse.foreach(println)

    trainData.unpersist()
    cvData.unpersist()
  }

  def recommend(
      rawUserArtistData: Dataset[String],
      rawArtistData: Dataset[String],
      rawArtistAlias: Dataset[String]): Unit = {

    val bArtistAlias = spark.sparkContext.broadcast(buildArtistAlias(rawArtistAlias))
    val allData = buildCounts(rawUserArtistData, bArtistAlias).cache()
    val model = new ALS().
      setImplicitPrefs(true).
      setRank(10).setRegParam(1.0).setAlpha(40.0).setMaxIter(20).
      setUserCol("user").setItemCol("artist").
      setRatingCol("count").setPredictionCol("prediction").
      fit(allData)
    allData.unpersist()

    val userID = 2093760
    val topRecommendations = makeRecommendations(model, userID, 5)

    val recommendedArtistIDs = topRecommendations.select("artist").as[Int].collect()
    val artistByID = buildArtistByID(rawArtistData)
    artistByID.join(spark.createDataset(recommendedArtistIDs).toDF("id"), "id").
      select("name").show()

    model.userFactors.unpersist()
    model.itemFactors.unpersist()
  }

  def buildArtistByID(rawArtistData: Dataset[String]): DataFrame = {
    rawArtistData.flatMap { line =>
      val (id, name) = line.span(_ != '\t')
      if (name.isEmpty) {
        None
      } else {
        try {
          Some((id.toInt, name.trim))
        } catch {
          case _: NumberFormatException => None
        }
      }
    }.toDF("id", "name")
  }

  def buildArtistAlias(rawArtistAlias: Dataset[String]): Map[Int,Int] = {
    rawArtistAlias.flatMap { line =>
      val Array(artist, alias) = line.split('\t')
      if (artist.isEmpty) {
        None
      } else {
        Some((artist.toInt, alias.toInt))
      }
    }.collect().toMap
  }

  def buildCounts(
      rawUserArtistData: Dataset[String],
      bArtistAlias: Broadcast[Map[Int,Int]]): DataFrame = {
    rawUserArtistData.map { line =>
      val Array(userID, artistID, count) = line.split(' ').map(_.toInt)
      val finalArtistID = bArtistAlias.value.getOrElse(artistID, artistID)
      (userID, finalArtistID, count)
    }.toDF("user", "artist", "count")
  }

  def makeRecommendations(model: ALSModel, userID: Int, howMany: Int): DataFrame = {
    val toRecommend = model.itemFactors.
      select($"id".as("artist")).
      withColumn("user", lit(userID))
    model.transform(toRecommend).
      select("artist", "prediction").
      orderBy($"prediction".desc).
      limit(howMany)
  }

  def areaUnderCurve(
      positiveData: DataFrame,
      bAllArtistIDs: Broadcast[Array[Int]],
      predictFunction: (DataFrame => DataFrame)): Double = {

    // What this actually computes is AUC, per user. The result is actually something
    // that might be called "mean AUC".

    // Take held-out data as the "positive".
    // Make predictions for each of them, including a numeric score
    val positivePredictions = predictFunction(positiveData.select("user", "artist")).
      withColumnRenamed("prediction", "positivePrediction")

    // BinaryClassificationMetrics.areaUnderROC is not used here since there are really lots of
    // small AUC problems, and it would be inefficient, when a direct computation is available.

    // Create a set of "negative" products for each user. These are randomly chosen
    // from among all of the other artists, excluding those that are "positive" for the user.
    val negativeData = positiveData.select("user", "artist").as[(Int,Int)].
      groupByKey { case (user, _) => user }.
      flatMapGroups { case (userID, userIDAndPosArtistIDs) =>
        val random = new Random()
        val posItemIDSet = userIDAndPosArtistIDs.map { case (_, artist) => artist }.toSet
        val negative = new ArrayBuffer[Int]()
        val allArtistIDs = bAllArtistIDs.value
        var i = 0
        // Make at most one pass over all artists to avoid an infinite loop.
        // Also stop when number of negative equals positive set size
        while (i < allArtistIDs.length && negative.size < posItemIDSet.size) {
          val artistID = allArtistIDs(random.nextInt(allArtistIDs.length))
          // Only add new distinct IDs
          if (!posItemIDSet.contains(artistID)) {
            negative += artistID
          }
          i += 1
        }
        // Return the set with user ID added back
        negative.map(artistID => (userID, artistID))
      }.toDF("user", "artist")

    // Make predictions on the rest:
    val negativePredictions = predictFunction(negativeData).
      withColumnRenamed("prediction", "negativePrediction")

    // Join positive predictions to negative predictions by user, only.
    // This will result in a row for every possible pairing of positive and negative
    // predictions within each user.
    val joinedPredictions = positivePredictions.join(negativePredictions, "user").
      select("user", "positivePrediction", "negativePrediction").cache()

    // Count the number of pairs per user
    val allCounts = joinedPredictions.
      groupBy("user").agg(count(lit("1")).as("total")).
      select("user", "total")
    // Count the number of correctly ordered pairs per user
    val correctCounts = joinedPredictions.
      filter($"positivePrediction" > $"negativePrediction").
      groupBy("user").agg(count("user").as("correct")).
      select("user", "correct")

    // Combine these, compute their ratio, and average over all users
    val meanAUC = allCounts.join(correctCounts, "user").
      select($"user", ($"correct" / $"total").as("auc")).
      agg(mean("auc")).
      as[Double].first()

    joinedPredictions.unpersist()

    meanAUC
  }

  def predictMostListened(train: DataFrame)(allData: DataFrame): DataFrame = {
    val listenCounts = train.groupBy("artist").
      agg(sum("count").as("prediction")).
      select("artist", "prediction")
    allData.
      join(listenCounts, Seq("artist"), "left_outer").
      select("user", "artist", "prediction")
  }

}miaofudeMacBook-Pro:aas miaofu$ 


(3) spark

登陆spark

miaofu@miaofu-Virtual-Machine:~$ spark-shell
16/09/14 10:19:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.2
      /_/

Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.7.0_95)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
16/09/14 10:19:14 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/09/14 10:19:14 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/09/14 10:19:22 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/09/14 10:19:22 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/09/14 10:19:26 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/09/14 10:19:26 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
SQL context available as sqlContext.


导入数据

scala>     val base = "hdfs:///user/ds/"
base: String = hdfs:///user/ds/

scala>     val base = "hdfs:///user/miaofu/"
base: String = hdfs:///user/miaofu/

scala>     val rawArtistData = spark.read.textFile(base + "artist_data.txt")
:27: error: object read is not a member of package spark
             val rawArtistData = spark.read.textFile(base + "artist_data.txt")
                                       ^

scala>     val rawArtistData = sc.textFile(base + "artist_data.txt")
rawArtistData: org.apache.spark.rdd.RDD[String] = hdfs:///user/miaofu/artist_data.txt MapPartitionsRDD[1] at textFile at :29

scala> val head = rawArtistData.take(10)
head: Array[String] = Array(1134999	06Crazy Life, 6821360	Pang Nakarin, 10113088	Terfel, Bartoli- Mozart: Don, 10151459	The Flaming Sidebur, 6826647	Bodenstandig 3000, 10186265	Jota Quest e Ivete Sangalo, 6828986	Toto_XX (1977, 10236364	U.S Bombs -, 1135000	artist formaly know as Mat, 10299728	Kassierer - Musik für beide Ohren)

scala> head.foreach(println)
1134999	06Crazy Life
6821360	Pang Nakarin
10113088	Terfel, Bartoli- Mozart: Don
10151459	The Flaming Sidebur
6826647	Bodenstandig 3000
10186265	Jota Quest e Ivete Sangalo
6828986	Toto_XX (1977
10236364	U.S Bombs -
1135000	artist formaly know as Mat
10299728	Kassierer - Musik für beide Ohren

清洗数据

scala> def f1(line:String)={
     | val g = line.split("\t")
     | (g(0).toInt,g(1).trim)
     | }
f1: (line: String)(Int, String)

scala> head.foreach(println)
1134999	06Crazy Life
6821360	Pang Nakarin
10113088	Terfel, Bartoli- Mozart: Don
10151459	The Flaming Sidebur
6826647	Bodenstandig 3000
10186265	Jota Quest e Ivete Sangalo
6828986	Toto_XX (1977
10236364	U.S Bombs -
1135000	artist formaly know as Mat
10299728	Kassierer - Musik für beide Ohren
scala> head.map(f1).foreach(println)
(1134999,06Crazy Life)
(6821360,Pang Nakarin)
(10113088,Terfel, Bartoli- Mozart: Don)
(10151459,The Flaming Sidebur)
(6826647,Bodenstandig 3000)
(10186265,Jota Quest e Ivete Sangalo)
(6828986,Toto_XX (1977)
(10236364,U.S Bombs -)
(1135000,artist formaly know as Mat)
(10299728,Kassierer - Musik für beide Ohren)


可以看出,这样清洗是有问题的,比如第三行就不是按照制表符分割的,因此这样直接做是不合理的。为此我们应当引入flatMap,option等概念

option是 scala 中collection的一个数据类型,有some,none两个子类。

scala> val myMap: Map[String, String] = Map("key1" -> "value")
myMap: Map[String,String] = Map(key1 -> value)

scala> val value1: Option[String] = myMap.get("key1")
value1: Option[String] = Some(value)

scala> val value2: Option[String] = myMap.get("key2")
value2: Option[String] = None


为什么需要使用flatMap?

然而,map() 函数要求对每个输入必须严格返回一个值,因此这里不能用这个函数。另一种可行的方法是用 filter() 方法删除那些无法解析的行,但这会重复解析逻辑。当需要将每个元素映射为零个、一个或更多结果时,我们应该使用 flatMap() 函数,因为它将每个输入对应的零个或多个结果组成的集合简单展开,然后放入到一个更大的 RDD 中。它可以和 Scala 集合一起使用,也可以和 Scala 的 Option 类一起使用。Option 代表一个值可以不存在,有点儿像只有 1 或 0 的一个简单集合,1 对应子类 Some,0 对应子类 None。因此在以下代码中,虽然 flatMap 中的函数本可以简单返回一个空 List,或一个只有一个元素 的 List,但使用 Some 和 None 更合理,这种方法简单明了。 

为此我们把上看的解析函数f1重新写成下面的形式。

scala> val artistByID = rawArtistData.flatMap { line =>
     |       val (id, name) = line.span(_ != '\t')
     |       if (name.isEmpty) {
     |         None
     |       } else {
     |         try {
     |           Some((id.toInt, name.trim))
     |         } catch {
     |           case _: NumberFormatException => None
     |         }
     |       }
     |     }
artistByID: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[2] at flatMap at :31

scala> artistByID
res9: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[2] at flatMap at :31

scala> val rawArtistAlias  = sc.textFile(base+"artist_alias.txt")
rawArtistAlias: org.apache.spark.rdd.RDD[String] = hdfs:///user/miaofu/artist_alias.txt MapPartitionsRDD[4] at textFile at :29

scala> val artistAlias = rawArtistAlias.flatMap { line =>
     |       val Array(artist, alias) = line.split('\t')
     |       if (artist.isEmpty) {
     |         None
     |       } else {
     |         Some((artist.toInt, alias.toInt))
     |       }
     |     }.collect().toMap
artistAlias: scala.collection.immutable.Map[Int,Int] = Map(1208690 -> 1003926, 2012757 -> 4569, 6949139 -> 1085752, 1109727 -> 1239120, 6772751 -> 1244705, 2070533 -> 1021544, 1157679 -> 2194, 9969617 -> 5630, 2034496 -> 1116214, 6764342 -> 40, 1272489 -> 1278238, 2108744 -> 1009267, 10349857 -> 1000052, 2145319 -> 1020463, 2126338 -> 2717, 10165456 -> 1001169, 6779368 -> 1239506, 10278137 -> 1001523, 9939075 -> 1329390, 2037201 -> 1274155, 1248585 -> 2885, 1106945 -> 1399, 6811322 -> 1019016, 9978396 -> 1784, 6676961 -> 1086433, 2117821 -> 2611, 6863616 -> 1277013, 6895480 -> 1000993, 6831632 -> 1246136, 1001719 -> 1009727, 10135633 -> 4250, 7029291 -> 1034635, 6967939 -> 1002734, 6864694 -> 1017311, 1237279 -> 1029752, 6793956 -> 1283231, 1208609 -> 1000699, 6693428 -> 1100258, 685174...


构建训练集

scala> import org.apache.spark.mllib.recommendation._
import org.apache.spark.mllib.recommendation._

scala> val bArtistAlias = sc.broadcastcast(artistAlias)
:36: error: value broadcastcast is not a member of org.apache.spark.SparkContext
         val bArtistAlias = sc.broadcastcast(artistAlias)
                               ^

scala> val bArtistAlias = sc.broadcast(artistAlias)
bArtistAlias: org.apache.spark.broadcast.Broadcast[scala.collection.immutable.Map[Int,Int]] = Broadcast(3)

scala> 

scala> val rawUserArtistData = sc.textFile(base + "user_artist_data.txt")
rawUserArtistData: org.apache.spark.rdd.RDD[String] = hdfs:///user/miaofu/user_artist_data.txt MapPartitionsRDD[7] at textFile at :32

scala> val trainData = rawUserArtistData.map{line => 
     | val Array(userID,artistID,count) = line.split(" ").map(_.toInt)
     | val finalArtistID =
     | bArtistAlias.value.getOrElse(artistID,artistID)
     | Rating(userID,finalArtistID,count)
     | }.cache()
trainData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating] = MapPartitionsRDD[8] at map at :40

scala> trainData.count()
[Stage 1:==============>                                            (1 + 3) / 4]16/09/14 12:47:18 WARN MemoryStore: Not enough space to cache rdd_8_1 in memory! (computed 172.8 MB so far)
16/09/14 12:47:18 WARN MemoryStore: Not enough space to cache rdd_8_0 in memory! (computed 172.8 MB so far)
16/09/14 12:47:21 WARN MemoryStore: Not enough space to cache rdd_8_2 in memory! (computed 172.8 MB so far)
res0: Long = 24296858                                                           


模型训练

scala> val model = ALS.trainImplicit(trainData,10,5,0.01,1.0)
[Stage 2:==============>                                            (1 + 3) / 4]16/09/14 12:48:29 WARN MemoryStore: Not enough space to cache rdd_8_2 in memory! (computed 172.8 MB so far)
16/09/14 12:48:29 WARN MemoryStore: Not enough space to cache rdd_8_0 in memory! (computed 172.8 MB so far)
16/09/14 12:48:30 WARN MemoryStore: Not enough space to cache rdd_8_1 in memory! (computed 172.8 MB so far)
[Stage 3:>                                                          (0 + 4) / 4]16/09/14 12:49:12 ERROR Executor: Managed memory leak detected; size = 115703758 bytes, TID = 12
16/09/14 12:49:13 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 12)
因为个人的主机配置内存有限,所以出错了,先写到着,一会去申请新的配置。


你可能感兴趣的:(spark学习笔记)