这是Spark高级数据分析的第二个项目,基于用户,歌手,播放次数的简单数据记录,来为用户推荐歌手。
(1)获取数据
miaofu@miaofu-Virtual-Machine:~/user_artist_data$ wget http://www.iro.umontreal.ca/~lisa/datasets/profiledata_06-May-2005.tar.gz
--2016-09-12 14:14:10-- http://www.iro.umontreal.ca/~lisa/datasets/profiledata_06-May-2005.tar.gz
正在解析主机 www.iro.umontreal.ca (www.iro.umontreal.ca)... 132.204.26.36
正在连接 www.iro.umontreal.ca (www.iro.umontreal.ca)|132.204.26.36|:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度: 135880312 (130M) [application/x-gzip]
正在保存至: “profiledata_06-May-2005.tar.gz”
profiledata_06-May-2005.tar.gz 100%[================================================================================================>] 129.58M 1.40MB/s 用时 90s
2016-09-12 14:15:44 (1.44 MB/s) - 已保存 “profiledata_06-May-2005.tar.gz” [135880312/135880312])
miaofu@miaofu-Virtual-Machine:~/user_artist_data$ ls
profiledata_06-May-2005.tar.gz
miaofu@miaofu-Virtual-Machine:~/user_artist_data$ tar -zxvf profiledata_06-May-2005.tar.gz
profiledata_06-May-2005/
profiledata_06-May-2005/artist_data.txt
profiledata_06-May-2005/README.txt
profiledata_06-May-2005/user_artist_data.txt
profiledata_06-May-2005/artist_alias.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data$ ls
profiledata_06-May-2005 profiledata_06-May-2005.tar.gz
miaofu@miaofu-Virtual-Machine:~/user_artist_data$ cd profiledata_06-May-2005/
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ ls
artist_alias.txt artist_data.txt README.txt user_artist_data.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ ls -l -h
总用量 464M
-rw-r--r-- 1 miaofu miaofu 2.8M 5月 6 2005 artist_alias.txt
-rw-r--r-- 1 miaofu miaofu 54M 5月 6 2005 artist_data.txt
-rw-r--r-- 1 miaofu miaofu 1.3K 5月 11 2005 README.txt
-rw-r--r-- 1 miaofu miaofu 407M 5月 5 2005 user_artist_data.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ vi user_artist_data.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ vi artist_alias.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ vi artist_data.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ vi README.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ pwd
/home/miaofu/user_artist_data/profiledata_06-May-2005
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ hadoop fs -put /home/miaofu/user_artist_data/profiledata_06-May-2005/
artist_alias.txt artist_data.txt README.txt user_artist_data.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ hadoop fs -put /home/miaofu/user_artist_data/profiledata_06-May-2005/
artist_alias.txt artist_data.txt README.txt user_artist_data.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ hadoop fs -put /home/miaofu/user_artist_data/profiledata_06-May-2005/user_artist_data.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ hadoop fs ls /user/miaofu/
ls: Unknown command
Did you mean -ls? This command begins with a dash.
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ hadoop fs -ls /user/miaofu/
Found 4 items
drwxr-xr-x - miaofu supergroup 0 2016-07-25 21:03 /user/miaofu/follower
drwxr-xr-x - miaofu supergroup 0 2016-07-25 20:14 /user/miaofu/grep-temp-31205318
drwxr-xr-x - miaofu supergroup 0 2016-08-27 22:02 /user/miaofu/linkage
-rw-r--r-- 2 miaofu supergroup 426761761 2016-09-12 14:22 /user/miaofu/user_artist_data.txt
再把其它的两个数据放入到hdfs中,
miaofu@miaofu-Virtual-Machine:~$ cd user_artist_data/profiledata_06-May-2005/
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ ls
artist_alias.txt artist_data.txt README.txt user_artist_data.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ pwd
/home/miaofu/user_artist_data/profiledata_06-May-2005
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ hadoop fs -put artist_alias.txt artist_data.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ hadoop fs -ls .
Found 5 items
-rw-r--r-- 2 miaofu supergroup 2932731 2016-09-12 16:18 artist_data.txt
drwxr-xr-x - miaofu supergroup 0 2016-07-25 21:03 follower
drwxr-xr-x - miaofu supergroup 0 2016-07-25 20:14 grep-temp-31205318
drwxr-xr-x - miaofu supergroup 0 2016-08-27 22:02 linkage
-rw-r--r-- 2 miaofu supergroup 426761761 2016-09-12 14:22 user_artist_data.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ hadoop fs -put artist_alias.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$ hadoop fs -ls .
Found 6 items
-rw-r--r-- 2 miaofu supergroup 2932731 2016-09-12 16:19 artist_alias.txt
-rw-r--r-- 2 miaofu supergroup 2932731 2016-09-12 16:18 artist_data.txt
drwxr-xr-x - miaofu supergroup 0 2016-07-25 21:03 follower
drwxr-xr-x - miaofu supergroup 0 2016-07-25 20:14 grep-temp-31205318
drwxr-xr-x - miaofu supergroup 0 2016-08-27 22:02 linkage
-rw-r--r-- 2 miaofu supergroup 426761761 2016-09-12 14:22 user_artist_data.txt
miaofu@miaofu-Virtual-Machine:~/user_artist_data/profiledata_06-May-2005$
(2)看一看代码纵览
Last login: Mon Sep 12 13:38:36 on ttys006
miaofudeMacBook-Pro:aas miaofu$ ls
LICENSE ch03-recommender ch06-lsa ch09-risk common
README.md ch04-rdf ch07-graph ch10-genomics pom.xml
ch02-intro ch05-kmeans ch08-geotime ch11-neuro simplesparkproject
miaofudeMacBook-Pro:aas miaofu$
miaofudeMacBook-Pro:aas miaofu$ vi ch03-recommender/src/main/scala/com/cloudera/datascience/recommender/RunRecommender.scala
miaofudeMacBook-Pro:aas miaofu$ cat ch
ch02-intro/ ch03-recommender/ ch04-rdf/ ch05-kmeans/ ch06-lsa/ ch07-graph/ ch08-geotime/ ch09-risk/ ch10-genomics/ ch11-neuro/
miaofudeMacBook-Pro:aas miaofu$ cat ch03-recommender/src/main/scala/com/cloudera/datascience/recommender/RunRecommender.scala
/*
* Copyright 2015 and onwards Sanford Ryza, Juliet Hougland, Uri Laserson, Sean Owen and Joshua Wills
*
* See LICENSE file for further information.
*/
package com.cloudera.datascience.recommender
import scala.collection.Map
import scala.collection.mutable.ArrayBuffer
import scala.util.Random
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.ml.recommendation.{ALS, ALSModel}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.apache.spark.sql.functions._
object RunRecommender {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().getOrCreate()
val base = "hdfs:///user/ds/"
val rawUserArtistData = spark.read.textFile(base + "user_artist_data.txt")
val rawArtistData = spark.read.textFile(base + "artist_data.txt")
val rawArtistAlias = spark.read.textFile(base + "artist_alias.txt")
val runRecommender = new RunRecommender(spark)
runRecommender.preparation(rawUserArtistData, rawArtistData, rawArtistAlias)
runRecommender.model(rawUserArtistData, rawArtistData, rawArtistAlias)
runRecommender.evaluate(rawUserArtistData, rawArtistAlias)
runRecommender.recommend(rawUserArtistData, rawArtistData, rawArtistAlias)
}
}
class RunRecommender(private val spark: SparkSession) {
import spark.implicits._
def preparation(
rawUserArtistData: Dataset[String],
rawArtistData: Dataset[String],
rawArtistAlias: Dataset[String]): Unit = {
val userArtistDF = rawUserArtistData.map { line =>
val Array(user, artist, _*) = line.split(' ')
(user.toInt, artist.toInt)
}.toDF("user", "artist")
userArtistDF.agg(min("user"), max("user"), min("artist"), max("artist")).show()
val artistByID = buildArtistByID(rawArtistData)
val artistAlias = buildArtistAlias(rawArtistAlias)
val (badID, goodID) = artistAlias.head
artistByID.filter($"id" isin (badID, goodID)).show()
}
def model(
rawUserArtistData: Dataset[String],
rawArtistData: Dataset[String],
rawArtistAlias: Dataset[String]): Unit = {
val bArtistAlias = spark.sparkContext.broadcast(buildArtistAlias(rawArtistAlias))
val trainData = buildCounts(rawUserArtistData, bArtistAlias).cache()
val model = new ALS().
setImplicitPrefs(true).
setRank(10).
setRegParam(0.01).
setAlpha(1.0).
setMaxIter(5).
setUserCol("user").
setItemCol("artist").
setRatingCol("count").
setPredictionCol("prediction").
fit(trainData)
trainData.unpersist()
model.userFactors.select("features").show(truncate = false)
val userID = 2093760
val existingArtistIDs = trainData.
filter($"user" === userID).
select("artist").as[Int].collect()
val artistByID = buildArtistByID(rawArtistData)
artistByID.filter($"id" isin (existingArtistIDs:_*)).show()
val topRecommendations = makeRecommendations(model, userID, 5)
topRecommendations.show()
val recommendedArtistIDs = topRecommendations.select("artist").as[Int].collect()
artistByID.filter($"id" isin (recommendedArtistIDs:_*)).show()
model.userFactors.unpersist()
model.itemFactors.unpersist()
}
def evaluate(
rawUserArtistData: Dataset[String],
rawArtistAlias: Dataset[String]): Unit = {
val bArtistAlias = spark.sparkContext.broadcast(buildArtistAlias(rawArtistAlias))
val allData = buildCounts(rawUserArtistData, bArtistAlias)
val Array(trainData, cvData) = allData.randomSplit(Array(0.9, 0.1))
trainData.cache()
cvData.cache()
val allArtistIDs = allData.select("artist").as[Int].distinct().collect()
val bAllArtistIDs = spark.sparkContext.broadcast(allArtistIDs)
val mostListenedAUC = areaUnderCurve(cvData, bAllArtistIDs, predictMostListened(trainData))
println(mostListenedAUC)
val evaluations =
for (rank <- Seq(5, 30);
regParam <- Seq(1.0, 0.0001);
alpha <- Seq(1.0, 40.0))
yield {
val model = new ALS().
setImplicitPrefs(true).
setRank(rank).setRegParam(regParam).
setAlpha(alpha).setMaxIter(20).
setUserCol("user").setItemCol("artist").
setRatingCol("count").setPredictionCol("prediction").
fit(trainData)
val auc = areaUnderCurve(cvData, bAllArtistIDs, model.transform)
model.userFactors.unpersist()
model.itemFactors.unpersist()
(auc, (rank, regParam, alpha))
}
evaluations.sorted.reverse.foreach(println)
trainData.unpersist()
cvData.unpersist()
}
def recommend(
rawUserArtistData: Dataset[String],
rawArtistData: Dataset[String],
rawArtistAlias: Dataset[String]): Unit = {
val bArtistAlias = spark.sparkContext.broadcast(buildArtistAlias(rawArtistAlias))
val allData = buildCounts(rawUserArtistData, bArtistAlias).cache()
val model = new ALS().
setImplicitPrefs(true).
setRank(10).setRegParam(1.0).setAlpha(40.0).setMaxIter(20).
setUserCol("user").setItemCol("artist").
setRatingCol("count").setPredictionCol("prediction").
fit(allData)
allData.unpersist()
val userID = 2093760
val topRecommendations = makeRecommendations(model, userID, 5)
val recommendedArtistIDs = topRecommendations.select("artist").as[Int].collect()
val artistByID = buildArtistByID(rawArtistData)
artistByID.join(spark.createDataset(recommendedArtistIDs).toDF("id"), "id").
select("name").show()
model.userFactors.unpersist()
model.itemFactors.unpersist()
}
def buildArtistByID(rawArtistData: Dataset[String]): DataFrame = {
rawArtistData.flatMap { line =>
val (id, name) = line.span(_ != '\t')
if (name.isEmpty) {
None
} else {
try {
Some((id.toInt, name.trim))
} catch {
case _: NumberFormatException => None
}
}
}.toDF("id", "name")
}
def buildArtistAlias(rawArtistAlias: Dataset[String]): Map[Int,Int] = {
rawArtistAlias.flatMap { line =>
val Array(artist, alias) = line.split('\t')
if (artist.isEmpty) {
None
} else {
Some((artist.toInt, alias.toInt))
}
}.collect().toMap
}
def buildCounts(
rawUserArtistData: Dataset[String],
bArtistAlias: Broadcast[Map[Int,Int]]): DataFrame = {
rawUserArtistData.map { line =>
val Array(userID, artistID, count) = line.split(' ').map(_.toInt)
val finalArtistID = bArtistAlias.value.getOrElse(artistID, artistID)
(userID, finalArtistID, count)
}.toDF("user", "artist", "count")
}
def makeRecommendations(model: ALSModel, userID: Int, howMany: Int): DataFrame = {
val toRecommend = model.itemFactors.
select($"id".as("artist")).
withColumn("user", lit(userID))
model.transform(toRecommend).
select("artist", "prediction").
orderBy($"prediction".desc).
limit(howMany)
}
def areaUnderCurve(
positiveData: DataFrame,
bAllArtistIDs: Broadcast[Array[Int]],
predictFunction: (DataFrame => DataFrame)): Double = {
// What this actually computes is AUC, per user. The result is actually something
// that might be called "mean AUC".
// Take held-out data as the "positive".
// Make predictions for each of them, including a numeric score
val positivePredictions = predictFunction(positiveData.select("user", "artist")).
withColumnRenamed("prediction", "positivePrediction")
// BinaryClassificationMetrics.areaUnderROC is not used here since there are really lots of
// small AUC problems, and it would be inefficient, when a direct computation is available.
// Create a set of "negative" products for each user. These are randomly chosen
// from among all of the other artists, excluding those that are "positive" for the user.
val negativeData = positiveData.select("user", "artist").as[(Int,Int)].
groupByKey { case (user, _) => user }.
flatMapGroups { case (userID, userIDAndPosArtistIDs) =>
val random = new Random()
val posItemIDSet = userIDAndPosArtistIDs.map { case (_, artist) => artist }.toSet
val negative = new ArrayBuffer[Int]()
val allArtistIDs = bAllArtistIDs.value
var i = 0
// Make at most one pass over all artists to avoid an infinite loop.
// Also stop when number of negative equals positive set size
while (i < allArtistIDs.length && negative.size < posItemIDSet.size) {
val artistID = allArtistIDs(random.nextInt(allArtistIDs.length))
// Only add new distinct IDs
if (!posItemIDSet.contains(artistID)) {
negative += artistID
}
i += 1
}
// Return the set with user ID added back
negative.map(artistID => (userID, artistID))
}.toDF("user", "artist")
// Make predictions on the rest:
val negativePredictions = predictFunction(negativeData).
withColumnRenamed("prediction", "negativePrediction")
// Join positive predictions to negative predictions by user, only.
// This will result in a row for every possible pairing of positive and negative
// predictions within each user.
val joinedPredictions = positivePredictions.join(negativePredictions, "user").
select("user", "positivePrediction", "negativePrediction").cache()
// Count the number of pairs per user
val allCounts = joinedPredictions.
groupBy("user").agg(count(lit("1")).as("total")).
select("user", "total")
// Count the number of correctly ordered pairs per user
val correctCounts = joinedPredictions.
filter($"positivePrediction" > $"negativePrediction").
groupBy("user").agg(count("user").as("correct")).
select("user", "correct")
// Combine these, compute their ratio, and average over all users
val meanAUC = allCounts.join(correctCounts, "user").
select($"user", ($"correct" / $"total").as("auc")).
agg(mean("auc")).
as[Double].first()
joinedPredictions.unpersist()
meanAUC
}
def predictMostListened(train: DataFrame)(allData: DataFrame): DataFrame = {
val listenCounts = train.groupBy("artist").
agg(sum("count").as("prediction")).
select("artist", "prediction")
allData.
join(listenCounts, Seq("artist"), "left_outer").
select("user", "artist", "prediction")
}
}miaofudeMacBook-Pro:aas miaofu$
(3) spark
登陆spark
miaofu@miaofu-Virtual-Machine:~$ spark-shell
16/09/14 10:19:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.2
/_/
Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.7.0_95)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
16/09/14 10:19:14 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/09/14 10:19:14 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/09/14 10:19:22 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/09/14 10:19:22 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/09/14 10:19:26 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/09/14 10:19:26 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
SQL context available as sqlContext.
scala> val base = "hdfs:///user/ds/"
base: String = hdfs:///user/ds/
scala> val base = "hdfs:///user/miaofu/"
base: String = hdfs:///user/miaofu/
scala> val rawArtistData = spark.read.textFile(base + "artist_data.txt")
:27: error: object read is not a member of package spark
val rawArtistData = spark.read.textFile(base + "artist_data.txt")
^
scala> val rawArtistData = sc.textFile(base + "artist_data.txt")
rawArtistData: org.apache.spark.rdd.RDD[String] = hdfs:///user/miaofu/artist_data.txt MapPartitionsRDD[1] at textFile at :29
scala> val head = rawArtistData.take(10)
head: Array[String] = Array(1134999 06Crazy Life, 6821360 Pang Nakarin, 10113088 Terfel, Bartoli- Mozart: Don, 10151459 The Flaming Sidebur, 6826647 Bodenstandig 3000, 10186265 Jota Quest e Ivete Sangalo, 6828986 Toto_XX (1977, 10236364 U.S Bombs -, 1135000 artist formaly know as Mat, 10299728 Kassierer - Musik für beide Ohren)
scala> head.foreach(println)
1134999 06Crazy Life
6821360 Pang Nakarin
10113088 Terfel, Bartoli- Mozart: Don
10151459 The Flaming Sidebur
6826647 Bodenstandig 3000
10186265 Jota Quest e Ivete Sangalo
6828986 Toto_XX (1977
10236364 U.S Bombs -
1135000 artist formaly know as Mat
10299728 Kassierer - Musik für beide Ohren
scala> def f1(line:String)={
| val g = line.split("\t")
| (g(0).toInt,g(1).trim)
| }
f1: (line: String)(Int, String)
scala> head.foreach(println)
1134999 06Crazy Life
6821360 Pang Nakarin
10113088 Terfel, Bartoli- Mozart: Don
10151459 The Flaming Sidebur
6826647 Bodenstandig 3000
10186265 Jota Quest e Ivete Sangalo
6828986 Toto_XX (1977
10236364 U.S Bombs -
1135000 artist formaly know as Mat
10299728 Kassierer - Musik für beide Ohren
scala> head.map(f1).foreach(println)
(1134999,06Crazy Life)
(6821360,Pang Nakarin)
(10113088,Terfel, Bartoli- Mozart: Don)
(10151459,The Flaming Sidebur)
(6826647,Bodenstandig 3000)
(10186265,Jota Quest e Ivete Sangalo)
(6828986,Toto_XX (1977)
(10236364,U.S Bombs -)
(1135000,artist formaly know as Mat)
(10299728,Kassierer - Musik für beide Ohren)
option是 scala 中collection的一个数据类型,有some,none两个子类。
scala> val myMap: Map[String, String] = Map("key1" -> "value")
myMap: Map[String,String] = Map(key1 -> value)
scala> val value1: Option[String] = myMap.get("key1")
value1: Option[String] = Some(value)
scala> val value2: Option[String] = myMap.get("key2")
value2: Option[String] = None
然而,map() 函数要求对每个输入必须严格返回一个值,因此这里不能用这个函数。另一种可行的方法是用 filter() 方法删除那些无法解析的行,但这会重复解析逻辑。当需要将每个元素映射为零个、一个或更多结果时,我们应该使用 flatMap() 函数,因为它将每个输入对应的零个或多个结果组成的集合简单展开,然后放入到一个更大的 RDD 中。它可以和 Scala 集合一起使用,也可以和 Scala 的 Option 类一起使用。Option 代表一个值可以不存在,有点儿像只有 1 或 0 的一个简单集合,1 对应子类 Some,0 对应子类 None。因此在以下代码中,虽然 flatMap 中的函数本可以简单返回一个空 List,或一个只有一个元素 的 List,但使用 Some 和 None 更合理,这种方法简单明了。
为此我们把上看的解析函数f1重新写成下面的形式。
scala> val artistByID = rawArtistData.flatMap { line =>
| val (id, name) = line.span(_ != '\t')
| if (name.isEmpty) {
| None
| } else {
| try {
| Some((id.toInt, name.trim))
| } catch {
| case _: NumberFormatException => None
| }
| }
| }
artistByID: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[2] at flatMap at :31
scala> artistByID
res9: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[2] at flatMap at :31
scala> val rawArtistAlias = sc.textFile(base+"artist_alias.txt")
rawArtistAlias: org.apache.spark.rdd.RDD[String] = hdfs:///user/miaofu/artist_alias.txt MapPartitionsRDD[4] at textFile at :29
scala> val artistAlias = rawArtistAlias.flatMap { line =>
| val Array(artist, alias) = line.split('\t')
| if (artist.isEmpty) {
| None
| } else {
| Some((artist.toInt, alias.toInt))
| }
| }.collect().toMap
artistAlias: scala.collection.immutable.Map[Int,Int] = Map(1208690 -> 1003926, 2012757 -> 4569, 6949139 -> 1085752, 1109727 -> 1239120, 6772751 -> 1244705, 2070533 -> 1021544, 1157679 -> 2194, 9969617 -> 5630, 2034496 -> 1116214, 6764342 -> 40, 1272489 -> 1278238, 2108744 -> 1009267, 10349857 -> 1000052, 2145319 -> 1020463, 2126338 -> 2717, 10165456 -> 1001169, 6779368 -> 1239506, 10278137 -> 1001523, 9939075 -> 1329390, 2037201 -> 1274155, 1248585 -> 2885, 1106945 -> 1399, 6811322 -> 1019016, 9978396 -> 1784, 6676961 -> 1086433, 2117821 -> 2611, 6863616 -> 1277013, 6895480 -> 1000993, 6831632 -> 1246136, 1001719 -> 1009727, 10135633 -> 4250, 7029291 -> 1034635, 6967939 -> 1002734, 6864694 -> 1017311, 1237279 -> 1029752, 6793956 -> 1283231, 1208609 -> 1000699, 6693428 -> 1100258, 685174...
scala> import org.apache.spark.mllib.recommendation._
import org.apache.spark.mllib.recommendation._
scala> val bArtistAlias = sc.broadcastcast(artistAlias)
:36: error: value broadcastcast is not a member of org.apache.spark.SparkContext
val bArtistAlias = sc.broadcastcast(artistAlias)
^
scala> val bArtistAlias = sc.broadcast(artistAlias)
bArtistAlias: org.apache.spark.broadcast.Broadcast[scala.collection.immutable.Map[Int,Int]] = Broadcast(3)
scala>
scala> val rawUserArtistData = sc.textFile(base + "user_artist_data.txt")
rawUserArtistData: org.apache.spark.rdd.RDD[String] = hdfs:///user/miaofu/user_artist_data.txt MapPartitionsRDD[7] at textFile at :32
scala> val trainData = rawUserArtistData.map{line =>
| val Array(userID,artistID,count) = line.split(" ").map(_.toInt)
| val finalArtistID =
| bArtistAlias.value.getOrElse(artistID,artistID)
| Rating(userID,finalArtistID,count)
| }.cache()
trainData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating] = MapPartitionsRDD[8] at map at :40
scala> trainData.count()
[Stage 1:==============> (1 + 3) / 4]16/09/14 12:47:18 WARN MemoryStore: Not enough space to cache rdd_8_1 in memory! (computed 172.8 MB so far)
16/09/14 12:47:18 WARN MemoryStore: Not enough space to cache rdd_8_0 in memory! (computed 172.8 MB so far)
16/09/14 12:47:21 WARN MemoryStore: Not enough space to cache rdd_8_2 in memory! (computed 172.8 MB so far)
res0: Long = 24296858
模型训练
scala> val model = ALS.trainImplicit(trainData,10,5,0.01,1.0)
[Stage 2:==============> (1 + 3) / 4]16/09/14 12:48:29 WARN MemoryStore: Not enough space to cache rdd_8_2 in memory! (computed 172.8 MB so far)
16/09/14 12:48:29 WARN MemoryStore: Not enough space to cache rdd_8_0 in memory! (computed 172.8 MB so far)
16/09/14 12:48:30 WARN MemoryStore: Not enough space to cache rdd_8_1 in memory! (computed 172.8 MB so far)
[Stage 3:> (0 + 4) / 4]16/09/14 12:49:12 ERROR Executor: Managed memory leak detected; size = 115703758 bytes, TID = 12
16/09/14 12:49:13 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 12)
因为个人的主机配置内存有限,所以出错了,先写到着,一会去申请新的配置。