本文面向spark以及入门推荐系统的新手(包括我自己)
Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark 2.3.1 uses Scala 2.11. You will need to use a compatible Scala version (2.11.x).
Scala and Java users can include Spark in their projects using its Maven coordinates and in the future Python users can also install Spark from PyPI
Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It’s easy to run locally on one machine — all you need is to have java installed on your system PATH, or the JAVA_HOME environment variable pointing to a Java installation.
mac java_home: https://www.mkyong.com/java/how-to-set-java_home-environment-variable-on-mac-os-x/
下载spark:https://spark.apache.org/downloads.html
解压: spark-2.3.1-bin-hadoop2.7.tgz
export JAVA_HOME=$(/usr/libexec/java_home)
export SPARK_HOME={YOUR_SPARK_HOME}
export PATH=$SPARK_HOME/bin:$PATH
run-example SparkPi 10
如果设置正确,会出现以下日志:
2018-09-01 13:30:09 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool
2018-09-01 13:30:09 INFO DAGScheduler:54 - ResultStage 0 (reduce at SparkPi.scala:38) finished in 0.785 s
2018-09-01 13:30:09 INFO DAGScheduler:54 - Job 0 finished: reduce at SparkPi.scala:38, took 0.888303 s
Pi is roughly 3.142015142015142
2018-09-01 13:30:09 INFO AbstractConnector:318 - Stopped Spark@3088660d{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-09-01 13:30:09 INFO SparkUI:54 - Stopped Spark web UI at http://192.168.0.102:4040
2018-09-01 13:30:09 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-09-01 13:30:09 INFO MemoryStore:54 - MemoryStore cleared
2018-09-01 13:30:09 INFO BlockManager:54 - BlockManager stopped
2018-09-01 13:30:09 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2018-09-01 13:30:09 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-09-01 13:30:09 INFO SparkContext:54 - Successfully stopped SparkContext
2018-09-01 13:30:09 INFO ShutdownHookManager:54 - Shutdown hook called
2018-09-01 13:30:09 INFO ShutdownHookManager:54 - Deleting directory /private/var/folders/nb/59yflyk555q86ny7pj84kjnm0000gn/T/spark-ef6810c7-0fbd-4ec6-a1c4-da3d41963d43
2018-09-01 13:30:09 INFO ShutdownHookManager:54 - Deleting directory /private/var/folders/nb/59yflyk555q86ny7pj84kjnm0000gn/T/spark-1cdbae12-c127-47cb-8571-7d546f088a45
You can also run Spark interactively through a modified version of the Scala shell. This is a great way to learn the framework.
spark-shell --master local[2]
如果你是zsh: https://zpjiang.me/2015/10/17/zsh-no-match-found-local-spark/
spark-shell --master "local[4]"
The –master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing. For a full list of options, run Spark shell with the –help option.
因为我日后的工作和推荐算法相关,所以直接上手spark ml 的协同过滤
不管三七二十一,先把官网demo运行起来:
https://spark.apache.org/docs/latest/ml-collaborative-filtering.html
启动 spark-shell:
spark-shell --master "local[4]"
输入以下脚本:
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.recommendation.ALS
case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
def parseRating(str: String): Rating = {
val fields = str.split("::")
assert(fields.size == 4)
Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong)
}
// 注意此处需要手敲
val ratings = spark.read.textFile("data/mllib/als/sample_movielens_ratings.txt").map(parseRating).toDF()
// 分割训练集和测试集
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))
// Build the recommendation model using ALS on the training data
val als = new ALS().setMaxIter(5).setRegParam(0.01).setUserCol("userId").setItemCol("movieId").setRatingCol("rating")
val model = als.fit(training)
// Evaluate the model by computing the RMSE on the test data
// Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
model.setColdStartStrategy("drop")
val predictions = model.transform(test)
val evaluator = new RegressionEvaluator().setMetricName("rmse").setLabelCol("rating").setPredictionCol("prediction")
val rmse = evaluator.evaluate(predictions)
println(s"Root-mean-square error = $rmse")
// Generate top 10 movie recommendations for each user
val userRecs = model.recommendForAllUsers(10)
// Generate top 10 user recommendations for each movie
val movieRecs = model.recommendForAllItems(10)
// Generate top 10 movie recommendations for a specified set of users
val users = ratings.select(als.getUserCol).distinct().limit(3)
val userSubsetRecs = model.recommendForUserSubset(users, 10)
// Generate top 10 user recommendations for a specified set of movies
val movies = ratings.select(als.getItemCol).distinct().limit(3)
val movieSubSetRecs = model.recommendForItemSubset(movies, 10)
问题:illegal-start-of-definition
https://stackoverflow.com/questions/2632247/scala-giving-me-illegal-start-of-definition
一步一步地执行,如果正常,最后查看推荐
userSubsetRecs.show(false)
输出:
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|userId|recommendations |
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|28 |[[55, 6.5194335], [48, 5.5661573], [79, 5.239481], [81, 5.092548], [42, 5.0799093], [92, 5.032366], [50, 5.024519], [12, 4.713505], [68, 4.4784117], [10, 4.3225317]] |
|26 |[[46, 7.1047177], [32, 6.8479147], [53, 6.2369795], [30, 5.704584], [94, 5.611696], [7, 5.046065], [23, 4.9733334], [22, 4.9221854], [90, 4.765525], [87, 4.404424]] |
|27 |[[18, 3.7961326], [39, 3.2431319], [46, 3.225921], [38, 3.1762652], [51, 3.1215887], [80, 3.118711], [34, 3.1077363], [1, 3.0809362], [75, 3.075434], [62, 3.0638745]]|
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
spark.read.textFile
+--------------------+
| value|
+--------------------+
| 0::2::3::1424380312|
| 0::3::1::1424380312|
返回的是Datasets
ratings.show(false)
demo通过定义 case class和解析每一行记录映射到case calss的字段,然后对Datasets进行map,可以生成结构化的dataframe
+------+-------+------+----------+
|userId|movieId|rating|timestamp |
+------+-------+------+----------+
|0 |2 |3.0 |1424380312|
|0 |3 |1.0 |1424380312|
|0 |5 |2.0 |1424380312|
|0 |9 |4.0 |1424380312|
+------+-------+------+----------+
参考:http://www.runoob.com/scala/scala-extractors.html
http://www.runoob.com/scala/scala-pattern-matching.html
在声明样例类时,下面的过程自动发生了:
构造器的每个参数都成为val,除非显式被声明为var,但是并不推荐这么做;
在伴生对象中提供了apply方法,所以可以不使用new关键字就可构建对象;
提供unapply方法使模式匹配可以工作;
生成toString、equals、hashCode和copy方法,除非显示给出这些方法的定义。
在我们实例化一个类的时,可以带上0个或者多个的参数,编译器在实例化的时会调用 apply 方法。我们可以在类和对象中都定义 apply 方法。
就像我们之前提到过的,unapply 用于提取我们指定查找的值,它与 apply 的操作相反。 当我们在提取器对象中使用 match 语句是,unapply 将自动执行
Collaborative filtering is commonly used for recommender systems. These techniques aim to fill in the missing entries of a user-item association matrix. spark.ml currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. spark.ml uses the alternating least squares (ALS) algorithm to learn these latent factors. The implementation in spark.ml has the following parameters:
numBlocks is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10). 并行度
rank is the number of latent factors in the model (defaults to 10). 隐因子个数
maxIter is the maximum number of iterations to run (defaults to 10). 最大迭代数
regParam specifies the regularization parameter in ALS (defaults to 1.0). 正则惩罚,防止过拟合
implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data (defaults to false which means using explicit feedback). 隐反馈
alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations (defaults to 1.0). 隐反馈的执行度相关
nonnegative specifies whether or not to use nonnegative constraints for least squares (defaults to false). 是否约束非负
协同过滤的目的就是要填充 user-item 矩阵空缺的位置,spark-ml的目前支持基于模型的协同过滤。用户和物品可以用一些隐因子的矩阵去预测空缺的位置。spark用的是ALS算法去学习这些隐因子
如何解释spark mllib中ALS算法的原理?http://www.dataguru.cn/article-7049-1.html