该内容主要是对
https://www.ctolib.com/IBM-elasticsearch-spark-recommender.html
的体会翻译,并且整理源码
用Apache Spark & Elasticsearch构建推荐系统
安装准备
- 安装es
$ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.3.0.tar.gz
$ tar xfz elasticsearch-5.3.0.tar.gz
- 安装es 的向量排序插件 Elasticsearch vector scoring plugin
$ cd elasticsearch-5.3.0
$ ./bin/elasticsearch-plugin install https://github.com/MLnick/elasticsearch-vector-scoring/releases/download/v5.3.0/elasticsearch-vector-scoring-5.3.0.zip
- 启动es
./bin/elasticsearch
查看已经启动了向量排序插件
- 安装es的python客户端
$ pip install elasticsearch
- 下载spark与es之间连接器
$ wget http://download.elastic.co/hadoop/elasticsearch-hadoop-5.3.0.zip
$ unzip elasticsearch-hadoop-5.3.0.zip
-
下载Spark
$ tar xfz spark-2.4.5-bin-hadoop2.7.tgz
- 安装numpy
$ pip install numpy
- 下载训练数据
$ cd data
$ wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
$ unzip ml-latest-small.zip
- 安装启动notebook
注意: notebook 环境需要 Python 2.7 or 3.x (且在 2.7.11 和 3.6.1测试通过)
$ pip install tmdbsimple
$ pip install jupyter
$ PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" ./spark-2.4.5-bin-hadoop2.7/bin/pyspark --driver-memory 2g --driver-class-path ./elasticsearch-hadoop-5.3.0/dist/elasticsearch-spark-20_2.11-5.3.0.jar
- 下载案例
[https://github.com/mindis/elasticsearch-spark-recommender-demo/blob/master/notebooks/elasticsearch-spark-recommender.ipynb](https://github.com/mindis/elasticsearch-spark-recommender-demo/blob/master/notebooks/elasticsearch-spark-recommender.ipynb)
-
在notebook打开例子
案例说明
-
逻辑图
-
修改训练数据地址
- 训练user和movies向量
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import col
als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", regParam=0.1, rank=10, seed=42)
model = als.fit(ratings_from_es)
model.userFactors.show(5)
model.itemFactors.show(5)
- 向es写入向量
movie_vectors.write.format("es") \
.option("es.mapping.id", "id") \
.option("es.write.operation", "update") \
.save("demo/movies", mode="append")
user_vectors.write.format("es") \
.option("es.mapping.id", "id") \
.option("es.write.operation", "update") \
.save("demo/users", mode="append")
- 查询相似度电影
display_similar(2628, num=5)
-
注意点
以下在原案例文件有错误,调整为以下内容