spark的als和es实现推荐系统

该内容主要是对
https://www.ctolib.com/IBM-elasticsearch-spark-recommender.html
的体会翻译,并且整理源码

用Apache Spark & Elasticsearch构建推荐系统

安装准备

  1. 安装es
$ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.3.0.tar.gz
$ tar xfz elasticsearch-5.3.0.tar.gz
image.png
  1. 安装es 的向量排序插件 Elasticsearch vector scoring plugin
$ cd elasticsearch-5.3.0
$ ./bin/elasticsearch-plugin install https://github.com/MLnick/elasticsearch-vector-scoring/releases/download/v5.3.0/elasticsearch-vector-scoring-5.3.0.zip
  1. 启动es
./bin/elasticsearch

查看已经启动了向量排序插件


image.png
  1. 安装es的python客户端
$ pip install elasticsearch
  1. 下载spark与es之间连接器
$ wget http://download.elastic.co/hadoop/elasticsearch-hadoop-5.3.0.zip
$ unzip elasticsearch-hadoop-5.3.0.zip
  1. 下载Spark


    image.png
$ tar xfz spark-2.4.5-bin-hadoop2.7.tgz
  1. 安装numpy
$ pip install numpy
  1. 下载训练数据
$ cd data
$ wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
$ unzip ml-latest-small.zip
  1. 安装启动notebook
    注意: notebook 环境需要 Python 2.7 or 3.x (且在 2.7.11 和 3.6.1测试通过)
$ pip install tmdbsimple
$ pip install jupyter
$ PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" ./spark-2.4.5-bin-hadoop2.7/bin/pyspark --driver-memory 2g --driver-class-path ./elasticsearch-hadoop-5.3.0/dist/elasticsearch-spark-20_2.11-5.3.0.jar
image.png
  1. 下载案例
[https://github.com/mindis/elasticsearch-spark-recommender-demo/blob/master/notebooks/elasticsearch-spark-recommender.ipynb](https://github.com/mindis/elasticsearch-spark-recommender-demo/blob/master/notebooks/elasticsearch-spark-recommender.ipynb)

  1. 在notebook打开例子


    image.png

    image.png
image.png

image.png

案例说明

  1. 逻辑图


    image.png
  2. 修改训练数据地址


    image.png
  3. 训练user和movies向量
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import col
als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", regParam=0.1, rank=10, seed=42)
model = als.fit(ratings_from_es)
model.userFactors.show(5)
model.itemFactors.show(5)
  1. 向es写入向量
movie_vectors.write.format("es") \
    .option("es.mapping.id", "id") \
    .option("es.write.operation", "update") \
    .save("demo/movies", mode="append")
user_vectors.write.format("es") \
    .option("es.mapping.id", "id") \
    .option("es.write.operation", "update") \
    .save("demo/users", mode="append")
  1. 查询相似度电影
display_similar(2628, num=5)
  1. 注意点
    以下在原案例文件有错误,调整为以下内容


    image.png

你可能感兴趣的:(spark的als和es实现推荐系统)