Spark中K-Means算法计算RFM

环境与定义

K-Means:org.apache.spark.ml.clustering.KMeans,版本2.11

RFM
根据业务需求的不同,RFM也会存在不同的定义,此处的定义:

  • R:最近一次购买时间距今的天数(0 - 365/366)
  • F:有购买行为的天数(一天多次购买算一次)
  • M:购买的总金额(注意0或者超大金额是否需要过滤掉)

只计算一年内的订单数据

结算结果(标签值):将用户共分为8种类型

Spark中K-Means算法计算RFM_第1张图片

实现:

1,加载订单数据转换为DataFrame:

SELECT
         aid as member_id
       , cast(datediff(current_date(), ifnull(max(pay_time), date_add(current_date(),-365))) as bigint) AS order_last #R
       , count(DISTINCT DAYOFYEAR(pay_time)) AS order_count	#F
       , cast(sum(amt_payment) as long) AS order_amount	#M
FROM cdp_rfm_order
WHERE pay_time > date_sub(CURDATE(), interval 1 year)
GROUP BY aid

2:K-Means初始化:

    def buildKmeansModel(featherColumn: String, numOfIterations: Int): KMeans = {
        val numOfClusters = 5    //根据业务数据定义K值
        val modeOfCluster = "k-means||"
        val kmeans = new KMeans()
            .setInitMode(modeOfCluster) 
            .setK(numOfClusters) 
            .setMaxIter(numOfIterations) //算法计算最大迭代次数:建议取数据行数的1/3 - 2/3
            .setSeed(1000L)    //随机种子,不设置默认取当前类名的hashCode
            .setFeaturesCol(featherColumn) //需要训练的特征列:(Vector类型)
        kmeans
    }

3,K-Means模型训练(Train)

RFM三个维度需要分开训练三次,以下代码以F值训练为例

训练前需要先将对应的训练列数据转换为向量
originalDataModel:加载的订单数据DF
originalColumn:训练的列名(oder_last/order_count/order_amount)
featherColumn:训练列名结果后的别名

      import sparkSession.implicits._
      val featherDataModel = originalDataModel.select("member_id", originalColumn)
      .mapPartitions(it => {
        it.map(row => {
          (row.getString(0), Vectors.dense(row.getAs[Long](originalColumn)))
        })
      }).toDF("member_id", featherColumn)

训练:

var recentKmeansModel: KMeansModel = kMeans.fit(featherDataModel)

训练结果每次计算可能会不一样,多次训练后选取一次符合预期结果的数据保存,方便后面的预测

  recentKmeansModel.write.overwrite().save(s"/user/RFM/")

加载训练数据:

KMeansModel.load(s"/user/RFM/")

4:预测(Predict):

kmeansModel.transform(featherDataModel)

预测后会给每一列打上0-4的分类值

5:根据预测后的结果给每一类打分(1-5)

分组计算当前聚类平均购买次数,次数越高的说明该类型的用户价值越高,然后打分1-5

SELECT
        row_number() over(order by order_count_tal/member_count_tal) as order_count_score
      , order_count_predict
FROM
        (
                 SELECT   order_count_predict
                        , sum(order_count) as order_count_tal
                        , count(member_id) as member_count_tal
                 FROM
                          cdp_frequency_predicted_data
                 GROUP BY
                        , order_count_predict
        )

6:根据上一步的打分结果,计算总体样本的平均分

7:计算RFM值
对比每个人的分数和样本的平均分,大于平均分的取1,否取0,依次计算R,F,M

 

参考链接:

spark K-Means:http://spark.apache.org/docs/2.1.1/ml-clustering.html#k-means

K-Means:https://www.bookstack.cn/read/spark-ml-source-analysis/%E8%81%9A%E7%B1%BB-k-means-k-means.md

RFM:https://cloud.tencent.com/developer/article/1780559

 

 

你可能感兴趣的:(Spark,spark)