K-Means:org.apache.spark.ml.clustering.KMeans,版本2.11
RFM:
根据业务需求的不同,RFM也会存在不同的定义,此处的定义:
只计算一年内的订单数据
结算结果(标签值):将用户共分为8种类型
1,加载订单数据转换为DataFrame:
SELECT
aid as member_id
, cast(datediff(current_date(), ifnull(max(pay_time), date_add(current_date(),-365))) as bigint) AS order_last #R
, count(DISTINCT DAYOFYEAR(pay_time)) AS order_count #F
, cast(sum(amt_payment) as long) AS order_amount #M
FROM cdp_rfm_order
WHERE pay_time > date_sub(CURDATE(), interval 1 year)
GROUP BY aid
2:K-Means初始化:
def buildKmeansModel(featherColumn: String, numOfIterations: Int): KMeans = {
val numOfClusters = 5 //根据业务数据定义K值
val modeOfCluster = "k-means||"
val kmeans = new KMeans()
.setInitMode(modeOfCluster)
.setK(numOfClusters)
.setMaxIter(numOfIterations) //算法计算最大迭代次数:建议取数据行数的1/3 - 2/3
.setSeed(1000L) //随机种子,不设置默认取当前类名的hashCode
.setFeaturesCol(featherColumn) //需要训练的特征列:(Vector类型)
kmeans
}
3,K-Means模型训练(Train)
RFM三个维度需要分开训练三次,以下代码以F值训练为例
训练前需要先将对应的训练列数据转换为向量
originalDataModel:加载的订单数据DF
originalColumn:训练的列名(oder_last/order_count/order_amount)
featherColumn:训练列名结果后的别名
import sparkSession.implicits._
val featherDataModel = originalDataModel.select("member_id", originalColumn)
.mapPartitions(it => {
it.map(row => {
(row.getString(0), Vectors.dense(row.getAs[Long](originalColumn)))
})
}).toDF("member_id", featherColumn)
训练:
var recentKmeansModel: KMeansModel = kMeans.fit(featherDataModel)
训练结果每次计算可能会不一样,多次训练后选取一次符合预期结果的数据保存,方便后面的预测
recentKmeansModel.write.overwrite().save(s"/user/RFM/")
加载训练数据:
KMeansModel.load(s"/user/RFM/")
4:预测(Predict):
kmeansModel.transform(featherDataModel)
预测后会给每一列打上0-4的分类值
5:根据预测后的结果给每一类打分(1-5)
分组计算当前聚类平均购买次数,次数越高的说明该类型的用户价值越高,然后打分1-5
SELECT
row_number() over(order by order_count_tal/member_count_tal) as order_count_score
, order_count_predict
FROM
(
SELECT order_count_predict
, sum(order_count) as order_count_tal
, count(member_id) as member_count_tal
FROM
cdp_frequency_predicted_data
GROUP BY
, order_count_predict
)
6:根据上一步的打分结果,计算总体样本的平均分
7:计算RFM值
对比每个人的分数和样本的平均分,大于平均分的取1,否取0,依次计算R,F,M
参考链接:
spark K-Means:http://spark.apache.org/docs/2.1.1/ml-clustering.html#k-means
K-Means:https://www.bookstack.cn/read/spark-ml-source-analysis/%E8%81%9A%E7%B1%BB-k-means-k-means.md
RFM:https://cloud.tencent.com/developer/article/1780559