Pyspark机器学习

项目概述&目的:这是一个虚拟的音乐服务数据集,拥有过千万用户,用户可以随时升级、降级、取消他们的套餐。用户的动态、意向可以直接影响到服务的盈利;而每次用户的操作都会被记录(即具体动作例如收藏、升级、降级、播放歌曲、添加歌单等),这些数据对于服务商而言有着重要价值,可从该数据中发现某些用户的某些操作的共通点,来判断该用户接下来会进行什么样的操作, 本次任务的目标是寻找潜在客户,而潜在客户也分为潜在意向客户和流失客户,本次我们要利用机器学习找到那些流失(即将流失)的客户,寻找他们的共同特征,利用优惠、试用等手段控制损失。

环境

  • Python
  • PySpark 分布式机器学习库
  • matplotlib 可视化库
  • numpy 科学计算库
安装
pip install pyspark
技术特点

pyspark采用了懒加载模式(需要真正运行的命令才会执行相关的指令),这种方式的优点是减少资源的开销,加快程序的开发。

流程
  • 加载和清理数据
  • 探索性数据分析
  • 提取特征工程
  • 建模
  • 模型的建立和预测
最终评估标准:

Accuracy:准确率,评价一个模型最直观的值,在验证集上准确率越高,模型越完美。

具体步骤

1. 导入所需库
# import libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, concat, col, desc, year, month, asc, count, avg, countDistinct
from pyspark.sql.types import IntegerType
import pyspark.sql.functions as func

from pyspark.ml import Pipeline
from pyspark.ml.evaluation import  MulticlassClassificationEvaluator
from pyspark.ml.feature import VectorAssembler,Normalizer,StandardScaler,IDF,StringIndexer
from pyspark.ml.regression import LinearRegression
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
# from matplotlib import pyplot as plt
from datetime import date
from functools import reduce
import numpy as np
  • SparkSession为spark的主要类,用来创建spark对象和加载数据
  • 从pyspark.sql.functions导入所需库(具体库的作用后面用到会讲)
  • 从pyspark.ml.feature导入对数据进行操作的对象
  • 从pyspark.ml.regression 导入线性回归模块
  • 从pyspark.ml.classification 导入分类器,用作最终模型的对比
  • 从pyspark.ml.evaluation导入计算准确率的对象
2. 创建spark对象
# create a Spark session
spark = SparkSession.builder.appName("sparkify").getOrCreate()
3. 加载数据
data = spark.read.json("s3n://udacity-dsnd/sparkify/sparkify_event_data.json")
4. 检索数据
# check detail for data
print(data.count())
print(data.describe())
print(data.printSchema())
print(data.show())
结果

图上的1、2、3、4分别由上而下对应代码,我们的数据一共有....好多行,2为数据类型,3为数据类型的总览,4为具体数据。

5. 清理数据,把userId和sessionId的为空(也就是无意义)的数据清理掉:
data = data.dropna(how = "any",subset=["userId","sessionId"])
data = data.filter(data["userId"] != "")
  • 由于空数据不一定是null,也可能是空字符串,所以再次过滤一下。
6. 定义客户流失

观察page,会发现有以下几个值:


image.png

可以看到该日志记录了用户的一些操作,其中Cancellation Confirmation和Downgrade这两个动作意味着用户(即将)流失,所以有这两个动作其一的将标志为Churn = 1(意味着需要进行一系列挽留操作的用户),其他则为0:

churn_func = udf(lambda x: 1 if x == "Cancellation Confirmation" or x == "Downgrade" else 0, IntegerType())
data = data.withColumn("Churn", churn_func(data.page))
  • 利用udf方法来创建一个适用于添加对应逻辑列的对象
  • udf方法类似于pandas的map和apply方法
  • 新建一个Churn列,当用户确认取消订阅和降级的时候,我们将该批用户的Churn标记为1,否则当作正常用户,标记为0.
7. 可视化

PS: 由于大数据集上无法加载pandas,所以利用小数据集(从大数据集提取的部分数据,两个数据之间运行的换境不一样,一个是在AWS,一个是在本地换境)作为展示:

# length visual
data_pd = data.select("length").toPandas()
data_pd.plot(kind = "hist", bins = 500)
plt.xlim(0, 550)
image.png
  • 可以看到听歌总时长这个特征呈正态分布,对于机器学习来说是极好的一个特征。
gender_pd = data.orderBy(desc("Churn")).dropDuplicates(subset = ["userId"]).where(col("Churn") == 1).groupBy("gender").agg(count("gender").alias("count")).toPandas()
gender_pd.plot(kind = "bar", x = "gender", y = "count")
Gender Ratio.png
  • 可以看到在取消订阅的用户中,男性占比会相对高一点,所以性别也是特征之一。
8. 特征
  1. Numerical Features
# count songs
df_songs = data.groupBy("userId").agg(countDistinct("song").alias("countSong")).orderBy("userId")

# calc avg listen time
df_avg_length = data.groupBy("userId").agg(avg("length").alias("avgLength")).orderBy("userId")

# count all artist for each user
df_singers = data.dropDuplicates(["userId", "artist"]).groupBy("userId").agg(count("artist").alias("countArtist")).orderBy("userId")
  • 所有用户的听歌总数:一个用户听歌多少或许会反应出该用户最后做的决定。
  • 所有用户的听歌平均时长:由于该字段是正态分布的,所以将会是一个很有用的特征。
  • 所有用户所听的歌的演唱者总数:或许也是一个用户去留的一个重要指标。
  1. Categories Features
# select category features
df_catgory = data.select(["userId", "gender", "level", "location", "method"])
df_catgory = df_catgory.dropDuplicates(["userId"]).orderBy(desc("userId"))
  • 性别,等级,地段,操作都可能导致最后去留的不同。
合并&清理特征
#  user may have 2 Churn values, we keep value 1 only
df_calced = data.select("Churn","userId").orderBy(desc("Churn")).dropDuplicates(["userId"])
# join all features
for feature in [df_songs, df_avg_length, df_singers, df_catgory]:
    df_calced = df_calced.join(feature, ["userId"], how="left")
# drop all na value
df_calced = df_calced.na.drop()
df_calced = df_calced.dropna(how = "any")
把类型字段转换为数字
# convert category features to numberic
for index in ["gender", "level", "location", "method"]:
    indexer = StringIndexer(inputCol=index, outputCol=f"{index}Indexer")
    _fit = indexer.fit(df_calced)
    df_calced = _fit.transform(df_calced)
把所有特征转换为Pyspark可以处理的向量
# convert all features to vector
assembler = VectorAssembler(inputCols=["countSong", "avgLength", "countArtist", "genderIndexer", "levelIndexer", "locationIndexer", "methodIndexer"], outputCol="featuresVec")
df_calced = assembler.transform(df_calced)
缩放向量

# standarded scaler the features column
stander = StandardScaler(inputCol="featuresVec", outputCol="features")
stander_fit = stander.fit(df_calced)
df_calced = stander_fit.transform(df_calced)
  • 因为一些比较大的和一些比较小的数值会对最终结果产生较大影响,缩放就是减少大数和小数对权重的最终影响。
建模

从合并&清理完的数据集中随机提取70%、15%、15%分别作为训练、测试、验证数据集:

# use 70% data as train dataset, and 15% for test and val dataset
train_dataset, test_dataset, val_dataset = df_calced.randomSplit([0.7, 0.15, 0.15], seed = 58)
  • 校验器(校验最终结果的准确率)
# build  the evaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction",labelCol="Churn")
  • 线性回归
# build linear regression model
lr = LinearRegression(labelCol="Churn", featuresCol="features", fitIntercept=False, regParam=0.0, solver="normal")
model = lr.fit(train_dataset)

# train the dataset by using linear regression model
train_pre = model.transform(train_dataset)
test_pre = model.transform(test_dataset)

# use the evaluator the evaluate train and test dataset
print("LinearRegression Accuracy for train dataset:", evaluator.evaluate(train_pre))
print("LinearRegression Accuracy for test dataset:", evaluator.evaluate(test_pre))

# final result
# LinearRegression Accuracy for train dataset: 0.6678041271593208
# LinearRegression Accuracy for test dataset: 0.9106971716581354
  • 逻辑回归
# build logistregression model
logist_lr = LogisticRegression(labelCol="Churn", featuresCol="features")
logist_model = logist_lr.fit(train_dataset)

# train the dataset by using logist regression model
train_pre_logist = logist_model.transform(train_dataset)
test_pre_logist = logist_model.transform(test_dataset)

# use the evaluator the evaluate train and test dataset
print("LogistRegression Accuracy for train dataset:", evaluator.evaluate(train_pre_logist))
print("LogistRegression Accuracy for test dataset:", evaluator.evaluate(test_pre_logist))

#Final result
# LogistRegression Accuracy for train dataset: 0.5971902287939127
# LogistRegression Accuracy for test dataset: 0.6027563842723956
  • 随机森林
# build RandomForest model
forest = RandomForestClassifier(labelCol="Churn", featuresCol="features", maxDepth=10)
forest_model = forest.fit(train_dataset)

# train the dataset by using Random Forest model
train_pre_forest = forest_model.transform(train_dataset)
test_pre_forest = forest_model.transform(test_dataset)

# use the evaluator the evaluate train and test dataset
print("RandomForest Regression Accuracy for train dataset:", evaluator.evaluate(train_pre_forest))
print("RandomForest Regression Accuracy for test dataset:", evaluator.evaluate(test_pre_forest))

# Final result
# RandomForest Regression Accuracy for train dataset: 0.6351333140056861
# RandomForest Regression Accuracy for test dataset: 0.8192344933184481
网格搜索,超参数调优(以下部分仅运行在小数据集换境,AWS不知道为啥老是error404)

首先为各个模型创建超参数字典:

# create hyper-paramter dict for each module
linearregression_dict = {
    lr.regParam: [0, 0.01, 0.1],
    lr.fitIntercept: [True, False],
    lr.maxIter: [10, 30, 50]
}
logistregression_dict = {
    logist_lr.maxIter: [10, 30, 50],
    logist_lr.fitIntercept: [True, False],
    logist_lr.regParam: [0, 0.01, 0.1]
}
forest_dict = {forest.maxDepth : [*range(10, 40, 10)], 
            forest.minInstancesPerNode : [*range(1, 30, 6)], 
            forest.maxBins : [*range(2, 33, 8)], 
            forest.numTrees: [*range(3, 30, 9)]}

合并各个字典为一体,以便后面的操作:

# combine all to ({module_name}, {module}, {param}, {evaluators})
all_module_dicts = [linearregression_dict, logistregression_dict, forest_dict]
params_maps = [reduce(lambda gb, param: gb.addGrid(*param), 
                    module.items(), 
                    ParamGridBuilder()).build() for module in all_module_dicts]

all_module_names = ["LinearRegression", "LogistRegression", "RandomForestRegression"]
all_modules = [lr, logist_lr, forest]
all_evaluates = [binary_evaluator] + [evaluator] * 2
all_module_combine = list(zip(all_module_names,all_modules, params_maps, all_evaluates))

开始拟合各个模型的各个超参数

# cross valiator every hyper-paramter
modules = {}
for module_name, module, param, eva in all_module_combine:
    cross = CrossValidator(estimator=module, estimatorParamMaps=param, evaluator=eva ,numFolds=3)
    cModel = cross.fit(train)
    
    train_result = cModel.transform(train)
    test_result = cModel.transform(test)
    accuracy, f1 = ("accuracy", "f1") if isinstance(eva, MulticlassClassificationEvaluator) else ("areaUnderPR", "areaUnderROC")
    print(f"{module_name} {accuracy.capitalize()} for Train dataset:", eva.evaluate(train_result, {eva.metricName: accuracy}))
    print(f"{module_name} {f1.capitalize()} for Train dataset:", eva.evaluate(train_result, {eva.metricName: f1}))

    print(f"{module_name} {accuracy.capitalize()} for Test dataset:", eva.evaluate(test_result, {eva.metricName: accuracy}))
    print(f"{module_name} {f1.capitalize()} for Test dataset:", eva.evaluate(test_result, {eva.metricName: f1}))
    
    print(f"Best params for {module_name}:", cModel.getEstimatorParamMaps()[np.argmax(cModel.avgMetrics)])
    print()
    
    modules[module_name] = {"module": cModel, "evaluator": eva}

最后运行得到结果:

LinearRegression Areaunderpr for Train dataset: 0.9709799102289998
LinearRegression Areaunderroc for Train dataset: 0.9107775693141545
LinearRegression Areaunderpr for Test dataset: 0.9882510716003329
LinearRegression Areaunderroc for Test dataset: 0.9357142857142857
Best params for LinearRegression: {Param(parent='LinearRegression_fb672c955ba6', name='regParam', doc='regularization parameter (>= 0).'): 0.01, Param(parent='LinearRegression_fb672c955ba6', name='fitIntercept', doc='whether to fit an intercept term.'): True, Param(parent='LinearRegression_fb672c955ba6', name='maxIter', doc='max number of iterations (>= 0).'): 10}

LogistRegression Accuracy for Train dataset: 0.8417721518987342
LogistRegression F1 for Train dataset: 0.8464320489636945
LogistRegression Accuracy for Test dataset: 0.8787878787878788
LogistRegression F1 for Test dataset: 0.8787878787878788
Best params for LogistRegression: {Param(parent='LogisticRegression_8f718c7ac443', name='maxIter', doc='max number of iterations (>= 0).'): 10, Param(parent='LogisticRegression_8f718c7ac443', name='fitIntercept', doc='whether to fit an intercept term.'): False, Param(parent='LogisticRegression_8f718c7ac443', name='regParam', doc='regularization parameter (>= 0).'): 0.0}

RandomForestRegression Accuracy for Train dataset: 0.8544303797468354
RandomForestRegression F1 for Train dataset: 0.857910562223808
RandomForestRegression Accuracy for Test dataset: 0.8181818181818182
RandomForestRegression F1 for Test dataset: 0.8371628371628371
Best params for RandomForestRegression: {Param(parent='RandomForestClassifier_4a8ffb100f5f', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 10, Param(parent='RandomForestClassifier_4a8ffb100f5f', name='minInstancesPerNode', doc='Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.'): 13, Param(parent='RandomForestClassifier_4a8ffb100f5f', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 18, Param(parent='RandomForestClassifier_4a8ffb100f5f', name='numTrees', doc='Number of trees to train (>= 1).'): 21}

最后的测试结果可以看到是线性回归的准确率最高,最优超参数分别为:regParam=0.01,fitIntercept=True,maxIter=10,所以利用线性回归作为最终模型来测试验证集:

# base on final result , use linearregression as the final model
final_module_dict = modules["LinearRegression"]
final_module = final_module_dict["module"]
val_tr = final_module.transform(val)
final_evalutor = final_module_dict["evaluator"]
print("Val dataset accuracy:", final_evalutor.evaluate(val_pre, {final_evalutor.metricName: "areaUnderPR"}))
# Final result
# Val dataset areaUnderPR: 0.9615486013517344

总结:

  • 本次项目我们建立了一个模型来预测客户的流失。
  • 在数据清洗过程中,我们把userId和sessionId为空的条目过滤掉。
  • 在数据探索阶段,我们给数据集加上了标签,以便标记哪些是流失用户。
  • 在可视化阶段,我们得知了男性用户的流失率较女性大。
  • 在特征阶段,我们选取了听歌平均时长、听歌总数、用户所关注的艺术家总数、行别、等级、地点、方法 等字段作为输入特征。
  • 最后发现线性回归是该项目最佳模型。

评价指标

机器学习深度学习结果定义只有4种:

  • 预测是正实际是正 = TP (True Positive)
  • 预测是正实际是负(预测错误) = FP(False Positive)
  • 预测是负实际是正(预测错误) = FN(False Negative)
  • 预测是负实际是负 = TN(True Negative)
Accuracy(准确率)

字面意思,就是预测正确的概率,那么很容易总结:


image.png

当一份训练数据不同类别或者说不同特征的条目差距较大时,准确率就会收到较高的影响,占比越大的条目对准确率影响越大。

Precision(精确率)

表示模型预测正确的正样本(预测正确的负样本不算)数量和预测为正的样本数量的比值:


image.png
Recall(召回率)

表示模型预测正确的正样本(预测正确的负样本不算)数量和实际上为正的样本数量的比值:


image.png

概括起来就是精准率 = 我们预测我们关注的事件,它相应的有多准确。(比如我预测这场球皇马会赢,而"这场球"就是所关注的事件,结合最终球赛成绩来判断精准率)
召回率 = 我们关注的事件真实的发生了,在真实发生的我们关注的事件中,我们成功预测了多少。(比如比赛成绩出来了,我一开始预测皇马在5场比赛中有2场会赢,而最终皇马赢了3场,那么我们的召回率就是3/2)
但是Precision和Recall是一对相爱相杀的存在,当样本量差不多的时候,其中一方越高意味着另外一方越低,所以最终需要F1-Score来衡量一个模型的健壮程度。

F1-Score

F1-Score是取Precision和Recall的调和平均值:


image.png

难点

过程中发现超参数调优是比较难的,而且由于数据集庞大,每次运行都要花飞好长时间,在AWS里面时间意味着美元;但完成了第一版项目以后觉得各个模型之间的优势劣势和侧重点依然不太清晰明了,以上三个模型都是在心里牢记的模型,至于哪一个更适用于本次项目,脑海里始终是"二元分类线性回归较好",对概念还不是很明确,可能在以后的工作中才能慢慢熟悉。

代码的改进

一开始没有使用网格搜索导致完全使用参数默认值去判定模型的优劣,这是不准确的,后来经过多次搜索后慢慢掌握了网格搜索的用法(课程里面的恕在下愚笨),才慢慢通过网格搜索来对超参数进行调优。

参考:
https://stackoverflow.com/questions/52498970/how-to-get-the-best-hyperparameter-value-after-crossvalidation-in-pyspark
https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.html
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.evaluation.BinaryClassificationMetrics
https://stackoverflow.com/questions/37707305/pyspark-multiple-conditions-in-when-clause

你可能感兴趣的:(Pyspark机器学习)