查看原文:http://www.wyblog.cn/2016/11/05/%e5%a4%a9%e6%b1%a0%e5%88%9d%e4%bd%93%e9%aa%8c-%e6%96%b0%e4%ba%ba%e5%ae%9e%e6%88%98%e8%b5%9b%e4%b9%8b%e7%a6%bb%e7%ba%bf%e8%b5%9b/
官方说明如下,这里只进行简单解释。
https://tianchi.shuju.aliyun.com/getStart/introduction.htm?spm=5176.100068.5678.1.VEirgR&raceId=231522
可以看到,有两个表。第一个表UI是用户这个月内在商品全集上的行为数据,第二个表P是商品子集信息。而赛题说了,评分数据是12月19号用户对商品子集里商品购买信息的预测。所以,这里提供几个简单规则:
如果涉及到Spark集群,那么操作文件都默认在HDFS上。首先将两个源数据上传到HDFS,并用Spark读取,注册成表格。我这里全部用的是pyspark交互式环境来做的。
#读取CSV文件,去掉表头,并创建derby数据库的表
df=spark.read.csv("tianchi_fresh_comp_train_user.csv",header=True)
df2=spark.read.csv("tianchi_fresh_comp_train_item.csv",header=True)
df.createOrReplaceTempView("user")
df2.createOrReplaceTempView("item")
#SQL语句:根据P表过滤UI表,创建出新的user_item表
spark.sql("CREATE TABLE user_item AS \
SELECT t.* FROM user AS t \
JOIN item AS b \
ON t.item_id=b.item_id \
AND t.item_category=b.item_category")
接着,我们需要统计16、17号用户交互情况,并且标识出用户是否在18号进行了购买行为:
spark.sql("CREATE TABLE day1617_18_detail AS \
SELECT user_id,item_id, \
CASE WHEN substr(time,1,10)='2014-12-16' OR substr(time,1,10)='2014-12-17' AND behavior_type=1 THEN 1 ELSE 0 END AS Is_2day_view, \
CASE WHEN substr(time,1,10)='2014-12-16' OR substr(time,1,10)='2014-12-17' AND behavior_type=2 THEN 1 ELSE 0 END AS Is_2day_favor, \
CASE WHEN substr(time,1,10)='2014-12-16' OR substr(time,1,10)='2014-12-17' AND behavior_type=3 THEN 1 ELSE 0 END AS Is_2day_tocar, \
CASE WHEN substr(time,1,10)='2014-12-16' OR substr(time,1,10)='2014-12-17' AND behavior_type=4 THEN 1 ELSE 0 END AS Is_2day_buy, \
CASE WHEN substr(time,1,10)='2014-12-18' AND behavior_type=4 THEN 1 ELSE 0 END AS Is_buy \
FROM user_item ")
还需要统计出17、18号用户交互行为:
spark.sql("CREATE TABLE day1718_19_detail AS \
SELECT DISTINCT user_id,item_id, \
CASE WHEN substr(time,1,10)='2014-12-17' OR substr(time,1,10)='2014-12-18' AND behavior_type=1 THEN 1 ELSE 0 END AS Is_2day_view, \
CASE WHEN substr(time,1,10)='2014-12-17' OR substr(time,1,10)='2014-12-18' AND behavior_type=2 THEN 1 ELSE 0 END AS Is_2day_favor, \
CASE WHEN substr(time,1,10)='2014-12-17' OR substr(time,1,10)='2014-12-18' AND behavior_type=3 THEN 1 ELSE 0 END AS Is_2day_tocar, \
CASE WHEN substr(time,1,10)='2014-12-17' OR substr(time,1,10)='2014-12-18' AND behavior_type=4 THEN 1 ELSE 0 END AS Is_2day_buy \
FROM user_item")
分别根据以上两张表,统计出相应日期各种交互行为的总数量:
spark.sql("CREATE TABLE day1617_18_train_data AS \
SELECT * FROM (SELECT user_id,item_id, \
SUM(CASE WHEN Is_2day_view=1 THEN 1 ELSE 0 END) AS 2day_view, \
SUM(CASE WHEN Is_2day_favor=1 THEN 1 ELSE 0 END) AS 2day_favor,\
SUM(CASE WHEN Is_2day_tocar=1 THEN 1 ELSE 0 END) AS 2day_tocar,\
SUM(CASE WHEN Is_2day_buy=1 THEN 1 ELSE 0 END) AS 2day_buy,\
SUM(CASE WHEN Is_buy=1 THEN 1 ELSE 0 END) AS DAY18_buy \
FROM day1617_18_detail \
GROUP BY user_id,item_id) \
WHERE 2day_view>0 OR 2day_favor>0 OR 2day_tocar>0 OR 2day_buy>0")
spark.sql("CREATE TABLE day1718_19_predict_data AS \
SELECT * FROM (SELECT user_id,item_id, \
SUM(CASE WHEN Is_2day_view=1 THEN 1 ELSE 0 END) AS 2day_view, \
SUM(CASE WHEN Is_2day_favor=1 THEN 1 ELSE 0 END) AS 2day_favor,\
SUM(CASE WHEN Is_2day_tocar=1 THEN 1 ELSE 0 END) AS 2day_tocar,\
SUM(CASE WHEN Is_2day_buy=1 THEN 1 ELSE 0 END) AS 2day_buy \
FROM day1718_19_detail \
GROUP BY user_id,item_id) \
WHERE 2day_view>0 OR 2day_favor>0 OR 2day_tocar>0 OR 2day_buy>0")
最后还需要抽出所有正样本,以及同等数量的负样本,组成最终的训练集表train_datatable。
这里因为本身特征太少,所以需要进行特征扩充,在以下代码中,我随意扩充了几个特征:
#这里注意,spark内进行机器学习时,需要将所有特征聚合为一个向量,并且特征标签要命名为features,而分类标签要命名为label,它才能自动识别。
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler=VectorAssembler(inputCols=["t1","t2","t3","t4","t5","t6","t7","t8","t9","t10"],outputCol="features")
dataset_train=spark.sql("SELECT user_id,item_id,2day_view,2day_favor,2day_tocar,2day_buy, \
CASE WHEN day18_buy>0 THEN 1 ELSE 0 END as label,\
2day_view*2 AS t1 , 2day_favor*3 AS t2 , 2day_tocar*4 AS t3 , 2day_buy*1 AS t4 ,\
2day_view*1+2day_favor*2+2day_tocar*3-2day_buy*1 AS t5,\
(2day_view+1)/(2day_view+2day_favor+2day_tocar+2day_buy+1) AS t6,\
(2day_favor+1)/(2day_view+2day_favor+2day_tocar+2day_buy+1) AS t7,\
(2day_tocar+1)/(2day_view+2day_favor+2day_tocar+2day_buy+1) AS t8,\
(2day_buy+1)/(2day_view+2day_favor+2day_tocar+2day_buy+1) AS t9,\
(2day_favor+1)*(2day_tocar+1)-2day_buy*2 AS t10 \
FROM train_datatable WHERE 2day_buy<20")
output_train = assembler.transform(dataset_train)
train_data=output_train.select("label","features")
有了训练数据集,下一步就是要用Spark的MLlib来构建模型,并训练出一个模型来。如何使用spark mllib就不在本文讨论范围内了。假设我们已经训练出一个TreeModel。
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(train_data)
featureIndexer =VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(train_data)
(trainingData, testData) = data.randomSplit([0.7, 0.3]) #这里分一部分作为测试集来测试模型是否可靠
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
TreeModel = pipeline.fit(trainingData)
predictions = model.transform(testData) #根据模型进行测试集预测
#以下代码可以查看模型预测的错误率,实测错误率0.22左右
evaluator = MulticlassClassificationEvaluator(
labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))
#可以用以下方式来查看决策树是怎样的
model=TreeModel.stages[2]
print(model.toDeBugString)
首先从17、18号的统计数据表构建出预测集,注意,预测集也要生成特征向量。
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler=VectorAssembler(inputCols=["t1","t2","t3","t4","t5","t6","t7","t8","t9","t10"],outputCol="features")
dataset_predict=spark.sql("SELECT user_id,item_id,2day_buy, \
2day_view*2 AS t1 , 2day_favor*3 AS t2 , 2day_tocar*4 AS t3 , 2day_buy*1 AS t4 ,\
2day_view*1+2day_favor*2+2day_tocar*3-2day_buy*1 AS t5,\
(2day_view+1)/(2day_view+2day_favor+2day_tocar+2day_buy+1) AS t6,\
(2day_favor+1)/(2day_view+2day_favor+2day_tocar+2day_buy+1) AS t7,\
(2day_tocar+1)/(2day_view+2day_favor+2day_tocar+2day_buy+1) AS t8,\
(2day_buy+1)/(2day_view+2day_favor+2day_tocar+2day_buy+1) AS t9,\
(2day_favor+1)*(2day_tocar+1)-2day_buy*2 AS t10 \
FROM day1718_19_predict_data WHERE 2day_buy<5")
output_predict = assembler.transform(dataset_predict)
predict_data=output_predict.select("user_id","item_id","2day_buy","features")
最后,我们便可以应用模型来预测数据了,并将结果写成文件,最后进行提交
prediction=model.transform(predict_data)
result=prediction.select("user_id","item_id","2day_buy","prediction")
result.createOrReplaceTempView("result")
outdata=spark.sql("SELECT user_id,item_id FROM result WHERE prediction>0 AND 2day_buy=0") #过滤出17/18号并没有进行购买操作的数据
outdata.write.csv("outfile.csv")
最终,得到结果:时间 F1评分 准确率2016-11-04 12:47:00 || 6.36805181% || 0.04416168排名两百多点,对于第一次参赛的小白,还是挺有成就感的,哈哈。
本文主要是详细介绍了如何从零开始比赛。对于特征提取、模型建立等,都需要再继续深入学习。愿大家都有一个好成绩!