天池初体验——新人实战赛之[离线赛]

 
  
注意,本篇博文代码存在一些问题,请查看修改版的博文,地址为:
查看原文:http://www.wyblog.cn/2016/11/05/%e5%a4%a9%e6%b1%a0%e5%88%9d%e4%bd%93%e9%aa%8c-%e6%96%b0%e4%ba%ba%e5%ae%9e%e6%88%98%e8%b5%9b%e4%b9%8b%e7%a6%bb%e7%ba%bf%e8%b5%9b/


  • 写在开头:本篇博文的适用对象为对天池比赛完全不知如何下手的小伙伴。本文将从头开始记录如何最简单地整理数据、提取特征,再建立模型或者使用人工规则进行预测,然后选出数据并提交,最后得到结果。所以,本文并不会更多得关注特征的选择提取以及模型该如何建立。最后,本人所使用的工具为Spark,非常地方便,它包含了spark-sql工具,数据库默认使用的是derby,同时,Spark含有mllib库,能非常方便地进行机器学习相关工作,可以说,线下比赛能够完全用Spark完成。

赛题说明


官方说明如下,这里只进行简单解释。

https://tianchi.shuju.aliyun.com/getStart/introduction.htm?spm=5176.100068.5678.1.VEirgR&raceId=231522

可以看到,有两个表。第一个表UI是用户这个月内在商品全集上的行为数据,第二个表P是商品子集信息。而赛题说了,评分数据是12月19号用户对商品子集里商品购买信息的预测。所以,这里提供几个简单规则:

  • 可以根据P表里的商品信息,对UI表进行过滤,只留下含有P表商品的交互记录。这条规则背后蕴含的原理是,因为用户对不同类别商品或者服务,购买的策略不同,所以我们只去考虑P表内含有的那些类别的数据去进行学习。
  • 将实际情况简化,这里只通过预测日前两天的用户商品交互情况来预测。即,为了预测19号的购买情况,我们就用17、18号的用户交互情况来预测。为了进行预测,我们就需要学习一个模型出来。所以,我们可以去根据16、17号的用户交互数据,以及18号用户的购买数据来学习出这个模型,本文将简单地选择决策树模型。
  • 因为UI表格里只含有浏览、收藏、加购物车以及购买这四个数据,对于构建一棵决策树来说肯定不够,所以我们要想办法扩充特征,例如对用户进行打分(例如对某个商品浏览一次加2分,收藏加3分等等)、算一下浏览占总交互次数的比例,等等等。这里只是我随意举的特征,具体的还需要自己打开脑洞想一想。
  • 有一个重要问题需要注意,对于我们提取出来的训练集数据里,在18号不会购买的项的数据量比18号购买的项的数据量多得多,也就是正负样本比例相差得太大!这对于模型学习是灾难性的,学习出的模型完全不能用。所以,我们需要选出训练集里所有在18号购买了的,再选出同等数量或者相差不多的数量的负样本作为新的训练集,利用新训练集去学习决策树模型,这样才能起效果。

数据准备


如果涉及到Spark集群,那么操作文件都默认在HDFS上。首先将两个源数据上传到HDFS,并用Spark读取,注册成表格。我这里全部用的是pyspark交互式环境来做的。


#读取CSV文件,去掉表头,并创建derby数据库的表
df=spark.read.csv("tianchi_fresh_comp_train_user.csv",header=True)
df2=spark.read.csv("tianchi_fresh_comp_train_item.csv",header=True)
df.createOrReplaceTempView("user")
df2.createOrReplaceTempView("item")

#SQL语句:根据P表过滤UI表,创建出新的user_item表
spark.sql("CREATE TABLE user_item AS \
SELECT t.* FROM user AS t \
JOIN item AS b \
ON t.item_id=b.item_id \
AND t.item_category=b.item_category")

接着,我们需要统计16、17号用户交互情况,并且标识出用户是否在18号进行了购买行为:


spark.sql("CREATE TABLE day1617_18_detail AS \
SELECT user_id,item_id, \
CASE WHEN substr(time,1,10)='2014-12-16' OR substr(time,1,10)='2014-12-17' AND behavior_type=1 THEN 1 ELSE 0 END AS Is_2day_view, \
CASE WHEN substr(time,1,10)='2014-12-16' OR substr(time,1,10)='2014-12-17' AND behavior_type=2 THEN 1 ELSE 0 END AS Is_2day_favor, \
CASE WHEN substr(time,1,10)='2014-12-16' OR substr(time,1,10)='2014-12-17' AND behavior_type=3 THEN 1 ELSE 0 END AS Is_2day_tocar, \
CASE WHEN substr(time,1,10)='2014-12-16' OR substr(time,1,10)='2014-12-17' AND behavior_type=4 THEN 1 ELSE 0 END AS Is_2day_buy, \
CASE WHEN substr(time,1,10)='2014-12-18' AND behavior_type=4 THEN 1 ELSE 0 END AS Is_buy \
FROM user_item ")

还需要统计出17、18号用户交互行为:


spark.sql("CREATE TABLE day1718_19_detail AS \
SELECT DISTINCT user_id,item_id, \
CASE WHEN substr(time,1,10)='2014-12-17' OR substr(time,1,10)='2014-12-18' AND behavior_type=1 THEN 1 ELSE 0 END AS Is_2day_view, \
CASE WHEN substr(time,1,10)='2014-12-17' OR substr(time,1,10)='2014-12-18' AND behavior_type=2 THEN 1 ELSE 0 END AS Is_2day_favor, \
CASE WHEN substr(time,1,10)='2014-12-17' OR substr(time,1,10)='2014-12-18' AND behavior_type=3 THEN 1 ELSE 0 END AS Is_2day_tocar, \
CASE WHEN substr(time,1,10)='2014-12-17' OR substr(time,1,10)='2014-12-18' AND behavior_type=4 THEN 1 ELSE 0 END AS Is_2day_buy \
FROM user_item")

分别根据以上两张表,统计出相应日期各种交互行为的总数量:


spark.sql("CREATE TABLE day1617_18_train_data AS \
SELECT * FROM (SELECT user_id,item_id, \
SUM(CASE WHEN Is_2day_view=1 THEN 1 ELSE 0 END) AS 2day_view, \
SUM(CASE WHEN Is_2day_favor=1 THEN 1 ELSE 0 END) AS 2day_favor,\
SUM(CASE WHEN Is_2day_tocar=1 THEN 1 ELSE 0 END) AS 2day_tocar,\
SUM(CASE WHEN Is_2day_buy=1 THEN 1 ELSE 0 END) AS 2day_buy,\
SUM(CASE WHEN Is_buy=1 THEN 1 ELSE 0 END) AS DAY18_buy \
FROM day1617_18_detail \
GROUP BY user_id,item_id) \
WHERE 2day_view>0 OR 2day_favor>0 OR 2day_tocar>0 OR 2day_buy>0")

spark.sql("CREATE TABLE day1718_19_predict_data AS \
SELECT * FROM (SELECT user_id,item_id, \
SUM(CASE WHEN Is_2day_view=1 THEN 1 ELSE 0 END) AS 2day_view, \
SUM(CASE WHEN Is_2day_favor=1 THEN 1 ELSE 0 END) AS 2day_favor,\
SUM(CASE WHEN Is_2day_tocar=1 THEN 1 ELSE 0 END) AS 2day_tocar,\
SUM(CASE WHEN Is_2day_buy=1 THEN 1 ELSE 0 END) AS 2day_buy \
FROM day1718_19_detail \
GROUP BY user_id,item_id) \
WHERE 2day_view>0 OR 2day_favor>0 OR 2day_tocar>0 OR 2day_buy>0")

最后还需要抽出所有正样本,以及同等数量的负样本,组成最终的训练集表train_datatable。

特征提取并生成模型


这里因为本身特征太少,所以需要进行特征扩充,在以下代码中,我随意扩充了几个特征:

#这里注意,spark内进行机器学习时,需要将所有特征聚合为一个向量,并且特征标签要命名为features,而分类标签要命名为label,它才能自动识别。
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

assembler=VectorAssembler(inputCols=["t1","t2","t3","t4","t5","t6","t7","t8","t9","t10"],outputCol="features")

dataset_train=spark.sql("SELECT user_id,item_id,2day_view,2day_favor,2day_tocar,2day_buy, \
CASE WHEN day18_buy>0 THEN 1 ELSE 0 END as label,\
2day_view*2 AS t1 , 2day_favor*3 AS t2 , 2day_tocar*4 AS t3 , 2day_buy*1 AS t4 ,\
2day_view*1+2day_favor*2+2day_tocar*3-2day_buy*1 AS t5,\
(2day_view+1)/(2day_view+2day_favor+2day_tocar+2day_buy+1) AS t6,\
(2day_favor+1)/(2day_view+2day_favor+2day_tocar+2day_buy+1) AS t7,\
(2day_tocar+1)/(2day_view+2day_favor+2day_tocar+2day_buy+1) AS t8,\
(2day_buy+1)/(2day_view+2day_favor+2day_tocar+2day_buy+1) AS t9,\
(2day_favor+1)*(2day_tocar+1)-2day_buy*2 AS t10 \
FROM train_datatable WHERE 2day_buy<20")

output_train = assembler.transform(dataset_train)
train_data=output_train.select("label","features")

有了训练数据集,下一步就是要用Spark的MLlib来构建模型,并训练出一个模型来。如何使用spark mllib就不在本文讨论范围内了。假设我们已经训练出一个TreeModel。


from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(train_data)
featureIndexer =VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(train_data)
(trainingData, testData) = data.randomSplit([0.7, 0.3])  #这里分一部分作为测试集来测试模型是否可靠
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
TreeModel = pipeline.fit(trainingData)
predictions = model.transform(testData) #根据模型进行测试集预测

#以下代码可以查看模型预测的错误率,实测错误率0.22左右
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))

#可以用以下方式来查看决策树是怎样的
model=TreeModel.stages[2]
print(model.toDeBugString)

预测集生成并进行预测


首先从17、18号的统计数据表构建出预测集,注意,预测集也要生成特征向量。


from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler=VectorAssembler(inputCols=["t1","t2","t3","t4","t5","t6","t7","t8","t9","t10"],outputCol="features")

dataset_predict=spark.sql("SELECT user_id,item_id,2day_buy, \
2day_view*2 AS t1 , 2day_favor*3 AS t2 , 2day_tocar*4 AS t3 , 2day_buy*1 AS t4 ,\
2day_view*1+2day_favor*2+2day_tocar*3-2day_buy*1 AS t5,\
(2day_view+1)/(2day_view+2day_favor+2day_tocar+2day_buy+1) AS t6,\
(2day_favor+1)/(2day_view+2day_favor+2day_tocar+2day_buy+1) AS t7,\
(2day_tocar+1)/(2day_view+2day_favor+2day_tocar+2day_buy+1) AS t8,\
(2day_buy+1)/(2day_view+2day_favor+2day_tocar+2day_buy+1) AS t9,\
(2day_favor+1)*(2day_tocar+1)-2day_buy*2 AS t10 \
FROM day1718_19_predict_data WHERE 2day_buy<5")

output_predict = assembler.transform(dataset_predict)
predict_data=output_predict.select("user_id","item_id","2day_buy","features")

最后,我们便可以应用模型来预测数据了,并将结果写成文件,最后进行提交


prediction=model.transform(predict_data)
result=prediction.select("user_id","item_id","2day_buy","prediction")
result.createOrReplaceTempView("result")
outdata=spark.sql("SELECT user_id,item_id FROM result WHERE prediction>0 AND 2day_buy=0") #过滤出17/18号并没有进行购买操作的数据
outdata.write.csv("outfile.csv")

最终,得到结果:时间 F1评分 准确率2016-11-04 12:47:00 || 6.36805181% || 0.04416168排名两百多点,对于第一次参赛的小白,还是挺有成就感的,哈哈。

小结


本文主要是详细介绍了如何从零开始比赛。对于特征提取、模型建立等,都需要再继续深入学习。愿大家都有一个好成绩!



查看原文: http://www.wyblog.cn/2016/11/05/%e5%a4%a9%e6%b1%a0%e5%88%9d%e4%bd%93%e9%aa%8c-%e6%96%b0%e4%ba%ba%e5%ae%9e%e6%88%98%e8%b5%9b%e4%b9%8b%e7%a6%bb%e7%ba%bf%e8%b5%9b/

你可能感兴趣的:(天池比赛)