表描述
淘宝网站中随机抽样了114万用户8天内的广告展示/点击日志(2600万条记录),构成原始的样本骨架。 字段说明如下:
数据读取并分析
# 淘宝网站中随机抽样了114万用户8天内的广告展示/点击日志(2600万条记录)构成原始的样本数据
df = spark.read.csv("data/raw_sample.csv", header=True)
df.show(5)
+------+----------+----------+-----------+------+---+
| user|time_stamp|adgroup_id| pid|nonclk|clk|
+------+----------+----------+-----------+------+---+
|581738|1494137644| 1|430548_1007| 1| 0|
|449818|1494638778| 3|430548_1007| 1| 0|
|914836|1494650879| 4|430548_1007| 1| 0|
|914836|1494651029| 5|430548_1007| 1| 0|
|399907|1494302958| 8|430548_1007| 1| 0|
+------+----------+----------+-----------+------+---+
only showing top 5 rows
print("总样本数目:", df.count())
print("adgroup_id数目:", df.groupBy("adgroup_id").count().count())
print("广告示位置:", df.groupBy("pid").count().collect()) # 可以考虑热编码onehot
print("用户的点击情况:", df.groupBy("nonclk").count().collect())
总样本数目: 26557961
adgroup_id数目: 846811
广告示位置: [Row(pid='430548_1007', count=16472898), Row(pid='430539_1007', count=10085063)]
用户的点击情况: [Row(nonclk='0', count=1366056), Row(nonclk='1', count=25191905)]
更改数据类型
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType, LongType, StringType
# 打印df结构信息
df.printSchema()
# 更改df表结构:更改列类型和列名称
raw_sample_df = df.\
withColumn("user", df.user.cast(IntegerType())).withColumnRenamed("user", "userId").\
withColumn("time_stamp", df.time_stamp.cast(LongType())).withColumnRenamed("time_stamp", "timestamp").\
withColumn("adgroup_id", df.adgroup_id.cast(IntegerType())).withColumnRenamed("adgroup_id", "adgroupId").\
withColumn("pid", df.pid.cast(StringType())).\
withColumn("nonclk", df.nonclk.cast(IntegerType())).\
withColumn("clk", df.clk.cast(IntegerType()))
raw_sample_df.printSchema()
root
|-- user: string (nullable = true)
|-- time_stamp: string (nullable = true)
|-- adgroup_id: string (nullable = true)
|-- pid: string (nullable = true)
|-- nonclk: string (nullable = true)
|-- clk: string (nullable = true)
root
|-- userId: integer (nullable = true)
|-- timestamp: long (nullable = true)
|-- adgroupId: integer (nullable = true)
|-- pid: string (nullable = true)
|-- nonclk: integer (nullable = true)
|-- clk: integer (nullable = true)
"pid"字段热编码
Spark中使用热独编码
StringIndexer:对指定字符串列数据进行特征处理,如将性别数据“男”、“女”转化为0和1
OneHotEncoder:对特征列数据,进行热编码,通常需结合StringIndexer一起使用
Pipeline:让数据按顺序依次被处理,将前一次的处理结果作为下一次的输入
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
stringindexer = StringIndexer(inputCol="pid", outputCol="pid_feature")
onehot = OneHotEncoder(inputCol="pid_feature", outputCol="pid_value", dropLast=False)
pipeline = Pipeline(stages=[stringindexer, onehot])
pipeline_model = pipeline.fit(df)
new_df = pipeline_model.transform(df)
# 查看onehot编码的结果
new_df.groupBy("pid_value").count().show()
+-------------+--------+
| pid_value| count|
+-------------+--------+
|(2,[0],[1.0])|16472898|
|(2,[1],[1.0])|10085063|
+-------------+--------+
时间戳字段分析
from datetime import datetime
# 查看时间
new_df.sort("time_stamp", ascending=False).show()
# 留取最后一天为测试集, 前七天为训练集
print("第八天:", datetime.fromtimestamp(1494691186))
print("第七天分割点:", datetime.fromtimestamp(1494691186-24*60*60))
第八天: 2017-05-13 23:59:46
第七天分割点: 2017-05-12 23:59:46
train_sample = new_df.filter(new_df.time_stamp<=(1494691186-24*60*60))
test_sample = new_df.filter(new_df.time_stamp>(1494691186-24*60*60))
# 所占分数
train_sample.count(), test_sample.count()
(23249291, 3308670)
表描述
本数据集涵盖了raw_sample中全部广告的基本信息(约80万条目)。字段说明如下:
其中一个广告ID对应一个商品(宝贝),一个宝贝属于一个类目,一个宝贝属于一个品牌。
数据读取并分析
# 广告信息表
adf = spark.read.csv("data/ad_feature.csv", header=True)
adf.show(5)
+----------+-------+-----------+--------+------+-----+
|adgroup_id|cate_id|campaign_id|customer| brand|price|
+----------+-------+-----------+--------+------+-----+
| 63133| 6406| 83237| 1| 95471|170.0|
| 313401| 6406| 83237| 1| 87331|199.0|
| 248909| 392| 83237| 1| 32233| 38.0|
| 208458| 392| 83237| 1|174374|139.0|
| 110847| 7211| 135256| 2|145952|32.99|
+----------+-------+-----------+--------+------+-----+
only showing top 5 rows
adf.printSchema(), adf.count()
root
|-- adgroup_id: string (nullable = true)
|-- cate_id: string (nullable = true)
|-- campaign_id: string (nullable = true)
|-- customer: string (nullable = true)
|-- brand: string (nullable = true)
|-- price: string (nullable = true)
(None, 846811)
更改字段类型
# 首先填补NULL值为-1后,并修改其对应字段的数据类型
adf = adf.replace("NULL", "-1")
# 修改数据类型
ad_feature_df = adf.withColumn("adgroup_id", adf.adgroup_id.cast(IntegerType())).withColumnRenamed("adgroup_id", "adgroupID").\
withColumn("cate_id", adf.cate_id.cast(IntegerType())).withColumnRenamed("cate_id", "cateId").\
withColumn("campaign_id", adf.campaign_id.cast(IntegerType())).withColumnRenamed("campaign_id", "campaignId").\
withColumn("customer", adf.customer.cast(IntegerType())).withColumnRenamed("customer", "customerId").\
withColumn("brand", adf.brand.cast(IntegerType())).withColumnRenamed("brand", "brandId").\
withColumn("price", adf.price.cast(FloatType()))
ad_feature_df.printSchema()
root
|-- adgroupID: integer (nullable = true)
|-- cateId: integer (nullable = true)
|-- campaignId: integer (nullable = true)
|-- customerId: integer (nullable = true)
|-- brandId: integer (nullable = true)
|-- price: float (nullable = true)
统计
# 基本数据指标统计
print("总广告条数:",df.count()) # 数据条数
_1 = ad_feature_df.groupBy("cateId").count().count()
print("cateId数值个数:", _1)
_2 = ad_feature_df.groupBy("campaignId").count().count()
print("campaignId数值个数:", _2)
_3 = ad_feature_df.groupBy("customerId").count().count()
print("customerId数值个数:", _3)
_4 = ad_feature_df.groupBy("brandId").count().count()
print("brandId数值个数:", _4)
print("价格高于1w的条目个数:", ad_feature_df.filter(ad_feature_df.price > 10000).count())
print("价格低于1w的条目个数:", ad_feature_df.filter(ad_feature_df.price <= 10000).count())
ad_feature_df.sort("price").show()
ad_feature_df.sort("price", ascending=False).show()
ad_feature_df.describe().show()
总广告条数: 26557961
cateId数值个数: 6769
campaignId数值个数: 423436
customerId数值个数: 255875
brandId数值个数: 99815
价格高于1w的条目个数: 6527
价格低于1w的条目个数: 840284
+---------+------+----------+----------+-------+-----+
|adgroupID|cateId|campaignId|customerId|brandId|price|
+---------+------+----------+----------+-------+-----+
| 92241| 6130| 72781| 149714| -1| 0.01|
| 149570| 7043| 126746| 176076| -1| 0.01|
| 71678| 9866| 124203| 91492| 63885| 0.01|
| 345870| 9995| 179595| 191036| 79971| 0.01|
| 41925| 7032| 85373| 114532| -1| 0.01|
| 88975| 9996| 198424| 182415| -1| 0.01|
| 485749| 9970| 352666| 140520| -1| 0.01|
| 494084| 9969| 349384| 154919| -1| 0.01|
| 49911| 7032| 129079| 172334| -1| 0.01|
| 42055| 9994| 43866| 113068| 123242| 0.01|
| 692990| 6018| 353223| 223320| -1| 0.01|
| 348342| 8999| 296966| 158809| 113555| 0.01|
| 288172| 9995| 314179| 230326| 399440| 0.01|
| 620285| 7043| 365821| 1960| 188191| 0.01|
| 174248| 8999| 184344| 196777| 113555| 0.01|
| 290675| 4824| 315371| 240984| -1| 0.01|
| 598024| 9970| 22467| 59048| 17554| 0.01|
| 517587| 1847| 352238| 158227| 188592| 0.01|
| 182565| 5375| 274375| 16356| -1| 0.01|
| 169988| 10539| 238823| 221154| 211816| 0.01|
+---------+------+----------+----------+-------+-----+
only showing top 20 rows
+---------+------+----------+----------+-------+-----------+
|adgroupID|cateId|campaignId|customerId|brandId| price|
+---------+------+----------+----------+-------+-----------+
| 179746| 1093| 270027| 102509| 405447| 1.0E8|
| 658722| 1093| 218101| 207754| -1| 1.0E8|
| 443295| 1093| 44251| 102509| 300681| 1.0E8|
| 468220| 1093| 270719| 207754| -1| 1.0E8|
| 243384| 685| 218918| 31239| 278301| 1.0E8|
| 31899| 685| 218918| 31239| 278301| 1.0E8|
| 554311| 1093| 266086| 207754| -1| 1.0E8|
| 513942| 745| 8401| 86243| -1|8.8888888E7|
| 201060| 745| 8401| 86243| -1|5.5555556E7|
| 289563| 685| 37665| 120847| 278301| 1.5E7|
| 35156| 527| 417722| 72273| 278301| 1.0E7|
| 33756| 527| 416333| 70894| -1| 9900000.0|
| 335495| 739| 170121| 148946| 326126| 9600000.0|
| 218306| 206| 162394| 4339| 221720| 8888888.0|
| 213567| 7213| 239302| 205612| 406125| 5888888.0|
| 375920| 527| 217512| 148946| 326126| 4760000.0|
| 262215| 527| 132721| 11947| 417898| 3980000.0|
| 154623| 739| 170121| 148946| 326126| 3900000.0|
| 152414| 739| 170121| 148946| 326126| 3900000.0|
| 448651| 527| 422260| 41289| 209959| 3800000.0|
+---------+------+----------+----------+-------+-----------+
only showing top 20 rows
+-------+-----------------+------------------+------------------+------------------+------------------+------------------+
|summary| adgroupID| cateId| campaignId| customerId| brandId| price|
+-------+-----------------+------------------+------------------+------------------+------------------+------------------+
| count| 846811| 846811| 846811| 846811| 846811| 846811|
| mean| 423406.0| 5868.593464185043|206552.60428005777|113180.40600559038|162566.00186464275| 1838.867108130995|
| stddev|244453.4237388929|2705.1712033181752|125192.34090758236| 73435.83494972308| 152482.7386634471|310887.70017026004|
| min| 1| 1| 1| 1| -1| 0.01|
| max| 846811| 12960| 423436| 255875| 461497| 1.0E8|
+-------+-----------------+------------------+------------------+------------------+------------------+------------------+
表描述
用户基本信息表user_profile
本数据集涵盖了raw_sample中全部用户的基本信息(约100多万用户)。字段说明如下:
数据读取并分析
# 用户信息表
upf = spark.read.csv("data/user_profile.csv", header=True)
upf.show(5)
+------+---------+------------+-----------------+---------+------------+--------------+----------+---------------------+
|userid|cms_segid|cms_group_id|final_gender_code|age_level|pvalue_level|shopping_level|occupation|new_user_class_level |
+------+---------+------------+-----------------+---------+------------+--------------+----------+---------------------+
| 234| 0| 5| 2| 5| null| 3| 0| 3|
| 523| 5| 2| 2| 2| 1| 3| 1| 2|
| 612| 0| 8| 1| 2| 2| 3| 0| null|
| 1670| 0| 4| 2| 4| null| 1| 0| null|
| 2545| 0| 10| 1| 4| null| 3| 0| null|
+------+---------+------------+-----------------+---------+------------+--------------+----------+---------------------+
only showing top 5 rows
更改字段类型
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType, LongType, StringType
# 构建表结构schema对象
schema = StructType([
StructField("userId", IntegerType()),
StructField("cms_segid", IntegerType()),
StructField("cms_group_id", IntegerType()),
StructField("final_gender_code", IntegerType()),
StructField("age_level", IntegerType()),
StructField("pvalue_level", IntegerType()),
StructField("shopping_level", IntegerType()),
StructField("occupation", IntegerType()),
StructField("new_user_class_level", IntegerType())
])
# 利用schema从hdfs加载
user_profile_df = spark.read.csv("./data/user_profile.csv", header=True, schema=schema)
user_profile_df.printSchema()
root
|-- userId: integer (nullable = true)
|-- cms_segid: integer (nullable = true)
|-- cms_group_id: integer (nullable = true)
|-- final_gender_code: integer (nullable = true)
|-- age_level: integer (nullable = true)
|-- pvalue_level: integer (nullable = true)
|-- shopping_level: integer (nullable = true)
|-- occupation: integer (nullable = true)
|-- new_user_class_level: integer (nullable = true)
统计
print("分类特征值个数情况: ")
print("cms_segid: ", user_profile_df.groupBy("cms_segid").count().count())
print("cms_group_id: ", user_profile_df.groupBy("cms_group_id").count().count())
print("final_gender_code: ", user_profile_df.groupBy("final_gender_code").count().count())
print("age_level: ", user_profile_df.groupBy("age_level").count().count())
print("shopping_level: ", user_profile_df.groupBy("shopping_level").count().count())
print("occupation: ", user_profile_df.groupBy("occupation").count().count())
分类特征值个数情况:
cms_segid: 97 # 特征取值较多,不宜升高维度
cms_group_id: 13
final_gender_code: 2
age_level: 7 # 七档消费档次
shopping_level: 3 # 三档购物层次
occupation: 2
缺失值
表中的pvalue_level和new_user_class_level字段含有部分缺失值,需要对缺失值进行填补,处理步骤一般如下:
缺失值处理
注意,一般情况下:
但根据我们的经验,我们的广告推荐其实和用户的消费水平、用户所在城市等级都有比较大的关联,因此在这里pvalue_level、new_user_class_level都是比较重要的特征,我们不考虑舍弃
缺失值处理方案:
填充方案
缺失情况
print("含缺失值的特征情况: ")
user_profile_df.groupBy("pvalue_level").count().show()
user_profile_df.groupBy(“new_user_class_level”).count().show()
含缺失值的特征情况:
+------------+------+
|pvalue_level| count|
+------------+------+
| null|575917|
| 1|154436|
| 3| 37759|
| 2|293656|
+------------+------+
+--------------------+------+
|new_user_class_level| count|
+--------------------+------+
| null|344920|
| 1| 80548|
| 3|173047|
| 4|138833|
| 2|324420|
缺失比率
t_count = user_profile_df.count()
pl_na_count = t_count - user_profile_df.dropna(subset=["pvalue_level"]).count()
print("pvalue_level空值个数:", pl_na_count, "空占比%0.2f%%"%(pl_na_count/t_count))
# 此时缺失值比重较大, 但由于其自身对最终的预测有决定性的作用,所以可以考虑进行填补
nul_na_count = t_count - user_profile_df.dropna(subset=["new_user_class_level"]).count()
print('nul_na_count空值个数:', nul_na_count, "空占比%0.2f%%"%(nul_na_count/t_count))
pvalue_level空值个数: 575917 空占比0.54%
nul_na_count空值个数: 344920 空占比0.32%
缺失值填补:随机森林(使用mllib,则需要转为与之对应的rdd格式数据类型)
构建训练集(相对应字段不为空)
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
# 构建随机森林填补的训练集和测试集
# 保证标签从0开始
train_data = user_profile_df.dropna(subset=["pvalue_level"])\
.rdd.map(lambda r: LabeledPoint(r.pvalue_level-1,
[r.cms_segid, r.cms_group_id, r.final_gender_code
, r.age_level, r.shopping_level, r.occupation])
)
# 对城市等级缺失值进行填补
# 选出new_user_class_level全部的
train_data2 = user_profile_df.dropna(subset=["new_user_class_level"]).rdd.map(
lambda r:LabeledPoint(r.new_user_class_level - 1, [r.cms_segid, r.cms_group_id, r.final_gender_code, r.age_level, r.shopping_level, r.occupation])
)
建模
# model模型
from pyspark.mllib.tree import RandomForest # RDD数据类型
model = RandomForest().trainClassifier(train_data, numClasses=3, categoricalFeaturesInfo={}, numTrees=5)
model2 = RandomForest().trainClassifier(train_data2, 4, {}, 5)
# 单个样本预测
model.predict([5.0,2.0,2.0,2.0,3.0,1.0])
构建测试集(相对应字段为空)
# 构建测试集
pl_na_df = user_profile_df.na.fill(-1).where("pvalue_level=-1")
nul_na_df = user_profile_df.na.fill(-1).where("new_user_class_level=-1")
# 转换为指定数据类型
def row(r):
return r.cms_segid, r.cms_group_id, r.final_gender_code, r.age_level, r.shopping_level, r.occupation
rdd2 = nul_na_df.rdd.map(row)
predicts2 = model.predict(rdd2)
rdd = pl_na_df.rdd.map(row)
# 预测缺失值的数据
predicts = model.predict(rdd)
print(predicts.take(5))
predicts.count()
[1.0, 1.0, 1.0, 1.0, 1.0]
575917
填补字段缺失值
# label 加1,转为pandas进行数据处理
import numpy as np
temp = predicts.map(lambda x:int(x)).collect()
pdf = pl_na_df.toPandas() # 选择需要填补的空值
pdf["pvalue_level"] = np.array(temp)+1
# 数据填补完成
new_user_profile_df = user_profile_df.dropna(subset=["pvalue_level"]).unionAll(spark.createDataFrame(pdf, schema=schema))
new_user_profile_df.show(5)
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
|userId|cms_segid|cms_group_id|final_gender_code|age_level|pvalue_level|shopping_level|occupation|new_user_class_level|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
| 523| 5| 2| 2| 2| 1| 3| 1| 2|
| 612| 0| 8| 1| 2| 2| 3| 0| null|
| 3644| 49| 6| 2| 6| 2| 3| 0| 2|
| 5777| 44| 5| 2| 5| 2| 3| 0| 2|
| 6355| 2| 1| 2| 1| 1| 3| 0| 4|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
only showing top 5 row
缺失值填补:低维转高纬方式
目的:我们接下来采用将变量映射到高维空间的方法来处理数据,即将缺失项也当做一个单独的特征来对待,保证数据的原始性,由于该思想正好和热独编码实现方法一样,因此这里直接使用热独编码方式处理数据
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.sql.types import StringTyp
# 需要填补空值后才能进行类型字段类型装换
user_profile_df = user_profile_df.na.fill(-1)
# 转变onehot编码,必须将相对应字段的数据进行填补字符串类型
user_profile_df = user_profile_df.withColumn("pvalue_level", user_profile_df.pvalue_level.cast(StringType()))\
.withColumn("new_user_class_level", user_profile_df.new_user_class_level.cast(StringType()))
# 1. pvalue_level字段onehot
# onehot编码一般流程 pvalue_level字段
stringindex = StringIndexer(inputCol="pvalue_level", outputCol="pl_onehot_feature")
encoder = OneHotEncoder(inputCol="pl_onehot_feature", outputCol="pl_onehot_value", dropLast=False)
pipeline = Pipeline(stages=[stringindex, encoder])
pipeline_fit = pipeline.fit(user_profile_df)
user_profile_df2 = pipeline_fit.transform(user_profile_df)
# 2.new_user_class_level字段onehot
stringindexer = StringIndexer(inputCol='new_user_class_level', outputCol='nucl_onehot_feature')
encoder = OneHotEncoder(dropLast=False, inputCol='nucl_onehot_feature', outputCol='nucl_onehot_value')
pipeline = Pipeline(stages=[stringindexer, encoder])
pipeline_fit = pipeline.fit(user_profile_df2)
user_profile_df3 = pipeline_fit.transform(user_profile_df2)
user_profile_df3.show(5, truncate=False)
VectorAssembler(ml指定格式类型)
feature_df = VectorAssembler().setInputCols(["age_level", "pl_onehot_value", "nucl_onehot_value"]).setOutputCol("feature").transform(user_profile_df3)
feature_df.select("feature").show(5, truncate=False)
+--------------------------+
|feature |
+--------------------------+
|(10,[0,1,7],[5.0,1.0,1.0])|
|(10,[0,3,6],[2.0,1.0,1.0])|
|(10,[0,2,5],[2.0,1.0,1.0])|
|(10,[0,1,5],[4.0,1.0,1.0])|
|(10,[0,1,5],[4.0,1.0,1.0])|
+--------------------------+
only showing top 5 rows
推荐系统
黑马python5.0
推荐系统(一):个性化电商广告推荐系统介绍、数据集介绍、项目效果展示、项目实现分析、点击率预测(CTR–Click-Through-Rate)概念