Wzideng

三、CTR预估数据准备

三 CTR预估数据准备

3.1 分析并预处理raw_sample数据集

# 从HDFS中加载样本数据信息
df = spark.read.csv("hdfs://localhost:9000/datasets/raw_sample.csv", header=True)
df.show()    # 展示数据，默认前20条
df.printSchema()

显示结果:

+------+----------+----------+-----------+------+---+
|  user|time_stamp|adgroup_id|        pid|nonclk|clk|
+------+----------+----------+-----------+------+---+
|581738|1494137644|         1|430548_1007|     1|  0|
|449818|1494638778|         3|430548_1007|     1|  0|
|914836|1494650879|         4|430548_1007|     1|  0|
|914836|1494651029|         5|430548_1007|     1|  0|
|399907|1494302958|         8|430548_1007|     1|  0|
|628137|1494524935|         9|430548_1007|     1|  0|
|298139|1494462593|         9|430539_1007|     1|  0|
|775475|1494561036|         9|430548_1007|     1|  0|
|555266|1494307136|        11|430539_1007|     1|  0|
|117840|1494036743|        11|430548_1007|     1|  0|
|739815|1494115387|        11|430539_1007|     1|  0|
|623911|1494625301|        11|430548_1007|     1|  0|
|623911|1494451608|        11|430548_1007|     1|  0|
|421590|1494034144|        11|430548_1007|     1|  0|
|976358|1494156949|        13|430548_1007|     1|  0|
|286630|1494218579|        13|430539_1007|     1|  0|
|286630|1494289247|        13|430539_1007|     1|  0|
|771431|1494153867|        13|430548_1007|     1|  0|
|707120|1494220810|        13|430548_1007|     1|  0|
|530454|1494293746|        13|430548_1007|     1|  0|
+------+----------+----------+-----------+------+---+
only showing top 20 rows

root
 |-- user: string (nullable = true)
 |-- time_stamp: string (nullable = true)
 |-- adgroup_id: string (nullable = true)
 |-- pid: string (nullable = true)
 |-- nonclk: string (nullable = true)
 |-- clk: string (nullable = true)

分析数据集字段的类型和格式
- 查看是否有空值
- 查看每列数据的类型
- 查看每列数据的类别情况

print("样本数据集总条目数：", df.count())
# 约2600w
print("用户user总数：", df.groupBy("user").count().count())
# 约 114w，略多余日志数据中用户数
print("广告id adgroup_id总数：", df.groupBy("adgroup_id").count().count())
# 约85w
print("广告展示位pid情况：", df.groupBy("pid").count().collect())
# 只有两种广告展示位，占比约为六比四
print("广告点击数据情况clk：", df.groupBy("clk").count().collect())
# 点和不点比率约： 1:20

显示结果:

样本数据集总条目数： 26557961
用户user总数： 1141729
广告id adgroup_id总数： 846811
广告展示位pid情况： [Row(pid='430548_1007', count=16472898), Row(pid='430539_1007', count=10085063)]
广告点击数据情况clk： [Row(clk='0', count=25191905), Row(clk='1', count=1366056)]

使用dataframe.withColumn更改df列数据结构；使用dataframe.withColumnRenamed更改列名称

# 更改表结构，转换为对应的数据类型
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType, LongType, StringType

# 打印df结构信息
df.printSchema()   
# 更改df表结构：更改列类型和列名称
raw_sample_df = df.\
    withColumn("user", df.user.cast(IntegerType())).withColumnRenamed("user", "userId").\
    withColumn("time_stamp", df.time_stamp.cast(LongType())).withColumnRenamed("time_stamp", "timestamp").\
    withColumn("adgroup_id", df.adgroup_id.cast(IntegerType())).withColumnRenamed("adgroup_id", "adgroupId").\
    withColumn("pid", df.pid.cast(StringType())).\
    withColumn("nonclk", df.nonclk.cast(IntegerType())).\
    withColumn("clk", df.clk.cast(IntegerType()))
raw_sample_df.printSchema()
raw_sample_df.show()

显示结果:

root
 |-- user: string (nullable = true)
 |-- time_stamp: string (nullable = true)
 |-- adgroup_id: string (nullable = true)
 |-- pid: string (nullable = true)
 |-- nonclk: string (nullable = true)
 |-- clk: string (nullable = true)

root
 |-- userId: integer (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- adgroupId: integer (nullable = true)
 |-- pid: string (nullable = true)
 |-- nonclk: integer (nullable = true)
 |-- clk: integer (nullable = true)

+------+----------+---------+-----------+------+---+
|userId| timestamp|adgroupId|        pid|nonclk|clk|
+------+----------+---------+-----------+------+---+
|581738|1494137644|        1|430548_1007|     1|  0|
|449818|1494638778|        3|430548_1007|     1|  0|
|914836|1494650879|        4|430548_1007|     1|  0|
|914836|1494651029|        5|430548_1007|     1|  0|
|399907|1494302958|        8|430548_1007|     1|  0|
|628137|1494524935|        9|430548_1007|     1|  0|
|298139|1494462593|        9|430539_1007|     1|  0|
|775475|1494561036|        9|430548_1007|     1|  0|
|555266|1494307136|       11|430539_1007|     1|  0|
|117840|1494036743|       11|430548_1007|     1|  0|
|739815|1494115387|       11|430539_1007|     1|  0|
|623911|1494625301|       11|430548_1007|     1|  0|
|623911|1494451608|       11|430548_1007|     1|  0|
|421590|1494034144|       11|430548_1007|     1|  0|
|976358|1494156949|       13|430548_1007|     1|  0|
|286630|1494218579|       13|430539_1007|     1|  0|
|286630|1494289247|       13|430539_1007|     1|  0|
|771431|1494153867|       13|430548_1007|     1|  0|
|707120|1494220810|       13|430548_1007|     1|  0|
|530454|1494293746|       13|430548_1007|     1|  0|
+------+----------+---------+-----------+------+---+
only showing top 20 rows

特征选取（Feature Selection）
- 特征选择就是选择那些靠谱的Feature，去掉冗余的Feature，对于搜索广告，Query关键词和广告的匹配程度很重要；但对于展示广告，广告本身的历史表现，往往是最重要的Feature。
  
  根据经验，该数据集中，只有广告展示位pid对比较重要，且数据不同数据之间的占比约为6:4，因此pid可以作为一个关键特征
  
  nonclk和clk在这里是作为目标值，不做为特征
热独编码 OneHotEncode
- 热独编码是一种经典编码，是使用N位状态寄存器(如0和1)来对N个状态进行编码，每个状态都由他独立的寄存器位，并且在任意时候，其中只有一位有效。
  
  假设有三组特征，分别表示年龄，城市，设备；
  
  [“男”, “女”][0,1]
  
  [“北京”, “上海”, “广州”][0,1,2]
  
  [“苹果”, “小米”, “华为”, “微软”][0,1,2,3]
  
  传统变化：对每一组特征，使用枚举类型，从0开始；
  
  ["男“，”上海“，”小米“]=[ 0,1,1]
  
  ["女“，”北京“，”苹果“] =[1,0,0]
  
  传统变化后的数据不是连续的，而是随机分配的，不容易应用在分类器中
  
  而经过热独编码，数据会变成稀疏的，方便分类器处理：
  
  ["男“，”上海“，”小米“]=[ 1,0,0,1,0,0,1,0,0]
  
  ["女“，”北京“，”苹果“] =[0,1,1,0,0,1,0,0,0]
  
  这样做保留了特征的多样性，但是也要注意如果数据过于稀疏(样本较少、维度过高)，其效果反而会变差
Spark中使用热独编码
- 注意：热编码只能对字符串类型的列数据进行处理
  
  StringIndexer：对指定字符串列数据进行特征处理，如将性别数据“男”、“女”转化为0和1
  
  OneHotEncoder：对特征列数据，进行热编码，通常需结合StringIndexer一起使用
  
  Pipeline：让数据按顺序依次被处理，将前一次的处理结果作为下一次的输入
特征处理

'''特征处理'''
'''
pid 资源位。该特征属于分类特征，只有两类取值，因此考虑进行热编码处理即可，分为是否在资源位1、是否在资源位2 两个特征
'''
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline

# StringIndexer对指定字符串列进行特征处理
stringindexer = StringIndexer(inputCol='pid', outputCol='pid_feature')

# 对处理出来的特征处理列进行，热独编码
encoder = OneHotEncoder(dropLast=False, inputCol='pid_feature', outputCol='pid_value')
# 利用管道对每一个数据进行热独编码处理
pipeline = Pipeline(stages=[stringindexer, encoder])
pipeline_model = pipeline.fit(raw_sample_df)
new_df = pipeline_model.transform(raw_sample_df)
new_df.show()

显示结果:

+------+----------+---------+-----------+------+---+-----------+-------------+
|userId| timestamp|adgroupId|        pid|nonclk|clk|pid_feature|    pid_value|
+------+----------+---------+-----------+------+---+-----------+-------------+
|581738|1494137644|        1|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|449818|1494638778|        3|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|914836|1494650879|        4|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|914836|1494651029|        5|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|399907|1494302958|        8|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|628137|1494524935|        9|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|298139|1494462593|        9|430539_1007|     1|  0|        1.0|(2,[1],[1.0])|
|775475|1494561036|        9|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|555266|1494307136|       11|430539_1007|     1|  0|        1.0|(2,[1],[1.0])|
|117840|1494036743|       11|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|739815|1494115387|       11|430539_1007|     1|  0|        1.0|(2,[1],[1.0])|
|623911|1494625301|       11|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|623911|1494451608|       11|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|421590|1494034144|       11|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|976358|1494156949|       13|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|286630|1494218579|       13|430539_1007|     1|  0|        1.0|(2,[1],[1.0])|
|286630|1494289247|       13|430539_1007|     1|  0|        1.0|(2,[1],[1.0])|
|771431|1494153867|       13|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|707120|1494220810|       13|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|530454|1494293746|       13|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
+------+----------+---------+-----------+------+---+-----------+-------------+
only showing top 20 rows

返回字段pid_value是一个稀疏向量类型数据 pyspark.ml.linalg.SparseVector

from pyspark.ml.linalg import SparseVector
# 参数：维度、索引列表、值列表
print(SparseVector(4, [1, 3], [3.0, 4.0]))
print(SparseVector(4, [1, 3], [3.0, 4.0]).toArray())
print("*********")
print(new_df.select("pid_value").first())
print(new_df.select("pid_value").first().pid_value.toArray())

显示结果:

(4,[1,3],[3.0,4.0])
[0. 3. 0. 4.]
*********
Row(pid_value=SparseVector(2, {0: 1.0}))
[1. 0.]

查看最大时间

new_df.sort("timestamp", ascending=False).show()

+------+----------+---------+-----------+------+---+-----------+-------------+
|userId| timestamp|adgroupId|        pid|nonclk|clk|pid_feature|    pid_value|
+------+----------+---------+-----------+------+---+-----------+-------------+
|177002|1494691186|   593001|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|243671|1494691186|   600195|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|488527|1494691184|   494312|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|488527|1494691184|   431082|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
| 17054|1494691184|   742741|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
| 17054|1494691184|   756665|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|488527|1494691184|   687854|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|839493|1494691183|   561681|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|704223|1494691183|   624504|430539_1007|     1|  0|        1.0|(2,[1],[1.0])|
|839493|1494691183|   582235|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|704223|1494691183|   675674|430539_1007|     1|  0|        1.0|(2,[1],[1.0])|
|628998|1494691180|   618965|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|674444|1494691179|   427579|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|627200|1494691179|   782038|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|627200|1494691179|   420769|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|674444|1494691179|   588664|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|738335|1494691179|   451004|430539_1007|     1|  0|        1.0|(2,[1],[1.0])|
|627200|1494691179|   817569|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|322244|1494691179|   820018|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
|322244|1494691179|   735220|430548_1007|     1|  0|        0.0|(2,[0],[1.0])|
+------+----------+---------+-----------+------+---+-----------+-------------+
only showing top 20 rows

# 本样本数据集共计8天数据
# 前七天为训练数据、最后一天为测试数据

from datetime import datetime
datetime.fromtimestamp(1494691186)
print("该时间之前的数据为训练样本，该时间以后的数据为测试样本：", datetime.fromtimestamp(1494691186-24*60*60))

显示结果:

该时间之前的数据为训练样本，该时间以后的数据为测试样本： 2017-05-12 23:59:46

训练样本

# 训练样本：
train_sample = raw_sample_df.filter(raw_sample_df.timestamp<=(1494691186-24*60*60))
print("训练样本个数：")
print(train_sample.count())
# 测试样本
test_sample = raw_sample_df.filter(raw_sample_df.timestamp>(1494691186-24*60*60))
print("测试样本个数：")
print(test_sample.count())

# 注意：还需要加入广告基本特征和用户基本特征才能做程一份完整的样本数据集

显示结果:

训练样本个数：
23249291
测试样本个数：
3308670

3.2 分析并预处理ad_feature数据集

# 从HDFS中加载广告基本信息数据，返回spark dafaframe对象
df = spark.read.csv("hdfs://localhost:9000/datasets/ad_feature.csv", header=True)
df.show()    # 展示数据，默认前20条

显示结果:

+----------+-------+-----------+--------+------+-----+
|adgroup_id|cate_id|campaign_id|customer| brand|price|
+----------+-------+-----------+--------+------+-----+
|     63133|   6406|      83237|       1| 95471|170.0|
|    313401|   6406|      83237|       1| 87331|199.0|
|    248909|    392|      83237|       1| 32233| 38.0|
|    208458|    392|      83237|       1|174374|139.0|
|    110847|   7211|     135256|       2|145952|32.99|
|    607788|   6261|     387991|       6|207800|199.0|
|    375706|   4520|     387991|       6|  NULL| 99.0|
|     11115|   7213|     139747|       9|186847| 33.0|
|     24484|   7207|     139744|       9|186847| 19.0|
|     28589|   5953|     395195|      13|  NULL|428.0|
|     23236|   5953|     395195|      13|  NULL|368.0|
|    300556|   5953|     395195|      13|  NULL|639.0|
|     92560|   5953|     395195|      13|  NULL|368.0|
|    590965|   4284|      28145|      14|454237|249.0|
|    529913|   4284|      70206|      14|  NULL|249.0|
|    546930|   4284|      28145|      14|  NULL|249.0|
|    639794|   6261|      70206|      14| 37004| 89.9|
|    335413|   4284|      28145|      14|  NULL|249.0|
|    794890|   4284|      70206|      14|454237|249.0|
|    684020|   6261|      70206|      14| 37004| 99.0|
+----------+-------+-----------+--------+------+-----+
only showing top 20 rows

# 注意：由于本数据集中存在NULL字样的数据，无法直接设置schema，只能先将NULL类型的数据处理掉，然后进行类型转换

from pyspark.sql.types import StructType, StructField, IntegerType, FloatType

# 替换掉NULL字符串，替换掉
df = df.replace("NULL", "-1")

# 打印df结构信息
df.printSchema()   
# 更改df表结构：更改列类型和列名称
ad_feature_df = df.\
    withColumn("adgroup_id", df.adgroup_id.cast(IntegerType())).withColumnRenamed("adgroup_id", "adgroupId").\
    withColumn("cate_id", df.cate_id.cast(IntegerType())).withColumnRenamed("cate_id", "cateId").\
    withColumn("campaign_id", df.campaign_id.cast(IntegerType())).withColumnRenamed("campaign_id", "campaignId").\
    withColumn("customer", df.customer.cast(IntegerType())).withColumnRenamed("customer", "customerId").\
    withColumn("brand", df.brand.cast(IntegerType())).withColumnRenamed("brand", "brandId").\
    withColumn("price", df.price.cast(FloatType()))
ad_feature_df.printSchema()
ad_feature_df.show()

显示结果:

root
 |-- adgroup_id: string (nullable = true)
 |-- cate_id: string (nullable = true)
 |-- campaign_id: string (nullable = true)
 |-- customer: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- price: string (nullable = true)

root
 |-- adgroupId: integer (nullable = true)
 |-- cateId: integer (nullable = true)
 |-- campaignId: integer (nullable = true)
 |-- customerId: integer (nullable = true)
 |-- brandId: integer (nullable = true)
 |-- price: float (nullable = true)

+---------+------+----------+----------+-------+-----+
|adgroupId|cateId|campaignId|customerId|brandId|price|
+---------+------+----------+----------+-------+-----+
|    63133|  6406|     83237|         1|  95471|170.0|
|   313401|  6406|     83237|         1|  87331|199.0|
|   248909|   392|     83237|         1|  32233| 38.0|
|   208458|   392|     83237|         1| 174374|139.0|
|   110847|  7211|    135256|         2| 145952|32.99|
|   607788|  6261|    387991|         6| 207800|199.0|
|   375706|  4520|    387991|         6|     -1| 99.0|
|    11115|  7213|    139747|         9| 186847| 33.0|
|    24484|  7207|    139744|         9| 186847| 19.0|
|    28589|  5953|    395195|        13|     -1|428.0|
|    23236|  5953|    395195|        13|     -1|368.0|
|   300556|  5953|    395195|        13|     -1|639.0|
|    92560|  5953|    395195|        13|     -1|368.0|
|   590965|  4284|     28145|        14| 454237|249.0|
|   529913|  4284|     70206|        14|     -1|249.0|
|   546930|  4284|     28145|        14|     -1|249.0|
|   639794|  6261|     70206|        14|  37004| 89.9|
|   335413|  4284|     28145|        14|     -1|249.0|
|   794890|  4284|     70206|        14| 454237|249.0|
|   684020|  6261|     70206|        14|  37004| 99.0|
+---------+------+----------+----------+-------+-----+
only showing top 20 rows

查看各项数据的特征

print("总广告条数：",df.count())   # 数据条数
_1 = ad_feature_df.groupBy("cateId").count().count()
print("cateId数值个数：", _1)
_2 = ad_feature_df.groupBy("campaignId").count().count()
print("campaignId数值个数：", _2)
_3 = ad_feature_df.groupBy("customerId").count().count()
print("customerId数值个数：", _3)
_4 = ad_feature_df.groupBy("brandId").count().count()
print("brandId数值个数：", _4)
ad_feature_df.sort("price").show()
ad_feature_df.sort("price", ascending=False).show()
print("价格高于1w的条目个数：", ad_feature_df.select("price").filter("price>10000").count())
print("价格低于1的条目个数", ad_feature_df.select("price").filter("price<1").count())

显示结果:

总广告条数： 846811
cateId数值个数： 6769
campaignId数值个数： 423436
customerId数值个数： 255875
brandId数值个数： 99815
+---------+------+----------+----------+-------+-----+
|adgroupId|cateId|campaignId|customerId|brandId|price|
+---------+------+----------+----------+-------+-----+
|   485749|  9970|    352666|    140520|     -1| 0.01|
|    88975|  9996|    198424|    182415|     -1| 0.01|
|   109704| 10539|     59774|     90351| 202710| 0.01|
|    49911|  7032|    129079|    172334|     -1| 0.01|
|   339334|  9994|    310408|    211292| 383023| 0.01|
|     6636|  6703|    392038|     46239| 406713| 0.01|
|    92241|  6130|     72781|    149714|     -1| 0.01|
|    20397| 10539|    410958|     65726|  79971| 0.01|
|   345870|  9995|    179595|    191036|  79971| 0.01|
|    77797|  9086|    218276|     31183|     -1| 0.01|
|    14435|  1136|    135610|     17788|     -1| 0.01|
|    42055|  9994|     43866|    113068| 123242| 0.01|
|    41925|  7032|     85373|    114532|     -1| 0.01|
|    67558|  9995|     90141|     83948|     -1| 0.01|
|   149570|  7043|    126746|    176076|     -1| 0.01|
|   518883|  7185|    403318|     58013|     -1| 0.01|
|     2246|  9996|    413653|     60214| 182966| 0.01|
|   290675|  4824|    315371|    240984|     -1| 0.01|
|   552638| 10305|    403318|     58013|     -1| 0.01|
|    89831| 10539|     90141|     83948| 211816| 0.01|
+---------+------+----------+----------+-------+-----+
only showing top 20 rows

+---------+------+----------+----------+-------+-----------+
|adgroupId|cateId|campaignId|customerId|brandId|      price|
+---------+------+----------+----------+-------+-----------+
|   658722|  1093|    218101|    207754|     -1|      1.0E8|
|   468220|  1093|    270719|    207754|     -1|      1.0E8|
|   179746|  1093|    270027|    102509| 405447|      1.0E8|
|   443295|  1093|     44251|    102509| 300681|      1.0E8|
|    31899|   685|    218918|     31239| 278301|      1.0E8|
|   243384|   685|    218918|     31239| 278301|      1.0E8|
|   554311|  1093|    266086|    207754|     -1|      1.0E8|
|   513942|   745|      8401|     86243|     -1|8.8888888E7|
|   201060|   745|      8401|     86243|     -1|5.5555556E7|
|   289563|   685|     37665|    120847| 278301|      1.5E7|
|    35156|   527|    417722|     72273| 278301|      1.0E7|
|    33756|   527|    416333|     70894|     -1|  9900000.0|
|   335495|   739|    170121|    148946| 326126|  9600000.0|
|   218306|   206|    162394|      4339| 221720|  8888888.0|
|   213567|  7213|    239302|    205612| 406125|  5888888.0|
|   375920|   527|    217512|    148946| 326126|  4760000.0|
|   262215|   527|    132721|     11947| 417898|  3980000.0|
|   154623|   739|    170121|    148946| 326126|  3900000.0|
|   152414|   739|    170121|    148946| 326126|  3900000.0|
|   448651|   527|    422260|     41289| 209959|  3800000.0|
+---------+------+----------+----------+-------+-----------+
only showing top 20 rows

价格高于1w的条目个数： 6527
价格低于1的条目个数 5762

特征选择
- cateId：脱敏过的商品类目ID；
- campaignId：脱敏过的广告计划ID；
- customerId:脱敏过的广告主ID；
- brandId：脱敏过的品牌ID；
以上四个特征均属于分类特征，但由于分类值个数均过于庞大，如果去做热独编码处理，会导致数据过于稀疏且当前我们缺少对这些特征更加具体的信息，（如商品类目具体信息、品牌具体信息等），从而无法对这些特征的数据做聚类、降维处理因此这里不选取它们作为特征

而只选取price作为特征数据，因为价格本身是一个统计类型连续数值型数据，且能很好的体现广告的价值属性特征，通常也不需要做其他处理(离散化、归一化、标准化等)，所以这里直接将当做特征数据来使用

3.3 分析并预处理user_profile数据集

# 从HDFS加载用户基本信息数据
df = spark.read.csv("hdfs://localhost:8020/csv/user_profile.csv", header=True)
# 发现pvalue_level和new_user_class_level存在空值：（注意此处的null表示空值，而如果是NULL，则往往表示是一个字符串）
# 因此直接利用schema就可以加载进该数据，无需替换null值
df.show()

显示结果:

+------+---------+------------+-----------------+---------+------------+--------------+----------+---------------------+
|userid|cms_segid|cms_group_id|final_gender_code|age_level|pvalue_level|shopping_level|occupation|new_user_class_level |
+------+---------+------------+-----------------+---------+------------+--------------+----------+---------------------+
|   234|        0|           5|                2|        5|        null|             3|         0|                    3|
|   523|        5|           2|                2|        2|           1|             3|         1|                    2|
|   612|        0|           8|                1|        2|           2|             3|         0|                 null|
|  1670|        0|           4|                2|        4|        null|             1|         0|                 null|
|  2545|        0|          10|                1|        4|        null|             3|         0|                 null|
|  3644|       49|           6|                2|        6|           2|             3|         0|                    2|
|  5777|       44|           5|                2|        5|           2|             3|         0|                    2|
|  6211|        0|           9|                1|        3|        null|             3|         0|                    2|
|  6355|        2|           1|                2|        1|           1|             3|         0|                    4|
|  6823|       43|           5|                2|        5|           2|             3|         0|                    1|
|  6972|        5|           2|                2|        2|           2|             3|         1|                    2|
|  9293|        0|           5|                2|        5|        null|             3|         0|                    4|
|  9510|       55|           8|                1|        2|           2|             2|         0|                    2|
| 10122|       33|           4|                2|        4|           2|             3|         0|                    2|
| 10549|        0|           4|                2|        4|           2|             3|         0|                 null|
| 10812|        0|           4|                2|        4|        null|             2|         0|                 null|
| 10912|        0|           4|                2|        4|           2|             3|         0|                 null|
| 10996|        0|           5|                2|        5|        null|             3|         0|                    4|
| 11256|        8|           2|                2|        2|           1|             3|         0|                    3|
| 11310|       31|           4|                2|        4|           1|             3|         0|                    4|
+------+---------+------------+-----------------+---------+------------+--------------+----------+---------------------+

# 注意：这里的null会直接被pyspark识别为None数据，也就是na数据，所以这里可以直接利用schema导入数据

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, LongType, FloatType

# 构建表结构schema对象
schema = StructType([
    StructField("userId", IntegerType()),
    StructField("cms_segid", IntegerType()),
    StructField("cms_group_id", IntegerType()),
    StructField("final_gender_code", IntegerType()),
    StructField("age_level", IntegerType()),
    StructField("pvalue_level", IntegerType()),
    StructField("shopping_level", IntegerType()),
    StructField("occupation", IntegerType()),
    StructField("new_user_class_level", IntegerType())
])
# 利用schema从hdfs加载
user_profile_df = spark.read.csv("hdfs://localhost:8020/csv/user_profile.csv", header=True, schema=schema)
user_profile_df.printSchema()
user_profile_df.show()

显示结果:

root
 |-- userId: integer (nullable = true)
 |-- cms_segid: integer (nullable = true)
 |-- cms_group_id: integer (nullable = true)
 |-- final_gender_code: integer (nullable = true)
 |-- age_level: integer (nullable = true)
 |-- pvalue_level: integer (nullable = true)
 |-- shopping_level: integer (nullable = true)
 |-- occupation: integer (nullable = true)
 |-- new_user_class_level: integer (nullable = true)

+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
|userId|cms_segid|cms_group_id|final_gender_code|age_level|pvalue_level|shopping_level|occupation|new_user_class_level|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
|   234|        0|           5|                2|        5|        null|             3|         0|                   3|
|   523|        5|           2|                2|        2|           1|             3|         1|                   2|
|   612|        0|           8|                1|        2|           2|             3|         0|                null|
|  1670|        0|           4|                2|        4|        null|             1|         0|                null|
|  2545|        0|          10|                1|        4|        null|             3|         0|                null|
|  3644|       49|           6|                2|        6|           2|             3|         0|                   2|
|  5777|       44|           5|                2|        5|           2|             3|         0|                   2|
|  6211|        0|           9|                1|        3|        null|             3|         0|                   2|
|  6355|        2|           1|                2|        1|           1|             3|         0|                   4|
|  6823|       43|           5|                2|        5|           2|             3|         0|                   1|
|  6972|        5|           2|                2|        2|           2|             3|         1|                   2|
|  9293|        0|           5|                2|        5|        null|             3|         0|                   4|
|  9510|       55|           8|                1|        2|           2|             2|         0|                   2|
| 10122|       33|           4|                2|        4|           2|             3|         0|                   2|
| 10549|        0|           4|                2|        4|           2|             3|         0|                null|
| 10812|        0|           4|                2|        4|        null|             2|         0|                null|
| 10912|        0|           4|                2|        4|           2|             3|         0|                null|
| 10996|        0|           5|                2|        5|        null|             3|         0|                   4|
| 11256|        8|           2|                2|        2|           1|             3|         0|                   3|
| 11310|       31|           4|                2|        4|           1|             3|         0|                   4|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
only showing top 20 rows

显示特征情况

print("分类特征值个数情况: ")
print("cms_segid: ", user_profile_df.groupBy("cms_segid").count().count())
print("cms_group_id: ", user_profile_df.groupBy("cms_group_id").count().count())
print("final_gender_code: ", user_profile_df.groupBy("final_gender_code").count().count())
print("age_level: ", user_profile_df.groupBy("age_level").count().count())
print("shopping_level: ", user_profile_df.groupBy("shopping_level").count().count())
print("occupation: ", user_profile_df.groupBy("occupation").count().count())

print("含缺失值的特征情况: ")
user_profile_df.groupBy("pvalue_level").count().show()
user_profile_df.groupBy("new_user_class_level").count().show()

t_count = user_profile_df.count()
pl_na_count = t_count - user_profile_df.dropna(subset=["pvalue_level"]).count()
print("pvalue_level的空值情况：", pl_na_count, "空值占比：%0.2f%%"%(pl_na_count/t_count*100))
nul_na_count = t_count - user_profile_df.dropna(subset=["new_user_class_level"]).count()
print("new_user_class_level的空值情况：", nul_na_count, "空值占比：%0.2f%%"%(nul_na_count/t_count*100))

显示内容:

分类特征值个数情况: 
cms_segid:  97
cms_group_id:  13
final_gender_code:  2
age_level:  7
shopping_level:  3
occupation:  2
含缺失值的特征情况: 
+------------+------+
|pvalue_level| count|
+------------+------+
|        null|575917|
|           1|154436|
|           3| 37759|
|           2|293656|
+------------+------+

+--------------------+------+
|new_user_class_level| count|
+--------------------+------+
|                null|344920|
|                   1| 80548|
|                   3|173047|
|                   4|138833|
|                   2|324420|
+--------------------+------+

pvalue_level的空值情况： 575917 空值占比：54.24%
new_user_class_level的空值情况： 344920 空值占比：32.49%

缺失值处理
- 注意，一般情况下：
  - 缺失率低于10%：可直接进行相应的填充，如默认值、均值、算法拟合等等；
  - 高于10%：往往会考虑舍弃该特征
  - 特征处理，如1维转多维
  但根据我们的经验，我们的广告推荐其实和用户的消费水平、用户所在城市等级都有比较大的关联，因此在这里pvalue_level、new_user_class_level都是比较重要的特征，我们不考虑舍弃
缺失值处理方案：
- 填充方案：结合用户的其他特征值，利用随机森林算法进行预测；但产生了大量人为构建的数据，一定程度上增加了数据的噪音
- 把变量映射到高维空间：如pvalue_level的1维数据，转换成是否1、是否2、是否3、是否缺失的4维数据；这样保证了所有原始数据不变，同时能提高精确度，但这样会导致数据变得比较稀疏，如果样本量很小，反而会导致样本效果较差，因此也不能滥用
填充方案
- 利用随机森林对pvalue_level的缺失值进行预测

from pyspark.mllib.regression import LabeledPoint

# 剔除掉缺失值数据，将余下的数据作为训练数据
# user_profile_df.dropna(subset=["pvalue_level"])： 将pvalue_level中的空值所在行数据剔除后的数据，作为训练样本
train_data = user_profile_df.dropna(subset=["pvalue_level"]).rdd.map(
    lambda r:LabeledPoint(r.pvalue_level-1, [r.cms_segid, r.cms_group_id, r.final_gender_code, r.age_level, r.shopping_level, r.occupation])
)

# 注意随机森林输入数据时，由于label的分类数是从0开始的，但pvalue_level的目前只分别是1，2，3，所以需要对应分别-1来作为目标值
# 自然那么最终得出预测值后，需要对应+1才能还原回来

# 我们使用cms_segid, cms_group_id, final_gender_code, age_level, shopping_level, occupation作为特征值，pvalue_level作为目标值

Labeled point

A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either 0 (negative) or 1 (positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ….
标记点是与标签/响应相关联的密集或稀疏的局部矢量。在MLlib中，标记点用于监督学习算法。我们使用double来存储标签，因此我们可以在回归和分类中使用标记点。对于二进制分类，标签应为0（负）或1（正）。对于多类分类，标签应该是从零开始的类索引：0, 1, 2, …。

Python
A labeled point is represented by LabeledPoint.
标记点表示为 LabeledPoint。
Refer to the LabeledPoint Python docs for more details on the API.
有关API的更多详细信息，请参阅LabeledPointPython文档。

from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint

# Create a labeled point with a positive label and a dense feature vector.
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])

# Create a labeled point with a negative label and a sparse feature vector.
neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))

随机森林：pyspark.mllib.tree.RandomForest

from pyspark.mllib.tree import RandomForest
# 训练分类模型
# 参数1 训练的数据
#参数2 目标值的分类个数 0,1,2
#参数3 特征中是否包含分类的特征 {2:2,3:7} {2:2} 表示 在特征中 第二个特征是分类的: 有两个分类
#参数4 随机森林中 树的棵数
model = RandomForest.trainClassifier(train_data, 3, {}, 5)

随机森林模型：pyspark.mllib.tree.RandomForestModel

# 预测单个数据
# 注意用法：https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html?highlight=tree%20random#pyspark.mllib.tree.RandomForestModel.predict
model.predict([0.0, 4.0 ,2.0 , 4.0, 1.0, 0.0])

显示结果:

1.0

筛选出缺失值条目

pl_na_df = user_profile_df.na.fill(-1).where("pvalue_level=-1")
pl_na_df.show(10)

def row(r):
    return r.cms_segid, r.cms_group_id, r.final_gender_code, r.age_level, r.shopping_level, r.occupation

# 转换为普通的rdd类型
rdd = pl_na_df.rdd.map(row)
# 预测全部的pvalue_level值:
predicts = model.predict(rdd)
# 查看前20条
print(predicts.take(20))
print("预测值总数", predicts.count())

# 这里注意predict参数，如果是预测多个，那么参数必须是直接有列表构成的rdd参数，而不能是dataframe.rdd类型
# 因此这里经过map函数处理，将每一行数据转换为普通的列表数据

显示结果:

+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
|userId|cms_segid|cms_group_id|final_gender_code|age_level|pvalue_level|shopping_level|occupation|new_user_class_level|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
|   234|        0|           5|                2|        5|          -1|             3|         0|                   3|
|  1670|        0|           4|                2|        4|          -1|             1|         0|                  -1|
|  2545|        0|          10|                1|        4|          -1|             3|         0|                  -1|
|  6211|        0|           9|                1|        3|          -1|             3|         0|                   2|
|  9293|        0|           5|                2|        5|          -1|             3|         0|                   4|
| 10812|        0|           4|                2|        4|          -1|             2|         0|                  -1|
| 10996|        0|           5|                2|        5|          -1|             3|         0|                   4|
| 11602|        0|           5|                2|        5|          -1|             3|         0|                   2|
| 11727|        0|           3|                2|        3|          -1|             3|         0|                   1|
| 12195|        0|          10|                1|        4|          -1|             3|         0|                   2|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
only showing top 10 rows

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0]
预测值总数 575917

转换为pandas dataframe

# 这里数据量比较小，直接转换为pandas dataframe来处理，因为方便，但注意如果数据量较大不推荐，因为这样会把全部数据加载到内存中
temp = predicts.map(lambda x:int(x)).collect()
pdf = pl_na_df.toPandas()
import numpy as np
 # 在pandas df的基础上直接替换掉列数据
pdf["pvalue_level"] = np.array(temp) + 1  # 注意+1 还原预测值
pdf

与非缺失数据进行拼接，完成pvalue_level的缺失值预测

new_user_profile_df = user_profile_df.dropna(subset=["pvalue_level"]).unionAll(spark.createDataFrame(pdf, schema=schema))
new_user_profile_df.show()

# 注意：unionAll的使用，两个df的表结构必须完全一样

显示结果:

+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
|userId|cms_segid|cms_group_id|final_gender_code|age_level|pvalue_level|shopping_level|occupation|new_user_class_level|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
|   523|        5|           2|                2|        2|           1|             3|         1|                   2|
|   612|        0|           8|                1|        2|           2|             3|         0|                null|
|  3644|       49|           6|                2|        6|           2|             3|         0|                   2|
|  5777|       44|           5|                2|        5|           2|             3|         0|                   2|
|  6355|        2|           1|                2|        1|           1|             3|         0|                   4|
|  6823|       43|           5|                2|        5|           2|             3|         0|                   1|
|  6972|        5|           2|                2|        2|           2|             3|         1|                   2|
|  9510|       55|           8|                1|        2|           2|             2|         0|                   2|
| 10122|       33|           4|                2|        4|           2|             3|         0|                   2|
| 10549|        0|           4|                2|        4|           2|             3|         0|                null|
| 10912|        0|           4|                2|        4|           2|             3|         0|                null|
| 11256|        8|           2|                2|        2|           1|             3|         0|                   3|
| 11310|       31|           4|                2|        4|           1|             3|         0|                   4|
| 11739|       20|           3|                2|        3|           2|             3|         0|                   4|
| 12549|       33|           4|                2|        4|           2|             3|         0|                   2|
| 15155|       36|           5|                2|        5|           2|             1|         0|                null|
| 15347|       20|           3|                2|        3|           2|             3|         0|                   3|
| 15455|        8|           2|                2|        2|           2|             3|         0|                   3|
| 15783|        0|           4|                2|        4|           2|             3|         0|                null|
| 16749|        5|           2|                2|        2|           1|             3|         1|                   4|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
only showing top 20 rows

利用随机森林对new_user_class_level的缺失值进行预测

from pyspark.mllib.regression import LabeledPoint

# 选出new_user_class_level全部的
train_data2 = user_profile_df.dropna(subset=["new_user_class_level"]).rdd.map(
    lambda r:LabeledPoint(r.new_user_class_level - 1, [r.cms_segid, r.cms_group_id, r.final_gender_code, r.age_level, r.shopping_level, r.occupation])
)
from pyspark.mllib.tree import RandomForest
model2 = RandomForest.trainClassifier(train_data2, 4, {}, 5)
model2.predict([0.0, 4.0 ,2.0 , 4.0, 1.0, 0.0])
# 预测值实际应该为2

显示结果:

1.0

nul_na_df = user_profile_df.na.fill(-1).where("new_user_class_level=-1")
nul_na_df.show(10)

def row(r):
    return r.cms_segid, r.cms_group_id, r.final_gender_code, r.age_level, r.shopping_level, r.occupation

rdd2 = nul_na_df.rdd.map(row)
predicts2 = model.predict(rdd2)
predicts2.take(20)

显示结果:

+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
|userId|cms_segid|cms_group_id|final_gender_code|age_level|pvalue_level|shopping_level|occupation|new_user_class_level|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
|   612|        0|           8|                1|        2|           2|             3|         0|                  -1|
|  1670|        0|           4|                2|        4|          -1|             1|         0|                  -1|
|  2545|        0|          10|                1|        4|          -1|             3|         0|                  -1|
| 10549|        0|           4|                2|        4|           2|             3|         0|                  -1|
| 10812|        0|           4|                2|        4|          -1|             2|         0|                  -1|
| 10912|        0|           4|                2|        4|           2|             3|         0|                  -1|
| 12620|        0|           4|                2|        4|          -1|             2|         0|                  -1|
| 14437|        0|           5|                2|        5|          -1|             3|         0|                  -1|
| 14574|        0|           1|                2|        1|          -1|             2|         0|                  -1|
| 14985|        0|          11|                1|        5|          -1|             2|         0|                  -1|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
only showing top 10 rows

[1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 0.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 0.0,
 1.0,
 0.0,
 0.0,
 1.0]

总结：可以发现由于这两个字段的缺失过多，所以预测出来的值已经大大失真，但如果缺失率在10%以下，这种方法是比较有效的一种

user_profile_df = user_profile_df.na.fill(-1)
user_profile_df.show()
# new_df = new_df.withColumn("pvalue_level", new_df.pvalue_level.cast(StringType()))\
#     .withColumn("new_user_class_level", new_df.new_user_class_level.cast(StringType()))

显示结果:

+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
|userId|cms_segid|cms_group_id|final_gender_code|age_level|pvalue_level|shopping_level|occupation|new_user_class_level|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
|   234|        0|           5|                2|        5|          -1|             3|         0|                   3|
|   523|        5|           2|                2|        2|           1|             3|         1|                   2|
|   612|        0|           8|                1|        2|           2|             3|         0|                  -1|
|  1670|        0|           4|                2|        4|          -1|             1|         0|                  -1|
|  2545|        0|          10|                1|        4|          -1|             3|         0|                  -1|
|  3644|       49|           6|                2|        6|           2|             3|         0|                   2|
|  5777|       44|           5|                2|        5|           2|             3|         0|                   2|
|  6211|        0|           9|                1|        3|          -1|             3|         0|                   2|
|  6355|        2|           1|                2|        1|           1|             3|         0|                   4|
|  6823|       43|           5|                2|        5|           2|             3|         0|                   1|
|  6972|        5|           2|                2|        2|           2|             3|         1|                   2|
|  9293|        0|           5|                2|        5|          -1|             3|         0|                   4|
|  9510|       55|           8|                1|        2|           2|             2|         0|                   2|
| 10122|       33|           4|                2|        4|           2|             3|         0|                   2|
| 10549|        0|           4|                2|        4|           2|             3|         0|                  -1|
| 10812|        0|           4|                2|        4|          -1|             2|         0|                  -1|
| 10912|        0|           4|                2|        4|           2|             3|         0|                  -1|
| 10996|        0|           5|                2|        5|          -1|             3|         0|                   4|
| 11256|        8|           2|                2|        2|           1|             3|         0|                   3|
| 11310|       31|           4|                2|        4|           1|             3|         0|                   4|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
only showing top 20 rows

低维转高维方式
- 我们接下来采用将变量映射到高维空间的方法来处理数据，即将缺失项也当做一个单独的特征来对待，保证数据的原始性
  由于该思想正好和热独编码实现方法一样，因此这里直接使用热独编码方式处理数据

from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline

# 使用热独编码转换pvalue_level的一维数据为多维，其中缺失值单独作为一个特征值

# 需要先将缺失值全部替换为数值，与原有特征一起处理
from pyspark.sql.types import StringType
user_profile_df = user_profile_df.na.fill(-1)
user_profile_df.show()

# 热独编码时，必须先将待处理字段转为字符串类型才可处理
user_profile_df = user_profile_df.withColumn("pvalue_level", user_profile_df.pvalue_level.cast(StringType()))\
    .withColumn("new_user_class_level", user_profile_df.new_user_class_level.cast(StringType()))
user_profile_df.printSchema()

# 对pvalue_level进行热独编码，求值
stringindexer = StringIndexer(inputCol='pvalue_level', outputCol='pl_onehot_feature')
encoder = OneHotEncoder(dropLast=False, inputCol='pl_onehot_feature', outputCol='pl_onehot_value')
pipeline = Pipeline(stages=[stringindexer, encoder])
pipeline_fit = pipeline.fit(user_profile_df)
user_profile_df2 = pipeline_fit.transform(user_profile_df)
# pl_onehot_value列的值为稀疏向量，存储热独编码的结果
user_profile_df2.printSchema()
user_profile_df2.show()

显示结果:

+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
|userId|cms_segid|cms_group_id|final_gender_code|age_level|pvalue_level|shopping_level|occupation|new_user_class_level|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
|   234|        0|           5|                2|        5|          -1|             3|         0|                   3|
|   523|        5|           2|                2|        2|           1|             3|         1|                   2|
|   612|        0|           8|                1|        2|           2|             3|         0|                  -1|
|  1670|        0|           4|                2|        4|          -1|             1|         0|                  -1|
|  2545|        0|          10|                1|        4|          -1|             3|         0|                  -1|
|  3644|       49|           6|                2|        6|           2|             3|         0|                   2|
|  5777|       44|           5|                2|        5|           2|             3|         0|                   2|
|  6211|        0|           9|                1|        3|          -1|             3|         0|                   2|
|  6355|        2|           1|                2|        1|           1|             3|         0|                   4|
|  6823|       43|           5|                2|        5|           2|             3|         0|                   1|
|  6972|        5|           2|                2|        2|           2|             3|         1|                   2|
|  9293|        0|           5|                2|        5|          -1|             3|         0|                   4|
|  9510|       55|           8|                1|        2|           2|             2|         0|                   2|
| 10122|       33|           4|                2|        4|           2|             3|         0|                   2|
| 10549|        0|           4|                2|        4|           2|             3|         0|                  -1|
| 10812|        0|           4|                2|        4|          -1|             2|         0|                  -1|
| 10912|        0|           4|                2|        4|           2|             3|         0|                  -1|
| 10996|        0|           5|                2|        5|          -1|             3|         0|                   4|
| 11256|        8|           2|                2|        2|           1|             3|         0|                   3|
| 11310|       31|           4|                2|        4|           1|             3|         0|                   4|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+
only showing top 20 rows

root
 |-- userId: integer (nullable = true)
 |-- cms_segid: integer (nullable = true)
 |-- cms_group_id: integer (nullable = true)
 |-- final_gender_code: integer (nullable = true)
 |-- age_level: integer (nullable = true)
 |-- pvalue_level: string (nullable = true)
 |-- shopping_level: integer (nullable = true)
 |-- occupation: integer (nullable = true)
 |-- new_user_class_level: string (nullable = true)

root
 |-- userId: integer (nullable = true)
 |-- cms_segid: integer (nullable = true)
 |-- cms_group_id: integer (nullable = true)
 |-- final_gender_code: integer (nullable = true)
 |-- age_level: integer (nullable = true)
 |-- pvalue_level: string (nullable = true)
 |-- shopping_level: integer (nullable = true)
 |-- occupation: integer (nullable = true)
 |-- new_user_class_level: string (nullable = true)
 |-- pl_onehot_feature: double (nullable = false)
 |-- pl_onehot_value: vector (nullable = true)

+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+-----------------+---------------+
|userId|cms_segid|cms_group_id|final_gender_code|age_level|pvalue_level|shopping_level|occupation|new_user_class_level|pl_onehot_feature|pl_onehot_value|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+-----------------+---------------+
|   234|        0|           5|                2|        5|          -1|             3|         0|                   3|              0.0|  (4,[0],[1.0])|
|   523|        5|           2|                2|        2|           1|             3|         1|                   2|              2.0|  (4,[2],[1.0])|
|   612|        0|           8|                1|        2|           2|             3|         0|                  -1|              1.0|  (4,[1],[1.0])|
|  1670|        0|           4|                2|        4|          -1|             1|         0|                  -1|              0.0|  (4,[0],[1.0])|
|  2545|        0|          10|                1|        4|          -1|             3|         0|                  -1|              0.0|  (4,[0],[1.0])|
|  3644|       49|           6|                2|        6|           2|             3|         0|                   2|              1.0|  (4,[1],[1.0])|
|  5777|       44|           5|                2|        5|           2|             3|         0|                   2|              1.0|  (4,[1],[1.0])|
|  6211|        0|           9|                1|        3|          -1|             3|         0|                   2|              0.0|  (4,[0],[1.0])|
|  6355|        2|           1|                2|        1|           1|             3|         0|                   4|              2.0|  (4,[2],[1.0])|
|  6823|       43|           5|                2|        5|           2|             3|         0|                   1|              1.0|  (4,[1],[1.0])|
|  6972|        5|           2|                2|        2|           2|             3|         1|                   2|              1.0|  (4,[1],[1.0])|
|  9293|        0|           5|                2|        5|          -1|             3|         0|                   4|              0.0|  (4,[0],[1.0])|
|  9510|       55|           8|                1|        2|           2|             2|         0|                   2|              1.0|  (4,[1],[1.0])|
| 10122|       33|           4|                2|        4|           2|             3|         0|                   2|              1.0|  (4,[1],[1.0])|
| 10549|        0|           4|                2|        4|           2|             3|         0|                  -1|              1.0|  (4,[1],[1.0])|
| 10812|        0|           4|                2|        4|          -1|             2|         0|                  -1|              0.0|  (4,[0],[1.0])|
| 10912|        0|           4|                2|        4|           2|             3|         0|                  -1|              1.0|  (4,[1],[1.0])|
| 10996|        0|           5|                2|        5|          -1|             3|         0|                   4|              0.0|  (4,[0],[1.0])|
| 11256|        8|           2|                2|        2|           1|             3|         0|                   3|              2.0|  (4,[2],[1.0])|
| 11310|       31|           4|                2|        4|           1|             3|         0|                   4|              2.0|  (4,[2],[1.0])|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+-----------------+---------------+
only showing top 20 rows

使用热编码转换new_user_class_level的一维数据为多维

stringindexer = StringIndexer(inputCol='new_user_class_level', outputCol='nucl_onehot_feature')
encoder = OneHotEncoder(dropLast=False, inputCol='nucl_onehot_feature', outputCol='nucl_onehot_value')
pipeline = Pipeline(stages=[stringindexer, encoder])
pipeline_fit = pipeline.fit(user_profile_df2)
user_profile_df3 = pipeline_fit.transform(user_profile_df2)
user_profile_df3.show()

显示结果:

+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+-----------------+---------------+-------------------+-----------------+
|userId|cms_segid|cms_group_id|final_gender_code|age_level|pvalue_level|shopping_level|occupation|new_user_class_level|pl_onehot_feature|pl_onehot_value|nucl_onehot_feature|nucl_onehot_value|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+-----------------+---------------+-------------------+-----------------+
|   234|        0|           5|                2|        5|          -1|             3|         0|                   3|              0.0|  (4,[0],[1.0])|                2.0|    (5,[2],[1.0])|
|   523|        5|           2|                2|        2|           1|             3|         1|                   2|              2.0|  (4,[2],[1.0])|                1.0|    (5,[1],[1.0])|
|   612|        0|           8|                1|        2|           2|             3|         0|                  -1|              1.0|  (4,[1],[1.0])|                0.0|    (5,[0],[1.0])|
|  1670|        0|           4|                2|        4|          -1|             1|         0|                  -1|              0.0|  (4,[0],[1.0])|                0.0|    (5,[0],[1.0])|
|  2545|        0|          10|                1|        4|          -1|             3|         0|                  -1|              0.0|  (4,[0],[1.0])|                0.0|    (5,[0],[1.0])|
|  3644|       49|           6|                2|        6|           2|             3|         0|                   2|              1.0|  (4,[1],[1.0])|                1.0|    (5,[1],[1.0])|
|  5777|       44|           5|                2|        5|           2|             3|         0|                   2|              1.0|  (4,[1],[1.0])|                1.0|    (5,[1],[1.0])|
|  6211|        0|           9|                1|        3|          -1|             3|         0|                   2|              0.0|  (4,[0],[1.0])|                1.0|    (5,[1],[1.0])|
|  6355|        2|           1|                2|        1|           1|             3|         0|                   4|              2.0|  (4,[2],[1.0])|                3.0|    (5,[3],[1.0])|
|  6823|       43|           5|                2|        5|           2|             3|         0|                   1|              1.0|  (4,[1],[1.0])|                4.0|    (5,[4],[1.0])|
|  6972|        5|           2|                2|        2|           2|             3|         1|                   2|              1.0|  (4,[1],[1.0])|                1.0|    (5,[1],[1.0])|
|  9293|        0|           5|                2|        5|          -1|             3|         0|                   4|              0.0|  (4,[0],[1.0])|                3.0|    (5,[3],[1.0])|
|  9510|       55|           8|                1|        2|           2|             2|         0|                   2|              1.0|  (4,[1],[1.0])|                1.0|    (5,[1],[1.0])|
| 10122|       33|           4|                2|        4|           2|             3|         0|                   2|              1.0|  (4,[1],[1.0])|                1.0|    (5,[1],[1.0])|
| 10549|        0|           4|                2|        4|           2|             3|         0|                  -1|              1.0|  (4,[1],[1.0])|                0.0|    (5,[0],[1.0])|
| 10812|        0|           4|                2|        4|          -1|             2|         0|                  -1|              0.0|  (4,[0],[1.0])|                0.0|    (5,[0],[1.0])|
| 10912|        0|           4|                2|        4|           2|             3|         0|                  -1|              1.0|  (4,[1],[1.0])|                0.0|    (5,[0],[1.0])|
| 10996|        0|           5|                2|        5|          -1|             3|         0|                   4|              0.0|  (4,[0],[1.0])|                3.0|    (5,[3],[1.0])|
| 11256|        8|           2|                2|        2|           1|             3|         0|                   3|              2.0|  (4,[2],[1.0])|                2.0|    (5,[2],[1.0])|
| 11310|       31|           4|                2|        4|           1|             3|         0|                   4|              2.0|  (4,[2],[1.0])|                3.0|    (5,[3],[1.0])|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+-----------------+---------------+-------------------+-----------------+
only showing top 20 rows

用户特征合并

from pyspark.ml.feature import VectorAssembler
feature_df = VectorAssembler().setInputCols(["age_level", "pl_onehot_value", "nucl_onehot_value"]).setOutputCol("features").transform(user_profile_df3)
feature_df.show()

显示结果:

+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+-----------------+---------------+-------------------+-----------------+--------------------+
|userId|cms_segid|cms_group_id|final_gender_code|age_level|pvalue_level|shopping_level|occupation|new_user_class_level|pl_onehot_feature|pl_onehot_value|nucl_onehot_feature|nucl_onehot_value|            features|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+-----------------+---------------+-------------------+-----------------+--------------------+
|   234|        0|           5|                2|        5|          -1|             3|         0|                   3|              0.0|  (4,[0],[1.0])|                2.0|    (5,[2],[1.0])|(10,[0,1,7],[5.0,...|
|   523|        5|           2|                2|        2|           1|             3|         1|                   2|              2.0|  (4,[2],[1.0])|                1.0|    (5,[1],[1.0])|(10,[0,3,6],[2.0,...|
|   612|        0|           8|                1|        2|           2|             3|         0|                  -1|              1.0|  (4,[1],[1.0])|                0.0|    (5,[0],[1.0])|(10,[0,2,5],[2.0,...|
|  1670|        0|           4|                2|        4|          -1|             1|         0|                  -1|              0.0|  (4,[0],[1.0])|                0.0|    (5,[0],[1.0])|(10,[0,1,5],[4.0,...|
|  2545|        0|          10|                1|        4|          -1|             3|         0|                  -1|              0.0|  (4,[0],[1.0])|                0.0|    (5,[0],[1.0])|(10,[0,1,5],[4.0,...|
|  3644|       49|           6|                2|        6|           2|             3|         0|                   2|              1.0|  (4,[1],[1.0])|                1.0|    (5,[1],[1.0])|(10,[0,2,6],[6.0,...|
|  5777|       44|           5|                2|        5|           2|             3|         0|                   2|              1.0|  (4,[1],[1.0])|                1.0|    (5,[1],[1.0])|(10,[0,2,6],[5.0,...|
|  6211|        0|           9|                1|        3|          -1|             3|         0|                   2|              0.0|  (4,[0],[1.0])|                1.0|    (5,[1],[1.0])|(10,[0,1,6],[3.0,...|
|  6355|        2|           1|                2|        1|           1|             3|         0|                   4|              2.0|  (4,[2],[1.0])|                3.0|    (5,[3],[1.0])|(10,[0,3,8],[1.0,...|
|  6823|       43|           5|                2|        5|           2|             3|         0|                   1|              1.0|  (4,[1],[1.0])|                4.0|    (5,[4],[1.0])|(10,[0,2,9],[5.0,...|
|  6972|        5|           2|                2|        2|           2|             3|         1|                   2|              1.0|  (4,[1],[1.0])|                1.0|    (5,[1],[1.0])|(10,[0,2,6],[2.0,...|
|  9293|        0|           5|                2|        5|          -1|             3|         0|                   4|              0.0|  (4,[0],[1.0])|                3.0|    (5,[3],[1.0])|(10,[0,1,8],[5.0,...|
|  9510|       55|           8|                1|        2|           2|             2|         0|                   2|              1.0|  (4,[1],[1.0])|                1.0|    (5,[1],[1.0])|(10,[0,2,6],[2.0,...|
| 10122|       33|           4|                2|        4|           2|             3|         0|                   2|              1.0|  (4,[1],[1.0])|                1.0|    (5,[1],[1.0])|(10,[0,2,6],[4.0,...|
| 10549|        0|           4|                2|        4|           2|             3|         0|                  -1|              1.0|  (4,[1],[1.0])|                0.0|    (5,[0],[1.0])|(10,[0,2,5],[4.0,...|
| 10812|        0|           4|                2|        4|          -1|             2|         0|                  -1|              0.0|  (4,[0],[1.0])|                0.0|    (5,[0],[1.0])|(10,[0,1,5],[4.0,...|
| 10912|        0|           4|                2|        4|           2|             3|         0|                  -1|              1.0|  (4,[1],[1.0])|                0.0|    (5,[0],[1.0])|(10,[0,2,5],[4.0,...|
| 10996|        0|           5|                2|        5|          -1|             3|         0|                   4|              0.0|  (4,[0],[1.0])|                3.0|    (5,[3],[1.0])|(10,[0,1,8],[5.0,...|
| 11256|        8|           2|                2|        2|           1|             3|         0|                   3|              2.0|  (4,[2],[1.0])|                2.0|    (5,[2],[1.0])|(10,[0,3,7],[2.0,...|
| 11310|       31|           4|                2|        4|           1|             3|         0|                   4|              2.0|  (4,[2],[1.0])|                3.0|    (5,[3],[1.0])|(10,[0,3,8],[4.0,...|
+------+---------+------------+-----------------+---------+------------+--------------+----------+--------------------+-----------------+---------------+-------------------+-----------------+--------------------+
only showing top 20 rows

feature_df.select("features").show()

显示结果:

+--------------------+
|            features|
+--------------------+
|(10,[0,1,7],[5.0,...|
|(10,[0,3,6],[2.0,...|
|(10,[0,2,5],[2.0,...|
|(10,[0,1,5],[4.0,...|
|(10,[0,1,5],[4.0,...|
|(10,[0,2,6],[6.0,...|
|(10,[0,2,6],[5.0,...|
|(10,[0,1,6],[3.0,...|
|(10,[0,3,8],[1.0,...|
|(10,[0,2,9],[5.0,...|
|(10,[0,2,6],[2.0,...|
|(10,[0,1,8],[5.0,...|
|(10,[0,2,6],[2.0,...|
|(10,[0,2,6],[4.0,...|
|(10,[0,2,5],[4.0,...|
|(10,[0,1,5],[4.0,...|
|(10,[0,2,5],[4.0,...|
|(10,[0,1,8],[5.0,...|
|(10,[0,3,7],[2.0,...|
|(10,[0,3,8],[4.0,...|
+--------------------+
only showing top 20 rows

特征选取

除了前面处理的pvalue_level和new_user_class_level需要作为特征以外，(能体现出用户的购买力特征)，还有：

前面分析的以下几个分类特征值个数情况:

- cms_segid:  97
- cms_group_id:  13
- final_gender_code:  2
- age_level:  7
- shopping_level:  3
- occupation:  2
-pvalue_level
-new_user_class_level
-price

根据经验，以上几个分类特征都一定程度能体现用户在购物方面的特征，且类别都较少，都可以用来作为用户特征

你可能感兴趣的:(机器学习之推荐系统,python学习,#,spark,python,spark,mllib)

【Python系列】高效Parquet数据处理策略：合并与分析实践小团团0 python 开发语言
在大数据时代，数据的存储、处理和分析变得尤为重要。Parquet作为一种高效的列存储格式，被广泛应用于大数据处理框架中，如ApacheSpark、ApacheHive等。Parquet是一个开源的列存储格式，它被设计用于支持复杂的嵌套数据结构，同时提供高效的压缩和编码方案，以优化存储空间和查询性能。以下将详细介绍如何使用Python对Parquet文件进行数据处理与合并，并提供相应的源码示例。一、
cryptography，一个神奇的 Python 库！ Sitin涛哥 Python python 开发语言
更多资料获取个人网站：ipengtao.com大家好，今天为大家分享一个神奇的Python库-cryptography。Github地址：https://github.com/pyca/cryptography在当今数字化时代，信息安全越来越受到重视。数据加密是保护数据安全的重要手段之一，而Python的cryptography库提供了丰富的功能来支持各种加密算法和协议。本文将深入探讨crypto
深度讨论Python for循环观智能 python 开发语言
作者的其他文章推荐：强化学习再受关注！for循环使用于遍历可迭代对象的Python语句，工作原理如下：#for循环foriteminiterable:print(item)#等价于iterator=iter(iterable)#获取迭代器whileTrue:try:item=next(iterator)#获取下一个元素print(item)exceptStopIteration:break#迭代结
Python第六章08：元组操作练习题苹果.Python.八宝粥 python 开发语言
#元组定义操作练习题"""定义一个元组，内容是：('周杰伦',11,['football','music'])，记录一个学生的信息（姓名、年龄、爱好）请通元组（tuple）的功能，对其进行如下操作：1.查询其年龄所在的下标位置2.查询学生的姓名3.删除学生爱好中的football4.增加爱好：coding"""my_tuple=('周杰伦',11,['football','music'])#1.查
Python第六章07：元组的定义和操作苹果.Python.八宝粥 python 前端开发语言
#tuple元组的定义和操作#tuple元组定义用小括号：(1,2,3,4,5),可以是不同类型元素#给变量定义元组时，写括号不写tuple：a=(1,2,3,4,5)#变量=（）变量=tuple（）空元组变量#tuple元组定义完成后，不可以修改，但是，如果元组中嵌套了一个列表时，元组中列表的内容可以修改#封装数据后，不希望被篡改数据，就使用元组tuple#1.定义一个元组t1=("halibo
利用Python爬虫获取Shopee（虾皮）商品详情：实战指南小爬虫程序猿 python 爬虫开发语言
在跨境电商领域，Shopee（虾皮）作为东南亚及台湾地区领先的电商平台，拥有海量的商品信息。无论是进行市场调研、数据分析，还是寻找热门商品，获取Shopee商品详情都是一项极具价值的任务。然而，手动浏览和整理这些信息显然是低效且容易出错的。幸运的是，通过编写Python爬虫程序，我们可以高效地完成这一任务。本文将详细介绍如何利用Python爬虫获取Shopee商品详情，并提供完整的代码示例。一、为
设计模式之观察者模式 spell007 架构设计设计模式观察者模式
一、观察者模式介绍观察者模式（ObserverPattern）是一种行为设计模式，它定义了一种一对多的依赖关系，让多个观察者对象同时监听某一个主题对象。这个主题对象在状态上发生变化时，会通知所有观察者对象，使它们能够自动更新自己。1、观察者模式的结构观察者模式类图结构：观察者模式主要涉及以下角色：Subject（主题）：它把所有对观察者对象的引用保存在一个聚集（比如ArrayList对象）里，每个
在Mac M1/M2芯片上完美安装DeepCTR库：避坑指南与实战验证 ku_code_ku 机器学习 macos 推荐算法推荐系统
让推荐算法在AppleSilicon上全速运行概述作为推荐系统领域的最经常用的明星库，DeepCTR集成了CTR预估、多任务学习等前沿模型实现。但在AppleSilicon架构的Mac设备上，安装过程常因ARM架构适配、依赖库版本冲突等问题受阻。本文通过20+次环境搭建实测，总结出最稳定的安装方案。关键版本说明（2024年验证）组件推荐版本注意事项Python3.10.x向下兼容至3.7，但3.1
探索NebulaGraph：一个开源分布式图数据库的技术解析一休哥助手数据库分布式系统开源分布式数据库
1.介绍NebulaGraph的定位和用途NebulaGraph是一款开源的分布式图数据库，专注于存储和处理大规模图数据。它的主要定位是为了解决图数据存储和分析的问题，能够处理节点和边数量巨大、结构复杂的图结构数据。NebulaGraph被设计用来应对各种领域的图数据挑战，包括社交网络分析、推荐系统、网络安全监测等。无论是从数据量还是计算复杂度上，NebulaGraph都能够应对各种挑战，为用户提
Java设计模式之解释器模式飞翔中文网 java 设计模式
概念解释器模式是一种行为型设计模式，用于定义一种语言的语法规则，并提供解释器来解释该语言中的表达式。作用其核心作用是将复杂的语法分解为简单的语法单元，通过递归组合的方式构建抽象语法树（AST），最终由解释器逐层解释执行。场景1.需要解释特定领域的语言：如数学公式、正则表达式、SQL查询等。2.语法相对简单且稳定：若语法频繁变化或过于复杂，建议使用解析器生成工具（如ANTLR）。3.需要灵活扩展语法
数据库数值函数详解 web安全工具库数据库 oracle jvm
各类资料学习下载合集https://pan.quark.cn/s/8c91ccb5a474数值函数是数据库中用于处理数值数据的函数，可以用于执行各种数学运算、统计计算等。数值函数在数据分析及处理时非常重要，能够帮助我们进行数据的聚合、计算和转换。在本篇博客中，我们将详细介绍常用的数据库数值函数，并通过Python和SQLite进行示例，帮助您理解和应用这些函数。1.数值函数的基本概念数值函数是用于
VisionPro实战之传感器识别视觉王小 VisionPro实战 visionpro 机器视觉 c#
目录1.案例要求2.实现思路1.先进行图片格式转换，不然可能格式不匹配2.进行模板匹配，仔细观察之后发现可以从左侧凹陷的地方入手，再进行定位3.找出四条线段4.进行距离的测量5.编写脚本或者使用CogCreateGraphicLabelTool工具输出数据3.具体操作1.我们先创建一个CogImageConvertTool工具，进行图片转码操作。2.创建一个模板匹配工具CogPMAlignTool
Python中Requests的Cookies的简单使用北条苒茗殇 python 开发语言 Requests
概述Python的Requests库中有一个cookies，是用于管理HTTPCookie的工具，可以像字典一样操作Cookie，支持自动处理作用域（域名、路径）和持久化，cookies是一个RequestsCookieJar的类型。一、概念1.作用自动存储服务器返回的Cookie根据请求域名和路径进行自动发送匹配的Cookie支持手动添加、修改、删除Cookie2.RequestsCookieJ
Pytest基础使用北条苒茗殇 pytest
概述Pytest是Python里的一个强大的测试框架，灵活易用，可以进行功能，自动化测试使用，可以与Requests，Selenium等进行结合使用，同时可以生成Html的报告。一、Pytest的基本使用在未指定Pytest的配置文件时，会对以下文件进行执行：test_*.py，如：test_1.py*_test.py，如：1_test.py会对以下的类和函数进行执行：类：以Test_开头的类，如
Visual Studio Code官网下载地址及使用技巧（含常用的拓展插件推荐） ITCTCSDN vscode ide 编辑器
VisualStudioCode（简称“VSCode”）是Microsoft于2015年4月发布的可运行于MacOS、Windows和Linux之上的跨平台源代码编辑器，它具有对JavaScript，TypeScript和Node.js的内置支持，并具有丰富的其他语言（例如C++，C＃，Java，Python，PHP，Go）和运行时（例如.NET和Unity）扩展的生态系统。VisualStudi
MySQL中基于机器学习的自适应缓存热点识别优化策略——开启数据库性能新纪元墨夶数据库学习资料1 数据库 mysql 机器学习
在数据驱动的世界里，数据库的性能直接影响到整个应用系统的响应速度和用户体验。随着业务量的增长和技术的发展，传统的缓存机制逐渐暴露出局限性。如何更智能地识别并利用热点数据进行缓存优化，成为提升数据库性能的关键所在。今天，我们将深入探讨一种创新的方法——基于机器学习的自适应缓存热点识别优化策略，并分享其在MySQL中的具体实现方案。为什么选择机器学习？‍传统上，开发者们依赖于手动配置或预设规则来决定哪
python中rmdir和rmtree的用法 Gin387 python
shutil.rmtree()是Python中shutil模块提供的一个函数，用于递归删除整个目录树（包括子目录和所有文件）。os.rmdir()（只能删除空目录）不同，shutil.rmtree()可以强制删除非空目录importshutil#删除指定目录及其所有内容shutil.rmtree('path/to/directory')
构建 Python 插件架构：打造灵活可扩展的模块化应用全栈探索者chen python python 架构开发语言学习机器学习程序人生插件
构建Python插件架构：打造灵活可扩展的模块化应用前言在现代软件开发中，单一的代码库往往难以满足不断变化的业务需求和多样化的扩展场景。如何设计一个应用，使其既能保持核心功能的稳定，又能轻松集成第三方功能、模块或定制化扩展？答案就是——插件架构。通过插件架构，你可以让应用具备极高的灵活性，支持动态加载、无缝扩展以及解耦维护。本文将深入探讨如何在Python中设计和构建一个插件架构。从核心概念、模块
产品经理必备知识之网页设计系列（二）-如何设计出一个优秀的界面文宇肃然产品运营系列课程快速学习实战应用界面设计产品设计产品经理网页设计
前言第一部分参见产品经理必备知识之网页设计系列（一）-创建出色用户体验https://blog.csdn.net/wenyusuran/article/details/108199875第三部分参见产品经理必备知识之网页设计系列（三）-移动端适配&无障碍设计及测试https://wenyusuran.blog.csdn.net/article/details/108199947设计师和开发人员在构
31天Python入门——第11天:挑战一口气把闭包·装饰器讲明白安然无虞 Python手把手教程 python 开发语言后端 pyqt
你好，我是安然无虞。文章目录1.闭包扩展知识:闭包的自由变量是如何存储的2.装饰器装饰器的应用场景3.补充练习1.闭包闭包是指在一个函数内部定义的函数，并且这个内部函数可以访问外部函数的变量、参数.换句话说，闭包是一个包含了函数及其相关引用环境的组合体.在Python中，当一个函数返回了内部函数的引用时，这个内部函数可以访问并操作外部函数的局部变量，它就创建了一个闭包,即使外部函数已经执行完毕，它
opencv python rgb转yuv_OpenCV之色彩空间与色彩空间转换 xiao fei opencv python rgb转yuv
python代码：importcv2ascvsrc=cv.imread("test.jpg")cv.namedWindow("rgb",cv.WINDOW_AUTOSIZE)cv.imshow("rgb",src)#RGBtoHSVhsv=cv.cvtColor(src,cv.COLOR_BGR2HSV)cv.imshow("hsv",hsv)#RGBtoYUVyuv=cv.cvtColor(sr
【AI大模型】搭建本地大模型GPT-NeoX：详细步骤及常见问题处理 qzw1210 gpt 人工智能深度学习
搭建本地大模型GPT-NeoX：详细步骤及常见问题处理GPT-NeoX是一个开源的大型语言模型框架，由EleutherAI开发，可用于训练和部署类似GPT-3的大型语言模型。本指南将详细介绍如何在本地环境中搭建GPT-NeoX，并解决过程中可能遇到的常见问题。1.系统要求1.1硬件要求1.2软件要求操作系统:Linux(推荐Ubuntu20.04或更高版本)CUDA:11.2或更高版本Python
python 列表倒序输出小琳爱分享 python python
python列表倒序输出#使用reverseli1=[1,6,4,3,7,9]li2=['a','m','s','g']li1.reverse()li2.reverse()print(li1,li2)#利用list切片li1=[1,6,4,3,7,9]li2=['a','m','s','g']print(li1[::-1])print(li2[::-1])#利用算法进行转换，这里需要用到深层cop
python怎么输出倒序 hakesashou python基础知识 python java 服务器
python怎么输出倒序？下面给大家介绍四种方法：创建测试列表>>> lst = [1,2,3,4,5,6]方法1：>>> lst.reverse() #reverse()反转>>> lst[6, 5, 4, 3, 2, 1]方法2：>>> lst1 = [i for i in reversed(lst)] #reversed只适用于与序列(列表、元组、字符串)>>> lst1[6, 5, 4,
chatgpt赋能python：Python怎么倒序列表 aijinglingchat ChatGpt python chatgpt 人工智能计算机
Python怎么倒序列表列表是Python中最常用的数据结构之一，但在实际使用时，有时需要将列表进行倒序排列。Python提供了多种方法来实现这个需求，本文将简要介绍这些方法以及它们的使用场景。方法1：使用reverse()函数使用列表的reverse()方法是Python中最简单直接的方法来倒序列表。该方法会将原列表倒置。lst=[1,2,3,4,5]lst.reverse()print(lst
“统计视角看世界”专栏阅读引导赛卡统计视角看世界信息可视化数据分析
根据文章主题和逻辑关系，我为您设计以下阅读引导方案：1.六西格玛基础2.帕累托图3.直方图4.散点图基础5.散点图高阶6.多变量可视化7.密度图进阶8.回归分析配套文字说明：入门基石（必读）《1.六西格玛遇上Python》→方法论总纲，建议优先精读基础三剑客（可并行）├─《2.帕累托图》→重点数据排序与决策├─《3.直方图》→数据分布核心工具└─《4.散点图》→数据探索第一视角高阶应用链（递进学习
自定义mavlink 生成wireshark wlua插件错误（已解决） JasonComing 问题收集 wireshark wlua mavlink
进入正题python3-mpymavlink.tools.mavgen--lang=WLua--wire-protocol=2.0--output=output/developmessage_definitions/v1.0/development.xml编译WLUA的时候遇到一些问题1.ERROR:SCHEMASV:SCHEMAV_CVC_ENUMERATION_VALID3765:0:ERRO
吐血整理 python最全习题100道（含答案）持续更新题目，建议收藏！ Bejpse 面试学习路线阿里巴巴 python 开发语言 pycharm redis java-ee
最近为了提升python水平，在网上找到了python习题，然后根据自己对于python的掌握，整理出来了答案，如果小伙伴们有更好的实现方式，可以下面留言大家一起讨论哦~已知一个字符串为“hello_world_yoyo”,如何得到一个队列[“hello”,”world”,”yoyo”]test=‘hello_world_yoyo’使用split函数，分割字符串，并且将数据转换成列表类型print
Qt插件之自定义插件构建和使用码农飞飞 QT+QML qt 开发语言 ui 插件代码复用
文章目录定义插件的SDK编写自定义插件动态加载自定义插件分发SDK上一篇文章介绍了如何构建QtDesigner插件。其实插件化的这套机制QT是对外开放的，这里就介绍一下如何使用QT开发自定义插件。在开发自定义插件之前我们先定义插件的SDK。插件的SDK就是插件的接口描述，任何开发者开发的插件都应该实现对应的接口。同时只要实现了对应的接口的插件，就可以被集成到系统当中，这其实就是给自定义插件提供了一
2024年第五届MathorCup数学应用挑战赛--大数据竞赛思路、代码更新中..... 宇哥预测优化代码学习 1024程序员节
欢迎来到本博客❤️❤️博主优势：博客内容尽量做到思维缜密，逻辑清晰，为了方便读者。⛳️座右铭：行百里者，半于九十。本文目录如下：目录⛳️研赛及概况一、竞赛背景与目的二、组织机构与参赛对象三、竞赛时间与流程四、竞赛要求与规则五、奖项设置与奖励六、研究文档撰写建议七、参考资料与资源1找程序网站推荐2公式编辑器、流程图、论文排版324年研赛资源下载4思路、Python、Matlab代码分享......⛳
[黑洞与暗粒子]没有光的世界 comsci
无论是相对论还是其它现代物理学,都显然有个缺陷,那就是必须有光才能够计算但是,我相信,在我们的世界和宇宙平面中,肯定存在没有光的世界.... 那么,在没有光的世界,光子和其它粒子的规律无法被应用和考察,那么以光速为核心的 &nbs
jQuery Lazy Load 图片延迟加载 aijuans jquery
基于 jQuery 的图片延迟加载插件，在用户滚动页面到图片之后才进行加载。对于有较多的图片的网页，使用图片延迟加载，能有效的提高页面加载速度。版本： jQuery v1.4.4+ jQuery Lazy Load v1.7.2 注意事项：需要真正实现图片延迟加载，必须将真实图片地址写在 data-original 属性中。若 src
使用Jodd的优点 Kai_Ge jodd
1. 简化和统一 controller ，抛弃 extends SimpleFormController ，统一使用 implements Controller 的方式。 2. 简化 JSP 页面的 bind, 不需要一个字段一个字段的绑定。 3. 对 bean 没有任何要求，可以使用任意的 bean 做为 formBean。使用方法简介
jpa Query转hibernate Query 120153216 Hibernate
public List<Map> getMapList(String hql, Map map) { org.hibernate.Query jpaQuery = entityManager.createQuery(hql); if (null != map) { for (String parameter : map.keySet()) { jp
Django_Python3添加MySQL/MariaDB支持 2002wmj mariaDB
现状首先，[email protected] 中默认的引擎为 django.db.backends.mysql 。但是在Python3中如果这样写的话，会发现 django.db.backends.mysql 依赖 MySQLdb[5] ，而 MySQLdb 又不兼容 Python3 于是要找一种新的方式来继续使用MySQL。 MySQL官方的方案首先据MySQL文档[3]说，自从MySQL
在SQLSERVER中查找消耗IO最多的SQL 357029540 SQL Server
返回做IO数目最多的50条语句以及它们的执行计划。 select top 50 (total_logical_reads/execution_count) as avg_logical_reads, (total_logical_writes/execution_count) as avg_logical_writes, (tot
spring UnChecked 异常官方定义！ 7454103 spring
如果你接触过spring的事物管理！那么你必须明白 spring的非捕获异常！即 unchecked 异常！因为 spring 默认这类异常事物自动回滚！！ public static boolean isCheckedException(Throwable ex) { return !(ex instanceof RuntimeExcep
mongoDB 入门指南、示例 adminjun java mongodb 操作
一、准备工作 1、下载mongoDB 下载地址：http://www.mongodb.org/downloads 选择合适你的版本相关文档：http://www.mongodb.org/display/DOCS/Tutorial 2、安装mongoDB A、不解压模式：将下载下来的mongoDB-xxx.zip打开，找到bin目录，运行mongod.exe就可以启动服务，默
CUDA 5 Release Candidate Now Available aijuans CUDA
The CUDA 5 Release Candidate is now available at http://developer.nvidia.com/<wbr></wbr>cuda/cuda-pre-production. Now applicable to a broader set of algorithms, CUDA 5 has advanced fe
Essential Studio for WinRT网格控件测评 Axiba JavaScript html5
Essential Studio for WinRT界面控件包含了商业平板应用程序开发中所需的所有控件，如市场上运行速度最快的grid 和chart、地图、RDL报表查看器、丰富的文本查看器及图表等等。同时，该控件还包含了一组独特的库，用于从WinRT应用程序中生成Excel、Word以及PDF格式的文件。此文将对其另外一个强大的控件——网格控件进行专门的测评详述。网格控件功能 1、
java 获取windows系统安装的证书或证书链 bewithme windows
有时需要获取windows系统安装的证书或证书链，比如说你要通过证书来创建java的密钥库。有关证书链的解释可以查看此处。 public static void main(String[] args) { SunMSCAPI providerMSCAPI = new SunMSCAPI(); S
NoSQL数据库之Redis数据库管理(set类型和zset类型) bijian1013 redis 数据库 NoSQL
4.sets类型 Set是集合，它是string类型的无序集合。set是通过hash table实现的，添加、删除和查找的复杂度都是O(1)。对集合我们可以取并集、交集、差集。通过这些操作我们可以实现sns中的好友推荐和blog的tag功能。 sadd：向名称为key的set中添加元
异常捕获何时用Exception，何时用Throwable bingyingao
用Exception的情况 try { //可能发生空指针、数组溢出等异常 } catch (Exception e) {
【Kafka四】Kakfa伪分布式安装 bit1129 kafka
在http://bit1129.iteye.com/blog/2174791一文中，实现了单Kafka服务器的安装，在Kafka中，每个Kafka服务器称为一个broker。本文简单介绍下，在单机环境下Kafka的伪分布式安装和测试验证 1. 安装步骤 Kafka伪分布式安装的思路跟Zookeeper的伪分布式安装思路完全一样，不过比Zookeeper稍微简单些(不
Project Euler bookjovi haskell
Project Euler是个数学问题求解网站，网站设计的很有意思，有很多problem，在未提交正确答案前不能查看problem的overview，也不能查看关于problem的discussion thread，只能看到现在problem已经被多少人解决了，人数越多往往代表问题越容易。看看problem 1吧： Add all the natural num
Java-Collections Framework学习与总结-ArrayDeque BrokenDreams Collections
表、栈和队列是三种基本的数据结构，前面总结的ArrayList和LinkedList可以作为任意一种数据结构来使用，当然由于实现方式的不同，操作的效率也会不同。这篇要看一下java.util.ArrayDeque。从命名上看
读《研磨设计模式》-代码笔记-装饰模式-Decorator bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ import java.io.BufferedOutputStream; import java.io.DataOutputStream; import java.io.FileOutputStream; import java.io.Fi
Maven学习(一) chenyu19891124 Maven私服
学习一门技术和工具总得花费一段时间，5月底6月初自己学习了一些工具，maven+Hudson+nexus的搭建，对于maven以前只是听说，顺便再自己的电脑上搭建了一个maven环境，但是完全不了解maven这一强大的构建工具，还有ant也是一个构建工具，但ant就没有maven那么的简单方便，其实简单点说maven是一个运用命令行就能完成构建，测试，打包，发布一系列功
[原创]JWFD工作流引擎设计----节点匹配搜索算法(用于初步解决条件异步汇聚问题) 补充 comsci 算法工作 PHP 搜索引擎嵌入式
本文主要介绍在JWFD工作流引擎设计中遇到的一个实际问题的解决方案，请参考我的博文"带条件选择的并行汇聚路由问题"中图例A2描述的情况(http://comsci.iteye.com/blog/339756),我现在把我对图例A2的一个解决方案公布出来，请大家多指点节点匹配搜索算法(用于解决标准对称流程图条件汇聚点运行控制参数的算法) 需要解决的问题：已知分支
Linux中用shell获取昨天、明天或多天前的日期 daizj linux shell 上几年昨天获取上几个月
在Linux中可以通过date命令获取昨天、明天、上个月、下个月、上一年和下一年 # 获取昨天 date -d 'yesterday' # 或 date -d 'last day' # 获取明天 date -d 'tomorrow' # 或 date -d 'next day' # 获取上个月 date -d 'last month' #
我所理解的云计算 dongwei_6688 云计算
在刚开始接触到一个概念时，人们往往都会去探寻这个概念的含义，以达到对其有一个感性的认知，在Wikipedia上关于“云计算”是这么定义的，它说： Cloud computing is a phrase used to describe a variety of computing co
YII CMenu配置 dcj3sjt126com yii
Adding id and class names to CMenu We use the id and htmlOptions to accomplish this. Watch. //in your view $this->widget('zii.widgets.CMenu', array( 'id'=>'myMenu', 'items'=>$this-&g
设计模式之静态代理与动态代理 come_for_dream 设计模式
静态代理与动态代理代理模式是java开发中用到的相对比较多的设计模式，其中的思想就是主业务和相关业务分离。所谓的代理设计就是指由一个代理主题来操作真实主题，真实主题执行具体的业务操作，而代理主题负责其他相关业务的处理。比如我们在进行删除操作的时候需要检验一下用户是否登陆，我们可以删除看成主业务，而把检验用户是否登陆看成其相关业务
【转】理解Javascript 系列 gcc2ge JavaScript
理解Javascript_13_执行模型详解摘要: 在《理解Javascript_12_执行模型浅析》一文中,我们初步的了解了执行上下文与作用域的概念，那么这一篇将深入分析执行上下文的构建过程，了解执行上下文、函数对象、作用域三者之间的关系。函数执行环境简单的代码:当调用say方法时，第一步是创建其执行环境，在创建执行环境的过程中，会按照定义的先后顺序完成一系列操作:1.首先会创建一个
Subsets II hcx2013 set
Given a collection of integers that might contain duplicates, nums, return all possible subsets. Note: Elements in a subset must be in non-descending order. The solution set must not conta
Spring4.1新特性——Spring缓存框架增强 jinnianshilongnian spring4
目录 Spring4.1新特性——综述 Spring4.1新特性——Spring核心部分及其他 Spring4.1新特性——Spring缓存框架增强 Spring4.1新特性——异步调用和事件机制的异常处理 Spring4.1新特性——数据库集成测试脚本初始化 Spring4.1新特性——Spring MVC增强 Spring4.1新特性——页面自动化测试框架Spring MVC T
shell嵌套expect执行命令 liyonghui160com
一直都想把expect的操作写到bash脚本里,这样就不用我再写两个脚本来执行了,搞了一下午终于有点小成就,给大家看看吧. 系统:centos 5.x 1.先安装expect yum -y install expect 2.脚本内容: cat auto_svn.sh #!/bin/bash
Linux实用命令整理 pda158 linux
0. 基本命令　　linux 基本命令整理　　1. 压缩解压　　tar -zcvf a.tar.gz a #把a压缩成a.tar.gz 　　tar -zxvf a.tar.gz #把a.tar.gz解压成a 　　2. vim小结　　2.1 vim替换　　:m,ns/word_1/word_2/gc
独立开发人员通向成功的29个小贴士 shoothao 独立开发
概述：本文收集了关于独立开发人员通向成功需要注意的一些东西,对于具体的每个贴士的注解有兴趣的朋友可以查看下面标注的原文地址。明白你从事独立开发的原因和目的。保持坚持制定计划的好习惯。万事开头难，第一份订单是关键。培养多元化业务技能。提供卓越的服务和品质。谨小慎微。营销是必备技能。学会组织，有条理的工作才是最有效率的。 “独立
JAVA中堆栈和内存分配原理 uule java
1、栈、堆 1.寄存器：最快的存储区, 由编译器根据需求进行分配,我们在程序中无法控制.2. 栈：存放基本类型的变量数据和对象的引用，但对象本身不存放在栈中，而是存放在堆（new 出来的对象）或者常量池中（字符串常量对象存放在常量池中。）3. 堆：存放所有new出来的对象。4. 静态域：存放静态成员（static定义的）5. 常量池：存放字符串常量和基本类型常量（public static f