新闻推荐02——探索性数据分析

1. 探索性数据分析

探索性数据分析( Exp lorat。可Data Analysis , EDA) 是采用各种技术(大部分为可视化技术)在尽量少的先验假设条件下,探索数据内部结构和规律的一种数据分析方法或理念。特别是当我们对数据中的信息没有足够的先验知识,不知道该用什么方法进行分析时,先对数据进行探索性分析,发现数据的模式和特点,就能够灵活地选择和调整合适的模型。

EDA技术通常可分为两类。

一类是可视化技术,

如箱形图、直方图、多变量图、链图、|怕累托图、散点图、茎叶图、平行坐标、让步比、多维尺度分析、
目标投影追踪、主成分分析、多线性主成分分析、降维、非线性降维等;

另一类是定量技术,

如样本均值、方差、分位数、峰度、偏度等。

探索性数据分析的目的

了解整个数据集的基本情况包括每个文件里有哪些数据,具体的文件中的每个字段表示什么实际含义,以及数据集中特征之间的相关性,在推荐场景下主要就是分析用户本身的基本属性,文章基本属性,以及用户和文章交互的一些分布,这些都有利于后面的召回策略的选择,以及特征工程。

建议:当特征工程和模型调参已经很难继续上分了,可以回来在重新从新的角度去分析这些数据,或许可以找到上分的灵感

2. 代码实现

2.1 读取数据

%matplotlib inline
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
plt.rc('font', family='SimHei', size=13)

import os,gc,re,warnings,sys
warnings.filterwarnings("ignore")

path = 'data/'

#####train
trn_click = pd.read_csv(path+'train_click_log.csv')
item_df = pd.read_csv(path+'articles.csv')
item_df = item_df.rename(columns={'article_id': 'click_article_id'})  #重命名,方便后续match
item_emb_df = pd.read_csv(path+'articles_emb.csv')

#####test
tst_click = pd.read_csv(path+'testA_click_log.csv')
trn_click.head()
user_id click_article_id click_timestamp click_environment click_deviceGroup click_os click_country click_region click_referrer_type rank click_cnts category_id created_at_ts words_count
199999 160417 1507029570190 4 1 17 1 13 1 11 11 281 1506942089000 173
199999 5408 1507029571478 4 1 17 1 13 1 10 11 4 1506994257000 118
199999 50823 1507029601478 4 1 17 1 13 1 9 11 99 1507013614000 213
199998 157770 1507029532200 4 1 17 1 25 5 40 40 281 1506983935000 201
199998 96613 1507029671831 4 1 17 1 25 5 39 40 209 1506938444000 185
item_df.head()
click_article_id category_id created_at_ts words_count
0 0 1513144419000 168
1 1 1405341936000 189
2 1 1408667706000 250
3 1 1408468313000 230
4 1 1407071171000 162
tst_click.head()
user_id click_article_id click_timestamp click_environment click_deviceGroup click_os click_country click_region click_referrer_type rank click_cnts category_id created_at_ts words_count
249999 160974 1506959142820 4 1 17 1 13 2 19 19 281 1506912747000 259
249999 160417 1506959172820 4 1 17 1 13 2 18 19 281 1506942089000 173
249998 160974 1506959056066 4 1 12 1 13 2 5 5 281 1506912747000 259
249998 202557 1506959086066 4 1 12 1 13 2 4 5 327 1506938401000 219
249997 183665 1506959088613 4 1 17 1 15 5 7 7 301 1500895686000 256

2.2 数据预处理

计算用户点击rank和点击次数

# 对每个用户的点击时间戳进行排序
trn_click['rank'] = trn_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int)
tst_click['rank'] = tst_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int)

#计算用户点击文章的次数,并添加新的一列count
trn_click['click_cnts'] = trn_click.groupby(['user_id'])['click_timestamp'].transform('count')
tst_click['click_cnts'] = tst_click.groupby(['user_id'])['click_timestamp'].transform('count')

2.3 数据浏览

2.3.1 用户点击日志文件_训练集

trn_click = trn_click.merge(item_df, how='left', on=['click_article_id'])
trn_click.head()
user_id click_article_id click_timestamp click_environment click_deviceGroup click_os click_country click_region click_referrer_type rank click_cnts category_id created_at_ts words_count
199999 160417 1507029570190 4 1 17 1 13 1 11 11 281 1506942089000 173
199999 5408 1507029571478 4 1 17 1 13 1 10 11 4 1506994257000 118
199999 50823 1507029601478 4 1 17 1 13 1 9 11 99 1507013614000 213
199998 157770 1507029532200 4 1 17 1 25 5 40 40 281 1506983935000 201
199998 96613 1507029671831 4 1 17 1 25 5 39 40 209 1506938444000 185

字段说明:

train_click_log.csv文件数据中每个字段的含义

  1. user_id: 用户的唯一标识
  2. click_article_id: 用户点击的文章唯一标识
  3. click_timestamp: 用户点击文章时的时间戳
  4. click_environment: 用户点击文章的环境
  5. click_deviceGroup: 用户点击文章的设备组
  6. click_os: 用户点击文章时的操作系统
  7. click_country: 用户点击文章时的所在的国家
  8. click_region: 用户点击文章时所在的区域
  9. click_referrer_type: 用户点击文章时,文章的来源
#用户点击日志信息
trn_click.info()

Int64Index: 1112623 entries, 0 to 1112622
Data columns (total 14 columns):
 #   Column               Non-Null Count    Dtype
---  ------               --------------    -----
 0   user_id              1112623 non-null  int64
 1   click_article_id     1112623 non-null  int64
 2   click_timestamp      1112623 non-null  int64
 3   click_environment    1112623 non-null  int64
 4   click_deviceGroup    1112623 non-null  int64
 5   click_os             1112623 non-null  int64
 6   click_country        1112623 non-null  int64
 7   click_region         1112623 non-null  int64
 8   click_referrer_type  1112623 non-null  int64
 9   rank                 1112623 non-null  int64
 10  click_cnts           1112623 non-null  int64
 11  category_id          1112623 non-null  int64
 12  created_at_ts        1112623 non-null  int64
 13  words_count          1112623 non-null  int64
dtypes: int64(14)
memory usage: 127.3 MB
trn_click.describe()
user_id click_article_id click_timestamp click_environment click_deviceGroup click_os click_country click_region click_referrer_type rank click_cnts category_id created_at_ts words_count
count 1112623 1112623 1112623 1112623 1112623 1112623 1112623 1112623 1112623 1112623 1112623 1112623 1112623 1112623
mean 122119.8036 195154.1147 1507587643096 3.947786447 1.815980795 13.01976141 1.310775528 18.13586723 1.910062977 7.118518132 13.23704166 305.6175551 1506598353027 201.1980653
std 55403.49418 92922.85512 336346607 0.327671536 1.03517048 6.967844043 1.61826368 7.105832133 1.220012442 10.16095491 16.31503188 115.5790891 8343065889 52.23881437
min 0 3 1507029532200 1 1 2 1 1 1 1 2 1 1166572800000 0
25% 79347 123909 1507296800236 4 1 2 1 13 1 2 4 250 1507220064000 170
50% 130967 203890 1507596263470 4 1 17 1 21 2 4 8 328 1507553479000 197
75% 170401 277712 1507840567385 4 3 17 1 25 2 8 16 410 1507756287000 228
max 199999 364046 1510603454886 4 5 20 11 28 7 241 241 460 1510666014000 6690
#训练集中的用户数量为20w
trn_click.user_id.nunique()
200000
trn_click.groupby('user_id')['click_article_id'].count().min()
# 训练集里面每个用户至少点击了两篇文章
2

画直方图大体看一下基本的属性分布:

# 画直方图大体看一下基本的属性分布:
plt.figure()
plt.figure(figsize=(15, 20))
i = 1
for col in ['click_article_id', 'click_timestamp', 'click_environment', 'click_deviceGroup', 'click_os', 'click_country', 
            'click_region', 'click_referrer_type', 'rank', 'click_cnts']:
    plot_envs = plt.subplot(5, 2, i)
    i += 1
    v = trn_click[col].value_counts().reset_index()[:10]
    fig = sns.barplot(x=v['index'], y=v[col])
    for item in fig.get_xticklabels():
        item.set_rotation(90)
    plt.title(col)
plt.tight_layout()
plt.show()
下载

结论:从点击时间clik_timestamp来看,分布较为平均,可不做特殊处理。由于时间戳是13位的,后续将时间格式转换成10位方便计算。

从点击环境click_environment来看,仅有1922次(占0.1%)点击环境为1;仅有24617次(占2.3%)点击环境为2;剩余(占97.6%)点击环境为4。

从点击设备组click_deviceGroup来看,设备1占大部分(60.4%),设备3占36%。

2.3.2 用户点击日志文件_测试集

tst_click = tst_click.merge(item_df, how='left', on=['click_article_id'])
tst_click.head()
user_id click_article_id click_timestamp click_environment click_deviceGroup click_os click_country click_region click_referrer_type rank click_cnts category_id created_at_ts words_count
249999 160974 1506959142820 4 1 17 1 13 2 19 19 281 1506912747000 259
249999 160417 1506959172820 4 1 17 1 13 2 18 19 281 1506942089000 173
249998 160974 1506959056066 4 1 12 1 13 2 5 5 281 1506912747000 259
249998 202557 1506959086066 4 1 12 1 13 2 4 5 327 1506938401000 219
249997 183665 1506959088613 4 1 17 1 15 5 7 7 301 1500895686000 256
tst_click.describe()
user_id click_article_id click_timestamp click_environment click_deviceGroup click_os click_country click_region click_referrer_type rank click_cnts category_id created_at_ts words_count
count 518010 518010 518010 518010 518010 518010 518010 518010 518010 518010 518010 518010 518010 518010
mean 227342.4282 193803.7926 1507387299502.04 3.947300245 1.738284975 13.62846663 1.348209494 18.25025 1.819613521 15.52178529 30.04358603 305.3249609 1506882565043.1 210.9663308
std 14613.90719 88279.38818 370612663.671477 0.323916087 1.020858384 6.625563915 1.703524195 7.060797599 1.08265705 33.95770247 56.86802106 110.4115128 5816668089.45911 83.04006463
min 200000 137 1506959050386 1 1 2 1 1 1 1 1 1 1265812331000 0
25% 214926 128551 1507026049425.25 4 1 12 1 13 1 4 10 252 1506970055000 176
50% 229109 199197 1507308386675.5 4 1 17 1 21 2 8 19 323 1507249424000 199
75% 240182 272143 1507666408390 4 3 17 1 25 2 18 35 399 1507630043000 232
max 249999 364043 1508831818749 4 5 20 11 28 7 938 938 460 1509949218000 3082

结论:可以看出训练集和测试集的用户是完全不一样的

训练集的用户ID由0 ~ 199999,而测试集A的用户ID由200000 ~ 249999。

#测试集中的用户数量为5w
tst_click.user_id.nunique()
50000
tst_click.groupby('user_id')['click_article_id'].count().min() 
# 注意测试集里面有只点击过一次文章的用户
1

2.3.3 新闻文章信息数据表

#新闻文章数据集浏览
item_df.head().append(item_df.tail())
click_article_id category_id created_at_ts words_count
0 0 0 1513144419000 168
1 1 1 1405341936000 189
2 2 1 1408667706000 250
3 3 1 1408468313000 230
4 4 1 1407071171000 162
364042 364042 460 1434034118000 144
364043 364043 460 1434148472000 463
364044 364044 460 1457974279000 177
364045 364045 460 1515964737000 126
364046 364046 460 1505811330000 479
item_df['words_count'].value_counts()
176     3485
182     3480
179     3463
178     3458
174     3456
        ... 
845        1
710        1
965        1
847        1
1535       1
Name: words_count, Length: 866, dtype: int64
print(item_df['category_id'].nunique())     # 461个文章主题
item_df['category_id'].hist()
image-20201127224445709
item_df.shape       
# 364047篇文章
(364047, 4)

2.3.4 新闻文章embedding向量表示

item_emb_df.head()
emb_0 emb_1 emb_2 emb_3 emb_4 emb_5 emb_6 emb_7 emb_8 emb_9 emb_10 emb_11 emb_12 emb_13 emb_14 emb_15 emb_16 emb_17 emb_18 emb_19 emb_20 emb_21 emb_22 emb_23 emb_24 emb_25 emb_26 emb_27 emb_28 emb_29 emb_30 emb_31 emb_32 emb_33 emb_34 emb_35 emb_36 emb_37 emb_38 emb_39 emb_40 emb_41 emb_42 emb_43 emb_44 emb_45 emb_46 emb_47 emb_48 emb_49 emb_50 emb_51 emb_52 emb_53 emb_54 emb_55 emb_56 emb_57 emb_58 emb_59 emb_60 emb_61 emb_62 emb_63 emb_64 emb_65 emb_66 emb_67 emb_68 emb_69 emb_70 emb_71 emb_72 emb_73 emb_74 emb_75 emb_76 emb_77 emb_78 emb_79 emb_80 emb_81 emb_82 emb_83 emb_84 emb_85 emb_86 emb_87 emb_88 emb_89 emb_90 emb_91 emb_92 emb_93 emb_94 emb_95 emb_96 emb_97 emb_98 emb_99 emb_100 emb_101 emb_102 emb_103 emb_104 emb_105 emb_106 emb_107 emb_108 emb_109 emb_110 emb_111 emb_112 emb_113 emb_114 emb_115 emb_116 emb_117 emb_118 emb_119 emb_120 emb_121 emb_122 emb_123 emb_124 emb_125 emb_126 emb_127 emb_128 emb_129 emb_130 emb_131 emb_132 emb_133 emb_134 emb_135 emb_136 emb_137 emb_138 emb_139 emb_140 emb_141 emb_142 emb_143 emb_144 emb_145 emb_146 emb_147 emb_148 emb_149 emb_150 emb_151 emb_152 emb_153 emb_154 emb_155 emb_156 emb_157 emb_158 emb_159 emb_160 emb_161 emb_162 emb_163 emb_164 emb_165 emb_166 emb_167 emb_168 emb_169 emb_170 emb_171 emb_172 emb_173 emb_174 emb_175 emb_176 emb_177 emb_178 emb_179 emb_180 emb_181 emb_182 emb_183 emb_184 emb_185 emb_186 emb_187 emb_188 emb_189 emb_190 emb_191 emb_192 emb_193 emb_194 emb_195 emb_196 emb_197 emb_198 emb_199 emb_200 emb_201 emb_202 emb_203 emb_204 emb_205 emb_206 emb_207 emb_208 emb_209 emb_210 emb_211 emb_212 emb_213 emb_214 emb_215 emb_216 emb_217 emb_218 emb_219 emb_220 emb_221 emb_222 emb_223 emb_224 emb_225 emb_226 emb_227 emb_228 emb_229 emb_230 emb_231 emb_232 emb_233 emb_234 emb_235 emb_236 emb_237 emb_238 emb_239 emb_240 emb_241 emb_242 emb_243 emb_244 emb_245 emb_246 emb_247 emb_248 emb_249
0 -0.16118301 -0.95723313 -0.13794445 0.050855342 0.83005524 0.90136534 -0.33514765 -0.55956066 -0.50060284 0.16518293 0.4284342 0.3550556 0.87443674 -0.52888286 0.6254872 0.2689198 -0.8228351 -0.703853 -0.62584543 -0.15285493 -0.6662412 0.043294866 0.1786375 0.046890084 0.5945311 -0.18334764 0.19510683 -0.46763963 -0.30480695 0.35317516 0.27818817 0.5386231 -0.37120935 0.48989806 -0.103832886 0.11917368 0.13243659 -0.62108386 -0.45331132 0.34662652 -0.06174106 -0.7305939 -0.38411567 -0.94075835 0.06134219 0.4825816 0.28968322 -0.62269634 -0.050004438 0.42151213 -0.24257636 0.6687105 -0.509004 -0.46179956 0.043901782 0.28848746 0.4498246 -0.28486234 0.916729 0.70312876 0.85167396 -0.6272441 0.35773164 0.3901894 0.65293527 0.1036527 0.79829276 -0.10254639 -0.2045143 0.37861153 -0.090261474 -0.2511573 0.3259828 0.08588007 -0.18056485 0.04752322 -0.15973987 -0.6201472 0.3003347 -0.7965467 0.47771874 0.40991876 0.32778588 0.072570786 -0.84338844 0.5743214 0.33120632 0.73519886 -0.051149476 -0.42130408 -0.9015315 -0.30527022 -0.13556914 -0.46817523 -0.256514 -0.028035348 0.29472083 -0.06734743 -0.46152946 0.07213261 -0.5707245 -0.059983782 -0.5503721 -0.6039815 0.40252253 -0.3064805 0.3307103 0.14233193 0.33290517 0.23447387 0.6906535 0.48634282 0.42737266 -0.115976304 0.3854025 0.3957128 -0.65858316 -0.55600464 0.7825072 0.12878653 0.6511364 -0.20835122 -0.5025915 0.42170998 -0.31492347 -0.02744605 -0.24444908 0.80920833 0.18782361 0.16195548 -0.14996661 -0.7144809 -0.006979209 -0.31291643 0.94464755 0.24019346 -0.17193855 -0.71315813 0.014604849 0.29979667 -0.7239213 -0.37642562 0.68271834 -0.603598 -0.5735651 -0.4823226 -0.6750509 0.12579046 -0.08218385 -0.46767712 0.87283933 0.35769752 0.79911405 -0.7901175 0.49681464 0.5684371 -0.34781384 0.6503434 -0.5989994 0.42142177 0.107381575 -0.46030676 0.30180693 -0.3994389 -0.034877725 0.12320779 -0.001797056 -0.21508627 -0.40875113 0.20472752 -0.60365415 -0.22466736 -0.42120966 0.6471822 -0.33045855 -0.29070365 -0.053617522 0.5163881 0.33017814 0.37607408 -0.17617662 -0.56767094 -0.009968062 -0.4003482 -0.49794638 -0.22630718 -0.57593477 -0.365481 0.29230362 0.68888617 0.87204033 -0.057107408 -0.3094548 -0.42777112 -0.6890967 0.65392166 -0.4179235 -0.13035183 -0.9389396 -0.61100435 0.56593966 -0.002798308 -0.046364475 0.59413934 0.16963427 0.2505927 -0.06282282 -0.40642437 0.17572297 -0.25898445 0.7643812 0.27564424 -0.69213825 -0.3932401 -0.32569295 0.16337723 -0.15495214 -0.7016393 0.711825 -0.8390629 -0.4591448 0.9199131 0.62183356 -0.64047134 0.38993424 0.76417774 0.16485116 -0.7470226 0.41103414 0.7509008 0.8606434 0.61475503 0.54965967 0.33443713 -0.38805893 -0.70372987 -0.5674205 0.006478452 -0.20659018 -0.3852717 0.3212482 0.31399873 0.6364123 0.16917853 0.540524 -0.8131821 0.28687033 -0.231686 0.5974159 0.40962312
1 -0.52321565 -0.974058 0.73860806 0.15523443 0.626294 0.48529708 -0.71565676 -0.8979958 -0.35974663 0.3982462 0.6728399 -0.011179935 -0.3475059 -0.54134595 0.4584919 0.48231626 0.7136649 -0.47404963 0.40578717 0.67143065 -0.48085636 0.43477857 0.4684991 0.37642372 0.36804795 0.19254456 0.09557209 -0.5422342 0.06554748 0.36326697 -0.16280402 0.31961665 0.4694728 0.57846594 -0.80363667 -0.3045451 0.6355107 0.08997828 -0.56994593 0.10230745 -0.20522885 -0.61088586 -0.50150836 -0.96048003 -0.06059504 0.38241914 -0.29736787 0.40926453 -0.21786417 0.49834716 -0.6147603 0.8442774 0.68017113 -0.15990801 0.061734963 0.1444336 0.31766814 0.24700879 0.94387203 -0.38058102 0.91859365 -0.8526052 -0.16837741 0.75444114 -0.14273201 0.4206429 -0.43596134 -0.104157746 0.1000563 -0.47496593 -0.027490439 0.8916824 -0.09459468 -0.78186184 0.3122802 -0.4069103 0.39425835 -0.71813405 -0.648612 0.25841528 0.18779121 -0.82493585 0.31033567 0.43707713 -0.47691584 -0.445812 0.40969998 0.60437524 -0.36169246 -0.3727106 0.22663075 -0.6397294 0.6675141 -0.82195014 0.44091257 0.021246659 0.30839667 -0.33622697 0.22992244 0.13510437 0.1504187 -0.778806 -0.16344225 -0.30231437 0.32165018 0.86082196 -0.08442749 -0.416456 0.02491276 0.15626587 -0.3437094 0.8664986 0.21410671 0.687026 0.8530002 0.084801495 -0.516603 -0.8581908 -0.5449399 0.40424716 0.4892713 -0.0694682 0.3224251 0.34146369 0.35007596 -0.009392044 0.7743942 0.90345913 -0.651531 -0.22094776 -0.8546268 -0.3415168 0.49017307 0.84110063 0.96615684 0.64262897 0.6238183 0.14489052 0.46625245 0.38889703 -0.8856299 0.5129191 -0.33545622 0.6910302 -0.4002566 0.06852663 -0.03643632 -0.6662388 -0.62802434 0.19465709 0.9336932 -0.8685101 0.24076794 -0.50692976 -0.44401106 0.4847236 0.27607688 0.24138331 -0.2531006 -0.2965363 -0.3071856 0.68878716 0.5038742 0.35433933 -0.2523091 -0.5544281 -0.28400695 -0.5831231 0.71838886 -0.7133156 -0.34497708 -0.38560635 0.40064123 0.50644064 0.31538466 0.79972106 -0.2237811 -0.79530096 0.30096582 -0.342811 -0.104375355 -0.34824187 -0.2094949 -0.34233913 -0.7312734 0.15266274 -0.23065941 -0.07098291 -0.8868492 -0.7093237 0.14651492 -0.18378492 0.015864851 -0.026017992 0.09604764 -0.49209952 0.15954165 0.24719411 -0.9651489 -0.40408066 0.39460066 -0.5637276 -0.8955656 0.6989942 0.6478281 -0.48602626 -0.67167413 0.02193207 -0.05510436 -0.40765557 0.34432083 -0.53404987 0.728764 0.8601099 -0.028166076 0.4911137 0.5315559 0.54535013 -0.24154396 -0.091118515 0.001625 0.9509227 -0.3678049 -0.21541052 -0.13189119 0.7090808 0.61287636 0.8148849 0.27868006 0.00369023 -0.28677323 -0.1009357 -0.3826052 -0.19188572 -0.41642004 -0.8565932 0.5374417 0.28169277 -0.6818292 0.66609454 -0.48784342 0.82312447 0.41268778 -0.3386544 0.3207865 0.5886426 -0.5941367 0.18282819 0.39708954 -0.83436364
2 -0.61961854 -0.9729604 -0.20736018 -0.12886102 0.044747587 -0.387535 -0.73047674 -0.066126116 -0.75489885 -0.24200428 0.670484 -0.2803883 -0.557285 -0.084145054 0.027781956 0.29407424 0.36269727 -0.3685494 0.14796 -0.01175088 0.030208733 0.10631693 0.6280128 0.388849 0.6159109 -0.44511306 0.10602808 0.13710949 -0.095535524 0.3425321 0.5926465 -0.26179096 0.34212252 0.7045392 -0.43306684 0.104154296 0.7859709 0.5886402 -0.62768734 -0.14329416 0.39983153 -0.70823455 -0.73296404 -0.95824176 -0.629325 -0.28223997 0.0551875 -0.70930463 0.5806534 -0.5183282 0.0590419 0.66433567 0.37024036 -0.22426963 -0.22767073 0.6944705 0.16796917 0.10058454 0.9468768 -0.47480643 0.91217107 -0.43829462 -0.04617592 0.80739474 -0.2778143 -0.6002078 -0.5066402 -0.008201393 -0.8228875 0.2007798 0.29347017 0.6092525 -0.6335858 -0.75588113 -0.20164421 -0.8128615 -0.28390712 -0.8338001 -0.5983968 -0.055292167 0.044525277 -0.33476397 0.52112645 -0.3640041 -0.6908397 -0.8012401 0.5533705 0.74320245 -0.42877087 -0.48136944 -0.5091764 0.023140103 -0.6849103 -0.131261 0.38206697 0.14063244 0.4497598 -0.5771929 0.40047687 -0.7790704 0.14744663 -0.19264965 -0.6540247 -0.22991715 0.24190693 0.44316638 0.3711746 -0.029080013 0.5786215 0.31998798 -0.5572205 0.8596662 0.029914541 0.81787074 0.41606712 0.29073954 0.1284451 -0.6975347 -0.26847145 0.22473739 -0.041609872 0.6536587 0.6486424 -0.08109035 0.25798333 -0.13012607 0.52703834 0.8891222 -0.45905393 0.20329566 -0.7709458 -0.31373432 0.20574173 0.686514 0.9611321 0.20189174 -0.28076124 0.6655121 -0.47473463 0.7556301 -0.8514799 -0.19893026 -0.56656754 0.01221101 -0.15132473 0.47432616 -0.026318237 0.31660327 -0.2690428 -0.5231459 0.9180193 -0.010043162 0.5588951 -0.55145407 -0.063592985 -0.21568829 -0.103606954 -0.10441898 -0.22273055 -0.16902 0.5398216 -0.026840318 0.32903978 0.69684696 0.22801323 -0.18862388 0.5981059 -0.65792733 0.30257994 0.3537352 -0.4648315 -0.6071538 0.79842037 0.87306833 -0.061885774 0.7593738 0.08122122 -0.135324 0.06847826 -0.06454425 0.2575452 -0.5884914 0.17373905 0.25228524 -0.44891816 0.5290899 0.13472247 -0.16171558 -0.3073884 0.15500672 -0.5801937 0.33319435 -0.5369098 -0.06604467 -0.04154485 -0.14278157 0.4121547 -0.6233337 -0.96528286 0.5994842 0.5175131 0.46808377 -0.74577326 -0.14307341 -0.063774 0.03437525 -0.8180665 0.43799338 -0.19257067 0.41348705 -0.6008404 -0.53932273 0.41819635 0.028715322 0.1350077 -0.31348306 0.7443608 -0.14511348 -0.485484 -0.59155476 0.03011219 0.9527359 -0.56937635 -0.35968015 0.561676 0.38167137 -0.18930016 0.42094955 0.022337439 -0.3295591 0.23276466 0.4243681 -0.4845985 0.3943427 -0.32952744 -0.85229295 0.63499165 -0.5326731 -0.46939626 0.71499074 0.4547558 0.47318393 0.37786552 -0.8638872 -0.38336506 0.13772134 -0.8108767 -0.44758022 0.8059317 -0.28528407
3 -0.7408434 -0.97574896 0.39169782 0.6417378 -0.26864457 0.19174537 -0.82559335 -0.71059096 -0.04009941 -0.11051404 0.59164035 -0.24520522 -0.7884179 -0.34529287 -0.75640696 0.141115 0.2837825 -0.36995867 0.3747333 0.9226334 0.17221873 0.25358212 0.5521077 -0.45413136 0.38516036 -0.1511007 0.5985594 -0.61419016 -0.2587653 0.59190655 0.6187495 0.414598 0.69818413 -0.1148622 -0.069055706 -0.4204547 0.5893682 0.49694833 -0.44970745 -0.74711275 -0.46309438 0.121650666 -0.36153314 -0.96052325 -0.3127393 -0.28400424 -0.048201885 -0.47585002 -0.08994279 -0.64009154 -0.3301555 -0.19408149 0.45184404 0.20191701 -0.2060132 -0.08224793 -0.45896775 0.32834625 0.9486373 -0.21064739 0.9295398 0.1134659 -0.28422692 0.24240343 -0.64194417 0.16595171 -0.63513243 0.23211524 0.4046342 -0.32584074 0.42850438 0.54978645 -0.34581375 -0.43115202 -0.061234035 -0.6212975 -0.6420294 -0.6277287 -0.28871766 0.27999452 0.18269487 0.4143371 0.75837016 0.73818374 -0.2682103 -0.95617294 0.6136255 0.36795926 -0.48584962 -0.59055865 0.32888106 0.42777297 -0.5914408 -0.6760584 0.7396574 -0.29714608 -0.001522011 0.6067586 0.021152798 -0.1999428 0.5992843 -0.9336488 -0.7339324 -0.3548952 -0.10201278 0.5267514 0.39976043 -0.7781645 0.45194831 0.6113064 -0.6365014 0.82341945 0.26160413 0.6729061 0.7377759 0.85496897 0.27964061 -0.4530164 -0.36170158 -0.32679778 -0.54835236 0.8228451 0.28131554 -0.51567614 0.32328904 -0.56952953 0.7272295 0.9164357 0.05195607 0.75865686 -0.6513103 0.17274843 -0.14809223 0.36314052 0.9664591 -0.6997752 -0.11022416 0.61416465 -0.63350356 0.509426 -0.8887299 -0.36334506 -0.43738785 -0.033945348 -0.1052819 -0.07988469 -0.32699475 0.6120975 0.21804792 -0.55917335 0.93942744 -0.80544114 -0.14779139 -0.61262435 -0.5958834 -0.3508079 0.17979275 0.08651969 0.6180093 -0.7206348 0.49267095 0.60434794 -0.71765494 0.19019826 -0.16486229 -0.42686835 -0.45953393 -0.6139717 0.5945738 -0.9201556 -0.7889728 -0.51705855 0.61294246 0.22808462 -0.46156266 0.6988653 -0.149526 -0.7692023 -0.9069472 -0.5331799 -0.28038624 -0.76899487 0.3995562 0.57235265 -0.8322853 0.100450955 0.8771542 -0.69182134 -0.96531117 -0.35563898 -0.8590283 0.10870883 -0.7466154 -0.03305922 0.29741293 -0.25052613 0.34788024 0.5218496 -0.967562 -0.1208118 -0.007138429 0.4481838 0.11044262 0.17852893 -0.24012966 0.38575053 -0.86950207 0.37351066 0.3290996 -0.03297463 -0.1371209 -0.24549016 0.1705587 0.5680895 -0.2522952 -0.4557263 0.75676507 0.62661505 -0.005398795 -0.47235534 0.32462546 0.9543453 -0.84055316 -0.8003321 0.11163725 -0.50180614 0.11598574 0.80583996 -0.04148082 0.32062986 -0.45011318 -0.3786998 0.50961566 0.1427866 0.14966995 -0.89618087 0.23438862 -0.18983129 -0.59761226 0.26380724 0.271535 0.03604036 0.48002875 -0.76317334 0.022626927 0.5651648 -0.910286 -0.5378381 0.24354108 -0.8853287
4 -0.2790515 -0.97231525 0.68537366 0.11305604 0.23831514 0.27191275 -0.56881577 0.34119362 -0.600554 -0.12564443 0.6986681 0.46750343 -0.7406642 -0.25806192 -0.0941097 0.25694084 0.5820713 -0.025040904 0.33916417 -0.34847263 0.41307598 0.15994526 0.5111372 0.6080067 0.6262254 -0.54069 0.62364006 -0.019697413 0.027607327 -0.23636986 0.31167123 -0.5598312 0.7663814 0.41255328 -0.66333795 0.15462615 0.70813227 0.2149541 -0.82996327 -0.8044903 -0.00721397 -0.36387116 -0.4930609 -0.9583802 -0.6457325 -0.09535645 -0.71986735 -0.07246083 -0.48144332 0.004318029 -0.016971705 0.5065764 0.76007205 -0.12748618 -0.5110607 0.54192376 -0.27129638 0.55283403 0.9372256 -0.8103462 0.8975088 -0.21992429 -0.04487581 0.74466425 -0.026190858 -0.0900828 -0.76284456 -0.09379676 -0.654756 -0.41048694 0.38635105 0.8906753 -0.034060627 -0.85879266 0.26595125 -0.7055903 -0.5910756 -0.5754765 0.2482654 -0.33104572 -0.19396351 -0.48348248 0.61344534 0.41048878 -0.33183008 -0.15405907 0.4535568 0.46153882 -0.78016883 0.18798463 0.6265335 -0.20332052 0.015695304 -0.54285246 0.25925624 0.2899606 0.20957795 -0.29053453 0.2724863 -0.60109365 -0.014308549 -0.55853987 -0.6775309 -0.48170102 0.36605647 0.6885134 0.11192804 -0.42977595 0.54204065 0.19973843 -0.7916503 0.7669245 -0.37493747 0.624663 0.64511013 0.35424432 0.13709246 -0.1439263 -0.11676121 -0.017233018 -0.23831826 0.41837034 0.62014097 -0.33894363 -0.006630527 0.11113287 0.7127267 0.88489485 -0.5365522 0.22438946 -0.79416794 0.7367378 0.18093406 0.79305655 0.9599217 0.16775015 0.342894 0.3355304 -0.18326297 0.8518253 -0.833091 -0.12753803 -0.14415734 -0.50560683 -0.34681374 0.26684228 -0.5988186 0.6993123 -0.38332996 0.03337532 0.8994969 -0.19153337 0.101739705 -0.50923413 -0.19421946 0.5620307 0.3061076 -0.51225716 0.54409444 -0.74116755 -0.5147172 0.21770592 0.18861626 0.31175086 -0.15968989 -0.26560682 -0.11243183 -0.59549665 0.69916314 0.68094957 0.4961249 -0.40668836 0.7443211 0.7867173 -0.4876832 0.5557059 0.26874924 -0.5820228 0.20688884 -0.8169357 -0.09994598 -0.57779837 -0.57344157 0.019961491 -0.5495038 -0.122715354 0.67587304 0.13588467 -0.49017805 -0.7959369 -0.52422357 -0.04936329 -0.11798768 0.3118808 -0.42331895 -0.42164764 0.1349588 -0.035781983 -0.9592425 0.56856906 -0.059126128 -0.16352127 -0.71017325 0.12416366 0.58593047 0.3068743 -0.6826809 0.20110898 -0.37408113 0.21332923 0.23442565 -0.6892339 0.7511533 0.64742357 -0.033630937 -0.5786174 0.8209151 0.37037665 -0.52871823 -0.53630793 -0.088348694 0.9444393 -0.8876789 -0.6173575 0.25458965 0.08434114 0.2863664 0.36071354 0.36962882 -0.45254308 -0.29296193 -0.53002346 -0.23712344 0.2330078 0.002609209 -0.8351449 0.49413693 0.029535035 -0.61963814 0.81344455 0.23828594 0.80926764 0.42752114 -0.6159318 -0.5036968 0.6144501 -0.91775995 -0.42406067 0.18548405 -0.5802922

5 rows × 251 columns

item_emb_df.shape
(364047, 251)

2.4 数据分析

2.4.1 用户重复点击

#####merge
user_click_merge = trn_click.append(tst_click)

#用户重复点击
user_click_count = user_click_merge.groupby(['user_id', 'click_article_id'])['click_timestamp'].agg({'count'}).reset_index()
user_click_count[:10]
user_id click_article_id count
0 0 30760 1
1 0 157507 1
2 1 63746 1
3 1 289197 1
4 2 36162 1
5 2 168401 1
6 3 36162 1
7 3 50644 1
8 4 39894 1
9 4 42567 1
user_click_count[user_click_count['count']>7]
user_id click_article_id count
311242 86295 74254 10
311243 86295 76268 10
393761 103237 205948 10
393763 103237 235689 10
576902 134850 69463 13
user_click_count['count'].unique()
array([ 1,  2,  4,  3,  6,  5, 10,  7, 13])
#用户点击新闻次数
user_click_count.loc[:,'count'].value_counts() 
1     1605541
2       11621
3         422
4          77
5          26
6          12
10          4
7           3
13          1
Name: count, dtype: int64

结论:可以看出:有1605541(约占99.2%)的用户未重复阅读过文章,仅有极少数用户重复点击过某篇文章。 这个也可以单独制作成特征;

2.4.2 用户点击环境变化分析

def plot_envs(df, cols, r, c):
    plt.figure()
    plt.figure(figsize=(10, 5))
    i = 1
    for col in cols:
        plt.subplot(r, c, i)
        i += 1
        v = df[col].value_counts().reset_index()
        fig = sns.barplot(x=v['index'], y=v[col])
        for item in fig.get_xticklabels():
            item.set_rotation(90)
        plt.title(col)
    plt.tight_layout()
    plt.show()
    
# 分析用户点击环境变化是否明显,这里随机采样10个用户分析这些用户的点击环境分布
sample_user_ids = np.random.choice(tst_click['user_id'].unique(), size=10, replace=False)
sample_users = user_click_merge[user_click_merge['user_id'].isin(sample_user_ids)]
cols = ['click_environment','click_deviceGroup', 'click_os', 'click_country', 'click_region','click_referrer_type']
for _, user_df in sample_users.groupby('user_id'):
    plot_envs(user_df, cols, 2, 3)
image-20201127225837430
image-20201127225929854
image-20201127230028019
image-20201127230113130
image-20201127230142160
image-20201127230211793
image-20201127230240414
image-20201127230301585
image-20201127230323342
image-20201127230349347

结论:可以看出绝大多数数的用户的点击环境是比较固定的;

思路:可以基于这些环境的统计特征来代表该用户本身的属性;

2.4.3 用户点击新闻数量的分布

user_click_item_count = sorted(user_click_merge.groupby('user_id')['click_article_id'].count(), reverse=True)
plt.plot(user_click_item_count)
image-20201127230544076

结论:可以根据用户的点击文章次数看出用户的活跃度

#点击次数在前50的用户
plt.plot(user_click_item_count[:50])
image-20201127230648737

结论:点击次数排前50的用户的点击次数都在100次以上。

思路:我们可以定义点击次数大于等于100次的用户为活跃用户,这是一种简单的处理思路, 判断用户活跃度,更加全面的是再结合上点击时间,后面我们会基于点击次数和点击时间两个方面来判断用户活跃度。

#点击次数排名在[25000:50000]之间
plt.plot(user_click_item_count[25000:50000])
image-20201127230743225

结论:可以看出点击次数小于等于两次的用户非常的多,这些用户可以认为是非活跃用户

2.4.4 新闻点击次数分析

item_click_count = sorted(user_click_merge.groupby('click_article_id')['user_id'].count(), reverse=True)
plt.plot(item_click_count)
image-20201127230903017
plt.plot(item_click_count[:100])
image-20201127230944068

结论:可以看出点击次数最多的前100篇新闻,点击次数大于1000次;

plt.plot(item_click_count[:20])
image-20201127231032690

结论:点击次数最多的前20篇新闻,点击次数大于2500。

思路:可以定义这些新闻为热门新闻, 这个也是简单的处理方式,后面我们也是根据点击次数和时间进行文章热度的一个划分。

plt.plot(item_click_count[3500:])
image-20201127231113814

结论;可以发现很多新闻只被点击过一两次;

思路:可以定义这些新闻是冷门新闻;

2.4.5 新闻共现频次(两篇新闻连续出现的次数)

tmp = user_click_merge.sort_values('click_timestamp')
tmp['next_item'] = tmp.groupby(['user_id'])['click_article_id'].transform(lambda x:x.shift(-1))
union_item = tmp.groupby(['click_article_id','next_item'])['click_timestamp'].agg({'count'}).reset_index().sort_values('count', ascending=False)
union_item[['count']].describe()
count
count 433597
mean 3.184138728
std 18.85175315
min 1
25% 1
50% 1
75% 2
max 2202

结论:由统计数据可以看出,平均共现次数3.18,最高为2202。

​ 说明用户看的新闻,相关性是比较强的。

#画个图直观地看一看
x = union_item['click_article_id']
y = union_item['count']
plt.scatter(x, y)
image-20201127231504490
plt.plot(union_item['count'].values[40000:])
image-20201127231529375

结论:大概有70000个pair至少共现一次;

2.4.6 新闻文章信息

#不同类型的新闻出现的次数
plt.plot(user_click_merge['category_id'].value_counts().values)
image-20201127231642169
#出现次数比较少的新闻类型, 有些新闻类型,基本上就出现过几次
plt.plot(user_click_merge['category_id'].value_counts().values[150:])
image-20201127231704424
#新闻字数的描述性统计
user_click_merge['words_count'].describe()
count    1.630633e+06
mean     2.043012e+02
std      6.382198e+01
min      0.000000e+00
25%      1.720000e+02
50%      1.970000e+02
75%      2.290000e+02
max      6.690000e+03
Name: words_count, dtype: float64
plt.plot(user_click_merge['words_count'].values)
image-20201127231751480

2.4.7 用户点击的新闻类型的偏好

此特征可以用于度量用户的兴趣是否广泛。

plt.plot(sorted(user_click_merge.groupby('user_id')['category_id'].nunique(), reverse=True))
image-20201127231845600

结论:从上图中可以看出有一小部分用户阅读类型是极其广泛的,大部分人都处在20个新闻类型以下。

user_click_merge.groupby('user_id')['category_id'].nunique().reset_index().describe()
user_id category_id
count 250000.000000 250000.000000
mean 124999.500000 4.573188
std 72168.927986 4.419800
min 0.000000 1.000000
25% 62499.750000 2.000000
50% 124999.500000 3.000000
75% 187499.250000 6.000000
max 249999.000000 95.000000

2.4.8 用户查看文章的长度的分布

通过统计不同用户点击新闻的平均字数,这个可以反映用户是对长文更感兴趣还是对短文更感兴趣。

plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), reverse=True))
image-20201127232035357

结论:从上图中可以发现有一小部分人看的文章平均词数非常高,也有一小部分人看的平均文章次数非常低。

大多数人偏好于阅读字数在200-400字之间的新闻。

#挑出大多数人的区间仔细看看
plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), reverse=True)[1000:45000])
image-20201127232148052

结论:可以发现大多数人都是看250字以下的文章

#更加详细的参数
user_click_merge.groupby('user_id')['words_count'].mean().reset_index().describe()
user_id words_count
count 250000 250000
mean 124999.5 205.8301889
std 72168.92799 47.17402984
min 0 8
25% 62499.75 187.5
50% 124999.5 202
75% 187499.25 217.75
max 249999 3434.5

2.4.9 用户点击新闻的时间分析

#为了更好的可视化,这里把时间进行归一化操作
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler()
user_click_merge['click_timestamp'] = mm.fit_transform(user_click_merge[['click_timestamp']])
user_click_merge['created_at_ts'] = mm.fit_transform(user_click_merge[['created_at_ts']])

user_click_merge = user_click_merge.sort_values('click_timestamp')

user_click_merge.head()
user_id click_article_id click_timestamp click_environment click_deviceGroup click_os click_country click_region click_referrer_type rank click_cnts category_id created_at_ts words_count
18 249990 162300 0 4 3 20 1 25 2 5 5 281 0.989186404 193
2 249998 160974 1.55855E-06 4 1 12 1 13 2 5 5 281 0.989092296 259
30 249985 160974 3.45022E-06 4 1 17 1 8 2 8 8 281 0.989092296 259
50 249979 162300 3.71721E-06 4 1 17 1 25 2 2 2 281 0.989186404 193
25 249988 160974 3.84096E-06 4 1 17 1 21 2 17 17 281 0.989092296 259
def mean_diff_time_func(df, col):
    df = pd.DataFrame(df, columns={col})
    df['time_shift1'] = df[col].shift(1).fillna(0)
    df['diff_time'] = abs(df[col] - df['time_shift1'])
    return df['diff_time'].mean()

# 点击时间差的平均值
mean_diff_click_time = user_click_merge.groupby('user_id')['click_timestamp', 'created_at_ts'].apply(lambda x: mean_diff_time_func(x, 'click_timestamp'))

plt.plot(sorted(mean_diff_click_time.values, reverse=True))
image-20201127232615650

结论:从上图可以发现不同用户点击文章的时间差是有差异的;

# 前后点击文章的创建时间差的平均值
mean_diff_created_time = user_click_merge.groupby('user_id')['click_timestamp', 'created_at_ts'].apply(lambda x: mean_diff_time_func(x, 'created_at_ts'))

plt.plot(sorted(mean_diff_created_time.values, reverse=True))
image-20201127232704670

结论:从图中可以发现用户先后点击文章,文章的创建时间也是有差异的;

# 用户前后点击文章的相似性分布
item_idx_2_rawid_dict = dict(zip(item_emb_df['article_id'], item_emb_df.index))

del item_emb_df['article_id']

item_emb_np = np.ascontiguousarray(item_emb_df.values, dtype=np.float32)

# 随机选择5个用户,查看这些用户前后查看文章的相似性
sub_user_ids = np.random.choice(user_click_merge.user_id.unique(), size=15, replace=False)
sub_user_info = user_click_merge[user_click_merge['user_id'].isin(sub_user_ids)]

sub_user_info.head()
user_id click_article_id click_timestamp click_environment click_deviceGroup click_os click_country click_region click_referrer_type rank click_cnts category_id created_at_ts words_count
15044 243899 160974 0.001161 4 1 17 1 21 2 30 30 281 0.989092 259
17205 243023 202383 0.001644 4 1 17 1 13 1 5 5 327 0.989202 206
17206 243023 140729 0.001652 4 1 17 1 13 1 4 5 265 0.973622 168
15045 243899 118864 0.002880 4 1 17 1 21 2 29 30 247 0.989218 174
15046 243899 166580 0.002888 4 1 17 1 21 2 28 30 289 0.989232 207
def get_item_sim_list(df):
    sim_list = []
    item_list = df['click_article_id'].values
    for i in range(0, len(item_list)-1):
        emb1 = item_emb_np[item_idx_2_rawid_dict[item_list[i]]]
        emb2 = item_emb_np[item_idx_2_rawid_dict[item_list[i+1]]]
        sim_list.append(np.dot(emb1,emb2)/(np.linalg.norm(emb1)*(np.linalg.norm(emb2))))
    sim_list.append(0)
    return sim_list

for _, user_df in sub_user_info.groupby('user_id'):
    item_sim_list = get_item_sim_list(user_df)
    plt.plot(item_sim_list)
image-20201127232913372

结论:从图中可以看出有些用户前后看的商品的相似度波动比较大,有些波动比较小,也是有一定的区分度的;

3. 总结

通过数据分析的过程, 我们目前可以得到以下几点重要的信息, 这个对于我们进行后面的特征制作和分析非常有帮助:

  1. 训练集和测试集的用户id没有重复,也就是测试集里面的用户没有模型是没有见过的
  2. 训练集中用户最少的点击文章数是2, 而测试集里面用户最少的点击文章数是1
  3. 用户对于文章存在重复点击的情况, 但这个都存在于训练集里面
  4. 同一用户的点击环境存在不唯一的情况,后面做这部分特征的时候可以采用统计特征
  5. 用户点击文章的次数有很大的区分度,后面可以根据这个制作衡量用户活跃度的特征
  6. 文章被用户点击的次数也有很大的区分度,后面可以根据这个制作衡量文章热度的特征
  7. 用户看的新闻,相关性是比较强的,所以往往我们判断用户是否对某篇文章感兴趣的时候, 在很大程度上会和他历史点击过的文章有关
  8. 用户点击的文章字数有比较大的区别, 这个可以反映用户对于文章字数的区别
  9. 用户点击过的文章主题也有很大的区别, 这个可以反映用户的主题偏好 10.不同用户点击文章的时间差也会有所区别, 这个可以反映用户对于文章时效性的偏好

你可能感兴趣的:(新闻推荐02——探索性数据分析)