赛题链接: https://tianchi.aliyun.com/competition/entrance/231573/introduction
将数据下载下来后,查看数据所包含的字段机器意义,考虑需要建立额外哪些字段,删除哪些字段才能更好预测结果,首先先查看现有字段可以给出哪些信息。
data_balance.head()
将单条数据按照天分组聚合:
total_balance = data_balance.groupby(['date'])['total_purchase_amt', 'total_redeem_amt'].sum().reset_index()
画出每日总购买与赎回量的时间序列图
plt.plot(total_balance['date'], total_balance['total_purchase_amt'], label='purchase')
plt.plot(total_balance['date'], total_balance['total_redeem_amt'], label='redeem')
plt.title("The lineplot of total amout of Purchase and Redeem from July.13 to Sep.14")
fig = plt.figure(figsize=(20,6))
plt.plot(total_balance_1['date'], total_balance_1['total_purchase_amt'],label='total_purchase_amt')
plt.plot(total_balance_1['date'], total_balance_1['total_redeem_amt'],label='total_redeem_amt')
plt.title("The linplot of total amount of Purchase and Redeem from April.14 to Sep.14")
整体看起来波动性较大,而且可以明显的观察到,一个月内有四次波动,所以接下来按周对数据集进行分析。
画出一周内时间序列图:
# 画出每个翌日的数据分布与整体数据的分布图
a = plt.figure(figsize=(10, 10))
plt.subplot(2,2,1)
plt.title('The distrubution of total purchase')
sns.violinplot(x='weekday', y='total_purchase_amt', data=total_balance_1, scatter_kws=scatter_para, line_kws=line_kws)
plt.subplot(2,2,2)
plt.title('The distrubution of total purchase')
sns.distplot(total_balance_1['total_purchase_amt'].dropna())
plt.ylim(0,1e-8)
plt.subplot(2,2,3)
plt.title('The distrubution of total redeem')
sns.violinplot(x='weekday', y='total_redeem_amt', data=total_balance_1,scattrt_kws=scatter_para, line_kws=line_kws)
plt.subplot(2,2,4)
plt.title('The distrubution of total redeem')
sns.distplot(total_balance_1['total_redeem_amt'].dropna())
plt.ylim(0,1e-8)
plt.show()
不知道是什么原因,按照提供的代码出不来直方图,只能手动设置Y轴了。
中位数分析:
ax = plt.subplot(1,2,1)
plt.title('The barplot of average total purchase with each weekday')
ax = sns.barplot(x="weekday", y="total_purchase_amt", data=week_sta, label='Purchase')
ax.legend()
ax = plt.subplot(1,2,2)
plt.title('The barplot of average total redeem with each weekday')
ax = sns.barplot(x="weekday", y="total_redeem_amt", data=week_sta, label='Redeem')
ax.legend()
箱型图:
plt.figure(figsize=(12, 5))
ax = plt.subplot(1,2,1)
plt.title('The boxplot of total purchase with each weekday')
ax = sns.boxplot(x="weekday", y="total_purchase_amt", data=total_balance_1)
ax = plt.subplot(1,2,2)
plt.title('The boxplot of total redeem with each weekday')
ax = sns.boxplot(x="weekday", y="total_redeem_amt", data=total_balance_1)
plt.subplot(1,2,1)
plt.title('The speraman coleration between total purchase and each weekday')
sns.heatmap(feature[[x for x in feature.columns if x not in ['total_redeem_amt','date']]].corr('spearman'),linewidths=0.1, vmax=0.2, vmin=-0.2)
plt.subplot(1,2,2)
plt.title('The spearman coleration between total redeem and each weekday')
sns.heatmap(feature[[x for x in feature.columns if x not in ['total_purchase_amt','date']]].corr('spearman'),linewidths=0.1, vmax=0.2, vmin=-0.2)
plt.show()
之后在调用mvtpy模块用来测试翌日特征与标签的独立性:
(模块:Ref:https://github.com/ChuanyuXue/MVTest)
from mvtpy import mvtest
mv = mvtest()
mv.test(total_balance_1['total_purchase_amt'], total_balance_1['weekday'])
结果也是非常的简单明了:{‘Tn’: 6.75, ‘p-value’: [0, 0.01]}
以周为单位分析完后,可以对整体按月来分析
plt.title('The Probability Density of total purchase amount in Each Month')
for i in range(7, 12):
sns.kdeplot(total_balance[(total_balance['date'] >= pd.to_datetime(f'2013.{str(i)}.1')) & (total_balance['date'] < pd.to_datetime(f'2013.{str(i+1)}.1'))]['total_purchase_amt'],label='13Y,'+str(i)+'M')
for i in range(1, 9):
sns.kdeplot(total_balance[(total_balance['date'] >= pd.to_datetime(f'2014.{str(i)}.1')) & (total_balance['date'] < pd.to_datetime(f'2014.{str(i+1)}.1'))]['total_purchase_amt'],label='14Y,'+str(i)+'M')
从图中可以看到每年的购买量逐年增长,这与支付宝利率变化有很大关系。
每月赎回量:
for i in range(7, 12):
sns.kdeplot(total_balance[(total_balance['date'] >= pd.to_datetime(f'2013.{str(i)}.1')) & (total_balance['date'] < pd.to_datetime(f'2013.{str(i+1)}.1'))]['total_redeem_amt'],label='13Y,'+str(i)+'M')
for i in range(1, 9):
sns.kdeplot(total_balance[(total_balance['date'] >= pd.to_datetime(f'2014.{str(i)}.1')) & (total_balance['date'] < pd.to_datetime(f'2014.{str(i+1)}.1'))]['total_redeem_amt'],label='14Y,'+str(i)+'M')
与购买量类似,支付宝利润的提高产生了大量用户进行频繁的交易。
若以日为单位进行分析会出现什么现象?
# 获取聚合后每月购买分布的柱状图
ax = sns.barplot(x='day', y='total_purchase_amt', data=day_sta, label='Purchase')
ax = sns.lineplot(x='day', y='total_purchase_amt', data=day_sta, label='Purchase')
# 获取聚合后每月赎回分布的柱状图
ax = sns.barplot(x='day', y='total_redeem_amt', data=day_sta, label='Redeem')
ax = sns.lineplot(x='day', y='total_redeem_amt', data=day_sta, label='Redeem')
一周内,有高买入日与低买入日,这与金融交易市场有关联,而赎回量较明显是在月中或者月末出现大量赎回现象。
在往下逐步探索,是大额用户与小额用户在交易额上存在哪些区别:
plt.plot(total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 1]['date'], total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 1]['total_purchase_amt'],label='big_purchase')
plt.plot(total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 1]['date'], total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 1]['total_redeem_amt'],label='big_redeem')
plt.plot(total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 0]['date'], total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 0]['total_purchase_amt'],label='small_purchase')
plt.plot(total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 0]['date'], total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 0]['total_redeem_amt'],label='small_redeem')
大额用户在14年3月份后相比小额用户还是愿意频繁交易,这在之后探索中有着很大的帮助,可以考虑将小额用户按一种模型预测,大额用户按另一种模型进行预测。
既然大额用户交易频繁,不妨来绘制频繁用户与非频繁用户总购买赎回量的时序图。
plt.plot(total_balance_hotNcold[total_balance_hotNcold['is_hot_users'] == 1]['date'], total_balance_hotNcold[total_balance_hotNcold['is_hot_users'] == 1]['total_purchase_amt'], label='hot_purchase')
plt.plot(total_balance_hotNcold[total_balance_hotNcold['is_hot_users'] == 1]['date'], total_balance_hotNcold[total_balance_hotNcold['is_hot_users'] == 1]['total_redeem_amt'], label='hot_redeem')
plt.plot(total_balance_hotNcold[total_balance_hotNcold['is_hot_users'] == 0]['date'], total_balance_hotNcold[total_balance_hotNcold['is_hot_users'] == 0]['total_purchase_amt'], label='cold_purchase')
plt.plot(total_balance_hotNcold[total_balance_hotNcold['is_hot_users'] == 0]['date'], total_balance_hotNcold[total_balance_hotNcold['is_hot_users'] == 0]['total_redeem_amt'], label='cold_redeem')
在这张图中四种线都缠在一起了,但还是可以看出频繁交易的用户购买量与赎回量都在非频繁用户之上。
还可以考虑交易量是否与用户所在城市、星座或者性别有关?
以城市为例:
# 统计每个城市用户的日总交易额的区别并绘制分布估计图
fig = plt.figure(figsize=(10, 5))
for i in np.unique(data_balance['city']):
temp = data_balance.groupby(['date','city'], as_index=False)['total_purchase_amt', 'total_redeem_amt'].sum()
ax = sns.kdeplot(temp[temp['city'] == i]['total_purchase_amt'], label=i)
plt.legend(loc='best')
plt.title('The time series of different city of Purchase an Redeem')
plt.xlabel('Time')
plt.ylabel('Amount')
plt.show()
fig = plt.figure(figsize=(10, 5))
for i in np.unique(data_balance['city']):
temp = data_balance.groupby(['date','city'], as_index=False)['total_purchase_amt', 'total_redeem_amt'].sum()
ax = sns.kdeplot(temp[temp['city'] == i]['total_redeem_amt'], label=i)
plt.plot(temp['date'], temp['share_amt'] / temp['direct_purchase_amt'], label='Rate')
绘制支付宝利率与交易额的时序图:
fig,ax1 = plt.subplots(figsize=(15,5))
plt.plot(temp['date'], temp['share_amt'],'b',label="Share_amt")
plt.legend()
ax2=ax1.twinx()
plt.plot(share['date'], share['mfd_daily_yield'],'g',label="Share rate")
画出支付宝利率与每日利息的增长/直接购买量的时序图:
fig,ax1 = plt.subplots(figsize=(15,5))
plt.plot(temp['date'], temp['share_amt'] / temp['direct_purchase_amt'],'b',label="Share_amt / Direct_amt")
plt.legend()
ax2=ax1.twinx()
plt.plot(share['date'], share['mfd_daily_yield'],'g',label="Share rate")
fig = plt.figure(figsize=(20,6))
plt.plot(temp['date'], temp['purchase_bal_amt'],label='Bal')
plt.plot(temp['date'], temp['purchase_bank_amt'],label='Bank')
画出不同赎回方式日赎回量的时序图:
temp = data_balance.groupby(['date'], as_index=False)['tftobal_amt','tftocard_amt'].sum()
fig = plt.figure(figsize=(20,6))
plt.plot(temp['date'], temp['tftobal_amt'],label='Bal')
plt.plot(temp['date'], temp['tftocard_amt'],label='Bank')
综合支付宝利率,用户比较喜欢直接使用支付宝方式赎回,购买方式两者没有多大差别,个人考虑可能与支付宝支付方式有关。
以上就是在这次Task1中学习到的内容,通过手码键盘,并之前也对时序图有点了解的情况下,在这次学习中也能够了解到每行代码的含义,对于之后的分析有何作用,在这次预测中又会带来哪些影响。