文章目录
- 赛题介绍
- 代码
- 1 特征工程
- 2 建模
- 3 预测
- 4 提交结果
- 0 查看数据
- 0.1 训练数据
- 0.1.1 正样本
- 0.1.2 负样本
- 0.1.3 天气数据
- 0.2 测试数据
赛题地址:https://www.kesci.com/home/competition/5be92233954d6e001063649a
又打了个酱油,最终成绩是39/205。说出来挺丢人的,因为本次比赛采用AUC来评判模型的效果,不用建模一半预测为去,另一半预测为不去就能得0.5分。
赛题介绍
赛题描述
参赛选手需要根据2017年贵阳市常住居民的部分用户的历史数据(训练集),以及2018年6月、7月的数据(测试集),对2018年8月贵阳市常住居民前往黔东南州进行省内旅游的可能性进行预测。
本比赛任务为:
训练:使用所提供的训练集,即用户使用2017年6、7月的历史数据与8月是否前往黔东南州进行省内旅游的数据,建立预测模型
输出结果:使用所提供的测试集,即用户使用2018年6月、7月的历史数据,通过所建立的模型,预测用户在2018年8月是否会前往黔东南州进行省内旅游的概率。在科赛网,提交测评,得到AUC分数
数据说明
训练集(training_set)约2.3G,其中包含 201708n,201708q 和 weather_data_2017三个文件夹,分别记录了对应的2017年6、7月用户历史数据和天气历史数据。
在201708n和201708q两个文件夹中,各包含7个txt文件,201708n文件夹中的用户在2017年8月都没有去过黔东南目标区域,201708q文件夹中的用户在2017年8月都去过黔东南目标景区
训练集中,除以下列示字段外,最后还有一个字段“label”:“0”表示其为负样本,即该用户在2017年8月没有去过黔东南目标区域;“1”表示其为正样本,即该用户在2017年8月去过黔东南目标区域
用户身份属性表(201708n1.txt, 201708q1.txt)
用户手机终端信息表(201708n2.txt, 201708q2.txt)
用户漫游行为表(201708n3.txt, 201708q3.txt)
用户漫出省份表(201708n4.txt, 201708q4.txt)
用户地理位置表(201708n6.txt, 201708q6.txt)
用户APP使用情况表(201708n7.txt, 201708q7.txt)
在weather_data_2017文件夹中包含两个txt文件,“weather_reported_2017”记录了2017年6月、7月的实际天气,“weather_forecast_2017”,记录了2017年6月、7月的预报天气,以及一个“天气现象编码表.xlsx”文件。
2017实况天气表(weather_reported_2017.txt)
2017预测天气表(weather_forecast_2017.txt)
测试集(testing_set)共约1G,其中包含201808和weather_data_2018两个文件夹
在201808文件夹中包含7个txt文件,命名依次为2018_1.txt,2018_2.txt, … ,2018_7.txt,字段信息与训练集相对应
在weather_data_2018文件夹中包含两个txt文件,命“weather_reported_2018”记录了2018年6月、7月的实际天气,“weather_forecast_2018”记录了2018年6月、7月的预报天气,字段信息与训练集相对应。
备注:
每个文件夹中的7个表可以通过虚拟ID互相关联;但不是每个虚拟ID都可以被关联,选手自行判断如何处理和使用
不同表中的虚拟ID存在格式不同的情况,需选手自行处理,并保证提交虚拟ID格式为string
由于表的数量较多,信息维度不同,应用方法多种,数据可能存在异常和缺失,选手需自行处理可能遇到的异常状况
欢迎选手用不同的方法进行尝试,如迁移学习等前沿方法
本次竞赛数据经过了脱敏处理,数据和实际信息有一定差距,但是不会影响问题的解决
评审说明
1、初赛评分规则
本次比赛采用AUC来评判模型的效果。AUC即以False Positive Rate为横轴,True Positive Rate为纵轴的ROC (Receiver Operating Characteristic)曲线下方的面积大小。
2、评审说明
测评排行榜采用Private/Public机制,其中,Private榜对应所提交结果文件中一定比例数据的成绩,Public榜对应剩余数据的成绩。
提供给每个队伍每天5次提交与测评排名的机会,实时更新Public排行榜,从高到低排序,若队伍一天内多次提交结果,新结果版本将覆盖原版本。
由于受到使用模型的泛化性能的影响,在 Public 榜获得最高分的提交在 Private 的分数不一定最高,因此需要选手从自己的有效提交里,选择两个觉得兼顾了泛化性能与模型评分的结果文件进入 Private 榜测评
Private 排行榜在比赛结束后会揭晓,比赛的最终有效成绩与有效排名将以 Private 榜为准。
代码
%load_ext klab-autotime
import pandas as pd
import numpy as np
time: 311 ms
def reduce_mem_usage(df, verbose=True):
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
start_mem = df.memory_usage().sum() / 1024 ** 2
for col in df.columns:
col_type = df[col].dtypes
if col_type in numerics:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
end_mem = df.memory_usage().sum() / 1024 ** 2
if verbose:
print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
return df
time: 3.85 ms
1 特征工程
正样本
q1
将两月金额相加
q1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q1.txt', sep='\t', header=None))
q1.columns = ['year_month', 'id', 'consume', 'label']
Mem. usage decreased to 0.16 Mb (53.1% reduction)
time: 39.2 ms
q1.describe()
|
year_month |
id |
consume |
label |
count |
11200.000000 |
1.120000e+04 |
1.086500e+04 |
11200.0 |
mean |
201706.500000 |
5.416583e+15 |
inf |
1.0 |
std |
0.500022 |
2.642827e+15 |
inf |
0.0 |
min |
201706.000000 |
1.448104e+12 |
4.998779e-02 |
1.0 |
25% |
201706.000000 |
3.117220e+15 |
4.068750e+01 |
1.0 |
50% |
201706.500000 |
5.456254e+15 |
9.837500e+01 |
1.0 |
75% |
201707.000000 |
7.702940e+15 |
1.785000e+02 |
1.0 |
max |
201707.000000 |
9.997949e+15 |
1.324000e+03 |
1.0 |
time: 37.3 ms
q1.info()
RangeIndex: 11200 entries, 0 to 11199
Data columns (total 4 columns):
year_month 11200 non-null int32
id 11200 non-null int64
consume 10865 non-null float16
label 11200 non-null int8
dtypes: float16(1), int32(1), int64(1), int8(1)
memory usage: 164.1 KB
time: 6.91 ms
q1.consume.min()
0.05
time: 2.64 ms
q1 = q1.fillna(98.0)
time: 2.75 ms
q1.info()
RangeIndex: 11200 entries, 0 to 11199
Data columns (total 4 columns):
year_month 11200 non-null int32
id 11200 non-null int64
consume 11200 non-null float16
label 11200 non-null int8
dtypes: float16(1), int32(1), int64(1), int8(1)
memory usage: 164.1 KB
time: 6.71 ms
q1 = q1[['id', 'consume']]
q1_groupbyid = q1.groupby(['id']).agg({'consume': pd.Series.sum})
time: 709 ms
q2
特征1 使用过的top9+其它手机品牌 共10个
特征2 使用的不同品牌数量
q2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q2.txt', sep='\t', header=None))
q2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']
Mem. usage decreased to 11.31 Mb (14.6% reduction)
time: 2.46 s
q2.info()
RangeIndex: 289203 entries, 0 to 289202
Data columns (total 6 columns):
id 289203 non-null int64
brand 197376 non-null object
type 197380 non-null object
first_use_time 289203 non-null int64
recent_use_time 289203 non-null int64
label 289203 non-null int8
dtypes: int64(3), int8(1), object(2)
memory usage: 11.3+ MB
time: 62.6 ms
q2.type = q2.type.fillna('其它')
time: 18.4 ms
brand_series = pd.Series({'苹果' : 'iphone', '华为' : "huawei", '欧珀' : 'oppo', '维沃' : 'vivo', '三星' : 'san', '小米' : 'mi', '金立' : 'jinli', '魅族' : 'mei', '乐视' : 'le', '四季恒美' : 'siji'})
q2.brand = q2.brand.map(brand_series)
time: 42.4 ms
q2.brand = q2.brand.fillna('其它')
time: 17.4 ms
q2.head()
|
id |
brand |
type |
first_use_time |
recent_use_time |
label |
0 |
1752398069509000 |
其它 |
其它 |
20161209134530 |
20161209190636 |
1 |
1 |
1752398069509000 |
huawei |
PLK-AL10 |
20170609223138 |
20170609224345 |
1 |
2 |
1752398069509000 |
le |
LETV X501 |
20160924102711 |
20160924112425 |
1 |
3 |
1752398069509000 |
jinli |
金立 GN800 |
20150331210255 |
20150630131232 |
1 |
4 |
1752398069509000 |
jinli |
GIONEE M5 |
20170508191216 |
20170605192347 |
1 |
time: 18.7 ms
q2['brand_type'] = q2['brand'] + q2['type']
time: 109 ms
q2.head()
|
id |
brand |
type |
first_use_time |
recent_use_time |
label |
brand_type |
0 |
1752398069509000 |
其它 |
其它 |
20161209134530 |
20161209190636 |
1 |
其它其它 |
1 |
1752398069509000 |
huawei |
PLK-AL10 |
20170609223138 |
20170609224345 |
1 |
huaweiPLK-AL10 |
2 |
1752398069509000 |
le |
LETV X501 |
20160924102711 |
20160924112425 |
1 |
leLETV X501 |
3 |
1752398069509000 |
jinli |
金立 GN800 |
20150331210255 |
20150630131232 |
1 |
jinli金立 GN800 |
4 |
1752398069509000 |
jinli |
GIONEE M5 |
20170508191216 |
20170605192347 |
1 |
jinliGIONEE M5 |
time: 9.75 ms
groupbybrand_type = q2['brand_type'].value_counts()
time: 51.8 ms
groupbybrand_type.head(10)
其它其它 91823
iphoneA1586 14898
iphoneA1524 10330
iphoneA1700 9246
iphoneA1699 8277
iphoneIPHONE6S(A1633) 6271
oppoOPPO R9M 4725
iphoneA1530 4640
oppoOPPO R9TM 2978
vivoVIVO X7 2516
Name: brand_type, dtype: int64
time: 3.44 ms
q2_brand_type = q2[['id', 'brand_type']]
q2_brand_type = q2_brand_type.drop_duplicates()
q2_groupbyid = q2_brand_type['id'].value_counts()
q2_groupbyid = q2_groupbyid.reset_index()
q2_groupbyid.columns = ['id', 'phone_nums']
q2_groupbyid.head()
|
id |
phone_nums |
0 |
8707678197418467 |
422 |
1 |
9196501153454276 |
409 |
2 |
3900535090108175 |
389 |
3 |
4104535378288025 |
352 |
4 |
1106540188374027 |
350 |
time: 90 ms
q2_groupbyid.info()
RangeIndex: 5600 entries, 0 to 5599
Data columns (total 2 columns):
id 5600 non-null int64
phone_nums 5600 non-null int64
dtypes: int64(2)
memory usage: 87.6 KB
time: 5.91 ms
q2_brand = q2[['id', 'brand']]
q2_brand = q2_brand.drop_duplicates()
q2_brand_one_hot = pd.get_dummies(q2_brand)
q2_brand_one_hot.head()
|
id |
brand_huawei |
brand_iphone |
brand_jinli |
brand_le |
brand_mei |
brand_mi |
brand_oppo |
brand_san |
brand_siji |
brand_vivo |
brand_其它 |
0 |
1752398069509000 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
1752398069509000 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2 |
1752398069509000 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
3 |
1752398069509000 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
8 |
1752398069509000 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
time: 48.9 ms
q2_one_hot = q2_brand_one_hot.groupby(['id']).agg({'brand_huawei': pd.Series.max,
'brand_iphone': pd.Series.max,
'brand_jinli': pd.Series.max,
'brand_le': pd.Series.max,
'brand_mei': pd.Series.max,
'brand_mi': pd.Series.max,
'brand_oppo': pd.Series.max,
'brand_san': pd.Series.max,
'brand_siji': pd.Series.max,
'brand_vivo': pd.Series.max,
'brand_其它': pd.Series.max
})
q2_one_hot.head()
|
brand_huawei |
brand_iphone |
brand_jinli |
brand_le |
brand_mei |
brand_mi |
brand_oppo |
brand_san |
brand_siji |
brand_vivo |
brand_其它 |
id |
|
|
|
|
|
|
|
|
|
|
|
1448103998000 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
0 |
0 |
1 |
17398718813730 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
61132623486000 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
68156596675520 |
0 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
76819334576430 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
time: 6.57 s
pos_set = q1_groupbyid.merge(q2_groupbyid, on=['id'])
pos_set.info()
Int64Index: 5600 entries, 0 to 5599
Data columns (total 3 columns):
id 5600 non-null int64
consume 5600 non-null float16
phone_nums 5600 non-null int64
dtypes: float16(1), int64(2)
memory usage: 142.2 KB
time: 11.6 ms
pos_set = pos_set.merge(q2_one_hot, on=['id'])
pos_set.info()
Int64Index: 5600 entries, 0 to 5599
Data columns (total 14 columns):
id 5600 non-null int64
consume 5600 non-null float16
phone_nums 5600 non-null int64
brand_huawei 5600 non-null uint8
brand_iphone 5600 non-null uint8
brand_jinli 5600 non-null uint8
brand_le 5600 non-null uint8
brand_mei 5600 non-null uint8
brand_mi 5600 non-null uint8
brand_oppo 5600 non-null uint8
brand_san 5600 non-null uint8
brand_siji 5600 non-null uint8
brand_vivo 5600 non-null uint8
brand_其它 5600 non-null uint8
dtypes: float16(1), int64(2), uint8(11)
memory usage: 202.3 KB
time: 98.6 ms
q3
1.将两月联络圈规模求和
2.将两月出省求和 是:1 否:0
3.将两月出国求和 是:1 否:0
q3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q3.txt', sep='\t', header=None))
q3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']
Mem. usage decreased to 0.18 Mb (64.6% reduction)
time: 85.8 ms
q3.info()
RangeIndex: 11200 entries, 0 to 11199
Data columns (total 6 columns):
year_month 11200 non-null int32
id 11200 non-null int64
call_nums 11200 non-null int16
is_trans_provincial 11200 non-null int8
is_transnational 11200 non-null int8
label 11200 non-null int8
dtypes: int16(1), int32(1), int64(1), int8(3)
memory usage: 186.0 KB
time: 7.49 ms
q3_groupbyid_call = q3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})
q3_groupbyid_provincial = q3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})
q3_groupbyid_trans = q3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})
pos_set = pos_set.merge(q3_groupbyid_call, on=['id'])
pos_set = pos_set.merge(q3_groupbyid_provincial, on=['id'])
pos_set = pos_set.merge(q3_groupbyid_trans, on=['id'])
pos_set.info()
Int64Index: 5600 entries, 0 to 5599
Data columns (total 17 columns):
id 5600 non-null int64
consume 5600 non-null float16
phone_nums 5600 non-null int64
brand_huawei 5600 non-null uint8
brand_iphone 5600 non-null uint8
brand_jinli 5600 non-null uint8
brand_le 5600 non-null uint8
brand_mei 5600 non-null uint8
brand_mi 5600 non-null uint8
brand_oppo 5600 non-null uint8
brand_san 5600 non-null uint8
brand_siji 5600 non-null uint8
brand_vivo 5600 non-null uint8
brand_其它 5600 non-null uint8
call_nums 5600 non-null int16
is_trans_provincial 5600 non-null int8
is_transnational 5600 non-null int8
dtypes: float16(1), int16(1), int64(2), int8(2), uint8(11)
memory usage: 224.2 KB
time: 1.95 s
q4
1.两月内漫出省次数
2.所有省份one-hot或top10省份+其它省份
3.两月内漫出不同省个数
q4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q4.txt', sep='\t', header=None))
q4.columns = ['year_month', 'id', 'province', 'label']
q4.info()
Mem. usage decreased to 0.15 Mb (34.4% reduction)
RangeIndex: 7289 entries, 0 to 7288
Data columns (total 4 columns):
year_month 7289 non-null int32
id 7289 non-null int64
province 7218 non-null object
label 7289 non-null int8
dtypes: int32(1), int64(1), int8(1), object(1)
memory usage: 149.6+ KB
time: 18.4 ms
q4.head()
|
year_month |
id |
province |
label |
0 |
201707 |
6062475264825100 |
广东 |
1 |
1 |
201707 |
5627768389537500 |
北京 |
1 |
2 |
201707 |
2000900444179600 |
山西 |
1 |
3 |
201707 |
5304502776817600 |
四川 |
1 |
4 |
201707 |
5304502776817600 |
四川 |
1 |
time: 7.16 ms
q4_groupbyid = q4.groupby(['province']).size()
time: 61.3 ms
q4_groupbyid.sort_values()
province
宁夏 15
吉林 20
内蒙古 22
黑龙江 27
青海 35
天津 39
辽宁 44
西藏 69
山西 70
甘肃 73
新疆 74
安徽 86
海南 100
陕西 114
山东 121
福建 150
河北 168
江苏 182
湖北 208
上海 215
河南 237
北京 247
江西 364
重庆 428
浙江 483
云南 530
广西 536
四川 793
广东 835
湖南 933
dtype: int64
time: 4.04 ms
q4.province = q4.province.fillna('湖南')
q4.info()
RangeIndex: 7289 entries, 0 to 7288
Data columns (total 4 columns):
year_month 7289 non-null int32
id 7289 non-null int64
province 7289 non-null object
label 7289 non-null int8
dtypes: int32(1), int64(1), int8(1), object(1)
memory usage: 149.6+ KB
time: 8.09 ms
q4_groupbyid = q4[['id', 'province']].groupby(['id']).size()
q4_groupbyid = q4_groupbyid.reset_index()
q4_groupbyid.columns = ['id', 'province_out_cnt']
pos_set = pos_set.merge(q4_groupbyid, how='left', on=['id'])
pos_set.info()
Int64Index: 5600 entries, 0 to 5599
Data columns (total 18 columns):
id 5600 non-null int64
consume 5600 non-null float16
phone_nums 5600 non-null int64
brand_huawei 5600 non-null uint8
brand_iphone 5600 non-null uint8
brand_jinli 5600 non-null uint8
brand_le 5600 non-null uint8
brand_mei 5600 non-null uint8
brand_mi 5600 non-null uint8
brand_oppo 5600 non-null uint8
brand_san 5600 non-null uint8
brand_siji 5600 non-null uint8
brand_vivo 5600 non-null uint8
brand_其它 5600 non-null uint8
call_nums 5600 non-null int16
is_trans_provincial 5600 non-null int8
is_transnational 5600 non-null int8
province_out_cnt 1942 non-null float64
dtypes: float16(1), float64(1), int16(1), int64(2), int8(2), uint8(11)
memory usage: 268.0 KB
time: 19.6 ms
pos_set = pos_set.fillna(0)
pos_set['label'] = 1
pos_set.info()
Int64Index: 5600 entries, 0 to 5599
Data columns (total 19 columns):
id 5600 non-null int64
consume 5600 non-null float16
phone_nums 5600 non-null int64
brand_huawei 5600 non-null uint8
brand_iphone 5600 non-null uint8
brand_jinli 5600 non-null uint8
brand_le 5600 non-null uint8
brand_mei 5600 non-null uint8
brand_mi 5600 non-null uint8
brand_oppo 5600 non-null uint8
brand_san 5600 non-null uint8
brand_siji 5600 non-null uint8
brand_vivo 5600 non-null uint8
brand_其它 5600 non-null uint8
call_nums 5600 non-null int16
is_trans_provincial 5600 non-null int8
is_transnational 5600 non-null int8
province_out_cnt 5600 non-null float64
label 5600 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2), uint8(11)
memory usage: 311.7 KB
time: 12.7 ms
q6 暂时忽略
q7
1.使用总流量
2.使用不同APP数量
3.某些特定(旅游相关)APP是否使用
1.1 正样本
q1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q1.txt', sep='\t', header=None))
q1.columns = ['year_month', 'id', 'consume', 'label']
q1 = q1.fillna(98.0)
q1 = q1[['id', 'consume']]
q1_groupbyid = q1.groupby(['id']).agg({'consume': pd.Series.sum})
q2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q2.txt', sep='\t', header=None))
q2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']
q2.type = q2.type.fillna('其它')
brand_series = pd.Series({'苹果' : 'iphone', '华为' : "huawei", '欧珀' : 'oppo', '维沃' : 'vivo', '三星' : 'san', '小米' : 'mi', '金立' : 'jinli', '魅族' : 'mei', '乐视' : 'le', '四季恒美' : 'siji'})
q2.brand = q2.brand.map(brand_series)
q2.brand = q2.brand.fillna('其它')
q2['brand_type'] = q2['brand'] + q2['type']
q2_brand_type = q2[['id', 'brand_type']]
q2_brand_type = q2_brand_type.drop_duplicates()
q2_groupbyid = q2_brand_type['id'].value_counts()
q2_groupbyid = q2_groupbyid.reset_index()
q2_groupbyid.columns = ['id', 'phone_nums']
q2_brand = q2[['id', 'brand']]
q2_brand = q2_brand.drop_duplicates()
q2_brand_one_hot = pd.get_dummies(q2_brand)
q2_one_hot = q2_brand_one_hot.groupby(['id']).agg({'brand_huawei': pd.Series.max,
'brand_iphone': pd.Series.max,
'brand_jinli': pd.Series.max,
'brand_le': pd.Series.max,
'brand_mei': pd.Series.max,
'brand_mi': pd.Series.max,
'brand_oppo': pd.Series.max,
'brand_san': pd.Series.max,
'brand_siji': pd.Series.max,
'brand_vivo': pd.Series.max,
'brand_其它': pd.Series.max
})
q2_one_hot.head()
pos_set = q1_groupbyid.merge(q2_groupbyid, on=['id'])
pos_set = pos_set.merge(q2_one_hot, on=['id'])
q3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q3.txt', sep='\t', header=None))
q3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']
q3_groupbyid_call = q3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})
q3_groupbyid_provincial = q3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})
q3_groupbyid_trans = q3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})
pos_set = pos_set.merge(q3_groupbyid_call, on=['id'])
pos_set = pos_set.merge(q3_groupbyid_provincial, on=['id'])
pos_set = pos_set.merge(q3_groupbyid_trans, on=['id'])
q4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q4.txt', sep='\t', header=None))
q4.columns = ['year_month', 'id', 'province', 'label']
q4.province = q4.province.fillna('湖南')
q4_groupbyid = q4[['id', 'province']].groupby(['id']).size()
q4_groupbyid = q4_groupbyid.reset_index()
q4_groupbyid.columns = ['id', 'province_out_cnt']
pos_set = pos_set.merge(q4_groupbyid, how='left', on=['id'])
pos_set = pos_set.fillna(0)
pos_set['label'] = 1
pos_set.info()
Mem. usage decreased to 0.16 Mb (53.1% reduction)
Mem. usage decreased to 11.31 Mb (14.6% reduction)
Mem. usage decreased to 0.18 Mb (64.6% reduction)
Mem. usage decreased to 0.15 Mb (34.4% reduction)
Int64Index: 5600 entries, 0 to 5599
Data columns (total 19 columns):
id 5600 non-null int64
consume 5600 non-null float16
phone_nums 5600 non-null int64
brand_huawei 5600 non-null uint8
brand_iphone 5600 non-null uint8
brand_jinli 5600 non-null uint8
brand_le 5600 non-null uint8
brand_mei 5600 non-null uint8
brand_mi 5600 non-null uint8
brand_oppo 5600 non-null uint8
brand_san 5600 non-null uint8
brand_siji 5600 non-null uint8
brand_vivo 5600 non-null uint8
brand_其它 5600 non-null uint8
call_nums 5600 non-null int16
is_trans_provincial 5600 non-null int8
is_transnational 5600 non-null int8
province_out_cnt 5600 non-null float64
label 5600 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2), uint8(11)
memory usage: 311.7 KB
time: 10.1 s
1.2 负样本
n1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n1.txt', sep='\t', header=None))
n1.columns = ['year_month', 'id', 'consume', 'label']
n1 = n1.fillna(98.0)
n1_groupbyid = n1[['id', 'consume']].groupby(['id']).agg({'consume': pd.Series.sum})
n2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n2.txt', sep='\t', header=None))
n2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']
n2.type = n2.type.fillna('其它')
brand_series = pd.Series({'苹果' : 'iphone', '华为' : "huawei", '欧珀' : 'oppo', '维沃' : 'vivo', '三星' : 'san', '小米' : 'mi', '金立' : 'jinli', '魅族' : 'mei', '乐视' : 'le', '四季恒美' : 'siji'})
n2.brand = n2.brand.map(brand_series)
n2.brand = n2.brand.fillna('其它')
n2['brand_type'] = n2['brand'] + n2['type']
n2_brand_type = n2[['id', 'brand_type']]
n2_brand_type = n2_brand_type.drop_duplicates()
n2_groupbyid = n2_brand_type['id'].value_counts()
n2_groupbyid = n2_groupbyid.reset_index()
n2_groupbyid.columns = ['id', 'phone_nums']
n2_brand = n2[['id', 'brand']]
n2_brand = n2_brand.drop_duplicates()
n2_brand_one_hot = pd.get_dummies(n2_brand)
n2_one_hot = n2_brand_one_hot.groupby(['id']).agg({'brand_huawei': pd.Series.max,
'brand_iphone': pd.Series.max,
'brand_jinli': pd.Series.max,
'brand_le': pd.Series.max,
'brand_mei': pd.Series.max,
'brand_mi': pd.Series.max,
'brand_oppo': pd.Series.max,
'brand_san': pd.Series.max,
'brand_siji': pd.Series.max,
'brand_vivo': pd.Series.max,
'brand_其它': pd.Series.max
})
neg_set = n1_groupbyid.merge(n2_groupbyid, on=['id'])
neg_set = neg_set.merge(n2_one_hot, on=['id'])
n3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n3.txt', sep='\t', header=None))
n3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']
n3_groupbyid_call = n3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})
n3_groupbyid_provincial = n3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})
n3_groupbyid_trans = n3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})
neg_set = neg_set.merge(n3_groupbyid_call, on=['id'])
neg_set = neg_set.merge(n3_groupbyid_provincial, on=['id'])
neg_set = neg_set.merge(n3_groupbyid_trans, on=['id'])
n4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n4.txt', sep='\t', header=None))
n4.columns = ['year_month', 'id', 'province', 'label']
n4.province = n4.province.fillna('湖南')
n4_groupbyid = n4[['id', 'province']].groupby(['id']).size()
n4_groupbyid = n4_groupbyid.reset_index()
n4_groupbyid.columns = ['id', 'province_out_cnt']
neg_set = neg_set.merge(n4_groupbyid, how='left', on=['id'])
neg_set = neg_set.fillna(0)
neg_set['label'] = 0
neg_set.info()
Mem. usage decreased to 2.67 Mb (53.1% reduction)
Mem. usage decreased to 51.13 Mb (14.6% reduction)
Mem. usage decreased to 3.03 Mb (64.6% reduction)
Mem. usage decreased to 0.73 Mb (34.4% reduction)
Int64Index: 93375 entries, 0 to 93374
Data columns (total 19 columns):
id 93375 non-null int64
consume 93375 non-null float16
phone_nums 93375 non-null int64
brand_huawei 93375 non-null uint8
brand_iphone 93375 non-null uint8
brand_jinli 93375 non-null uint8
brand_le 93375 non-null uint8
brand_mei 93375 non-null uint8
brand_mi 93375 non-null uint8
brand_oppo 93375 non-null uint8
brand_san 93375 non-null uint8
brand_siji 93375 non-null uint8
brand_vivo 93375 non-null uint8
brand_其它 93375 non-null uint8
call_nums 93375 non-null int16
is_trans_provincial 93375 non-null int8
is_transnational 93375 non-null int8
province_out_cnt 93375 non-null float64
label 93375 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2), uint8(11)
memory usage: 5.1 MB
time: 2min 48s
train_set = pos_set.append(neg_set)
train_set.info()
Int64Index: 98975 entries, 0 to 93374
Data columns (total 19 columns):
id 98975 non-null int64
consume 98975 non-null float16
phone_nums 98975 non-null int64
brand_huawei 98975 non-null uint8
brand_iphone 98975 non-null uint8
brand_jinli 98975 non-null uint8
brand_le 98975 non-null uint8
brand_mei 98975 non-null uint8
brand_mi 98975 non-null uint8
brand_oppo 98975 non-null uint8
brand_san 98975 non-null uint8
brand_siji 98975 non-null uint8
brand_vivo 98975 non-null uint8
brand_其它 98975 non-null uint8
call_nums 98975 non-null int16
is_trans_provincial 98975 non-null int8
is_transnational 98975 non-null int8
province_out_cnt 98975 non-null float64
label 98975 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2), uint8(11)
memory usage: 5.4 MB
time: 62.5 ms
2 建模
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn import metrics
from sklearn.model_selection import train_test_split
X = train_set[['consume', 'phone_nums', 'call_nums', 'is_trans_provincial', 'is_transnational', 'province_out_cnt']].values
y = train_set['label'].values
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
lgb_train = lgb.Dataset(x_train, y_train)
lgb_eval = lgb.Dataset(x_test, y_test, reference = lgb_train)
params = {
'boosting_type':'gbdt',
'objective':'binary',
'metric':{'auc'},
'num_leaves':100,
'reg_alpha':0,
'reg_lambda':0.01,
'max_depth':6,
'n_estimators':100,
'subsample':0.9,
'colsample_bytree':0.85,
'subsample_freq':1,
'min_child_samples':25,
'learning_rate':0.1,
'random_state':2019
}
gbm = lgb.train(params,
lgb_train,
num_boost_round = 2000,
valid_sets = lgb_eval,
verbose_eval=250,
early_stopping_rounds=50)
y_pred = gbm.predict(X, num_iteration=gbm.best_iteration)
print('AUC: %.4f' % metrics.roc_auc_score(y, y_pred))
y_pred = gbm.predict(x_test, num_iteration=gbm.best_iteration)
print('Test AUC: %.4f' % metrics.roc_auc_score(y_test, y_pred))
Training until validation scores don't improve for 50 rounds.
Early stopping, best iteration is:
[18] valid_0's auc: 0.786865
AUC: 0.7981
Test AUC: 0.7869
time: 772 ms
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from collections import Counter
X = train_set[['consume', 'phone_nums', 'call_nums', 'is_trans_provincial', 'is_transnational', 'province_out_cnt']].values
y = train_set['label'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
c = Counter(y_train)
'''
params={'booster':'gbtree',
'objective': 'binary:logistic',
'eval_metric': 'auc',
'max_depth':4,
'lambda':10,
'subsample':0.75,
'colsample_bytree':0.75,
'min_child_weight':2,
'eta': 0.025,
'seed':0,
'nthread':8,
'silent':1}
'''
clf = XGBClassifier(max_depth=5, eval_metric='auc', min_child_weight=6, scale_pos_weight=c[0] / 16 / c[1],
nthread=12, num_boost_round=1000, seed=2019
)
print('fit start...')
clf.fit(X_train, y_train)
print('fit finish')
'''
train_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)
print('train score:{}\ntest score:{}'.format(train_score, test_score))
'''
y_pred=clf.predict(X)
from sklearn import metrics
print('AUC: %.4f' % metrics.roc_auc_score(y, y_pred))
y_pred=clf.predict(X_test)
print('Test AUC: %.4f' % metrics.roc_auc_score(y_test, y_pred))
fit start...
fit finish
AUC: 0.5134
Test AUC: 0.5082
time: 3.11 s
import xgboost as xgb
import pandas as pd
from sklearn.model_selection import GridSearchCV
from collections import Counter
X_train = train_set[['consume', 'phone_nums', 'call_nums', 'is_trans_provincial', 'is_transnational', 'province_out_cnt']].values
y_train = train_set['label'].values
c = Counter(y_train)
parameters = {
'max_depth': [5, 10, 15],
'learning_rate': [0.01, 0.02, 0.05],
'n_estimators': [500, 1000, 2000],
'min_child_weight': [0, 2, 5],
'max_delta_step': [0, 0.2, 0.6],
'subsample': [0.6, 0.7, 0.8],
'colsample_bytree': [0.5, 0.6, 0.7],
'reg_alpha': [0, 0.25, 0.5],
'reg_lambda': [0.2, 0.4, 0.6],
'scale_pos_weight': [0.8, 8, 14]
}
xlf = xgb.XGBClassifier(max_depth=10,
learning_rate=0.01,
n_estimators=2000,
silent=True,
objective='binary:logistic',
nthread=12,
gamma=0,
min_child_weight=1,
max_delta_step=0,
subsample=0.85,
colsample_bytree=0.7,
colsample_bylevel=1,
reg_alpha=0,
reg_lambda=1,
scale_pos_weight=1,
seed=2019,
missing=None)
gsearch = GridSearchCV(xlf, param_grid=parameters, scoring='accuracy', cv=3)
gsearch.fit(X_train, y_train)
print("Best score: %0.3f" % gsearch.best_score_)
print("Best parameters set:")
best_parameters = gsearch.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
3 预测
3.1 测试集
t1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_1.txt', sep='\t', header=None))
t1.columns = ['year_month', 'id', 'consume']
t1 = t1.fillna(81.0)
t1_groupbyid = t1[['id', 'consume']].groupby(['id']).agg({'consume': pd.Series.sum})
t2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_2.txt', sep='\t', header=None))
t2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time']
t2 = t2.fillna('其它')
brand_series = pd.Series({'苹果' : 'iphone', '华为' : "huawei", '欧珀' : 'oppo', '维沃' : 'vivo', '三星' : 'san', '小米' : 'mi', '金立' : 'jinli', '魅族' : 'mei', '乐视' : 'le', '四季恒美' : 'siji'})
t2.brand = t2.brand.map(brand_series)
t2.brand = t2.brand.fillna('其它')
t2['brand_type'] = t2['brand'] + t2['type']
t2_brand_type = t2[['id', 'brand_type']]
t2_brand_type = t2_brand_type.drop_duplicates()
t2_groupbyid = t2_brand_type['id'].value_counts()
t2_groupbyid = t2_groupbyid.reset_index()
t2_groupbyid.columns = ['id', 'phone_nums']
t2_brand = t2[['id', 'brand']]
t2_brand = t2_brand.drop_duplicates()
t2_brand_one_hot = pd.get_dummies(t2_brand)
t2_one_hot = t2_brand_one_hot.groupby(['id']).agg({'brand_huawei': pd.Series.max,
'brand_iphone': pd.Series.max,
'brand_jinli': pd.Series.max,
'brand_le': pd.Series.max,
'brand_mei': pd.Series.max,
'brand_mi': pd.Series.max,
'brand_oppo': pd.Series.max,
'brand_san': pd.Series.max,
'brand_siji': pd.Series.max,
'brand_vivo': pd.Series.max,
'brand_其它': pd.Series.max
})
test_set = t1_groupbyid.merge(t2_groupbyid, on=['id'])
test_set = test_set.merge(t2_one_hot, on=['id'])
t3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_3.txt', sep='\t', header=None))
t3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational']
t3_groupbyid_call = t3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})
t3_groupbyid_provincial = t3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})
t3_groupbyid_trans = t3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})
test_set = test_set.merge(t3_groupbyid_call, on=['id'])
test_set = test_set.merge(t3_groupbyid_provincial, on=['id'])
test_set = test_set.merge(t3_groupbyid_trans, on=['id'])
t4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_4.txt', sep='\t', header=None))
t4.columns = ['year_month', 'id', 'province']
t4 = t4.fillna('湖南')
t4_groupbyid = t4[['id', 'province']].groupby(['id']).size()
t4_groupbyid = t4_groupbyid.reset_index()
t4_groupbyid.columns = ['id', 'province_out_cnt']
test_set = test_set.merge(t4_groupbyid, how='left', on=['id'])
test_set = test_set.fillna(0)
test_set.info()
Mem. usage decreased to 1.34 Mb (41.7% reduction)
Mem. usage decreased to 60.50 Mb (0.0% reduction)
Mem. usage decreased to 1.53 Mb (60.0% reduction)
Mem. usage decreased to 0.85 Mb (16.7% reduction)
Int64Index: 48668 entries, 0 to 48667
Data columns (total 18 columns):
id 48668 non-null int64
consume 48668 non-null float16
phone_nums 48668 non-null int64
brand_huawei 48668 non-null uint8
brand_iphone 48668 non-null uint8
brand_jinli 48668 non-null uint8
brand_le 48668 non-null uint8
brand_mei 48668 non-null uint8
brand_mi 48668 non-null uint8
brand_oppo 48668 non-null uint8
brand_san 48668 non-null uint8
brand_siji 48668 non-null uint8
brand_vivo 48668 non-null uint8
brand_其它 48668 non-null uint8
call_nums 48668 non-null int16
is_trans_provincial 48668 non-null int8
is_transnational 48668 non-null int8
province_out_cnt 48668 non-null float64
dtypes: float16(1), float64(1), int16(1), int64(2), int8(2), uint8(11)
memory usage: 2.3 MB
time: 1min 39s
X_test = test_set[['consume', 'phone_nums', 'call_nums', 'is_trans_provincial', 'is_transnational', 'province_out_cnt']].values
y_predict = gbm.predict(X_test, num_iteration=gbm.best_iteration)
submit = test_set[['id']]
submit['pred'] = y_predict
time: 108 ms
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
"""
type(y_predict)
numpy.ndarray
time: 2.3 ms
y_predict[:5]
array([0.10280227, 0.08214867, 0.06905468, 0.07655945, 0.11238844])
time: 2.9 ms
X_test = test_set[['consume', 'phone_nums', 'call_nums', 'is_trans_provincial', 'is_transnational', 'province_out_cnt']].values
y_predict = clf.predict_proba(X_test)[:, 1]
submit_xgb = test_set[['id']]
submit_xgb['pred'] = y_predict
time: 208 ms
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
"""
4 提交结果
tt1 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_1.txt', sep='\t', header=None)
tt1.columns = ['year_month', 'id', 'consume']
time: 41.6 ms
xgb_t1_id = tt1[['id']].drop_duplicates()
time: 13 ms
xgb_t1_id.info()
Int64Index: 50200 entries, 0 to 99852
Data columns (total 1 columns):
id 50200 non-null int64
dtypes: int64(1)
memory usage: 784.4 KB
time: 5.46 ms
t1_id = tt1[['id']].drop_duplicates()
time: 12.5 ms
t1_id.info()
Int64Index: 50200 entries, 0 to 99852
Data columns (total 1 columns):
id 50200 non-null int64
dtypes: int64(1)
memory usage: 784.4 KB
time: 5.67 ms
submit_xgb.info()
Int64Index: 48668 entries, 0 to 48667
Data columns (total 2 columns):
id 48668 non-null int64
pred 48668 non-null float32
dtypes: float32(1), int64(1)
memory usage: 950.5 KB
time: 7.8 ms
submit.info()
Int64Index: 48668 entries, 0 to 48667
Data columns (total 2 columns):
id 48668 non-null int64
pred 48668 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 8.33 ms
tt_xgb = t1_id.merge(submit_xgb, on=['id'], how='left')
time: 17.6 ms
tt_xgb.info()
Int64Index: 50200 entries, 0 to 50199
Data columns (total 2 columns):
id 50200 non-null int64
pred 48668 non-null float32
dtypes: float32(1), int64(1)
memory usage: 980.5 KB
time: 8.14 ms
tt = t1_id.merge(submit, on=['id'], how='left')
time: 19.3 ms
tt.info()
Int64Index: 50200 entries, 0 to 50199
Data columns (total 2 columns):
id 50200 non-null int64
pred 48668 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 8.06 ms
xgboost
submit_xgb = tt_xgb.fillna(0.0)
time: 1.92 ms
lightgbm
submit_gbm = tt.fillna(0.0)
time: 1.96 ms
1.模型融合 求和 得分0.4558
2.全为1.0/0.0 得分0.5
3.大于0.5改为1.0,小于0.5改为0.0 应有2800人左右去 xgb0.26 得分0.50153 gbm0.17 得分0.50554
submit_xgb.describe()
|
id |
pred |
count |
5.020000e+04 |
50200.000000 |
mean |
5.449990e+15 |
0.092590 |
std |
2.628886e+15 |
0.088487 |
min |
5.959412e+11 |
0.000000 |
25% |
3.177008e+15 |
0.034837 |
50% |
5.441108e+15 |
0.063993 |
75% |
7.726328e+15 |
0.125547 |
max |
9.999920e+15 |
0.754152 |
time: 22.4 ms
submit_xgb[submit_xgb['pred']>=0.26].describe()
|
id |
pred |
count |
2.818000e+03 |
2818.000000 |
mean |
5.523494e+15 |
0.350387 |
std |
2.632627e+15 |
0.083545 |
min |
7.736480e+13 |
0.260060 |
25% |
3.193231e+15 |
0.287803 |
50% |
5.528103e+15 |
0.324941 |
75% |
7.801996e+15 |
0.386373 |
max |
9.999505e+15 |
0.754152 |
time: 16.7 ms
xgb_yes = submit_xgb[submit_xgb['pred']>=0.26]
xgb_yes['pred'] = 1.0
xgb_yes.describe()
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
|
id |
pred |
count |
2.818000e+03 |
2818.0 |
mean |
5.523494e+15 |
1.0 |
std |
2.632627e+15 |
0.0 |
min |
7.736480e+13 |
1.0 |
25% |
3.193231e+15 |
1.0 |
50% |
5.528103e+15 |
1.0 |
75% |
7.801996e+15 |
1.0 |
max |
9.999505e+15 |
1.0 |
time: 347 ms
xgb_no = submit_xgb[submit_xgb['pred']<0.26]
xgb_no['pred'] = 0.0
xgb_no.describe()
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
|
id |
pred |
count |
4.738200e+04 |
47382.0 |
mean |
5.445619e+15 |
0.0 |
std |
2.628626e+15 |
0.0 |
min |
5.959412e+11 |
0.0 |
25% |
3.175890e+15 |
0.0 |
50% |
5.435288e+15 |
0.0 |
75% |
7.722863e+15 |
0.0 |
max |
9.999920e+15 |
0.0 |
time: 380 ms
submit = xgb_yes.append(xgb_no)
time: 2.29 ms
submit.describe()
|
id |
pred |
count |
5.020000e+04 |
50200.000000 |
mean |
5.449990e+15 |
0.056135 |
std |
2.628886e+15 |
0.230185 |
min |
5.959412e+11 |
0.000000 |
25% |
3.177008e+15 |
0.000000 |
50% |
5.441108e+15 |
0.000000 |
75% |
7.726328e+15 |
0.000000 |
max |
9.999920e+15 |
1.000000 |
time: 19.6 ms
submit_xgb[submit_xgb['pred']>=0.2].describe()
|
id |
pred |
count |
5.547000e+03 |
5547.000000 |
mean |
5.508672e+15 |
0.289829 |
std |
2.641133e+15 |
0.086438 |
min |
5.399382e+12 |
0.200014 |
25% |
3.195841e+15 |
0.225862 |
50% |
5.489831e+15 |
0.261552 |
75% |
7.813588e+15 |
0.326278 |
max |
9.999505e+15 |
0.754152 |
time: 18.5 ms
5600/98975*50200
2840.3132104066685
time: 2.17 ms
submit_gbm[submit_gbm['pred']>=0.23].describe()
|
id |
pred |
count |
2.539000e+03 |
2539.000000 |
mean |
5.482621e+15 |
0.298836 |
std |
2.625965e+15 |
0.062903 |
min |
7.736480e+13 |
0.230013 |
25% |
3.200866e+15 |
0.253366 |
50% |
5.471503e+15 |
0.279145 |
75% |
7.742764e+15 |
0.326900 |
max |
9.999505e+15 |
0.632138 |
time: 19 ms
submit_gbm[submit_gbm['pred']>=0.22].describe()
|
id |
pred |
count |
2.859000e+03 |
2859.000000 |
mean |
5.493943e+15 |
0.290563 |
std |
2.630246e+15 |
0.063701 |
min |
7.736480e+13 |
0.220121 |
25% |
3.195841e+15 |
0.244933 |
50% |
5.501943e+15 |
0.270700 |
75% |
7.743865e+15 |
0.321506 |
max |
9.999505e+15 |
0.632138 |
time: 19.6 ms
gbm_yes = submit_gbm[submit_gbm['pred']>=0.23]
gbm_yes['pred'] = 1.0
gbm_yes.describe()
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
|
id |
pred |
count |
2.539000e+03 |
2539.0 |
mean |
5.482621e+15 |
1.0 |
std |
2.625965e+15 |
0.0 |
min |
7.736480e+13 |
1.0 |
25% |
3.200866e+15 |
1.0 |
50% |
5.471503e+15 |
1.0 |
75% |
7.742764e+15 |
1.0 |
max |
9.999505e+15 |
1.0 |
time: 82.2 ms
gbm_no = submit_gbm[submit_gbm['pred']<0.23]
gbm_no['pred'] = 0.0
gbm_no.describe()
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
|
id |
pred |
count |
4.766100e+04 |
47661.0 |
mean |
5.448252e+15 |
0.0 |
std |
2.629058e+15 |
0.0 |
min |
5.959412e+11 |
0.0 |
25% |
3.175232e+15 |
0.0 |
50% |
5.439911e+15 |
0.0 |
75% |
7.725629e+15 |
0.0 |
max |
9.999920e+15 |
0.0 |
time: 58.7 ms
submit = gbm_yes.append(gbm_no)
time: 4.19 ms
submit.describe()
|
id |
pred |
count |
5.020000e+04 |
50200.000000 |
mean |
5.449990e+15 |
0.018745 |
std |
2.628886e+15 |
0.135625 |
min |
5.959412e+11 |
0.000000 |
25% |
3.177008e+15 |
0.000000 |
50% |
5.441108e+15 |
0.000000 |
75% |
7.726328e+15 |
0.000000 |
max |
9.999920e+15 |
1.000000 |
time: 20.4 ms
submit_gbm.describe()
|
id |
pred |
count |
5.020000e+04 |
50200.000000 |
mean |
5.449990e+15 |
0.085097 |
std |
2.628886e+15 |
0.071304 |
min |
5.959412e+11 |
0.000000 |
25% |
3.177008e+15 |
0.036845 |
50% |
5.441108e+15 |
0.062206 |
75% |
7.726328e+15 |
0.113462 |
max |
9.999920e+15 |
0.632138 |
time: 20.8 ms
submit.info()
Int64Index: 50200 entries, 91 to 50199
Data columns (total 2 columns):
id 50200 non-null int64
pred 50200 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 9.36 ms
submit = submit_xgb.append(submit_gbm)
submit = submit.groupby(by='id').sum().reset_index()
submit.describe()
|
id |
pred |
count |
5.020000e+04 |
50200.000000 |
mean |
5.449990e+15 |
0.169012 |
std |
2.628886e+15 |
0.139313 |
min |
5.959412e+11 |
0.000000 |
25% |
3.177008e+15 |
0.076237 |
50% |
5.441108e+15 |
0.125893 |
75% |
7.726328e+15 |
0.222622 |
max |
9.999920e+15 |
1.124561 |
time: 41.7 ms
submit.head()
|
id |
pred |
4 |
9297165066591558 |
1.0 |
14 |
8168181097053542 |
1.0 |
18 |
6473515505643555 |
1.0 |
25 |
4641233171005560 |
1.0 |
29 |
6759757036024682 |
1.0 |
time: 6.16 ms
submit_xgb[submit_xgb['id']==595941207920]
|
id |
pred |
8048 |
595941207920 |
0.185561 |
time: 7.07 ms
submit_gbm[submit_gbm['id']==595941207920]
|
id |
pred |
8048 |
595941207920 |
0.114782 |
time: 6.33 ms
submit.info()
Int64Index: 50200 entries, 14 to 50199
Data columns (total 2 columns):
id 50200 non-null int64
pred 50200 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 8 ms
全为1
t1_id['pred'] = 1.0
submit = t1_id.copy()
submit.info()
Int64Index: 50200 entries, 0 to 99852
Data columns (total 2 columns):
id 50200 non-null int64
pred 50200 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 8.79 ms
submit.head()
|
id |
pred |
0 |
6401824160010748 |
1.0 |
1 |
6506134548135499 |
1.0 |
2 |
5996920884619954 |
1.0 |
3 |
1187209424543713 |
1.0 |
4 |
9297165066591558 |
1.0 |
time: 13.1 ms
submit.columns = ['ID', 'Pred']
submit['ID'] = submit['ID'].astype(str)
time: 36.7 ms
submit.info()
Int64Index: 50200 entries, 14 to 50199
Data columns (total 2 columns):
ID 50200 non-null object
Pred 50200 non-null float64
dtypes: float64(1), object(1)
memory usage: 1.1+ MB
time: 10.1 ms
submit.to_csv('../submit.csv')
time: 126 ms
!wget -O kesci_submit https://www.heywhale.com/kesci_submit&&chmod +x kesci_submit
wget: /opt/conda/lib/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /opt/conda/lib/libssl.so.1.0.0: no version information available (required by wget)
wget: /opt/conda/lib/libssl.so.1.0.0: no version information available (required by wget)
--2019-07-31 08:15:56-- https://www.heywhale.com/kesci_submit
Resolving www.heywhale.com (www.heywhale.com)... 106.15.25.147
Connecting to www.heywhale.com (www.heywhale.com)|106.15.25.147|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6528405 (6.2M) [application/octet-stream]
Saving to: ‘kesci_submit’
kesci_submit 100%[===================>] 6.23M 12.1MB/s in 0.5s
2019-07-31 08:15:57 (12.1 MB/s) - ‘kesci_submit’ saved [6528405/6528405]
time: 1.83 s
!https_proxy="http://klab-external-proxy" ./kesci_submit -file ../submit.csv -token 578549794d544bff
Kesci Submit Tool 3.0
> 已验证Token
> 提交文件 ../submit.csv (1312.26 KiB)
> 文件已上传
> 提交完成
time: 1.7 s
!./kesci_submit -token 578549794d544bff -file ../submit.csv
Kesci Submit Tool
Result File: ../submit.csv (1.28 MiB)
Uploading: 7%====================
Submit Failed.
Serevr Response:
400 - {"message":"当前提交工具版本过旧,请参考比赛提交页面信息下载新的提交工具"}
time: 1 s
!ls ../
input pred.csv work
time: 665 ms
!wget -nv -O kesci_submit https://www.heywhale.com/kesci_submit&&chmod +x kesci_submit
wget: /opt/conda/lib/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /opt/conda/lib/libssl.so.1.0.0: no version information available (required by wget)
wget: /opt/conda/lib/libssl.so.1.0.0: no version information available (required by wget)
2019-07-02 08:08:23 URL:https://www.heywhale.com/kesci_submit [7842088/7842088] -> "kesci_submit" [1]
time: 1.47 s
0 查看数据
0.1 训练数据
0.1.1 正样本
q1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q1.txt', sep='\t', header=None))
Mem. usage decreased to 0.16 Mb (53.1% reduction)
time: 23 ms
q1.columns = ['year_month', 'id', 'consume', 'label']
time: 1.21 ms
q1 = q1.dropna(axis=0)
time: 6.72 ms
q1.head()
|
year_month |
id |
consume |
label |
2 |
201706 |
8160829951314300 |
82.75000 |
1 |
3 |
201707 |
8160829951314300 |
37.68750 |
1 |
4 |
201706 |
1508075698521400 |
68.00000 |
1 |
5 |
201707 |
1508075698521400 |
49.59375 |
1 |
6 |
201706 |
1686251204809800 |
200.75000 |
1 |
time: 6.82 ms
q1.describe()
|
year_month |
id |
consume |
label |
count |
10865.000000 |
1.086500e+04 |
1.086500e+04 |
10865.0 |
mean |
201706.499678 |
5.417732e+15 |
inf |
1.0 |
std |
0.500023 |
2.635784e+15 |
inf |
0.0 |
min |
201706.000000 |
1.448104e+12 |
4.998779e-02 |
1.0 |
25% |
201706.000000 |
3.118365e+15 |
4.068750e+01 |
1.0 |
50% |
201706.000000 |
5.456594e+15 |
9.837500e+01 |
1.0 |
75% |
201707.000000 |
7.687339e+15 |
1.785000e+02 |
1.0 |
max |
201707.000000 |
9.997949e+15 |
1.324000e+03 |
1.0 |
time: 37.1 ms
q1.info()
Int64Index: 10865 entries, 2 to 11199
Data columns (total 4 columns):
year_month 10865 non-null int32
id 10865 non-null int64
consume 10865 non-null float16
label 10865 non-null int8
dtypes: float16(1), int32(1), int64(1), int8(1)
memory usage: 244.0 KB
time: 6.9 ms
%matplotlib inline
q1.consume.plot()
Matplotlib is building the font cache using fc-list. This may take a moment.
time: 11.3 s
q1[q1.consume == 1323.74]
|
year_month |
id |
consume |
label |
4867 |
201707 |
5510977603357000 |
1324.0 |
1 |
time: 11.1 ms
q2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q2.txt', sep='\t', header=None))
Mem. usage decreased to 11.31 Mb (14.6% reduction)
time: 291 ms
q2 = q2.dropna(axis=0)
time: 77.7 ms
q2.head()
|
0 |
1 |
2 |
3 |
4 |
5 |
1 |
1752398069509000 |
华为 |
PLK-AL10 |
20170609223138 |
20170609224345 |
1 |
2 |
1752398069509000 |
乐视 |
LETV X501 |
20160924102711 |
20160924112425 |
1 |
3 |
1752398069509000 |
金立 |
金立 GN800 |
20150331210255 |
20150630131232 |
1 |
4 |
1752398069509000 |
金立 |
GIONEE M5 |
20170508191216 |
20170605192347 |
1 |
5 |
1752398069509000 |
华为 |
PLK-AL10 |
20160618182839 |
20170731235959 |
1 |
time: 8.16 ms
q2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']
time: 1.15 ms
q2.head()
|
id |
brand |
type |
first_use_time |
recent_use_time |
label |
1 |
1752398069509000 |
华为 |
PLK-AL10 |
20170609223138 |
20170609224345 |
1 |
2 |
1752398069509000 |
乐视 |
LETV X501 |
20160924102711 |
20160924112425 |
1 |
3 |
1752398069509000 |
金立 |
金立 GN800 |
20150331210255 |
20150630131232 |
1 |
4 |
1752398069509000 |
金立 |
GIONEE M5 |
20170508191216 |
20170605192347 |
1 |
5 |
1752398069509000 |
华为 |
PLK-AL10 |
20160618182839 |
20170731235959 |
1 |
time: 8.58 ms
q2.describe()
|
id |
first_use_time |
recent_use_time |
label |
count |
1.973760e+05 |
1.973760e+05 |
1.973760e+05 |
197376.0 |
mean |
5.436228e+15 |
2.015597e+13 |
2.015684e+13 |
1.0 |
std |
2.642924e+15 |
2.685010e+11 |
2.685124e+11 |
0.0 |
min |
1.448104e+12 |
-1.000000e+00 |
-1.000000e+00 |
1.0 |
25% |
3.227267e+15 |
2.015122e+13 |
2.016013e+13 |
1.0 |
50% |
5.353833e+15 |
2.016052e+13 |
2.016060e+13 |
1.0 |
75% |
7.764521e+15 |
2.016102e+13 |
2.016112e+13 |
1.0 |
max |
9.997949e+15 |
2.017073e+13 |
2.017073e+13 |
1.0 |
time: 64.7 ms
q2.info()
Int64Index: 197376 entries, 1 to 289201
Data columns (total 6 columns):
id 197376 non-null int64
brand 197376 non-null object
type 197376 non-null object
first_use_time 197376 non-null int64
recent_use_time 197376 non-null int64
label 197376 non-null int8
dtypes: int64(3), int8(1), object(2)
memory usage: 9.2+ MB
time: 41.7 ms
q3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q3.txt', sep='\t', header=None))
Mem. usage decreased to 0.18 Mb (64.6% reduction)
time: 18.4 ms
q3 = q3.dropna(axis=0)
time: 6.41 ms
q3.head()
|
0 |
1 |
2 |
3 |
4 |
5 |
0 |
201707 |
6062475264825100 |
88 |
1 |
0 |
1 |
1 |
201707 |
8160829951314300 |
27 |
0 |
0 |
1 |
2 |
201707 |
1508075698521400 |
19 |
0 |
0 |
1 |
3 |
201707 |
1686251204809800 |
207 |
0 |
0 |
1 |
4 |
201707 |
5627768389537500 |
133 |
1 |
0 |
1 |
time: 7.62 ms
q3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']
time: 1.16 ms
q3.head()
|
year_month |
id |
call_nums |
is_trans_provincial |
is_transnational |
label |
0 |
201707 |
6062475264825100 |
88 |
1 |
0 |
1 |
1 |
201707 |
8160829951314300 |
27 |
0 |
0 |
1 |
2 |
201707 |
1508075698521400 |
19 |
0 |
0 |
1 |
3 |
201707 |
1686251204809800 |
207 |
0 |
0 |
1 |
4 |
201707 |
5627768389537500 |
133 |
1 |
0 |
1 |
time: 7.37 ms
q3.describe()
|
year_month |
id |
call_nums |
is_trans_provincial |
is_transnational |
label |
count |
11200.000000 |
1.120000e+04 |
11200.000000 |
11200.000000 |
11200.000000 |
11200.0 |
mean |
201706.500000 |
5.416583e+15 |
70.562232 |
0.235446 |
0.014464 |
1.0 |
std |
0.500022 |
2.642827e+15 |
61.820144 |
0.424296 |
0.119400 |
0.0 |
min |
201706.000000 |
1.448104e+12 |
-1.000000 |
0.000000 |
0.000000 |
1.0 |
25% |
201706.000000 |
3.117220e+15 |
25.000000 |
0.000000 |
0.000000 |
1.0 |
50% |
201706.500000 |
5.456254e+15 |
54.000000 |
0.000000 |
0.000000 |
1.0 |
75% |
201707.000000 |
7.702940e+15 |
99.250000 |
0.000000 |
0.000000 |
1.0 |
max |
201707.000000 |
9.997949e+15 |
727.000000 |
1.000000 |
1.000000 |
1.0 |
time: 79.6 ms
q3.info()
Int64Index: 11200 entries, 0 to 11199
Data columns (total 6 columns):
year_month 11200 non-null int32
id 11200 non-null int64
call_nums 11200 non-null int16
is_trans_provincial 11200 non-null int8
is_transnational 11200 non-null int8
label 11200 non-null int8
dtypes: int16(1), int32(1), int64(1), int8(3)
memory usage: 273.4 KB
time: 7.47 ms
q4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q4.txt', sep='\t', header=None))
q4 = q4.dropna(axis=0)
q4.columns = ['year_month', 'id', 'province', 'label']
time: 935 µs
q4.head()
|
year_month |
id |
province |
label |
0 |
201707 |
6062475264825100 |
广东 |
1 |
1 |
201707 |
5627768389537500 |
北京 |
1 |
2 |
201707 |
2000900444179600 |
山西 |
1 |
3 |
201707 |
5304502776817600 |
四川 |
1 |
4 |
201707 |
5304502776817600 |
四川 |
1 |
time: 6.84 ms
q4.describe()
|
year_month |
id |
label |
count |
7218.000000 |
7.218000e+03 |
7218.0 |
mean |
201706.538515 |
5.341915e+15 |
1.0 |
std |
0.498549 |
2.631231e+15 |
0.0 |
min |
201706.000000 |
1.739872e+13 |
1.0 |
25% |
201706.000000 |
3.037311e+15 |
1.0 |
50% |
201707.000000 |
5.367106e+15 |
1.0 |
75% |
201707.000000 |
7.545199e+15 |
1.0 |
max |
201707.000000 |
9.987407e+15 |
1.0 |
time: 22.2 ms
q4.info()
Int64Index: 7218 entries, 0 to 7288
Data columns (total 4 columns):
year_month 7218 non-null int32
id 7218 non-null int64
province 7218 non-null object
label 7218 non-null int8
dtypes: int32(1), int64(1), int8(1), object(1)
memory usage: 204.4+ KB
time: 6.74 ms
!ls /home/kesci/input/gzlt/train_set/201708q/
201708q1.txt 201708q3.txt 201708q6.txt
201708q2.txt 201708q4.txt 201708q7.txt
time: 667 ms
q6 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q6.txt', sep='\t', header=None))
Mem. usage decreased to 62.58 Mb (52.1% reduction)
time: 3.9 s
q6.columns = ['date', 'hour', 'id', 'user_longitude', 'user_latitude', 'label']
time: 868 µs
q6.head()
|
date |
hour |
id |
user_longitude |
user_latitude |
label |
0 |
2017-07-18 |
8.0 |
9239265006758100 |
106.467545 |
26.58625 |
1 |
1 |
2017-07-10 |
0.0 |
3859201812337600 |
106.708213 |
26.57854 |
1 |
2 |
2017-07-16 |
18.0 |
3859201812337600 |
106.545690 |
26.56724 |
1 |
3 |
2017-07-17 |
8.0 |
3859201812337600 |
106.545690 |
26.56724 |
1 |
4 |
2017-07-27 |
16.0 |
3859201812337600 |
106.545690 |
26.56724 |
1 |
time: 16.7 ms
q6.describe()
|
hour |
id |
user_longitude |
user_latitude |
label |
count |
2.852871e+06 |
2.852871e+06 |
2.851527e+06 |
2.851527e+06 |
2852871.0 |
mean |
1.141897e+01 |
5.415213e+15 |
1.068143e+02 |
2.659968e+01 |
1.0 |
std |
6.632995e+00 |
2.634349e+15 |
5.580043e-01 |
2.852525e-01 |
0.0 |
min |
0.000000e+00 |
1.448104e+12 |
1.036700e+02 |
2.470664e+01 |
1.0 |
25% |
6.000000e+00 |
3.135488e+15 |
1.066656e+02 |
2.654610e+01 |
1.0 |
50% |
1.200000e+01 |
5.442594e+15 |
1.067027e+02 |
2.658143e+01 |
1.0 |
75% |
1.800000e+01 |
7.687963e+15 |
1.067373e+02 |
2.662629e+01 |
1.0 |
max |
2.200000e+01 |
9.997949e+15 |
1.095277e+02 |
2.909348e+01 |
1.0 |
time: 775 ms
q6.info()
RangeIndex: 2852871 entries, 0 to 2852870
Data columns (total 6 columns):
date object
hour float64
id int64
user_longitude float64
user_latitude float64
label int64
dtypes: float64(3), int64(2), object(1)
memory usage: 130.6+ MB
time: 3.24 ms
q7 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q7.txt', sep='\t', header=None))
Mem. usage decreased to 3.80 Mb (42.5% reduction)
time: 137 ms
q7 = q7.dropna(axis=0)
time: 35.4 ms
q7.columns = ['year_month', 'id', 'app', 'flow', 'label']
time: 1.54 ms
q7.head()
|
year_month |
id |
app |
flow |
label |
0 |
201707 |
6610350034824100 |
腾讯手机管家 |
0.010002 |
1 |
1 |
201707 |
6997210664840100 |
喜马拉雅FM |
27.390625 |
1 |
2 |
201707 |
3198621664927300 |
网易新闻 |
0.029999 |
1 |
3 |
201707 |
9987406611703100 |
喜马拉雅FM |
0.000000 |
1 |
4 |
201707 |
1785540174324200 |
天气通 |
0.020004 |
1 |
time: 8.14 ms
q7.describe()
|
year_month |
id |
flow |
label |
count |
173117.000000 |
1.731170e+05 |
173117.000000 |
173117.0 |
mean |
201706.539699 |
5.403100e+15 |
NaN |
1.0 |
std |
0.498423 |
2.667026e+15 |
NaN |
0.0 |
min |
201706.000000 |
1.448104e+12 |
0.000000 |
1.0 |
25% |
201706.000000 |
3.056260e+15 |
0.010002 |
1.0 |
50% |
201707.000000 |
5.429056e+15 |
0.080017 |
1.0 |
75% |
201707.000000 |
7.730223e+15 |
1.599609 |
1.0 |
max |
201707.000000 |
9.997949e+15 |
7828.000000 |
1.0 |
time: 70.4 ms
q7.info()
Int64Index: 173117 entries, 0 to 173116
Data columns (total 5 columns):
year_month 173117 non-null int32
id 173117 non-null int64
app 173117 non-null object
flow 173117 non-null float16
label 173117 non-null int8
dtypes: float16(1), int32(1), int64(1), int8(1), object(1)
memory usage: 5.1+ MB
time: 29.8 ms
q1
将两月金额相加
q1.head()
|
year_month |
id |
consume |
label |
2 |
201706 |
8160829951314300 |
82.75000 |
1 |
3 |
201707 |
8160829951314300 |
37.68750 |
1 |
4 |
201706 |
1508075698521400 |
68.00000 |
1 |
5 |
201707 |
1508075698521400 |
49.59375 |
1 |
6 |
201706 |
1686251204809800 |
200.75000 |
1 |
time: 7.05 ms
q1 = q1[['id', 'consume']]
time: 2.91 ms
q1_groupbyid = q1.groupby(['id']).agg({'consume': pd.Series.sum})
time: 747 ms
len(q1)
10865
time: 8.1 ms
q1[q1['id']==1448103998000]
|
id |
consume |
3532 |
1448103998000 |
18.09375 |
3533 |
1448103998000 |
44.28125 |
time: 8.84 ms
q1_groupbyid[:10]
|
consume |
id |
|
1448103998000 |
62.37500 |
17398718813730 |
460.75000 |
61132623486000 |
12.28125 |
68156596675520 |
903.50000 |
76819334576430 |
282.25000 |
78745100940550 |
531.00000 |
110229638660000 |
253.00000 |
122134826301000 |
138.75000 |
132923269304000 |
26.81250 |
138204830829320 |
387.50000 |
time: 5.8 ms
q2
特征1 使用过的top9+其它手机品牌 共10个
特征2 使用的不同品牌数量
q2 = q2[['id', 'brand']]
time: 4.86 ms
q2.head(10)
|
id |
brand |
1 |
1752398069509000 |
华为 |
2 |
1752398069509000 |
乐视 |
3 |
1752398069509000 |
金立 |
4 |
1752398069509000 |
金立 |
5 |
1752398069509000 |
华为 |
6 |
1752398069509000 |
华为 |
7 |
1752398069509000 |
金立 |
8 |
1752398069509000 |
三星 |
9 |
4799656026499908 |
三星 |
10 |
4799656026499908 |
华为 |
time: 6.36 ms
groupbybrand = q2['brand'].value_counts()
time: 18.7 ms
len(groupbybrand)
750
time: 2.09 ms
%matplotlib inline
groupbybrand.plot()
time: 454 ms
groupbybrand[:10]
苹果 62347
华为 22266
欧珀 20516
维沃 17158
三星 13435
小米 10632
金立 9922
魅族 9708
乐视 5609
四季恒美 2163
Name: brand, dtype: int64
time: 3.52 ms
q2 = q2.drop_duplicates()
groupbyid = q2['id'].value_counts()
time: 19.6 ms
len(groupbyid)
5597
time: 2.23 ms
%matplotlib inline
groupbyid.plot()
time: 294 ms
groupbyid[:10]
4104535378288025 115
8707678197418467 108
3900535090108175 104
3986280749497468 93
9196501153454276 88
5510977603357000 84
8569492566715454 78
1106540188374027 71
4091371962011072 71
4874962666674313 71
Name: id, dtype: int64
time: 3.27 ms
q1[q1['id']==4104535378288025]
|
year_month |
id |
consume |
label |
10576 |
201706 |
4104535378288025 |
208.000 |
1 |
10577 |
201707 |
4104535378288025 |
205.125 |
1 |
time: 7.63 ms
time: 364 µs
type(groupbyid)
pandas.core.series.Series
time: 2.14 ms
type(groupbyid.to_frame())
pandas.core.frame.DataFrame
time: 3.13 ms
q2_groupbyid = groupbyid.reset_index()
time: 2.34 ms
q2_groupbyid.columns = ['id', 'phone_nums']
time: 1.19 ms
q2_groupbyid.head()
|
id |
phone_nums |
0 |
4104535378288025 |
115 |
1 |
8707678197418467 |
108 |
2 |
3900535090108175 |
104 |
3 |
3986280749497468 |
93 |
4 |
9196501153454276 |
88 |
time: 6.12 ms
type(q1_groupbyid)
pandas.core.frame.DataFrame
time: 2.15 ms
pos_set = q1_groupbyid.merge(q2_groupbyid, on=['id'])
time: 6.42 ms
pos_set.head()
|
id |
consume |
phone_nums |
0 |
1448103998000 |
62.37500 |
6 |
1 |
17398718813730 |
460.75000 |
23 |
2 |
61132623486000 |
12.28125 |
1 |
3 |
68156596675520 |
903.50000 |
4 |
4 |
76819334576430 |
282.25000 |
21 |
time: 7.11 ms
pos_set.info()
Int64Index: 5473 entries, 0 to 5472
Data columns (total 3 columns):
id 5473 non-null int64
consume 5473 non-null float16
phone_nums 5473 non-null int64
dtypes: float16(1), int64(2)
memory usage: 139.0 KB
time: 6.27 ms
q3
1.将两月联络圈规模求和
2.将两月出省求和 是:1 否:0
3.将两月出国求和 是:1 否:0
q3.head()
|
year_month |
id |
call_nums |
is_trans_provincial |
is_transnational |
label |
0 |
201707 |
6062475264825100 |
88 |
1 |
0 |
1 |
1 |
201707 |
8160829951314300 |
27 |
0 |
0 |
1 |
2 |
201707 |
1508075698521400 |
19 |
0 |
0 |
1 |
3 |
201707 |
1686251204809800 |
207 |
0 |
0 |
1 |
4 |
201707 |
5627768389537500 |
133 |
1 |
0 |
1 |
time: 7.69 ms
q3_groupbyid_call = q3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})
q3_groupbyid_provincial = q3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})
q3_groupbyid_trans = q3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})
time: 1.95 s
pos_set = pos_set.merge(q3_groupbyid_call, on=['id'])
time: 5.14 ms
pos_set.head()
|
id |
consume |
phone_nums |
call_nums |
0 |
1448103998000 |
62.37500 |
6 |
21 |
1 |
17398718813730 |
460.75000 |
23 |
217 |
2 |
61132623486000 |
12.28125 |
1 |
61 |
3 |
68156596675520 |
903.50000 |
4 |
353 |
4 |
76819334576430 |
282.25000 |
21 |
431 |
time: 7.94 ms
pos_set = pos_set.merge(q3_groupbyid_provincial, on=['id'])
pos_set = pos_set.merge(q3_groupbyid_trans, on=['id'])
time: 9.61 ms
pos_set.info()
Int64Index: 5473 entries, 0 to 5472
Data columns (total 6 columns):
id 5473 non-null int64
consume 5473 non-null float16
phone_nums 5473 non-null int64
call_nums 5473 non-null int16
is_trans_provincial 5473 non-null int8
is_transnational 5473 non-null int8
dtypes: float16(1), int16(1), int64(2), int8(2)
memory usage: 160.3 KB
time: 7.3 ms
q4
1.两月内漫出省次数
2.所有省份one-hot或top10省份+其它省份
3.两月内漫出不同省个数
q4.head(10)
|
year_month |
id |
province |
label |
0 |
201707 |
6062475264825100 |
广东 |
1 |
1 |
201707 |
5627768389537500 |
北京 |
1 |
2 |
201707 |
2000900444179600 |
山西 |
1 |
3 |
201707 |
5304502776817600 |
四川 |
1 |
4 |
201707 |
5304502776817600 |
四川 |
1 |
5 |
201707 |
5304502776817600 |
四川 |
1 |
6 |
201707 |
5304502776817600 |
重庆 |
1 |
7 |
201707 |
8594396491246200 |
广西 |
1 |
8 |
201707 |
8594396491246200 |
广西 |
1 |
9 |
201707 |
8594396491246200 |
广西 |
1 |
time: 8.78 ms
q4_groupbyid = q4[['id', 'province']].groupby(['id']).agg({'province': pd.Series.unique})
q4_groupbyid.head()
|
province |
id |
|
17398718813730 |
重庆 |
61132623486000 |
[福建, 河南, 江苏, 安徽] |
68156596675520 |
[辽宁, 广东] |
132923269304000 |
江西 |
138204830829320 |
浙江 |
time: 322 ms
q4_groupbyid = q4[['id', 'province']].groupby(['id']).size()
q4_groupbyid.head()
id
17398718813730 1
61132623486000 8
68156596675520 3
132923269304000 1
138204830829320 2
dtype: int64
time: 6.52 ms
q4[q4['id']==61132623486000]
|
year_month |
id |
province |
label |
461 |
201707 |
61132623486000 |
福建 |
1 |
462 |
201707 |
61132623486000 |
福建 |
1 |
463 |
201707 |
61132623486000 |
福建 |
1 |
4363 |
201706 |
61132623486000 |
河南 |
1 |
4364 |
201706 |
61132623486000 |
江苏 |
1 |
4365 |
201706 |
61132623486000 |
安徽 |
1 |
4366 |
201706 |
61132623486000 |
安徽 |
1 |
4367 |
201706 |
61132623486000 |
江苏 |
1 |
time: 8.26 ms
type(q4_groupbyid.reset_index())
pandas.core.frame.DataFrame
time: 4.03 ms
q4_groupbyid = q4_groupbyid.reset_index()
q4_groupbyid.columns = ['id', 'province_out_cnt']
time: 2.73 ms
q4_groupbyid.head()
|
id |
province_out_cnt |
0 |
17398718813730 |
1 |
1 |
61132623486000 |
8 |
2 |
68156596675520 |
3 |
3 |
132923269304000 |
1 |
4 |
138204830829320 |
2 |
time: 5.73 ms
pos_set = pos_set.merge(q4_groupbyid, how='left', on=['id'])
pos_set.head()
|
id |
consume |
phone_nums |
call_nums |
is_trans_provincial |
is_transnational |
province_out_cnt |
0 |
1448103998000 |
62.37500 |
6 |
21 |
0 |
0 |
NaN |
1 |
17398718813730 |
460.75000 |
23 |
217 |
1 |
0 |
1.0 |
2 |
61132623486000 |
12.28125 |
1 |
61 |
2 |
0 |
8.0 |
3 |
68156596675520 |
903.50000 |
4 |
353 |
2 |
0 |
3.0 |
4 |
76819334576430 |
282.25000 |
21 |
431 |
0 |
0 |
NaN |
time: 14.6 ms
pos_set.info()
Int64Index: 5473 entries, 0 to 5472
Data columns (total 7 columns):
id 5473 non-null int64
consume 5473 non-null float16
phone_nums 5473 non-null int64
call_nums 5473 non-null int16
is_trans_provincial 5473 non-null int8
is_transnational 5473 non-null int8
province_out_cnt 1913 non-null float64
dtypes: float16(1), float64(1), int16(1), int64(2), int8(2)
memory usage: 203.1 KB
time: 7.53 ms
pos_set = pos_set.fillna(0)
time: 2.46 ms
pos_set.info()
Int64Index: 5473 entries, 0 to 5472
Data columns (total 7 columns):
id 5473 non-null int64
consume 5473 non-null float16
phone_nums 5473 non-null int64
call_nums 5473 non-null int16
is_trans_provincial 5473 non-null int8
is_transnational 5473 non-null int8
province_out_cnt 5473 non-null float64
dtypes: float16(1), float64(1), int16(1), int64(2), int8(2)
memory usage: 203.1 KB
time: 8.02 ms
Int64Index: 1913 entries, 0 to 1912
Data columns (total 7 columns):
id 1913 non-null int64
consume 1913 non-null float16
phone_nums 1913 non-null int64
call_nums 1913 non-null int16
is_trans_provincial 1913 non-null int8
is_transnational 1913 non-null int8
province_out_cnt 1913 non-null int64
dtypes: float16(1), int16(1), int64(3), int8(2)
memory usage: 71.0 KB
time: 6.67 ms
q6 暂时忽略
q7
1.使用总流量
2.使用不同APP数量
3.某些特定(旅游相关)APP是否使用
q7.head()
|
year_month |
id |
app |
flow |
label |
0 |
201707 |
6610350034824100 |
腾讯手机管家 |
0.010002 |
1 |
1 |
201707 |
6997210664840100 |
喜马拉雅FM |
27.390625 |
1 |
2 |
201707 |
3198621664927300 |
网易新闻 |
0.029999 |
1 |
3 |
201707 |
9987406611703100 |
喜马拉雅FM |
0.000000 |
1 |
4 |
201707 |
1785540174324200 |
天气通 |
0.020004 |
1 |
time: 7.94 ms
q7_groupbyapp = q7.groupby(['app']).agg({'flow': pd.Series.sum})
time: 135 ms
len(q7_groupbyapp)
762
time: 2.04 ms
q7_groupbyapp.sort_values(by='flow', ascending=False)
|
flow |
app |
|
网易云音乐 |
inf |
爱奇艺视频 |
inf |
微信 |
inf |
新浪微博 |
inf |
QQ音乐 |
inf |
今日头条 |
inf |
QQ |
57856.0 |
手机百度 |
53408.0 |
陌陌 |
43488.0 |
iTunes |
35392.0 |
腾讯新闻 |
25952.0 |
快手 |
24256.0 |
手机淘宝 |
18400.0 |
UC浏览器 |
16608.0 |
酷狗音乐 |
15360.0 |
高德地图 |
14984.0 |
酷我音乐 |
13488.0 |
新浪新闻 |
13432.0 |
唯品会 |
11504.0 |
腾讯视频 |
10760.0 |
优酷视频 |
10736.0 |
汽车之家 |
9984.0 |
百度地图 |
9816.0 |
美团 |
9400.0 |
网易新闻 |
8648.0 |
AppStore |
7776.0 |
中国联通手机营业厅 |
6736.0 |
百度贴吧 |
6104.0 |
凤凰新闻 |
5504.0 |
虾米音乐 |
5020.0 |
... |
... |
百才招聘网 |
0.0 |
碰碰 |
0.0 |
禾文阿思看图购 |
0.0 |
科学作息时间表 |
0.0 |
章鱼输入法 |
0.0 |
米折 |
0.0 |
约会吧 |
0.0 |
网易微博 |
0.0 |
表情大全 |
0.0 |
欢乐互娱 |
0.0 |
博客大巴 |
0.0 |
查快递 |
0.0 |
邮储银行 |
0.0 |
号簿助手 |
0.0 |
司机邦 |
0.0 |
壁纸多多 |
0.0 |
天天聊 |
0.0 |
天翼阅读 |
0.0 |
安全管家 |
0.0 |
安卓游戏盒子 |
0.0 |
安软市场 |
0.0 |
车网互联 |
0.0 |
宜搜搜索 |
0.0 |
工程师爸爸 |
0.0 |
彩票控 |
0.0 |
贝瓦儿歌 |
0.0 |
搜狗壁纸 |
0.0 |
智远一户通 |
0.0 |
诚品快拍 |
0.0 |
07073手游中心 |
0.0 |
762 rows × 1 columns
time: 12.4 ms
pos_set.describe()
|
id |
consume |
phone_nums |
call_nums |
is_trans_provincial |
is_transnational |
province_out_cnt |
count |
5.473000e+03 |
5473.000000 |
5473.000000 |
5473.000000 |
5473.000000 |
5473.000000 |
5473.000000 |
mean |
5.417038e+15 |
inf |
8.228942 |
141.201900 |
0.474511 |
0.029600 |
1.300018 |
std |
2.637784e+15 |
inf |
8.551830 |
121.262826 |
0.706162 |
0.187904 |
3.110401 |
min |
1.448104e+12 |
0.099976 |
1.000000 |
-2.000000 |
0.000000 |
0.000000 |
0.000000 |
25% |
3.113785e+15 |
82.000000 |
3.000000 |
52.000000 |
0.000000 |
0.000000 |
0.000000 |
50% |
5.457364e+15 |
198.250000 |
6.000000 |
108.000000 |
0.000000 |
0.000000 |
0.000000 |
75% |
7.688781e+15 |
355.250000 |
10.000000 |
198.000000 |
1.000000 |
0.000000 |
1.000000 |
max |
9.997949e+15 |
2392.000000 |
115.000000 |
1035.000000 |
2.000000 |
2.000000 |
42.000000 |
time: 126 ms
pos_set['label'] = 1
|
id |
consume |
phone_nums |
call_nums |
is_trans_provincial |
is_transnational |
province_out_cnt |
label |
0 |
1448103998000 |
62.37500 |
6 |
21 |
0 |
0 |
NaN |
1 |
1 |
17398718813730 |
460.75000 |
23 |
217 |
1 |
0 |
1.0 |
1 |
2 |
61132623486000 |
12.28125 |
1 |
61 |
2 |
0 |
8.0 |
1 |
3 |
68156596675520 |
903.50000 |
4 |
353 |
2 |
0 |
3.0 |
1 |
4 |
76819334576430 |
282.25000 |
21 |
431 |
0 |
0 |
NaN |
1 |
time: 10.5 ms
pos_set.fillna(0)
pos_set.head()
|
id |
consume |
phone_nums |
call_nums |
is_trans_provincial |
is_transnational |
province_out_cnt |
label |
0 |
1448103998000 |
62.37500 |
6 |
21 |
0 |
0 |
NaN |
1 |
1 |
17398718813730 |
460.75000 |
23 |
217 |
1 |
0 |
1.0 |
1 |
2 |
61132623486000 |
12.28125 |
1 |
61 |
2 |
0 |
8.0 |
1 |
3 |
68156596675520 |
903.50000 |
4 |
353 |
2 |
0 |
3.0 |
1 |
4 |
76819334576430 |
282.25000 |
21 |
431 |
0 |
0 |
NaN |
1 |
time: 23.5 ms
0.1.2 负样本
n1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n1.txt', sep='\t', header=None))
n1.columns = ['year_month', 'id', 'consume', 'label']
n1 = n1.dropna(axis=0)
n1_groupbyid = n1[['id', 'consume']].groupby(['id']).agg({'consume': pd.Series.sum})
n2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n2.txt', sep='\t', header=None))
n2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']
n2 = n2.dropna(axis=0)
n2 = n2[['id', 'brand']]
n2 = n2.drop_duplicates()
n2_groupbyid = n2['id'].value_counts()
n2_groupbyid = n2_groupbyid.reset_index()
n2_groupbyid.columns = ['id', 'phone_nums']
neg_set = n1_groupbyid.merge(n2_groupbyid, on=['id'])
neg_set.head()
Mem. usage decreased to 2.67 Mb (53.1% reduction)
Mem. usage decreased to 51.13 Mb (14.6% reduction)
|
id |
consume |
phone_nums |
0 |
1009387204000 |
225.000000 |
4 |
1 |
1167316303000 |
1.199219 |
4 |
2 |
1883071709000 |
213.500000 |
8 |
3 |
3393143830010 |
517.500000 |
6 |
4 |
4568973162000 |
18.078125 |
3 |
time: 10.8 s
neg_set.info()
Int64Index: 76515 entries, 0 to 76514
Data columns (total 3 columns):
id 76515 non-null int64
consume 76515 non-null float16
phone_nums 76515 non-null int64
dtypes: float16(1), int64(2)
memory usage: 1.9 MB
time: 11.1 ms
n3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n3.txt', sep='\t', header=None))
n3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']
n3_groupbyid_call = n3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})
n3_groupbyid_provincial = n3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})
n3_groupbyid_trans = n3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})
neg_set = neg_set.merge(n3_groupbyid_call, on=['id'])
neg_set = neg_set.merge(n3_groupbyid_provincial, on=['id'])
neg_set = neg_set.merge(n3_groupbyid_trans, on=['id'])
n4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n4.txt', sep='\t', header=None))
n4.columns = ['year_month', 'id', 'province', 'label']
n4_groupbyid = n4[['id', 'province']].groupby(['id']).size()
n4_groupbyid = n4_groupbyid.reset_index()
n4_groupbyid.columns = ['id', 'province_out_cnt']
neg_set = neg_set.merge(n4_groupbyid, how='left', on=['id'])
neg_set = neg_set.fillna(0)
neg_set.head()
Mem. usage decreased to 3.03 Mb (64.6% reduction)
Mem. usage decreased to 0.73 Mb (34.4% reduction)
|
id |
consume |
phone_nums |
call_nums |
is_trans_provincial |
is_transnational |
province_out_cnt |
0 |
1009387204000 |
225.000000 |
4 |
19 |
0 |
0 |
0.0 |
1 |
1167316303000 |
1.199219 |
4 |
6 |
0 |
0 |
0.0 |
2 |
1883071709000 |
213.500000 |
8 |
40 |
0 |
0 |
0.0 |
3 |
3393143830010 |
517.500000 |
6 |
205 |
1 |
0 |
2.0 |
4 |
4568973162000 |
18.078125 |
3 |
17 |
0 |
0 |
0.0 |
time: 32.5 s
neg_set['label'] = 0
time: 1.83 ms
neg_set.info()
Int64Index: 76515 entries, 0 to 76514
Data columns (total 8 columns):
id 76515 non-null int64
consume 76515 non-null float16
phone_nums 76515 non-null int64
call_nums 76515 non-null int16
is_trans_provincial 76515 non-null int8
is_transnational 76515 non-null int8
province_out_cnt 76515 non-null float64
label 76515 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2)
memory usage: 3.4 MB
time: 18.9 ms
n1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n1.txt', sep='\t', header=None))
Mem. usage decreased to 2.67 Mb (53.1% reduction)
time: 484 ms
n1.columns = ['year_month', 'id', 'consume', 'label']
time: 1.28 ms
n1.head()
|
year_month |
id |
consume |
label |
0 |
201707 |
8570518832906100 |
9.00 |
0 |
1 |
201707 |
2182640938718700 |
10.00 |
0 |
2 |
201707 |
783614344429000 |
8.38 |
0 |
3 |
201707 |
2007036960106400 |
100.00 |
0 |
4 |
201707 |
9482847959399300 |
226.05 |
0 |
time: 7.22 ms
n1.describe()
|
year_month |
id |
consume |
label |
count |
186800.000000 |
1.868000e+05 |
150750.000000 |
186800.0 |
mean |
201706.500000 |
5.464219e+15 |
63.580028 |
0.0 |
std |
0.500001 |
2.633848e+15 |
84.063600 |
0.0 |
min |
201706.000000 |
1.009387e+12 |
-70.660000 |
0.0 |
25% |
201706.000000 |
3.192389e+15 |
12.930000 |
0.0 |
50% |
201706.500000 |
5.486486e+15 |
34.000000 |
0.0 |
75% |
201707.000000 |
7.744140e+15 |
82.500000 |
0.0 |
max |
201707.000000 |
9.999717e+15 |
3979.940000 |
0.0 |
time: 52.5 ms
n1.info()
RangeIndex: 186800 entries, 0 to 186799
Data columns (total 4 columns):
year_month 186800 non-null int64
id 186800 non-null int64
consume 150750 non-null float64
label 186800 non-null int64
dtypes: float64(1), int64(3)
memory usage: 5.7 MB
time: 21.7 ms
n2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n2.txt', sep='\t', header=None))
Mem. usage decreased to 51.13 Mb (14.6% reduction)
time: 7.76 s
n2.head()
|
0 |
1 |
2 |
3 |
4 |
5 |
0 |
5227696575283900 |
苹果 |
A1699 |
20150331210636 |
20150701063017 |
0 |
1 |
6279759720262000 |
NaN |
NaN |
20160725112240 |
20170731235959 |
0 |
2 |
6279759720262000 |
NaN |
NaN |
20161205220417 |
20161205220417 |
0 |
3 |
6279759720262000 |
三星 |
SM-A9000 |
20161128231001 |
20161128231001 |
0 |
4 |
6279759720262000 |
NaN |
NaN |
20161220102623 |
20170306173713 |
0 |
time: 8.15 ms
n2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']
time: 1.2 ms
n2.head()
|
id |
brand |
type |
first_use_time |
recent_use_time |
label |
0 |
5227696575283900 |
苹果 |
A1699 |
20150331210636 |
20150701063017 |
0 |
1 |
6279759720262000 |
NaN |
NaN |
20160725112240 |
20170731235959 |
0 |
2 |
6279759720262000 |
NaN |
NaN |
20161205220417 |
20161205220417 |
0 |
3 |
6279759720262000 |
三星 |
SM-A9000 |
20161128231001 |
20161128231001 |
0 |
4 |
6279759720262000 |
NaN |
NaN |
20161220102623 |
20170306173713 |
0 |
time: 8.3 ms
n2.describe()
|
id |
first_use_time |
recent_use_time |
label |
count |
1.307608e+06 |
1.307608e+06 |
1.307608e+06 |
1307608.0 |
mean |
5.460966e+15 |
1.999810e+13 |
1.999992e+13 |
0.0 |
std |
2.619222e+15 |
1.801007e+12 |
1.801171e+12 |
0.0 |
min |
1.009387e+12 |
-1.000000e+00 |
-1.000000e+00 |
0.0 |
25% |
3.196695e+15 |
2.015112e+13 |
2.016022e+13 |
0.0 |
50% |
5.477102e+15 |
2.016071e+13 |
2.016101e+13 |
0.0 |
75% |
7.728047e+15 |
2.016123e+13 |
2.017023e+13 |
0.0 |
max |
9.999717e+15 |
2.017073e+13 |
2.017073e+13 |
0.0 |
time: 252 ms
n2.info()
RangeIndex: 1307608 entries, 0 to 1307607
Data columns (total 6 columns):
id 1307608 non-null int64
brand 894190 non-null object
type 894205 non-null object
first_use_time 1307608 non-null int64
recent_use_time 1307608 non-null int64
label 1307608 non-null int64
dtypes: int64(4), object(2)
memory usage: 59.9+ MB
time: 251 ms
n3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n3.txt', sep='\t', header=None))
Mem. usage decreased to 3.03 Mb (64.6% reduction)
time: 584 ms
n3.head()
|
0 |
1 |
2 |
3 |
4 |
5 |
0 |
201707 |
4295277677437000 |
36 |
1 |
0 |
0 |
1 |
201707 |
9121335969062000 |
37 |
0 |
0 |
0 |
2 |
201707 |
9438277095447300 |
-1 |
0 |
0 |
0 |
3 |
201707 |
6749854876532500 |
20 |
0 |
0 |
0 |
4 |
201707 |
1545361809381400 |
26 |
0 |
0 |
0 |
time: 7.82 ms
n3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']
time: 1.13 ms
n3.head()
|
year_month |
id |
call_nums |
is_trans_provincial |
is_transnational |
label |
0 |
201707 |
4295277677437000 |
36 |
1 |
0 |
0 |
1 |
201707 |
9121335969062000 |
37 |
0 |
0 |
0 |
2 |
201707 |
9438277095447300 |
-1 |
0 |
0 |
0 |
3 |
201707 |
6749854876532500 |
20 |
0 |
0 |
0 |
4 |
201707 |
1545361809381400 |
26 |
0 |
0 |
0 |
time: 7.49 ms
n3.describe()
|
year_month |
id |
call_nums |
is_trans_provincial |
is_transnational |
label |
count |
186800.000000 |
1.868000e+05 |
186800.000000 |
186800.000000 |
186800.000000 |
186800.0 |
mean |
201706.500000 |
5.464219e+15 |
32.674797 |
0.093292 |
0.005054 |
0.0 |
std |
0.500001 |
2.633848e+15 |
46.054929 |
0.290842 |
0.070909 |
0.0 |
min |
201706.000000 |
1.009387e+12 |
-1.000000 |
0.000000 |
0.000000 |
0.0 |
25% |
201706.000000 |
3.192389e+15 |
4.000000 |
0.000000 |
0.000000 |
0.0 |
50% |
201706.500000 |
5.486486e+15 |
19.000000 |
0.000000 |
0.000000 |
0.0 |
75% |
201707.000000 |
7.744140e+15 |
43.000000 |
0.000000 |
0.000000 |
0.0 |
max |
201707.000000 |
9.999717e+15 |
1807.000000 |
1.000000 |
1.000000 |
0.0 |
time: 75.7 ms
n3.info()
RangeIndex: 186800 entries, 0 to 186799
Data columns (total 6 columns):
year_month 186800 non-null int64
id 186800 non-null int64
call_nums 186800 non-null int64
is_trans_provincial 186800 non-null int64
is_transnational 186800 non-null int64
label 186800 non-null int64
dtypes: int64(6)
memory usage: 8.6 MB
time: 26.6 ms
n4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n4.txt', sep='\t', header=None))
Mem. usage decreased to 0.73 Mb (34.4% reduction)
time: 88.8 ms
n4.columns = ['year_month', 'id', 'province', 'label']
time: 1.15 ms
n4.head()
|
year_month |
id |
province |
label |
0 |
201707 |
4295277677437000 |
重庆 |
0 |
1 |
201707 |
5560109665240300 |
广西 |
0 |
2 |
201707 |
5560109665240300 |
广东 |
0 |
3 |
201707 |
5560109665240300 |
广东 |
0 |
4 |
201707 |
5705601521649600 |
重庆 |
0 |
time: 7.14 ms
n4.describe()
|
year_month |
id |
label |
count |
36499.000000 |
3.649900e+04 |
36499.0 |
mean |
201706.539193 |
5.471019e+15 |
0.0 |
std |
0.498468 |
2.639006e+15 |
0.0 |
min |
201706.000000 |
3.393144e+12 |
0.0 |
25% |
201706.000000 |
3.203830e+15 |
0.0 |
50% |
201707.000000 |
5.468480e+15 |
0.0 |
75% |
201707.000000 |
7.753756e+15 |
0.0 |
max |
201707.000000 |
9.999305e+15 |
0.0 |
time: 24.4 ms
n4.info()
RangeIndex: 36499 entries, 0 to 36498
Data columns (total 4 columns):
year_month 36499 non-null int64
id 36499 non-null int64
province 36099 non-null object
label 36499 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.1+ MB
time: 9.97 ms
!ls /home/kesci/input/gzlt/train_set/201708n/
201708n1.txt 201708n3.txt 201708n6.txt
201708n2.txt 201708n4.txt 201708n7.txt
time: 669 ms
n6 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n6.txt', sep='\t', header=None))
Mem. usage decreased to 798.26 Mb (52.1% reduction)
time: 2min 59s
n6.columns = ['date', 'hour', 'id', 'user_longitude', 'user_latitude', 'label']
time: 1.51 ms
n6.head()
|
date |
hour |
id |
user_longitude |
user_latitude |
label |
0 |
2017-07-02 |
10.0 |
7748777616409800 |
106.680816 |
26.563650 |
0 |
1 |
2017-07-10 |
0.0 |
7748777616409800 |
106.719520 |
26.576370 |
0 |
2 |
2017-07-31 |
14.0 |
7748777616409800 |
106.683060 |
26.654663 |
0 |
3 |
2017-07-01 |
0.0 |
6633710902197900 |
106.697440 |
26.613930 |
0 |
4 |
2017-07-08 |
14.0 |
6633710902197900 |
106.715700 |
26.609710 |
0 |
time: 9.14 ms
q6.describe()
|
hour |
id |
user_longitude |
user_latitude |
label |
count |
2.852871e+06 |
2.852871e+06 |
2.851527e+06 |
2.851527e+06 |
2852871.0 |
mean |
1.141897e+01 |
5.415213e+15 |
1.068143e+02 |
2.659968e+01 |
1.0 |
std |
6.632995e+00 |
2.634349e+15 |
5.580043e-01 |
2.852525e-01 |
0.0 |
min |
0.000000e+00 |
1.448104e+12 |
1.036700e+02 |
2.470664e+01 |
1.0 |
25% |
6.000000e+00 |
3.135488e+15 |
1.066656e+02 |
2.654610e+01 |
1.0 |
50% |
1.200000e+01 |
5.442594e+15 |
1.067027e+02 |
2.658143e+01 |
1.0 |
75% |
1.800000e+01 |
7.687963e+15 |
1.067373e+02 |
2.662629e+01 |
1.0 |
max |
2.200000e+01 |
9.997949e+15 |
1.095277e+02 |
2.909348e+01 |
1.0 |
time: 979 ms
n6.info()
RangeIndex: 36393070 entries, 0 to 36393069
Data columns (total 6 columns):
date object
hour float64
id int64
user_longitude float64
user_latitude float64
label int64
dtypes: float64(3), int64(2), object(1)
memory usage: 1.6+ GB
time: 3.76 ms
n7 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n7.txt', sep='\t', header=None))
Mem. usage decreased to 17.98 Mb (31.2% reduction)
time: 3.14 s
n7.columns = ['year_month', 'id', 'app', 'flow']
time: 1.44 ms
n7.head()
|
year_month |
id |
app |
flow |
0 |
201707 |
4011022166491000 |
米聊 |
0.01 |
1 |
201707 |
8544172893207700 |
百度地图 |
2.07 |
2 |
201707 |
9856572220983403 |
搜狗输入法 |
0.00 |
3 |
201707 |
6441300393946200 |
爱奇艺视频 |
0.00 |
4 |
201707 |
8751918977379700 |
开心消消乐 |
0.03 |
time: 7.51 ms
time: 2.94 ms
|
year_month |
id |
app |
flow |
label |
0 |
201707 |
4011022166491000 |
米聊 |
0.01 |
0 |
1 |
201707 |
8544172893207700 |
百度地图 |
2.07 |
0 |
2 |
201707 |
9856572220983403 |
搜狗输入法 |
0.00 |
0 |
3 |
201707 |
6441300393946200 |
爱奇艺视频 |
0.00 |
0 |
4 |
201707 |
8751918977379700 |
开心消消乐 |
0.03 |
0 |
time: 8.46 ms
n7.describe()
|
year_month |
id |
flow |
label |
count |
856961.000000 |
8.569610e+05 |
856961.000000 |
856961.0 |
mean |
201706.535881 |
5.432556e+15 |
9.942533 |
0.0 |
std |
0.498711 |
2.643712e+15 |
68.096944 |
0.0 |
min |
201706.000000 |
1.009387e+12 |
0.000000 |
0.0 |
25% |
201706.000000 |
3.134290e+15 |
0.000000 |
0.0 |
50% |
201707.000000 |
5.440495e+15 |
0.060000 |
0.0 |
75% |
201707.000000 |
7.727765e+15 |
1.130000 |
0.0 |
max |
201707.000000 |
9.999717e+15 |
10986.150000 |
0.0 |
time: 170 ms
n7.info()
RangeIndex: 856961 entries, 0 to 856960
Data columns (total 5 columns):
year_month 856961 non-null int64
id 856961 non-null int64
app 856961 non-null object
flow 856961 non-null float64
label 856961 non-null int64
dtypes: float64(1), int64(3), object(1)
memory usage: 32.7+ MB
time: 116 ms
0.1.3 天气数据
!ls /home/kesci/input/gzlt/train_set/weather_data_2017/
weather_forecast_2017.txt weather_reported_2017.txt 天气现象编码.xlsx
time: 669 ms
weather_reported = pd.read_csv('/home/kesci/input/gzlt/train_set/weather_data_2017/weather_reported_2017.txt', sep='\t')
time: 6.15 ms
weather_reported.head()
|
Station_Name |
VACODE |
Year |
Month |
Day |
TEM_Avg |
TEM_Max |
TEM_Min |
PRE_Time_2020 |
WEP_Record |
0 |
麻江 |
522635 |
2017 |
6 |
1 |
23.00 |
24.5 |
20.9 |
0.6 |
( 01 60 ) 60 . |
1 |
三穗 |
522624 |
2017 |
6 |
1 |
21.13 |
25.6 |
19.4 |
9.0 |
( 01 10 80 ) 80 60 . |
2 |
镇远 |
522625 |
2017 |
6 |
1 |
22.68 |
26.5 |
21.3 |
8.9 |
( 60 ) 60 . |
3 |
雷山 |
522634 |
2017 |
6 |
1 |
23.80 |
26.1 |
20.4 |
5.1 |
( 10 ) 60 . |
4 |
剑河 |
522629 |
2017 |
6 |
1 |
23.53 |
27.1 |
22.0 |
6.8 |
( 01 10 80 ) 80 10 . |
time: 12.2 ms
time: 1.25 ms
weather_reported.describe()
|
Station_Name |
VACODE |
Year |
Month |
Day |
TEM_Avg |
TEM_Max |
TEM_Min |
PRE_Time_2020 |
WEP_Record |
count |
1404 |
1404 |
1404 |
1404 |
1404 |
1404 |
1404 |
1404 |
1404 |
1404 |
unique |
24 |
25 |
2 |
3 |
32 |
448 |
214 |
109 |
330 |
305 |
top |
贵阳 |
520000 |
2017 |
7 |
4 |
22.83 |
30.5 |
20.5 |
0.0 |
( 01 ) 01 . |
freq |
61 |
360 |
1403 |
713 |
46 |
10 |
18 |
35 |
625 |
197 |
time: 49.9 ms
weather_reported.info()
RangeIndex: 1404 entries, 0 to 1403
Data columns (total 10 columns):
Station_Name 1404 non-null object
VACODE 1404 non-null object
Year 1404 non-null object
Month 1404 non-null object
Day 1404 non-null object
TEM_Avg 1404 non-null object
TEM_Max 1404 non-null object
TEM_Min 1404 non-null object
PRE_Time_2020 1404 non-null object
WEP_Record 1404 non-null object
dtypes: object(10)
memory usage: 109.8+ KB
time: 6.32 ms
weather_forecast = pd.read_csv('/home/kesci/input/gzlt/train_set/weather_data_2017/weather_forecast_2017.txt', sep='\t')
time: 10.8 ms
weather_forecast.head()
|
Station_Name |
VACODE |
Year |
Mon |
Day |
TEM_Max_24h |
TEM_Min_24h |
WEP_24h |
TEM_Max_48h |
TEM_Min_48h |
... |
TEM_Max_120h |
TEM_Min_120h |
WEP_120h |
TEM_Max_144h |
TEM_Min_144h |
WEP_144h |
TEM_Max_168h |
TEM_Min_168h,WEP_168h |
Unnamed: 24 |
Unnamed: 25 |
0 |
白云 |
520113 |
2017 |
6 |
1 |
25.0 |
17.0 |
(2)1 |
24.0 |
19.0 |
... |
(4)2 |
25.0 |
15.0 |
(2)1 |
27.0 |
15.0 |
(1)0 |
26.0 |
16.0 |
(1)0 |
1 |
岑巩 |
522626 |
2017 |
6 |
1 |
31.3 |
19.4 |
(1)1 |
31.0 |
22.0 |
... |
(4)1 |
32.0 |
19.4 |
(1)1 |
32.0 |
22.8 |
(1)1 |
32.0 |
21.0 |
(1)1 |
2 |
从江 |
522633 |
2017 |
6 |
1 |
33.4 |
22.0 |
(1)1 |
30.0 |
23.0 |
... |
(4)3 |
34.0 |
22.0 |
(1)1 |
34.0 |
23.8 |
(1)1 |
34.0 |
22.0 |
(1)1 |
3 |
丹寨 |
522636 |
2017 |
6 |
1 |
27.5 |
18.0 |
(1)1 |
24.5 |
20.0 |
... |
(4)1 |
28.5 |
18.0 |
(1)1 |
28.5 |
21.0 |
(1)1 |
28.5 |
20.0 |
(1)1 |
4 |
贵阳 |
520103 |
2017 |
6 |
1 |
26.0 |
18.0 |
(2)1 |
25.0 |
20.0 |
... |
(4)2 |
26.0 |
16.0 |
(2)1 |
28.0 |
16.0 |
(1)0 |
27.0 |
17.0 |
(1)0 |
5 rows × 26 columns
time: 86.4 ms
weather_forecast.describe()
|
VACODE |
Year |
Mon |
Day |
TEM_Max_24h |
TEM_Min_24h |
TEM_Max_48h |
TEM_Min_48h |
TEM_Max_72h |
TEM_Min_72h,WEP_72h |
TEM_Min_96h |
WEP_96h |
TEM_Min_120h |
WEP_120h |
TEM_Min_144h |
WEP_144h |
TEM_Min_168h,WEP_168h |
Unnamed: 24 |
count |
1464.000000 |
1464.0 |
1464.000000 |
1464.000000 |
1464.000000 |
1464.000000 |
1464.000000 |
1464.000000 |
1464.000000 |
1464.000000 |
1464.000000 |
1464.000000 |
1464.000000 |
1464.000000 |
1464.000000 |
1464.000000 |
1464.000000 |
1464.000000 |
mean |
521792.583333 |
2017.0 |
6.508197 |
15.754098 |
28.374658 |
20.721585 |
28.375820 |
20.872814 |
28.283811 |
21.112432 |
28.539481 |
21.408128 |
28.702254 |
21.454713 |
29.142623 |
21.485656 |
29.131626 |
21.589003 |
std |
1180.891163 |
0.0 |
0.500104 |
8.809966 |
4.300391 |
2.290850 |
4.379771 |
2.232788 |
4.329132 |
2.204980 |
4.154188 |
5.203525 |
4.167441 |
5.238257 |
4.124026 |
2.180222 |
4.033227 |
2.391945 |
min |
520103.000000 |
2017.0 |
6.000000 |
1.000000 |
17.300000 |
13.800000 |
17.300000 |
13.600000 |
17.000000 |
10.000000 |
19.000000 |
14.300000 |
19.000000 |
15.000000 |
18.000000 |
15.000000 |
18.000000 |
2.000000 |
25% |
520122.750000 |
2017.0 |
6.000000 |
8.000000 |
25.000000 |
19.000000 |
25.000000 |
19.400000 |
25.000000 |
19.600000 |
25.500000 |
19.700000 |
26.000000 |
19.700000 |
26.000000 |
20.000000 |
26.500000 |
20.000000 |
50% |
522624.500000 |
2017.0 |
7.000000 |
16.000000 |
28.500000 |
21.000000 |
28.500000 |
21.000000 |
28.000000 |
21.000000 |
28.500000 |
21.500000 |
28.500000 |
21.500000 |
29.000000 |
22.000000 |
29.000000 |
22.000000 |
75% |
522630.250000 |
2017.0 |
7.000000 |
23.000000 |
31.800000 |
22.500000 |
31.600000 |
22.500000 |
31.500000 |
23.000000 |
31.500000 |
23.000000 |
32.000000 |
23.000000 |
32.000000 |
23.000000 |
32.000000 |
23.500000 |
max |
522636.000000 |
2017.0 |
7.000000 |
31.000000 |
39.000000 |
25.700000 |
39.500000 |
25.500000 |
38.000000 |
25.800000 |
39.000000 |
200.000000 |
39.000000 |
202.000000 |
38.800000 |
25.800000 |
37.500000 |
26.000000 |
time: 121 ms
weather_forecast.info()
RangeIndex: 1464 entries, 0 to 1463
Data columns (total 26 columns):
Station_Name 1464 non-null object
VACODE 1464 non-null int64
Year 1464 non-null int64
Mon 1464 non-null int64
Day 1464 non-null int64
TEM_Max_24h 1464 non-null float64
TEM_Min_24h 1464 non-null float64
WEP_24h 1464 non-null object
TEM_Max_48h 1464 non-null float64
TEM_Min_48h 1464 non-null float64
WEP_48h 1464 non-null object
TEM_Max_72h 1464 non-null float64
TEM_Min_72h,WEP_72h 1464 non-null float64
TEM_Max_96h 1464 non-null object
TEM_Min_96h 1464 non-null float64
WEP_96h 1464 non-null float64
TEM_Max_120h 1464 non-null object
TEM_Min_120h 1464 non-null float64
WEP_120h 1464 non-null float64
TEM_Max_144h 1464 non-null object
TEM_Min_144h 1464 non-null float64
WEP_144h 1464 non-null float64
TEM_Max_168h 1464 non-null object
TEM_Min_168h,WEP_168h 1464 non-null float64
Unnamed: 24 1464 non-null float64
Unnamed: 25 1464 non-null object
dtypes: float64(14), int64(4), object(8)
memory usage: 297.5+ KB
time: 9.2 ms
0.2 测试数据
0.2.1 测试集
t1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_1.txt', sep='\t', header=None))
t1.columns = ['year_month', 'id', 'consume']
t1 = t1.dropna(axis=0)
t1_groupbyid = t1[['id', 'consume']].groupby(['id']).agg({'consume': pd.Series.sum})
t2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_2.txt', sep='\t', header=None))
t2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time']
t2 = t2.dropna(axis=0)
t2 = t2[['id', 'brand']]
t2 = t2.drop_duplicates()
t2_groupbyid = t2['id'].value_counts()
t2_groupbyid = t2_groupbyid.reset_index()
t2_groupbyid.columns = ['id', 'phone_nums']
test_set = t1_groupbyid.merge(t2_groupbyid, on=['id'])
test_set.head()
Mem. usage decreased to 1.34 Mb (41.7% reduction)
Mem. usage decreased to 60.50 Mb (0.0% reduction)
|
id |
consume |
phone_nums |
0 |
595941207920 |
220.000 |
10 |
1 |
901845022650 |
662.000 |
6 |
2 |
1868765858840 |
143.375 |
4 |
3 |
5058794512580 |
200.000 |
7 |
4 |
5399381591230 |
192.000 |
29 |
time: 7.86 s
test_set.info()
Int64Index: 43977 entries, 0 to 43976
Data columns (total 3 columns):
id 43977 non-null int64
consume 43977 non-null float16
phone_nums 43977 non-null int64
dtypes: float16(1), int64(2)
memory usage: 1.1 MB
time: 9.02 ms
t3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_3.txt', sep='\t', header=None))
t3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational']
t3_groupbyid_call = t3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})
t3_groupbyid_provincial = t3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})
t3_groupbyid_trans = t3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})
test_set = test_set.merge(t3_groupbyid_call, on=['id'])
test_set = test_set.merge(t3_groupbyid_provincial, on=['id'])
test_set = test_set.merge(t3_groupbyid_trans, on=['id'])
t4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_4.txt', sep='\t', header=None))
t4.columns = ['year_month', 'id', 'province']
t4_groupbyid = t4[['id', 'province']].groupby(['id']).size()
t4_groupbyid = t4_groupbyid.reset_index()
t4_groupbyid.columns = ['id', 'province_out_cnt']
test_set = test_set.merge(t4_groupbyid, how='left', on=['id'])
test_set = test_set.fillna(0)
test_set.head()
Mem. usage decreased to 1.53 Mb (60.0% reduction)
Mem. usage decreased to 0.85 Mb (16.7% reduction)
|
id |
consume |
phone_nums |
call_nums |
is_trans_provincial |
is_transnational |
province_out_cnt |
0 |
595941207920 |
220.000 |
10 |
68 |
1 |
0 |
1.0 |
1 |
901845022650 |
662.000 |
6 |
278 |
0 |
0 |
0.0 |
2 |
1868765858840 |
143.375 |
4 |
107 |
2 |
0 |
3.0 |
3 |
5058794512580 |
200.000 |
7 |
128 |
0 |
0 |
0.0 |
4 |
5399381591230 |
192.000 |
29 |
61 |
0 |
0 |
0.0 |
time: 17.4 s
!ls /home/kesci/input/gzlt/test_set/
201808 weather_data_2018
time: 704 ms
!ls /home/kesci/input/gzlt/test_set/201808
2018_1.txt 2018_2.txt 2018_3.txt 2018_4.txt 2018_6.txt 2018_7.txt
time: 702 ms
t1 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_1.txt', sep='\t', header=None)
time: 527 ms
t1.columns = ['year_month', 'id', 'consume']
time: 1.27 ms
t1.head()
|
year_month |
id |
consume |
0 |
201807 |
6401824160010748 |
618.40 |
1 |
201807 |
6506134548135499 |
NaN |
2 |
201807 |
5996920884619954 |
22.05 |
3 |
201806 |
1187209424543713 |
7.20 |
4 |
201807 |
9297165066591558 |
124.00 |
time: 99.9 ms
t1.describe()
|
year_month |
id |
consume |
count |
100402.000000 |
1.004020e+05 |
86787.000000 |
mean |
201806.500000 |
5.449905e+15 |
103.357399 |
std |
0.500002 |
2.628916e+15 |
311.428596 |
min |
201806.000000 |
5.959412e+11 |
0.010000 |
25% |
201806.000000 |
3.176902e+15 |
36.500000 |
50% |
201806.500000 |
5.440931e+15 |
81.000000 |
75% |
201807.000000 |
7.726318e+15 |
132.125000 |
max |
201807.000000 |
9.999920e+15 |
61465.900000 |
time: 50.6 ms
t1.info()
RangeIndex: 100402 entries, 0 to 100401
Data columns (total 3 columns):
year_month 100402 non-null int64
id 100402 non-null int64
consume 86787 non-null float64
dtypes: float64(1), int64(2)
memory usage: 2.3 MB
time: 12.6 ms
%matplotlib inline
t1.consume.plot()
Matplotlib is building the font cache using fc-list. This may take a moment.
time: 17 s
t1[t1.consume == 61465.9]
|
year_month |
id |
consume |
11962 |
201807 |
4827806860301307 |
61465.9 |
time: 7.15 ms
t2 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_2.txt', sep='\t', header=None)
time: 11.8 s
t2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time']
time: 1.18 ms
t2.head()
|
id |
brand |
type |
first_use_time |
recent_use_time |
0 |
3179771753483280 |
魅族 |
M575 |
20180601151052 |
20180601151054 |
1 |
4185007692177509 |
NaN |
NaN |
20171021182915 |
20171021183000 |
2 |
4972845789896505 |
NaN |
NaN |
20180624003647 |
20180624003656 |
3 |
4207293827582218 |
NaN |
NaN |
20171224165902 |
20180306175444 |
4 |
2628020151876580 |
NaN |
NaN |
20170820111053 |
20171207020159 |
time: 7.95 ms
t2.describe()
|
id |
first_use_time |
recent_use_time |
count |
1.586024e+06 |
1.586024e+06 |
1.586024e+06 |
mean |
5.410516e+15 |
2.017033e+13 |
2.017156e+13 |
std |
2.618994e+15 |
6.902153e+09 |
6.865591e+09 |
min |
5.959412e+11 |
2.016032e+13 |
2.016033e+13 |
25% |
3.140763e+15 |
2.016122e+13 |
2.017021e+13 |
50% |
5.389338e+15 |
2.017063e+13 |
2.017080e+13 |
75% |
7.660413e+15 |
2.017122e+13 |
2.018013e+13 |
max |
9.999920e+15 |
2.018073e+13 |
2.018073e+13 |
time: 353 ms
t2.info()
RangeIndex: 1586024 entries, 0 to 1586023
Data columns (total 5 columns):
id 1586024 non-null int64
brand 1098244 non-null object
type 1098250 non-null object
first_use_time 1586024 non-null int64
recent_use_time 1586024 non-null int64
dtypes: int64(3), object(2)
memory usage: 60.5+ MB
time: 291 ms
t3 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_3.txt', sep='\t', header=None)
time: 451 ms
t3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational']
time: 1.14 ms
t3.head()
|
year_month |
id |
call_nums |
is_trans_provincial |
is_transnational |
0 |
201806 |
3690814703003361 |
49 |
0 |
0 |
1 |
201807 |
4315823592069831 |
-1 |
0 |
0 |
2 |
201806 |
5199170013029443 |
-1 |
0 |
0 |
3 |
201806 |
1387658205895203 |
35 |
0 |
0 |
4 |
201807 |
3280240784164442 |
-1 |
0 |
0 |
time: 7.12 ms
t3.describe()
|
year_month |
id |
call_nums |
is_trans_provincial |
is_transnational |
count |
100400.000000 |
1.004000e+05 |
100400.000000 |
100400.000000 |
100400.000000 |
mean |
201806.500000 |
5.449990e+15 |
51.642102 |
0.206116 |
0.012809 |
std |
0.500002 |
2.628873e+15 |
90.705957 |
0.404516 |
0.112449 |
min |
201806.000000 |
5.959412e+11 |
-1.000000 |
0.000000 |
0.000000 |
25% |
201806.000000 |
3.177008e+15 |
6.000000 |
0.000000 |
0.000000 |
50% |
201806.500000 |
5.441108e+15 |
31.000000 |
0.000000 |
0.000000 |
75% |
201807.000000 |
7.726328e+15 |
71.000000 |
0.000000 |
0.000000 |
max |
201807.000000 |
9.999920e+15 |
6537.000000 |
1.000000 |
1.000000 |
time: 46.4 ms
t3.info()
RangeIndex: 100400 entries, 0 to 100399
Data columns (total 5 columns):
year_month 100400 non-null int64
id 100400 non-null int64
call_nums 100400 non-null int64
is_trans_provincial 100400 non-null int64
is_transnational 100400 non-null int64
dtypes: int64(5)
memory usage: 3.8 MB
time: 15.1 ms
t4 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_4.txt', sep='\t', header=None)
time: 240 ms
t4.columns = ['year_month', 'id', 'province']
time: 1.2 ms
t4.head()
|
year_month |
id |
province |
0 |
201807 |
8445647072009305 |
广东 |
1 |
201806 |
9414872397547413 |
浙江 |
2 |
201806 |
2272887111818372 |
广东 |
3 |
201807 |
224368910874770 |
湖北 |
4 |
201807 |
6081677258986878 |
NaN |
time: 6.81 ms
t4.describe()
|
year_month |
id |
count |
44543.000000 |
4.454300e+04 |
mean |
201806.530319 |
5.448788e+15 |
std |
0.499086 |
2.640390e+15 |
min |
201806.000000 |
5.959412e+11 |
25% |
201806.000000 |
3.118911e+15 |
50% |
201807.000000 |
5.430117e+15 |
75% |
201807.000000 |
7.751481e+15 |
max |
201807.000000 |
9.999505e+15 |
time: 20.3 ms
t4.info()
RangeIndex: 44543 entries, 0 to 44542
Data columns (total 3 columns):
year_month 44543 non-null int64
id 44543 non-null int64
province 44119 non-null object
dtypes: int64(2), object(1)
memory usage: 1.0+ MB
time: 9.73 ms
t6 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_6.txt', sep='\t', header=None)
time: 2min 7s
t6.columns = ['date', 'hour', 'id', 'user_longitude', 'user_latitude']
time: 1.22 ms
t6.head()
|
date |
hour |
id |
user_longitude |
user_latitude |
0 |
2018-06-10 |
20 |
1929821481825935 |
106.289902 |
26.837687 |
1 |
2018-07-14 |
18 |
5450093661688579 |
106.641975 |
26.627846 |
2 |
2018-07-16 |
2 |
4617571498633816 |
106.230420 |
27.466980 |
3 |
2018-06-15 |
22 |
2826359445811398 |
106.693610 |
26.591110 |
4 |
2018-06-22 |
10 |
3526202744290054 |
107.032570 |
27.715830 |
time: 8.4 ms
t6.describe()
|
hour |
id |
user_longitude |
user_latitude |
count |
1.655899e+07 |
1.655899e+07 |
1.655081e+07 |
1.655081e+07 |
mean |
1.144987e+01 |
5.461505e+15 |
1.066642e+02 |
2.662386e+01 |
std |
6.742805e+00 |
2.629564e+15 |
4.626476e-01 |
3.195807e-01 |
min |
0.000000e+00 |
5.959412e+11 |
1.036700e+02 |
2.469706e+01 |
25% |
6.000000e+00 |
3.191837e+15 |
1.066328e+02 |
2.655164e+01 |
50% |
1.200000e+01 |
5.475087e+15 |
1.066902e+02 |
2.658444e+01 |
75% |
1.800000e+01 |
7.732384e+15 |
1.067199e+02 |
2.663778e+01 |
max |
2.200000e+01 |
9.999920e+15 |
1.095534e+02 |
2.916468e+01 |
time: 6.3 s
t6.info()
RangeIndex: 16558993 entries, 0 to 16558992
Data columns (total 5 columns):
date object
hour int64
id int64
user_longitude float64
user_latitude float64
dtypes: float64(2), int64(2), object(1)
memory usage: 631.7+ MB
time: 3.04 ms
t7 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_7.txt', sep='\t', header=None)
time: 8.75 s
t7.columns = ['year_month', 'id', 'app', 'flow']
time: 1.18 ms
t7.head()
|
year_month |
id |
app |
flow |
0 |
201806 |
9813651010156104 |
OPPO软件商店 |
14545.00 |
1 |
201806 |
2338567014163500 |
腾讯新闻 |
0.19 |
2 |
201807 |
1133512913801798 |
讯飞输入法 |
0.01 |
3 |
201807 |
7739596338372898 |
手机百度 |
1615.00 |
4 |
201807 |
5724269192271018 |
百度贴吧 |
1301953.00 |
time: 15.6 ms
t7.describe()
|
year_month |
id |
flow |
count |
1.493733e+06 |
1.493733e+06 |
1.492434e+06 |
mean |
2.018065e+05 |
5.468351e+15 |
8.991198e+07 |
std |
4.999895e-01 |
2.628382e+15 |
8.503798e+08 |
min |
2.018060e+05 |
5.959412e+11 |
0.000000e+00 |
25% |
2.018060e+05 |
3.196619e+15 |
6.519000e+03 |
50% |
2.018070e+05 |
5.477012e+15 |
2.883350e+05 |
75% |
2.018070e+05 |
7.737568e+15 |
7.842132e+06 |
max |
2.018070e+05 |
9.999920e+15 |
3.341152e+11 |
time: 226 ms
t7.info()
RangeIndex: 1493733 entries, 0 to 1493732
Data columns (total 4 columns):
year_month 1493733 non-null int64
id 1493733 non-null int64
app 1457137 non-null object
flow 1492434 non-null float64
dtypes: float64(1), int64(2), object(1)
memory usage: 45.6+ MB
time: 178 ms
0.2.2 天气数据
!ls /home/kesci/input/gzlt/test_set/weather_data_2018/
weather_forecast_2018.txt weather_reported_2018.txt
time: 830 ms
weather_reported_2018 = pd.read_csv('/home/kesci/input/gzlt/test_set/weather_data_2018/weather_reported_2018.txt', sep='\t')
time: 8.57 ms
weather_reported_2018.head()
|
Station_Name |
VACODE |
Year |
Month |
Day |
TEM_Avg |
TEM_Max |
TEM_Min |
PRE_Time_2020 |
WEP_Record |
0 |
镇远 |
522625 |
2018 |
6 |
1 |
19.0 |
21.0 |
17.8 |
0.1 |
( 60 01 ) 01 60 10 . |
1 |
丹寨 |
522636 |
2018 |
6 |
1 |
17.0 |
19.9 |
15.3 |
4.3 |
( 60 80 ) 80 . |
2 |
三穗 |
522624 |
2018 |
6 |
1 |
17.8 |
19.2 |
17.0 |
0.6 |
( 80 10 ) 60 10 . |
3 |
台江 |
522630 |
2018 |
6 |
1 |
18.8 |
21.1 |
17.5 |
1.4 |
( 60 01 ) 01 60 10 . |
4 |
剑河 |
522629 |
2018 |
6 |
1 |
19.2 |
21.6 |
17.9 |
2.1 |
( 60 ) 60 10 . |
time: 12.6 ms
weather_reported_2018.describe()
|
VACODE |
Year |
Month |
Day |
TEM_Avg |
TEM_Max |
TEM_Min |
PRE_Time_2020 |
count |
1403.000000 |
1403.0 |
1403.000000 |
1403.000000 |
1403.000000 |
1403.000000 |
1403.000000 |
1403.000000 |
mean |
521862.934426 |
2018.0 |
6.508197 |
15.754098 |
737.393799 |
742.297577 |
734.011119 |
4.922594 |
std |
1155.972144 |
0.0 |
0.500111 |
8.810097 |
26696.850268 |
26696.719415 |
26696.940604 |
15.090986 |
min |
520103.000000 |
2018.0 |
6.000000 |
1.000000 |
15.100000 |
16.200000 |
11.800000 |
0.000000 |
25% |
520122.000000 |
2018.0 |
6.000000 |
8.000000 |
22.900000 |
27.300000 |
20.000000 |
0.000000 |
50% |
522625.000000 |
2018.0 |
7.000000 |
16.000000 |
25.100000 |
30.100000 |
21.600000 |
0.000000 |
75% |
522631.000000 |
2018.0 |
7.000000 |
23.000000 |
26.900000 |
32.550000 |
23.050000 |
2.100000 |
max |
522636.000000 |
2018.0 |
7.000000 |
31.000000 |
999999.000000 |
999999.000000 |
999999.000000 |
281.700000 |
time: 118 ms
weather_reported_2018.info()
RangeIndex: 1403 entries, 0 to 1402
Data columns (total 10 columns):
Station_Name 1403 non-null object
VACODE 1403 non-null int64
Year 1403 non-null int64
Month 1403 non-null int64
Day 1403 non-null int64
TEM_Avg 1403 non-null float64
TEM_Max 1403 non-null float64
TEM_Min 1403 non-null float64
PRE_Time_2020 1403 non-null float64
WEP_Record 1403 non-null object
dtypes: float64(4), int64(4), object(2)
memory usage: 109.7+ KB
time: 6.7 ms
weather_forecast_2018 = pd.read_csv('/home/kesci/input/gzlt/test_set/weather_data_2018/weather_forecast_2018.txt', sep='\t')
time: 12 ms
weather_forecast_2018.head()
|
Station_Name |
VACODE |
Year |
Mon |
Day |
TEM_Max_24h |
TEM_Min_24h |
WEP_24h |
TEM_Max_48h |
TEM_Min_48h |
... |
TEM_Max_120h |
TEM_Min_120h |
WEP_120h |
TEM_Max_144h |
TEM_Min_144h |
WEP_144h |
TEM_Max_168h |
TEM_Min_168h,WEP_168h |
Unnamed: 24 |
Unnamed: 25 |
0 |
白云 |
520113 |
2018 |
6 |
1 |
20.2 |
14.8 |
(3)2 |
23.2 |
15.8 |
... |
(2)1 |
27.5 |
13.5 |
(1)1 |
26.0 |
14.0 |
(2)1 |
24.0 |
16.0 |
(1)1 |
1 |
岑巩 |
522626 |
2018 |
6 |
1 |
25.5 |
17.5 |
(2)2 |
28.5 |
20.2 |
... |
(2)0 |
31.0 |
17.0 |
(0)0 |
31.0 |
18.5 |
(0)1 |
31.0 |
21.5 |
(1)1 |
2 |
从江 |
522633 |
2018 |
6 |
1 |
27.3 |
19.0 |
(7)2 |
29.5 |
22.0 |
... |
(21)0 |
33.5 |
19.6 |
(0)0 |
33.5 |
20.2 |
(0)1 |
31.5 |
23.0 |
(1)1 |
3 |
丹寨 |
522636 |
2018 |
6 |
1 |
23.0 |
15.5 |
(2)2 |
26.0 |
19.2 |
... |
(2)0 |
28.0 |
16.2 |
(0)0 |
28.0 |
17.2 |
(0)1 |
27.0 |
19.5 |
(1)1 |
4 |
贵阳 |
520103 |
2018 |
6 |
1 |
20.9 |
14.9 |
(3)2 |
24.0 |
16.4 |
... |
(2)1 |
28.0 |
14.0 |
(1)1 |
26.0 |
14.0 |
(2)1 |
24.0 |
16.0 |
(1)1 |
5 rows × 26 columns
time: 54.2 ms
weather_forecast_2018.describe()
|
VACODE |
Year |
Mon |
Day |
TEM_Max_24h |
TEM_Min_24h |
TEM_Max_48h |
TEM_Min_48h |
TEM_Max_72h |
TEM_Min_72h,WEP_72h |
TEM_Min_96h |
WEP_96h |
TEM_Min_120h |
WEP_120h |
TEM_Min_144h |
WEP_144h |
TEM_Min_168h,WEP_168h |
Unnamed: 24 |
count |
1463.000000 |
1463.0 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
mean |
521793.738209 |
2018.0 |
6.508544 |
15.759398 |
29.724607 |
21.244703 |
29.724470 |
21.385236 |
29.694463 |
21.655434 |
29.924949 |
21.886945 |
29.891183 |
22.010936 |
30.027341 |
22.055229 |
30.192960 |
21.985373 |
std |
1180.467638 |
0.0 |
0.500098 |
8.810643 |
3.470128 |
2.536103 |
3.232737 |
2.385237 |
3.167789 |
2.270505 |
3.130886 |
2.131020 |
3.191721 |
2.066640 |
3.199460 |
2.092155 |
3.167676 |
2.227871 |
min |
520103.000000 |
2018.0 |
6.000000 |
1.000000 |
17.800000 |
10.800000 |
18.000000 |
12.000000 |
16.500000 |
12.500000 |
16.500000 |
14.000000 |
14.500000 |
13.000000 |
17.000000 |
13.200000 |
16.000000 |
15.000000 |
25% |
520123.000000 |
2018.0 |
6.000000 |
8.000000 |
27.500000 |
20.000000 |
27.500000 |
20.000000 |
27.500000 |
20.200000 |
28.000000 |
20.500000 |
27.500000 |
21.000000 |
28.000000 |
21.000000 |
28.000000 |
20.850000 |
50% |
522625.000000 |
2018.0 |
7.000000 |
16.000000 |
30.000000 |
22.000000 |
29.900000 |
22.000000 |
29.500000 |
22.000000 |
30.000000 |
22.000000 |
30.000000 |
22.200000 |
30.000000 |
22.100000 |
30.000000 |
22.200000 |
75% |
522630.500000 |
2018.0 |
7.000000 |
23.000000 |
32.350000 |
23.000000 |
32.000000 |
23.000000 |
32.300000 |
23.300000 |
32.500000 |
23.500000 |
32.500000 |
23.500000 |
32.500000 |
23.700000 |
32.600000 |
24.000000 |
max |
522636.000000 |
2018.0 |
7.000000 |
31.000000 |
37.500000 |
27.000000 |
37.000000 |
25.900000 |
36.500000 |
26.000000 |
36.500000 |
26.000000 |
36.500000 |
26.200000 |
37.000000 |
26.000000 |
37.000000 |
30.000000 |
time: 74 ms
weather_forecast_2018.info()
RangeIndex: 1463 entries, 0 to 1462
Data columns (total 26 columns):
Station_Name 1463 non-null object
VACODE 1463 non-null int64
Year 1463 non-null int64
Mon 1463 non-null int64
Day 1463 non-null int64
TEM_Max_24h 1463 non-null float64
TEM_Min_24h 1463 non-null float64
WEP_24h 1463 non-null object
TEM_Max_48h 1463 non-null float64
TEM_Min_48h 1463 non-null float64
WEP_48h 1463 non-null object
TEM_Max_72h 1463 non-null float64
TEM_Min_72h,WEP_72h 1463 non-null float64
TEM_Max_96h 1463 non-null object
TEM_Min_96h 1463 non-null float64
WEP_96h 1463 non-null float64
TEM_Max_120h 1463 non-null object
TEM_Min_120h 1463 non-null float64
WEP_120h 1463 non-null float64
TEM_Max_144h 1463 non-null object
TEM_Min_144h 1463 non-null float64
WEP_144h 1463 non-null float64
TEM_Max_168h 1463 non-null object
TEM_Min_168h,WEP_168h 1463 non-null float64
Unnamed: 24 1463 non-null float64
Unnamed: 25 1463 non-null object
dtypes: float64(14), int64(4), object(8)
memory usage: 297.2+ KB
time: 11 ms
!jupyter nbconvert --to markdown "“联创黔线”杯大数据应用创新大赛.ipynb"
0.000000
25%
520122.000000
2018.0
6.000000
8.000000
22.900000
27.300000
20.000000
0.000000
50%
522625.000000
2018.0
7.000000
16.000000
25.100000
30.100000
21.600000
0.000000
75%
522631.000000
2018.0
7.000000
23.000000
26.900000
32.550000
23.050000
2.100000
max
522636.000000
2018.0
7.000000
31.000000
999999.000000
999999.000000
999999.000000
281.700000
time: 118 ms
weather_reported_2018.info()
RangeIndex: 1403 entries, 0 to 1402
Data columns (total 10 columns):
Station_Name 1403 non-null object
VACODE 1403 non-null int64
Year 1403 non-null int64
Month 1403 non-null int64
Day 1403 non-null int64
TEM_Avg 1403 non-null float64
TEM_Max 1403 non-null float64
TEM_Min 1403 non-null float64
PRE_Time_2020 1403 non-null float64
WEP_Record 1403 non-null object
dtypes: float64(4), int64(4), object(2)
memory usage: 109.7+ KB
time: 6.7 ms
weather_forecast_2018 = pd.read_csv('/home/kesci/input/gzlt/test_set/weather_data_2018/weather_forecast_2018.txt', sep='\t')
time: 12 ms
weather_forecast_2018.head()
|
Station_Name |
VACODE |
Year |
Mon |
Day |
TEM_Max_24h |
TEM_Min_24h |
WEP_24h |
TEM_Max_48h |
TEM_Min_48h |
... |
TEM_Max_120h |
TEM_Min_120h |
WEP_120h |
TEM_Max_144h |
TEM_Min_144h |
WEP_144h |
TEM_Max_168h |
TEM_Min_168h,WEP_168h |
Unnamed: 24 |
Unnamed: 25 |
0 |
白云 |
520113 |
2018 |
6 |
1 |
20.2 |
14.8 |
(3)2 |
23.2 |
15.8 |
... |
(2)1 |
27.5 |
13.5 |
(1)1 |
26.0 |
14.0 |
(2)1 |
24.0 |
16.0 |
(1)1 |
1 |
岑巩 |
522626 |
2018 |
6 |
1 |
25.5 |
17.5 |
(2)2 |
28.5 |
20.2 |
... |
(2)0 |
31.0 |
17.0 |
(0)0 |
31.0 |
18.5 |
(0)1 |
31.0 |
21.5 |
(1)1 |
2 |
从江 |
522633 |
2018 |
6 |
1 |
27.3 |
19.0 |
(7)2 |
29.5 |
22.0 |
... |
(21)0 |
33.5 |
19.6 |
(0)0 |
33.5 |
20.2 |
(0)1 |
31.5 |
23.0 |
(1)1 |
3 |
丹寨 |
522636 |
2018 |
6 |
1 |
23.0 |
15.5 |
(2)2 |
26.0 |
19.2 |
... |
(2)0 |
28.0 |
16.2 |
(0)0 |
28.0 |
17.2 |
(0)1 |
27.0 |
19.5 |
(1)1 |
4 |
贵阳 |
520103 |
2018 |
6 |
1 |
20.9 |
14.9 |
(3)2 |
24.0 |
16.4 |
... |
(2)1 |
28.0 |
14.0 |
(1)1 |
26.0 |
14.0 |
(2)1 |
24.0 |
16.0 |
(1)1 |
5 rows × 26 columns
time: 54.2 ms
weather_forecast_2018.describe()
|
VACODE |
Year |
Mon |
Day |
TEM_Max_24h |
TEM_Min_24h |
TEM_Max_48h |
TEM_Min_48h |
TEM_Max_72h |
TEM_Min_72h,WEP_72h |
TEM_Min_96h |
WEP_96h |
TEM_Min_120h |
WEP_120h |
TEM_Min_144h |
WEP_144h |
TEM_Min_168h,WEP_168h |
Unnamed: 24 |
count |
1463.000000 |
1463.0 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
1463.000000 |
mean |
521793.738209 |
2018.0 |
6.508544 |
15.759398 |
29.724607 |
21.244703 |
29.724470 |
21.385236 |
29.694463 |
21.655434 |
29.924949 |
21.886945 |
29.891183 |
22.010936 |
30.027341 |
22.055229 |
30.192960 |
21.985373 |
std |
1180.467638 |
0.0 |
0.500098 |
8.810643 |
3.470128 |
2.536103 |
3.232737 |
2.385237 |
3.167789 |
2.270505 |
3.130886 |
2.131020 |
3.191721 |
2.066640 |
3.199460 |
2.092155 |
3.167676 |
2.227871 |
min |
520103.000000 |
2018.0 |
6.000000 |
1.000000 |
17.800000 |
10.800000 |
18.000000 |
12.000000 |
16.500000 |
12.500000 |
16.500000 |
14.000000 |
14.500000 |
13.000000 |
17.000000 |
13.200000 |
16.000000 |
15.000000 |
25% |
520123.000000 |
2018.0 |
6.000000 |
8.000000 |
27.500000 |
20.000000 |
27.500000 |
20.000000 |
27.500000 |
20.200000 |
28.000000 |
20.500000 |
27.500000 |
21.000000 |
28.000000 |
21.000000 |
28.000000 |
20.850000 |
50% |
522625.000000 |
2018.0 |
7.000000 |
16.000000 |
30.000000 |
22.000000 |
29.900000 |
22.000000 |
29.500000 |
22.000000 |
30.000000 |
22.000000 |
30.000000 |
22.200000 |
30.000000 |
22.100000 |
30.000000 |
22.200000 |
75% |
522630.500000 |
2018.0 |
7.000000 |
23.000000 |
32.350000 |
23.000000 |
32.000000 |
23.000000 |
32.300000 |
23.300000 |
32.500000 |
23.500000 |
32.500000 |
23.500000 |
32.500000 |
23.700000 |
32.600000 |
24.000000 |
max |
522636.000000 |
2018.0 |
7.000000 |
31.000000 |
37.500000 |
27.000000 |
37.000000 |
25.900000 |
36.500000 |
26.000000 |
36.500000 |
26.000000 |
36.500000 |
26.200000 |
37.000000 |
26.000000 |
37.000000 |
30.000000 |
time: 74 ms
weather_forecast_2018.info()
RangeIndex: 1463 entries, 0 to 1462
Data columns (total 26 columns):
Station_Name 1463 non-null object
VACODE 1463 non-null int64
Year 1463 non-null int64
Mon 1463 non-null int64
Day 1463 non-null int64
TEM_Max_24h 1463 non-null float64
TEM_Min_24h 1463 non-null float64
WEP_24h 1463 non-null object
TEM_Max_48h 1463 non-null float64
TEM_Min_48h 1463 non-null float64
WEP_48h 1463 non-null object
TEM_Max_72h 1463 non-null float64
TEM_Min_72h,WEP_72h 1463 non-null float64
TEM_Max_96h 1463 non-null object
TEM_Min_96h 1463 non-null float64
WEP_96h 1463 non-null float64
TEM_Max_120h 1463 non-null object
TEM_Min_120h 1463 non-null float64
WEP_120h 1463 non-null float64
TEM_Max_144h 1463 non-null object
TEM_Min_144h 1463 non-null float64
WEP_144h 1463 non-null float64
TEM_Max_168h 1463 non-null object
TEM_Min_168h,WEP_168h 1463 non-null float64
Unnamed: 24 1463 non-null float64
Unnamed: 25 1463 non-null object
dtypes: float64(14), int64(4), object(8)
memory usage: 297.2+ KB
time: 11 ms
!jupyter nbconvert --to markdown "“联创黔线”杯大数据应用创新大赛.ipynb"