

    • 赛题介绍
    • 代码
      • 1 特征工程
        • 1.1 正样本
        • 1.2 负样本
      • 2 建模
      • 3 预测
        • 3.1 测试集
      • 4 提交结果
      • 0 查看数据
        • 0.1 训练数据
          • 0.1.1 正样本
          • 0.1.2 负样本
          • 0.1.3 天气数据
        • 0.2 测试数据
          • 0.2.1 测试集
          • 0.2.2 天气数据






训练集(training_set)约2.3G,其中包含 201708n,201708q 和 weather_data_2017三个文件夹,分别记录了对应的2017年6、7月用户历史数据和天气历史数据。

用户身份属性表(201708n1.txt, 201708q1.txt)
用户手机终端信息表(201708n2.txt, 201708q2.txt)
用户漫游行为表(201708n3.txt, 201708q3.txt)
用户漫出省份表(201708n4.txt, 201708q4.txt)
用户地理位置表(201708n6.txt, 201708q6.txt)
用户APP使用情况表(201708n7.txt, 201708q7.txt)

在201808文件夹中包含7个txt文件,命名依次为2018_1.txt,2018_2.txt, … ,2018_7.txt,字段信息与训练集相对应


本次比赛采用AUC来评判模型的效果。AUC即以False Positive Rate为横轴,True Positive Rate为纵轴的ROC (Receiver Operating Characteristic)曲线下方的面积大小。



由于受到使用模型的泛化性能的影响,在 Public 榜获得最高分的提交在 Private 的分数不一定最高,因此需要选手从自己的有效提交里,选择两个觉得兼顾了泛化性能与模型评分的结果文件进入 Private 榜测评
Private 排行榜在比赛结束后会揭晓,比赛的最终有效成绩与有效排名将以 Private 榜为准。


import pandas as pd
import numpy as np
# 减少内存使用

def reduce_mem_usage(df, verbose=True):

    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

    start_mem = df.memory_usage().sum() / 1024 ** 2

    for col in df.columns:

        col_type = df[col].dtypes

        if col_type in numerics:

            c_min = df[col].min()

            c_max = df[col].max()

            if str(col_type)[:3] == 'int':

                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:

                    df[col] = df[col].astype(np.int8)

                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:

                    df[col] = df[col].astype(np.int16)

                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:

                    df[col] = df[col].astype(np.int32)

                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:

                    df[col] = df[col].astype(np.int64)


                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:

                    df[col] = df[col].astype(np.float16)

                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:

                    df[col] = df[col].astype(np.float32)


                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024 ** 2

    if verbose:

        print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))

    return df
1 特征工程



q1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q1.txt', sep='\t', header=None))
q1.columns = ['year_month', 'id', 'consume', 'label']
Mem. usage decreased to  0.16 Mb (53.1% reduction)
year_month id consume label
count 11200.000000 1.120000e+04 1.086500e+04 11200.0
mean 201706.500000 5.416583e+15 inf 1.0
std 0.500022 2.642827e+15 inf 0.0
min 201706.000000 1.448104e+12 4.998779e-02 1.0
25% 201706.000000 3.117220e+15 4.068750e+01 1.0
50% 201706.500000 5.456254e+15 9.837500e+01 1.0
75% 201707.000000 7.702940e+15 1.785000e+02 1.0
max 201707.000000 9.997949e+15 1.324000e+03 1.0
RangeIndex: 11200 entries, 0 to 11199
Data columns (total 4 columns):
year_month    11200 non-null int32
id            11200 non-null int64
consume       10865 non-null float16
label         11200 non-null int8
dtypes: float16(1), int32(1), int64(1), int8(1)
memory usage: 164.1 KB
q1 = q1.fillna(98.0)
RangeIndex: 11200 entries, 0 to 11199
Data columns (total 4 columns):
year_month    11200 non-null int32
id            11200 non-null int64
consume       11200 non-null float16
label         11200 non-null int8
dtypes: float16(1), int32(1), int64(1), int8(1)
memory usage: 164.1 KB
q1 = q1[['id', 'consume']]
q1_groupbyid = q1.groupby(['id']).agg({'consume': pd.Series.sum})
特征1 使用过的top9+其它手机品牌 共10个
特征2 使用的不同品牌数量

q2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q2.txt', sep='\t', header=None))
q2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']
Mem. usage decreased to 11.31 Mb (14.6% reduction)
RangeIndex: 289203 entries, 0 to 289202
Data columns (total 6 columns):
id                 289203 non-null int64
brand              197376 non-null object
type               197380 non-null object
first_use_time     289203 non-null int64
recent_use_time    289203 non-null int64
label              289203 non-null int8
dtypes: int64(3), int8(1), object(2)
memory usage: 11.3+ MB
q2.type = q2.type.fillna('其它')
brand_series = pd.Series({'苹果' : 'iphone', '华为' : "huawei", '欧珀' : 'oppo', '维沃' : 'vivo', '三星' : 'san', '小米' : 'mi', '金立' : 'jinli', '魅族' : 'mei', '乐视' : 'le', '四季恒美' : 'siji'})

q2.brand = q2.brand.map(brand_series)
q2.brand = q2.brand.fillna('其它')
id brand type first_use_time recent_use_time label
0 1752398069509000 其它 其它 20161209134530 20161209190636 1
1 1752398069509000 huawei PLK-AL10 20170609223138 20170609224345 1
2 1752398069509000 le LETV X501 20160924102711 20160924112425 1
3 1752398069509000 jinli 金立 GN800 20150331210255 20150630131232 1
4 1752398069509000 jinli GIONEE M5 20170508191216 20170605192347 1
q2['brand_type'] = q2['brand'] + q2['type']
id brand type first_use_time recent_use_time label brand_type
0 1752398069509000 其它 其它 20161209134530 20161209190636 1 其它其它
1 1752398069509000 huawei PLK-AL10 20170609223138 20170609224345 1 huaweiPLK-AL10
2 1752398069509000 le LETV X501 20160924102711 20160924112425 1 leLETV X501
3 1752398069509000 jinli 金立 GN800 20150331210255 20150630131232 1 jinli金立 GN800
4 1752398069509000 jinli GIONEE M5 20170508191216 20170605192347 1 jinliGIONEE M5
groupbybrand_type = q2['brand_type'].value_counts()
其它其它                     91823
iphoneA1586              14898
iphoneA1524              10330
iphoneA1700               9246
iphoneA1699               8277
iphoneIPHONE6S(A1633)     6271
oppoOPPO R9M              4725
iphoneA1530               4640
oppoOPPO R9TM             2978
vivoVIVO X7               2516
Name: brand_type, dtype: int64

q2_brand_type = q2[['id', 'brand_type']]
q2_brand_type = q2_brand_type.drop_duplicates()
q2_groupbyid = q2_brand_type['id'].value_counts()
q2_groupbyid = q2_groupbyid.reset_index()
q2_groupbyid.columns = ['id', 'phone_nums']
id phone_nums
0 8707678197418467 422
1 9196501153454276 409
2 3900535090108175 389
3 4104535378288025 352
4 1106540188374027 350
RangeIndex: 5600 entries, 0 to 5599
Data columns (total 2 columns):
id            5600 non-null int64
phone_nums    5600 non-null int64
dtypes: int64(2)
memory usage: 87.6 KB
q2_brand = q2[['id', 'brand']]
q2_brand = q2_brand.drop_duplicates()
q2_brand_one_hot = pd.get_dummies(q2_brand)
id brand_huawei brand_iphone brand_jinli brand_le brand_mei brand_mi brand_oppo brand_san brand_siji brand_vivo brand_其它
0 1752398069509000 0 0 0 0 0 0 0 0 0 0 1
1 1752398069509000 1 0 0 0 0 0 0 0 0 0 0
2 1752398069509000 0 0 0 1 0 0 0 0 0 0 0
3 1752398069509000 0 0 1 0 0 0 0 0 0 0 0
8 1752398069509000 0 0 0 0 0 0 0 1 0 0 0
q2_one_hot = q2_brand_one_hot.groupby(['id']).agg({'brand_huawei': pd.Series.max, 
                                                   'brand_iphone': pd.Series.max,
                                                   'brand_jinli': pd.Series.max, 
                                                   'brand_le': pd.Series.max,
                                                   'brand_mei': pd.Series.max, 
                                                   'brand_mi': pd.Series.max,
                                                   'brand_oppo': pd.Series.max, 
                                                   'brand_san': pd.Series.max,
                                                   'brand_siji': pd.Series.max, 
                                                   'brand_vivo': pd.Series.max,
                                                   'brand_其它': pd.Series.max
brand_huawei brand_iphone brand_jinli brand_le brand_mei brand_mi brand_oppo brand_san brand_siji brand_vivo brand_其它
1448103998000 1 1 0 1 1 0 1 1 0 0 1
17398718813730 1 1 1 1 1 1 1 1 0 1 1
61132623486000 1 0 0 0 0 0 0 0 0 0 1
68156596675520 0 1 1 1 0 0 0 0 0 0 1
76819334576430 1 1 1 0 1 1 1 1 0 1 1
pos_set = q1_groupbyid.merge(q2_groupbyid, on=['id'])

Int64Index: 5600 entries, 0 to 5599
Data columns (total 3 columns):
id            5600 non-null int64
consume       5600 non-null float16
phone_nums    5600 non-null int64
dtypes: float16(1), int64(2)
memory usage: 142.2 KB
pos_set = pos_set.merge(q2_one_hot, on=['id'])

Int64Index: 5600 entries, 0 to 5599
Data columns (total 14 columns):
id              5600 non-null int64
consume         5600 non-null float16
phone_nums      5600 non-null int64
brand_huawei    5600 non-null uint8
brand_iphone    5600 non-null uint8
brand_jinli     5600 non-null uint8
brand_le        5600 non-null uint8
brand_mei       5600 non-null uint8
brand_mi        5600 non-null uint8
brand_oppo      5600 non-null uint8
brand_san       5600 non-null uint8
brand_siji      5600 non-null uint8
brand_vivo      5600 non-null uint8
brand_其它        5600 non-null uint8
dtypes: float16(1), int64(2), uint8(11)
memory usage: 202.3 KB
2.将两月出省求和 是:1 否:0
3.将两月出国求和 是:1 否:0

q3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q3.txt', sep='\t', header=None))
q3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']
Mem. usage decreased to  0.18 Mb (64.6% reduction)
RangeIndex: 11200 entries, 0 to 11199
Data columns (total 6 columns):
year_month             11200 non-null int32
id                     11200 non-null int64
call_nums              11200 non-null int16
is_trans_provincial    11200 non-null int8
is_transnational       11200 non-null int8
label                  11200 non-null int8
dtypes: int16(1), int32(1), int64(1), int8(3)
memory usage: 186.0 KB
q3_groupbyid_call = q3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})
q3_groupbyid_provincial = q3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})
pos_set = pos_set.merge(q3_groupbyid_call, on=['id'])
pos_set = pos_set.merge(q3_groupbyid_provincial, on=['id'])
Int64Index: 5600 entries, 0 to 5599
Data columns (total 17 columns):
id                     5600 non-null int64
consume                5600 non-null float16
phone_nums             5600 non-null int64
brand_huawei           5600 non-null uint8
brand_iphone           5600 non-null uint8
brand_jinli            5600 non-null uint8
brand_le               5600 non-null uint8
brand_mei              5600 non-null uint8
brand_mi               5600 non-null uint8
brand_oppo             5600 non-null uint8
brand_san              5600 non-null uint8
brand_siji             5600 non-null uint8
brand_vivo             5600 non-null uint8
brand_其它               5600 non-null uint8
call_nums              5600 non-null int16
is_trans_provincial    5600 non-null int8
is_transnational       5600 non-null int8
dtypes: float16(1), int16(1), int64(2), int8(2), uint8(11)
memory usage: 224.2 KB
q4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q4.txt', sep='\t', header=None))
q4.columns = ['year_month', 'id', 'province', 'label']
RangeIndex: 7289 entries, 0 to 7288
Data columns (total 4 columns):
year_month    7289 non-null int32
id            7289 non-null int64
province      7218 non-null object
label         7289 non-null int8
dtypes: int32(1), int64(1), int8(1), object(1)
memory usage: 149.6+ KB
year_month id province label
0 201707 6062475264825100 广东 1
1 201707 5627768389537500 北京 1
2 201707 2000900444179600 山西 1
3 201707 5304502776817600 四川 1
4 201707 5304502776817600 四川 1
q4_groupbyid = q4.groupby(['province']).size()
time: 61.3 ms
宁夏      15
吉林      20
内蒙古     22
黑龙江     27
青海      35
天津      39
辽宁      44
西藏      69
山西      70
甘肃      73
新疆      74
安徽      86
海南     100
陕西     114
山东     121
福建     150
河北     168
江苏     182
湖北     208
上海     215
河南     237
北京     247
江西     364
重庆     428
浙江     483
云南     530
广西     536
四川     793
广东     835
湖南     933
dtype: int64

q4.province = q4.province.fillna('湖南')

RangeIndex: 7289 entries, 0 to 7288
Data columns (total 4 columns):
year_month    7289 non-null int32
id            7289 non-null int64
province      7289 non-null object
label         7289 non-null int8
dtypes: int32(1), int64(1), int8(1), object(1)
memory usage: 149.6+ KB
q4_groupbyid = q4[['id', 'province']].groupby(['id']).size()
q4_groupbyid = q4_groupbyid.reset_index()
pos_set = pos_set.merge(q4_groupbyid, how='left', on=['id'])

Int64Index: 5600 entries, 0 to 5599
Data columns (total 18 columns):
id                     5600 non-null int64
consume                5600 non-null float16
phone_nums             5600 non-null int64
brand_huawei           5600 non-null uint8
brand_iphone           5600 non-null uint8
brand_jinli            5600 non-null uint8
brand_le               5600 non-null uint8
brand_mei              5600 non-null uint8
brand_mi               5600 non-null uint8
brand_oppo             5600 non-null uint8
brand_san              5600 non-null uint8
brand_siji             5600 non-null uint8
brand_vivo             5600 non-null uint8
brand_其它               5600 non-null uint8
call_nums              5600 non-null int16
is_trans_provincial    5600 non-null int8
is_transnational       5600 non-null int8
province_out_cnt       1942 non-null float64
dtypes: float16(1), float64(1), int16(1), int64(2), int8(2), uint8(11)
memory usage: 268.0 KB
pos_set = pos_set.fillna(0)
Int64Index: 5600 entries, 0 to 5599
Data columns (total 19 columns):
id                     5600 non-null int64
consume                5600 non-null float16
phone_nums             5600 non-null int64
brand_huawei           5600 non-null uint8
brand_iphone           5600 non-null uint8
brand_jinli            5600 non-null uint8
brand_le               5600 non-null uint8
brand_mei              5600 non-null uint8
brand_mi               5600 non-null uint8
brand_oppo             5600 non-null uint8
brand_san              5600 non-null uint8
brand_siji             5600 non-null uint8
brand_vivo             5600 non-null uint8
brand_其它               5600 non-null uint8
call_nums              5600 non-null int16
is_trans_provincial    5600 non-null int8
is_transnational       5600 non-null int8
province_out_cnt       5600 non-null float64
label                  5600 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2), uint8(11)
memory usage: 311.7 KB
q6 暂时忽略

Int64Index: 48668 entries, 0 to 48667
Data columns (total 2 columns):
id      48668 non-null int64
pred    48668 non-null float32
dtypes: float32(1), int64(1)
memory usage: 950.5 KB
Int64Index: 48668 entries, 0 to 48667
Data columns (total 2 columns):
id      48668 non-null int64
pred    48668 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
# fill 0 0.469  dropna-0.50005  addfeat-0.46  addfeat dropna-0.4549
# fill 1 0.436  addfeat dropna-0.419
# fill mean0.088458 0.43048757
submit_xgb = tt_xgb.fillna(0.0)
# fill 0 addfeat-0.4491 0.4539  addfeat dropna-0.4512
submit_gbm = tt.fillna(0.0)
1.模型融合 求和 得分0.4558
2.全为1.0/0.0 得分0.5
3.大于0.5改为1.0,小于0.5改为0.0 应有2800人左右去 xgb0.26 得分0.50153 gbm0.17 得分0.50554

id pred
count 5.020000e+04 50200.000000
mean 5.449990e+15 0.092590
std 2.628886e+15 0.088487
min 5.959412e+11 0.000000
25% 3.177008e+15 0.034837
50% 5.441108e+15 0.063993
75% 7.726328e+15 0.125547
max 9.999920e+15 0.754152
id pred
count 2.818000e+03 2818.000000
mean 5.523494e+15 0.350387
std 2.632627e+15 0.083545
min 7.736480e+13 0.260060
25% 3.193231e+15 0.287803
50% 5.528103e+15 0.324941
75% 7.801996e+15 0.386373
max 9.999505e+15 0.754152
id pred
count 2.539000e+03 2539.000000
mean 5.482621e+15 0.298836
std 2.625965e+15 0.062903
min 7.736480e+13 0.230013
25% 3.200866e+15 0.253366
50% 5.471503e+15 0.279145
75% 7.742764e+15 0.326900
max 9.999505e+15 0.632138
id pred
count 2.859000e+03 2859.000000
mean 5.493943e+15 0.290563
std 2.630246e+15 0.063701
min 7.736480e+13 0.220121
25% 3.195841e+15 0.244933
50% 5.501943e+15 0.270700
75% 7.743865e+15 0.321506
max 9.999505e+15 0.632138
id pred
count 5.020000e+04 50200.000000
mean 5.449990e+15 0.085097
std 2.628886e+15 0.071304
min 5.959412e+11 0.000000
25% 3.177008e+15 0.036845
50% 5.441108e+15 0.062206
75% 7.726328e+15 0.113462
max 9.999920e+15 0.632138
Int64Index: 50200 entries, 91 to 50199
Data columns (total 2 columns):
id      50200 non-null int64
pred    50200 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
Int64Index: 50200 entries, 14 to 50199
Data columns (total 2 columns):
id      50200 non-null int64
pred    50200 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
