2019移动广告反欺诈算法挑战赛之一些奇妙的idea

首先简单介绍一下自己的几种写好代码但是还没有跑的方案,然后呢分享一下其他几个人的想法,看完感觉不错的话点个赞呗,哈哈哈哈哈哈

 

思路一:  处理长尾

对于类别特征处理一下长尾特征(也即是把make model ver这些特征中仅仅出现次数少于20次的数据归为一类),因为catboost对于类别特征是自动暴力找组合特征,默认的时候最多是对四个特征进行组合的,具体可参考catboost官方文档。所以长尾之后增加特征的组合数。

如果想要使用这个功能的话,直接设置Use_w2v = True即可

 

思路二: 无脑交叉

无脑交叉定义一些交叉特征。虽然catboost是设置好类别之后是自带交叉特征的。但是听前排大佬说他们定义了200+的特征,实在想不出来那么多有意义的特征。所以直接for循环,无脑交叉了一些二阶特征,虽然不知道效果好不好吧,但是方便又省事这是真的。

如果想要使用这个功能的话,直接设置Use_Reduce_Dim = True即可

 

思路三:PCA降维

无脑交叉之后发现数据的特征比较多,所以就尝试使用PCA进行降维,发现152个特征进行PCA降维之后仅仅用20个特征解释率就能达到99%。不知道是不是说明这些特征是共线性的还是怎么样。如果使用PCA降维的话,默认会把之前的交叉特征给删除了。

想要使用这个功能的话,直接设置Use_PCA = True即可

 

思路四:Word2Vec()

对于媒体信息 (pkgname,ver, adunitshowid, mediashowid,apptype)进行拼接,并且使用Word2Vec()计算出拼接之后的媒体信息之间的向量空间上的相似度。(其实也可以试一下筛选出来一些设备特征试一下效果)。这个想法是参考林有夕 2018-CCF大数据与计算智能大赛-面向电信行业存量用户的智能套餐个性化匹配模型。之前粗浅的了解过NLP方面的知识,其实个人感觉这个Word2vec()是不合理的,因为word2vec做词嵌入的时候通过前后文关系来确定两个单词之间的空间相似度。例如训练集中有一些话是‘我有一只猫’, 而有另一些话是‘我有一只狗’。word2vec训练的时候就可能会发现'我有一只'这句话后面有时会跟着‘狗’,有时跟着‘猫’, 就认为他们的空间距离比较近。而这个拼接的之后作为一句话有点想不明白。这些都是个人很不专业的看法,仅供参考,至于比赛还是要看线上的成绩的。

Use_Cross_Fea = True

 

思路五:

对于单属性比较强的特征, 只要训练的数据中包含重要特征就直接统计其出现次数,这个主要是因为懒,想不出哪些应该那些不应该出现次数,哪些不应该统计。。。。。。。。。

使用方法: Use_Count_Fea = True

 

思路六:缺失值填充

如果把'empty'也设置为nan的话,其实数据中缺失值占的比例还是比较大的。正好知乎上有一篇关于如何处理大量缺失值的方案。个人感觉还是不错的和之前看到的使用均值, 众数填充缺失值是不一样的。首先很专业的介绍了缺失值的类别,然后对于每种缺失值都有不同的处理方式。最后还在GitHub开源了一个处理不同类型缺失值的方式,本来想学习一波,但是用R语言写的实在看不懂就没做。如果有时间的话,我会尝试把他改成python版本的。

 

思路七:学习一下其他人的模型

第一种方案使用深度学习模型做的:模型的输入部分包括两个,第一个是原始特征使用Embedding,对于自己统计的一些数值特征直接使用mlp得到‘隐层特征’,最后将两个部分进行合并,一起放到CuDNNGRU然后全连接。其实之前我也做过不过用的是LSTM,而且只用到了原始数据。

 

第二种方案数据分析的方法感觉很好:没有开源自己的模型

 

第三种方案:catboost,lgb, xgb,MLP都使用了。对数据预处理,特征选择,模型的选择都做了介绍。感觉很棒

 

第四种方案: 其实一直有疑惑,数据挖掘中到底什么样的数据特征才算好的特征,当然可以属性的贡献率来判断,但是知乎上的这篇介绍了一种方案。训练集和测试集的交叉数以及特征本身分布的特点。正在看还是有点疑问的例如取值具备有很强的倾向性的应该是个不错特征,这个就有点不懂了

 

代码实现:

代码部分没啥亮点,省去了一些数据清洗的方案,数据清洗的代码就没放上去,清洗的步骤可以参考我的这篇文章,哈哈哈哈哈写的很垃圾。最后因为catboost预测的时候是使用cpu预测的,所以在预测的时候我采用分批预测的方法节约一下内存。训练的时候可以指定是使用五折交叉验证还是单折。

 

dataload部分:

# -*- coding: utf-8 -*-
# @Time    : 2019/9/15 21:21
# @Author  : YYLin
# @Email   : [email protected]
# @File    : All_Try_Dataload.py
import numpy as np
import pandas as pd
from datetime import timedelta
import multiprocessing
from gensim.models import Word2Vec
from tqdm import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
import gc
import warnings
warnings.filterwarnings('ignore')

# 对媒体信息 pkgname,ver, adunitshowid, mediashowid,apptype 拼接之后使用word2vec
Use_w2v = True

# 筛选特征 保留特征的前N位
Use_Reduce_Dim = True

# 对于媒体信息 地址信息 设备信息 两两进行二阶交叉特征
Use_Cross_Fea = True

# 对于特征中包含强特征的例如 pkgname的统计其出现次数
Use_Count_Fea = True

# 对于交叉特征使用PCA进行降维操作 使用之后会删除原始的的特征
Use_PCA = True

path = '../A_Data/'
print('loading train dataset .......')
train_1 = pd.read_table(path + 'round1_iflyad_anticheat_traindata.txt', nrows=1000)
train_2 = pd.read_table(path + 'round2_iflyad_anticheat_traindata.txt', nrows=1000)
train = train_1.append(train_2).reset_index(drop=True)

print('loading test dataset ....')
test_a = pd.read_table(path + 'round2_iflyad_anticheat_testdata_feature_A.txt', nrows=1000)
test_b = pd.read_table(path + 'round2_iflyad_anticheat_testdata_feature_B.txt', nrows=1000)
test = test_a.append(test_b).reset_index(drop=True)

all_df = train.append(test).reset_index(drop=True)
print('train:', train.shape, 'test:', test.shape)

del train_1, train_2, train, test_a, test_b, test

# 删除一些feature
del_cols = ['idfamd5', 'adidmd5', 'os']
for col in del_cols:
    all_df.drop(columns=col, inplace=True)

cat_list = [i for i in all_df.columns if i not in ['sid', 'label', 'nginxtime', 'ip', 'reqrealip']]

if Use_Reduce_Dim:
    # 处理完了之后去make的前200位
    makes = []
    for i in all_df['make'].value_counts().head(300).index:
        makes.append(i)
    all_df.loc[~all_df['make'].isin(makes), 'make'] = 'others'

    # 处理一下model 减少了3000左右的数据
    def deal_model(x):
        if '-' in x:
            x = x.replace('-', ' ')
        if '%20' in x:
            x = x.replace('%20', ' ')
        if '_' in x:
            x = x.replace('_', ' ')
        if '+' in x:
            x = x.replace('+', ' ')
        if 'the palace museum edition' in x:
            x = 'mix 3'
        return x


    print('loading model ........')
    all_df['model'] = all_df['model'].astype('str').map(lambda x: x.lower()).apply(deal_model)

    # 对于model 选择model排名前1200数据
    models = []
    for i in all_df['model'].value_counts().head(1500).index:
        models.append(i)
    all_df.loc[~all_df['model'].isin(models), 'model'] = 'others'

    # 对ver取排名靠前的1500位
    vers = []
    for i in all_df['ver'].value_counts().head(1500).index:
        vers.append(i)
    all_df.loc[~all_df['ver'].isin(vers), 'ver'] = 'others'

    # 选择数据中的前120位数据
    osvs = []
    for i in all_df['osv'].value_counts().head(150).index:
        osvs.append(i)
    all_df.loc[~all_df['osv'].isin(osvs), 'osv'] = 'others'

    # 处理lan
    lans = []
    for i in all_df['lan'].value_counts().head(12).index:
        lans.append(i)
    all_df.loc[~all_df['lan'].isin(lans), 'lan'] = 'others'

if Use_w2v:
    L = 10
    sentence = []
    for line in list(all_df[['pkgname', 'ver', 'adunitshowid', 'mediashowid', 'apptype']].values):
        sentence.append([str(tmp) for idx, tmp in enumerate(line)])

    print('training...')
    model = Word2Vec(sentence, size=L, window=2, min_count=1, workers=multiprocessing.cpu_count(),
                     iter=2)

    for fea in ['pkgname', 'ver', 'adunitshowid', 'mediashowid', 'apptype']:
        values = []
        for line in list(all_df[fea].values):
            values.append(line)
        values = set(values)
        print(len(values))
        w2v = []
        for i in values:
            a = [i]
            a.extend(model[str(i)])
            w2v.append(a)
        out_df = pd.DataFrame(w2v)

        name = [fea]
        for i in range(L):
            name.append(name[0] + '_w2v_' + str(i))
        out_df.columns = name

        df = out_df.drop_duplicates([fea])
        all_df = pd.merge(all_df, df, on=fea, how='left')

# 处理ip 地址
all_df['ip_1'] = all_df['ip'].map(lambda x: '.'.join(x.split('.')[0:2]))
all_df['ip_2'] = all_df['ip'].map(lambda x: '.'.join(x.split('.')[0:3]))

all_df['reqrealip_1'] = all_df['reqrealip'].map(lambda x: '.'.join(str(x).split('.')[0:2]))
all_df['reqrealip_2'] = all_df['reqrealip'].map(lambda x: '.'.join(str(x).split('.')[0:3]))

cross_feature = []

if Use_Cross_Fea:
    # 每次一层交叉之后
    first_feature = ['pkgname', 'adunitshowid', 'mediashowid', 'apptype']
    second_feature = ['imeimd5', 'openudidmd5', 'lan', 'h', 'w', 'macmd5', 'dvctype', 'model', 'make',
                      'ntt', 'carrier', 'orientation', 'ppi', 'ver', 'osv']

    third_feature = ['ip_2', 'city', 'province', 'reqrealip_2']

    print('begin cross between first_feature and second_feature:')
    for feat_1 in first_feature:
        for feat_2 in second_feature:
            col_name = 'cross_' + feat_1 + '_and_' + feat_2
            print('cross name between first and second:', col_name)
            if col_name in cross_feature:
                continue
            else:
                cross_feature.append(col_name)
                all_df[col_name] = all_df[feat_1].astype(str).values + '_' + all_df[feat_2].astype(str).values

    print('begin cross between second_feature and third_feature:')
    for feat_1 in second_feature:
        for feat_2 in third_feature:
            col_name = 'cross_' + feat_1 + '_and_' + feat_2
            print('cross name between second and third:', col_name)
            if col_name in cross_feature:
                continue
            else:
                cross_feature.append(col_name)
                all_df[col_name] = all_df[feat_1].astype(str).values + '_' + all_df[feat_2].astype(str).values

    print('begin cross between first_feature and third_feature:')
    for feat_1 in first_feature:
        for feat_2 in third_feature:
            col_name = 'cross_' + feat_1 + '_and_' + feat_2
            print('cross name between first and third:', col_name)
            if col_name in cross_feature:
                continue
            else:
                cross_feature.append(col_name)
                all_df[col_name] = all_df[feat_1].astype(str).values + '_' + all_df[feat_2].astype(str).values

    print('the len of cross feature:', len(cross_feature))

if Use_PCA:
    pca = PCA(n_components=0.98)
    print('Label Encoding ...........')
    lab_col = [i for i in cross_feature]
    for i in tqdm(lab_col):
        lbl = LabelEncoder()
        all_df[i] = lbl.fit_transform(all_df[i].astype(str))

    matrix_pca = pca.fit_transform(all_df[cross_feature])
    variance_pca = pca.explained_variance_ratio_.sum()
    print('降维之后的方差解释率:', variance_pca)

    PCA_Dict = dict()
    for i in range(matrix_pca.shape[1]):
        col_name = 'PCA_' + str(i)
        PCA_Dict.setdefault(col_name, [])
        PCA_Dict[col_name] = PCA_Dict[col_name] + matrix_pca[:, i].tolist()
    PCA_DF = pd.DataFrame(PCA_Dict)

    all_df = pd.concat([all_df, PCA_DF], axis=1)

    # 对交叉特征进行降维之后 直接使用PCA删除之前特征的特征
    for i in tqdm(lab_col):
        all_df.drop(columns=[i], inplace=True)

    cross_feature = []
    gc.collect()

    print('经过pca降维之后的数据特点是:', all_df.info())


# 时间数据
all_df['time'] = pd.to_datetime(all_df['nginxtime']/1000, unit='s') + timedelta(hours=8)
all_df['hour'] = all_df['time'].dt.hour
all_df['minute'] = all_df.time.dt.minute
all_df['respond_time'] = abs(all_df['sid'].apply(lambda x: x.split('-')[-1]).astype(float)-all_df['nginxtime'])
all_df['arrive_time'] = all_df['sid'].apply(lambda x: x.split('-')[-1]).astype(float)

# 数值数据的处理
all_df['size'] = (np.sqrt(all_df['h']**2 + all_df['w'] ** 2) / 2.54) / 1000
all_df['ratio'] = all_df['h'] / all_df['w']
all_df['px'] = all_df['ppi'] * all_df['size']
all_df['mj'] = all_df['h'] * all_df['w']

print('reduce memery using key to value change')
key_cols = [i for i in all_df.select_dtypes(object).columns if i not in ['sid']]
for col in key_cols:
    class_mapping = {label: idx for idx, label in enumerate(set(all_df[col]))}
    all_df[col] = all_df[col].map(class_mapping)

cat_list = cat_list + ['hour', 'reqrealip_2', 'ip_2'] + cross_feature
print('cat_list:', cat_list)

if Use_Count_Fea:
    # 统计出现次数
    import_fea = ['ratio', 'pkgname', 'mediashowid', 'apptype', 'ip_2', 'reqrealip_2', 'adunitshowid']
    count_cols = []
    for i in all_df.columns:
        for impor_fea in import_fea:
            if impor_fea in i:
                count_cols.append(i)

    print('Data counting ............')
    for i in cat_list:
        all_df['{}_count'.format(i)] = all_df.groupby(['{}'.format(i)])['sid'].transform('count')

# 对缺失数据进行填充
for column in list(all_df.columns[all_df.isnull().sum() > 0]):
    if column == 'label':
        continue
    else:
        all_df[column].fillna(-9999, inplace=True)


# 节约内存
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose:
        print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df


print('reduce_mem_usage ............')
all_df = reduce_mem_usage(all_df)

all_df.to_csv('data/test.csv', index=False)

 

train部分: 

# -*- coding: utf-8 -*-
# @Time    : 2019/9/15 21:45
# @Author  : YYLin
# @Email   : [email protected]
# @File    : All_Try_Train.py
# cat_list是catboost的类别模型 直接把dataload中的cat_list复制过来即可
import pandas as pd
import gc
import math
from sklearn.model_selection import train_test_split
import catboost as cbt
import numpy as np
from sklearn.model_selection import StratifiedKFold

Use_Five_Fold = False

path = 'data/'

# 测试集的sid
test = pd.read_csv(path + 'aichallenge2019_test_all_sample.csv', sep=',', usecols=['sid'], nrows=2000)

all_data = pd.read_csv(path + 'test.csv', sep=',')

cat_list = ['adunitshowid', 'apptype', 'carrier', 'city', 'dvctype', 'h', 'imeimd5', 'lan', 'macmd5', 'make', 'mediashowid',
            'model', 'ntt', 'openudidmd5', 'orientation', 'osv', 'pkgname', 'ppi', 'province', 'ver', 'w', 'device', 'hour',
            'reqrealip_2', 'ip_2', 'cross_pkgname_and_imeimd5', 'cross_pkgname_and_openudidmd5', 'cross_pkgname_and_lan', 'cross_pkgname_and_h', 'cross_pkgname_and_w', 'cross_pkgname_and_macmd5', 'cross_pkgname_and_dvctype', 'cross_pkgname_and_model', 'cross_pkgname_and_make', 'cross_pkgname_and_ntt', 'cross_pkgname_and_carrier', 'cross_pkgname_and_orientation', 'cross_pkgname_and_ppi', 'cross_pkgname_and_ver', 'cross_pkgname_and_osv', 'cross_adunitshowid_and_imeimd5', 'cross_adunitshowid_and_openudidmd5', 'cross_adunitshowid_and_lan', 'cross_adunitshowid_and_h', 'cross_adunitshowid_and_w', 'cross_adunitshowid_and_macmd5', 'cross_adunitshowid_and_dvctype', 'cross_adunitshowid_and_model', 'cross_adunitshowid_and_make', 'cross_adunitshowid_and_ntt', 'cross_adunitshowid_and_carrier', 'cross_adunitshowid_and_orientation', 'cross_adunitshowid_and_ppi', 'cross_adunitshowid_and_ver', 'cross_adunitshowid_and_osv', 'cross_mediashowid_and_imeimd5', 'cross_mediashowid_and_openudidmd5', 'cross_mediashowid_and_lan', 'cross_mediashowid_and_h', 'cross_mediashowid_and_w', 'cross_mediashowid_and_macmd5', 'cross_mediashowid_and_dvctype', 'cross_mediashowid_and_model', 'cross_mediashowid_and_make', 'cross_mediashowid_and_ntt', 'cross_mediashowid_and_carrier', 'cross_mediashowid_and_orientation', 'cross_mediashowid_and_ppi', 'cross_mediashowid_and_ver', 'cross_mediashowid_and_osv', 'cross_apptype_and_imeimd5', 'cross_apptype_and_openudidmd5', 'cross_apptype_and_lan', 'cross_apptype_and_h', 'cross_apptype_and_w', 'cross_apptype_and_macmd5', 'cross_apptype_and_dvctype', 'cross_apptype_and_model', 'cross_apptype_and_make', 'cross_apptype_and_ntt', 'cross_apptype_and_carrier', 'cross_apptype_and_orientation', 'cross_apptype_and_ppi', 'cross_apptype_and_ver', 'cross_apptype_and_osv', 'cross_imeimd5_and_ip_2', 'cross_imeimd5_and_city', 'cross_imeimd5_and_province', 'cross_imeimd5_and_reqrealip_2', 'cross_openudidmd5_and_ip_2', 'cross_openudidmd5_and_city', 'cross_openudidmd5_and_province', 'cross_openudidmd5_and_reqrealip_2', 'cross_lan_and_ip_2', 'cross_lan_and_city', 'cross_lan_and_province', 'cross_lan_and_reqrealip_2', 'cross_h_and_ip_2', 'cross_h_and_city', 'cross_h_and_province', 'cross_h_and_reqrealip_2', 'cross_w_and_ip_2', 'cross_w_and_city', 'cross_w_and_province', 'cross_w_and_reqrealip_2', 'cross_macmd5_and_ip_2', 'cross_macmd5_and_city', 'cross_macmd5_and_province', 'cross_macmd5_and_reqrealip_2', 'cross_dvctype_and_ip_2', 'cross_dvctype_and_city', 'cross_dvctype_and_province', 'cross_dvctype_and_reqrealip_2', 'cross_model_and_ip_2', 'cross_model_and_city', 'cross_model_and_province', 'cross_model_and_reqrealip_2', 'cross_make_and_ip_2', 'cross_make_and_city', 'cross_make_and_province', 'cross_make_and_reqrealip_2', 'cross_ntt_and_ip_2', 'cross_ntt_and_city', 'cross_ntt_and_province', 'cross_ntt_and_reqrealip_2', 'cross_carrier_and_ip_2', 'cross_carrier_and_city', 'cross_carrier_and_province', 'cross_carrier_and_reqrealip_2', 'cross_orientation_and_ip_2', 'cross_orientation_and_city', 'cross_orientation_and_province', 'cross_orientation_and_reqrealip_2', 'cross_ppi_and_ip_2', 'cross_ppi_and_city', 'cross_ppi_and_province', 'cross_ppi_and_reqrealip_2', 'cross_ver_and_ip_2', 'cross_ver_and_city', 'cross_ver_and_province', 'cross_ver_and_reqrealip_2', 'cross_osv_and_ip_2', 'cross_osv_and_city', 'cross_osv_and_province', 'cross_osv_and_reqrealip_2', 'cross_pkgname_and_ip_2', 'cross_pkgname_and_city', 'cross_pkgname_and_province', 'cross_pkgname_and_reqrealip_2', 'cross_adunitshowid_and_ip_2', 'cross_adunitshowid_and_city', 'cross_adunitshowid_and_province', 'cross_adunitshowid_and_reqrealip_2', 'cross_mediashowid_and_ip_2', 'cross_mediashowid_and_city', 'cross_mediashowid_and_province', 'cross_mediashowid_and_reqrealip_2', 'cross_apptype_and_ip_2', 'cross_apptype_and_city', 'cross_apptype_and_province', 'cross_apptype_and_reqrealip_2']


feature_name = [i for i in all_data.columns if i not in ['sid', 'label', 'time']]

tr_index = ~all_data['label'].isnull()
X_train = all_data[tr_index][list(set(feature_name))].reset_index(drop=True)
y = all_data[tr_index]['label'].reset_index(drop=True).astype(int)
X_test = all_data[~tr_index][list(set(feature_name))].reset_index(drop=True)
print(X_train.shape, X_test.shape)

# 节约一下不实用的变量内存
del all_data
gc.collect()


if Use_Five_Fold:
    n_split = 5
    random_seed = 2019
    pass_train = False
    Submit_result = np.zeros((X_test.shape[0],))

    skf = StratifiedKFold(n_splits=n_split, random_state=random_seed, shuffle=True)
    for index, (train_index, test_index) in enumerate(skf.split(X_train, y)):

        # 如果需要接着训练的话 则跳过的训练次数
        if pass_train:
            if index < 1:
                print('pass %d:' % index, train_index[0:5], test_index[0:5])
                continue

        train_x, test_x, train_y, test_y = X_train[feature_name].iloc[train_index], X_train[feature_name].iloc[
            test_index], y.iloc[train_index], y.iloc[test_index]

        cbt_model = cbt.CatBoostClassifier(iterations=2, learning_rate=0.05, max_depth=11, l2_leaf_reg=1, verbose=10,
                                           early_stopping_rounds=400, task_type='GPU', eval_metric='F1', max_ctr_complexity=8,
                                           cat_features=cat_list)

        cbt_model.fit(train_x[feature_name], train_y, eval_set=(test_x[feature_name], test_y))

        # 训练完成之后保存一下模型
        del train_x, train_y, test_x, test_y
        gc.collect()

        # catboost测试数据的时候使用的是CPU相对较慢 所以建议使用分批预测模型
        num = 20
        line_num = math.floor(len(X_test) / num)
        print('line_num:', line_num)

        Proba_result = []
        for i in range(num):
            test_1 = X_test.loc[i * line_num:(i + 1) * line_num - 1, feature_name]
            test_pred = cbt_model.predict_proba(test_1)[:, 1] / n_split
            Proba_result.extend(list(test_pred))

        Proba_result = np.array(Proba_result)
        Submit_result += Proba_result

        print('Saveing result proba:', index)
        submit = test[['sid']]
        submit['label'] = Proba_result
        submit.to_csv('cat_baseline_on_five_fold_%d.csv' % index, index=False)

        # 继续释放内存
        del cbt_model, Proba_result
        gc.collect()

else:
    train_x, val_x, train_y, val_y = train_test_split(X_train, y, test_size=0.2)

    cbt_model = cbt.CatBoostClassifier(iterations=2, learning_rate=0.05, max_depth=11, l2_leaf_reg=1, verbose=10,
                                       early_stopping_rounds=400, task_type='GPU', eval_metric='F1',
                                       cat_features=cat_list)

    cbt_model.fit(train_x[feature_name], train_y, eval_set=(val_x[feature_name], val_y))

    # 预测的是时候只能使用cpu进行预测 所以直接放200万个数据嘟嘟很慢 所以对测试集进行分段预测
    num = 20
    line_num = math.floor(len(X_test) / num)
    print('line_num:', line_num)
    result = []
    for i in range(num):
        print(i)
        test_1 = X_test.loc[i * line_num:(i + 1) * line_num - 1, feature_name]
        test_pred = cbt_model.predict_proba(test_1)[:, 1]
        for tmp in test_pred:
            result.append(tmp)
        del test_pred
        gc.collect()

    print('len(result):', len(result))

    ser_result = pd.Series(result)
    del result
    gc.collect()

    submit = test[['sid']]

    submit['label'] = np.array(ser_result>= 0.5).astype(int)
    print(submit['label'].value_counts())
    submit.to_csv('cat_baseline.csv', index=False)

    # 打印预测地概率 方便以后使用融合模型
    df_sub = pd.concat([test['sid'], ser_result], axis=1)
    df_sub.columns = ['sid', 'label']
    df_sub.to_csv('cat_baseline_proba.csv', sep=',', index=False)

 

运行结果:

2019移动广告反欺诈算法挑战赛之一些奇妙的idea_第1张图片

你可能感兴趣的:(研究生参加的相关比赛,机器学习算法俱乐部)