电信客户流失预测挑战赛baseline【AI比赛】

  • 1.赛题
    • 1.1背景
    • 1.2任务
    • 1.3数据
    • 1.4评估指标
    • 1.5提交要求
    • 1.6奖项设置
  • 2.baseline
    • 2.1导入模块
    • 2.2读入数据并准备
    • 2.3构建模型
    • 2.4训练rgb模型
    • 2.5输出到文件
    • 2.6提交结果

1.赛题

官方地址(科大讯飞)

1.1背景

随着市场饱和度的上升,电信运营商的竞争也越来越激烈,电信运营商亟待解决减少用户流失,延长用户生命周期的问题。对于客户流失率而言,每增加5%,利润就可能随之降低25%-85%。因此,如何减少电信用户流失的分析与预测至关重要。

鉴于此,运营商会经常设有客户服务部门,该部门的职能主要是做好客户流失分析,赢回高概率流失的客户,降低客户流失率。某电信机构的客户存在大量流失情况,导致该机构的用户量急速下降。面对如此头疼的问题,该机构将部分客户数据开放,诚邀大家帮助他们建立流失预测模型来预测可能流失的客户。

1.2任务

给定某电信机构实际业务中的相关客户信息,包含69个与客户相关的字段,其中“是否流失”字段表明客户会否会在观察日期后的两个月内流失。任务目标是通过训练集训练模型,来预测客户是否会流失,以此为依据开展工作,提高用户留存。

1.3数据

csdn下载,kaggle下载
赛题数据由训练集和测试集组成,总数据量超过25w,包含69个特征字段。为了保证比赛的公平性,将会从中抽取15万条作为训练集,3万条作为测试集,同时会对部分字段信息进行脱敏。
特征字段:客户ID、地理区域、是否双频、是否翻新机、当前手机价格、手机网络功能、婚姻状况、家庭成人人数、信息库匹配、预计收入、信用卡指示器、当前设备使用天数、在职总月数、家庭中唯一订阅者的数量、家庭活跃用户数、… 、过去六个月的平均每月使用分钟数、过去六个月的平均每月通话次数、过去六个月的平均月费用、是否流失

1.4评估指标

AUC指标,正样本为1,参考代码

from sklearn import metrics

auc = metrics.roc_auc_score(data['default_score_true'], data['default_score_pred'])

1.5提交要求

预测结果文件详细说明:

  1. 以csv格式提交,编码为UTF-8,第一行为表头;

  2. 提交前请确保预测结果的格式与sample_submit.csv中的格式一致。

1.6奖项设置

一等奖:1支队伍,周赛一等奖证书,奖金:1000元;

二等奖:1支队伍,周赛二等奖证书,奖金:800元;

三等奖:1支队伍,周赛三等奖证书,奖金:500元;

优秀奖:10支队伍。前十名将获得由 “讯飞 x Datawhale” 联合颁发的优秀选手证书。

2.baseline

baseline地址

2.1导入模块

import pandas as pd
import os
import gc
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler
from gensim.models import Word2Vec
import math
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')

2.2读入数据并准备

train = pd.read_csv('../input/dianxinkehuliushiyvce/train.csv')
test = pd.read_csv('../input/dianxinkehuliushiyvce/test.csv')
data = pd.concat([train, test], axis=0, ignore_index=True)

#训练数据/测试数据准备

features = [f for f in data.columns if f not in ['是否流失','客户ID']]

train = data[data['是否流失'].notnull()].reset_index(drop=True)
test = data[data['是否流失'].isnull()].reset_index(drop=True)

x_train = train[features]
x_test = test[features]

y_train = train['是否流失']

2.3构建模型

#构建模型

def cv_model(clf, train_x, train_y, test_x, clf_name):
    folds = 5
    seed = 2022
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)

    train = np.zeros(train_x.shape[0])
    test = np.zeros(test_x.shape[0])

    cv_scores = []

    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ {} ************************************'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]

        if clf_name == "lgb":
            train_matrix = clf.Dataset(trn_x, label=trn_y)
            valid_matrix = clf.Dataset(val_x, label=val_y)

            params = {
                'boosting_type': 'gbdt',
                'objective': 'binary',
                'metric': 'auc',
                'min_child_weight': 5,
                'num_leaves': 2 ** 5,
                'lambda_l2': 10,
                'feature_fraction': 0.7,
                'bagging_fraction': 0.7,
                'bagging_freq': 10,
                'learning_rate': 0.2,
                'seed': 2022,
                'n_jobs':-1
            }

            model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], 
                              categorical_feature=[], verbose_eval=3000, early_stopping_rounds=200)
            val_pred = model.predict(val_x, num_iteration=model.best_iteration)
            test_pred = model.predict(test_x, num_iteration=model.best_iteration)
            
            print(list(sorted(zip(features, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:20])
                
        if clf_name == "xgb":
            train_matrix = clf.DMatrix(trn_x , label=trn_y)
            valid_matrix = clf.DMatrix(val_x , label=val_y)
            test_matrix = clf.DMatrix(test_x)
            
            params = {'booster': 'gbtree',
                      'objective': 'binary:logistic',
                      'eval_metric': 'auc',
                      'gamma': 1,
                      'min_child_weight': 1.5,
                      'max_depth': 5,
                      'lambda': 10,
                      'subsample': 0.7,
                      'colsample_bytree': 0.7,
                      'colsample_bylevel': 0.7,
                      'eta': 0.2,
                      'tree_method': 'exact',
                      'seed': 2020,
                      'nthread': 36,
                      "silent": True,
                      }
            
            watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')]
            
            model = clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=3000, early_stopping_rounds=200)
            val_pred  = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)
            test_pred = model.predict(test_matrix , ntree_limit=model.best_ntree_limit)
                 
        if clf_name == "cat":
            params = {'learning_rate': 0.2, 'depth': 5, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli',
                      'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False}
            
            model = clf(iterations=20000, **params)
            model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
                      cat_features=[], use_best_model=True, verbose=3000)
            
            val_pred  = model.predict(val_x)
            test_pred = model.predict(test_x)
            
        train[valid_index] = val_pred
        test = test_pred / kf.n_splits
        cv_scores.append(roc_auc_score(val_y, val_pred))
        
        print(cv_scores)
       
    print("%s_scotrainre_list:" % clf_name, cv_scores)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
    print("%s_score_std:" % clf_name, np.std(cv_scores))
    return train, test
    
def lgb_model(x_train, y_train, x_test):
    lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
    return lgb_train, lgb_test

def xgb_model(x_train, y_train, x_test):
    xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb")
    return xgb_train, xgb_test

def cat_model(x_train, y_train, x_test):
    cat_train, cat_test = cv_model(CatBoostRegressor, x_train, y_train, x_test, "cat") 
    return cat_train, cat_test
    

2.4训练rgb模型

lgb_train, lgb_test = lgb_model(x_train, y_train, x_test)

输出:

************************************ 1 ************************************
[LightGBM] [Info] Number of positive: 60072, number of negative: 59928
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.063069 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 10515
[LightGBM] [Info] Number of data points in the train set: 120000, number of used features: 67
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500600 -> initscore=0.002400
[LightGBM] [Info] Start training from score 0.002400
Training until validation scores don’t improve for 200 rounds
[6000] training’s auc: 1 valid_1’s auc: 0.832951
[9000] training’s auc: 1 valid_1’s auc: 0.840899
Early stopping, best iteration is:
[11399] training’s auc: 1 valid_1’s auc: 0.843936
[(‘当前设备使用天数’, 23206.93494542688), (‘当月使用分钟数与前三个月平均值的百分比变化’, 18966.850461155176), (‘客户生命周期内的平均每月使用分钟数’, 13798.198653317988), (‘每月平均使用分钟数’, 13793.37155648321), (‘在职总月数’, 13514.45736033097), (‘客户整个生命周期内的平均每月通话次数’, 13169.779234770685), (‘已完成语音通话的平均使用分钟数’, 12717.158051796257), (‘客户生命周期内的总费用’, 12660.670025695115), (‘当前手机价格’, 12073.53323160857), (‘当月费用与前三个月平均值的百分比变化’, 12001.614469721913), (‘计费调整后的总费用’, 11994.650363598019), (‘计费调整后的总分钟数’, 11881.44530763477), (‘使用高峰语音通话的平均不完整分钟数’, 11772.639638941735), (‘客户生命周期内的总使用分钟数’, 11543.160581946373), (‘过去六个月的平均每月使用分钟数’, 11353.998522041366), (‘客户生命周期内平均月费用’, 11079.994449861348), (‘客户生命周期内的总通话次数’, 10965.192475471646), (‘过去六个月的平均每月通话次数’, 10816.04681776464), (‘过去三个月的平均每月通话次数’, 10654.587144132704), (‘平均月费用’, 10615.659350316972)]
[0.8439362868562585]
************************************ 2 ************************************
[LightGBM] [Info] Number of positive: 59900, number of negative: 60100
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.064188 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 10528
[LightGBM] [Info] Number of data points in the train set: 120000, number of used features: 67
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499167 -> initscore=-0.003333
[LightGBM] [Info] Start training from score -0.003333
Training until validation scores don’t improve for 200 rounds
[3000] training’s auc: 0.999531 valid_1’s auc: 0.81211
[6000] training’s auc: 1 valid_1’s auc: 0.832441
[9000] training’s auc: 1 valid_1’s auc: 0.83953
[12000] training’s auc: 1 valid_1’s auc: 0.84293
Early stopping, best iteration is:
[12210] training’s auc: 1 valid_1’s auc: 0.843058
[(‘当前设备使用天数’, 24020.498692663386), (‘当月使用分钟数与前三个月平均值的百分比变化’, 19773.12492423877), (‘每月平均使用分钟数’, 13641.919732686132), (‘在职总月数’, 13542.374755300581), (‘客户整个生命周期内的平均每月通话次数’, 13250.751761453226), (‘客户生命周期内的平均每月使用分钟数’, 13230.550698732957), (‘已完成语音通话的平均使用分钟数’, 12665.135287033394), (‘当前手机价格’, 12515.943285102025), (‘计费调整后的总费用’, 12446.485826000571), (‘客户生命周期内的总费用’, 12174.580246660858), (‘当月费用与前三个月平均值的百分比变化’, 12122.996504634619), (‘使用高峰语音通话的平均不完整分钟数’, 11753.57425396517), (‘客户生命周期内的总使用分钟数’, 11670.048939611763), (‘计费调整后的总分钟数’, 11564.826041478664), (‘过去六个月的平均每月使用分钟数’, 11223.45485602878), (‘客户生命周期内的总通话次数’, 11168.901827361435), (‘过去六个月的平均每月通话次数’, 11124.491803480312), (‘过去三个月的平均每月通话次数’, 10913.613902557641), (‘客户生命周期内平均月费用’, 10808.741903448477), (‘计费调整后的呼叫总数’, 10793.087901951745)]
[0.8439362868562585, 0.8430577396278305]
************************************ 3 ************************************
[LightGBM] [Info] Number of positive: 60098, number of negative: 59902
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.022496 seconds.
You can set force_row_wise=true to remove the overhead.
And if memory is not enough, you can set force_col_wise=true.
[LightGBM] [Info] Total Bins 10513
[LightGBM] [Info] Number of data points in the train set: 120000, number of used features: 67
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500817 -> initscore=0.003267
[LightGBM] [Info] Start training from score 0.003267
Training until validation scores don’t improve for 200 rounds
[3000] training’s auc: 0.999479 valid_1’s auc: 0.816165
[6000] training’s auc: 1 valid_1’s auc: 0.835227
[9000] training’s auc: 1 valid_1’s auc: 0.842783
Early stopping, best iteration is:
[10972] training’s auc: 1 valid_1’s auc: 0.845106
[(‘当前设备使用天数’, 23554.02359988913), (‘当月使用分钟数与前三个月平均值的百分比变化’, 19450.002618733793), (‘每月平均使用分钟数’, 13781.198779758066), (‘客户生命周期内的平均每月使用分钟数’, 13459.828927565366), (‘在职总月数’, 13450.310572762042), (‘客户整个生命周期内的平均每月通话次数’, 12807.892476923764), (‘已完成语音通话的平均使用分钟数’, 12764.867111746222), (‘客户生命周期内的总费用’, 12400.86265109852), (‘当前手机价格’, 12370.400694530457), (‘计费调整后的总费用’, 12057.831106703728), (‘当月费用与前三个月平均值的百分比变化’, 11742.217323374003), (‘计费调整后的总分钟数’, 11737.426330137998), (‘客户生命周期内的总使用分钟数’, 11546.544992171228), (‘过去六个月的平均每月通话次数’, 11189.07267446071), (‘使用高峰语音通话的平均不完整分钟数’, 11077.357912018895), (‘客户生命周期内平均月费用’, 11054.351627696306), (‘过去六个月的平均每月使用分钟数’, 11045.627827014774), (‘客户生命周期内的总通话次数’, 10956.346342962235), (‘过去三个月的平均每月通话次数’, 10770.111043587327), (‘过去三个月的平均每月使用分钟数’, 10649.836913790554)]
[0.8439362868562585, 0.8430577396278305, 0.8451055611157318]
************************************ 4 ************************************
[LightGBM] [Info] Number of positive: 59934, number of negative: 60066
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.080083 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 10521
[LightGBM] [Info] Number of data points in the train set: 120000, number of used features: 67
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499450 -> initscore=-0.002200
[LightGBM] [Info] Start training from score -0.002200
Training until validation scores don’t improve for 200 rounds
[3000] training’s auc: 0.999563 valid_1’s auc: 0.811568
[6000] training’s auc: 1 valid_1’s auc: 0.833085
[9000] training’s auc: 1 valid_1’s auc: 0.840568
Early stopping, best iteration is:
[10815] training’s auc: 1 valid_1’s auc: 0.842836
[(‘当前设备使用天数’, 23250.61392029375), (‘当月使用分钟数与前三个月平均值的百分比变化’, 19292.30032589659), (‘客户生命周期内的平均每月使用分钟数’, 13766.847205281258), (‘每月平均使用分钟数’, 13647.102678723633), (‘在职总月数’, 13616.28791507706), (‘客户整个生命周期内的平均每月通话次数’, 12942.528880607337), (‘客户生命周期内的总费用’, 12354.855169367045), (‘已完成语音通话的平均使用分钟数’, 12344.891597270966), (‘当前手机价格’, 12116.039827257395), (‘计费调整后的总费用’, 11973.962297733873), (‘客户生命周期内的总使用分钟数’, 11728.554908126593), (‘当月费用与前三个月平均值的百分比变化’, 11600.252432178706), (‘计费调整后的总分钟数’, 11484.329358864576), (‘使用高峰语音通话的平均不完整分钟数’, 11409.022325478494), (‘客户生命周期内平均月费用’, 11246.711442269385), (‘客户生命周期内的总通话次数’, 11189.995209667832), (‘过去六个月的平均每月通话次数’, 10974.940132912248), (‘过去六个月的平均每月使用分钟数’, 10896.997217286378), (‘计费调整后的呼叫总数’, 10649.78157223016), (‘过去三个月的平均每月使用分钟数’, 10490.423435229808)]
[0.8439362868562585, 0.8430577396278305, 0.8451055611157318, 0.8428364570863798]
************************************ 5 ************************************
[LightGBM] [Info] Number of positive: 60164, number of negative: 59836
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.066641 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 10530
[LightGBM] [Info] Number of data points in the train set: 120000, number of used features: 67
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.501367 -> initscore=0.005467
[LightGBM] [Info] Start training from score 0.005467
Training until validation scores don’t improve for 200 rounds
[3000] training’s auc: 0.999484 valid_1’s auc: 0.812766
[6000] training’s auc: 1 valid_1’s auc: 0.833513
[9000] training’s auc: 1 valid_1’s auc: 0.840732
[12000] training’s auc: 1 valid_1’s auc: 0.843099
Early stopping, best iteration is:
[12570] training’s auc: 1 valid_1’s auc: 0.84359
[(‘当前设备使用天数’, 23424.84182724543), (‘当月使用分钟数与前三个月平均值的百分比变化’, 20145.600453276187), (‘客户生命周期内的平均每月使用分钟数’, 13742.811106188223), (‘每月平均使用分钟数’, 13542.054858371615), (‘在职总月数’, 13331.132492953911), (‘客户整个生命周期内的平均每月通话次数’, 13032.698692249134), (‘已完成语音通话的平均使用分钟数’, 12668.222773976624), (‘当前手机价格’, 12389.838216289878), (‘客户生命周期内的总费用’, 12380.491549521685), (‘使用高峰语音通话的平均不完整分钟数’, 12170.956499611959), (‘计费调整后的总费用’, 12101.883673759177), (‘客户生命周期内的总使用分钟数’, 11935.202114250511), (‘当月费用与前三个月平均值的百分比变化’, 11643.637906264514), (‘计费调整后的总分钟数’, 11638.548691518605), (‘客户生命周期内平均月费用’, 11420.638470709324), (‘客户生命周期内的总通话次数’, 11384.500305030495), (‘过去六个月的平均每月使用分钟数’, 11285.595895411447), (‘过去六个月的平均每月通话次数’, 10763.348691094667), (‘过去三个月的平均每月通话次数’, 10488.567783802748), (‘平均月费用’, 10480.520216416568)]
[0.8439362868562585, 0.8430577396278305, 0.8451055611157318, 0.8428364570863798, 0.8435898444055294]
lgb_scotrainre_list: [0.8439362868562585, 0.8430577396278305, 0.8451055611157318, 0.8428364570863798, 0.8435898444055294]
lgb_score_mean: 0.843705177818346
lgb_score_std: 0.000800204784264596

2.5输出到文件

#提交结果

test['是否流失'] = lgb_test
test[['客户ID','是否流失']].to_csv('test_sub.csv', index=False)

2.6提交结果

分数:0.83909
电信客户流失预测挑战赛baseline【AI比赛】_第1张图片

你可能感兴趣的:(AI比赛与实战,人工智能,机器学习,算法,数据挖掘,数据分析)