1.竞赛概要

本比赛为个人练习赛，适用于入门二元分类模型。

任务类型：二元分类

背景介绍:
在交通摩擦（事故）发生后，理赔员会前往现场勘察、采集信息，这些信息往往影响着车主是否能够得到保险公司的理赔。训练集数据包括理赔人员在现场对该事故方采集的36条信息，信息已经被编码，以及该事故方最终是否获得理赔。我们的任务是根据这36条信息预测该事故方没有被理赔的概率。

2.数据清洗

数据清洗其中就是缺失值的处理和重复值的删除，如图1。

图1

在这个原始的数据集中，没有发现缺失的存在，但是有186833条重复值。通常一般处理是看见重复值就进行剔除，但是这个数据的重复值的占比较高，我们需要对比一下剔除和没有剔除的效果。经过对比发现，对重复值进行剔除后，效果反而不佳，所以在数据清洗这阶段，没有对数据集进行修改。

3.特征处理

一般特征处理有特征工程、升维、降维。通过对数据特征的判断，因为数据每个样本之间不具备时间等可以合并或者分割的特征，而且每个样本之间都是一个独立数据，所以综上的判断个人认为不适合进行特征工程和升维的处理，又因为数据集中有36个特征，所以觉得可以尝试降维。
降维的方法非常多，这里我选择了用遗传算法来筛选特征，从而达到降维的效果。图2为遗传算法降维的流程图。

图2

由于我们数据集有36个特征，所以我们建立一个初始的种群为(30,36)的矩阵，其中染色体则为种群(30,36)矩阵中的其中(1,36)，基因则为染色体中的一个特征为一个基因。
对初始种群进行二进制编码，即随机产生一个(30,36)里面为0，1的矩阵，那么一条染色体即为数据集特征的索引，例如其中一条染色体为[0,1,1,1,0,.......,0,1,0]，通过这条染色体进行索引数据集特征，得到新的一个数据集。将这个数据集切分成训练集和测试集，通过模型对新数据集进行训练和预测，以预测结果的AUC值作为适应度。根据适应度筛选出最优一条染色保存，看是否达到终止条件，没达到再对种群进行选择、交叉、变异，得到新种群进入下一次循环，直到到达终止条件（一般是人为设置，本次训练我设置迭代30次后终止循环），输出最优的一条染色体，即为我们降维的结果。

#遗传算法特征选择
class BGA():

    def __init__(self, pop_shape, method, p_c=0.8, p_m=0.2, max_round = 1000, early_stop_rounds=None, verbose = None, maximum=True):
 
        if early_stop_rounds != None:
            assert(max_round > early_stop_rounds)
        self.pop_shape = pop_shape
        self.method = method
        self.pop = np.zeros(pop_shape)
        self.fitness = np.zeros(pop_shape[0])
        self.p_c = p_c
        self.p_m = p_m
        self.max_round = max_round
        self.early_stop_rounds = early_stop_rounds
        self.verbose = verbose
        self.maximum = maximum

    def evaluation(self, pop):

        return np.array([self.method(i) for i in pop])

    def initialization(self):
        if os.path.exists('pop.npy'):
            self.pop = np.load('pop.npy')
        else:
            self.pop = np.random.randint(low=0, high=2, size=self.pop_shape)
            np.save('pop.npy', self.pop)
        self.fitness = self.evaluation(self.pop)

    def crossover(self, ind_0, ind_1):

        assert(len(ind_0) == len(ind_1))

        point = np.random.randint(len(ind_0))
#         new_0, new_1 = np.zeros(len(ind_0)),  np.zeros(len(ind_0))
        new_0 = np.hstack((ind_0[:point], ind_1[point:]))
        new_1 = np.hstack((ind_1[:point], ind_0[point:]))

        assert(len(new_0) == len(ind_0))
        return new_0, new_1

    def mutation(self, indi):
        point = np.random.randint(len(indi))
        indi[point] = 1 - indi[point]
        return indi


    def rws(self, size, fitness):
        if self.maximum:
            fitness_ = fitness
        else:
            fitness_ = 1.0 / fitness
#         fitness_ = fitness
        idx = np.random.choice(np.arange(len(fitness_)), size=size, replace=True,
               p=fitness_/fitness_.sum()) # p 就是选它的比例
        return idx

    def local_search(self, solution, fitness):
        for i in range(len(solution)):
            solution_b = solution[:]
            solution_b[i] = 1-solution_b[i]
            fit = self.method(solution_b)
            if self.maximum:
                if fit > fitness:
                    fitness = fit
                    solution = solution_b[:]
            else:
                if fit < fitness:
                    fitness = fit
                    solution = solution_b[:]
            del solution_b
        return solution, fitness


    def run(self):

        global_best = 0
        self.initialization()
        if self.maximum:
            best_index = np.argsort(self.fitness)[-1]
        else:
            best_index = np.argsort(self.fitness)[0]
        global_best_fitness = self.fitness[best_index]
        global_best_ind = self.pop[best_index, :]
        eva_times = self.pop_shape[0]
        count = 0

        for it in range(self.max_round):
            next_gene = []
            self.pop = np.load('pop.npy')
            for n in range(int(self.pop_shape[0]/2)):
                i, j = self.rws(2, self.fitness) # choosing 2 individuals with rws.
                indi_0, indi_1 = self.pop[i, :].copy(), self.pop[j, :].copy()
                if np.random.rand() < self.p_c:
                    indi_0, indi_1 = self.crossover(indi_0, indi_1)

                if np.random.rand() < self.p_m:
                    indi_0 = self.mutation(indi_0)
                    indi_1 = self.mutation(indi_1)

                next_gene.append(indi_0)
                next_gene.append(indi_1)


            self.pop = np.array(next_gene)
            self.fitness = self.evaluation(self.pop)
            eva_times += self.pop_shape[0]

            if self.maximum:
                if np.max(self.fitness) > global_best_fitness:
                    best_index = np.argsort(self.fitness)[-1]
                    global_best_fitness = self.fitness[best_index]
                    global_best_ind = self.pop[best_index, :]
                    # global_best_ind, global_best_fitness = self.local_search(global_best_ind, global_best_fitness)
                    with open('./history.txt', 'w') as f:
                        f.write(str(global_best_ind.tolist()))
                        f.write(str(global_best_fitness))
                    count = 0
                else:
                    count +=1
                worst_index = np.argsort(self.fitness)[-1]
                self.pop[worst_index, :] = global_best_ind
                self.fitness[worst_index] = global_best_fitness

            else:
                if np.min(self.fitness) < global_best_fitness:
                    best_index = np.argsort(self.fitness)[0]
                    global_best_fitness = self.fitness[best_index]
                    global_best_ind = self.pop[best_index, :]
                    # global_best_ind, global_best_fitness = self.local_search(global_best_ind, global_best_fitness)
                    with open('./history.txt', 'w') as f:
                        f.write(str(global_best_ind.tolist()))
                        f.write(str(global_best_fitness))
                    count = 0
                else:
                    count +=1

                worst_index = np.argsort(self.fitness)[-1]
                self.pop[worst_index, :] = global_best_ind
                self.fitness[worst_index] = global_best_fitness

            np.save('pop.npy', self.pop)

            if self.verbose != None and 0 == (it % self.verbose):
                print('Gene {}:'.format(it))
                print('Global best fitness:', global_best_fitness)

            if self.early_stop_rounds != None and count > self.early_stop_rounds:
                print('Did not improved within {} rounds. Break.'.format(self.early_stop_rounds))
                break

        print('\n Solution: {} \n Fitness: {} \n Evaluation times: {}'.format(global_best_ind, global_best_fitness, eva_times))
        return global_best_ind, global_best_fitness


def evaluate(solution):
    
    cols = used_feat[solution.astype(bool)]

    train_x_1 = train[cols]
    train_y_1 = y_train
    test_x_1  = test[cols]

    folds = 5
    seeds = [44] 
    for seed in seeds:
        kfold = StratifiedKFold(n_splits=folds, shuffle=True, random_state=seed)
        for fold, (trn_idx, val_idx) in enumerate(kfold.split(train_x_1, train_y_1)):
            x_trn, y_trn, x_val, y_val = train_x_1.iloc[trn_idx], train_y_1.iloc[trn_idx], train_x_1.iloc[val_idx], train_y_1.iloc[val_idx]
            lgb_train = lgb.Dataset(x_trn, y_trn)
            lgb_eval= lgb.Dataset(x_val, y_val)
            print('开始训练......')
            params = {
                'task': 'train',
                'boosting_type': 'gbdt',
                'objective': 'binary',
                'metric': {'auc'}
    
            }
            gbm = lgb.train(params,
                        lgb_train,
                        num_boost_round=1000,
                        early_stopping_rounds=50,
                        valid_sets=lgb_eval,
                        verbose_eval=False
                        )
            gc.collect()
            print("gbm.best_score:",gbm.best_score['valid_0']['auc'])
            return gbm.best_score['valid_0']['auc']
            
## searching
ga = BGA(pop_shape=(30, len(used_feat)), method=evaluate, max_round=30, verbose=1, p_m=0.3, maximum=True)
solution, fitness = ga.run()

4.模型选择

竞赛的标杆模型有lasso逻辑回归模型和随机森林模型，但是随机森林模型的PR-AUC远高于lasso逻辑回归模型，所以在标杆模型中就选择随机森林模型。我之前接触过一个模型lightgbm，这个模型是根据xgboost改进而来的，具体可参考https://blog.csdn.net/maqunfi/article/details/82219999

随机森林的提交结果为0.850898

cols = used_feat[solution.astype(bool)]
# 建立随机森林模型
rfc = RandomForestClassifier(n_estimators=100, random_state=0)
rfc = rfc.fit(train[cols],y_train)       #用训练集数据训练模型 
y_pred = rfc .predict_proba(test[cols])[:, 1]

lightgbm模型的提交结果为 0.86035

col = train.columns[solution_1.astype(bool)]
train_x_1 = train[used_feat]
train_y_1 = y_train
test_x_1  = test[used_feat]
folds = 5
seeds = [44] 
for seed in seeds:
    kfold = StratifiedKFold(n_splits=folds, shuffle=True, random_state=seed)
    for fold, (trn_idx, val_idx) in enumerate(kfold.split(train_x_1, train_y_1)):
        x_trn, y_trn, x_val, y_val = train_x_1.iloc[trn_idx], train_y_1.iloc[trn_idx], train_x_1.iloc[val_idx], train_y_1.iloc[val_idx]
        lgb_train = lgb.Dataset(x_trn, y_trn)
        lgb_eval= lgb.Dataset(x_val, y_val)
        print('开始训练......')
        params = {
            'task': 'train',
            'boosting_type': 'gbdt',
            'objective': 'binary',
            'metric': {'auc'},
        }
        gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=1000,
                    early_stopping_rounds=50,
                    valid_sets=lgb_eval,
                    verbose_eval=False
                    )
        y_pred = gbm.predict(test_x_1)
        print(gbm.best_score['valid_0']['auc'])
        gc.collect()

相比之下lightgbm比随机森林更好一点，但是整体的差异也不大，所以通过尝试将两个模型做一个集成

模型集成0.819028

def get_oof(clf, x_train, y_train, x_test):
    oof_train = np.zeros((x_train.shape[0],))
    oof_test = np.zeros((x_test.shape[0],))
    oof_test_skf = np.empty((5, x_test.shape[0]))

    for i, (train_index, test_index) in enumerate(kf.split(xtrain,ytrain)):
        #train_index, test_index= Series(train_index), Series(test_index)
        x_tr = x_train.iloc[train_index]
        y_tr = y_train.iloc[train_index]
        x_te = x_train.iloc[test_index]

        clf.fit(x_tr, y_tr)

        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(x_test)

    oof_test[:] = oof_test_skf.mean(axis=0)
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)
models= [RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini',),LGBMClassifier() ]
number_models=len(models)
xtrain_new=pd.DataFrame(np.zeros((train.shape[0],number_models)))
xtest_new=pd.DataFrame(np.zeros((test.shape[0],number_models)))
for l,ml in enumerate(models):
    xtrain_new.iloc[:,l],xtest_new.iloc[:,l]=get_oof(ml,train,y_train,test)
    print("第{}次循环".format(l))
clf = LogisticRegression()
clf.fit(xtrain_new, y_train)
y_pred=clf.predict(xtest_new)

原理如图3所示
第一步：
通过K折交叉切分数据集的方式对原始的训练数据集进行切分为5等份，假设原始训练集中有1000个样本，原始测试集中有500个样本(未知标签），例如对原始训练数据集中1000个样本数据，通过k折交叉切分后，得到每份数据为200个样本，其中4份为新的训练集，1份为新的测试集。
第二步：
分别用随机森林模型和lightgbm模型对新的训练集和新的测试集进行训练和预测，得到一份200个样本的预测集数据，其中一个模型就一列预测值，那么两个模型的集成，就产生了一个(200，2)的数据集，那么循环五次，那么就对原始训练数据集中每一个样本进行预测，最后就得到一个新的训练数据集(1000,2)。在五次循环中，每一次循环随机森林模型和lightgbm模型还要对原始测试集(未知标签的)进行预测，那么预测五次就生成(500,5)的数据集，因为结果标签值只有一个，那么就对(500,5)进行行求平均得到一个(500,1)的新测试集，因为有两个模型，所以最后就是(500,2)的新测试集。
第三步：
通过一个简单模型，比如逻辑回归模型，对新的训练集进行训练，再最新的测试集进行预测，最后得到一个(500,1)的预测值，这就是最后的预测结果。

图3

按道理来说，集成的结果是优于单个模型的结果的，但是不知什么原因，比单个模型略差，通过多方面的尝试，最终没有达到想要的结果。

最终结果为lightgbm模型的结果最优

参考：
https://blog.csdn.net/xiaoliuzz/article/details/79298841
https://blog.csdn.net/maqunfi/article/details/82219999
代码：
https://gitee.com/liu_ji_duan/DuanGe/tree/master/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/%E4%BA%A4%E9%80%9A%E4%BA%8B%E6%95%85%E7%90%86%E8%B5%94%E5%AE%A1%E6%A0%B8%E7%AB%9E%E8%B5%9B

交通事故理赔审核竞赛

目录