目录
1.竞赛概要
2.数据清洗
3.特征处理
4.模型选择
1.竞赛概要
本比赛为个人练习赛,适用于入门二元分类模型。
任务类型:二元分类
背景介绍:
在交通摩擦(事故)发生后,理赔员会前往现场勘察、采集信息,这些信息往往影响着车主是否能够得到保险公司的理赔。训练集数据包括理赔人员在现场对该事故方采集的36条信息,信息已经被编码,以及该事故方最终是否获得理赔。我们的任务是根据这36条信息预测该事故方没有被理赔的概率。
2.数据清洗
数据清洗其中就是缺失值的处理和重复值的删除,如图1。
在这个原始的数据集中,没有发现缺失的存在,但是有186833条重复值。通常一般处理是看见重复值就进行剔除,但是这个数据的重复值的占比较高,我们需要对比一下剔除和没有剔除的效果。经过对比发现,对重复值进行剔除后,效果反而不佳,所以在数据清洗这阶段,没有对数据集进行修改。
3.特征处理
一般特征处理有特征工程、升维、降维。通过对数据特征的判断,因为数据每个样本之间不具备时间等可以合并或者分割的特征,而且每个样本之间都是一个独立数据,所以综上的判断个人认为不适合进行特征工程和升维的处理,又因为数据集中有36个特征,所以觉得可以尝试降维。
降维的方法非常多,这里我选择了用遗传算法来筛选特征,从而达到降维的效果。图2为遗传算法降维的流程图。
由于我们数据集有36个特征,所以我们建立一个初始的种群为(30,36)的矩阵,其中染色体则为种群(30,36)矩阵中的其中(1,36),基因则为染色体中的一个特征为一个基因。
对初始种群进行二进制编码,即随机产生一个(30,36)里面为0,1的矩阵,那么一条染色体即为数据集特征的索引,例如其中一条染色体为[0,1,1,1,0,.......,0,1,0],通过这条染色体进行索引数据集特征,得到新的一个数据集。将这个数据集切分成训练集和测试集,通过模型对新数据集进行训练和预测,以预测结果的AUC值作为适应度。根据适应度筛选出最优一条染色保存,看是否达到终止条件,没达到再对种群进行选择、交叉、变异,得到新种群进入下一次循环,直到到达终止条件(一般是人为设置,本次训练我设置迭代30次后终止循环),输出最优的一条染色体,即为我们降维的结果。
#遗传算法特征选择
class BGA():
def __init__(self, pop_shape, method, p_c=0.8, p_m=0.2, max_round = 1000, early_stop_rounds=None, verbose = None, maximum=True):
if early_stop_rounds != None:
assert(max_round > early_stop_rounds)
self.pop_shape = pop_shape
self.method = method
self.pop = np.zeros(pop_shape)
self.fitness = np.zeros(pop_shape[0])
self.p_c = p_c
self.p_m = p_m
self.max_round = max_round
self.early_stop_rounds = early_stop_rounds
self.verbose = verbose
self.maximum = maximum
def evaluation(self, pop):
return np.array([self.method(i) for i in pop])
def initialization(self):
if os.path.exists('pop.npy'):
self.pop = np.load('pop.npy')
else:
self.pop = np.random.randint(low=0, high=2, size=self.pop_shape)
np.save('pop.npy', self.pop)
self.fitness = self.evaluation(self.pop)
def crossover(self, ind_0, ind_1):
assert(len(ind_0) == len(ind_1))
point = np.random.randint(len(ind_0))
# new_0, new_1 = np.zeros(len(ind_0)), np.zeros(len(ind_0))
new_0 = np.hstack((ind_0[:point], ind_1[point:]))
new_1 = np.hstack((ind_1[:point], ind_0[point:]))
assert(len(new_0) == len(ind_0))
return new_0, new_1
def mutation(self, indi):
point = np.random.randint(len(indi))
indi[point] = 1 - indi[point]
return indi
def rws(self, size, fitness):
if self.maximum:
fitness_ = fitness
else:
fitness_ = 1.0 / fitness
# fitness_ = fitness
idx = np.random.choice(np.arange(len(fitness_)), size=size, replace=True,
p=fitness_/fitness_.sum()) # p 就是选它的比例
return idx
def local_search(self, solution, fitness):
for i in range(len(solution)):
solution_b = solution[:]
solution_b[i] = 1-solution_b[i]
fit = self.method(solution_b)
if self.maximum:
if fit > fitness:
fitness = fit
solution = solution_b[:]
else:
if fit < fitness:
fitness = fit
solution = solution_b[:]
del solution_b
return solution, fitness
def run(self):
global_best = 0
self.initialization()
if self.maximum:
best_index = np.argsort(self.fitness)[-1]
else:
best_index = np.argsort(self.fitness)[0]
global_best_fitness = self.fitness[best_index]
global_best_ind = self.pop[best_index, :]
eva_times = self.pop_shape[0]
count = 0
for it in range(self.max_round):
next_gene = []
self.pop = np.load('pop.npy')
for n in range(int(self.pop_shape[0]/2)):
i, j = self.rws(2, self.fitness) # choosing 2 individuals with rws.
indi_0, indi_1 = self.pop[i, :].copy(), self.pop[j, :].copy()
if np.random.rand() < self.p_c:
indi_0, indi_1 = self.crossover(indi_0, indi_1)
if np.random.rand() < self.p_m:
indi_0 = self.mutation(indi_0)
indi_1 = self.mutation(indi_1)
next_gene.append(indi_0)
next_gene.append(indi_1)
self.pop = np.array(next_gene)
self.fitness = self.evaluation(self.pop)
eva_times += self.pop_shape[0]
if self.maximum:
if np.max(self.fitness) > global_best_fitness:
best_index = np.argsort(self.fitness)[-1]
global_best_fitness = self.fitness[best_index]
global_best_ind = self.pop[best_index, :]
# global_best_ind, global_best_fitness = self.local_search(global_best_ind, global_best_fitness)
with open('./history.txt', 'w') as f:
f.write(str(global_best_ind.tolist()))
f.write(str(global_best_fitness))
count = 0
else:
count +=1
worst_index = np.argsort(self.fitness)[-1]
self.pop[worst_index, :] = global_best_ind
self.fitness[worst_index] = global_best_fitness
else:
if np.min(self.fitness) < global_best_fitness:
best_index = np.argsort(self.fitness)[0]
global_best_fitness = self.fitness[best_index]
global_best_ind = self.pop[best_index, :]
# global_best_ind, global_best_fitness = self.local_search(global_best_ind, global_best_fitness)
with open('./history.txt', 'w') as f:
f.write(str(global_best_ind.tolist()))
f.write(str(global_best_fitness))
count = 0
else:
count +=1
worst_index = np.argsort(self.fitness)[-1]
self.pop[worst_index, :] = global_best_ind
self.fitness[worst_index] = global_best_fitness
np.save('pop.npy', self.pop)
if self.verbose != None and 0 == (it % self.verbose):
print('Gene {}:'.format(it))
print('Global best fitness:', global_best_fitness)
if self.early_stop_rounds != None and count > self.early_stop_rounds:
print('Did not improved within {} rounds. Break.'.format(self.early_stop_rounds))
break
print('\n Solution: {} \n Fitness: {} \n Evaluation times: {}'.format(global_best_ind, global_best_fitness, eva_times))
return global_best_ind, global_best_fitness
def evaluate(solution):
cols = used_feat[solution.astype(bool)]
train_x_1 = train[cols]
train_y_1 = y_train
test_x_1 = test[cols]
folds = 5
seeds = [44]
for seed in seeds:
kfold = StratifiedKFold(n_splits=folds, shuffle=True, random_state=seed)
for fold, (trn_idx, val_idx) in enumerate(kfold.split(train_x_1, train_y_1)):
x_trn, y_trn, x_val, y_val = train_x_1.iloc[trn_idx], train_y_1.iloc[trn_idx], train_x_1.iloc[val_idx], train_y_1.iloc[val_idx]
lgb_train = lgb.Dataset(x_trn, y_trn)
lgb_eval= lgb.Dataset(x_val, y_val)
print('开始训练......')
params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': {'auc'}
}
gbm = lgb.train(params,
lgb_train,
num_boost_round=1000,
early_stopping_rounds=50,
valid_sets=lgb_eval,
verbose_eval=False
)
gc.collect()
print("gbm.best_score:",gbm.best_score['valid_0']['auc'])
return gbm.best_score['valid_0']['auc']
## searching
ga = BGA(pop_shape=(30, len(used_feat)), method=evaluate, max_round=30, verbose=1, p_m=0.3, maximum=True)
solution, fitness = ga.run()
4.模型选择
竞赛的标杆模型有lasso逻辑回归模型和随机森林模型,但是随机森林模型的PR-AUC远高于lasso逻辑回归模型,所以在标杆模型中就选择随机森林模型。我之前接触过一个模型lightgbm,这个模型是根据xgboost改进而来的,具体可参考https://blog.csdn.net/maqunfi/article/details/82219999
随机森林的提交结果为0.850898
cols = used_feat[solution.astype(bool)]
# 建立随机森林模型
rfc = RandomForestClassifier(n_estimators=100, random_state=0)
rfc = rfc.fit(train[cols],y_train) #用训练集数据训练模型
y_pred = rfc .predict_proba(test[cols])[:, 1]
lightgbm模型的提交结果为 0.86035
col = train.columns[solution_1.astype(bool)]
train_x_1 = train[used_feat]
train_y_1 = y_train
test_x_1 = test[used_feat]
folds = 5
seeds = [44]
for seed in seeds:
kfold = StratifiedKFold(n_splits=folds, shuffle=True, random_state=seed)
for fold, (trn_idx, val_idx) in enumerate(kfold.split(train_x_1, train_y_1)):
x_trn, y_trn, x_val, y_val = train_x_1.iloc[trn_idx], train_y_1.iloc[trn_idx], train_x_1.iloc[val_idx], train_y_1.iloc[val_idx]
lgb_train = lgb.Dataset(x_trn, y_trn)
lgb_eval= lgb.Dataset(x_val, y_val)
print('开始训练......')
params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': {'auc'},
}
gbm = lgb.train(params,
lgb_train,
num_boost_round=1000,
early_stopping_rounds=50,
valid_sets=lgb_eval,
verbose_eval=False
)
y_pred = gbm.predict(test_x_1)
print(gbm.best_score['valid_0']['auc'])
gc.collect()
相比之下lightgbm比随机森林更好一点,但是整体的差异也不大,所以通过尝试将两个模型做一个集成
模型集成0.819028
def get_oof(clf, x_train, y_train, x_test):
oof_train = np.zeros((x_train.shape[0],))
oof_test = np.zeros((x_test.shape[0],))
oof_test_skf = np.empty((5, x_test.shape[0]))
for i, (train_index, test_index) in enumerate(kf.split(xtrain,ytrain)):
#train_index, test_index= Series(train_index), Series(test_index)
x_tr = x_train.iloc[train_index]
y_tr = y_train.iloc[train_index]
x_te = x_train.iloc[test_index]
clf.fit(x_tr, y_tr)
oof_train[test_index] = clf.predict(x_te)
oof_test_skf[i, :] = clf.predict(x_test)
oof_test[:] = oof_test_skf.mean(axis=0)
return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)
models= [RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini',),LGBMClassifier() ]
number_models=len(models)
xtrain_new=pd.DataFrame(np.zeros((train.shape[0],number_models)))
xtest_new=pd.DataFrame(np.zeros((test.shape[0],number_models)))
for l,ml in enumerate(models):
xtrain_new.iloc[:,l],xtest_new.iloc[:,l]=get_oof(ml,train,y_train,test)
print("第{}次循环".format(l))
clf = LogisticRegression()
clf.fit(xtrain_new, y_train)
y_pred=clf.predict(xtest_new)
原理如图3所示
第一步:
通过K折交叉切分数据集的方式对原始的训练数据集进行切分为5等份,假设原始训练集中有1000个样本,原始测试集中有500个样本(未知标签),例如对原始训练数据集中1000个样本数据,通过k折交叉切分后,得到每份数据为200个样本,其中4份为新的训练集,1份为新的测试集。
第二步:
分别用随机森林模型和lightgbm模型对新的训练集和新的测试集进行训练和预测,得到一份200个样本的预测集数据,其中一个模型就一列预测值,那么两个模型的集成,就产生了一个(200,2)的数据集,那么循环五次,那么就对原始训练数据集中每一个样本进行预测,最后就得到一个新的训练数据集(1000,2)。在五次循环中,每一次循环随机森林模型和lightgbm模型还要对原始测试集(未知标签的)进行预测,那么预测五次就生成(500,5)的数据集,因为结果标签值只有一个,那么就对(500,5)进行行求平均得到一个(500,1)的新测试集,因为有两个模型,所以最后就是(500,2)的新测试集。
第三步:
通过一个简单模型,比如逻辑回归模型,对新的训练集进行训练,再最新的测试集进行预测,最后得到一个(500,1)的预测值,这就是最后的预测结果。
按道理来说,集成的结果是优于单个模型的结果的,但是不知什么原因,比单个模型略差,通过多方面的尝试,最终没有达到想要的结果。
最终结果为lightgbm模型的结果最优
参考:
https://blog.csdn.net/xiaoliuzz/article/details/79298841
https://blog.csdn.net/maqunfi/article/details/82219999
代码:
https://gitee.com/liu_ji_duan/DuanGe/tree/master/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/%E4%BA%A4%E9%80%9A%E4%BA%8B%E6%95%85%E7%90%86%E8%B5%94%E5%AE%A1%E6%A0%B8%E7%AB%9E%E8%B5%9B