糖尿病遗传风险检测挑战赛(Coggle 30 Days of ML)

本次跟着Coggle 30 Days of ML学习算法竞赛,而不是基于现成baseline来操作,预计重新熟悉并学习

  • Pandas、Numpy 处理数据
  • Sklearn、LightGBM 模型使用
  • 构建特征工程、筛选特征
  • NLP模型搭建

数据挖掘比赛为“糖尿病遗传风险检测挑战赛”,比赛报名地址:2022 iFLYTEK A.I.开发者大赛-讯飞开放平台

以下为打卡内容:

任务1:报名比赛

  • 步骤1:报名比赛2022 iFLYTEK A.I.开发者大赛-讯飞开放平台
  • 步骤2:下载比赛数据(点击比赛页面的赛题数据)
  • 步骤3:解压比赛数据,并使用pandas进行读取;
  • 步骤4:查看训练集和测试集字段类型,并将数据读取代码写到博客;
import pandas as pd

train_df = pd.read_csv('./train.csv', encoding='gbk')
test_df = pd.read_csv('./test.csv', encoding='gbk')

print(train_df.shape, test_df.shape)
print(train_df.dtypes, test_df.dtypes)

任务2:比赛数据分析

  • 步骤1:统计字段的缺失值,计算缺失比例;
    • 通过缺失值统计,训练集和测试集的缺失值分布是否一致?
    • 通过缺失值统计,有没有缺失比例很高的列?
  • 步骤2:分析字段的类型;
    • 有多少数值类型、类别类型?
    • 你是判断字段类型的?
    • 在博客中通过文字写出你的判断;
  • 步骤3:计算字段相关性;
    • 通过.corr()计算字段之间的相关性;
    • 有哪些字段与标签的相关性最高?
    • 尝试使用其他可视化方法将字段 与 标签的分布差异进行可视化;
# 缺失值计算
print(train_df.isnull().mean(0))
print(test_df.isnull().mean(0))  # 缺失值分布一致,舒张压列均缺失约4.9%

# 分析字段类型
print(train_df.info())
print(test_df.info())

#计算特征相关性
corr = train_df.corr()
print(corr)

#相关性可视化
import seaborn as sns
import matplotlib.pyplot as plt
# 解决中文显示问题
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
# plt.figure(figsize = (15,10))
heatmap = sns.heatmap(corr,cmap = 'Purples')
plt.show() # 患有糖尿病标识与肱三头肌皮褶厚度、体重指数相关性较高

糖尿病遗传风险检测挑战赛(Coggle 30 Days of ML)_第1张图片

任务3:逻辑回归尝试

  • 步骤1:导入sklearn中的逻辑回归;
  • 步骤2:使用训练集和逻辑回归进行训练,并在测试集上进行预测;
  • 步骤3:将步骤2预测的结果文件提交到比赛,截图分数;
  • 步骤4:将训练集20%划分为验证集,在训练部分进行训练,在测试部分进行预测,调节逻辑回归的超参数;
  • 步骤4:如果精度有提高,则重复步骤2和步骤3;
test_df['患有糖尿病标识'] = -1
data = pd.concat([train_df,test_df])

# 字符串转化数字
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
data['糖尿病家族史'] =  labelencoder.fit_transform( data['糖尿病家族史'])

#缺失值填充
data['舒张压'] = data['舒张压'].fillna(data['舒张压'].mean())

#逻辑回归尝试
from sklearn.linear_model import LogisticRegression as LR
X_train = data[data['患有糖尿病标识'] >= 0 ].drop(['患有糖尿病标识','编号'],axis = 1)
y_train = data[data['患有糖尿病标识'] >= 0 ]['患有糖尿病标识']
X_test = data[data['患有糖尿病标识'] < 0 ].drop(['患有糖尿病标识','编号'],axis = 1)
print(X_train.shape,y_train.shape,X_test.shape)  # (5070, 8) (5070,) (1000, 8)

X_train.info()

model = LR()
model.fit(X_train,y_train)
y_test = model.predict(X_test)
test_df['label']=y_test
test_df.rename({'编号':'uuid'},axis = 1)[['uuid','label']].to_csv('submit.csv',index = False)

## 将训练集20%划分为验证集,在训练部分进行训练,在测试部分进行预测,调节逻辑回归的超参数
from sklearn.model_selection import train_test_split
X_train2,X_valid2,y_train2,y_valid2 = train_test_split(X_train,y_train,test_size = 0.2 )
print(X_train2.shape,y_train2.shape,X_valid2.shape,y_valid2.shape)

model = LR()
model.fit(X_train2,y_train2)
y_pred = model.predict(X_valid2)
from sklearn.metrics import f1_score
print(f1_score(y_pred,y_valid2))   #0.75

任务4:特征工程(使用pandas完成)

  • 步骤1:统计每个性别对应的【体重指数】、【舒张压】平均值
  • 步骤2:计算每个患者与每个性别平均值的差异;
  • 步骤3:在上述基础上将训练集20%划分为验证集,使用逻辑回归完成训练,精度是否有提高?
  • 步骤4:思考字段含义,尝试新的特征,将你的尝试写入博客;
# 统计每个性别对应的【体重指数】、【舒张压】平均值
female_height = train_df[train_df['性别']==0]['体重指数'].mean()
female_stress = train_df[train_df['性别']==0]['舒张压'].mean()
male_height = train_df[train_df['性别']==1]['体重指数'].mean()
male_stress = train_df[train_df['性别']==1]['舒张压'].mean()

print('女性:体重指数平均值',train_df[train_df['性别']==0]['体重指数'].mean(),'舒张压平均值',train_df[train_df['性别']==0]['舒张压'].mean())
print('男性:体重指数平均值',train_df[train_df['性别']==1]['体重指数'].mean(),'舒张压平均值',train_df[train_df['性别']==1]['舒张压'].mean())

'''
女性:体重指数平均值 37.19760348583878 舒张压平均值 88.75514089870525
男性:体重指数平均值 38.92521588946459 舒张压平均值 90.22257624032773
'''

# 计算每个患者与每个性别平均值的差异
data['体重差异'].loc[data['性别']==0]=data['体重指数'].loc[data['性别']==0]-female_height
# data[data['性别']==1]['体重差异']=data['体重指数']-male_height
data['体重差异'].loc[data['性别']==1]=data['体重指数'].loc[data['性别']==1]-male_height


data['舒张压差异'].loc[data['性别']==0]=data['舒张压'].loc[data['性别']==0]-female_height
# data[data['性别']==1]['体重差异']=data['体重指数']-male_height
data['舒张压差异'].loc[data['性别']==1]=data['舒张压'].loc[data['性别']==1]-male_height

#逻辑回归重新预测
from sklearn.linear_model import LogisticRegression as LR
X_train = data[data['患有糖尿病标识'] >= 0 ].drop(['患有糖尿病标识','编号'],axis = 1)
y_train = data[data['患有糖尿病标识'] >= 0 ]['患有糖尿病标识']
X_test = data[data['患有糖尿病标识'] < 0 ].drop(['患有糖尿病标识','编号'],axis = 1)
print(X_train.shape,y_train.shape,X_test.shape)
X_train2,X_valid2,y_train2,y_valid2 = train_test_split(X_train,y_train,test_size = 0.2 ,shuffle = False)
print(X_train2.shape,y_train2.shape,X_valid2.shape,y_valid2.shape)
model = LR()
model.fit(X_train2,y_train2)
y_pred = model.predict(X_valid2)
print(f1_score(y_pred,y_valid2))    # f1-score:0.75 有提高

# 加入新特征 年龄
data['年龄']=2022-data['出生年份']

from sklearn.linear_model import LogisticRegression as LR
X_train = data[data['患有糖尿病标识'] >= 0 ].drop(['患有糖尿病标识','编号'],axis = 1)
y_train = data[data['患有糖尿病标识'] >= 0 ]['患有糖尿病标识']
X_test = data[data['患有糖尿病标识'] < 0 ].drop(['患有糖尿病标识','编号'],axis = 1)
print(X_train.shape,y_train.shape,X_test.shape)
X_train2,X_valid2,y_train2,y_valid2 = train_test_split(X_train,y_train,test_size = 0.2 ,shuffle = False)
print(X_train2.shape,y_train2.shape,X_valid2.shape,y_valid2.shape)
model = LR()
model.fit(X_train2,y_train2)
y_pred = model.predict(X_valid2)
print(f1_score(y_pred,y_valid2)) # f1-score 略有提高
  • 任务5:特征筛选

    • 步骤1:使用树模型完成模型的训练,通过特征重要性筛选出Top5的特征;
    • 步骤2:使用筛选出的特征和逻辑回归进行训练,在验证集精度是否有提高?
    • 步骤3:如果有提高,为什么?如果没有提高,为什么?
    • 步骤4:将你的尝试写入博客;
from sklearn.tree import DecisionTreeClassifier
model_tree = DecisionTreeClassifier()
model_tree.fit(X_train2,y_train2)

y_pred2 = model.predict(X_valid2)
print(f1_score(y_pred2,y_valid2)) # f1-score 略有提高

###### 1、feature_importances_(适用于决策树、随机森林、GBDT、xgboost、lightgbm)
# 重要性
features_import = pd.DataFrame(X_train.columns, columns=['feature'])
features_import['importance'] = model_tree.feature_importances_  # 默认按照gini计算特征重要性
features_import.sort_values('importance', inplace=True)
# 绘图
from matplotlib import pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei']  # 显示中文黑体
# plt.rcParams['axes.unicode_minus'] = False # 负值显示
plt.barh(features_import['feature'], features_import['importance'], height=0.7, color='#008792', edgecolor='#005344') # 更多颜色可参见颜色大全
plt.show()

糖尿病遗传风险检测挑战赛(Coggle 30 Days of ML)_第2张图片

#取前5位放入逻辑回归中预测
data2 = data[['体重指数','肱三头肌皮褶厚度','口服耐糖量测试','舒张压','胰岛素释放实验','患有糖尿病标识']]

from sklearn.linear_model import LogisticRegression as LR
X_train = data2[data2['患有糖尿病标识'] >= 0 ].drop(['患有糖尿病标识'],axis = 1)
y_train = data2[data2['患有糖尿病标识'] >= 0 ]['患有糖尿病标识']
X_test = data2[data2['患有糖尿病标识'] < 0 ].drop(['患有糖尿病标识'],axis = 1)
print(X_train.shape,y_train.shape,X_test.shape)
X_train2,X_valid2,y_train2,y_valid2 = train_test_split(X_train,y_train,test_size = 0.2 ,shuffle = False)
print(X_train2.shape,y_train2.shape,X_valid2.shape,y_valid2.shape)
model = LR()
model.fit(X_train2,y_train2)
y_pred = model.predict(X_valid2)
print(f1_score(y_pred,y_valid2)) # f1-score 下降了 其他有用特征少了,效果下降是正常的
  • 任务6:高阶树模型

    • 步骤1:安装LightGBM,并学习基础的使用方法;
    • 步骤2:将训练集20%划分为验证集,使用LightGBM完成训练,精度是否有提高?
    • 步骤3:将步骤2预测的结果文件提交到比赛,截图分数;
    • 步骤4:尝试调节搜索LightGBM的参数;
    • 步骤5:将步骤4调参之后的模型从新训练,将最新预测的结果文件提交到比赛,截图分数;
# 使用Lgb
import lightgbm as lgb

clf = lgb.LGBMClassifier(
    max_depth=3, 
    n_estimators=4000, 
    n_jobs=-1, 
    verbose=-1,
    verbosity=-1,
    learning_rate=0.1,
)
X_train2 = data[data['患有糖尿病标识'] >= 0 ].drop(['患有糖尿病标识','编号'],axis = 1)
y_train2 = data[data['患有糖尿病标识'] >= 0 ]['患有糖尿病标识']
X_test2 = data[data['患有糖尿病标识'] < 0 ].drop(['患有糖尿病标识','编号'],axis = 1)
X_train2,X_valid2,y_train2,y_valid2 = train_test_split(X_train,y_train,test_size = 0.2 ,shuffle = False)

clf.fit(X_train2,y_train2)
y_pred = clf.predict(X_valid2)
print(f1_score(y_pred,y_valid2))    #F1-score达到0.94,大幅提高

# 调节LGB的参数,使用网格搜索
from sklearn.model_selection import GridSearchCV
estimator = lgb.LGBMClassifier(
    max_depth=3, 
    n_estimators=4000, 
    n_jobs=-1, 
    verbose=-1,
    verbosity=-1,
    learning_rate=0.1)

param_grid = {
    'learning_rate': [0.01, 0.1, 1],
    'n_estimators': [20,100,1000,2000,3000,4000]
}

clf2 = GridSearchCV(estimator, param_grid)
clf2.fit(X_train2,y_train2)
y_pred = clf2.predict(X_valid2)
print(f1_score(y_pred,y_valid2))    #F1-score达到0.95,有一定提高

  • 任务7:多折训练与集成

    • 步骤1:使用KFold完成数据划分;
    • 步骤2:使用StratifiedKFold完成数据划分;
    • 步骤3:使用StratifiedKFold配合LightGBM完成模型的训练和预测
    • 步骤4:在步骤3训练得到了多少个模型,对测试集多次预测,将最新预测的结果文件提交到比赛,截图分数;
    • 步骤5:使用交叉验证训练5个机器学习模型(svm、lr等),使用stacking完成集成,将最新预测的结果文件提交到比赛,截图分数;

from sklearn.model_selection import StratifiedKFold
n_splits=5

# Kfold
# kfolds = KFold(n_splits=n_splits, shuffle=False)

# StratifiedKFold
kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=2019)

for trn_idx, val_idx in kfolds.split(X_train):
    train_pred = np.zeros( (len(X_train), len(np.unique(y_train))) )
    test_pred = np.zeros( (len(X_test), len(np.unique(y_train))) )

#     print( X_train[100])
    X_train3, X_valid3= X_train.iloc[trn_idx],X_train.iloc[val_idx]
    y_train3, y_valid3 = y_train.iloc[trn_idx], y_train.iloc[val_idx]
#     print(y_train3)
    eval_set = [(X_valid3, y_valid3)]
    clf2.fit(X_train3,y_train3,eval_set=eval_set)
    
    test_pred += clf2.predict_proba(X_test)
    train_pred[val_idx] = clf2.predict_proba(X_valid3)


a = test_pred/5
test_df['label']=a.argmax(1)
test_df.rename({'编号':'uuid'},axis = 1)[['uuid','label']].to_csv('result.csv',index = False)

 

提交结果下降到0.94还不如随机分类,原因在于没有调参。 

# Stacking选用4个一级分类器以及一个二级分类器
clfs = [svm.SVC(C = 3, kernel="rbf"),
            RandomForestClassifier(n_estimators=100, max_features="log2", max_depth=10, min_samples_leaf=1, bootstrap=True, n_jobs=-1, random_state=1),
            lgb.LGBMClassifier(),
            XGBClassifier(n_estimators=100, objective="binary:logistic", gamma=1, max_depth=10, subsample=0.8, nthread=-1, seed=1)
        ]
    
    # 二级分类器的train_x, test
dataset_blend_train = np.zeros((X_train.shape[0], len(clfs)), dtype=np.int)
dataset_blend_test = np.zeros((X_test.shape[0], len(clfs)), dtype=np.int)
# print(dataset_blend_train[9,1])

# 4个分类器进行5_folds预测
n_folds = 5
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=1)
for i,clf in enumerate(clfs):
    dataset_blend_test_j = np.zeros((X_test.shape[0], n_folds))  # 每个分类器的单次fold预测结果
    for j,(train_index,test_index) in enumerate(skf.split(X_train, y_train)):
        tr_x = X_train.iloc[train_index]
        tr_y = y_train.iloc[train_index]
        clf.fit(tr_x, tr_y)
        dataset_blend_train[test_index, i]=clf.predict(X_train.iloc[test_index])
        dataset_blend_test_j[:, j] = clf.predict(X_test)
    dataset_blend_test[:, i] = dataset_blend_test_j.sum(axis=1) // (n_folds//2 + 1)


#     二级分类器进行预测
clf = LR( tol=1e-6, C=1.0, random_state=1, n_jobs=-1)
clf.fit(dataset_blend_train, y_train)
prediction = clf.predict(dataset_blend_test)

 提交结果分数来到0.963历史新高,但如果要继续提高分数,还得从特征入手。在这个时间节点排行榜有许多满分大佬,后续若有时间会进一步开展特征工程的尝试。

你可能感兴趣的:(模型,竞赛,机器学习,python)