kaggle篇章三,新手入门糖尿病检测

一、数据清洗

写在前列

Preg  Number of times pregnant 怀孕次数
Plas  Plasma glucose concentration a 2 hours in an oral glucose tolerance test 口服葡萄糖耐量试验中 2 小时的血浆葡萄糖浓度
Pres  Diastolic blood pressure (mm Hg) 舒张压 (mm Hg)
Skin  Triceps skin fold thickness (mm)  三头肌皮褶厚度(mm)
Insu  2-Hour serum insulin (mu U/ml)  2 小时血清胰岛素 (mu U/ml)
Mass  Body mass index (weight in kg/(height in m)^2)  体重指数(体重kg/(身高m)^2)    
Pedi  Diabetes pedigree function 糖尿病谱系功能
age   Age (years) 年龄(岁)
class  Class variable (0 or 1) 类变量(0或1)

1、数据导入

我们依旧采用下列函数对数据进行导入

diabetes_data = pd.read_csv(r"C:\Users\86137\PycharmProjects\pythonProject\venv\糖尿病检测\diabetes.csv")

 

2、数据情况查看

第一步

diabetes_data.head()#查看数据的前五行

kaggle篇章三,新手入门糖尿病检测_第1张图片

 

 第二步

diabetes_data#查看所有数据

kaggle篇章三,新手入门糖尿病检测_第2张图片

根据图标我们知道,一共有768行数据。 

第三步

diabetes_data.info()#查看数据的缺失值

kaggle篇章三,新手入门糖尿病检测_第3张图片

我们将0设置为缺失值

#将0全部设为空值
diabetes_data_copy = diabetes_data.copy(deep = True)
diabetes_data_copy[['plas','pres','skin','insu','mass']] = diabetes_data_copy[['plas','pres','skin','insu','mass']].replace(0,np.NaN)

print(diabetes_data_copy.isnull().sum())

得到下列结果

kaggle篇章三,新手入门糖尿病检测_第4张图片 

由图可知,plas和mass缺少数据较少,pres相对较少,skin和insu缺少数据过多。

第四步

diabetes_data.describe()#查看数据表的情况

kaggle篇章三,新手入门糖尿病检测_第5张图片

 

3、异常值检测

采用和泰坦尼克号相同的代码

#异常值检测函数
# Outlier detection

def detect_outliers(df, n, features):
    """
    Takes a dataframe df of features and returns a list of the indices
    corresponding to the observations containing more than n outliers according
    to the Tukey method.
    """
    outlier_indices = []

    # iterate over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col], 75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1

        # outlier step
        outlier_step = 1.5 * IQR

        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step)].index

        # append the found outlier indices for col to the list of outlier indices
        outlier_indices.extend(outlier_list_col)

        # select observations containing more than 2 outliers
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(k for k, v in outlier_indices.items() if v > n)

    return multiple_outliers
Outliers_to_drop = detect_outliers(diabetes_data, 4, ["preg", "plas", "pres", "skin","insu","mass","pedi","age"])
print(diabetes_data.loc[Outliers_to_drop]) # Show the outliers rows
#没有超过四个缺失值的,故不对数据进行删除

关于缺失值处理,可以查看3000字详解四种常用的缺失值处理方法_一行玩python的博客-CSDN博客

 

4、对非数值类型转换

diabetes_data['OutCome'] = diabetes_data['class'].map({"b'tested_positive'":1,"b'tested_negative'":0})
diabetes_data.drop(labels = ["class"], axis = 1, inplace = True)
#我们新设OutCome列,把b'tested_positive换成1,将b'tested_negative换成0,并原有列删除。

之后我们查看数据类型

print("展示样本类型数据")
diabetes_data.dtypes

 结果:所有数据均换成数值型

kaggle篇章三,新手入门糖尿病检测_第6张图片

 我们重新查看数据前五行

diabetes_data.head()

kaggle篇章三,新手入门糖尿病检测_第7张图片

5、 对特征列进行分析

(1)查看关联矩阵

print("关联矩阵")

# Correlation matrix between numerical values (SibSp Parch Age and Fare values) and Survived
g = sns.heatmap(diabetes_data[["preg", "plas", "pres", "skin","insu","mass","pedi","age","OutCome"]].corr(),annot=True, fmt = ".2f", cmap = "coolwarm")
plt.show()

kaggle篇章三,新手入门糖尿病检测_第8张图片 分析:由图表可知,与OutCome关联大的有plas,mass,age和preg。缺失值较多的相关系数较小。

(2)分析plas和OutCome的关系

由上面缺少值统计可知,plas只有5个缺失值,考虑用中位数代替。

diabetes_data['plas'].fillna(value = diabetes_data['plas'].median())

 如果用条状图,图像如下:

#条状图
print("plas和OutCome的关系")

# Explore plas feature vs Outcome
g = sns.catplot(x="plas",y="OutCome",data=diabetes_data,kind="bar", height = 6 ,
palette = "muted")
g.despine(left=True)
g = g.set_ylabels("OutCome")
plt.show()

kaggle篇章三,新手入门糖尿病检测_第9张图片

故我们选择用柱状图

g = sns.FacetGrid(diabetes_data, col='OutCome')
g = g.map(sns.histplot, "plas")
plt.show()

 kaggle篇章三,新手入门糖尿病检测_第10张图片

分析:由图表可知糖尿病检测为阳性的plas在20-40附近,糖尿病检测为阴性的plas在0-60大区间,说明plas对结果有很大影响。

(3)分析mass和OutCome的关系

由上面缺少值统计可知,mass只有11个缺失值,考虑用中位数代替。

print("mass和OutCome的关系")

diabetes_data['mass'].fillna(value = diabetes_data['mass'].median())
g = sns.FacetGrid(diabetes_data, col='OutCome')
g = g.map(sns.histplot, "mass")
plt.show()

​​kaggle篇章三,新手入门糖尿病检测_第11张图片

分析:糖尿病检测为阳性的mass的数值比检测为阴性的低,且不超过60。

​​

(4)分析age和OutCome的关系

print("age和OutCome的关系")

# Explore age vs OutCome
g = sns.FacetGrid(diabetes_data, col='OutCome')
g = g.map(sns.histplot, "age")
plt.show()

kaggle篇章三,新手入门糖尿病检测_第12张图片

 分析:糖尿病检测为阳性的年龄分布均匀分布在20-60岁,60-80岁糖尿病检测为阴性。

(5)分析preg和OutCome的关系

print("preg和OutCome的关系")

# Explore preg feature vs Outcome
g = sns.catplot(x="preg",y="OutCome",data=diabetes_data,kind="bar", height = 6 ,
palette = "muted")
g.despine(left=True)
g = g.set_ylabels("OutCome")
plt.show()

kaggle篇章三,新手入门糖尿病检测_第13张图片

 分析:preg越高,糖尿病检测为阳性的可能越高。

(6)分析age分布情况

print("age分布情况")

# Explore Age distibution
g = sns.kdeplot(diabetes_data["age"][(diabetes_data["OutCome"] == 0) & (diabetes_data["age"].notnull())], color="Red", shade = True)
g = sns.kdeplot(diabetes_data["age"][(diabetes_data["OutCome"] == 1) & (diabetes_data["age"].notnull())], ax =g, color="Blue", shade= True)
g.set_xlabel("age")
g.set_ylabel("Frequency")
g = g.legend(["0","1"])

kaggle篇章三,新手入门糖尿病检测_第14张图片

分析:30-60岁糖尿病检测为阳性的可能更大。

(6)分析age和plas,mass,preg的关系

print("age和plas,mass,age和preg的关系")

# Explore age vs plas,mass和preg

g = sns.catplot(y="age",x="plas",hue="OutCome", data=diabetes_data,kind="box")
g = sns.catplot(y="age",x="mass", hue="OutCome",data=diabetes_data,kind="box")
g = sns.catplot(y="age",x="preg",hue="OutCome",data=diabetes_data,kind="box")
plt.show()

 kaggle篇章三,新手入门糖尿病检测_第15张图片

kaggle篇章三,新手入门糖尿病检测_第16张图片 kaggle篇章三,新手入门糖尿病检测_第17张图片

二、模型训练

Cross validate models

# Cross validate model with Kfold stratified cross val
kfold = StratifiedKFold(n_splits=10)

# Modeling step Test differents algorithms 
random_state = 2
classifiers = []
classifiers.append(SVC(random_state=random_state))
classifiers.append(DecisionTreeClassifier(random_state=random_state))
classifiers.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1))
classifiers.append(RandomForestClassifier(random_state=random_state))
classifiers.append(ExtraTreesClassifier(random_state=random_state))
classifiers.append(GradientBoostingClassifier(random_state=random_state))
classifiers.append(MLPClassifier(random_state=random_state))
classifiers.append(KNeighborsClassifier())
classifiers.append(LogisticRegression(random_state = random_state))
classifiers.append(LinearDiscriminantAnalysis())

diabetes_data["OutCome"] = diabetes_data["OutCome"].astype(int)

Y_diabetes_data = diabetes_data["OutCome"]

X_diabetes_data = diabetes_data.drop(labels = ["OutCome"],axis = 1)

cv_results = []
for classifier in classifiers :
    cv_results.append(cross_val_score(classifier, X_diabetes_data, y = Y_diabetes_data, scoring = "accuracy", cv = kfold, n_jobs=4))

cv_means = []
cv_std = []
for cv_result in cv_results:
    cv_means.append(cv_result.mean())
    cv_std.append(cv_result.std())

cv_res = pd.DataFrame({"CrossValMeans":cv_means,"CrossValerrors": cv_std,"Algorithm":["SVC","DecisionTree","AdaBoost",
"RandomForest","ExtraTrees","GradientBoosting","MultipleLayerPerceptron","KNeighboors","LogisticRegression","LinearDiscriminantAnalysis"]})

g = sns.barplot("CrossValMeans","Algorithm",data = cv_res, palette="Set3",orient = "h",**{'xerr':cv_std})
g.set_xlabel("Mean Accuracy")
g = g.set_title("Cross validation scores")

kaggle篇章三,新手入门糖尿病检测_第18张图片

Hyperparameter tunning for best models 

### META MODELING  WITH ADABOOST, RF, EXTRATREES and GRADIENTBOOSTING

# Adaboost
DTC = DecisionTreeClassifier()

adaDTC = AdaBoostClassifier(DTC, random_state=7)

ada_param_grid = {"base_estimator__criterion" : ["gini", "entropy"],
              "base_estimator__splitter" :   ["best", "random"],
              "algorithm" : ["SAMME","SAMME.R"],
              "n_estimators" :[1,2],
              "learning_rate":  [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3,1.5]}

gsadaDTC = GridSearchCV(adaDTC,param_grid = ada_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsadaDTC.fit(X_diabetes_data,Y_diabetes_data)

ada_best = gsadaDTC.best_estimator_

gsadaDTC.best_score_

 

 

#ExtraTrees 
ExtC = ExtraTreesClassifier()


## Search grid for optimal parameters
ex_param_grid = {"max_depth": [None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [False],
              "n_estimators" :[100,300],
              "criterion": ["gini"]}


gsExtC = GridSearchCV(ExtC,param_grid = ex_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsExtC.fit(X_diabetes_data,Y_diabetes_data)

ExtC_best = gsExtC.best_estimator_

# Best score
gsExtC.best_score_

 

 

# RFC Parameters tunning 
RFC = RandomForestClassifier()


## Search grid for optimal parameters
rf_param_grid = {"max_depth": [None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [False],
              "n_estimators" :[100,300],
              "criterion": ["gini"]}


gsRFC = GridSearchCV(RFC,param_grid = rf_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsRFC.fit(X_diabetes_data,Y_diabetes_data)

RFC_best = gsRFC.best_estimator_

# Best score
gsRFC.best_score_

 

# Gradient boosting tunning

GBC = GradientBoostingClassifier()
gb_param_grid = {'loss' : ["deviance"],
              'n_estimators' : [100,200,300],
              'learning_rate': [0.1, 0.05, 0.01],
              'max_depth': [4, 8],
              'min_samples_leaf': [100,150],
              'max_features': [0.3, 0.1] 
              }

gsGBC = GridSearchCV(GBC,param_grid = gb_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsGBC.fit(X_diabetes_data,Y_diabetes_data)

GBC_best = gsGBC.best_estimator_

# Best score
gsGBC.best_score_

### SVC classifier
SVMC = SVC(probability=True)
svc_param_grid = {'kernel': ['rbf'], 
                  'gamma': [ 0.001, 0.01, 0.1, 1],
                  'C': [1, 10, 50, 100,200,300, 1000]}

gsSVMC = GridSearchCV(SVMC,param_grid = svc_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsSVMC.fit(X_diabetes_data,Y_diabetes_data)

SVMC_best = gsSVMC.best_estimator_

# Best score
gsSVMC.best_score_

 

Plot learning curves 

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
    """Generate a simple plot of the test and training learning curve"""
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

g = plot_learning_curve(gsRFC.best_estimator_,"RF mearning curves",X_diabetes_data,Y_diabetes_data,cv=kfold)
g = plot_learning_curve(gsExtC.best_estimator_,"ExtraTrees learning curves",X_diabetes_data,Y_diabetes_data,cv=kfold)
g = plot_learning_curve(gsSVMC.best_estimator_,"SVC learning curves",X_diabetes_data,Y_diabetes_data,cv=kfold)
g = plot_learning_curve(gsadaDTC.best_estimator_,"AdaBoost learning curves",X_diabetes_data,Y_diabetes_data,cv=kfold)
g = plot_learning_curve(gsGBC.best_estimator_,"GradientBoosting learning curves",X_diabetes_data,Y_diabetes_data,cv=kfold)

 kaggle篇章三,新手入门糖尿病检测_第19张图片kaggle篇章三,新手入门糖尿病检测_第20张图片

kaggle篇章三,新手入门糖尿病检测_第21张图片

 kaggle篇章三,新手入门糖尿病检测_第22张图片

kaggle篇章三,新手入门糖尿病检测_第23张图片

Feature importance of tree based classifiers 

nrows = ncols = 2
fig, axes = plt.subplots(nrows = nrows, ncols = ncols, sharex="all", figsize=(15,15))

names_classifiers = [("AdaBoosting", ada_best),("ExtraTrees",ExtC_best),("RandomForest",RFC_best),("GradientBoosting",GBC_best)]

nclassifier = 0
for row in range(nrows):
    for col in range(ncols):
        name = names_classifiers[nclassifier][0]
        classifier = names_classifiers[nclassifier][1]
        indices = np.argsort(classifier.feature_importances_)[::-1][:40]
        g = sns.barplot(y=X_diabetes_data.columns[indices][:40],x = classifier.feature_importances_[indices][:40] , orient='h',ax=axes[row][col])
        g.set_xlabel("Relative importance",fontsize=12)
        g.set_ylabel("Features",fontsize=12)
        g.tick_params(labelsize=9)
        g.set_title(name + " feature importance")
        nclassifier += 1

 kaggle篇章三,新手入门糖尿病检测_第24张图片kaggle篇章三,新手入门糖尿病检测_第25张图片

后面的模型不知道怎么套用了。 

你可能感兴趣的:(Kaggle学习之旅,python,开发语言,数据挖掘)