应用疾病数据建模预测分析,用了逻辑回归和决策树去建模,以准确率和auc值为评价指标。
import pandas as pd
import numpy as np
hy_data = pd.read_csv(r'/work/hy_data.csv')
hy_data.head()
corr = hy_data.corr()
import seaborn as sns
import matplotlib.pyplot as plt
plt.subplots(figsize=(12,12))
sns.heatmap(corr,annot=True,vmax=1,square=True,cmap='Reds')
plt.savefig(r'/work/hy_data_corr.png')
plt.show()
从相关性分析的角度分析,得出建模变量cp,trestbps,restecg,thalach,exang,oldpeak,slope,ca,thal和目标变量相关性很大,接下来用他们去建模。
hy_train_data = hy_data.loc[:,['cp','trestbps','restecg','thalach','exang','oldpeak','slope','ca','thal','target']]
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import roc_curve, auc ,roc_auc_score
X = hy_train_data.iloc[:,:-1]
y = hy_train_data.iloc[:,-1]
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 创建逻辑回归模型
model = LogisticRegression(
C=7957.0,max_iter=10,solver="liblinear"
)
# 训练模型
model.fit(X_train, y_train)
# 预测测试集结果
y_pred = model.predict(X_test)
# 计算准确率
accuracy = metrics.accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
#roc曲线面积
roc_auc = roc_auc_score(y_test,y_pred)
print('roc曲线下的面积:',roc_auc)
模型结果指标:准确率和auc值
Accuracy: 0.8558558558558559
roc曲线下的面积: 0.8487105084866639
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
#roc曲线面积
print('auc值:',roc_auc_score(y_test,y_pred))
决策树建模结果显示为:
Accuracy: 1.0
auc值: 1.0
决策树效果最好,可以用决策树去预测数据。