回归分析预测
As per the Centers for Disease Control and Prevention report, heart disease is the prime killer of both men and women in the United States and around the globe. There are several data mining techniques that can be leveraged by researchers/ statisticians to help health care professionals determine heart disease and its potential causes. Some of the significant risk factors associated with heart disease are age, blood pressure, total cholesterol, diabetes, hypertension, family history of heart disease, obesity, lack of physical exercise, etc.
根据美国疾病控制与预防中心的报告,心脏病是美国乃至全球男女的主要杀手。 研究人员/统计人员可以利用多种数据挖掘技术来帮助医疗保健专业人员确定心脏病及其潜在原因。 与心脏病有关的一些重要危险因素是年龄,血压,总胆固醇,糖尿病,高血压,心脏病家族史,肥胖症,缺乏体育锻炼等。
In this project from Data Camp, the objective of my project is to build a regression model and run statistical tests to assess how strongly are the clinical factors associated with heart disease and how it is related to the higher probability of getting a heart disease. I shall be implementing Multiple and Logistic Regression approaches together with data explorations in ggplot and dplyr. This project uses the Cleveland heart disease dataset.
在这个来自Data Camp的项目中,我的目标是建立一个回归模型并运行统计测试,以评估与心脏病相关的临床因素有多强烈,以及它与患心脏病可能性更高的相关性。 我将在ggplot和dplyr中实现多元和逻辑回归方法以及数据探索。 该项目使用克利夫兰心脏病数据集。
Here’s a glimpse of the dataset in hand -
这是现有数据集的一瞥-
On inspecting the first five rows of Cleveland heart disease dataset 在检查克利夫兰心脏病数据集的前五行时数据字典 (Data Dictionary)
There are 14 columns in the dataset which are set out as mentioned below -
数据集中有14列,其内容如下所述-
a. Age : It is a continuous data type which describes the age of the person in years.
一个。 年龄 :这是一个连续的数据类型,描述了人的年龄(以年为单位)。
b. Sex: It is a discrete data type that describes the gender of the person. Here 0 = Female and 1 = Male
b。 性别:这是描述人的性别的离散数据类型。 0 =女性,1 =男性
c. CP(Chest Pain type): It is a discrete data type that describes the chest pain type with following parameters- 1 = Typical angina; 2 = Atypical angina; 3 = Non-anginal pain ; 4 = Asymptotic
C。 CP(Chest Pain type) :这是一种离散数据类型,描述了具有以下参数的胸痛类型-1 =典型心绞痛; 2 =非典型心绞痛; 3 =非心绞痛; 4 =渐近的
d. Trestbps : It is a continuous data type which describes resting blood pressure in mm Hg
d。 Trestbps:这是一个连续数据类型,以mm Hg表示静息血压
e. Cholesterol: It is a continuous data type that describes the serum cholesterol in mg/dl
e。 胆固醇:这是一个连续的数据类型,以mg / dl的形式描述血清胆固醇
f. FBS: It is a discrete data type that compares the fasting blood sugar of the person with 120 mg/dl. If FBS >120 then 1 = true else 0 = false
F。 FBS:这是一种离散数据类型,用于将人的空腹血糖与120 mg / dl进行比较。 如果FBS> 120,则1 = true,否则0 = false
g. RestECG: It is a discrete data type that shows the resting ECG results where 0 = normal; 1 = ST-T wave abnormality; 2 = left ventricular hypertrophy
G。 RestECG:这是一种离散数据类型,显示静态 ECG结果,其中0 =正常; 1 = ST-T波异常; 2 =左心室肥大
h. Thalach: It is a continuous data type that describes the max heart rate achieved.
H。 Thalach :这是一个连续的数据类型,描述了达到的最大心率。
i. Exang: It is a discrete data type where exercise induced angina is shown by 1 = Yes and 0 = No
一世。 Exang:这是一种离散的数据类型,其中运动诱发的心绞痛显示为1 =是和0 =否
j. Oldpeak: It is a continuous data type that shows the depression induced by exercise relative to weight
j。 Oldpeak:这是一个连续的数据类型,显示了运动引起的相对于体重的压抑
k. Slope: It is a discrete data type that shows us the slope of the peak exercise segment where 1= up-sloping; 2 = flat; 3 = down-sloping
k。 斜率:这是一种离散的数据类型,向我们显示了峰值运动段的斜率,其中1 =向上倾斜; 2 =平坦; 3 =向下倾斜
l.ca: It is a continuous data type that shows us the number of major vessels colored by fluoroscopy that ranges from 0 to 3.
l。 ca:这是一个连续的数据类型,向我们显示了通过荧光检查显色的主要血管数量,范围为0到3。
m. Thal: It is a discrete data type that shows us Thalassemia where 3 = normal ; 6 = fixed defect ; 7 = reversible defect.
米 Thal:这是一种离散的数据类型,向我们显示地中海贫血,其中3 =正常; 6 =固定缺陷; 7 =可逆缺陷。
n. Class: It is a discrete data type where diagnose class 0 = No Presence and 1 -4 is range for the person to have the heart disease from least likely to most likely, 1 being least likely.
。 类别:它是一种离散的数据类型,其中诊断类别0 =无状态,而1 -4是该人患心脏病的范围,从最不可能到最可能,从1到最不可能。
数据整理 (Data Wrangling)
Since the outcome variable class has more than 2 levels; I created a new variable hd using mutate() to represent binary 0/1 outcome where any value > 0 shall be 1 and all 0 values will stay 0. Also, I renamed sex levels (originally 1 and 0) as Male/ Female for better clarity.
由于结果变量类具有两个以上的级别; 我使用mutate()创建了一个新变量hd来表示二进制0/1结果,其中任何值> 0都应为1,所有0值都将保持为0。此外,我将性别级别(最初为1和0)重命名为Male / Female更好的清晰度。
统计检验 (Statistical Tests)
I ran statistical tests to check which predictor variables are closely related to heart disease. Depending on the data type (continuous/ discrete), I implemented t-test and chi-squared test to derive p-values.
我进行了统计测试,以检查哪些预测变量与心脏病密切相关。 根据数据类型(连续/离散),我实施了t检验和卡方检验以得出p值。
In this project, I examined how sex, age and thalach are related to heart disease which is as shown below -
在这个专案中,我检查了性别,年龄和下丘脑与心脏病的关系,如下所示-
→ Sex: Since sex is a binary variable in this dataset, chi-squared test will be the appropriate test for this variable. Here’s the output on using chisq.test() to assess the relationship between sex and hd(outcome variable)
→Sex :由于sex是此数据集中的二进制变量,因此卡方检验将是此变量的适当检验。 这是使用chisq.test()评估性别和hd(结果变量)之间关系的输出
data: hd_data$sex and hd_data$hd
X-squared = 22.043, df = 1, p-value = 2.667e-06
→ Age: Since age is a continuous variable, I used t.test() to determine relationship between age and hd.
→ 年龄:由于年龄是一个连续变量,因此我使用t.test()来确定年龄和高清之间的关系。
data: hd_data$age by hd_data$hd
t = -4.0303, df = 300.93, p-value = 7.061e-05
→ Thalach: Using t.test() again to assess relationship between thalach and hd.
→ Thalach :再次使用t.test()来评估thalach和hd之间的关系。
data: hd_data$thalach by hd_data$hd
t = 7.8579, df = 272.27, p-value = 9.106e-14
图形化的关联显示(因为每张图片都讲述了一个更好的故事!) (Graphical visualization of the associations (Because every picture tells a better story!))
I have plotted boxplot for continuous variables like Age and Thalach (max heart rate).
我为连续变量(例如Age和Thalach(最大心率))绘制了箱线图 。
Age v/s Hd 年龄v / s Hd Thalach (max heart rate) v/s hd Thalach(最大心率)v / s hdPlotted Barplot for sex since it’s a binary variable in this dataset.
为性别绘制Barplot,因为它是此数据集中的二进制变量。
Sex v/s hd 性别v / s高清→ The graphical plots above and statistical tests clearly show us that all the three clinical variables (Age, Sex, Thalach) that were chosen are significantly associated with our outcome since p-value < 0.001 for all the tests.
→上面的图形图和统计测试清楚地向我们表明,选择的所有三个临床变量(年龄,性别,Thalach)均与我们的结果显着相关,因为所有测试的p值<0.001。
用所有3个变量拟合Logistic回归模型 (Fitting Logistic Regression Model with all 3 variables)
I have fitted a Logistic Regression model here since there are two predicting variables and one binary outcome variable. This model will help us determine the effect that a max heart rate (thalach), age and sex can have on the likelihood that an individual will have a heart disease.
由于这里有两个预测变量和一个二进制结果变量,因此我在这里拟合了Logistic回归模型。 该模型将帮助我们确定最大心率(thalach),年龄和性别对个人患心脏病的可能性的影响。
model <- glm(data = hd_data, hd ~ age + sex + thalach, family = “binomial” )
模型<-glm(数据= hd_data,hd〜年龄+性别+触角,家庭=“二项式”)
# extract the model summarysummary(model)
#提取模型summarysummary(model)
Here’s the output as shown below after implementing the model -
实施模型后,输出如下所示-
Call:
glm(formula = hd ~ age + sex + thalach, family = "binomial",
data = hd_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2250 -0.8486 -0.4570 0.9043 2.1156
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.111610 1.607466 1.936 0.0529 .
age 0.031886 0.016440 1.940 0.0524 .
sexMale 1.491902 0.307193 4.857 1.19e-06 ***
thalach -0.040541 0.007073 -5.732 9.93e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
This logistic regression model can be used to predict the probability of a person having heart disease given his/her age, sex and max heart rate. Additionally, we can translate the predicted probability into a decision rule for clinical use by defining cutoff value on the probability scale. For instance- if a 45 year old female patient with a max heart rate = 150 walks in, we can find out the predicted probability of the heart disease by creating a new data frame called newdata.
鉴于年龄,性别和最大心率,该逻辑回归模型可用于预测某人患心脏病的可能性。 此外,我们可以通过在概率标度上定义临界值将预测的概率转换为临床使用的决策规则。 例如,如果一名最大心率= 150的45岁女性患者走进来,我们可以通过创建一个称为newdata的新数据框来找到心脏病的预测概率。
We can see that the model generated a heart disease probability of 0.177 for a 45 year old female with a max heart rate of 150 which indicates a low risk of heart disease.
我们可以看到,该模型为45岁女性,最大心率150产生的心脏病概率为0.177 ,这表明患心脏病的风险较低。
评估模型性能 (Evaluating model performance)
While these predictive models can be used to predict the probability of an event occurring, it is vital to check the accuracy of any model before computing the predicted values. Some of the core metrics that can be used to evaluate this model are as described below-
尽管这些预测模型可用于预测事件发生的可能性,但在计算预测值之前检查任何模型的准确性至关重要。 可以用来评估该模型的一些核心指标如下所述:
Accuracy : It is one of the most straightforward metric which tells us the proportion of total number of predictions being correct
准确性 :这是最直接的指标之一,它告诉我们正确的预测总数中所占的比例
Classification Error Rate : This can be calculated using 1-Accuracy
分类错误率:可以使用1-Accuracy计算
Area under the ROC curve (AUC): This is one of the most sought after metrics used for evaluation. It is popular since it is independent of the change in proportion of responders. It ranges from 0–1. The closer it gets to 1, the better is the model performance
ROC曲线下的面积(AUC):这是用于评估的最受欢迎的指标之一。 它之所以受欢迎是因为它与响应者比例的变化无关。 取值范围是0-1。 越接近1,模型性能越好
Confusion Matrix: It is a N*N matrix where N is the level of outcome.This metric reports the the number of false positives, false negatives, true positives, and true negatives.
混淆矩阵:这是一个N * N矩阵,其中N是结果水平。此度量标准报告误报,误报,真报和真报错的数量。
结果 (Result)
From the above output, we can see that the model has an overall accuracy of 0.71. Also, there are cases that were misclassified as shown in the confusion matrix. We can improve the existing model by including other relevant predictors from the dataset into our model.
从上面的输出中,我们可以看到该模型的整体精度为0.71。 而且,有些情况下分类错误,如混淆矩阵中所示。 我们可以通过将数据集中的其他相关预测变量纳入模型来改善现有模型。
You can find the entire code of this project on my Github.
您可以在我的Github上找到该项目的全部代码。
Disclaimer: This project has been done solely for educational motives and to solidify my understanding of data mining techniques. It is not intended to be used for diagnosis of actual heart patients. Please consult your healthcare practitioner for professional advice.
免责声明:此项目仅出于教育目的而进行,目的是巩固我对数据挖掘技术的理解。 它不能用于诊断实际的心脏病患者。 请咨询您的医疗保健从业人员以获取专业建议。
翻译自: https://medium.com/swlh/predicting-heart-disease-using-regression-analysis-486401cd0a47
回归分析预测