使用Logistic回归进行统计分析和Python Statsmodels中的预测

This article will explain a statistical modeling technique with an example. I will explain a logistic regression modeling for binary outcome variables here. That means the outcome variable can have only two values, 0 or 1. We will also analyze the correlation amongst the predictor variables (the input variables that will be used to predict the outcome variable), how to extract the useful information from the model results, the visualization techniques to better present and understand the data and prediction of the outcome. I am assuming that you have the basic knowledge of statistics and python.

本文将通过一个示例说明统计建模技术。 我将在这里解释二进制结果变量的逻辑回归建模。 这意味着结果变量只能有两个值,即0或1。我们还将分析预测变量(用于预测结果变量的输入变量)之间的相关性,以及如何从模型结果中提取有用的信息,可视化技术,以更好地呈现和理解数据以及结果预测。 我假设您具有统计和python的基本知识。

使用的工具 (The Tools Used)

For this tutorial, we will use:

在本教程中,我们将使用:

  1. Numpy Library

    脾气暴躁的图书馆

  2. Pandas Library

    熊猫图书馆

  3. Matplotlib Library

    Matplotlib库

  4. Seaborn Library

    西雅图图书馆

  5. Statsmodels Library

    统计模型库

  6. Jupyter Notebook environment.

    Jupyter Notebook环境。

数据集 (Dataset)

I used the Heart dataset from Kaggle. I have it in my GitHub repository. Please feel free download from this link if you want to follow along:

我使用了Kaggle的Heart数据集。 我在GitHub存储库中有它。 如果您想继续,请从此链接免费下载:

Let’s import the necessary packages and the dataset.

让我们导入必要的包和数据集。

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import numpy as npdf = pd.read_csv('Heart.csv')
df.head()

The last column ‘AHD’ contains only ‘yes’ or ‘no’ which tells you if a person has heart disease or not. Replace ‘yes’ and ‘no’ with 1 and 0.

最后一列“ AHD”仅包含“是”或“否”,告诉您一个人是否患有心脏病。 将“是”和“否”分别替换为1和0。

df['AHD'] = df.AHD.replace({"No":0, "Yes": 1})

The logistic regression model provides the odds of an event.

逻辑回归模型提供了事件的几率。

具有一个变量的基本Logistic回归 (A Basic Logistic Regression With One Variable)

Let’s dive into the modeling. I will explain each step. I suggest, keep running the code for yourself as you read to better absorb the material.

让我们深入研究建模。 我将解释每个步骤。 我建议您在阅读时继续自己运行代码,以更好地吸收材料。

Logistic regression is an improved version of linear regression. We will use a Generalized Linear Model (GLM) for this example.

Logistic回归是线性回归的改进版本。 在此示例中, 我们将使用 广义线性模型 (GLM)

There are so many variables. Which one could be that one variable? As we all know, generally heart disease occurs mostly to the older population. The younger population is less likely to get heart disease. I am taking “Age” as the only covariate. We will add more covariates later.

变量太多了。 哪一个是那个变量? 众所周知,一般来说,心脏病主要发生在老年人口。 年轻人口患心脏病的可能性较小。 我将“年龄”作为唯一的协变量。 稍后我们将添加更多协变量。

model = sm.GLM.from_formula("AHD ~ Age", family = sm.families.Binomial(), data=df)
result = model.fit()
result.summary()

The result summary looks very complex and scary, right? We will focus mostly on this part.

结果摘要看起来非常复杂和令人恐惧,对吧? 我们将主要关注这一部分。

Hopefully, it looks a lot better now. As a reminder, here is the linear regression formula:

希望现在看起来好多了。 提醒一下,这是线性回归公式:

Y = AX + B

Y = AX + B

Here Y is the output and X is the input, A is the slope and B is the intercept.

这里Y是输出,X是输入,A是斜率,B是截距。

If you need a refresher on confidence interval and hypothesis testing, please check out these articles:

如果您需要复习置信区间和假设检验,请查看以下文章:

Now, let’s understand all the terms above. First, we have the coefficients where -3.0059 is the B, and 0.0520 is our A. If a person’s age is 1 unit more s/he will have a 0.052 unit more chance of having heart disease based on the p-value in the table.

现在,让我们了解以上所有术语。 首先,我们具有系数-3.0059是B,而0.0520是我们的A。如果一个人的年龄多1个单位,则他/她根据该表中的p值会增加0.052个单位患心脏病的机会。

There is a standard error of 0.014 that indicates the distance of the estimated slope from the true slope. z-statistic of 3.803 means that the predicted slope is going to be 3.803 unit above the zero. And the last two columns are the confidence intervals (95%). Here the confidence interval is 0.025 and 0.079. Later we will visualize the confidence intervals throughout the length of the data.

有一个0.014的标准误差,它指示估计坡度与真实坡度的距离。 3.803的z统计量表示预测的斜率将比零高3.803单位。 最后两列是置信区间(95%)。 在此,置信区间为0.025和0.079。 稍后,我们将可视化整个数据长度的置信区间。

赔率和对数赔率 (Odds And Log Odds)

To understand the odds and log-odds, we will use the gender variable. Because a categorical variable is appropriate for this. Check the proportion of males and females having heart disease in the dataset.

为了理解几率和对数,我们将使用性别变量。 因为类别变量适用于此。 在数据集中检查患有心脏病的男性和女性的比例。

df["Sex1"] = df.Sex.replace({1: "Male", 0:"Female"})
c = pd.crosstab(df.Sex1, df.AHD)
c = c.apply(lambda x: x/x.sum(), axis=1)

A logistic regression model provides the ‘odds’ of an event. Remember that, ‘odds’ are the probability on a different scale. Here is the formula:

逻辑回归模型提供事件的“奇数”。 请记住,“奇数”是不同范围的概率。 这是公式:

If an event has a probability of p, the odds of that event is p/(1-p). Odds are the transformation of the probability. Based on this formula, if the probability is 1/2, the ‘odds’ is 1

如果一个事件的概率为p,则该事件的几率是p /(1-p)。 赔率是概率的转换。 根据此公式,如果概率为1/2,则“奇数”为1

Let’s calculate the ‘odds’ of heart disease for males and females.

让我们计算一下男性和女性心脏病的“几率”。

c["odds"] = c.loc[:, 1] / c.loc[:, 0]

The ‘odds’ show that the probability of a female having heart disease is substantially lower than a male(32% vs 53%) that reflects very well in the odds. Odds ratios are common to use while working with two population groups.

“赔率”表明,女性患心脏病的可能性大大低于男性(32%对53%),这在赔率中反映得很好。 与两个人群一起使用时,赔率比率很常见。

c.odds.Male / c.odds.Female

The ratio comes out to be 3.587 which indicates a man has a 3.587 times greater chance of having a heart disease.

比率为3.587,表明一个人患心脏病的几率是3.587倍

Remember that, an individual probability cannot be calculated from an odd ratio

请记住,不能从奇数比率计算出单个概率

Another important convention is to work with log-odds which are odds in a logarithmic scale. Recall that the neutral point of the probability is 0.5. Using the formula for ‘odds’, odds for 0.5 is 1 and ‘log-odds’ is 0 (log of 1 id 0).

另一个重要的约定是使用对数奇数(对数标度中的奇数)。 回想一下,概率的中性点为0.5。 使用“赔率”公式,0.5的赔率是1,“ log-odds”的赔率是0(1的id为0的对数)。

In our exercise where men have a greater chance of having heart disease, have ‘odds’ between 1 and infinity. At the same time, the ‘odds’ of women having a greater chance of having heart disease is 0 to 1.

在我们的锻炼中,男人患心脏病的机会更大,将“奇数”设置在1到无穷大之间。 同时,罹患心脏病的机会更大的女性的“几率”为0:1。

Here is the log odds calculation:

这是对数赔率计算:

c['logodds'] = np.log(c.odds)

Here, the log-odds of the female population are negative which indicates that less than 50% of females have heart disease. Log-odds of males is positive and a little more than 0 which means more than half of the males have heart disease.

在此,女性人口的对数为负数,这表示不到50%的女性患有心脏病。 男性的对数奇数为正,略大于0,这意味着一半以上的男性患有心脏病。

Let’s see the model summary using the gender variable only:

让我们看看仅使用性别变量的模型摘要:

model = sm.GLM.from_formula("AHD ~ Sex1", family = sm.families.Binomial(), data=df)
result = model.fit()
result.summary()

This result should give a better understanding of the relationship between the logistic regression and the log-odds. Look at the coefficients above. The logistic regression coefficient of males is 1.2722 which should be the same as the log-odds of males minus the log-odds of females.

这个结果应该更好地理解逻辑回归与对数奇数之间的关系。 看上面的系数。 男性的逻辑对数回归系数为1.2722,应与男性的对数奇数减去女性的对数奇数相同。

c.logodds.Male - c.logodds.Female

This difference is exactly 1.2722.

两者的差异恰好是1.2722。

添加更多协变量 (Adding More Covariates)

We can use multiple covariates. I am using both ‘Age’ and ‘Sex1’ variables here. Before we dive into the model, we can conduct an initial analysis with the categorical variables. Check the proportion of males and females having heart disease in the dataset.

我们可以使用多个协变量。 我在这里同时使用'Age'和'Sex1'变量。 在深入研究模型之前,我们可以使用分类变量进行初始分析。 在数据集中检查患有心脏病的男性和女性的比例。

df["Sex1"] = df.Sex.replace({1: "Male", 0:"Female"})
c = pd.crosstab(df.Sex1, df.AHD)
c = c.apply(lambda x: x/x.sum(), axis=1)

Now, generate a model using both the ‘Age’ and ‘Sex’ variable.

现在,使用“年龄”和“性别”变量生成模型。

model = sm.GLM.from_formula("AHD ~ Age + Sex1", family = sm.families.Binomial(), data=df)
result = model.fit()
result.summary()

Understand the coefficients better. Adding gender to the model changed the coefficient of the ‘Age’ parameter a little(0.0520 to 0.0657). According to this fitted model, older people are more likely to have heart disease than younger people. The log odds for heart disease increases by 0.0657 units for each year.

更好地理解系数 。 在模型中添加性别会稍微改变“年龄”参数的系数(0.0520到0.0657)。 根据这种拟合模型,老年人比年轻人更容易患心脏病。 心脏病的对数赔率每年增加0.0657单位。

If a person is 10 years older his or her chance of having heart disease increases by 0.0657 * 10 = 0.657 units.

如果一个人大10岁,则他或她患心脏病的机会增加0.0657 * 10 = 0.657个单位。

In the case of the gender variable, the female is the reference as it does not appear in the output. While comparing a male and a female of the same age, the male has a 1.4989 units higher chance of having a heart disease.

对于性别变量,由于女性没有出现在输出中,因此它是参考。 在比较相同年龄的男性和女性时,男性患心脏病的几率高1.4989个单位。

Now, let’s see the effect of both gender and age. If a 40 years old female is compared to 50 years old male, the log odds for the male having heart disease is 1.4989 + 0.0657 * 10 = 2.15559 units greater than the female.

现在,让我们看看性别和年龄的影响。 如果将40岁的女性与50岁的男性进行比较,则患有心脏病的男性的对数几率比女性高1.4989 + 0.0657 * 10 = 2.15559个单位。

All the coefficients are in log-odds scale. You can exponentiate the values to convert them to the odds

所有系数均为对数标度。 您可以对值取幂以将其转换为赔率

具有三个协变量的逻辑回归模型 (A logistic regression Model With Three Covariates)

Now, we will fit a logistic regression with three covariates. This time we will add ‘Chol’ or cholesterol variables with ‘Age’ and ‘Sex1’.

现在,我们将使用三个协变量拟合逻辑回归。 这次,我们将在“年龄”和“性别1”中添加“胆固醇”或胆固醇变量。

model = sm.GLM.from_formula("AHD ~ Age + Sex1 + Chol", family = sm.families.Binomial(), data=df)
result = model.fit()
result.summary()

As you can see, after adding the ‘Chol’ variable, the coefficient of the ‘Age’ variable reduced a little bit and the coefficient of ‘Sex1’ variable went up a little. The change is more in ‘Sex1’ coefficients than the ‘Age’ coefficient. This is because ‘Chol’ is better correlated to the ‘Sex1’ covariate than the ‘Age’ covariate. Let’s check the correlations:

如您所见,添加'Chol'变量后,'Age'变量的系数略有降低,'Sex1'变量的系数略有上升。 “ Sex1”系数的变化比“年龄”系数的变化大。 这是因为“ Chol”与“ Sex1”协变量的相关性比“ Age”协变量的更好。 让我们检查相关性:

df[['Age', 'Sex', 'Chol']].corr()

拟合模型的可视化 (Visualization of the Fitted Model)

We will begin by plotting the fitted proportion of the population that have heart disease for different subpopulations defined by the regression model. We will plot how the heart disease rate varies with the age. We will fix some values that we want to focus on in the visualization. We will visualize the effect of ‘Age’ on the female population having a cholesterol level of 250.

我们将首先绘制回归模型定义的不同亚人群患有心脏病的人口比例。 我们将绘制心脏病发病率随年龄变化的图。 我们将修复一些我们想在可视化中关注的值。 我们将可视化“年龄”对胆固醇水平为250的女性人群的影响。

from statsmodels.sandbox.predict_functional import predict_functional
values = {"Sex1": "Female", "Sex":0, "AHD": 1, "Chol": 250}
pr, cb, fv = predict_functional(result, "Age", values=values, ci_method="simultaneous")ax = sns.lineplot(fv, pr, lw=4)
ax.fill_between(fv, cb[:, 0], cb[:, 1], color='grey', alpha=0.4)
ax.set_xlabel("Age")
ax.set_ylabel("Heart Disease")

We just plotted the fitted log-odds probability of having heart disease and the 95% confidence intervals. The confidence band is more appropriate. The confidence band looks curvy which means that it’s not uniform throughout the age range.

我们只是绘制了患有心脏病和95%置信区间的拟合对数几率。 置信带更合适。 置信区间看起来很弯曲,这意味着它在整个年龄范围内都不统一。

We can visualize in terms of probability instead of log-odds. The probability can be calculated from the log odds using the formula 1 / (1 + exp(-lo)), where lo is the log-odds.

我们可以用概率而不是对数来可视化。 可以使用公式1 /(1 + exp(-lo))从对数赔率计算出概率,其中lo是对数奇数。

pr1 = 1 / (1 + np.exp(-pr))
cb1 = 1 / (1 + np.exp(-cb))
ax = sns.lineplot(fv, pr1, lw=4)
ax.fill_between(fv, cb1[:, 0], cb[:, 1], color='grey', alpha=0.4)
ax.set_xlabel("Age", size=15)
ax.set_ylabel("Heart Disease")

Here is the problem with the probability scale sometimes. While the probability values are limited to 0 and 1, the confidence intervals are not.

有时这是概率规模的问题。 虽然概率值限制为0和1,但置信区间不是。

The plots above plotted the average. On average that was the probability of a female having heart disease given the cholesterol level of 250. Next, we will visualize in a different way that is called a partial residual plot. In this plot, it will show the effect of one covariate only while the other covariates are fixed. This shows even the smaller discrepancies. So, the plot will not be as smooth as before. Remember, the small discrepancies are not reliable if the sample size is not very large.

上面的图绘制了平均值。 平均而言,胆固醇水平为250时,女性患心脏病的可能性。接下来,我们将以另一种方式可视化,这称为局部残差图。 在此图中,它将仅显示一个协变量的效果,而其他协变量是固定的。 这甚至显示出较小的差异。 因此,绘图不会像以前那样平滑。 请记住, 如果样本量不是很大,那么小的差异是不可靠的。

from statsmodels.graphics.regressionplots import add_lowess
fig = result.plot_partial_residuals("Age")
ax = fig.get_axes()[0]
ax.lines[0].set_alpha(0.5)
_ = add_lowess(ax)

This plot shows that the heart disease rate rises rapidly from the age of 53 to 60.

该图显示心脏病发病率从53岁到60岁Swift上升。

预测 (Prediction)

Using the results from the model, we can predict if a person has heart disease or not. The models we fitted before were to explain the model parameters. For the prediction purpose, I will use all the variables in the DataFrame. Because we do not have too many variables. Let’s check the correlations amongst the variables.

使用该模型的结果,我们可以预测一个人是否患有心脏病。 我们之前安装的模型用于解释模型参数。 为了进行预测,我将使用DataFrame中的所有变量。 因为我们没有太多变量。 让我们检查变量之间的相关性。

df['ChestPain'] = df.ChestPain.replace({"typical":1, "asymptomatic": 2, 'nonanginal': 3, 'nontypical':4})df['Thal'] = df.Thal.replace({'fixed': 1, 'normal': 2, 'reversable': 3})
df[['Age', 'Sex1', 'Chol','RestBP', 'Fbs', 'RestECG', 'Slope', 'Oldpeak', 'Ca', 'ExAng', 'ChestPain', 'Thal']].corr()

We can see that each variable has some correlations with other variables. I will use all the variables to get a better prediction.

我们可以看到每个变量与其他变量都有一些相关性。 我将使用所有变量来获得更好的预测。

model = sm.GLM.from_formula("AHD ~ Age + Sex1 + Chol + RestBP+ Fbs + RestECG + Slope + Oldpeak + Ca + ExAng + ChestPain + Thal", family = sm.families.Binomial(), data=df)
result = model.fit()
result.summary()

We can use the predict function to predict the outcome. But the predict function uses only the DataFrame. So, let’s prepare a DataFrame with the variables and then use the predict function.

我们可以使用预测功能来预测结果。 但是预测函数仅使用DataFrame。 因此,让我们准备一个带有变量的DataFrame,然后使用预测函数。

X = df[['Age', 'Sex1', 'Chol','RestBP', 'Fbs', 'RestECG', 'Slope', 'Oldpeak', 'Ca', 'ExAng', 'ChestPain', 'Thal']]
predicted_output = result.predict(X)

The predicted output should be either 0 or 1. It’s 1 when the output is greater than or equal to 0.5 and 0 otherwise.

预测输出应为0或1。当输出大于或等于0.5时为1,否则为0。

for i in range(0, len(predicted_output)):
predicted_output = predicted_output.replace()
if predicted_output[i] >= 0.5:
predicted_output = predicted_output.replace(predicted_output[i], 1)
else:
predicted_output = predicted_output.replace(predicted_output[i], 0)

Now, compare this predicted_output to the ‘AHD’ column of the DataFrame which indicates the heart disease to find the accuracy:

现在,将此预测输出与DataFrame的“ AHD”列进行比较,该列指示心脏病,以找到准确性:

accuracy = 0
for i in range(0, len(predicted_output)):
if df['AHD'][i] == predicted_output[i]:
accuracy += 1
accuracy/len(df)

The accuracy comes out to be 0.81 or 81% which is very good.

精度为0.81或81%,非常好。

结论 (Conclusion)

In this article, I tried to explain the statistical model fitting, how to interpret the result from the fitted model, some visualization technique to present the log-odds with the confidence band, and how to predict a binary variable using the fitted model results. I hope this was helpful.

在本文中,我试图解释统计模型的拟合,如何解释拟合模型的结果,某种可视化技术以显示具有置信带的对数奇数以及如何使用拟合模型的结果预测二进制变量。 我希望这可以帮到你。

Gain Access to Expert View — Subscribe to DDI Intel

获得访问专家视图的权限- 订阅DDI Intel

翻译自: https://medium.com/datadriveninvestor/statistical-modeling-analysis-and-prediction-in-pythons-statsmodels-logistic-regression-3136f20eea4

你可能感兴趣的:(机器学习,python)