我如何使用回归分析通过Scikit-Learn和Statsmodels分析预期寿命

In this article, I will use some data related to life expectancy to evaluate the following models: Linear, Ridge, LASSO, and Polynomial Regression. So let's jump right in.

在本文中,我将使用一些与预期寿命相关的数据来评估以下模型:线性,岭,LASSO和多项式回归。 因此,让我们直接进入。

I was exploring the dengue trend in Singapore where there has been a recent spike in dengue cases – especially in the Dengue Red Zone where I am living. However, the raw data was not available on the NEA website.

我正在探索新加坡的登革热流行趋势,那里的登革热病例最近激增,尤其是在我居住的登革热红区 。 但是,原始数据在NEA网站上不可用。

I was wondering, has dengue affected the life expectancy of people in any country in particular? Do people in rich nations live longer? What are the factors affecting life expectancy of a country?

我想知道登革热是否对任何国家的人们的预期寿命都有影响? 富裕国家的人寿命更长吗? 影响一个国家预期寿命的因素有哪些?

So I explored life expectancy and looked for data on the following aspects (features):

因此,我探索了预期寿命并寻找了以下方面(特征)的数据:

  • Birth Rate

    出生率

  • Cancer Rate

    癌症发生率

  • Dengue Cases

    登革热病例

  • Environmental Performance Index (EPI)

    环境绩效指数( EPI )

  • Gross Domestic Product (GDP)

    国内生产总值

  • Health Expenditure

    卫生支出

  • Heart Disease Rate

    心脏病发生率

  • Population

    人口

  • Area

  • Population Density

    人口密度

  • Stroke Rate

    冲程率

The target is Life Expectancy, measured in number of years.

目标是预期寿命 ,以年数为单位。

The assumptions are:

假设是:

  1. These are country level averages

    这些是国家/地区的平均值
  2. There is no distinction between male and female

    男女之间没有区别

The Python code is available on my GitHub.

Python代码在我的GitHub上可用 。

数据科学过程 (Data Science Process)

I have used the following data science process in my analysis:

我在分析中使用了以下数据科学过程:

  • data collection, data cleaning, Exploratory Data Analysis

    数据收集,数据清理,探索性数据分析
  • feature selection, feature engineering

    特征选择,特征工程
  • model selection, model tuning and hyperparameter tuning

    模型选择,模型调整和超参数调整
  • model optimization based on selected performance metric

    基于所选性能指标的模型优化

Tools used for this analysis include:

用于此分析的工具包括:

  • Python libraries, particularly Numpy and Pandas for manipulating data structures

    Python库,尤其是用于处理数据结构的Numpy和Pandas

  • Matplotlib and Seaborn for visualization

    Matplotlib和Seaborn进行可视化

  • Scikit-Learn and Statsmodels for regression analysis

    Scikit-Learn和Stats模型进行回归分析

探索性数据分析 (Exploratory Data Analysis)

First I check for multi-collinearity between features.

首先,我检查特征之间的多重共线性。

sns.set(rc={'figure.figsize':(10,7)})sns.heatmap(df.corr(), cmap="seismic", annot=True, vmin=-1, vmax=1)

There seems to be some strong collinearity, denoted by boxes in dark red and dark blue as you can see in the image below.

似乎有很强的共线性,如下面的图像所示,用深红色和深蓝色表示。

For example, countries who spent more on healthcare have a higher EPI score. When health expenditures are higher, the stroke rate is also lower. And a larger area yields a higher population.

例如,在医疗保健上花费更多的国家的EPI得分更高。 当卫生支出较高时,中风率也较低。 面积越大,人口越多。

How about the correlation between features and target?To live a long life, you should have a low stroke rate, high health expenditure, take good care of the environment, and have fewer babies (according to the correlation chart).

特征与目标之间的相关性如何?要长寿,您应该有较低的中风发生率,较高的医疗保健支出,良好的环境维护以及较少的婴儿(根据相关性图表)。

Let’s look at the initial pair plot.

让我们看一下初始对图。

sns.pairplot(df, height=1.5, aspect=1.5)

There seems to be a need to remove outliers in many features, for example, Dengue Cases, GDP, Population, Area, and Population Density.

似乎有必要消除许多特征的异常值,例如登革热病例,GDP,人口,面积和人口密度。

Each outlier is replaced by the next highest value in the column. After removing the outliers, the plots are still skewed to the right (points are very concentrated on the left side). So this suggests that some transformation might be needed.

每个异常值将替换为该列中的下一个最大值。 除去异常值后,地块仍然向右倾斜(点非常集中在左侧)。 因此,这表明可能需要进行一些转换。

Another way to remove outliers is to use the LOG function, which helps to spread the concentrated data to the right.

消除异常值的另一种方法是使用LOG功能,该功能有助于将集中的数据分布到右侧。

功能选择 (Feature Selection)

To look for significant features, I dropped one feature at a time to see its impact on the simple regression model. Looking at the R² Score, these 3 features (Birth Rate, EPI, Stroke Rate) are chosen, because the model will be adversely affected without them.

为了寻找重要的功能,我一次删除了一个功能,以查看其对简单回归模型的影响。 从R²分数来看,选择了这三个特征(出生率,EPI,中风率),因为如果没有它们,模型将受到不利影响。

Next, I removed outliers and review the p-values on Statsmodels. I gained one more significant feature (Population Density). When the p-value of a feature is less than 0.05, it is considered a good feature, as I have chosen 5% as the significance level.

接下来,我删除了异常值,并查看了Statsmodels的p值。 我获得了另一个重要功能(人口密度)。 当特征的p值小于0.05时,因为我选择5%作为显着性水平,所以它被认为是良好的特征。

After that, I applied LOG functions to all features, and gained 4 more significant features (GDP, Heart Disease Rate, Population, and Area).

之后,我将LOG函数应用于所有功能,并获得了另外4个重要功能(GDP,心脏病率,人口和面积)。

I have also done other transformations (Reciprocal, Power 2, Square Root) but there is no more improvement.

我还进行了其他转换(倒数,幂2,平方根),但没有其他改进。

Features can also be selected using the LassoCV feature in SkLearn.

功能也可以使用在SkLearnLassoCV功能选择。

Finally I looked at the pair plot again with all significant features. The scatter plots are now nicely spread out with some clear trends.

最后,我再次查看了具有所有重要功能的对图。 现在,散点图很好地散布了一些明显的趋势。

选型 (Model Selection)

I am now ready to fit the following models on the train data set:

我现在准备在火车数据集上拟合以下模型:

  • Linear Regression (a straight line which approximates the relationship between the dependent variables and the independent target variable)

    线性回归 (近似于因变量和独立目标变量之间的关系的直线)

  • Ridge Regression (this reduces model complexity while keeping all coefficients in the model, known as L2 penalty)

    回归 (在保持模型中所有系数的同时降低模型复杂性,称为L2罚分)

  • LASSO Regression (Least Absolute Shrinkage and Selection Operator reduces model complexity by penalizing model coefficients to zero, for example, L1 penalty)

    大号 ASSO 回归 (最小绝对收缩和选择算子通过将模型系数惩罚为零(例如,L1罚分)来降低模型复杂度)

  • Degree 2 Polynomial Regression (a curve line to approximate the relationship between the dependent variables and the independent target variable)

    2级多项式回归 (一条曲线,用于近似因变量和独立目标变量之间的关系)

I have also validated their performance on the validation data set. The simple linear regression model seems to have the potential to be the best performing model.

我还通过验证数据集验证了它们的性能。 简单的线性回归模型似乎有可能成为性能最好的模型。

This is confirmed by Cross Validation using KFold (with 5 splits).

这可以通过使用KFold (5个拆分)的交叉验证来确认。

Finally, I checked the residue error against assumptions. The residue errors should be normally distributed with equal variance around the mean zero. The Normal Quartile-to-Quartile plot also looks acceptably normal.

最后,我根据假设检查了残差错误。 残差误差应在均值零附近正态分布且方差相等。 正常四分位数到四分位数的图看起来也可以接受。

Since I only have 250 rows (data limited by the number of countries in the world), I used the entire data set to simulate the test data set (note: this is done for academic purpose, not practical as it will lead to data leakage). I used KFold Cross Validation with 10 splits to evaluate the model performance.

由于我只有250行(数据受世界国家/地区数量限制),因此我使用了整个数据集来模拟测试数据集(注意:这样做是出于学术目的,不切实际,因为这会导致数据泄漏 )。 我使用KFold交叉验证和10个拆分来评估模型性能。

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state = 1)
lm = LinearRegression()
lm.fit(X_train, y_train)
cvs_lm = cross_val_score(lm, X, y, cv=kf, scoring='r2')
print(cvs_lm)

There is quite a bit of variation in the R² values from 0.49 to 0.82, but the average result is around 0.69, which is quite satisfactory.

R²值从0.49到0.82有相当大的变化,但平均结果约为0.69,这是令人满意的。

我们如何解释模型? (How do we interpret the model?)

df = pd.read_csv('df3.csv')
X = df[ ['Birth Rate', 'EPI', 'GDP', 'Heart Disease Rate', 'Population', 'Area', 'Pop Density', 'Stroke Rate'] ].astype(float)
X = np.log(X)
y = df[ "Life Expectancy" ].astype(float)
X = sm.add_constant(X)

model = sm.OLS(y, X)
results = model.fit()
results.summary()

If you're unaffected by the features, your life expectancy is 62 years. If your country has low birth rate, add 5 more years to your life. If the EPI (Environment Performance Index) is high, add 8 more years to your life. If you live in a rich country, add half a year to your life. Finally for every unit (or rather LOG unit) decrease in stroke rate, 5 more years could be added to your life.

如果您不受这些功能的影响,则预期寿命为62岁。 如果您的国家/地区的出生率较低,请再增加5岁。 如果EPI(环境绩效指数)很高,请再增加8年的寿命。 如果您生活在一个富裕的国家,则可以增加半年的生活时间。 最终,每降低一个单元(或更确切地说是LOG单位)的冲程率,您的生命就可以再增加5年。

下一步 (Next Steps)

I could possibly collect more data by expanding the scope to cities instead of countries, and exploring other features (factors) affecting life expectancy. Also, I could split the data into male and female categories for such life expectancy regression analysis.

我可以通过将范围扩大到城市而不是国家,并探索影响预期寿命的其他特征(因素)来收集更多数据。 另外,我可以将数据分为男性和女性类别,以进行预期寿命回归分析。

To conclude, here are some interesting insights:

总结一下,这里有一些有趣的见解:

  1. Japan has the highest life expectancy (83.7 years). Central African Republic (49.5 years) and many countries in the African continent are at the bottom of scale. Singapore is ranked #5 (82.7 years).

    日本的预期寿命最高(83.7岁)。 中非共和国(49.5年)和非洲大陆的许多国家处于规模最底层。 新加坡排名第5(82.7年)。

2. Take good care of the environment. This has the largest coefficient (impact) on a country’s life expectancy.

2.照顾好环境 。 这对一个国家的预期寿命影响最大(影响)。

The Python code for the above analysis is available on my GitHub – do feel free to refer to it.

我的GitHub上提供了用于上述分析的Python代码-请随时参考。

https://github.com/JNYH/Project-Luther

https://github.com/JNYH/Project-Luther

Video presentation: https://youtu.be/gC2m_lvouu8

视频演示: https : //youtu.be/gC2m_lvouu8

Thank you for reading.

感谢您的阅读。

翻译自: https://www.freecodecamp.org/news/regression-analysis-on-life-expectancy/

你可能感兴趣的:(python,机器学习,人工智能,深度学习,数据分析)