使用scipy.stats.boxcox完成BoxCox变换

Why

为何要使用box-cox变换?原因如下:

  • 在做线性回归的过程中,一般线性模型假定有: Y = X β + ϵ , ϵ ∼ N ( 0 , δ 2 I ) Y=X\beta+\epsilon,\epsilon \sim N(0,\delta^2I) Y=Xβ+ϵ,ϵN(0,δ2I)
    • 线性性:E(Y)是X中各变量的线性函数
    • 独立性: ϵ 1 , ϵ 2 . . . ϵ n \epsilon_1,\epsilon_2...\epsilon_n ϵ1,ϵ2...ϵn相互独立
    • 方差齐性: D ( ϵ 1 ) = . . . = D ( ϵ n ) = δ 2 D(\epsilon_1)=...=D(\epsilon_n)=\delta^2 D(ϵ1)=...=D(ϵn)=δ2
    • 正态性: ϵ 1 , ϵ 2 . . . ϵ n \epsilon_1,\epsilon_2...\epsilon_n ϵ1,ϵ2...ϵn服从正态分布
  • 使用Box-Cox变换后的数据得到的回归模型优于变换前的模型,变换可以使模型的解释力度等性能更加优良。
  • 使用Box-Cox变换后,残差可以更好的满足正态性、独立性等假设前提,降低了伪回归的概率。
  • 使用Box-Cox变换族一般都可以保证将数据进行成功的正态变换,但在二分变量或较少水平的等级变量的情况下,不能成功进行转换,此时,我们可以考虑使用广义线性模型,如 ILOGUSTICS模型、 Johnson转换等。

What

Box-Cox的变换公式:
y ( λ ) = { ( y + c ) λ − 1 λ , if  λ = /   0 l o g ( y + c ) , if  λ = 0 y^{(\lambda)}=\left\{ \begin{aligned} \frac{(y+c)^\lambda-1}{\lambda}, \text{if } \lambda {=}\mathllap{/\,} 0\\ log(y+c), \text{if } \lambda {=} 0 \end{aligned} \right. y(λ)=λ(y+c)λ1,if λ=/0log(y+c),if λ=0

How

scipy相关函数:

scipy.special.boxcox1p(x, lmbda): Compute the Box-Cox transformation of 1 + x.
(即执行转换的函数,c为1)

Parameters

  • x: array_like
    Data to be transformed.

  • lmbda: array_like
    Power parameter of the Box-Cox transform.

Returns

  • y: array
    Transformed data.

scipy.stats.boxcox_normmax: Compute optimal Box-Cox transform parameter for input data.
(寻找最佳变换参数 λ \lambda λ的函数)
Parameters

  • x: array_like
    Input array.

  • brack: 2-tuple, optional
    The starting interval for a downhill bracket search with optimize.brent. Note that this is in most cases not critical; the final result is allowed to be outside this bracket.

  • method: str, optional
    The method to determine the optimal transform parameter (boxcox lmbda parameter). Options are:

    1. ‘pearsonr’ (default)
      Maximizes the Pearson correlation coefficient between y = boxcox(x) and the expected values for y if x would be normally-distributed.
    2. ‘mle’
      Minimizes the log-likelihood boxcox_llf. This is the method used in boxcox.
    3. ‘all’
      Use all optimization methods available, and return all results. Useful to compare different methods.

Returns

  • maxlog: float or ndarray
    The optimal transform parameter found. An array instead of a scalar for method='all'.

scipy.stats.boxcox: Return a positive dataset transformed by a Box-Cox power transformation.
(也是实施boxcox转换的函数,但c为0并可自动寻找最佳 λ \lambda λ)

Parameters

  • x: ndarray
    Input array. Should be 1-dimensional.

  • lmbda{None, scalar}, optional
    If lmbda is not None, do the transformation for that value.

    If lmbda is None, find the lambda that maximizes the log-likelihood function and return it as the second output argument.

  • alpha{None, float}, optional
    If alpha is not None, return the 100 * (1-alpha)% confidence interval for lmbda as the third output argument. Must be between 0.0 and 1.0.

Returns

  • boxcox: ndarray
    Box-Cox power transformed array.

  • maxlog: float, optional
    If the lmbda parameter is None, the second returned argument is the lambda that maximizes the log-likelihood function.

  • (min_ci, max_ci): tuple of float, optional
    If lmbda parameter is None and alpha is not None, this returned tuple of floats represents the minimum and maximum confidence limits given alpha.

实例

使用kaggle里的 Housing Price 竞赛数据进行Box-Cox变换。

import numpy as np 
import pandas as pd 
trains=pd.read_csv('train.csv')
tests=pd.read_csv('test.csv')
sns.distplot(trains['SalePrice'])
import matplotlib.pyplot as plt
import seaborn as sns


sns.distplot(trains['SalePrice'])

使用scipy.stats.boxcox完成BoxCox变换_第1张图片

from scipy import stats
from scipy.stats import norm, skew #for some statistics
#查看SalePrice的skewness
fig=plt.figure(figsize=(15,5))
#pic1
plt.subplot(1,2,1)
sns.distplot(trains['SalePrice'],fit=norm)
(mu,sigma)=norm.fit(trains['SalePrice'])
plt.legend(['$\mu=$ {:.2f} and $\sigma=$ {:.2f}'.format(mu,sigma)],loc='best')
plt.ylabel('Frequency')
#pic2
plt.subplot(1,2,2)
res=stats.probplot(trains['SalePrice'],plot=plt)
plt.suptitle('Before')
print(f"Skewness of saleprice: {trains['SalePrice'].skew()}")
print(f"Kurtosis of saleprice: {trains['SalePrice'].kurt()}")

使用scipy.stats.boxcox完成BoxCox变换_第2张图片
Skewness of saleprice: 1.8828757597682129
Kurtosis of saleprice: 6.536281860064529

#进行Box-Cox变换
#box-cox
trains.SalePrice,lambda_=stats.boxcox(trains.SalePrice)
print(lambda_)

-0.0769239637442887 (自动计算的 λ \lambda λ)

#再次观察SalePrice
fig=plt.figure(figsize=(15,5))
#pic1
plt.subplot(1,2,1)
sns.distplot(trains['SalePrice'],fit=norm)
(mu,sigma)=norm.fit(trains['SalePrice'])
plt.legend(['$\mu=$ {:.2f} and $\sigma=$ {:.2f}'.format(mu,sigma)],loc='best')
plt.ylabel('Frequency')
#pic2
plt.subplot(1,2,2)
res=stats.probplot(trains['SalePrice'],plot=plt)
plt.suptitle('After')
print(f"Skewness of saleprice: {trains['SalePrice'].skew()}")
print(f"Kurtosis of saleprice: {trains['SalePrice'].kurt()}")

使用scipy.stats.boxcox完成BoxCox变换_第3张图片
Skewness of saleprice: -0.00865297992364803
Kurtosis of saleprice: 0.8778702738892878
可见变换后的数据更好的满足正态性的假设前提。很可能会对ML模型的学习带来更好的效果。


另一种方法:使用boxcox1p

from scipy.stats import boxcox_normmax
from scipy.special import boxcox1p
lambda_2=boxcox_normmax(trains.SalePrice+1)
print(lambda_2)
trains.SalePrice=boxcox1p(trains.SalePrice,lambda_2)

-0.05453787726665998

fig=plt.figure(figsize=(15,5))
#pic1
plt.subplot(1,2,1)
sns.distplot(trains['SalePrice'],fit=norm)
(mu,sigma)=norm.fit(trains['SalePrice'])
plt.legend(['$\mu=$ {:.2f} and $\sigma=$ {:.2f}'.format(mu,sigma)],loc='best')
plt.ylabel('Frequency')
#pic2
plt.subplot(1,2,2)
res=stats.probplot(trains['SalePrice'],plot=plt)
plt.suptitle('After')
print(f"Skewness of saleprice: {trains['SalePrice'].skew()}")
print(f"Kurtosis of saleprice: {trains['SalePrice'].kurt()}")

Skewness of saleprice: 0.02949933614370673
Kurtosis of saleprice: 0.850754366695321使用scipy.stats.boxcox完成BoxCox变换_第4张图片
可见使用boxcox1p()可使数据的峰度变得更小,但偏度没有boxcox()的结果小。

参考文章:

BoxCox 变换方法及其实现运用

极大似然估计思想的最简单解释

scipy.stats.boxcox document

House prices Beginner top 7%

你可能感兴趣的:(Data,Science)