为何要使用box-cox变换?原因如下:
Box-Cox的变换公式:
y ( λ ) = { ( y + c ) λ − 1 λ , if λ = /   0 l o g ( y + c ) , if λ = 0 y^{(\lambda)}=\left\{ \begin{aligned} \frac{(y+c)^\lambda-1}{\lambda}, \text{if } \lambda {=}\mathllap{/\,} 0\\ log(y+c), \text{if } \lambda {=} 0 \end{aligned} \right. y(λ)=⎩⎪⎨⎪⎧λ(y+c)λ−1,if λ=/0log(y+c),if λ=0
scipy相关函数:
scipy.special.boxcox1p(x, lmbda)
: Compute the Box-Cox transformation of 1 + x.
(即执行转换的函数,c为1)
Parameters
x: array_like
Data to be transformed.
lmbda: array_like
Power parameter of the Box-Cox transform.
Returns
scipy.stats.boxcox_normmax
: Compute optimal Box-Cox transform parameter for input data.
(寻找最佳变换参数 λ \lambda λ的函数)
Parameters
x: array_like
Input array.
brack: 2-tuple, optional
The starting interval for a downhill bracket search with optimize.brent. Note that this is in most cases not critical; the final result is allowed to be outside this bracket.
method: str, optional
The method to determine the optimal transform parameter (boxcox lmbda
parameter). Options are:
y = boxcox(x)
and the expected values for y
if x would be normally-distributed.Returns
method='all'
.scipy.stats.boxcox
: Return a positive dataset transformed by a Box-Cox power transformation.
(也是实施boxcox转换的函数,但c为0并可自动寻找最佳 λ \lambda λ)
Parameters
x: ndarray
Input array. Should be 1-dimensional.
lmbda{None, scalar}, optional
If lmbda is not None, do the transformation for that value.
If lmbda is None, find the lambda that maximizes the log-likelihood function and return it as the second output argument.
alpha{None, float}, optional
If alpha is not None, return the 100 * (1-alpha)% confidence interval for lmbda as the third output argument. Must be between 0.0 and 1.0.
Returns
boxcox: ndarray
Box-Cox power transformed array.
maxlog: float, optional
If the lmbda parameter is None, the second returned argument is the lambda that maximizes the log-likelihood function.
(min_ci, max_ci): tuple of float, optional
If lmbda parameter is None and alpha is not None, this returned tuple of floats represents the minimum and maximum confidence limits given alpha.
使用kaggle里的 Housing Price 竞赛数据进行Box-Cox变换。
import numpy as np
import pandas as pd
trains=pd.read_csv('train.csv')
tests=pd.read_csv('test.csv')
sns.distplot(trains['SalePrice'])
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(trains['SalePrice'])
from scipy import stats
from scipy.stats import norm, skew #for some statistics
#查看SalePrice的skewness
fig=plt.figure(figsize=(15,5))
#pic1
plt.subplot(1,2,1)
sns.distplot(trains['SalePrice'],fit=norm)
(mu,sigma)=norm.fit(trains['SalePrice'])
plt.legend(['$\mu=$ {:.2f} and $\sigma=$ {:.2f}'.format(mu,sigma)],loc='best')
plt.ylabel('Frequency')
#pic2
plt.subplot(1,2,2)
res=stats.probplot(trains['SalePrice'],plot=plt)
plt.suptitle('Before')
print(f"Skewness of saleprice: {trains['SalePrice'].skew()}")
print(f"Kurtosis of saleprice: {trains['SalePrice'].kurt()}")
Skewness of saleprice: 1.8828757597682129
Kurtosis of saleprice: 6.536281860064529
#进行Box-Cox变换
#box-cox
trains.SalePrice,lambda_=stats.boxcox(trains.SalePrice)
print(lambda_)
-0.0769239637442887 (自动计算的 λ \lambda λ)
#再次观察SalePrice
fig=plt.figure(figsize=(15,5))
#pic1
plt.subplot(1,2,1)
sns.distplot(trains['SalePrice'],fit=norm)
(mu,sigma)=norm.fit(trains['SalePrice'])
plt.legend(['$\mu=$ {:.2f} and $\sigma=$ {:.2f}'.format(mu,sigma)],loc='best')
plt.ylabel('Frequency')
#pic2
plt.subplot(1,2,2)
res=stats.probplot(trains['SalePrice'],plot=plt)
plt.suptitle('After')
print(f"Skewness of saleprice: {trains['SalePrice'].skew()}")
print(f"Kurtosis of saleprice: {trains['SalePrice'].kurt()}")
Skewness of saleprice: -0.00865297992364803
Kurtosis of saleprice: 0.8778702738892878
可见变换后的数据更好的满足正态性的假设前提。很可能会对ML模型的学习带来更好的效果。
另一种方法:使用boxcox1p
from scipy.stats import boxcox_normmax
from scipy.special import boxcox1p
lambda_2=boxcox_normmax(trains.SalePrice+1)
print(lambda_2)
trains.SalePrice=boxcox1p(trains.SalePrice,lambda_2)
-0.05453787726665998
fig=plt.figure(figsize=(15,5))
#pic1
plt.subplot(1,2,1)
sns.distplot(trains['SalePrice'],fit=norm)
(mu,sigma)=norm.fit(trains['SalePrice'])
plt.legend(['$\mu=$ {:.2f} and $\sigma=$ {:.2f}'.format(mu,sigma)],loc='best')
plt.ylabel('Frequency')
#pic2
plt.subplot(1,2,2)
res=stats.probplot(trains['SalePrice'],plot=plt)
plt.suptitle('After')
print(f"Skewness of saleprice: {trains['SalePrice'].skew()}")
print(f"Kurtosis of saleprice: {trains['SalePrice'].kurt()}")
Skewness of saleprice: 0.02949933614370673
Kurtosis of saleprice: 0.850754366695321
可见使用boxcox1p()
可使数据的峰度变得更小,但偏度没有boxcox()
的结果小。
BoxCox 变换方法及其实现运用
极大似然估计思想的最简单解释
scipy.stats.boxcox document
House prices Beginner top 7%