我们进行数据转换的原因是:除了小样本可以考虑非参数,大部分的统计原理和参数检验都是基于正态分布推得。
关于box-cox转换的基础内容请看:BoxCox-变换方法及其实现运用.pptx
了解极大似然估计:极大似然估计思想的最简单解释
通过上面的内容可以知道,
boxcox_normmax(x)说明,详情见https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox_normmax.html
scipy.stats.boxcox_normmax(x, brack=(-2.0, 2.0), method='pearsonr')[source]
Compute optimal Box-Cox transform parameter for input data.
Parameters:
x : array_like Input array.
brack : 2-tuple, optional
The starting interval for a downhill bracket search with optimize.brent. Note that this is in most cases not critical; the final result is allowed to be outside this bracket.
method : str, optional
The method to determine the optimal transform parameter (boxcox lmbda parameter). Options are:
‘pearsonr’ (default)
Maximizes the Pearson correlation coefficient between y = boxcox(x) and the expected values for y if x would be normally-distributed.
‘mle’
Minimizes the log-likelihood boxcox_llf. This is the method used in boxcox. ()
‘all’
Use all optimization methods available, and return all results. Useful to compare different methods.
Returns:
maxlog : float or ndarray
The optimal transform parameter found. An array instead of a scalar for method='all'.
接下来,用kaggle中House Prices: Advanced Regression Techniques比赛的数据集做个练习。
scipy.stats.boxcox_llf使用详见https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox_llf.html
import pandas as pd
import numpy as np
from scipy import stats,special
import matplotlib.pyplot as plt
train = pd.read_csv('./data/train.csv')
y = train['SalePrice']
print(y.shape)
lam_range = np.linspace(-2,5,100) # default nums=50
llf = np.zeros(lam_range.shape, dtype=float)
# lambda estimate:
for i,lam in enumerate(lam_range):
llf[i] = stats.boxcox_llf(lam, y) # y 必须>0
# find the max lgo-likelihood(llf) index and decide the lambda
lam_best = lam_range[llf.argmax()]
print('Suitable lam is: ',round(lam_best,2))
print('Max llf is: ', round(llf.max(),2))
plt.figure()
plt.axvline(round(lam_best,2),ls="--",color="r")
plt.plot(lam_range,llf)
plt.show()
plt.savefig('boxcox.jpg')
# boxcox convert:
print('before convert: ','\n', y.head())
#y_boxcox = stats.boxcox(y, lam_best)
y_boxcox = special.boxcox1p(y, lam_best)
print('after convert: ','\n', pd.DataFrame(y_boxcox).head())
# inverse boxcox convert:
y_invboxcox = special.inv_boxcox1p(y_boxcox, lam_best)
print('after inverse: ', '\n', pd.DataFrame(y_invboxcox).head())
结果如下,
比外,也可以通过scipy.stats.boxcox_normplot确定lambda,详见http://scipy.github.io/devdocs/generated/scipy.stats.boxcox_normplot.html