boxcox变换python实现

boxcox1p变换参数lambda估算方法:

极大似然估计 或者 贝叶斯估计 (原理略)

  • 极大似然估计:
    设总体中含有待估参数theta, 可以取很多值。已知样本观察值,求使该样本值出现概率最大的theta值作为theta的估计值,称之为极大似然估计。
    参考:极大似然估计思想的最简单解释

    极大似然估计就是在只有概率的情况下,忽略低概率事件直接将高概率事件认为是真实事件的思想。

  • python代码:

for i,lam in enumerate(lam_range):
    llf[i] = stats.boxcox_llf(lam, y)
    
# find the max lgo-likelihood(llf) index and decide the lambda
lam_best = lam_range[llf.argmax()]				# mle_

boxcox1p变换公式:

在这里插入图片描述

  • note: boxcox1p变换中y+c的+c是为了确保(y+c)>0,因为在boxcox变换中要求y>0
  • python代码:
  • y_boxcox = special.boxcox1p(y, lam_best) 利用llf获得优化后的lambda
    或者:
  • boxcox_normmax(x) 得到优化后的lambda
    for i in highskew.index:
        # boxcox1p for x with high skew
        x[i] = boxcox1p(x[i], boxcox_normmax(x[i]))
    

详细语法:

scipy.stats.boxcox_normmax(x, brack=(-2.0, 2.0), method='pearsonr')[source]
Compute optimal Box-Cox transform parameter for input data.

Parameters:	
x : array_like 	Input array.
brack : 2-tuple, optional
	The starting interval for a downhill bracket search with optimize.brent. Note that this is in most cases not critical; the final result is allowed to be outside this bracket.
method : str, optional
	The method to determine the optimal transform parameter (boxcox lmbda parameter). Options are:
		‘pearsonr’ (default)
		Maximizes the Pearson correlation coefficient between y = boxcox(x) and the expected values for y if x would be normally-distributed.
		‘mle’
		Minimizes the log-likelihood boxcox_llf. This is the method used in boxcox. ()
		‘all’
		Use all optimization methods available, and return all results. Useful to compare different methods.
		Returns:	
		maxlog : float or ndarray
		The optimal transform parameter found. An array instead of a scalar for method='all'.

example:

# Generate some data and determine optimal lmbda in various ways:
>>> x = stats.loggamma.rvs(5, size=30) + 5
>>> y, lmax_mle = stats.boxcox(x)
>>> lmax_pearsonr = stats.boxcox_normmax(x)

————————————————————————分割线---------------------------------------------------

# -*- coding: utf-8 -*-
"""
Here the boxcox method will be demonstated including boxcox convert,
lambda estimate via llf, inverse boxcox convert.
"""

import pandas as pd
import numpy as np
from scipy import stats,special
import matplotlib.pyplot as plt

data = pd.read_csv('y_boxcox.csv',header=None)
y = data.iloc[:,1]
print(y.shape)

lam_range = np.linspace(-2,5,100)  # default nums=50
llf = np.zeros(lam_range.shape, dtype=float)

# lambda estimate:
for i,lam in enumerate(lam_range):
    llf[i] = stats.boxcox_llf(lam, y)		# y 必须>0

# find the max lgo-likelihood(llf) index and decide the lambda
lam_best = lam_range[llf.argmax()]
print('Suitable lam is: ',round(lam_best,2))
print('Max llf is: ', round(llf.max(),2))

plt.figure()
plt.plot(lam_range,llf)
plt.show()
plt.savefig('boxcox.jpg')

# boxcox convert:
print('before convert: ','\n', y.head())
#y_boxcox = stats.boxcox(y, lam_best)
y_boxcox = special.boxcox1p(y, lam_best)
print('after convert: ','\n',  pd.DataFrame(y_boxcox.reshape(-1,1)).head())

# inverse boxcox convert:
y_invboxcox = special.inv_boxcox1p(y_boxcox, lam_best)
print('after inverse: ', '\n', pd.DataFrame(y_invboxcox.reshape(-1,1)).head())

'''
output:
(1456,)
Suitable lam is:  -0.02
Max llf is:  -16154.7
before convert:  
 0   208500.00000
1   181500.00000
2   223500.00000
3   140000.00000
4   250000.00000
Name: 1, dtype: float64
after convert:  
          0
0 10.85009
1 10.74166
2 10.90430
3 10.53785
4 10.99156
after inverse:  
              0
0 208500.00000
1 181500.00000
2 223500.00000
3 140000.00000
4 250000.00000
'''
 

你可能感兴趣的:(机器学习)