python的ols回归,Python-滚动窗口OLS回归估计

For my evaluation, I have a dataset found in this link (https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk) as in the following format. The third column (Y) in my dataset is my true value - that's what I wanted to predict (estimate).

time X Y

0.000543 0 10

0.000575 0 10

0.041324 1 10

0.041331 2 10

0.041336 3 10

0.04134 4 10

...

9.987735 55 239

9.987739 56 239

9.987744 57 239

9.987749 58 239

9.987938 59 239

I want to run a rolling of for example 5 window OLS regression estimation, and I have tried it with the following script.

# /usr/bin/python -tt

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

df = pd.read_csv('estimated_pred.csv')

model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['X']],

window_type='rolling', window=5, intercept=True)

df['Y_hat'] = model.y_predict

print(df['Y_hat'])

print (model.summary)

df.plot.scatter(x='X', y='Y', s=0.1)

The summary of the regression analysis is shown below.

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ +

Number of Observations: 5

Number of Degrees of Freedom: 2

R-squared: -inf

Adj R-squared: -inf

Rmse: 0.0000

F-stat (1, 3): nan, p-value: nan

Degrees of Freedom: model 1, resid 3

-----------------------Summary of Estimated Coefficients------------------------

Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%

--------------------------------------------------------------------------------

X 0.0000 0.0000 1.97 0.1429 0.0000 0.0000

intercept 239.0000 0.0000 14567091934632472.00 0.0000 239.0000 239.0000

---------------------------------End of Summary---------------------------------

I want to do a backward prediction of Y at t+1 (i.e. predict the next value of Y according to the previous value i.e. p(Y)t+1 by including the mean squared error (MSE) - for example, if we look at row 5, the value of X is 2 and the value of Y is 10. Let's say the prediction value (p(Y)t+1) is 6 and therefore the mse will be (10-6)^2. How can we do this using either statsmodels or scikit-learn for pd.stats.ols.MovingOLS was removed in Pandas version 0.20.0 and since I can't find any reference?

解决方案

Here is an outline of doing rolling OLS with statsmodels and should work for your data. simply use df=pd.read_csv('estimated_pred.csv') instead of my randomly generated df:

import pandas as pd

import numpy as np

import statsmodels.api as sm

#random data

#df=pd.DataFrame(np.random.normal(size=(500,3)),columns=['time','X','Y'])

df=pd.read_csv('estimated_pred.csv')

df=df.dropna() #uncomment this line to drop nans

window = 5

df['a']=None #constant

df['b1']=None #beta1

df['b2']=None #beta2

for i in range(window,len(df)):

temp=df.iloc[i-window:i,:]

RollOLS=sm.OLS(temp.loc[:,'Y'],sm.add_constant(temp.loc[:,['time','X']])).fit()

df.iloc[i,df.columns.get_loc('a')]=RollOLS.params[0]

df.iloc[i,df.columns.get_loc('b1')]=RollOLS.params[1]

df.iloc[i,df.columns.get_loc('b2')]=RollOLS.params[2]

#The following line gives you predicted values in a row, given the PRIOR row's estimated parameters

df['predicted']=df['a'].shift(1)+df['b1'].shift(1)*df['time']+df['b2'].shift(1)*df['X']

I store the constant and betas, but there are a number of ways to approach predicting... you can use your fitted model object mine is RollOLS and the .predict() method, or multiply it yourself which I did in the final line (easier to do this way in this case because number of variables is fixed and known and you can do simple column math all in one go).

to do predictions with sm though as you go it would look like this:

predict_x=np.random.normal(size=(20,2))

RollOLS.predict(sm.add_constant(predict_x))

but keep in mind, if you ran the above code in sequence the predicted values would be using the model of the last window only. if you want to use a different model then you can save those as you go, or predict values within the for loop. Note you can also get fitted values with RollOLS.fittedvalues, and so if you are smoothing data pull and save RollOLS.fittedvalues[-1] for each iteration in the loop.

To help see how to use for your own data here is the tail of my df after the rolling regression loop is run:

time X Y a b1 b2

495 0.662463 0.771971 0.643008 -0.0235751 0.037875 0.0907694

496 -0.127879 1.293141 0.404959 0.00314073 0.0441054 0.113387

497 -0.006581 -0.824247 0.226653 0.0105847 0.0439867 0.118228

498 1.870858 0.920964 0.571535 0.0123463 0.0428359 0.11598

499 0.724296 0.537296 -0.411965 0.00104044 0.055003 0.118953

你可能感兴趣的:(python的ols回归)