03 - linear regression

Pisa Tower Example

import pandas
import matplotlib.pyplot as plt

pisa = pandas.DataFrame({"year": range(1975, 1988), 
                         "lean": [2.9642, 2.9644, 2.9656, 2.9667, 2.9673, 2.9688, 2.9696, 
                                  2.9698, 2.9713, 2.9717, 2.9725, 2.9742, 2.9757]})

print(pisa)

plt.scatter(pisa["year"], pisa["lean"])
plt.show()

增加一列截距项

import statsmodels.api as sm

y = pisa.lean # target
X = pisa.year  # features
X = sm.add_constant(X)  # add a column of 1's as the constant term

# OLS -- Ordinary Least Squares Fit
linear = sm.OLS(y, X)
# fit model
linearfit = linear.fit()
print (linearfit.summary())

计算residual value

# Our predicted values of y
yhat = linearfit.predict(X)
print(yhat)

residuals = yhat - y
print (residuals)

#plot residuals and if it looks similar to a bell curve then we will accept the assumption of normality
plt.hist(residuals, bins = 5)
plt.show()

3个summary of squares

Sum of Square Error (SSE)  
Regression Sum of Squares (RSS) 
Total Sum of Squares (TSS)   # total variance

SSE = np.sum((y.values-yhat)**2)
RSS = np.sum((y.mean()-yhat)**2)
TSS = np.sum((y.values-yhat.mean())**2)

TSS=RSS+SSE

SSE is the sum of all residuals giving us a measure between the model's prediction and the observed values.
RSS measures the amount of explained variance which our model accounts for.
A large RSS and small SSE can be an indicator of a strong model.
TSS = RSS + SSE - the total amount of variance in the data is captured by the variance explained by the model plus the variance missed by the model.

R_squared = 1 - SSE / TSS = RSS / TSS
# what the percentage of variation in the target variable is explained by our model.

Show params for the regression model

print("\n",linearfit.params)

The variance of each of the coefficients is an important and powerful measure.

# Compute SSE
SSE = np.sum((y.values - yhat)**2)
# Compute variance in X
xvar = np.sum((pisa.year - pisa.year.mean())**2)
# Compute variance in b1 
s2b1 = SSE / ((y.shape[0] - 2) * xvar)

pdf and cdf

from scipy.stats import t

# 100 values between -3 and 3
x = np.linspace(-3,3,100)

# Compute the pdf with 3 degrees of freedom
tdist3 = t.pdf(x=x, df=3)
tdist30 = t.pdf(x=x, df=30)
            
plt.plot(x, tdist3)
plt.plot(x, tdist30)

计算系数的significance

The variable s2b1 is in memory.  The variance of beta_1
tstat = linearfit.params["year"] / np.sqrt(s2b1)

t检验

# At the 95% confidence interval for a two-sided t-test we must use a p-value of 0.975
pval = 0.975

# The degrees of freedom
df = pisa.shape[0] - 2

# The probability to test against
p = t.cdf(tstat, df=df)

beta1_test = p > pval

03 - linear regression

你可能感兴趣的:(03 - linear regression)