python回归代码_一元回归1_基础(python代码实现)

机器学习,项目统计联系QQ:231469242

目录

1.基本概念

2.SSE/SSR/SST可视化

3.简单回归分为两类

4.一元回归公式

5.估计的回归公式

6.最小二乘法得到回归线应该穿过中心点

7.预测值

8.误差项

9.斜率公式

10.截距公式

11. 决定系数R**2

12.线性关系检验

13.相关系数检验

14.残差

15.被调整的R平方(The Adjusted R2 Value)

16.回归系数的标准误

17.残差分析

19.OLS参数解读

20.一元回归共线性

21.bootstrap

相关系数简介编辑

Pearson相关系数[1]  用来衡量两个数据集合是否在一条线上面,它用来衡量定距变量间的线性关系。如衡量国民收入和居民储蓄存款、身高和体重、高中成绩和高考成绩等变量间的线性相关关系。当两个变量都是正态连续变量,而且两者之间呈线性关系时,表现这两个变量之间相关程度用积差相关系数,主要有Pearson简单相关系数。

其计算公式为:

计算公式

值域等级解释编辑

相关系数的绝对值越大,相关性越强,相关系数越接近于1或-1,相关度越强,相关系数越接近于0,相关度越弱。

通常情况下通过以下取值范围判断变量的相关强度:

相关系数 0.8-1.0 极强相关

0.6-0.8 强相关

0.4-0.6 中等程度相关

0.2-0.4 弱相关

0.0-0.2 极弱相关或无相关

https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.linregress.html

1.基本概念

因变量:

用了被预测的变量,y表示

自变量:

用来预测或解释得一个或多个变量,x表示

线性关系

正线性相关:一个变量数值增加,另一个变量的数值也增加

负线性相关:一个变量数值增加,另一个变量是指随之减少

完全线性相关:两个变量观测点完全落在直线上(函数关系)

非线性相关:一个变量数值增加,另一个变量数值可能增加或减少,无规律

2.SSE/SSR/SST可视化

3.简单回归分为两类

1.只有一个变量

2.有一个独立变量 independent variable,一个dependent variable

4.一元回归公式

B0是截距

B1是slope斜率

是误差项,一个随机变量,它是除x和y之间的线性关系以外的随机因素对y的影响。是不能由x和y之间的线性关系所解释的y的变异

5.估计的回归公式

最小二乘法得到估计的回归公式

如果我们知道总体参数斜率和截距,我们就可以使用简单线性回归公式。

真实情况,我们不能得到总体参数。所以我们用样本参数来得到估算的斜率和截距

斜率为0的直线,SSE=SST

6.最小二乘法得到回归线应该穿过中心点

最小二乘法得到回归线应该穿过中心点(变量1的平均值,变量2的平均值)

最小二乘法:

SSE最小的方法就是最小二乘法

..

7.预测值

只有一个变量:平均数

点估计(两个变量)

平均值和个别值的预测

8.误差项

是误差项,一个随机变量,它是除x和y之间的线性关系以外的随机因素对y的影响。是不能由x和y之间的线性关系所解释的y的变异

误差项应满足

正态性+方差齐性+独立性

在回归模型中,误差项是期望值为0,方差相等且服从正态分布的一个独立随机变量。如果关于误差项的假定不成立的话,所做的检验站不住脚。确定有关误差项假定的方法之一就是残差分析

残差分析图

9.斜率公式

slope斜率的公式

10.截距公式

因为最好的模型会穿越中心点(变量1的平均值,变量2的平均值),利用斜率和中心点可以得到截距

11. 决定系数R**2

相关关系correlation:

影响一个变量的因素有多个,造成两个变量关系不确定性。变量之间不确定的关系称为相关关系。

即一个变量不能决定另一个变量,而是多个变量共同决定另一个变量发展。

相关系数:

变量之间关系强度

决定系数R**2

=SSR/SSE

SSR占据SST空间越大,R**2值越大

12.线性关系检验

建立模型前,我们已经假定x和y是线性关系,但假设是否成功需要检验后才能证实。

线性关系检验简称为F检验,它用于检验自变量x和因变量y之间线性关系是否显著。

如何判断决定模型是否匹配?用方差分析来算F值,即SSR/SSE的F值概率是否低于0.05

SSE自由度为n-2(变量个数-2)

SSR自由度为1(两个变量自由度为2-1=1)

SST,SSR,SSE的可视化

13.相关系数检验

如果决定系数R**2显著,还要结合样本量考虑参数估计是否适用于总体,如果样本量太小,或R值太小,则不适用与总体

这时用t检验t=(r*math.sqrt(n-2))/(math.sqrt(1-r**2)), 自由度为n-2

.

14.残差

真实值y1与估算值y~1之差

e=y1-(y~1)

残差平方和就是SSE

|参数估计值-真实值|**2 相加就是SSE

残差平方和

如果一个简单线性模型较好匹配数据,则SSE会最小

简单线性回归目标是创造一个线性模型,其残差平方和最小

SSE公式

练习

http://book.2cto.com/201512/58842.html

餐饮系统中可以统计得到不同菜品的日销量数据,数据示例如表3-7所示。

数据详见:demo/data/catering_sale_all.xls

分析这些菜品销售量之间的相关性可以得到不同菜品之间的关系,比如是替补菜品、互补菜品或者没有关系,为原材料采购提供参考。其Python代码如代码清单3-4所示。

代码清单3-4 餐饮销量数据相关性分析

#-*- coding: utf-8 -*-

#餐饮销量数据相关性分析

from __future__ import print_function

import pandas as pd

catering_sale = '../data/catering_sale_all.xls' #餐饮数据,含有其他属性

data = pd.read_excel(catering_sale, index_col = u'日期') #读取数据,指定“日期”列为索引列

data.corr() #相关系数矩阵,即给出了任意两款菜式之间的相关系数

data.corr()[u'百合酱蒸凤爪'] #只显示“百合酱蒸凤爪”与其他菜式的相关系数

data[u'百合酱蒸凤爪'].corr(data[u'翡翠蒸香茜饺']) #计算“百合酱蒸凤爪”与“翡翠蒸香茜饺”的相关系数

代码详见:demo/code/correlation_analyze.py

上面的代码给出了3种不同形式的求相关系数的运算。运行代码,可以得到任意两款菜式之间的相关系数,如运行“data.corr()[u'百合酱蒸凤爪']”可以得到下面的结果。

>>> data.corr()[u'百合酱蒸凤爪']

百合酱蒸凤爪 1.000000

翡翠蒸香茜饺 0.009206

金银蒜汁蒸排骨 0.016799

乐膳真味鸡 0.455638

蜜汁焗餐包 0.098085

生炒菜心 0.308496

铁板酸菜豆腐 0.204898

香煎韭菜饺 0.127448

香煎萝卜糕 -0.090276

原汁原味菜心 0.428316

Name: 百合酱蒸凤爪, dtype: float64

从上面的结果可以看到,如果顾客点了“百合酱蒸凤爪”,则和点“翡翠蒸香茜饺”“金银蒜汁蒸排骨”“香煎萝卜糕”“铁板酸菜豆腐”“香煎韭菜饺”等主食类的相关性比较低,反而点“乐膳真味鸡”“生炒菜心”“原汁原味菜心”的相关性比较高。

15.被调整的R平方(The Adjusted R2 Value)

http://www.graphpad.com/guides/prism/6/curve-fitting/index.htm?reg_interpreting_the_adjusted_r2.htm

http://www.statisticshowto.com/adjusted-r2/

http://www.360doc.com/content/16/1213/10/33459258_614269488.shtml

n代表数据量,k代表参数量

一种快速简单比较模型方法是选用较小的调整R方

R方仅用于样本数据,对于整体数据,R方没有啥用

调整R方永远小于R方

如果增加了越来越多无用的变量,调整R方变小;

如果增加了越来越多有用变量,调整R方变大。

R方公式

The formula is:

当给模型增加自变量时,复决定系数也随之逐步增大,当自变量足够多时总会得到模型拟合良好,而实际却可能并非如此。于是考虑对R2进行调整,记为Ra2,称调整后复决定系数。

R2=SSR/SST=1-SSE/SST

Ra2=1-(SSE/dfE)/(SST/dfT)

Why you should not use R2to compare models

R2quantifies how well a model fits the data, so it seems as though it would be an easy way to compare models. It sure sounds easy -- pick the  model with the larger R2. The problem with this approach is that there is no penalty for adding more parameters. So the model with more parameters  will bend and twist more to come nearer the points, and so almost always has a higher R2. If you use R2as the criteria for picking the best model, you'd almost always pick the model with the most parameters.

The adjusted R2accounts for the number of parameters fit

The adjusted R2always has a lower value than R2 (unless you are fitting only one parameter). The equations below show why.

The equations above show how the adjusted R2is computed. The sum-of-squares of the residuals from the regression line or curve have n-K degrees of freedom, wheren is the number of data points and K is the number of parameters fit by the regression. The total sum-of-squares is the sum of the squares of the distances from a horizontal line through the mean of all Y values. Since it only has one parameter (the mean), the degrees of freedom equals n-1. The adjusted R2is larger than the ordinary R2whenever K is greater than 1.

Using adjusted R2and a quick and dirty way to compare models

A quick and easy way to compare models is to choose the one with the smaller adjusted R2. Choose to report this value on the Diagnostics  tab.

Comparing models with adjusted R2is not a standard method for comparing nonlinear models (it is standard for multiple linear regression), and we suggest that you use the extra-sum-of-square F test or comparing AICc instead. If you do compare models by comparing adjusted R2, make sure that identical data, weighted identically, are used for all fits.

Adjusted R2in linear regression

Prism doesn't report the adjusted R2with linear regression, but you can fit a straight line with nonlinear regression.

If X and Y are not linearly related at all, the best fit slope is expected to be 0.0. If you analyzed many randomly selected samples, half the samples would have a slope that is  positive and half the samples would have a negative slope.  But in all these cases, R2would be positive (or zero). R2can never be negative (unless you constrain the slope or intercept so it is forced to fit worse than a horizontal line).  In contrast, the adjusted R2can be negative. If you analyzed many randomly selected samples, you'd expect the adjusted R2to be positive in half the samples and negative in the other half.

Here is a simple way to think about the distinction. The R2quantifies the linear relationship in the sample of data you are analyzing. Even if there is no underlying relationship, there almost certainly is some relationship in that sample. The adjusted R2is smaller than R2and is your best estimate of the degree of relationship in the underlying population.

Adjusted R2 / Adjusted R-Squared: What is it used for?

Watch the video or read the article below:

Adjusted R2: Overview

Adjusted R2 is a special form of R2, the coefficient of determination.

The adjusted R2 has many applications in real life. Image: USCG

R2 shows how well terms (data points) fit a curve or line. Adjusted R2 also indicates how well terms fit a curve or line, but adjusts for the number of terms in a model. If you add more and more useless variablesto a model, adjusted r-squared will decrease. If you add more useful variables, adjusted r-squared will increase.

Adjusted R2 will always be less than or equal to R2. You only need R2 when working withsamples. In other words, R2 isn’t necessary when you have data from an entire population.

where:

N is the number of points in your data sample.

K is the number of independent regressors, i.e. the number of variables in your model, excluding the constant.

If you already know R2 then it’s a fairly simple formula to work. However, if you do not already have R2 then you’ll probably not want to calculate this by hand! (If you must, see How to Calculate the Coefficient of Determination). There are many statistical packages that can calculated adjusted r squared for you. Adjusted r squared is given as part of Excel regression output. See: Excel regression analysis output explained.

Meaning of Adjusted R2

Both R2 and the adjusted R2 give you an idea of how many data points fall within the line of the regression equation. However, there is one main difference between R2 and the adjusted R2: R2 assumes that every single variable explains the 2 tells you the percentage of variation explained by only the independent variables that actually affect the dependent variable.

How Adjusted R2 Penalizes You

The adjusted R2 will penalize you for adding independent variables (K in the equation) that do not fit the model. Why? In regression analysis, it can be tempting to add more variables to the data as you think of them. Some of those variables will be significant, but you can’t be sure that significance is just by chance. The adjusted R2 will compensate for this by that penalizing you for those extra variables.

Problems with R2 that are corrected with an adjusted R2

R2 increases with every predictor added to a model. As R2 always increases and never decreases, it can appear to be a better fit with the more terms you add to the model. This can be completely misleading.

Similarly, if your model has too many terms and too many high-order polynomials you can run into the problem of over-fitting the data. When you over-fit data, a misleadingly high R2 value can lead to misleading projections.

16.回归系数的标准误

1.回归系数的标准误

因为样本统计量的标准差就是它的标准误,所以回归系数的标准误就是它的标准差。有多次抽样时,每次抽出来的样本可估计出一个回归系数,k次抽样有k个估计的回归系数,它们的标准差就是回归系数的标准误(详细参考“标准差与标准误”)。

2.回归的标准误

Yi=Xiβ+ε,其中(Xi,Yi)为观测值,β为回归系数的真实值,ε为误差项;

Yi=Xiβ^+μ,其中(Xi,Yi)为观测值,β^为回归系数的估计值,μ为残差项。

(1)回归的标准误指的是误差项标准差的估计值。

每次抽出来的样本虽然只有一个回归系数的估计值,但因为有n个个体,每个观测值都对应有个残差项(1次抽样有n个残差项),所以每次抽样都对应着一个残差项标准差,而残差项方差(样本)是误差项方差(总体)的无偏估计量,这也就意味着回归的标准误就是残差项标准差(也被称作均方根误差:Root Mean Squared Error),所以回归的标准误公式为:

,其中n为样本容量,k为待估参数个数,i为样本中的个体编号;

显然,回归标准误的平方就是残差项方差(也被称作均方误差:Mean Squared Error):

,其中n为样本容量,k为待估参数个数,i为样本中的个体编号。

均方误差可以评价数据的变化程度,MSE的值越小,说明模型对实验数据具有更好的描述精确度。

Log-Likelihood对数似然估计函数值

理解对数似然估计函数值。对数似然估计函数值一般取负值,实际值(不是绝对值)越大越好。第一,基本推理。对于似然函数,如果是离散分布,最后得到的数值直接就是概率,取值区间为0-1,对数化之后的值就是负数了;如果是连续变量,因为概率密度函数的取值区间并不局限于0-1,所以最后得到的似然函数值不是概率而只是概率密度函数值,这样对数化之后的正负就不确定了。第二,Eviews的计算公式解释。公式值的大小关键取之于残差平方和(以及样本容量),只有当残差平方和与样本容量的比之很小时,括号内的值才可能为负,从而公式值为正,这时说明参数拟合效度很高;反之公式值为负,但其绝对值越小表示残差平方和越小,因而参数拟合效度越高。

似然函数值的自然对数的—2倍,常用来反映模型的拟合程度,其值越小,表示拟合程度越好.

似然比统计量,服从卡方分布,是一个检验统计量,无所谓越大越好还是越小越好。该统计量大于卡方临界值时,拒绝原假设,否则接受原假设

Maximum likelihood estimation (越大越好) optimizes likelihood function ( no negative sign).

Here is why the objective function defined as -2 Log likelihood

The log function is monotonic and makes it easy to calculate. Some software do it in minimum that is why there a negative sign(-). My guess is because a regression problem is defined to minimize sum of error square. The number 2 is there for ease for hypothesis testing when one needs it, because 2*( loglikehood ration) is of chi square distribution with # df.

17.残差分析

1.方差齐性

2.正态性

1.方差齐性

2.正态性

Durbin-Watson检验

np.sum( np.diff( result.resid.values )**2.0 )

Out[18]: 3.1437096272928842

DW = np.sum( np.diff( result.resid.values )**2.0 )/result.ssr

DW

Out[20]: 1.9753463429714668

print('Durbin-Watson: {:.5f}'.format( DW ))

Durbin-Watson: 1.97535

D.W统计量是用来检验残差分布是否为正态分布的,因为用OLS进行回归估计是假设模型残差服从正态分布的,因此,如果残差不服从正态分布,那么,模型将是有偏的,也就是说模型的解释能力是不强的.

D.W统计量在2左右说明残差是服从正态分布的,若偏离2太远,那么你所构建的模型的解释能力就要受影响了.

在线性回归中,我们总是假设残差是彼此独立的(不相关)。如果违反相互独立假设 ,一些模型的拟合结果就会成问题。例如,误差项之间的正相关往往会放大系数 t 值,从而使预测变量显得重要 ,而事实上它们可能并不重要。

Durbin-Watson 统计量通过确定两个相邻误差项的相关性是否为零来检验回归残差是否存在自相关。该检验以误差均由一阶自回归过程生成的假设为基础。要从检验中得出结论,根据样本量n和自变量数目k'查DW分布表,得下临界值LD 和上临界值UD,并依下列准则判断残差的自相关情形:

(1)如果0

(2)如果LD

(3)如果UD

(4)如果4-UD

(5)如果4-LD

详细的检验表(请右键另存为):

T=6-100  (

T=100-200  (.TXT)

T=200-500  (.TXT)

T=500-2000  (.TXT)

Durbin-Waterson Test 检验表

 

从高斯-马尔可夫定理的证明过程中可以看出,只有在同方差和非自相关性的条件下,OLS估计才具有最小方差性。当模型存在自相关性时,OLS估计仍然是无偏估计,但不再具有有效性。这与存在异方差性时的情况一样,说明存在其他的参数估计方法,其估计误差小于OLS估计的误差;也就是说,对于存在自相关性的模型,应该改用其他方法估计模型中的参数。

1.自相关不影响OLS估计量的线性和无偏性,但使之失去有效性

2.自相关的系数估计量将有相当大的方差

3.自相关系数的T检验不显著

4.模型的预测功能失效

Jarque–Bera Test

H0:skewness (S), and kurtosis (K) 都等于0

H1:skewness (S), and kurtosis (K)有一个不等于0

样本小时,H1会成立

The Jarque–Bera test is another test that considers skewness (S), and kurtosis (K).The null hypothesis is that the distribution is normal, that both the skewness and excess kurtosis equal zero, or alternatively, that the skewness is zero and the regular run-of-the-mill kurtosis is three. Unfortunately,with small samples the Jarque–Bera

Condition Number

如果大于30,表明两个变量有很大共线性

The condition number measures the sensitivity of a function’s output to its input.

When two predictor variables are highly correlated, which is called multicollinearity,

the coefficients or factors of those predictor variables can fluctuate erratically for

small changes in the data or the model. Ideally, similar models should be similar,

i.e., have approximately equal coefficients. Multicollinearity can cause numerical

matrix inversion to crap out, or produce inaccurate results (see Kaplan 2009). One

approach to this problem in regression is the technique of ridge regression, which is

available in the Python package sklearn.

We calculate the condition number by taking the eigenvalues of the product of

the predictor variables (including the constant vector of ones) and then taking the

square root of the ratio of the largest eigenvalue to the smallest eigenvalue. If the

condition number is greater than 30, then the regression may have multicollinearity.

AIC

外文名Akaike information criterion

赤池信息量准则[1]  是由日本统计学家赤池弘次创立的,以熵的概念基础确定。

赤池信息量准则,即Akaike information criterion、简称AIC,是衡量统计模型拟合优良性的一种标准,是由日本统计学家赤池弘次创立和发展的。赤池信息量准则建立在熵的概念基础上,可以权衡所估计模型的复杂度和此模型拟合数据的优良性。

公式:

在一般的情况下,AIC可以表示为:

AIC=(2k-2L)/n

参数越少,AIC值越小,模型越好

样本数越多,AIC值越小,模型越好

它的假设条件是模型的误差服从独立正态分布。

其中:k是所拟合模型中参数的数量,L是对数似然值,n是观测值数目。

AIC的大小取决于L和k。k取值越小,AIC越小;L取值越大,AIC值越小。k小意味着模型简洁,L大意味着模型精确。因此AIC和修正的决定系数类似,在评价模型是兼顾了简洁性和精确性。

具体到,L=-(n/2)*ln(2*pi)-(n/2)*ln(sse/n)-n/2.其中n为样本量,sse为残差平方和

表明增加自由参数的数目提高了拟合的优良性,AIC鼓励数据拟合的优良性但是尽量避免出现过度拟合(Overfitting)的情况。所以优先考虑的模型应是AIC值最小的那一个。赤池信息准则的方法是寻找可以最好地解释数据但包含最少自由参数的模型。

AICc和AICu

在样本小的情况下,AIC转变为AICc:

AICc=AIC+[2k(k+1)/(n-k-1)

当n增加时,AICc收敛成AIC。所以AICc可以应用在任何样本大小的情况下(Burnham and Anderson, 2004)。

McQuarrie 和 Tsai(1998: 22)把AICc定义为:

AICc=ln(RSS/n)+(n+k)/(n-k-2),

他们提出的另一个紧密相关指标为AICu:

AICu=ln[RSS/(n-k)]+(n+k)/(n-k-2).

QAIC

QAIC(Quasi-AIC)可以定义为:

QAIC=2k-1/c*2lnL

其中:c是方差膨胀因素。因此QAIC可以调整过度离散(或者缺乏拟合)。

在小样本情况下, QAIC表示为:

QAICc=QAIC+2k(2k+1)/(n-k-1)

平均值的置信区间(confidence interval):对于自变量的一个给定值X0,求出因变量y的平均值的估计区间

个别值的预测区间(prediction interval):对于自变量的一个给定值X0,求出因变量y的一个个别值的估计区间

# -*- coding: utf-8 -*-

"""

Created on Mon Jul 10 11:04:51 2017

@author: toby

"""

# Import standard packages

import numpy as np

import matplotlib.pyplot as plt

import scipy.stats as stats

def fitLine(x, y, alpha=0.05, newx=[], plotFlag=1):

''' Fit a curve to the data using a least squares 1st order polynomial fit '''

# Summary data

n = len(x) # number of samples

Sxx = np.sum(x**2) - np.sum(x)**2/n

# Syy = np.sum(y**2) - np.sum(y)**2/n # not needed here

Sxy = np.sum(x*y) - np.sum(x)*np.sum(y)/n

mean_x = np.mean(x)

mean_y = np.mean(y)

# Linefit

b = Sxy/Sxx

a = mean_y - b*mean_x

# Residuals

fit = lambda xx: a + b*xx

residuals = y - fit(x)

var_res = np.sum(residuals**2)/(n-2)

sd_res = np.sqrt(var_res)

# Confidence intervals

se_b = sd_res/np.sqrt(Sxx)

se_a = sd_res*np.sqrt(np.sum(x**2)/(n*Sxx))

df = n-2 # degrees of freedom

tval = stats.t.isf(alpha/2., df) # appropriate t value

ci_a = a + tval*se_a*np.array([-1,1])

ci_b = b + tval*se_b*np.array([-1,1])

# create series of new test x-values to predict for

npts = 100

px = np.linspace(np.min(x),np.max(x),num=npts)

se_fit = lambda x: sd_res * np.sqrt( 1./n + (x-mean_x)**2/Sxx)

se_predict = lambda x: sd_res * np.sqrt(1+1./n + (x-mean_x)**2/Sxx)

print(('Summary: a={0:5.4f}+/-{1:5.4f}, b={2:5.4f}+/-{3:5.4f}'.format(a,tval*se_a,b,tval*se_b)))

print(('Confidence intervals: ci_a=({0:5.4f} - {1:5.4f}), ci_b=({2:5.4f} - {3:5.4f})'.format(ci_a[0], ci_a[1], ci_b[0], ci_b[1])))

print(('Residuals: variance = {0:5.4f}, standard deviation = {1:5.4f}'.format(var_res, sd_res)))

print(('alpha = {0:.3f}, tval = {1:5.4f}, df={2:d}'.format(alpha, tval, df)))

# Return info

ri = {'residuals': residuals,

'var_res': var_res,

'sd_res': sd_res,

'alpha': alpha,

'tval': tval,

'df': df}

if plotFlag == 1:

# Plot the data

plt.figure()

plt.plot(px, fit(px),'k', label='Regression line')

#plt.plot(x,y,'k.', label='Sample observations', ms=10)

plt.plot(x,y,'k.')

x.sort()

limit = (1-alpha)*100

plt.plot(x, fit(x)+tval*se_fit(x), 'r--', lw=2, label='Confidence limit ({0:.1f}%)'.format(limit))

plt.plot(x, fit(x)-tval*se_fit(x), 'r--', lw=2 )

plt.plot(x, fit(x)+tval*se_predict(x), '--', lw=2, color=(0.2,1,0.2), label='Prediction limit ({0:.1f}%)'.format(limit))

plt.plot(x, fit(x)-tval*se_predict(x), '--', lw=2, color=(0.2,1,0.2))

plt.xlabel('X values')

plt.ylabel('Y values')

plt.title('Linear regression and confidence limits')

# configure legend

plt.legend(loc=0)

leg = plt.gca().get_legend()

ltext = leg.get_texts()

plt.setp(ltext, fontsize=14)

# show the plot

outFile = 'regression_wLegend.png'

plt.savefig(outFile, dpi=200)

print('Image saved to {0}'.format(outFile))

plt.show()

if newx != []:

try:

newx.size

except AttributeError:

newx = np.array([newx])

print(('Example: x = {0}+/-{1} => se_fit = {2:5.4f}, se_predict = {3:6.5f}'\

.format(newx[0], tval*se_predict(newx[0]), se_fit(newx[0]), se_predict(newx[0]))))

newy = (fit(newx), fit(newx)-se_predict(newx), fit(newx)+se_predict(newx))

return (a,b,(ci_a, ci_b), ri, newy)

else:

return (a,b,(ci_a, ci_b), ri)

def Draw_confidenceInterval(x,y):

x=np.array(x)

y=np.array(y)

goodIndex = np.invert(np.logical_or(np.isnan(x), np.isnan(y)))

(a,b,(ci_a, ci_b), ri,newy) = fitLine(x[goodIndex],y[goodIndex], alpha=0.01,newx=np.array([1,4.5]))

y=[6.47,6.13,6.19,4.89,5.63,4.52,5.89,4.79,5.27,6.08]

x=[4.03,3.76,3.77,3.34,3.47,2.92,3.20,2.71,3.53,4.51]

Draw_confidenceInterval(x,y)

19.OLS参数解读

# -*- coding: utf-8 -*-

#斯皮尔曼等级相关(Spearman’s correlation coefficient for ranked data)

import math,pylab,scipy

import numpy as np

import scipy.stats as stats

from scipy.stats import t

from scipy.stats import f

import pandas as pd

import matplotlib.pyplot as plt

from statsmodels.stats.diagnostic import lillifors

import normality_check

import statsmodels.formula.api as sm

x=[4.03,3.76,3.77,3.34,3.47,2.92,3.20,2.71,3.53,4.51]

y=[6.47,6.13,6.19,4.89,5.63,4.52,5.89,4.79,5.27,6.08]

list_group=[x,y]

sample=len(x)

#显著性

a=0.05

#数据可视化

plt.plot(x,y,'ro')

#斯皮尔曼等级相关,非参数检验

def Spearmanr(x,y):

print("use spearmanr,Nonparametric tests")

#样本不一致时,发出警告

if len(x)!=len(y):

print ("warming,the samples are not equal!")

r,p=stats.spearmanr(x,y)

print("spearman r**2:",r**2)

print("spearman p:",p)

if sample<500 and p>0.05:

print("when sample < 500,p has no mean(>0.05)")

print("when sample > 500,p has mean")

#皮尔森 ,参数检验

def Pearsonr(x,y):

print("use Pearson,parametric tests")

r,p=stats.pearsonr(x,y)

print("pearson r**2:",r**2)

print("pearson p:",p)

if sample<30:

print("when sample <30,pearson has no mean")

#皮尔森 ,参数检验,带有详细参数

def Pearsonr_details(x,y,xLabel,yLabel,formula):

n=len(x)

df=n-2

data=pd.DataFrame({yLabel:y,xLabel:x})

result = sm.ols(formula, data).fit()

print(result.summary())

#模型F分布显著性分析

print('\n')

print("linear relation Significant test:...................................")

#如果F检验的P值<0.05,拒绝H0,x和y无显著关系,H1成立,x和y有显著关系

if result.f_pvalue<0.05:

print ("P value of f test<0.05,the linear relation is right.")

#R的显著检验

print('\n')

print("R significant test:...................................")

r_square=result.rsquared

r=math.sqrt(r_square)

t_score=r*math.sqrt(n-2)/(math.sqrt(1-r**2))

t_std=t.isf(a/2,df)

if t_scoret_std:

print ("R is significant according to its sample size")

else:

print ("R is not significant")

#残差分析

print('\n')

print("residual error analysis:...................................")

states=normality_check.check_normality(result.resid)

if states==True:

print("the residual error are normal distributed")

else:

print("the residual error are not normal distributed")

#残差偏态和峰态

Skew = stats.skew(result.resid, bias=True)

Kurtosis = stats.kurtosis(result.resid, fisher=False,bias=True)

if round(Skew,1)==0:

print("residual errors normality Skew:in middle,perfect match")

elif round(Skew,1)>0:

print("residual errors normality Skew:close right")

elif round(Skew,1)<0:

print("residual errors normality Skew:close left")

if round(Kurtosis,1)==3:

print("residual errors normality Kurtosis:in middle,perfect match")

elif round(Kurtosis,1)>3:

print("residual errors normality Kurtosis:more peak")

elif round(Kurtosis,1)<3:

print("residual errors normality Kurtosis:more flat")

#自相关分析autocorrelation

print('\n')

print("autocorrelation test:...................................")

DW = np.sum( np.diff( result.resid.values )**2.0 )/ result.ssr

if round(DW,1)==2:

print("Durbin-Watson close to 2,there is no autocorrelation.OLS model works well")

#共线性检查

print('\n')

print("multicollinearity test:")

conditionNumber=result.condition_number

if conditionNumber>30:

print("conditionNumber>30,multicollinearity exists")

else:

print("conditionNumber<=30,multicollinearity not exists")

#绘制残差图,用于方差齐性检验

Draw_residual(list(result.resid))

'''

result.rsquared

Out[28]: 0.61510660055413524

'''

#kendalltau非参数检验

def Kendalltau(x,y):

print("use kendalltau,Nonparametric tests")

r,p=stats.kendalltau(x,y)

print("kendalltau r**2:",r**2)

print("kendalltau p:",p)

#选择模型

def R_mode(x,y,xLabel,yLabel,formula):

#正态性检验

Normal_result=normality_check.NormalTest(list_group)

print ("normality result:",Normal_result)

if len(list_group)>2:

Kendalltau(x,y)

if Normal_result==False:

Spearmanr(x,y)

Kendalltau(x,y)

if Normal_result==True:

Pearsonr_details(x,y,xLabel,yLabel,formula)

#调整的R方

def Adjust_Rsquare(r_square,n,k):

adjust_rSquare=1-((1-r_square)*(n-1)*1.0/(n-k-1))

return adjust_rSquare

'''

n=len(x)

n=10

k=1

r_square=0.615

Adjust_Rsquare(r_square,n,k)

Out[11]: 0.566875

'''

#绘图

def Plot(x,y,yLabel,xLabel,Title):

plt.plot(x,y,'ro')

plt.ylabel(yLabel)

plt.xlabel(xLabel)

plt.title(Title)

plt.show()

#绘图参数

yLabel='Alcohol'

xLabel='Tobacco'

Title='Sales in Several UK Regions'

Plot(x,y,yLabel,xLabel,Title)

formula='Alcohol ~ Tobacco'

#绘制残点图

def Draw_residual(residual_list):

x=[i for i in range(1,len(residual_list)+1)]

y=residual_list

pylab.plot(x,y,'ro')

pylab.title("draw residual to check wrong number")

# Pad margins so that markers don't get clipped by the axes,让点不与坐标轴重合

pylab.margins(0.3)

#绘制网格

pylab.grid(True)

pylab.show()

R_mode(x,y,xLabel,yLabel,formula)

'''

result.fittedvalues表示预测的y值阵列

result.fittedvalues

Out[42]:

0 6.094983

1 5.823391

2 5.833450

3 5.400915

4 5.531682

5 4.978439

6 5.260090

7 4.767201

8 5.592035

9 6.577813

dtype: float64

#计算残差的偏态

S = stats.skew(result.resid, bias=True)

Out[44]: -0.013678125910039975

K = stats.kurtosis(result.resid, fisher=False,bias=True)

K

Out[47]: 1.5271300905736027

'''

20.一元回归共线性

官网例子

https://github.com/thomas-haslwanter/statsintro_python/blob/master/ISP/Code_Quantlets/11_LinearModels/simpleModels/swim100m.csv

导入数据

# -*- coding: utf-8 -*-

'''Simple linear models.

- "model_formulas" is based on examples in Kaplan's book "Statistical Modeling".

- "polynomial_regression" shows how to work with simple design matrices, like MATLAB's "regress" command.

'''

# Copyright(c) 2015, Thomas Haslwanter. All rights reserved, under the CC BY-SA 4.0 International License

# Import standard packages

import numpy as np

import pandas as pd

# additional packages

from statsmodels.formula.api import ols

import statsmodels.regression.linear_model as sm

from statsmodels.stats.anova import anova_lm

def model_formulas():

''' Define models through formulas '''

# Get the data:

# Development of world record times for the 100m Freestyle, for men and women.

data = pd.read_csv('swim100m.csv')

# Different models

model1 = ols("time ~ sex", data).fit() # one factor

model2 = ols("time ~ sex + year", data).fit() # two factors

model3 = ols("time ~ sex * year", data).fit() # two factors with interaction

# Model information

print((model1.summary()))

print((model2.summary()))

print((model3.summary()))

# ANOVAs

print('----------------- Results ANOVAs: Model 1 -----------------------')

print((anova_lm(model1)))

print('--------------------- Model 2 -----------------------------------')

print((anova_lm(model2)))

print('--------------------- Model 3 -----------------------------------')

model3Results = anova_lm(model3)

print(model3Results)

# Just to check the correct run

return model3Results['F'][0] # should be 156.1407931415788

def polynomial_regression():

''' Define the model directly through the design matrix.

Similar to MATLAB's "regress" command.

'''

# Generate the data: a noisy second order polynomial

# To get reproducable values, I provide a seed value

np.random.seed(987654321)

t = np.arange(0,10,0.1)

y = 4 + 3*t + 2*t**2 + 5*np.random.randn(len(t))

# --- >>> START stats <<< ---

# Make the fit. Note that this is another "OLS" than the one in "model_formulas",

# as it works directly with the design matrix!

M = np.column_stack((np.ones(len(t)), t, t**2))

res = sm.OLS(y, M).fit()

# --- >>> STOP stats <<< ---

# Display the results

print('Summary:')

print((res.summary()))

print(('The fit parameters are: {0}'.format(str(res.params))))

print('The confidence intervals are:')

print((res.conf_int()))

return res.params # should be [ 4.74244177, 2.60675788, 2.03793634]

if __name__ == '__main__':

model_formulas()

polynomial_regression()

第一个模型

model1 = ols("time ~ sex", data).fit() # one factor

多个参数表示不合适:

R方:值太小,解释度太差

AIC/BIC:值太大,模型不合适

omnibus:p概率0,残差非正太分布,模型不合适

durbin-waston:值太小,autocorrelation同相关明显

模型2 分析

model2 = ols("time ~ sex + year", data).fit()   # two factors

AIC/BIC:值太大,模型不合适

omnibus:p概率0,残差非正太分布,模型不合适

durbin-waston:值太小,autocorrelation同相关明显

condition number太大:远远高于30,多重共线明显

模型3

model3 = ols("time ~ sex * year", data).fit()   # two factors with interaction

AIC/BIC:值太大,模型不合适

omnibus:p概率0,残差非正太分布,模型不合适

durbin-waston:值太小,autocorrelation同相关明显

condition number太大:远远高于30,多重共线明显

21.bootstrap

安装bootstrap

http://scikits.scipy.org/bootstrap

https://wenku.baidu.com/view/0c449147336c1eb91a375d39.html

Scikits.bootstrap provides bootstrap confidence interval algorithms for scipy.

At present, it is rather feature-incomplete and in flux. However, the functions that have been written should be relatively stable as far as results.

Much of the code has been written based off the descriptions from Efron and Tibshirani's Introduction to the Bootstrap, and results should match the results obtained from following those explanations. However, the current ABC code is based off of the modified-BSD-licensed R port of the Efron bootstrap code, as I do not believe I currently have a sufficient understanding of the ABC method to write the code independently.

In any case, please contact me (Constantine Evans ) with any questions or suggestions. I'm trying to add documentation, and will be adding tests as well. I'm especially interested, however, in how the API should actually look; please let me know if you think the package should be organized differently.

The package is licensed under the Modified BSD License.

pip install scikits.bootstrap

bootstrap.py

# -*- coding: utf-8 -*-

''' Example of bootstrapping the confidence interval for the mean of a sample distribution

This function requires "bootstrap.py", which is available from

https://github.com/cgevans/scikits-bootstrap

'''

# Copyright(c) 2015, Thomas Haslwanter. All rights reserved, under the CC BY-SA 4.0 International License

import scikits

# Import standard packages

import matplotlib.pyplot as plt

import scipy as sp

from scipy import stats

# additional packages

import scikits.bootstrap as bootstrap

def generate_data():

''' Generate the data for the bootstrap simulation '''

# To get reproducable values, I provide a seed value

sp.random.seed(987654321)

# Generate a non-normally distributed datasample

data = stats.poisson.rvs(2, size=1000)

# Show the data

plt.plot(data, '.')

plt.title('Non-normally distributed dataset: Press any key to continue')

plt.waitforbuttonpress()

plt.close()

return(data)

def calc_bootstrap(data):

''' Find the confidence interval for the mean of the given data set with bootstrapping. '''

# --- >>> START stats <<< ---

# Calculate the bootstrap

CIs = bootstrap.ci(data=data, statfunction=sp.mean)

# --- >>> STOP stats <<< ---

# Print the data: the "*" turns the array "CIs" into a list

print(('The conficence intervals for the mean are: {0} - {1}'.format(*CIs)))

return CIs

if __name__ == '__main__':

data = generate_data()

calc_bootstrap(data)

input('Done')

bootstrapping 解决不知道分布情况下,计算平均值的置信区间

你可能感兴趣的:(python回归代码)