







Generalized Estimating Equations¶广义估计方程

Generalized Linear Models¶广义线性模型

Discrete and Count Models¶离散模型

Multivariate Models¶多元模型

Other Models¶其他模型





2.1 论文情况简介

2.2 代际流动性时间趋势图绘制

 2.3 基准回归及实现


pip install statsmodels


API Reference — statsmodels



函数 功能 解释

OLS(endog[, exog, missing, hasconst])

Ordinary Least Squares


WLS(endog, exog[, weights, missing, hasconst])

Weighted Least Squares


GLS(endog, exog[, sigma, missing, hasconst])

Generalized Least Squares


GLSAR(endog[, exog, rho, missing, hasconst])

Generalized Least Squares with AR covariance structure

带AR (p)协方差结构的广义最小二乘

RecursiveLS(endog, exog[, constraints])

Recursive least squares


Durbin, James, and Siem Jan Koopman. 2012. Time Series Analysis by State Space Methods: Second Edition. Oxford University Press.

RollingOLS(endog, exog[, window, min_nobs, ...])

Rolling Ordinary Least Squares

RollingWLS(endog, exog[, window, weights, ...])

Rolling Weighted Least Squares



BayesGaussMI(data[, mean_prior, cov_prior, ...])

Bayesian Imputation using a Gaussian model.

MI(imp, model[, model_args_fn, ...])

MI performs multiple imputation using a provided imputer object.

MICE(model_formula, model_class, data[, ...])

Multiple Imputation with Chained Equations.

MICEData(data[, perturbation_method, k_pmm, ...])

Wrap a data set to allow missing data handling with MICE.

Generalized Estimating Equations¶广义估计方程



GEE(endog, exog, groups[, time, family, ...])

Marginal Regression Model using Generalized Estimating Equations.

NominalGEE(endog, exog, groups[, time, ...])

Nominal Response Marginal Regression Model using GEE.

OrdinalGEE(endog, exog, groups[, time, ...])

Ordinal Response Marginal Regression Model using GEE

Generalized Linear Models¶广义线性模型


GLM(endog, exog[, family, offset, exposure, ...])

Generalized Linear Models

GLMGam(endog[, exog, smoother, alpha, ...])

Generalized Additive Models (GAM)

BinomialBayesMixedGLM(endog, exog, exog_vc, ...)

Generalized Linear Mixed Model with Bayesian estimation

PoissonBayesMixedGLM(endog, exog, exog_vc, ident)

Generalized Linear Mixed Model with Bayesian estimation

Discrete and Count Models¶离散模型

Logit(endog, exog[, check_rank])

Logit Model

Probit(endog, exog[, check_rank])

Probit Model

MNLogit(endog, exog[, check_rank])

Multinomial Logit Model

OrderedModel(endog, exog[, offset, distr])

Ordinal Model based on logistic or normal distribution

Poisson(endog, exog[, offset, exposure, ...])

Poisson Model

NegativeBinomial(endog, exog[, ...])

Negative Binomial Model

NegativeBinomialP(endog, exog[, p, offset, ...])

Generalized Negative Binomial (NB-P) Model

GeneralizedPoisson(endog, exog[, p, offset, ...])

Generalized Poisson Model

ZeroInflatedPoisson(endog, exog[, ...])

Poisson Zero Inflated Model

ZeroInflatedNegativeBinomialP(endog, exog[, ...])

Zero Inflated Generalized Negative Binomial Model

ZeroInflatedGeneralizedPoisson(endog, exog)

Zero Inflated Generalized Poisson Model

Multivariate Models¶多元模型

Factor([endog, n_factor, corr, method, smc, ...])

Factor analysis

MANOVA(endog, exog[, missing, hasconst])

Multivariate Analysis of Variance

PCA(data[, ncomp, standardize, demean, ...])

Principal Component Analysis

Other Models¶其他模型

MixedLM线性混合效应模型Linear Mixed Effects Model,这种模型结合了固定效应和随机效应。当使用存在内部聚集效应和重复测量数据时采用。研究中用的比较多。


Quantile Regression分位数回归,金融的研究中用的也比较多,对于截尾拖尾数据很有效,有时候OLS无法探究的关系可以靠分位数回归搞定。

Robust Linear Model稳健线性模型,是能够较好应对数据中的极端值的模型,能够削弱偶然出现的极端值对模型的影响,在量化投资中应用较多,用于削弱有些极端变化但是对股价影响并不大的因子。

MixedLM(endog, exog, groups[, exog_re, ...])

Linear Mixed Effects Model

SurvfuncRight(time, status[, entry, title, ...])

Estimation and inference for a survival function.

PHReg(endog, exog[, status, entry, strata, ...])

Cox Proportional Hazards Regression Model

QuantReg(endog, exog, **kwargs)

Quantile Regression

RLM(endog, exog[, M, missing])

Robust Linear Model

BetaModel(endog, exog[, exog_precision, ...])

Beta Regression.


ProbPlot(data[, dist, fit, distargs, a, ...])

Q-Q and P-P Probability Plots


qqline(ax, line[, x, y, dist, fmt])

Plot a reference line for a qqplot.



qqplot(data[, dist, distargs, a, loc, ...])

Q-Q plot of the quantiles of x versus the quantiles/ppf of a distribution.


qqplot_2samples(data1, data2[, xlabel, ...])

Q-Q Plot of two samples' quantiles.



Description(data[, stats, numeric, ...])

Extended descriptive statistics for data

describe(data[, stats, numeric, ...])

Extended descriptive statistics for data





test([extra_args, exit])

Run the test suite

add_constant(data[, prepend, has_constant])

Add a column of ones to an array.


Load a previously saved object


List the versions of statsmodels and any installed dependencies

webdoc([func, stable])

Opens a browser and displays online documentation其他

除了上述功能外,还有诸如Univariate Time-Series Analysis(单变量时间序列分析)、Multivariate Time Series Models(多元时间序列模型)、Filters and Decompositions(过滤降维算法)、Markov Regime Switching Models(马尔科夫机制转换模型)等等。



2.1 论文情况简介

义务教育能提高代际流动性吗? - 中国知网https://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CJFD&dbname=CJFDLAST2021&filename=JRYJ202106005&uniplatform=NZKPT&v=Dy2OTtraPRzoUOo03HDEpf09y-Qrk1KRueN9Eqk7cqjes2XjVzLZJ9e_ywxmTen1



import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels


data = pd.read_excel(r'data_merge2.xlsx')#这里改成你的数据所在文件的路径


hhcode coun househead gender birthyear nationality siblings hukou education ChildEdu ... Educorr1 Educorr2 Educorr3 Educorr4 Educorr5 PY x Eligibility fa_work birthyear_groupby5
家庭编码 县区编码 是否是户主 性别 出生年份 民族 子女个数 户口 受教育程度 受教育年限 父亲受教育程度 母亲受教育程度 父亲职业 母亲职业 父亲受教育年限 母亲受教育年限 父母平均受教育年限 省份 父母最高受教育年限 父母最低受教育年限 相关系数:衡量代际流动性 义务教育政策开始年份 义务教育开始时年龄 受义务教育影响程度大小 父亲职业分组
0 1101065303202 110106 1 2 1976 1.0 0.0 1.0 3.0 9.0 ... 0.214808 0.503370 0.349154 0.245640 0.554755 1986 10 0.666667 NaN 0
1 1101065306101 110106 1 1 1957 1.0 2.0 1.0 2.0 5.0 ... 0.276842 0.123774 0.207662 0.276842 0.137444 1986 29 0.000000 NaN 1
2 1101065309001 110106 1 1 1960 1.0 2.0 1.0 3.0 8.0 ... 0.146180 0.355535 0.267936 0.190255 0.311783 1986 26 0.000000 NaN 2
3 1101065311101 110106 1 2 1936 1.0 1.0 1.0 1.0 NaN ... 0.790569 0.607143 0.852116 0.790569 0.607143 1986 50 0.000000 NaN 3
4 1101066100401 110106 1 1 1976 1.0 1.0 1.0 3.0 9.0 ... 0.214808 0.503370 0.349154 0.245640 0.554755 1986 10 0.666667 NaN 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32917 6210210205901 621021 2 2 1975 1.0 2.0 1.0 2.0 5.0 ... 0.359305 0.449707 0.471492 0.354519 0.483449 1991 16 0.000000 11.0 5
32918 6210210209201 621021 2 2 1959 1.0 3.0 1.0 2.0 5.0 ... 0.501423 0.196046 0.467181 0.501423 0.196046 1991 32 0.000000 11.0 6
32919 6210210210501 621021 2 1 1977 1.0 2.0 1.0 2.0 5.0 ... 0.443866 0.404331 0.498954 0.434100 0.432057 1991 14 0.222222 11.0 7
32920 6210210211801 621021 2 2 1984 1.0 0.0 1.0 3.0 8.0 ... -0.018009 0.337699 0.167929 0.147695 0.144852 1991 7 1.000000 11.0 8
32921 6210220110401 621022 2 2 1979 1.0 1.0 1.0 3.0 8.0 ... 0.611632 0.232354 0.528778 0.610094 0.306826 1991 12 0.444444 11.0

32922 rows × 30 columns


coun househead gender birthyear nationality siblings hukou education ChildEdu father_edu ... Educorr1 Educorr2 Educorr3 Educorr4 Educorr5 PY x Eligibility fa_work birthyear_groupby5
count 32922.000000 32922.000000 32922.000000 32922.000000 32916.000000 32785.000000 32914.000000 32911.000000 31999.000000 31195.000000 ... 32575.000000 31706.000000 32586.000000 32578.000000 31578.000000 32922.000000 32922.000000 32922.000000 25588.000000 32922.000000
mean 382013.629549 1.478677 1.506105 1962.923881 1.339318 2.870581 1.552956 3.342196 8.255899 1.858856 ... 0.329442 0.307071 0.357607 0.337348 0.306641 1987.269273 24.345392 0.145553 11.286931 7.998937
std 135042.465042 0.499553 0.499970 11.895852 1.406428 1.754655 0.686343 1.781469 3.639925 1.210058 ... 0.177946 0.178255 0.181399 0.177930 0.175811 1.726825 11.973904 0.308486 0.521000 4.898945
min 110101.000000 1.000000 1.000000 1916.000000 1.000000 -1.000000 1.000000 -1.000000 0.000000 1.000000 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 1986.000000 -12.000000 0.000000 9.000000 0.000000
25% 321002.000000 1.000000 1.000000 1955.000000 1.000000 2.000000 1.000000 2.000000 6.000000 1.000000 ... 0.234758 0.198073 0.257497 0.241156 0.210418 1986.000000 16.000000 0.000000 11.000000 4.000000
50% 411421.000000 1.000000 2.000000 1964.000000 1.000000 3.000000 1.000000 3.000000 8.000000 2.000000 ... 0.340944 0.322726 0.366655 0.343839 0.315769 1987.000000 23.000000 0.000000 11.000000 8.000000
75% 445381.000000 2.000000 2.000000 1971.000000 1.000000 4.000000 2.000000 4.000000 10.000000 2.000000 ... 0.426330 0.414291 0.466546 0.438355 0.418991 1987.000000 33.000000 0.000000 12.000000 12.000000
max 621121.000000 2.000000 2.000000 1999.000000 8.000000 12.000000 4.000000 9.000000 21.000000 9.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1992.000000 72.000000 1.000000 13.000000 16.000000

8 rows × 29 columns

2.2 代际流动性时间趋势图绘制

import matplotlib as mpl
import matplotlib.pyplot as plt
data_draw = data.loc[data["Educorr1"] >0]
data_draw = data_draw.loc[data_draw["Educorr2"] >0]
data_draw = data_draw.loc[data_draw["Educorr3"] >0]
data_draw = data_draw.loc[data_draw["Educorr4"] >0]
plt.rcParams['font.family'] = 'FangSong'   # 设置字体为仿宋

# 绘制一个子图,其中row=2,co1=2,该子图占第1个位置
plt.subplot(2, 2, 1)
plt.scatter(data_draw['birthyear'], data_draw['Educorr1'], color="black", s=10, zorder=1)
z = np.polyfit(data_draw.dropna()['birthyear'], data_draw.dropna()['Educorr1'], 1)
p = np.poly1d(z)

plt.ylabel("教育流动性", fontsize=12)
# 绘制一个子图,其中row=2,col=2,该子图占第2个位置
plt.subplot(2, 2, 2)
plt.scatter(data_draw['birthyear'], data_draw['Educorr2'], color="blue", s=10, zorder=1)
z = np.polyfit(data_draw.dropna()['birthyear'], data_draw.dropna()['Educorr2'], 1)
p = np.poly1d(z)

plt.subplot(2, 2, 3)
plt.scatter(data_draw['birthyear'], data_draw['Educorr3'], color="green", s=10, zorder=1)
z = np.polyfit(data_draw.dropna()['birthyear'], data_draw.dropna()['Educorr3'], 1)
p = np.poly1d(z)
plt.xlabel("出生年份", fontsize=12)
plt.ylabel("教育流动性", fontsize=12)

plt.subplot(2, 2, 4)
plt.scatter(data_draw['birthyear'], data_draw['Educorr4'], color="yellow", s=10, zorder=1)
z = np.polyfit(data_draw.dropna()['birthyear'], data_draw.dropna()['Educorr4'], 1)
p = np.poly1d(z)
plt.suptitle("代际流动性的时间趋势图", fontsize=15)    #默认字体大小为12
plt.xlabel("出生年份", fontsize=12)






 2.3 基准回归及实现








count    32922.000000
mean      1962.923881
std         11.895852
min       1916.000000
25%       1955.000000
50%       1964.000000
75%       1971.000000
max       1999.000000
Name: birthyear, dtype: float64

最小1916年,最大1999年,因此共有17个分组 ,生成变量birthyear_groupby5过程如下:

a = list(range(1916,1999,5))
b = list(range(1921,2002,5))
birthyear_groupby5 = []
for year in data['birthyear']:
    for i in range(0,17):
        if(year>=a[i] & year


data1 = data[['coun','Educorr1','Eligibility','birthyear_groupby5','province']].dropna()


*注意函数里面实现了虚拟变量的生成和截距项的生成(更多内容见【机器学习】 Statsmodels 统计包之 OLS 回归 - -零 - 博客园)

def my_OLS(data):
    data = data.dropna()
    nsample = len(data['coun'])
    x = data['Eligibility']
    dummy_birthyear_groupby5 = sm.categorical(data['birthyear_groupby5'].values,drop = True)
    dummy_province = sm.categorical(data['province'].values,drop = True)
    X = np.column_stack((x, dummy_birthyear_groupby5,dummy_province))
    X = sm.add_constant(X)
    #beta = np.ones( 2 + len(dummy_birthyear_groupby5[0]) + len(dummy_province[0]))
    e = np.random.normal(size=nsample)
    y = data['Educorr1']
    result = sm.OLS(y,X).fit()
    return result



OLS1 = my_OLS(data1)


                           OLS Regression Results                            
Dep. Variable:               Educorr1   R-squared:                       0.061
Model:                            OLS   Adj. R-squared:                  0.060
Method:                 Least Squares   F-statistic:                     70.28
Date:                Thu, 05 May 2022   Prob (F-statistic):               0.00
Time:                        15:38:21   Log-Likelihood:                 11034.
No. Observations:               32575   AIC:                        -2.201e+04
Df Residuals:                   32544   BIC:                        -2.175e+04
Df Model:                          30                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
const          0.2818      0.001    297.766      0.000       0.280       0.284
x1             0.0989      0.003     31.634      0.000       0.093       0.105
x2             0.0130      0.004      3.412      0.001       0.006       0.021
x3             0.0181      0.004      4.738      0.000       0.011       0.026
x4             0.0231      0.004      6.042      0.000       0.016       0.031
x5             0.0178      0.004      4.658      0.000       0.010       0.025
x6             0.0143      0.004      3.753      0.000       0.007       0.022
x7             0.0180      0.004      4.699      0.000       0.010       0.025
x8             0.0189      0.004      4.945      0.000       0.011       0.026
x9             0.0193      0.004      5.056      0.000       0.012       0.027
x10            0.0128      0.004      3.351      0.001       0.005       0.020
x11            0.0102      0.004      2.668      0.008       0.003       0.018
x12            0.0171      0.004      4.477      0.000       0.010       0.025
x13            0.0154      0.004      4.033      0.000       0.008       0.023
x14            0.0133      0.004      3.482      0.000       0.006       0.021
x15            0.0203      0.004      5.318      0.000       0.013       0.028
x16            0.0158      0.004      4.128      0.000       0.008       0.023
x17            0.0143      0.004      3.734      0.000       0.007       0.022
x18            0.0199      0.004      5.194      0.000       0.012       0.027
x19            0.0165      0.004      4.477      0.000       0.009       0.024
x20            0.0072      0.003      2.053      0.040       0.000       0.014
x21            0.0270      0.004      7.287      0.000       0.020       0.034
x22            0.0178      0.003      5.577      0.000       0.012       0.024
x23            0.0134      0.004      3.766      0.000       0.006       0.020
x24           -0.0216      0.003     -6.896      0.000      -0.028      -0.015
x25            0.0106      0.003      3.421      0.001       0.005       0.017
x26           -0.0148      0.003     -4.281      0.000      -0.022      -0.008
x27           -0.0107      0.003     -3.129      0.002      -0.017      -0.004
x28            0.0200      0.003      6.269      0.000       0.014       0.026
x29            0.0855      0.004     21.846      0.000       0.078       0.093
x30           -0.0136      0.003     -4.036      0.000      -0.020      -0.007
x31            0.0515      0.004     13.913      0.000       0.044       0.059
x32            0.0931      0.004     24.275      0.000       0.086       0.101
Omnibus:                     4647.437   Durbin-Watson:                   1.976
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            19392.273
Skew:                          -0.658   Prob(JB):                         0.00
Kurtosis:                       6.544   Cond. No.                     2.77e+15

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.92e-27. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.



