逆概率加权&Doubly Robust Methods

1. 逆概率加权

1.1 为什么要用逆概率加权

逆概率加权是一种debiasing方法,可以用于纠正样本分布不均衡导致的辛普森悖论等问题。

1.2 逆概率加权推导

逆概率加权是后门调整的进一步推广,利用贝叶斯公式对后门调整公式变换了一下形式。
P ( y ∣ d o ( t ) ) = ∑ x P ( y ∣ t , x ) ∗ P ( x ) ‾ = ∑ x P ( y ∣ t , x ) ∗ P ( x ) ∗ P ( t ∣ x ) P ( t ∣ x ) = ∑ x P ( y , t , x ) P ( t ∣ x ) \begin{aligned} P(y|do(t)) &= \sum_x P(y|t,x)*\underline {P(x)} \\ &= \sum_x \frac {P(y|t,x)*P(x)*P(t|x)}{P(t|x)} \\ &= \sum_x \frac {P(y,t,x)}{P(t|x)} \\ \end{aligned} P(ydo(t))=xP(yt,x)P(x)=xP(tx)P(yt,x)P(x)P(tx)=xP(tx)P(y,t,x)
另一种证明方式:
P ( y ∣ d o ( t ) ) = ∑ x P ( y ∣ t , x ) ∗ P ( x ) ‾ P ( y ∣ t ) = ∑ x P ( y ∣ t , x ) ∗ P ( t ∣ x ) ‾ = ∑ x P ( y , x ∣ t ) \begin{aligned} P(y|do(t)) &= \sum_x P(y|t,x)*\underline {P(x)} \\ P(y|t) &= \sum_x P(y|t,x)* \underline {P(t|x)} = \sum_x P(y,x|t) \\ \end{aligned} P(ydo(t))P(yt)=xP(yt,x)P(x)=xP(yt,x)P(tx)=xP(y,xt)
如果reweight population 使得 P ( t ∣ x ) P(t|x) P(tx)= P ( t ) P(t) P(t),那么 P ( y ∣ d o ( t ) ) P(y|do(t)) P(ydo(t)) P ( y ∣ t ) P(y|t) P(yt)相等,从图的角度,即为切断了 X → T X \rightarrow T XT的后门路径,得到的是causal association
逆概率加权&Doubly Robust Methods_第1张图片
具体ATE计算方法:

  • Idetification: τ = E [ Y ( 1 ) − Y ( 0 ) ) ] = E [ I { T = 1 } Y P ( T ∣ X ) ] − E [ I { T = 0 } Y P ( T ∣ X ) ] \tau = E[Y(1)-Y(0))]= E[\frac {I \{T=1\} Y}{P(T|X)}] - E[\frac {I \{T=0\} Y}{P(T|X)}] τ=E[Y(1)Y(0))]=E[P(TX)I{T=1}Y]E[P(TX)I{T=0}Y]
  • Estimation: τ ^ = 1 N ∑ i = 1 N T i Y i e ^ ( X i ) − 1 N ∑ i = 1 N ( 1 − T i ) Y i 1 − e ^ ( X i ) \hat \tau = \frac{1}{N} \sum_{i=1}^N \frac{T_i Y_i}{\hat e(X_i)}-\frac{1}{N} \sum_{i=1}^N \frac{(1-T_i) Y_i}{1-\hat e(X_i)} τ^=N1i=1Ne^(Xi)TiYiN1i=1N1e^(Xi)(1Ti)Yi

1.3 逆概率加权举例说明

原始样本分布

类别 T =1 T=0 P ( T = 1 | X = 1 ) P(T=1|X=1) P(T=1X=1)
X=1 8 2 8 10 = 0.8 \frac {8}{10}=0.8 108=0.8
X=0 10 6 10 16 = 0.625 \frac {10}{16}=0.625 1610=0.625

reweighted

类别 T =1 T=0 P ( T = 1 | X = 1 ) P(T=1|X=1) P(T=1X=1)
X=1 8 0.8 = 10 \frac {8}{0.8}=10 0.88=10 8 1 − 0.8 = 10 \frac {8}{1-0.8}=10 10.88=10 10 10 + 10 = 0.5 \frac {10}{10+10}=0.5 10+1010=0.5
X=0 10 0.625 = 16 \frac {10}{0.625}=16 0.62510=16 6 1 − 0.625 = 16 \frac {6}{1-0.625}=16 10.6256=16 16 16 + 16 = 0.5 \frac {16}{16+16}=0.5 16+1616=0.5

原始样本分布 VS reweighted

类别 P ( T = 1 | X = 1 ) P(T=1|X=1) P(T=1X=1) P ( T = 1 | X = 0 ) P(T=1|X=0) P(T=1X=0) P ( T = 1 ) P(T=1) P(T=1)
original 0.8 0.625 8 + 10 10 + 16 = 0.69 \frac {8+10}{10+16}=0.69 10+168+10=0.69
reweighted 0.5 0.5 0.5

reweighted之后 P ( t ∣ x ) = P ( t ) = 0.5 P(t|x) = P(t)=0.5 P(tx)=P(t)=0.5

1.4 逆概率加权缺点

0.2.4 Covariate balancing propensity score (CBPS)
因果推断综述解析|A Survey on Causal Inference(3)
倾向得分方法的双重稳健且有效的改进

IPW方法的倾向得分其实是策略的倾向选择概率,但是选择性偏差带来的是样本之间其他相关变量分布的不平衡。所以使用逆倾向得分属于只考虑了策略的倾向选择概率,却用来平衡样本之间其他相关变量的分布。Covariate balancing propensity score (CBPS),协变量平衡倾向得分方法被设计出来来同时衡量这两方面,来使倾向得分更准确。CBPS得到倾向得分的方式是求解下面的方程:

2. Doubly Robust Methods

2.1 为什么要用Doubly Robust

逆概率加权方法虽然简单易行,但是对于倾向得分的设定非常敏感。该方法有两个弱点:

  • 需要对倾向指数的估计足够精确
  • 如果倾向指数过于趋近0或1,就会导致某些权重的值过高,使估计出的平均因果效应的方差过大

Doubly Robust 公式
A ^ T E = 1 N ∑ i = 1 N ( T i ( Y i − μ ^ 1 ( X i ) ) e ^ ( X i ) + μ ^ 1 ( X i ) ) − 1 N ∑ i = 1 N ( ( 1 − T i ) ( Y i − μ ^ 0 ( X i ) ) e ^ ( X i ) + μ ^ 0 ( X i ) ) \hat ATE = \frac{1}{N} \sum_{i=1}^N \left( \frac{T_i (Y_i-\hat \mu_1(X_i))}{\hat e(X_i) }+\hat \mu_1(X_i) \right) -\frac{1}{N} \sum_{i=1}^N \left( \frac{(1-T_i) (Y_i-\hat \mu_0(X_i))}{\hat e(X_i) }+\hat \mu_0(X_i) \right) A^TE=N1i=1N(e^(Xi)Ti(Yiμ^1(Xi))+μ^1(Xi))N1i=1N(e^(Xi)(1Ti)(Yiμ^0(Xi))+μ^0(Xi))
μ ^ 1 ( X i ) : e s t i m a t i o n   o f E [ Y ∣ X , T = 1 ] \hat \mu_1(X_i):estimation \ of E[Y|X,T=1] μ^1(Xi):estimation ofE[YX,T=1]
μ ^ 1 ( X 0 ) : e s t i m a t i o n   o f E [ Y ∣ X , T = 0 ] \hat \mu_1(X_0):estimation\ of E[Y|X,T=0] μ^1(X0):estimation ofE[YX,T=0]
e ^ ( X i ) : e s t i m a t i o n   o f E [ T = 1 ∣ X ] \hat e(X_i) :estimation \ of E[T=1|X] e^(Xi):estimation ofE[T=1X]

2.2 Doubly Robust推导

需证明: A ^ T E = E ( Y 1 ) − E ( Y 0 ) \hat ATE =E(Y_1) - E(Y_0) A^TE=E(Y1)E(Y0)
首先证明: E ^ ( Y 1 ) = 1 N ∑ i = 1 N ( T i ( Y i − μ ^ 1 ( X i ) ) e ^ ( X i ) + μ ^ 1 ( X i ) ) \hat E(Y_1) = \frac{1}{N} \sum_{i=1}^N \left( \frac{T_i (Y_i-\hat \mu_1(X_i))}{\hat e(X_i) }+\hat \mu_1(X_i) \right) E^(Y1)=N1i=1N(e^(Xi)Ti(Yiμ^1(Xi))+μ^1(Xi))

  • 假设 μ ^ 1 ( X ) \hat \mu_1(X) μ^1(X)正确,则 E [ T i ( Y i − μ ^ 1 ( X i ) ] = 0 E[T_i (Y_i-\hat \mu_1(X_i)]=0 E[Ti(Yiμ^1(Xi)]=0,因为乘以 T i T_i Ti只选择了观测值为 T = 1 T=1 T=1的样本,这些样本由 μ ^ 1 \hat \mu_1 μ^1估计的残差的期望是0(根据定义),同样根据定义, E ^ ( Y 1 ) = E ( μ ^ 1 ( X i ) ) \hat E(Y_1) =E(\hat \mu_1(X_i)) E^(Y1)=E(μ^1(Xi))
  • 假设 e ^ ( X ) \hat e(X) e^(X)正确,则 E [ T i − e ^ ( X i ) ] = 0 E[T_i- \hat e(X_i)]=0 E[Tie^(Xi)]=0,且根据逆概率加权, E ^ ( Y 1 ) = 1 N ∑ i = 1 N T i Y i e ^ ( X i ) \hat E(Y_1) =\frac{1}{N} \sum_{i=1}^N \frac{T_i Y_i}{\hat e(X_i) } E^(Y1)=N1i=1Ne^(Xi)TiYi
    E ^ ( Y 1 ) = 1 N ∑ i = 1 N ( T i ( Y i − μ ^ 1 ( X i ) ) e ^ ( X i ) + μ ^ 1 ( X i ) ) = 1 N ∑ i = 1 N ( T i Y i e ^ ( X i ) − T i μ ^ 1 ( X i ) e ^ ( X i ) + μ ^ 1 ( X i ) ) = 1 N ∑ i = 1 N [ T i Y i e ^ ( X i ) − ( T i − e ^ ( X i ) e ^ ( X i ) ) μ ^ 1 ( X i ) ] \begin{aligned} \hat E(Y_1) &= \frac{1}{N} \sum_{i=1}^N \left( \frac{T_i (Y_i-\hat \mu_1(X_i))}{\hat e(X_i) }+\hat \mu_1(X_i) \right) \\ &= \frac{1}{N} \sum_{i=1}^N \left( \frac{T_i Y_i}{\hat e(X_i) } - \frac{T_i \hat \mu_1(X_i)}{\hat e(X_i) }+\hat \mu_1(X_i) \right) \\ &= \frac{1}{N} \sum_{i=1}^N \left[ \frac{T_i Y_i}{\hat e(X_i) } - \left( \frac{T_i - \hat e(X_i)}{\hat e(X_i) } \right) \hat \mu_1(X_i)\right] \\ \end{aligned} E^(Y1)=N1i=1N(e^(Xi)Ti(Yiμ^1(Xi))+μ^1(Xi))=N1i=1N(e^(Xi)TiYie^(Xi)Tiμ^1(Xi)+μ^1(Xi))=N1i=1N[e^(Xi)TiYi(e^(Xi)Tie^(Xi))μ^1(Xi)]

2.3 Doubly Robust例子

数据预处理

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings 
#读取数据
data = pd.read_csv('/Users/wujunhan/Downloads/causal inference learning/learning_mindset.csv')
print (data.columns)

#get_dummy
categ = ["ethnicity", "gender", "school_urbanicity"]
cont = ["school_mindset", "school_achievement", "school_ethnic_minority", "school_poverty", "school_size"]

data_with_categ = pd.concat([
    data.drop(columns=categ), # dataset without the categorical features
    pd.get_dummies(data[categ], columns=categ, drop_first=False) # categorical features converted to dummies
], axis=1)

print(data_with_categ.shape)

三种模型计算ATE

T = 'intervention'
Y = 'achievement_score'
X = data_with_categ.columns.drop(['schoolid', T, Y])
#doubly_robust
from sklearn.linear_model import LogisticRegression, LinearRegression

def doubly_robust(df, X, T, Y):
    ps = LogisticRegression(C=1e6).fit(df[X], df[T]).predict_proba(df[X])[:, 1]
    mu0 = LinearRegression().fit(df.query(f"{T}==0")[X], df.query(f"{T}==0")[Y]).predict(df[X])
    mu1 = LinearRegression().fit(df.query(f"{T}==1")[X], df.query(f"{T}==1")[Y]).predict(df[X])
    ATE = np.mean(df[T]*(df[Y] - mu1)/ps + mu1) - \
        np.mean((1-df[T])*(df[Y] - mu0)/(1-ps) + mu0)
    return (round(ATE,3))
doubly_robust(data_with_categ, X, T, Y)

#wrong_ps
def doubly_robust_wrong_ps(df, X, T, Y):
    # wrong PS model
    np.random.seed(654)
    ps = np.random.uniform(0.1, 0.9, df.shape[0])
    mu0 = LinearRegression().fit(df.query(f"{T}==0")[X], df.query(f"{T}==0")[Y]).predict(df[X])
    mu1 = LinearRegression().fit(df.query(f"{T}==1")[X], df.query(f"{T}==1")[Y]).predict(df[X])
    ATE = np.mean(df[T]*(df[Y] - mu1)/ps + mu1) - \
        np.mean((1-df[T])*(df[Y] - mu0)/(1-ps) + mu0)
    return (round(ATE,3))
doubly_robust_wrong_ps(data_with_categ, X, T, Y)

#wrong mux model
def doubly_robust_wrong_model(df, X, T, Y):
    np.random.seed(654)
    ps = LogisticRegression(C=1e6).fit(df[X], df[T]).predict_proba(df[X])[:, 1]
    
    # wrong mu(x) model
    mu0 = np.random.normal(0, 1, df.shape[0])
    mu1 = np.random.normal(0, 1, df.shape[0])
    ATE = np.mean(df[T]*(df[Y] - mu1)/ps + mu1) - \
        np.mean((1-df[T])*(df[Y] - mu0)/(1-ps) + mu0)
    return (round(ATE,3))
doubly_robust_wrong_model(data_with_categ, X, T, Y)

boostrap算ATE置信区间

from joblib import Parallel, delayed # for parallel processing

np.random.seed(88)
# run 1000 bootstrap samples
bootstrap_sample = 1000
ates = Parallel(n_jobs=4)(delayed(doubly_robust)(data_with_categ.sample(frac=1, replace=True), X, T, Y)
                          for _ in range(bootstrap_sample))
ates = np.array(ates)
print(f"true model,ATE 95% CI:", (round(np.percentile(ates, 2.5),3), round(np.percentile(ates, 97.5),3)))

np.random.seed(88)
parallel_fn = delayed(doubly_robust_wrong_ps)
wrong_ps = Parallel(n_jobs=4)(parallel_fn(data_with_categ.sample(frac=1, replace=True), X, T, Y)
                              for _ in range(bootstrap_sample))
wrong_ps = np.array(wrong_ps)
print(f"wrong ps model,ATE 95% CI:", (round(np.percentile(wrong_ps, 2.5),3), round(np.percentile(wrong_ps, 97.5),3)))

np.random.seed(88)
parallel_fn = delayed(doubly_robust_wrong_model)
wrong_mux = Parallel(n_jobs=4)(parallel_fn(data_with_categ.sample(frac=1, replace=True), X, T, Y)
                               for _ in range(bootstrap_sample))
wrong_mux = np.array(wrong_mux)
print(f"wrong mux model ATE 95% CI:", (round(np.percentile(wrong_mux, 2.5),3), round(np.percentile(wrong_mux, 97.5),3)))

一些记录
作者在例子中get_dummies的drop_first 设置为False,和设置为True结果是一致的,因为共线性不影响线性模型的预测结果y,但是不剔除第一列,会引起共线性,导致线性模型的系数估计不准确

data_with_categ['gender_2'].corr(data_with_categ['gender_1'])

相关系数为-0.99999

3.参考资料

因果推断总结
IPW例子
Doubly Robust通俗讲解
Doubly Robust严格推导
Doubly Robust参考

你可能感兴趣的:(因果推断,概率论,机器学习,算法)