因果推断dowhy之-401(k)资格对净金融资产的影响

0x01. 问题背景

在本案例研究中,我们将使用来自401(k)分析的真实数据来解释如何使用因果库来估计平均治疗效果(ATE)和条件ATE (CATE)。

本案例数据来自真实数据。在20世纪80年代初,美国政府为雇员推出了几种税收递延储蓄选择,以增加个人退休储蓄。一个受欢迎的选择是401(k)计划,该计划允许员工将工资的一部分存入个人账户。考虑到由于个人特征(特别是收入)造成的差异性,这里的目标是了解401(k)资格对净金融资产(即401(k)余额和非401(k)资产的总和)的影响。

由于401(k)计划是由雇主提供的,因此只有提供这些计划的公司的员工才有资格参加。因此,我们正在处理一项非随机研究。有几个因素(如教育程度、储蓄偏好)会影响401(k)计划的资格以及净金融资产。

0x02. 数据

我们考虑的样本来自1991年的收入和项目参与调查。样本由家庭组成,其中参考个人年龄在25-64岁之间,至少有一人受雇,但没有人是自雇的。样本中有9915户的记录。对于每个家庭,记录了44个变量,包括家庭参考人参加401(k)计划的资格(治疗)、净金融资产(结果)和其他协变量,如年龄、收入、家庭规模、教育程度、婚姻状况等。我们特别考虑了16个协变量。部分变量总结如下:

Variable Name Type Details
e401 Treatment eligibility for the 401(k) plan
net_tfa Outcome net financial assets (in USD)
age Covariate Age
inc Covariate income (in USD)
fsize Covariate family size
educ Covariate education (in years)
male Covariate is a male?
db Covariate defined benefit pension
marr Covariate married?
twoearn Covariate two earners
pira Covariate participation in IRA
hown Covariate home owner?
hval Covariate home value (in USD)
hequity Covariate home equity (in USD)
hmort Covariate home mortgage (in USD)
nohs Covariate no high-school? (one-hot encoded)
hs Covariate high-school? (one-hot encoded)
smcol Covariate some-college? (one-hot encoded)

该数据集可从’ hdm https://rdrr.io/cran/hdm/man/pension.html ’ __ R包中在线公开获得。为了更方便的做实验,经过一系列的实验,将数据下载至本地进行实验。具体如何搞到的数据,参考整理的博客:hdm数据R语言获取教程

0x03. 实验

0x03_1. 读取数据

import pandas as pd
df = pd.read_csv("data/pension.csv")
df.head()

数据结果如下:

ira a401 hval hmort hequity nifa net_nifa tfa net_tfa tfa_he tw age inc fsize educ db marr male twoearn dum91 e401 p401 pira nohs hs smcol col icat ecat zhat net_n401 hown i1 i2 i3 i4 i5 i6 i7 a1 a2 a3 a4 a5
0 0 0 69000 60150 8850 100 -3300 100 -3300 5550 53550 31 28146 5 12 0 1 0 0 1 0 0 0 0 1 0 0 3 2 0.273178 -3300 1 0 0 1 0 0 0 0 0 1 0 0 0
1 0 0 78000 20000 58000 61010 61010 61010 61010 119010 124635 52 32634 5 16 0 0 0 0 1 0 0 0 0 0 0 1 4 4 0.386641 61010 1 0 0 0 1 0 0 0 0 0 0 1 0
2 1800 0 200000 15900 184100 7549 7049 9349 8849 192949 192949 50 52206 3 11 0 1 1 1 1 0 0 1 1 0 0 0 6 1 0.533650 8849 1 0 0 0 0 0 1 0 0 0 0 1 0
3 0 0 0 0 0 2487 -6013 2487 -6013 -6013 -513 28 45252 4 15 0 1 0 1 1 0 0 0 0 0 1 0 5 3 0.324319 -6013 0 0 0 0 0 1 0 0 1 0 0 0 0
4 0 0 300000 90000 210000 10625 -2375 10625 -2375 207625 212087 42 33126 3 12 1 0 0 0 1 0 0 0 0 1 0 0 4 2 0.602807 -2375 1 0 0 0 1 0 0 0 0 0 1 0 0

0x03_2. 构建因果图

使用e401作为Treatment,net_tfa作为out_come,其他协变量作为confounder,构建因果图代码如下:

import networkx as nx
import dowhy.gcm as gcm

treatment_var = "e401"
outcome_var = "net_tfa"
covariates = ["age","inc","fsize","educ","male","db",
              "marr","twoearn","pira","hown","hval",
              "hequity","hmort","nohs","hs","smcol"]

edges = [(treatment_var, outcome_var)]
edges.extend([(covariate, treatment_var) for covariate in covariates])
edges.extend([(covariate, outcome_var) for covariate in covariates])

causal_graph = nx.DiGraph(edges)
gcm.util.plot(causal_graph, figure_size=[20, 20])

效果图如下:
因果推断dowhy之-401(k)资格对净金融资产的影响_第1张图片

0x03_3. 数据分析

在我们为变量分配因果模型之前,让我们绘制它们的直方图,以了解变量的分布。

import matplotlib.pyplot as plt

cols = [treatment_var, outcome_var]
cols.extend(covariates)
plt.figure(figsize=(10,5))
for i, col in enumerate(cols):
    plt.subplot(3,6,i+1)
    plt.grid(False)
    plt.hist(df[col])
    plt.xlabel(col)
plt.tight_layout()
plt.show()

绘制结果如下
因果推断dowhy之-401(k)资格对净金融资产的影响_第2张图片
我们观察到,实值变量不遵循众所周知的参数分布,如高斯分布。因此,当这些变量没有父变量时,我们拟合经验分布,这也适用于分类变量。

0x03_4. 数据增加噪声增强鲁棒性

让我们将因果模型分配给变量。对于treatment变量,我们分配了一个带有随机森林分类器的分类器功能因果模型(FCM)。对于outcome变量,我们分配了一个加性噪声模型,其中随机森林回归作为函数和噪声的经验分布。我们将经验分布分配给其他变量,因为它们在因果图中没有父节点。

causal_model = gcm.StructuralCausalModel(causal_graph)
causal_model.set_causal_mechanism(treatment_var, gcm.ClassifierFCM(gcm.ml.create_random_forest_classifier()))
causal_model.set_causal_mechanism(outcome_var, gcm.AdditiveNoiseModel(gcm.ml.create_random_forest_regressor()))
for covariate in covariates:
    causal_model.set_causal_mechanism(covariate, gcm.EmpiricalDistribution())

为了适合分类器FCM,我们将处理列转换为字符串类型。

df = df.astype({treatment_var: str})

0x03_5. 从数据中拟合因果模型

gcm.fit(causal_model, df)

输出如下:

Fitting causal mechanism of node smcol: 100%|██████████| 18/18 [00:06<00:00,  2.68it/s]

在计算CATE之前,我们首先将家庭划分为收入百分位数的等宽箱(equi-width bins)。这使我们能够研究对不同收入群体的影响。

import numpy as np

percentages = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
bin_edges = [0]
bin_edges.extend(np.quantile(df.inc, percentages[1:]).tolist())
bin_edges[-1] += 1 # adding 1 to the last edge as last edge is excluded by np.digitize

groups = [f'{percentages[i]*100:.0f}%-{percentages[i+1]*100:.0f}%' for i in range(len(percentages)-1)]
group_index_to_group_label = dict(zip(range(1, len(bin_edges)+1), groups))

现在我们可以计算CATE了。为此,我们对拟合因果图中的治疗变量进行随机干预,从介入分布中抽取样本,按收入组分组观察,然后计算每个组的治疗效果。

np.random.seed(47)

def estimate_cate():
    samples = gcm.interventional_samples(causal_model,
                                         {treatment_var: lambda x: np.random.choice(['0', '1'])},
                                         observed_data=df)
    eligible = samples[treatment_var] == '1'
    ate = samples[eligible][outcome_var].mean() - samples[~eligible][outcome_var].mean()
    result = dict(ate = ate)

    group_indices = np.digitize(samples['inc'], bin_edges)
    samples['group_index'] = group_indices

    for group_index in group_index_to_group_label:
        group_samples = samples[samples['group_index'] == group_index]
        eligible_in_group = group_samples[treatment_var] == '1'
        cate = group_samples[eligible_in_group][outcome_var].mean() - group_samples[~eligible_in_group][outcome_var].mean()
        result[group_index_to_group_label[group_index]] = cate

    return result

group_to_median, group_to_ci = gcm.confidence_intervals(estimate_cate, num_bootstrap_resamples=100)
print(group_to_median)
print(group_to_ci)

输出结果如下:

{'ate': 6519.046476486404, '0%-20%': 3985.972442541254, '20%-40%': 3109.9999288096888, '40%-60%': 5731.625707624532, '60%-80%': 7605.467796966453, '80%-100%': 11995.55917989574}
{'ate': array([4982.99412698, 8339.97497725]), '0%-20%': array([2630.16909916, 5676.94495668]), '20%-40%': array([1252.7312225 , 5215.15452742]), '40%-60%': array([3533.43542901, 8243.86661569]), '60%-80%': array([ 4726.56666574, 10603.23313684]), '80%-100%': array([ 4981.36999637, 19280.14639468])}

如置信区间所示[4982.99, 8339.97],401(k)资格对净金融资产的平均处理效果为正。 现在,让我们画出不同收入群体的CATEs,以便清楚地了解情况。

fig = plt.figure(figsize=(8,4))
for x, group in enumerate(groups):
    ci = group_to_ci[group]
    plt.plot((x, x), (ci[0], ci[1]), 'ro-', color='orange')
ax = fig.axes[0]
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xticks(range(len(groups)), groups)
plt.xlabel('Income group')
plt.ylabel('ATE of 401(k) eligibility on net financial assets')
plt.show()

效果如下:
因果推断dowhy之-401(k)资格对净金融资产的影响_第3张图片
当一个人从低收入群体向高收入群体移动时,这种影响就会增加。这一结果似乎与不同收入群体的资源约束相一致。

你可能感兴趣的:(论文学习,r语言,开发语言)