临床研究的最后一道防线(四):倾向性评分匹配(propensity score matching, PSM) 在Python的实现
No.25介绍了SPSS实现倾向性评分匹配(propensity score matching, PSM)的具体流程,如果用于基本临床试验或者发表论文已经足够,但当进阶进行1:2或者1:N多重有放回匹配时SPSS的劣势就暴露无遗,因此这一讲围绕脚本语言Python实现PSM的流程进行详细探讨。
运算前Python需要安装numpy、scipy、pandas、scikit-learn与PSM算法数据包ctmatching-0.0.6-source.zip (md5),具体安装流程与细节参见24讲。数据文件仍然选择25讲里使用的re78。
进入脚本界面,首先引入上述数据包:
import pandas as pd
import numpy as np
from ctmatching import psm, load_re78
载入数据集re78:
control, treatment = load_re78()
使用len函数可看到处理组treatment的观察数目为185,对照组control的观察数目为429。
len(control)
Out[78]: 429
len(treatment)
Out[79]: 185
使用help函数查阅psm的说明:help(psm)
psm(control, treatment, use_col=None, stratify_order=None, independent=True, k=1)
Propensity score matching main function.
If you want to know the inside of the psm algorithm, check
:func:`stratified_matching`, :func:`non_stratified_matching`,
:func:`non_repeat_index_matching`, :func:`independent_index_matching`.
otherwise, just read the parameters' definition.
Suppose we have m1 control samples, m2 treatment samples. Sample is n-dimension vector.
:param control: control group sample data, m1 x n matrix. Example::
[[c1_1, c1_2, ..., c1_n], # c means control
[c2_1, c2_2, ..., c2_n],
...,
[cm1_1, cm1_2, ..., cm1_n],]
:type control: numpy.ndarray
:param treatment: control group sample data, m2 x n matrix. Example::
[[t1_1, t1_2, ..., t1_n], # t means treatment
[t2_1, t2_2, ..., t2_n],
...,
[tm1_1, tm1_2, ..., tm1_n],]
:type treatment: numpy.ndarray
:param use_col: (default None, use all) list of column index. Example::
[0, 1, 4, 6, 7, 9] # use first, second, fifth, ... columns
:type use_col: list or numpy.ndarray
:param stratify_order: (default None, use normal nearest neighbor)
list of list. Example::
# for input data has 6 columns
# first feature has highest priority
# [second, third, forth] features' has second highest priority by mean of euclidean distance
# fifth feature has third priority, ...
[[0], [1, 2, 3], [4], [5]]
:type stratify_order: list of list
:param independent: (default True), if True, same treatment sample could be matched to different control sample.
:type independent: Boolean
:param k: (default 1) Number of samples selected from control group.
:type k: int
:returns selected_control_index: all control sample been selected for
entire treatment group.
:returns selected_control_index_for_each_treatment: selected control sample for each treatment sample.
selected_control_index: selected control sample index. Example (k = 3)::
(m2 * k)-length array: [7, 120, 43, 54, 12, 98, ..., 71, 37, 14]
selected_control_index_for_each_treatment: selected control sample index for each treatment sample. Example (k = 3)::
# for treatment[0], we have control[7], control[120], control[43]
# matched by mean of stratification.
[[7, 120, 43],
[54, 12, 98],
...,
[71, 37, 14],]
:raises InputError: if the input parameters are not legal.
:raises NotEnoughControlSampleError: if don't have sufficient data for independent index matching.
头晕吧!简单一点。
selected_control, selected_control_each_treatment = psm(
control, treatment, use_col=[1,2,3,4,5,6], stratify_order=None,
independent=True, k=2)
psm需要调用的五个关键参数是:对照组(control)、处理组(treatment)、匹配的变量(use_col), 匹配优先级stratify_order,independent:if True, same treatment sample could be matched to different control sample,即可以进行有放回多重匹配,k为一个处理组匹配的对照组个数,这里选为2,即采用1:2匹配。
selected_control为选择出的对照组,selected_control_each_treatment为处理组匹配的对照组编号。
进行for循环嵌套,外置循环treatment_sample在treatment中进行,index为对照组编号。内循环目的是寻找匹配出的对照组control[index[i]](i的取值范围为0,1)。
for treatment_sample, index in zip(treatment, selected_control_each_treatment):
print treatment_sample
print("matches")
for i in range(2):
print control[index[i]]
以下是结果,匹配出来后可进行后续的进一步比较分析,这里不再罗列。
[u'NSW183', 1, 35, 9, 1, 0, 1, 1, 13602.43, 13830.64, 12803.97]
matches
[u'PSID27', 0, 36, 9, 1, 0, 1, 1, 13256.4, 8457.484, 0.0]
[u'PSID6', 0, 37, 9, 1, 0, 1, 1, 13685.48, 12756.05, 17833.2]
=======================================
[u'NSW184', 1, 35, 8, 1, 0, 1, 1, 13732.07, 17976.15, 3786.628]
matches
[u'PSID27', 0, 36, 9, 1, 0, 1, 1, 13256.4, 8457.484, 0.0]
[u'PSID6', 0, 37, 9, 1, 0, 1, 1, 13685.48, 12756.05, 17833.2]
=======================================
[u'NSW185', 1, 33, 11, 1, 0, 1, 1, 14660.71, 25142.24, 4181.942]
matches
[u'PSID380', 0, 34, 12, 1, 0, 1, 0, 0.0, 0.0, 18716.88]
[u'PSID293', 0, 31, 12, 1, 0, 1, 0, 0.0, 42.96774, 11023.84]