python倾向匹配得分_临床研究的最后一道防线(四):倾向性评分匹配PSM在Python的实现...

临床研究的最后一道防线(四):倾向性评分匹配(propensity score matching, PSM) 在Python的实现

No.25介绍了SPSS实现倾向性评分匹配(propensity score matching, PSM)的具体流程,如果用于基本临床试验或者发表论文已经足够,但当进阶进行1:2或者1:N多重有放回匹配时SPSS的劣势就暴露无遗,因此这一讲围绕脚本语言Python实现PSM的流程进行详细探讨。

运算前Python需要安装numpy、scipy、pandas、scikit-learn与PSM算法数据包ctmatching-0.0.6-source.zip (md5),具体安装流程与细节参见24讲。数据文件仍然选择25讲里使用的re78。

进入脚本界面,首先引入上述数据包:

import pandas as pd

import numpy as np

from ctmatching import psm, load_re78

载入数据集re78:

control, treatment = load_re78()

使用len函数可看到处理组treatment的观察数目为185,对照组control的观察数目为429。

len(control)

Out[78]: 429

len(treatment)

Out[79]: 185

使用help函数查阅psm的说明:help(psm)

psm(control, treatment, use_col=None, stratify_order=None, independent=True, k=1)

Propensity score matching main function.

If you want to know the inside of the psm algorithm, check

:func:`stratified_matching`, :func:`non_stratified_matching`,

:func:`non_repeat_index_matching`, :func:`independent_index_matching`.

otherwise, just read the parameters' definition.

Suppose we have m1 control samples, m2 treatment samples. Sample is n-dimension vector.

:param control: control group sample data, m1 x n matrix. Example::

[[c1_1, c1_2, ..., c1_n], # c means control

[c2_1, c2_2, ..., c2_n],

...,

[cm1_1, cm1_2, ..., cm1_n],]

:type control: numpy.ndarray

:param treatment: control group sample data, m2 x n matrix. Example::

[[t1_1, t1_2, ..., t1_n], # t means treatment

[t2_1, t2_2, ..., t2_n],

...,

[tm1_1, tm1_2, ..., tm1_n],]

:type treatment: numpy.ndarray

:param use_col: (default None, use all) list of column index. Example::

[0, 1, 4, 6, 7, 9] # use first, second, fifth, ... columns

:type use_col: list or numpy.ndarray

:param stratify_order: (default None, use normal nearest neighbor)

list of list. Example::

# for input data has 6 columns

# first feature has highest priority

# [second, third, forth] features' has second highest priority by mean of euclidean distance

# fifth feature has third priority, ...

[[0], [1, 2, 3], [4], [5]]

:type stratify_order: list of list

:param independent: (default True), if True, same treatment sample could be matched to different control sample.

:type independent: Boolean

:param k: (default 1) Number of samples selected from control group.

:type k: int

:returns selected_control_index: all control sample been selected for

entire treatment group.

:returns selected_control_index_for_each_treatment: selected control sample for each treatment sample.

selected_control_index: selected control sample index. Example (k = 3)::

(m2 * k)-length array: [7, 120, 43, 54, 12, 98, ..., 71, 37, 14]

selected_control_index_for_each_treatment: selected control sample index for each treatment sample. Example (k = 3)::

# for treatment[0], we have control[7], control[120], control[43]

# matched by mean of stratification.

[[7, 120, 43],

[54, 12, 98],

...,

[71, 37, 14],]

:raises InputError: if the input parameters are not legal.

:raises NotEnoughControlSampleError: if don't have sufficient data for independent index matching.

头晕吧!简单一点。

selected_control, selected_control_each_treatment = psm(

control, treatment, use_col=[1,2,3,4,5,6], stratify_order=None,

independent=True, k=2)

psm需要调用的五个关键参数是:对照组(control)、处理组(treatment)、匹配的变量(use_col), 匹配优先级stratify_order,independent:if True, same treatment sample could be matched to different control sample,即可以进行有放回多重匹配,k为一个处理组匹配的对照组个数,这里选为2,即采用1:2匹配。

selected_control为选择出的对照组,selected_control_each_treatment为处理组匹配的对照组编号。

进行for循环嵌套,外置循环treatment_sample在treatment中进行,index为对照组编号。内循环目的是寻找匹配出的对照组control[index[i]](i的取值范围为0,1)。

for treatment_sample, index in zip(treatment, selected_control_each_treatment):

print treatment_sample

print("matches")

for i in range(2):

print control[index[i]]

以下是结果,匹配出来后可进行后续的进一步比较分析,这里不再罗列。

[u'NSW183', 1, 35, 9, 1, 0, 1, 1, 13602.43, 13830.64, 12803.97]

matches

[u'PSID27', 0, 36, 9, 1, 0, 1, 1, 13256.4, 8457.484, 0.0]

[u'PSID6', 0, 37, 9, 1, 0, 1, 1, 13685.48, 12756.05, 17833.2]

=======================================

[u'NSW184', 1, 35, 8, 1, 0, 1, 1, 13732.07, 17976.15, 3786.628]

matches

[u'PSID27', 0, 36, 9, 1, 0, 1, 1, 13256.4, 8457.484, 0.0]

[u'PSID6', 0, 37, 9, 1, 0, 1, 1, 13685.48, 12756.05, 17833.2]

=======================================

[u'NSW185', 1, 33, 11, 1, 0, 1, 1, 14660.71, 25142.24, 4181.942]

matches

[u'PSID380', 0, 34, 12, 1, 0, 1, 0, 0.0, 0.0, 18716.88]

[u'PSID293', 0, 31, 12, 1, 0, 1, 0, 0.0, 42.96774, 11023.84]

你可能感兴趣的:(python倾向匹配得分)