Greedy Match 是基于 Greedy Algorithm 的思想,根据实验组样本在受到特定干预前的各项属性,贪婪的、不放回的生成一个虚拟对照组的方法。该方法的思想清晰透明,且可以根据需要灵活调整约束条件。在大样本的情况下,可以很好的生成所需的对照组。
贪婪算法的核心在于,在每一个阶段,都选择当前阶段的局部最优解。在大多数情况下,依靠这种方法并不能得到全局最优解,但这个算法可以很有效率的逼近全局最优,而且它得到的路径在每个小范围内来看都是“很优秀”的。
在生成虚拟对照组这一问题上,Greedy Algorithm的思路如下:
1. 不放回。
按一定的逻辑从实验组取出一样本,然后在对照池(control pool)中寻找最佳匹配。成功后再取另一样本,从剩余的对照池中寻找最佳匹配。若还存在阈值,则每次需检查最佳匹配间的距离是否超过阈值,若超过则需舍弃。
2. 1:K mapping。
若K>1,则先对实验组的全部样本1:1寻找其最佳匹配,再将整个过程重复K次(不放回)。
最佳匹配意味着对照池中的该样本和选取的样本距离最近。
下面是几种常见的距离定义:
1. 欧氏距离(Euclidean distance)
/*------------------------------------------------------------------*
| MACRO NAME : gmatch
| SHORT DESC : Match 1 or more controls to cases using the
| GREEDY algorithm
*------------------------------------------------------------------*
| CREATED BY : Kosanke, Jon (04/07/2004 16:32)
| : Bergstralh, Erik
*------------------------------------------------------------------*
| PURPOSE
|
| GMATCH Macro to match 1 or more controls for each of N cases
| using the GREEDY algorithm--REPLACES GREEDY option of MATCH macro.
| Changes:
| --cases and controls in same dataset
| --not mandatory to randomly pre-ort cases and controls, but recommended
| --options to transform X's and to choose distance metric
| --input parameters consistent with %DIST macro for optimal matching
|
| *******
|
| Macro name: %gmatch
|
| Authors: Jon Kosanke and Erik Bergstralh
|
| Date: July 23, 2003
| October 31, 2003...tweaked print/means based on "time" var
|
| Macro function:
|
| Matching using the GREEDY algorithm
|
| The purpose of this macro is to match 1 or more controls(from a total
| of M) for each of N cases. The controls may be matched to the cases by
| one or more factors(X's). The control selected for a particular
| case(i) will be the control(j) closest to the case in terms of Dij.
| Dij can be defined in multiple ways. Common choices are the Euclidean
| distance and the weighted sum of the absolute differences between the
| case and control matching factors. I.e.,
|
| Dij= SQRT [SUM { W.k*(X.ik-X.jk)**2} ], or
|
| Dij= SUM { W.k*ABS(X.ik-X.jk) },
|
| where the sum is over the number
| of matching factors X(with index
| k) and W.k = the weight assigned
| to matching factor k and X.ik =
| the value of variable X(k) for
| subject i.
|
| The control(j) selected for a case(i) is the one with the smallest Dij
| (subject to constraints DMAX and DMAXK, defined below). In the case of
| ties, the first one encountered will be used. The higher the user-defined
| weight, the more likely it is that the case and control will be matched
| on the factor. Assign large weights (relative to the other weights) to
| obtain exact matches for two-level factors such as gender. An option to
| using weights might be to standarize the X's in some fashion. The macro
| has options to standardize all X's to mean 0 and variance 1 and to use
| ranks.
|
| The matching algorithm used is the GREEDY method. Using the greedy method,
| once a match is made it is never broken. This may result in inefficiencies
| if a previously matched control would be a better match for the current
| case than those controls currently available. (An alternative method is to
| do optimal matching using the VMATCH & DIST macros. This method guarantees
| the best possible matched set in terms of minimizing the total Dij.)
| The GREEDY method generally produces very good matches, especially if the
| control pool is large relative to the number of cases. When multiple
| controls/case are desired, the algorithm first matches 1 control to all
| cases and then proceeds to select second controls.
|
|
| The gmatch macro checks for missing values of matching variables and the
| time variable(if specified) and deletes those observations from the input
| dataset.
|
| Call statement:
|
|
| %gmatch(data=,group=,id=,
| mvars=,wts=,dmaxk=,dmax=,transf,
| time=, dist=,
| ncontls=,seedca=,seedco=,
| out=,outnmca=,outnmco=,print=);
|
| Parameter definitions(R=required parameter):
|
|
| R data SAS data set containing cases and potential controls. Must
| contain the ID, GROUP, and the matching variables.
|
| R group SAS variable defining cases. Group=1 if case, 0 if control.
|
| R id SAS CHARACTER ID variable for the cases and controls.
|
|
| R mvars List of numeric matching variables common to both case and
| control data sets. For example, mvars=male age birthyr.
|
| R wts List of non-negative weights corresponding to each matching
| variable. For example wts=10 2 1 corresponding to male, age
| and birthyr as in the above example.
|
| dmaxk List of non-negative values corresponding to each matching
| variable. These numbers are the largest possible absolute
| differences compatible with a valid match. Cases will
| NOT be matched to a control if ANY of the INDIVIDUAL
| matching factor differences are >DMAXK. This optional
| parameter allows one to form matches of the type male+/-0,
| age+/-2, birth year+/-5 by specifying DMAXK=0 2 5.
|
| dmax Largest value of Dij considered to be a valid match. If
| you want to match exactly on a two-level factor(such as
| gender coded as 0 or 1) then assign DMAX to be less than
| the weight for the factor. In the example above, one could
| use wt=10 for male and dmax=9. Leave DMAX blank if any
| Dij is a valid match. One would typically NOT use both
| DMAXK and DMAX. The only advantage to using both, would be
| to further restrict potential matches that meet the
| DMAXK criteria.
|
| dist Indicates type of distance to calculate.
|
| 1=weighted sum(over matching vars) of
| absolute case-control differences(default)
|
| 2=weighted Euclidean distance
|
| time Time variable used for risk set matching. Matches are only
| valid if the control time > case time. May need to
|
| transf Indicates whether all matching vars are to be transformed
| (using the combined case+control data) prior to computing
| distances. 0=no(default),
| 1=standardize to mean 0 and variance 1,
| 2=use ranks of matching variables.
|
| ncontls Indicates the number of controls to match to each case. The
| default is 1. With multiple controls per case, the algorithm
| will first match every case to one control and then again
| match each case to a second control, etc. Controls selected
| on the first pass will be stronger matches than those selected in
| later rounds. The output data set contains a variable (cont_n)
| which indicates on which round the control was selected.
|
| seedca Seed value used to randomly sort the cases prior to
| matching. This positive integer will be used as input to
| the RANUNI function. The greedy matching algorithm is
| order dependent which, among other things means that
| cases matched first will be on average more similar to
| their controls than those matched last(as the number of
| control choices will be limited). If the matching order
| is related to confounding factors (possibly age or
| calendar time) then biases may result. Therefore it is
| generally considered good practice when using the GREEDY
| method to randomly sort both the cases and controls
| before beginning the matching process.
|
| seedco Seed value used to randomly sort the controls prior to
| matching using the GREEDY method. This seed value must
| also be a positive integer.
|
|
| print= Option to print data for matched cases. Use PRINT=y to
| print data and PRINT=n or blank to not print. Default is y.
|
| out=name of SAS data set containing the results of the matching
| process. Unmatched cases are not included. See outnm
| below. The default name is __out. This data set will have
| the following layout:
|
| Case_id Cont_id Cont_n Dij Delta_caco MVARS_ca MVARS_co
| 1 67 1 5.2 (Differences & actual
| 1 78 2 6.1 values for matching factors
| 2 52 1 2.9 for cases & controls)
| 2 92 2 3.1
| . . . .
| . . . .
|
| outnmca=name of SAS data set containing NON-matched cases.
| Default name is __nmca .
|
| outnmco=name of SAS data set containing NON-matched controls.
| Default name is __nmco .
| REUSE= Option allow a control can be reused after it's matched to 1 test if REUSE <>0.
|
| References: Bergstralh, EJ and Kosanke JL(1995). Computerized
| matching of controls. Section of Biostatistics
| Technical Report 56. Mayo Foundation.
|
|
| Example: 1-1 matching by male(exact), age(+-2) and year(+-5).
| The wt for male is not relevant, as only exact matches
| on male will be considered. The weight for age(2) is
| double that for year(1).
|
|
| %gmatch(data=all, group=ca_co,id=clinic,
| mvars=male age_od yr_od,
| wts=2 2 1, dmaxk=0 2 5,out=mtch,
| seedca=87877,seedco=987973);
|
*------------------------------------------------------------------*
| OPERATING SYSTEM COMPATIBILITY
|
| UNIX SAS v8 : YES
| UNIX SAS v9 :
| MVS SAS v8 :
| MVS SAS v9 :
| PC SAS v8 :
| PC SAS v9 :
*------------------------------------------------------------------*
| EXAMPLES
|
| Another example is located at the bottom of the code.
*------------------------------------------------------------------*/
%MACRO GMATCH(DATA=,GROUP=,ID=,
MVARS=,WTS=,DMAXK=,DMAX=,DIST=1,
NCONTLS=1, TIME=,TRANSF=0,
SEEDCA=,SEEDCO=,PRINT=y,
OUT=__OUT,OUTNMCA=__NMCA,OUTNMCO=__NMCO,REUSE=0);
%LET BAD=0;
%IF %LENGTH(&DATA)=0 %THEN %DO;
%PUT ERROR: NO DATASET SUPPLIED;
%LET BAD=1;
%END;
%IF %LENGTH(&ID)=0 %THEN %DO;
%PUT ERROR: NO ID VARIABLE SUPPLIED;
%LET BAD=1;
%END;
%IF %LENGTH(&GROUP)=0 %THEN %DO;
%PUT ERROR: NO CASE(1)/CONTROL(0) GROUP VARIABLE SUPPLIED;
%LET BAD=1;
%END;
%IF %LENGTH(&MVARS)=0 %THEN %DO;
%PUT ERROR: NO MATCHING VARIABLES SUPPLIED;
%LET BAD=1;
%END;
%IF %LENGTH(&WTS)=0 %THEN %DO;
%PUT ERROR: NO WEIGHTS SUPPLIED;
%LET BAD=1;
%END;
%LET NVAR=0;
%DO %UNTIL(%SCAN(&MVARS,&NVAR+1,' ')= );
%LET NVAR=%EVAL(&NVAR+1);
%END;
%LET NWTS=0;
%DO %UNTIL(%QSCAN(&WTS,&NWTS+1,' ')= );
%LET NWTS=%EVAL(&NWTS+1);
%END;
%IF &NVAR^= &NWTS %THEN %DO;
%PUT ERROR: #VARS MUST EQUAL #WTS;
%LET BAD=1;
%END;
%LET NK=0;
%IF %QUOTE(&DMAXK)^= %THEN %DO %UNTIL(%QSCAN(&DMAXK,&NK+1,' ')= );
%LET NK=%EVAL(&NK+1);
%END;
%IF &NK>&NVAR %THEN %LET NK=&NVAR;
%DO I=1 %TO &NVAR;
%LET V&I=%SCAN(&MVARS,&I,' ');
%END;
%IF &NWTS>0 %THEN %DO;
DATA _NULL_;
%DO I=1 %TO &NWTS;
%LET W&I=%SCAN(&WTS,&I,' ');
IF &&W&I<0 THEN DO;
PUT 'ERROR: WEIGHTS MUST BE NON-NEGATIVE';
CALL SYMPUT('BAD','1');
END;
%END;
RUN;
%END;
%IF &NK>0 %THEN %DO;
DATA _NULL_;
%DO I=1 %TO &NK;
%LET K&I=%SCAN(&DMAXK,&I,' ');
IF &&K&I<0 THEN DO;
PUT 'ERROR: DMAXK VALUES MUST BE NON-NEGATIVE';
CALL SYMPUT('BAD','1');
END;
%END;
RUN;
%END;
%MACRO MAX1;
%IF &DMAX^= %THEN %DO;
& __D<=&DMAX
%END;
%DO I=1 %TO &NK;
& ABS(__CA&I-__CO&I)<=&&K&I
%END;
%MEND MAX1;
%MACRO REUSE1;
%IF &REUSE =0 %THEN %DO;
%END;
%ELSE %DO;
OR __USED(__K)<__J
%END;
%MEND REUSE1;
%macro greedy;
%GLOBAL BAD2;
data __CHECK; set &DATA;
__id=&id;
if __id="" then delete;
%DO I=1 %TO &NVAR;
IF %scan(&mvars,&i)=. THEN DELETE;
%END;
%IF &TIME^= %THEN %DO;
IF &TIME=. THEN DELETE;
%END;
run;
*** transform data if requested/separate cases & controls;
%if &transf=1 %then %do;
proc standard data=__check m=0 s=1 out=_stdzd; var &mvars;
data _caco;
set _stdzd;
%end;
%if &transf=2 %then %do;
proc rank data=__check out=_ranks; var &mvars;
data _caco;
set _ranks;
%end;
%if &transf=0 %then %do;
data _caco;
set __check;
%end;
DATA __CASE; SET _caco;
if &group=1;
DATA __CASE; SET __CASE END=EOF;
KEEP __IDCA __CA1-__CA&NVAR __R &mvars
%if &time^= %then %do;
__catime
%end;
;
__IDCA=&ID;
%if &time^= %then %do;
__catime=&time;
%end;
%DO I=1 %TO &NVAR;
__CA&I=&&V&I;
%END;
%if &seedca^= %then %do;
SEED=&SEEDCA;
__R=RANUNI( SEED );
%end;
%else %do;
__R=1;
%end;
IF EOF THEN CALL SYMPUT('NCA',_N_);
PROC SORT; BY __R __IDCA;
DATA __CONT; SET _caco;
if &group=0;
DATA __CONT; SET __CONT END=EOF;
KEEP __IDCO __CO1-__CO&NVAR __R &mvars
%if &time^= %then %do;
__cotime
%end;
;
__IDCO=&ID;
%if &time^= %then %do;
__cotime=&time;
%end;
%DO I=1 %TO &NVAR;
__CO&I=&&V&I;
%END;
%if &seedco^= %then %do;
SEED=&SEEDCo;
__R=RANUNI( SEED );
%end;
%else %do;
__R=1;
%end;
IF EOF THEN CALL SYMPUT('NCO',_N_);
RUN;
%LET BAD2=0;
%IF &NCO < %EVAL(&NCA*&NCONTLS) %THEN %DO;
%PUT WARNING: NOT ENOUGH CONTROLS TO MAKE REQUESTED MATCHES;
/*%LET BAD2=1;*/
%END;
%IF &BAD2=0 %THEN %DO;
PROC SORT; BY __R __IDCO; RUN;
DATA __MATCH;
KEEP __IDCA __CA1-__CA&NVAR __DIJ __MATCH __CONT_N
%if &time^= %then %do;
__catime __cotime
%end;
;
ARRAY __USED(&NCO) 8 _TEMPORARY_;
DO __I=1 TO &NCO;
__USED(__I)=0;
END;
DO __J=1 TO &NCA;
DO __I=1 TO &NCONTLS;
SET __CASE POINT=__J;
__SMALL=.;
__MATCH=.;
DO __K=1 TO &NCO;
IF __USED(__K)=0 %REUSE1 THEN DO;
SET __CONT POINT=__K;
%if &dist=2 %then %do;
**wtd euclidian dist;
__D= sqrt(
%do k=1 %to &nvar;
%scan(&wts,&k)*(__ca&k - __co&k)**2
%if &k<&nvar %then + ;
%end;
);
%end;
%else %do;
**wtd sum absolute diff;
__D=
%do k=1 %to &nvar;
%scan(&wts,&k)*abs(__ca&k - __co&k )
%if &k<&nvar %then + ;
%end;
;
%end;
IF __d^=. & (__SMALL=. | __D<__SMALL) %MAX1
%if &time^= %then %do;
& __cotime > __catime
%end;
THEN DO;
__SMALL=__D;
__MATCH=__K;
__DIJ=__D;
__CONT_N=__I;
END;
END;
END;
IF __MATCH^=. THEN DO;
__USED(__MATCH)=__J;
OUTPUT;
END;
END;
END;
STOP;
DATA &OUT;
SET __MATCH;
SET __CONT POINT=__MATCH;
KEEP __IDCA __IDCO __CONT_N __DIJ __CA1-__CA&NVAR
__CO1-__CO&NVAR __d1-__d&nvar __absd1-__absd&nvar __WT1-__WT&NVAR
__catime __cotime __dtime;
%if &time= %then %do;
__cotime=.; __catime=.;
%end;
LABEL
__catime="&time/CASE"
__cotime="&time/CONTROL"
__dtime="&time/ABS. DIFF"
__CONT_N='CONTROL/NUMBER'
__DIJ='DISTANCE/D_IJ'
%DO I=1 %TO &NVAR;
__CA&I="&&V&I/CASE"
__CO&I="&&V&I/CONTROL"
__absd&I="&&V&I/ABS. DIFF "
__d&I="&&V&I/DIFF "
__WT&I="&&V&I/WEIGHT"
%END;
;
%DO I=1 %TO &NVAR;
__d&i= (__CA&I-__CO&I); **raw diff;
__absd&I=abs(__CA&I-__CO&I); **abs diff;
__WT&I=&&W&I;
%END;
__dtime=__cotime-__catime;
PROC SORT DATA=&OUT; BY __IDCA __CONT_N;
proc sort data=__case; by __IDCA;
data &outnmca; merge __case
&out(in=__inout where=(__cont_n=1)); by __idca;
if __inout=0; **non-matches;
proc sort data=__cont; by __IDCO;
proc sort data=&out; by __IDCO;
data &outnmco; merge __cont
&out(in=__inout); by __idco;
if __inout=0; **non-matched controls;
proc sort data=&out; by __IDCA; **re-sort by case id;
%if %upcase(&print)=Y %then %do;
PROC PRINT data=&out LABEL SPLIT='/';
VAR __IDCA __IDCO __CONT_N
__DIJ
%DO I=1 %TO &NVAR;
__absd&I
%END;
%if &time^= %then %do;
__dtime
%end;
%DO I=1 %TO &NVAR;
__CA&I __CO&I
%END;
%if &time^= %then %do;
__catime __cotime
%end;
;
sum __dij;
title9'Data listing for matched cases and controls';
footnote"Greedy matching(gmatch) macro: data=&data group=&group id=&id ";
footnote2" mvars=&mvars wts=&wts dmaxk=&dmaxk dmax=&dmax ncontls=&ncontls";
footnote3" transf=&transf dist=&dist time=&time seedca=&seedca seedco=&seedco";
footnote4" out=&out outnmca=&outnmca outnmco=&outnmco";
run;
title9'Summary data for matched cases and controls--one obs/control';
%if &sysver ge 8 %then %do;
proc means data=&out maxdec=3 fw=8
n mean median min p10 p25 p75 p90 max sum;
%end;
%else %do;
proc means data=&out maxdec=3
n mean min max sum;
%end;
class __cont_n;
var __dij
%do I=1 %TO &NVAR;
__absd&I
%end;
%if &time^= %then %do;
__dtime
%end;
%do I=1 %TO &NVAR;
__ca&I
%end;
%if &time^= %then %do;
__catime
%end;
%do I=1 %TO &NVAR;
__co&I
%end;
%if &time^= %then %do;
__cotime
%end;
;
run;
*** estimate matching var means within matched sets for controls;
proc means data=&out n mean noprint; by __idca;
var __dij
%do i=1 %to &nvar;
__co&i
%end;
__cotime
;
output out=_mcont n=n_co mean=__dijm
%do i=1 %to &nvar;
__com&i
%end;
__tcom
;
data _onecase; set &out; by __idca; if first.__idca;
data __camcon; merge _onecase _mcont; by __idca;
keep __idca n_co __dijm
__dtime __catime __tcom
%do i=1 %to &nvar;
__ca&i __com&i __actd&i __absd&i
%end;
;
%do i=1 %to &nvar;
__absd&i=abs(__ca&i - __com&i);
__actd&i=(__ca&i - __com&i);
%end;
__dtime=__tcom-__catime
;
label
n_co="No./CONTROLS"
__dijm="Average/Dij"
__dtime="&time/Mean Time DIFF"
__tcom="&time/Mean CONT TIME"
%do i=1 %to &nvar; %let vvar=%scan(&mvars,&i);
__absd&i="&vvar/Mean ABS. DIFF"
__com&i="&vvar/Mean CONTROL"
%end;
;
title9'Summary data for matched cases and controls--one obs/case(using average control value)';
%if &sysver ge 8 %then %do;
proc means data=__camcon maxdec=3 fw=8
n mean median min p10 p25 p75 p90 max sum;
%end;
%else %do;
proc means data=__camcon maxdec=3
n mean min max sum;
%end;
var n_co __dijm
%do i=1 %to &nvar;
__absd&i
%end;
%if &time^= %then %do;
__dtime
%end;
%do i=1 %to &nvar;
__ca&i
%end;
%if &time^= %then %do;
__catime
%end;
%do i=1 %to &nvar;
__com&i
%end;
%if &time^= %then %do;
__tcom
%end;
;
%end; **end of print=y loop**;
%END; **end of bad2=0 loop**;
run;
title9; footnote;
run;
%mend greedy;
%IF &BAD=0 %THEN %DO;
%GREEDY
%END;
%MEND GMATCH;
/* **test data;
DATA FAKEREG;
DO I = 1 TO 3000;
*id = _n_;
id=i;
SEX = MOD(I, 2);
IF mod(I, 21) THEN CASE = 0; ELSE CASE = 1;
*AGE = (INT(RANUNI(12378937)*10000)/100);
age= int(ranuni(123789837)*100);
DROP i ;
OUTPUT;
END;
RUN;
data fakereg;set fakereg; timex=5; if case=0 then timex=6;
%gmatch(data=fakereg,group=case, id=id,
mvars=age sex,wts=2 1,dmaxk= 5 0, transf=0,
time=timex, dist=1, ncontls=1,seedca=234098,seedco=0489,
out=regccout,outnmco=matched,print=Y);
run;
*/
在没有预设对照组的情况下,Greedy Match绝对是一个非常好用的生成虚拟对照组的方法。但在使用的过程中也一定要注意,根据控制变量法的思路,添加足够的限定条件。不然可能无法保证后面检测到的差异是由实验组和对照组的区别造成的。