SAS缺失值筛选,删除比例超过一定阈值的变量/观测

1.删除缺失值比例超过一定阈值的变量

options symbolgen;
data missing;
input n1 n2 n3 n4 n5 n6 n7 n8 c1$ c2$ c3$ c4$;
datalines;
1 . 1 . 1 . 1 4 a . c .
1 1 . . 2 . . 5 e . g h
1 . 1 . 3 . . 6 . . k l
1 . . . . . . . a b c d
;
data _null_;
if 0 then 
  set missing nobs=obs;
  array num_vars[*] _NUMERIC_;
  array char_vars[*] _CHARACTER_;
  call symputx('num_qty', dim(num_vars));
  call symputx('char_qty', dim(char_vars));
  call symputx('m_obs',obs);
  stop;
run;
%put &num_qty &char_qty &m_obs;
data _null_;
  set missing end=finished;
  array num_vars[*] _NUMERIC_;
  array char_vars[*] _CHARACTER_;
  array num_miss [&num_qty] (&num_qty * 0);
  array char_miss [&char_qty] (&char_qty * 0); 
  length list $ 50; 
  do i=1 to dim(num_vars);
    if num_vars(i) eq . then num_miss(i)+1;
  end;
  do i=1 to dim(char_vars);
    if char_vars(i) eq '' then char_miss(i)+1;
  end;
  if finished then do;
  do i= 1 to dim(num_vars);
    if num_miss(i)/&m_obs. ge 0.7 then list=trim(list)||' '||trim(vname(num_vars(i)));
  end;
  do i= 1 to dim(char_vars);
    if char_miss(i)/&m_obs. ge 0.7 then list=trim(list)||' '||trim(vname(char_vars(i)));
  end;
  call symputx('mlist',list);
  end;
run;
%put &mlist;
data notmiss;
  set missing(drop=&mlist);
run;

在此例中,缺失比例超过70%的变量被删除

2.删除缺失值比例超过一定阈值的观测

data missing;
input n1 n2 n3 n4 n5 n6 n7 n8 c1$ c2$ c3$ c4$;
datalines;
1 . 1 . 1 . 1 4 a . c .
1 1 . . 2 . . 5 e . g h
1 . 1 . 3 . . 6 . . k l
1 . . . . . . . a b c d
2 . . . 2 . . . a  . c d
3 . 4 . 4 . . 6 . . . l
;
run;
data nomissing;
  set missing;
  array num_vars(*) _NUMERIC_;
  array char_vars(*) _CHARACTER_;
  missColumns=0;
  do i=1 to dim(num_vars);
    if num_vars(i) eq . then missColumns+1;
  end;
  do i=1 to dim(char_vars);
    if char_vars(i) eq '' then missColumns+1;
  end;
  if missColumns<(dim(num_vars)+dim(char_vars))*0.5 then output;
run;

在此例中,缺失比例超过50%的观测被删除

你可能感兴趣的:(SAS)