教你如何自动筛选显著变量

以下stata源代码来自于宝器兄的分享,整个过程需要使用stata和Python小程序。其中Python小程序是为了生成变量的所有排列组合,可通过下文获取链接。

解释变量这么多,到底应该怎么选才显著?

本文是根据自己理解对源代码做简要解释,辅助大家理解这段代码,以期举一反三。

一、目标显著变量:单个

假设我们现在的需求是:找到能使变量mpg显著的其他解释变量组合,同时其他解释变量中需要包含rep78

注:以下示例中,从foreign weight length headroom turn中获取目标变量组合。

*导入示例数据
sysuse auto, clear

*必须存在的解释变量(选填)
local fixed_vars "rep78"
*解释变量组合(来自python)
local var_lists = "foreign weight length headroom turn foreign,weight foreign,length foreign,headroom foreign,turn weight,length weight,headroom weight,turn length,headroom length,turn headroom,turn foreign,weight,length foreign,weight,headroom foreign,weight,turn foreign,length,headroom foreign,length,turn foreign,headroom,turn weight,length,headroom weight,length,turn weight,headroom,turn length,headroom,turn foreign,weight,length,headroom foreign,weight,length,turn foreign,weight,headroom,turn foreign,length,headroom,turn weight,length,headroom,turn foreign,weight,length,headroom,turn"

*找到显著变量并显示出
foreach var_list of local var_lists{
    local var_list : subinstr local var_list "," " ", all
    qui regress price mpg `fixed_vars' `var_list'
    
    if (_se[mpg] != 0 & abs(_b[mpg]/_se[mpg])>2){
        display in r"`fixed_vars'" " " "`var_list'"
        local con_var_lists = "`con_var_lists'" + "," + "`var_list'"
    }
}

*对显著变量进行回归
local con_var_lists : subinstr local con_var_lists " " ".", all
local con_var_lists : subinstr local con_var_lists "," " ", all

foreach con_var_list of local con_var_lists{
    local con_var_list : subinstr local con_var_list "." " ", all
    regress price mpg `fixed_vars' `con_var_list'
}

知识点1:局部宏

local lclname [=exp | :extended_fcn | "[string]" | `"[string]"']

局部宏后面可以跟

  • 表达式
  • 扩展函数
  • “字符串”
  • `"字符串"'(字符串中含双引号时用)

local fixed_vars "rep78" 字符串

local var_lists = "foreign weight length..." 表达式

local var_list : subinstr local var_list "," " ", all 扩展函数

2.foreach循环

foreach var_list of local var_lists{
    ...
}
  • foreach lname of local lmacname {
  • foreach lname of global gmacname {

foreach lname of local list { ... } and foreach lname of global list { ... } obtain the list from the indicated place. This method of using foreach produces the fastest executing code.

官方示例:

local grains "rice wheat corn rye barley oats"
foreach x of local grains {
    display "`x'"
}

global money "Franc Dollar Lira Pound"
foreach y of global money {
    display "`y'"
}

所以有

local var_lists = "...."
foreach var_list of local var_lists{
   ...
}

注意:foreach遍历时,识别符号为空格

3.subinstr函数

subinstr(s1,s2,s3,n)
Description: s1, where the first n occurrences in s1 of s2 have been replaced with s3

意思是“将字符串s1中的字符串s2出现的前n个,替换成字符串s3。”

对于我们的代码:

local var_list : subinstr local var_list "," " ", all

其想表达的意思是将,换成空格,这样做的原因是后面使用regress命令时,变量之间是以空格作为分隔。我们可以把这块功能单独拿出来看看:

local var_lists = "foreign foreign,weight foreign,weight,length foreign,weight,length,headroom foreign,weight,length,headroom,turn"
        foreach var_list of local var_lists {
            local var_list : subinstr local var_list "," " ", all
            display "`var_list'"
        }

foreign
foreign weight
foreign weight length
foreign weight length headroom
foreign weight length headroom turn

对比下,显然如果把foreign,weight,length,headroom,turn放入regress命令中是会报错的。

local var_lists = "foreign foreign,weight foreign,weight,length foreign,weight,length,headroom foreign,weight,length,headroom,turn"
        foreach var_list of local var_lists {
            display "`var_list'"
        }
foreign
foreign,weight
foreign,weight,length
foreign,weight,length,headroom
foreign,weight,length,headroom,turn

4.显著性判断

回归系数和标准误的比值为t值,当t的绝对值大于1.96时,就代表代表p值达到5%的显著水平(即p<0.05),一般表示为两颗星星。

_se[mpg] != 0 & abs(_b[mpg]/_se[mpg])>2

理解该句的关键在于认知_se[mpg]这种表示,在stata中被称作_variables

Expressions may also contain _variables (pronounced "underscore variables"), which are built-in system variables that are created and updated by Stata. They are called variables because their names all begin with the underscore character, "".

[eqno]_b[varname] (synonym: [eqno]_coef[varname]) contains the value (to machine precision) of the coefficient on varname from the most recently fitted model (such as ANOVA, regression, Cox, logit, probit, and multinomial logit).

[eqno]_se[varname] contains the value (to machine precision) of the standard error of the coefficient on varname from the most recently fit model (such as ANOVA,regression, Cox, logit, probit, and multinomial logit).

5.display函数

display in r"`fixed_vars'" " " "`var_list'"

该句里的rred的简写,此处用的是SMCL,stata的标记语言。
SMCL, which stands for Stata Markup and Control Language and is pronounced “smickle”, is Stata’s output language.
SMCL is markup language of Stata and mastering it helps you create nicer outputs for your packages and also, write better help files.

image.png

In brief, markup languages are computer languages in a sense that they include syntax and are interpretted by computers. They are mainly used for annotating a document. For example, LaTeX, HTML, XHTML, and XML all are markup languages that are used for annotating documents. SMCL is designed based on the same consept and it includes syntax that can be interpreted by Stata for creating electronic documents such as help files, log files, and results' window outputs.
SMCL Markup Language

6.语法理解

local con_var_lists : subinstr local con_var_lists " " ".", all
local con_var_lists : subinstr local con_var_lists "," " ", all
foreach con_var_list of local con_var_lists{
    local con_var_list : subinstr local con_var_list "." " ", all
    ...
}

这里面有三个local语句,对它们的理解要建立对“找到显著变量并显示出”模块内con_var_lists的结果之上。

*导入数据
sysuse auto, clear
*必须存在的解释变量(选填)
local fixed_vars "rep78"
*解释变量组合(来自python)
local var_lists = "foreign weight length headroom turn foreign,weight foreign,length foreign,headroom foreign,turn weight,length weight,headroom weight,turn length,headroom length,turn headroom,turn foreign,weight,length foreign,weight,headroom foreign,weight,turn foreign,length,headroom foreign,length,turn foreign,headroom,turn weight,length,headroom weight,length,turn weight,headroom,turn length,headroom,turn foreign,weight,length,headroom foreign,weight,length,turn foreign,weight,headroom,turn foreign,length,headroom,turn weight,length,headroom,turn foreign,weight,length,headroom,turn"

*找到显著变量并显示出
foreach var_list of local var_lists{
    local var_list : subinstr local var_list "," " ", all
    qui regress price mpg `fixed_vars' `var_list'
    
    if (_se[mpg] != 0 & abs(_b[mpg]/_se[mpg])>2){
//      display in r"`fixed_vars'" " " "`var_list'"
        local con_var_lists = "`con_var_lists'" + "," + "`var_list'"
    }
}

dis in r"`con_var_lists'"

,foreign,headroom,turn,foreign headroom,foreign turn,headroom turn,foreign headroom turn

输出结果为,foreign,headroom,turn,foreign headroom,foreign turn,headroom turn,foreign headroom turn

就是说显著变量有以下7组:

  • foreign
  • headroom
  • turn
  • foreign headroom
  • foreign turn
  • headroom turn
  • foreign headroom turn

第一个local语句
local con_var_lists : subinstr local con_var_lists " " ".", all
把空格替换为.,效果是foreign headroomforeign.headroom
否则使用foreach循环时会把foreign headroom拆开,造成错误。

这条语句处理后的效果
,foreign,headroom,turn,foreign.headroom,foreign.turn,headroom turn,foreign.headroom.turn

第二个local语句:
local con_var_lists : subinstr local con_var_lists "," " ", all
,替换为空格,这是也是为了foreach循环.

这条语句处理后的效果
foreign headroom turn foreign.headroom foreign.turn headroom turn foreign.headroom.turn

第三个local语句:
local con_var_list : subinstr local con_var_list "." " ", all
.替换为空格,这是为了使用regress回归命令,显然如果把foreign.headroom.turn放入regress命令中是会报错的。

二、目标显著变量:多个

假设我们现在的需求是:找到能使变量mpgrep78均显著的其他解释变量组合。

注:以下示例中,从foreign weight length headroom turn中获取目标变量组合。

*导入示例数据
sysuse auto, clear

*必须存在的解释变量(选填)
local fixed_vars ""
*解释变量组合(来自python)
local var_lists = "foreign weight length headroom turn foreign,weight foreign,length foreign,headroom foreign,turn weight,length weight,headroom weight,turn length,headroom length,turn headroom,turn foreign,weight,length foreign,weight,headroom foreign,weight,turn foreign,length,headroom foreign,length,turn foreign,headroom,turn weight,length,headroom weight,length,turn weight,headroom,turn length,headroom,turn foreign,weight,length,headroom foreign,weight,length,turn foreign,weight,headroom,turn foreign,length,headroom,turn weight,length,headroom,turn foreign,weight,length,headroom,turn"

*找到显著变量并显示出
foreach var_list of local var_lists{
    local var_list : subinstr local var_list "," " ", all
    qui regress price mpg rep78  `fixed_vars' `var_list'
    
    if (_se[mpg] != 0 & abs(_b[mpg]/_se[mpg])>2) &(_se[rep78] != 0 & abs(_b[rep78]/_se[rep78])>2) {
        display in r"`fixed_vars'" " " "`var_list'"
        local con_var_lists = "`con_var_lists'" + "," + "`var_list'"
    }
}

*对显著变量进行回归
local con_var_lists : subinstr local con_var_lists " " ".", all
local con_var_lists : subinstr local con_var_lists "," " ", all
foreach con_var_list of local con_var_lists{
    local con_var_list : subinstr local con_var_list "." " ", all
    regress price mpg rep78 `fixed_vars' `con_var_list'
}

你可能感兴趣的:(教你如何自动筛选显著变量)