proc glm vs proc reg

1.0 Introduction to Regression Procedures in SAS
 

Statistical Procedure

Functions

REG

performs linear regression with many diagnostic capabilities, selects models using one of nine methods, produces scatter plots of raw data and statistics, highlights scatter plots to identify particular observations, and allows interactive changes in both the regression model and the data used to fit the model.

CATMOD

analyzes data that can be represented by a contingency table.

GENMOD

fits generalized linear models.

GLM

uses the method of least squares to fit general linear models.

LOGISTIC

fits logistic models for binomial and ordinal outcomes.

NLIN

builds nonlinear regression models.

PROBIT

performs probit regression as well as logistic regression and ordinal logistic regression.

LIFEREG

fits parametric models to failure-time data that may be right censored.

ROBUSTREG

performs robust regression using Huber M estimation and high breakdown value estimation.

 
This chapter we introduce the procedure GLM and REG. The REG procedure provides the most general analysis capabilities; the other procedures give more specialized analyses.
 
 
2.0 General Linear Model
 
The GLM procedure (general linear model) uses the method of least squares to fit general linear models relating to one or several continuous dependent variables to one or several independent variables.
 
Strengths:
  • direct specification of polynomial effects
  • ease of specifying categorical effects (PROC GLM automatically generates dummy variables for class variables)
Weaknesses:
  • No collinearity diagnostics
  • No influence diagnostics
  • No scatter plots
  • Only one model at one time
Most of the statistics based on predicted and residual values that are available in PROC REG are also available in PROC GLM. However, PROC GLM does not produce collinearity diagnostics, influence diagnostics, or scatter plots. In addition, PROC GLM allows only one model and fits the full model.
 
The general form of PROC GLM can be found in Introduction to ANOVA.
 
 
Demonstrations and explanations:

We use SAS data set drugtest as an example. In this data, we have three variables Drug, PreTreatment and PostTreatment, meaning the drug types, pre and post treatment measures. Here is the list of samle data:

                                            Pre          Post

                          Obs    Drug    Treatment    Treatment

 

                            1     A          11            6

                            2     A           8            0

                            3     A           5            2

                            4     A          14            8

                            5     A          19           11

                            6     A           6            4

                            7     A          10           13

                            8     A           6            1

                            9     A          11            8

                           10     A           3            0

The following codes model a general linear regression to predict the effect of drug type and pre treatment measure to the post treatment outcome. In addition, ouput the predicted values and residuals to a new SAS data set.

odshtml;

ods graphics on;

 

PROC GLM data=mylib.drugtest;

      class Drug;

      model PostTreatment = Drug PreTreatment / solution;  

      outputout=drugest p=drugpred r=resid; 

RUN;

 

ods graphics off;

odshtmlclose;

QUIT;

The option SOLUTION produces parameter estimates.

Here is the main output.

--------------------------------------------------------------------------------------------------- 

                                    The GLM Procedure

 

Dependent Variable: PostTreatment

 

                                           Sum of

   Source                      DF         Squares     Mean Square    F Value    Pr > F

 

   Model                        3      871.497403      290.499134      18.10    <.0001

   Error                       26      417.202597       16.046254

   Corrected Total             29     1288.700000

 

 

                R-Square     Coeff Var      Root MSE    PostTreatment Mean

 

                0.676261      50.70604      4.005778              7.900000

 

 

   Source                      DF       Type I SS     Mean Square    F Value    Pr > F

 

   Drug                         2     293.6000000     146.8000000       9.15    0.0010

   PreTreatment                 1     577.8974030     577.8974030      36.01    <.0001

 

 

   Source                      DF     Type III SS     Mean Square    F Value    Pr > F

 

   Drug                         2      68.5537106      34.2768553       2.14    0.1384

   PreTreatment                 1     577.8974030     577.8974030      36.01    <.0001

---------------------------------------------------------------------------------------------------

Let's review this output a bit more carefully.

First, we see that the F-test is statistically significant, which means that the model is statistically significant. The R-squared is .676 means that approximately 67.6% of the variance of post treatment is accounted for by the model.

Second, the Type I SS for Drug (293.6) gives the between-drug sums of squares that are obtained for the analysis-of-variance model PostTreatment=Drug. This measures the difference between arithmetic means of posttreatment scores for different drugs, disregarding the covariate. The Type III SS for Drug (68.5537) gives the Drug sum of squares adjusted for the covariate. This measures the differences between Drug LS-means, controlling for the covariate PreTreatment.

---------------------------------------------------------------------------------------------------

                                                  Standard

         Parameter              Estimate             Error    t Value    Pr > |t|

 

         Intercept          -0.434671164 B      2.47135356      -0.18      0.8617

         Drug         A     -3.446138280 B      1.88678065      -1.83      0.0793

         Drug         D     -3.337166948 B      1.85386642      -1.80      0.0835

         Drug         F      0.000000000 B       .                .         .

         PreTreatment        0.987183811        0.16449757       6.00      <.0001

 

NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to

      solve the normal equations.  Terms whose estimates are followed by the letter 'B'

      are not uniquely estimable.

 

---------------------------------------------------------------------------------------------------

Then we take a look at each of independent variable. The t-test for PreTreatment equals 6 , and is statistically significant, meaning that the regression coefficient for PreTreatment is significantly different from zero. The coefficient for PreTreatment is 0.987, or approximately 1, meaning that for a one unit increase in PreTreatment, we would expect a 1-unit increase in PostTreatment. The constant is -0.43467, and this is the predicted value when independent variables equal zero.  In most cases, the constant is not very interesting.

 

For class variable Drug, the effects are not significant. The estimate of Drug F is zeor as it is set to be reference group in the model. in SAS, the last group of a class variable is set to be referenmce group by default. The negative sign of the estimates of drug A and D indicate that the effect of drug A and D to post treatment are less than the effect of drug F.

 

---------------------------------------------------------------------------------------------------

                                       

---------------------------------------------------------------------------------------------------

ODS graphics can generate analysis of covariance plot in PROC GLM. The plot makes it clear that the control (drug F) has higher post-treatment scores across the range of pre-treatment scores, while the fitted models for the two antibiotics (drugs A and D) nearly coincide.  .

As we have saved the predicted values and residuals into a new SAS data set drugest, we can plot a residual diagnostic plot by using plot command PROC GPLOT.

PROC GPLOT data=drugest;

       plot drugpred*resid ;

RUN;

QUIT;

3.0 Regression with PROC REG

The REG procedure provides the most general analysis capabilities:
  • handles multiple regression models
  • provides nine model-selection methods
  • allows interactive changes both in the model and in the data used to fit the model
  • allows linear equality restrictions on parameters
  • tests linear hypotheses and multivariate hypotheses
  • produces collinearity diagnostics, influence diagnostics, and partial regression leverage plots
  • saves estimates, predicted values, residuals, confidence limits, and other diagnostic statistics in output SAS data sets
  • generates plots of data and of various statistics
The general form of a PROC REG step is:

PROC REG DATA=SAS-dataset;
       MODEL dependent-variable = predictors / 
            selection=method R CLI CLM ;
       PLOT  r.*p. ;
RUN ;

QUIT;

MODEL

specifies the dependent/independent variables in the model.

SELECTION

specifies model selection model: forward, backward, etc.

R

requests a residual analysis to be performed.

CLI

requests confidence limits for an individual predicted value .

CLM

displays confidence limits for the expected value of the dependent variable for each observation.

r.*p.

plot of the residuals against the predicted values.

 
 
 

Demonstrations and explanations:

We use SAS data set insurance as an example. Here is the dat input:

DATA mylib.insurance;

      input time size type @@;

      sizetype=size*type;

      datalines;

   17 151 0   26  92 0   21 175 0   30  31 0   22 104 0

    0 277 0   12 210 0   19 120 0    4 290 0   16 238 0

   28 164 1   15 272 1   11 295 1   38  68 1   31  85 1

   21 224 1   20 166 1   13 305 1   30 124 1   14 246 1

   ;

There are four variables time, size, type and interaction term of sizetype. We are going to construct a linear model to describe the linear relationship between output time and independent variable size, type. We also take count of the possible interaction of size and type.

PROC REGdata=mylib.insurance;

      model time = size type sizetype /selection=none;

RUN;

      delete sizetype;

      print;

RUN;

      plotr.*p. time*p.;

      outputout=insurancepre p=fit  r=resid;

RUN;

QUIT;

The DELETE statement deletes the specified term from the constructed model. The PRINT prints the model results.

Here is the main output:

--------------------------------------------------------------------------------------------------- 

                                       Model: MODEL1

                                Dependent Variable: time

                                   Analysis of Variance

 

                                          Sum of           Mean

      Source                   DF        Squares         Square    F Value    Pr > F

 

      Model                     3     1504.41904      501.47301      45.49    <.0001

      Error                    16      176.38096       11.02381

      Corrected Total          19     1680.80000

 

 

                   Root MSE              3.32021    R-Square     0.8951

                   Dependent Mean       19.40000    Adj R-Sq     0.8754

                   Coeff Var            17.11450

 

 

                                   Parameter Estimates

 

                                Parameter       Standard

           Variable     DF       Estimate          Error    t Value    Pr > |t|

 

           Intercept     1       33.83837        2.44065      13.86      <.0001

           size          1       -0.10153        0.01305      -7.78      <.0001

           type          1        8.13125        3.65405       2.23      0.0408

           sizetype      1    -0.00041714        0.01833      -0.02      0.9821

--------------------------------------------------------------------------------------------------- 

1) The above lists the results of model1 which includes the interaction term. The model F test indicates the model is significant. R-Square value says there is 89.51% of variance of outcome explained by the model. From parameter estimates, all the main effects a re significant, but the interaction term is not. 

 

---------------------------------------------------------------------------------------------------  

 

                                    The REG Procedure

                                     Model: MODEL1.1

                                Dependent Variable: time

 

                                   Analysis of Variance

 

                                          Sum of           Mean

      Source                   DF        Squares         Square    F Value    Pr > F

 

      Model                     2     1504.41333      752.20667      72.50    <.0001

      Error                    17      176.38667       10.37569

      Corrected Total          19     1680.80000

 

 

                   Root MSE              3.22113    R-Square     0.8951

                   Dependent Mean       19.40000    Adj R-Sq     0.8827

                   Coeff Var            16.60377

 

 

                                   Parameter Estimates

 

                                Parameter       Standard

           Variable     DF       Estimate          Error    t Value    Pr > |t|

 

           Intercept     1       33.87407        1.81386      18.68      <.0001

           size          1       -0.10174        0.00889     -11.44      <.0001

           type          1        8.05547        1.45911       5.52      <.0001

--------------------------------------------------------------------------------------------------- 

2) Model 1.1 excludes the interaction effect. The model and paramters are all significant. 

 

---------------------------------------------------------------------------------------------------  

--------------------------------------------------------------------------------------------------- 

3) The above plots are graphed by PLOT statement.

 

---------------------------------------------------------------------------------------------------

4) We also can use ODS Graphics to produce more diagnostic plots.

odshtml;

ods graphics on;

PROC REGdata=mylib.insurance;

      model time = size type /selection=none;

RUN;

ods graphics off;

odshtmlclose;

RUN;

QUIT;

 

 

 

 

 

4.0 Polynomial Regression Using PROC REG

Demonstrations and explanations:

The example SAS data set USpopulation has three variables, population, year and yearsq.

DATA mylib.USPopulation;

      input Population @@;

      retain Year 1780;

      Year=Year+10;

      YearSq=Year*Year;

      Population=Population/1000;

      datalines;

3929 5308 7239 9638 12866 17069 23191 31443 39818 50155

62947 75994 91972 105710 122775 131669 151325 179323 203211

226542 248710 281422

   ;

Here is the sample observations:

 

                           Obs    Population    Year     YearSq

 

                             1        3.929     1790    3204100

                             2        5.308     1800    3240000

                             3        7.239     1810    3276100

                             4        9.638     1820    3312400

                             5       12.866     1830    3348900

                             6       17.069     1840    3385600

                             7       23.191     1850    3422500

                             8       31.443     1860    3459600

                             9       39.818     1870    3496900

We first run a simple linear model with population and year, then add an polynomial term yearsq.

PROC REGdata=mylib.USPopulation;

      var YearSq;

      model Population=Year / selection=none;

      plotr.*p. ;

RUN;

      add YearSq;

      print;

      plot / cframe=ligr;

RUN;

      plot (Population predicted.u95.l95.)*Year

        / overlaycframe=ligr;

RUN;

QUIT;

Any variable that you might add to the model but that is not included in the first MODEL statement must appear in the VAR statement.

The PLOT statement with no variables recreates the most recent plot requested. To create a plot of the observed values, predicted values, and confidence limits against Year all on the same plot and to exert some control over the look of the resulting plot.

 

Here is the main output:

--------------------------------------------------------------------------------------------------- 

 

                                    The REG Procedure

                                      Model: MODEL1

                             Dependent Variable: Population

 

                                   Analysis of Variance

 

                                          Sum of           Mean

      Source                   DF        Squares         Square    F Value    Pr > F

 

      Model                     1         146869         146869     228.92    <.0001

      Error                    20          12832      641.58160

      Corrected Total          21         159700

 

 

                   Root MSE             25.32946    R-Square     0.9197

                   Dependent Mean       94.64800    Adj R-Sq     0.9156

                   Coeff Var            26.76175

 

 

                                   Parameter Estimates

 

                                Parameter       Standard

          Variable      DF       Estimate          Error    t Value    Pr > |t|

 

          Intercept      1    -2345.85498      161.39279     -14.54      <.0001

          Year           1        1.28786        0.08512      15.13      <.0001

 

                                    The REG Procedure

                                     Model: MODEL1.1

                             Dependent Variable: Population

 

                                   Analysis of Variance

 

                                          Sum of           Mean

      Source                   DF        Squares         Square    F Value    Pr > F

 

      Model                     2         159529          79765    8864.19    <.0001

      Error                    19      170.97193        8.99852

      Corrected Total          21         159700

 

 

                   Root MSE              2.99975    R-Square     0.9989

                   Dependent Mean       94.64800    Adj R-Sq     0.9988

                   Coeff Var             3.16938

 

 

                                   Parameter Estimates

 

                                Parameter       Standard

          Variable      DF       Estimate          Error    t Value    Pr > |t|

 

          Intercept      1          21631      639.50181      33.82      <.0001

          Year           1      -24.04581        0.67547     -35.60      <.0001

          YearSq         1        0.00668     0.00017820      37.51      <.0001

 

--------------------------------------------------------------------------------------------------- 
The results tell us that the main effects and the polynomial term are all significant. 

 

The SAS also produces the following three plots: 

--------------------------------------------------------------------------------------------------- 

--------------------------------------------------------------------------------------------------- 

The above plot is generated in the first model. The wave pattern of the studentized residual plot is seen here again. The semi-circle shape indicates an inadequate model; perhaps additional terms (such as the quadratic) are needed, or perhaps the data need to be transformed before analysis. If a model fits well, the plot of residuals against predicted values should exhibit no apparent trends.

 

--------------------------------------------------------------------------------------------------- 

                        

--------------------------------------------------------------------------------------------------- 
We can use ODS Graphics to produce more diagnostic plots:

odshtml

ods graphics on;

PROC REGdata=mylib.USPopulation;

      Linear: model Population=Year;

      Quadratic:model Population=Year YearSq;

RUN;

ods graphics off;

odshtmlclose;

QUIT;

We omit the results here.

你可能感兴趣的:(output,statistics,methods,Diagnostics,plot,variables)