Statistical Procedure |
Functions |
REG |
performs linear regression with many diagnostic capabilities, selects models using one of nine methods, produces scatter plots of raw data and statistics, highlights scatter plots to identify particular observations, and allows interactive changes in both the regression model and the data used to fit the model. |
CATMOD |
analyzes data that can be represented by a contingency table. |
GENMOD |
fits generalized linear models. |
GLM |
uses the method of least squares to fit general linear models. |
LOGISTIC |
fits logistic models for binomial and ordinal outcomes. |
NLIN |
builds nonlinear regression models. |
PROBIT |
performs probit regression as well as logistic regression and ordinal logistic regression. |
LIFEREG |
fits parametric models to failure-time data that may be right censored. |
ROBUSTREG |
performs robust regression using Huber M estimation and high breakdown value estimation. |
We use SAS data set drugtest as an example. In this data, we have three variables Drug, PreTreatment and PostTreatment, meaning the drug types, pre and post treatment measures. Here is the list of samle data:
Pre Post
Obs Drug Treatment Treatment
1 A 11 6
2 A 8 0
3 A 5 2
4 A 14 8
5 A 19 11
6 A 6 4
7 A 10 13
8 A 6 1
9 A 11 8
10 A 3 0
The following codes model a general linear regression to predict the effect of drug type and pre treatment measure to the post treatment outcome. In addition, ouput the predicted values and residuals to a new SAS data set.
odshtml;
ods graphics on;
PROC GLM data=mylib.drugtest;
class Drug;
model PostTreatment = Drug PreTreatment / solution;
outputout=drugest p=drugpred r=resid;
RUN;
ods graphics off;
odshtmlclose;
QUIT;
The option SOLUTION produces parameter estimates.
Here is the main output.
---------------------------------------------------------------------------------------------------
The GLM Procedure
Dependent Variable: PostTreatment
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 3 871.497403 290.499134 18.10 <.0001
Error 26 417.202597 16.046254
Corrected Total 29 1288.700000
R-Square Coeff Var Root MSE PostTreatment Mean
0.676261 50.70604 4.005778 7.900000
Source DF Type I SS Mean Square F Value Pr > F
Drug 2 293.6000000 146.8000000 9.15 0.0010
PreTreatment 1 577.8974030 577.8974030 36.01 <.0001
Source DF Type III SS Mean Square F Value Pr > F
Drug 2 68.5537106 34.2768553 2.14 0.1384
PreTreatment 1 577.8974030 577.8974030 36.01 <.0001
---------------------------------------------------------------------------------------------------
Let's review this output a bit more carefully.
First, we see that the F-test is statistically significant, which means that the model is statistically significant. The R-squared is .676 means that approximately 67.6% of the variance of post treatment is accounted for by the model.
Second, the Type I SS for Drug (293.6) gives the between-drug sums of squares that are obtained for the analysis-of-variance model PostTreatment=Drug. This measures the difference between arithmetic means of posttreatment scores for different drugs, disregarding the covariate. The Type III SS for Drug (68.5537) gives the Drug sum of squares adjusted for the covariate. This measures the differences between Drug LS-means, controlling for the covariate PreTreatment.
---------------------------------------------------------------------------------------------------
Standard
Parameter Estimate Error t Value Pr > |t|
Intercept -0.434671164 B 2.47135356 -0.18 0.8617
Drug A -3.446138280 B 1.88678065 -1.83 0.0793
Drug D -3.337166948 B 1.85386642 -1.80 0.0835
Drug F 0.000000000 B . . .
PreTreatment 0.987183811 0.16449757 6.00 <.0001
NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to
solve the normal equations. Terms whose estimates are followed by the letter 'B'
are not uniquely estimable.
---------------------------------------------------------------------------------------------------
Then we take a look at each of independent variable. The t-test for PreTreatment equals 6 , and is statistically significant, meaning that the regression coefficient for PreTreatment is significantly different from zero. The coefficient for PreTreatment is 0.987, or approximately 1, meaning that for a one unit increase in PreTreatment, we would expect a 1-unit increase in PostTreatment. The constant is -0.43467, and this is the predicted value when independent variables equal zero. In most cases, the constant is not very interesting.
For class variable Drug, the effects are not significant. The estimate of Drug F is zeor as it is set to be reference group in the model. in SAS, the last group of a class variable is set to be referenmce group by default. The negative sign of the estimates of drug A and D indicate that the effect of drug A and D to post treatment are less than the effect of drug F.
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
ODS graphics can generate analysis of covariance plot in PROC GLM. The plot makes it clear that the control (drug F) has higher post-treatment scores across the range of pre-treatment scores, while the fitted models for the two antibiotics (drugs A and D) nearly coincide. .
As we have saved the predicted values and residuals into a new SAS data set drugest, we can plot a residual diagnostic plot by using plot command PROC GPLOT.
PROC GPLOT data=drugest;
plot drugpred*resid ;
RUN;
QUIT;
3.0 Regression with PROC REG
PROC REG DATA=SAS-dataset;
MODEL dependent-variable = predictors /
selection=method R CLI CLM ;
PLOT r.*p. ;
RUN ;QUIT;
MODEL |
specifies the dependent/independent variables in the model. |
SELECTION |
specifies model selection model: forward, backward, etc. |
R |
requests a residual analysis to be performed. |
CLI |
requests confidence limits for an individual predicted value . |
CLM |
displays confidence limits for the expected value of the dependent variable for each observation. |
r.*p. |
plot of the residuals against the predicted values. |
Demonstrations and explanations:
We use SAS data set insurance as an example. Here is the dat input:
DATA mylib.insurance;
input time size type @@;
sizetype=size*type;
datalines;
17 151 0 26 92 0 21 175 0 30 31 0 22 104 0
0 277 0 12 210 0 19 120 0 4 290 0 16 238 0
28 164 1 15 272 1 11 295 1 38 68 1 31 85 1
21 224 1 20 166 1 13 305 1 30 124 1 14 246 1
;
There are four variables time, size, type and interaction term of sizetype. We are going to construct a linear model to describe the linear relationship between output time and independent variable size, type. We also take count of the possible interaction of size and type.
PROC REGdata=mylib.insurance;
model time = size type sizetype /selection=none;
RUN;
delete sizetype;
print;
RUN;
plotr.*p. time*p.;
outputout=insurancepre p=fit r=resid;
RUN;
QUIT;
The DELETE statement deletes the specified term from the constructed model. The PRINT prints the model results.
Here is the main output:
---------------------------------------------------------------------------------------------------
Model: MODEL1
Dependent Variable: time
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 3 1504.41904 501.47301 45.49 <.0001
Error 16 176.38096 11.02381
Corrected Total 19 1680.80000
Root MSE 3.32021 R-Square 0.8951
Dependent Mean 19.40000 Adj R-Sq 0.8754
Coeff Var 17.11450
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 33.83837 2.44065 13.86 <.0001
size 1 -0.10153 0.01305 -7.78 <.0001
type 1 8.13125 3.65405 2.23 0.0408
sizetype 1 -0.00041714 0.01833 -0.02 0.9821
---------------------------------------------------------------------------------------------------
1) The above lists the results of model1 which includes the interaction term. The model F test indicates the model is significant. R-Square value says there is 89.51% of variance of outcome explained by the model. From parameter estimates, all the main effects a re significant, but the interaction term is not.
---------------------------------------------------------------------------------------------------
The REG Procedure
Model: MODEL1.1
Dependent Variable: time
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 1504.41333 752.20667 72.50 <.0001
Error 17 176.38667 10.37569
Corrected Total 19 1680.80000
Root MSE 3.22113 R-Square 0.8951
Dependent Mean 19.40000 Adj R-Sq 0.8827
Coeff Var 16.60377
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 33.87407 1.81386 18.68 <.0001
size 1 -0.10174 0.00889 -11.44 <.0001
type 1 8.05547 1.45911 5.52 <.0001
---------------------------------------------------------------------------------------------------
2) Model 1.1 excludes the interaction effect. The model and paramters are all significant.
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
3) The above plots are graphed by PLOT statement.
---------------------------------------------------------------------------------------------------
4) We also can use ODS Graphics to produce more diagnostic plots.
odshtml;
ods graphics on;
PROC REGdata=mylib.insurance;
model time = size type /selection=none;
RUN;
ods graphics off;
odshtmlclose;
RUN;
QUIT;
4.0 Polynomial Regression Using PROC REG
Demonstrations and explanations:
The example SAS data set USpopulation has three variables, population, year and yearsq.
DATA mylib.USPopulation;
input Population @@;
retain Year 1780;
Year=Year+10;
YearSq=Year*Year;
Population=Population/1000;
datalines;
3929 5308 7239 9638 12866 17069 23191 31443 39818 50155
62947 75994 91972 105710 122775 131669 151325 179323 203211
226542 248710 281422
;
Here is the sample observations:
Obs Population Year YearSq
1 3.929 1790 3204100
2 5.308 1800 3240000
3 7.239 1810 3276100
4 9.638 1820 3312400
5 12.866 1830 3348900
6 17.069 1840 3385600
7 23.191 1850 3422500
8 31.443 1860 3459600
9 39.818 1870 3496900
We first run a simple linear model with population and year, then add an polynomial term yearsq.
PROC REGdata=mylib.USPopulation;
var YearSq;
model Population=Year / selection=none;
plotr.*p. ;
RUN;
add YearSq;
print;
plot / cframe=ligr;
RUN;
plot (Population predicted.u95.l95.)*Year
/ overlaycframe=ligr;
RUN;
QUIT;
Any variable that you might add to the model but that is not included in the first MODEL statement must appear in the VAR statement.
The PLOT statement with no variables recreates the most recent plot requested. To create a plot of the observed values, predicted values, and confidence limits against Year all on the same plot and to exert some control over the look of the resulting plot.
Here is the main output:
---------------------------------------------------------------------------------------------------
The REG Procedure
Model: MODEL1
Dependent Variable: Population
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 146869 146869 228.92 <.0001
Error 20 12832 641.58160
Corrected Total 21 159700
Root MSE 25.32946 R-Square 0.9197
Dependent Mean 94.64800 Adj R-Sq 0.9156
Coeff Var 26.76175
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 -2345.85498 161.39279 -14.54 <.0001
Year 1 1.28786 0.08512 15.13 <.0001
The REG Procedure
Model: MODEL1.1
Dependent Variable: Population
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 159529 79765 8864.19 <.0001
Error 19 170.97193 8.99852
Corrected Total 21 159700
Root MSE 2.99975 R-Square 0.9989
Dependent Mean 94.64800 Adj R-Sq 0.9988
Coeff Var 3.16938
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 21631 639.50181 33.82 <.0001
Year 1 -24.04581 0.67547 -35.60 <.0001
YearSq 1 0.00668 0.00017820 37.51 <.0001
---------------------------------------------------------------------------------------------------
The results tell us that the main effects and the polynomial term are all significant.
The SAS also produces the following three plots:
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
The above plot is generated in the first model. The wave pattern of the studentized residual plot is seen here again. The semi-circle shape indicates an inadequate model; perhaps additional terms (such as the quadratic) are needed, or perhaps the data need to be transformed before analysis. If a model fits well, the plot of residuals against predicted values should exhibit no apparent trends.
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
We can use ODS Graphics to produce more diagnostic plots:
odshtml;
ods graphics on;
PROC REGdata=mylib.USPopulation;
Linear: model Population=Year;
Quadratic:model Population=Year YearSq;
RUN;
ods graphics off;
odshtmlclose;
QUIT;
We omit the results here.