Homework #11 Dealing With Missing Values Using Amelia
Robert Perez
April 25th, 2018
Introduction
The dataset used was Airquality which collected daily air quality measurements in New York, from May to September in the year of 1973. Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973. The data were obtained from the New York State Department of Conservation (ozone data) and the National Weather Service (meteorological data). I will use this dataset to examine the factors that Air quality. This dataset is not perfect therefore we will deal with the missing values using the Amelia Package.
Variables
Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island
Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park
Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.
Month
Day
Research Question
Which factors effect the air quality in New York over a 30 day span?
Hide
library(tidyverse)
library(Zelig)
library(Amelia)
library(pander)
library(texreg)
library(visreg)
library(lmtest)
library(sjmisc)
library(radiant.data)
library(datasets)
Hide
data(airquality)
head(airquality)
require(graphics)
pairs(airquality, panel = panel.smooth, main = "airquality data")
Hide
m1 <- lm(Ozone ~ Solar.R + Wind, data = airquality)
m2 <- lm(Ozone ~ Solar.R + Wind + Temp, data = airquality)
htmlreg(list(m1, m2))
summary(airquality)
Ozone Solar.R Wind Temp Month Day
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.000 Min. : 1.0
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.000 1st Qu.: 8.0
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.000 Median :16.0
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 Mean :6.993 Mean :15.8
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.000 3rd Qu.:23.0
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.000 Max. :31.0
NA's :37 NA's :7
Listwise Deletion Method
Hide
summary(lm(Ozone ~ Solar.R + Wind + Temp + Month + Day, data = airquality, na.action = na.omit))
Call:
lm(formula = Ozone ~ Solar.R + Wind + Temp + Month + Day, data = airquality,
na.action = na.omit)
Residuals:
Min 1Q Median 3Q Max
-37.014 -12.284 -3.302 8.454 95.348
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -64.11632 23.48249 -2.730 0.00742 **
Solar.R 0.05027 0.02342 2.147 0.03411 *
Wind -3.31844 0.64451 -5.149 0.000001231276 ***
Temp 1.89579 0.27389 6.922 0.000000000366 ***
Month -3.03996 1.51346 -2.009 0.04714 *
Day 0.27388 0.22967 1.192 0.23576
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 20.86 on 105 degrees of freedom
(42 observations deleted due to missingness)
Multiple R-squared: 0.6249, Adjusted R-squared: 0.6071
F-statistic: 34.99 on 5 and 105 DF, p-value: < 0.00000000000000022
From the summary shown above we can see that 42 observations were deleted due to missingness. There is never a substitute for a complete dataset. By deleting these observations information about its relations with the other variables are being messed with. Imputation or multiple inputation is the proper way to deal with missing data and by using the Amelia package we will help to retrieve the missing values to complete the dataset and help make betters inferences using the data.
Visualing Percentage of Missing data
Hide
V1 <- function(x){sum(is.na(x))/length(x)*100}
apply(airquality,2,V1)
Ozone Solar.R Wind Temp Month Day
24.183007 4.575163 0.000000 0.000000 0.000000 0.000000
Hide
apply(airquality,1,V1)
[1] 0.00000 0.00000 0.00000 0.00000 33.33333 16.66667 0.00000 0.00000 0.00000 16.66667
[11] 16.66667 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
[21] 0.00000 0.00000 0.00000 0.00000 16.66667 16.66667 33.33333 0.00000 0.00000 0.00000
[31] 0.00000 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667 0.00000 16.66667 0.00000
[41] 0.00000 16.66667 16.66667 0.00000 16.66667 16.66667 0.00000 0.00000 0.00000 0.00000
[51] 0.00000 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667
[61] 16.66667 0.00000 0.00000 0.00000 16.66667 0.00000 0.00000 0.00000 0.00000 0.00000
[71] 0.00000 16.66667 0.00000 0.00000 16.66667 0.00000 0.00000 0.00000 0.00000 0.00000
[81] 0.00000 0.00000 16.66667 16.66667 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
[91] 0.00000 0.00000 0.00000 0.00000 0.00000 16.66667 16.66667 16.66667 0.00000 0.00000
[101] 0.00000 16.66667 16.66667 0.00000 0.00000 0.00000 16.66667 0.00000 0.00000 0.00000
[111] 0.00000 0.00000 0.00000 0.00000 16.66667 0.00000 0.00000 0.00000 16.66667 0.00000
[121] 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
[131] 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
[141] 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 16.66667
[151] 0.00000 0.00000 0.00000
In the above chart we can see that of all the observation in the dataset the variable Ozone is missing 25% of its datapoints and the variable Solar.R is missing 4.5%.
Imputation Using Amelia Package
The Amelia package will take care of the imputing process for us.
aq1 <- amelia(x=airquality, m = 20)
aq1$imputations$imp1[1:6, ]
Above when viewing the NA’s in the dataset we saw that the 5th value for ozone was NA. Here we can see the imputed value that was imputed using Amelia. These imputed values were done 20 times but this is only showing one.
ggplot(data=airquality) + geom_histogram(mapping=aes(Ozone))
Hide
z.out <- zelig(Ozone ~ Solar.R + Wind + Temp + Month + Day, model = "ls", data = aq1, cite = FALSE)
summary(z.out, subset = 1)
Imputed Dataset 1
Call:
z5$zelig(formula = Ozone ~ Solar.R + Wind + Temp + Month + Day,
data = aq1)
Residuals:
Min 1Q Median 3Q Max
-46.300 -13.073 -3.004 12.885 98.904
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -78.10871 19.85289 -3.934 0.000128
Solar.R 0.02592 0.02057 1.260 0.209607
Wind -2.70016 0.54658 -4.940 0.000002098565416779
Temp 2.14191 0.23466 9.128 0.000000000000000502
Month -3.87204 1.35901 -2.849 0.005013
Day 0.30787 0.19645 1.567 0.119235
Residual standard error: 21.03 on 147 degrees of freedom
Multiple R-squared: 0.5925, Adjusted R-squared: 0.5786
F-statistic: 42.74 on 5 and 147 DF, p-value: < 0.00000000000000022
Next step: Use 'setx' method
This model varies in ways from the model shown above. The negative effect that wind has on the Ozone has actually decreased when we use imputed values into the dataset.
Hide
z.out$setx()
z.out$sim()
plot(z.out)
Conclusion
The imputation values that replaced the NA values in the dataset had an an effect on the models we ran above. By using the Amelia package we were able to recover some imformation from two variables that contained NA values. This allowed us to create a close to completed dataset using values that were imputed and thus giving us the best method we can use to deal with missing values.
原文链接:https://www.rpubs.com/RobertPerez63/384791