2020-09-01-Dealing With Missing Values Using Amelia

Homework #11 Dealing With Missing Values Using Amelia

Robert Perez

April 25th, 2018

Introduction

The dataset used was Airquality which collected daily air quality measurements in New York, from May to September in the year of 1973. Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973. The data were obtained from the New York State Department of Conservation (ozone data) and the National Weather Service (meteorological data). I will use this dataset to examine the factors that Air quality. This dataset is not perfect therefore we will deal with the missing values using the Amelia Package.

Variables

Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island

Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park

Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport

Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.

Month

Day

Research Question

Which factors effect the air quality in New York over a 30 day span?

Hide

library(tidyverse)
library(Zelig)
library(Amelia)
library(pander)
library(texreg)
library(visreg)
library(lmtest)
library(sjmisc)
library(radiant.data)
library(datasets)

Hide

data(airquality)
head(airquality)
require(graphics)
pairs(airquality, panel = panel.smooth, main = "airquality data")
image.png

Hide

m1 <- lm(Ozone ~ Solar.R + Wind, data = airquality)
m2 <- lm(Ozone ~ Solar.R + Wind + Temp, data = airquality)
htmlreg(list(m1, m2))
summary(airquality)
     Ozone           Solar.R           Wind             Temp           Month            Day      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   Min.   :5.000   Min.   : 1.0  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00   1st Qu.:6.000   1st Qu.: 8.0  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   Median :7.000   Median :16.0  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88   Mean   :6.993   Mean   :15.8  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00   3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00   Max.   :9.000   Max.   :31.0  
 NA's   :37       NA's   :7                                                                      

Listwise Deletion Method

Hide

summary(lm(Ozone ~ Solar.R + Wind + Temp + Month + Day,  data = airquality, na.action = na.omit))

Call:
lm(formula = Ozone ~ Solar.R + Wind + Temp + Month + Day, data = airquality, 
    na.action = na.omit)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.014 -12.284  -3.302   8.454  95.348 

Coefficients:
             Estimate Std. Error t value       Pr(>|t|)    
(Intercept) -64.11632   23.48249  -2.730        0.00742 ** 
Solar.R       0.05027    0.02342   2.147        0.03411 *  
Wind         -3.31844    0.64451  -5.149 0.000001231276 ***
Temp          1.89579    0.27389   6.922 0.000000000366 ***
Month        -3.03996    1.51346  -2.009        0.04714 *  
Day           0.27388    0.22967   1.192        0.23576    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 20.86 on 105 degrees of freedom
  (42 observations deleted due to missingness)
Multiple R-squared:  0.6249,    Adjusted R-squared:  0.6071 
F-statistic: 34.99 on 5 and 105 DF,  p-value: < 0.00000000000000022

From the summary shown above we can see that 42 observations were deleted due to missingness. There is never a substitute for a complete dataset. By deleting these observations information about its relations with the other variables are being messed with. Imputation or multiple inputation is the proper way to deal with missing data and by using the Amelia package we will help to retrieve the missing values to complete the dataset and help make betters inferences using the data.

Visualing Percentage of Missing data

Hide

V1 <- function(x){sum(is.na(x))/length(x)*100}
apply(airquality,2,V1)
    Ozone   Solar.R      Wind      Temp     Month       Day 
24.183007  4.575163  0.000000  0.000000  0.000000  0.000000 

Hide

apply(airquality,1,V1)
  [1]  0.00000  0.00000  0.00000  0.00000 33.33333 16.66667  0.00000  0.00000  0.00000 16.66667
 [11] 16.66667  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
 [21]  0.00000  0.00000  0.00000  0.00000 16.66667 16.66667 33.33333  0.00000  0.00000  0.00000
 [31]  0.00000 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667  0.00000 16.66667  0.00000
 [41]  0.00000 16.66667 16.66667  0.00000 16.66667 16.66667  0.00000  0.00000  0.00000  0.00000
 [51]  0.00000 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667
 [61] 16.66667  0.00000  0.00000  0.00000 16.66667  0.00000  0.00000  0.00000  0.00000  0.00000
 [71]  0.00000 16.66667  0.00000  0.00000 16.66667  0.00000  0.00000  0.00000  0.00000  0.00000
 [81]  0.00000  0.00000 16.66667 16.66667  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
 [91]  0.00000  0.00000  0.00000  0.00000  0.00000 16.66667 16.66667 16.66667  0.00000  0.00000
[101]  0.00000 16.66667 16.66667  0.00000  0.00000  0.00000 16.66667  0.00000  0.00000  0.00000
[111]  0.00000  0.00000  0.00000  0.00000 16.66667  0.00000  0.00000  0.00000 16.66667  0.00000
[121]  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
[131]  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
[141]  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 16.66667
[151]  0.00000  0.00000  0.00000

In the above chart we can see that of all the observation in the dataset the variable Ozone is missing 25% of its datapoints and the variable Solar.R is missing 4.5%.

Imputation Using Amelia Package

The Amelia package will take care of the imputing process for us.

aq1 <- amelia(x=airquality,  m = 20)
aq1$imputations$imp1[1:6, ]

Above when viewing the NA’s in the dataset we saw that the 5th value for ozone was NA. Here we can see the imputed value that was imputed using Amelia. These imputed values were done 20 times but this is only showing one.

ggplot(data=airquality) + geom_histogram(mapping=aes(Ozone))
image.png

Hide

z.out <- zelig(Ozone ~ Solar.R + Wind + Temp + Month + Day, model = "ls", data = aq1, cite = FALSE)
summary(z.out, subset = 1)
Imputed Dataset 1
Call:
z5$zelig(formula = Ozone ~ Solar.R + Wind + Temp + Month + Day, 
    data = aq1)

Residuals:
    Min      1Q  Median      3Q     Max 
-46.300 -13.073  -3.004  12.885  98.904 

Coefficients:
             Estimate Std. Error t value             Pr(>|t|)
(Intercept) -78.10871   19.85289  -3.934             0.000128
Solar.R       0.02592    0.02057   1.260             0.209607
Wind         -2.70016    0.54658  -4.940 0.000002098565416779
Temp          2.14191    0.23466   9.128 0.000000000000000502
Month        -3.87204    1.35901  -2.849             0.005013
Day           0.30787    0.19645   1.567             0.119235

Residual standard error: 21.03 on 147 degrees of freedom
Multiple R-squared:  0.5925,    Adjusted R-squared:  0.5786 
F-statistic: 42.74 on 5 and 147 DF,  p-value: < 0.00000000000000022

Next step: Use 'setx' method

This model varies in ways from the model shown above. The negative effect that wind has on the Ozone has actually decreased when we use imputed values into the dataset.

Hide

z.out$setx()
z.out$sim()
plot(z.out)
image.png

Conclusion

The imputation values that replaced the NA values in the dataset had an an effect on the models we ran above. By using the Amelia package we were able to recover some imformation from two variables that contained NA values. This allowed us to create a close to completed dataset using values that were imputed and thus giving us the best method we can use to deal with missing values.

原文链接:https://www.rpubs.com/RobertPerez63/384791

你可能感兴趣的:(2020-09-01-Dealing With Missing Values Using Amelia)