[R] Importing, viewing and screening imported data

Install and load the package psych and forcats

install.packages(“psych”)

library(“psych”)

Load data

#Load the UCBAdmissions data
data("UCBAdmissions")
head(UCBAdmissions)

Explore the data set

#Explore the dataset using the commands mentionned in the word file
summary(UCBAdmissions)
describe(UCBAdmissions)
table(UCBAdmissions)
str(UCBAdmissions)

summary(UCBAdmissions):

  • Function: Provides a summary of the main statistical measures for each variable in the dataset
Number of cases in table: 4526 
Number of factors: 3 
Test for independence of all factors:
	Chisq = 2000.3, df = 16, p-value = 0

describe(UCBAdmissions):

  • Function: Generates a comprehensive summary of the dataset, including measures of central tendency, spread, and other statistics
Error in describe(UCBAdmissions) : 没有"describe"这个函数

why error? 

The describe() function from the "Hmisc" package is generally used for data frames, not for tables. If you have a data frame, you can use describe() to get a comprehensive summary of the variables. However, if you have a table, you might use other functions like summary(), str(), or specific functions designed for tables.

table(UCBAdmissions):

  • Function: Creates a contingency table, showing the frequency of occurrences of each combination of variable values.
    UCBAdmissions
      8  17  19  22  24  53  89  94 120 131 138 202 205 207 244 279 299 313 317 351 353 
      1   1   1   1   1   1   1   1   1   1   2   1   1   1   1   1   1   1   1   1   1 
    391 512 
      1   1 

str(UCBAdmissions):

Function: Provides the structure of the dataset, displaying the data type and the first few values of each variable.

 'table' num [1:2, 1:2, 1:6] 512 313 89 19 353 207 17 8 120 205 ...
 - attr(*, "dimnames")=List of 3
  ..$ Admit : chr [1:2] "Admitted" "Rejected"
  ..$ Gender: chr [1:2] "Male" "Female"
  ..$ Dept  : chr [1:6] "A" "B" "C" "D" ...

After explore the structure

The structure may not be well organized for R and statistic analysis. 

str()

library(forcats)
#load the gss_cat data (remember you need to load forcats first)
data("gss_cat")
head(gss_cat)
#Use the string command to know the number of levels in the variable rincome
str(gss_cat$rincome)

You may see info like

"No answer","Don't know",
Factor w/ 16 levels "No answer","Don't know",..: 8 8 16 16 16 5 4 9 4 4 ...

after the str() function, these info should be merged into "NA" for analysis.

the levles command

#Use the levles command to know what is the ninth level of the variable rincome
levels(gss_cat$rincome)
 [1] "No answer"      "Don't know"     "Refused"       
 [4] "$25000 or more" "$20000 - 24999" "$15000 - 19999"
 [7] "$10000 - 14999" "$8000 to 9999"  "$7000 to 7999" 
[10] "$6000 to 6999"  "$5000 to 5999"  "$4000 to 4999" 
[13] "$3000 to 3999"  "$1000 to 2999"  "Lt $1000"      
[16] "Not applicable"

rincome is a not well ordered variable
#(the first three should be in NA ), and be put after all the ranges. (which will be taught in the next article)

to get the info and calculate

# What is the proportion of buddhist believers
table(gss_cat$marital)
table(gss_cat$relig)
147/21483*100
#What is the 9th label in the variable relig
levels(gss_cat$relig)

shapiro.test()

shapiro.test is a statistical test in R used to assess the normality of a univariate data sample. The test is based on the Shapiro-Wilk W statistic, which tests the null hypothesis that a given sample comes from a normally distributed population. Here is an overview of the functionality of shapiro.test:

# Perform the Shapiro-Wilk test
shapiro.test(data)
Shapiro-Wilk normality test

data:  USArrests$Murder
W = 0.95703, p-value = 0.06674

Interpretation:

  • If the p-value is less than the significance level (commonly 0.05), you would reject the null hypothesis, suggesting that the data does not follow a normal distribution.
  • If the p-value is greater than the significance level, you would fail to reject the null hypothesis, indicating that there is not enough evidence to conclude that the data deviates significantly from a normal distribution.

So it is useful to detect whether the dataset is worth using. Remember, the choice of the 0.05 significance level is somewhat arbitrary, and it's a common convention. The interpretation of p-values is always relative to the chosen significance level. A p-value close to 0.05 may indicate that there is some evidence against the null hypothesis, but the decision to reject or not depends on the specific threshold set by the researcher.

你可能感兴趣的:(R,r语言,开发语言)