install.packages(“psych”)
library(“psych”)
#Load the UCBAdmissions data
data("UCBAdmissions")
head(UCBAdmissions)
#Explore the dataset using the commands mentionned in the word file
summary(UCBAdmissions)
describe(UCBAdmissions)
table(UCBAdmissions)
str(UCBAdmissions)
summary(UCBAdmissions):
Number of cases in table: 4526
Number of factors: 3
Test for independence of all factors:
Chisq = 2000.3, df = 16, p-value = 0
describe(UCBAdmissions):
Error in describe(UCBAdmissions) : 没有"describe"这个函数
why error?
The describe()
function from the "Hmisc" package is generally used for data frames, not for tables. If you have a data frame, you can use describe()
to get a comprehensive summary of the variables. However, if you have a table, you might use other functions like summary()
, str()
, or specific functions designed for tables.
table(UCBAdmissions):
UCBAdmissions
8 17 19 22 24 53 89 94 120 131 138 202 205 207 244 279 299 313 317 351 353
1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1
391 512
1 1
str(UCBAdmissions):
Function: Provides the structure of the dataset, displaying the data type and the first few values of each variable.
'table' num [1:2, 1:2, 1:6] 512 313 89 19 353 207 17 8 120 205 ...
- attr(*, "dimnames")=List of 3
..$ Admit : chr [1:2] "Admitted" "Rejected"
..$ Gender: chr [1:2] "Male" "Female"
..$ Dept : chr [1:6] "A" "B" "C" "D" ...
The structure may not be well organized for R and statistic analysis.
str()
library(forcats)
#load the gss_cat data (remember you need to load forcats first)
data("gss_cat")
head(gss_cat)
#Use the string command to know the number of levels in the variable rincome
str(gss_cat$rincome)
You may see info like
"No answer","Don't know",
Factor w/ 16 levels "No answer","Don't know",..: 8 8 16 16 16 5 4 9 4 4 ...
after the str() function, these info should be merged into "NA" for analysis.
#Use the levles command to know what is the ninth level of the variable rincome
levels(gss_cat$rincome)
[1] "No answer" "Don't know" "Refused"
[4] "$25000 or more" "$20000 - 24999" "$15000 - 19999"
[7] "$10000 - 14999" "$8000 to 9999" "$7000 to 7999"
[10] "$6000 to 6999" "$5000 to 5999" "$4000 to 4999"
[13] "$3000 to 3999" "$1000 to 2999" "Lt $1000"
[16] "Not applicable"
rincome is a not well ordered variable
#(the first three should be in NA ), and be put after all the ranges. (which will be taught in the next article)
# What is the proportion of buddhist believers
table(gss_cat$marital)
table(gss_cat$relig)
147/21483*100
#What is the 9th label in the variable relig
levels(gss_cat$relig)
shapiro.test
is a statistical test in R used to assess the normality of a univariate data sample. The test is based on the Shapiro-Wilk W statistic, which tests the null hypothesis that a given sample comes from a normally distributed population. Here is an overview of the functionality of shapiro.test
:
# Perform the Shapiro-Wilk test
shapiro.test(data)
Shapiro-Wilk normality test
data: USArrests$Murder
W = 0.95703, p-value = 0.06674
Interpretation:
So it is useful to detect whether the dataset is worth using. Remember, the choice of the 0.05 significance level is somewhat arbitrary, and it's a common convention. The interpretation of p-values is always relative to the chosen significance level. A p-value close to 0.05 may indicate that there is some evidence against the null hypothesis, but the decision to reject or not depends on the specific threshold set by the researcher.