This tutorial introduces how to easily compute statistcal summaries in R using the dplyr package.
You will learn, how to:
Compute summary statistics for ungrouped data, as well as, for data that are grouped by one or multiple variables. R functions: summarise() and group_by().
Summarise multiple variable columns. R functions:
summarise_all(): apply summary functions to every columns in the data frame.
summarise_at(): apply summary functions to specific columns selected with a character vector
summarise_if(): apply summary functions to columns selected with a predicate function that returns TRUE.
library(tidyverse)
my_data <- as_tibble(iris)
my_data
未分组数据的汇总统计
#Compute the mean of Sepal.Length and Petal.Length as well as the number of observations using the function n():
my_data %>% summarise( count = n(), mean_sep = mean(Sepal.Length, na.rm =TRUE), mean_pet = mean(Petal.Length, na.rm =TRUE) )
注意,计算均数前,我们使用了附加参数 na.rm 为去除NAs。
分组数据的汇总统计
关键R函数: group_by() 和 summarise()
单变量组
my_data %>%
group_by(Species) %>%
summarise(
count = n(),
mean_sep = mean(Sepal.Length),
mean_pet = mean(Petal.Length)
)
注意,可以使用向前管道运算符组合多个操作:%>%。例如,x%>%f 等于f(x)。
多变量分组
# ToothGrowth demo data sets
head(ToothGrowth)
# Summarize
ToothGrowth %>%group_by(supp, dose) %>% summarise( n = n(), mean = mean(len), sd = sd(len) )
计算多个变量的统计量
关键R函数: summarise_all(), summarise_at() 和summarise_if()
形式如下:
summarise_all(.tbl, .funs,...)
summarise_if(.tbl, .predicate, .funs,...)
summarise_at(.tbl, .vars, .funs,...)
.tbl: a tbl data frame
.funs: List of function calls generated by funs(), or a character vector of function names, or simply a function.
…: Additional arguments for the function calls in .funs.
.predicate: A predicate function to be applied to the columns or a logical vector. The variables for which .predicate is or returns TRUE are selected.
总结所有变量-计算所有变量的平均值:
my_data %>%
group_by(Species) %>%
summarise_all(mean)
Summarise specific variables selected with a character vector:
my_data %>% group_by(Species) %>% summarise_at(c("Sepal.Length","Sepal.Width"), mean, na.rm =TRUE)
Summarise specific variables selected with a predicate function:
my_data %>% group_by(Species) %>% summarise_if(is.numeric, mean, na.rm =TRUE)
Useful statistical summary functions
This section presents some R functions for computing statistical summaries.
Measure of location:
mean(x): sum of x divided by the length
median(x): 50% of x is above and 50% is below
Measure of variation:
sd(x): standard deviation
IQR(x): interquartile range (robust equivalent of sd when outliers are present in the data)
mad(x): median absolute deviation (robust equivalent of sd when outliers are present in the data)
Measure of rank:
min(x): minimum value of x
max(x): maximum value of x
quantile(x, 0.25): 25% of x is below this value
Measure of position:
first(x): equivalent to x[1]
nth(x, 2): equivalent to n<-2; x[n]
last(x): equivalent to x[length(x)]
Counts:
n(x): the number of element in x
sum(!is.na(x)): count non-missing values
n_distinct(x): count the number of unique value
Counts and proportions of logical values:
sum(x > 10): count the number of elements where x > 10
mean(y == 0): proportion of elements where y = 0
Summary
In this tutorial, we describe how to easily compute statistical summaries using the R functions summarise() and group_by() [in dplyr package].