R中选择数据框的行

本教程描述如何根据特定的条件来提取数据框的行。

在本教程中，您将从dplyr包中学习以下R函数:

slice(): 按位置提取行

filter(): 提取满足特定逻辑条件的行。例如 iris %>% filter(Sepal.Length > 6).

filter_all(), filter_if() and filter_at(): 筛选选择的变量中的行。这些函数将逻辑标准复制到所有变量或一组变量上。filter rows within a selection of variables. These functions replicate the logical criteria over all variables or a selection of variables.

sample_n(): 随机选择n行

sample_frac(): 随机选择一部分行

top_n(): 选择按变量排序的前n行

我们还将向您展示如何删除给定列中有缺失值的行。

需要的R包

加载tidyverse包，其中包括dplyr包：

library(tidyverse)

示例数据

我们将使用R内置的iris数据集，首先将其转换为tibble数据框(tbl_df)，以便于进行数据分析。

my_data <- as_tibble(iris)

my_data

按位置提取行

函数： slice() [dplyr package]

my_data %>% slice(1:6)

根据逻辑标准过滤行

函数: filter() [dplyr package]. 用于过滤符合某些逻辑条件的行。

在继续之前，我们将介绍逻辑比较和运算符，它们对于过滤数据非常重要。

逻辑比较

R中可用的“逻辑”比较运算符有:

<: for less than

>: for greater than

<=: for less than or equal to

>=: for greater than or equal to

==: for equal to each other

!=: not equal to each other

%in%: 包含group membership. 例如, “value %in% c(2, 3)” 意思是value可以取2或者3

is.na(): is NA

!is.na(): is not NA.

逻辑运算符

value == 2|3: means that the value equal 2 or (|) 3.

value %in% c(2, 3) is a shortcut equivalent to value == 2|3.

&: means and. For example sex == “female” & age > 25

初学者在R中最常犯的错误是在测试相等性时使用=而不是==。请记住，在测试是否相等时，应该始终使用== (not =)。

根据逻辑标准提取行

One-column based criteria: Extract rows where Sepal.Length > 7:

my_data %>% filter(Sepal.Length >7)

基于多列的标准Multiple-column based criteria: Extract rows where Sepal.Length > 6.7 and Sepal.Width ≤ 3: 提取萼片长度> 6.7和萼片宽度≤3的行

my_data %>% filter(Sepal.Length >6.7, Sepal.Width <=3)

Test for equality (==): Extract rows where Sepal.Length > 6.5 and Species = “versicolor”:

my_data %>% filter(Sepal.Length >6.7, Species =="versicolor")

Using OR operator (|): Extract rows where Sepal.Length > 6.5 and (Species = “versicolor” or Species = “virginica”):

my_data %>% filter( Sepal.Length >6.7, Species =="versicolor"| Species =="virginica")

Or, equivalently, use this shortcut (%in% operator):

my_data %>% filter( Sepal.Length >6.7, Species %in% c("versicolor","virginica") )

Filter rows within a selection of variables筛选选择的变量中的行

函数 filter_all(), filter_if() and filter_at() 用于筛选选择的变量中的行

这些函数将逻辑标准复制到所有变量或一组变量上。

从my_data中删除分组列“Species”，创建一个新的演示数据集:

my_data2 <- my_data %>% select(-Species)

选择所有变量都大于2.4的行:

my_data2 %>% filter_all(all_vars(.>2.4))

选择任一变量大于2.4的行:

my_data2 %>% filter_all(any_vars(.>2.4))

更改要应用筛选条件的列的选择。filter_at()允许使用vars()规范。下面的R代码对 Sepal.Length 和 Sepal.Width列进行筛选

my_data2 %>% filter_at(vars(starts_with("Sepal")), any_vars(. >2.4))

删除缺失值

我们从创建一个包含缺失值的数据框开始。在R中用 NA(Not Available) 用来表示缺失值:

# Create a data frame with missing data

friends_data <- data_frame( name = c("A","B","C","D"), age = c(27,25,29,26), height = c(180,NA,NA,169), married = c("yes","yes","no","no"))

# Print

friends_data

提取高度为NA的行:

friends_data %>% filter(is.na(height))

排除(删除)高度为NA的行:

friends_data %>% filter(!is.na(height))

!is.na() 意思是 “非” NAs.

从数据框中随机选择行

可以使用 sample_n() 函数选择n个随机行，也可以使用 sample_frac() 函数选择随机分数的行。我们首先使用函数 set.seed() 来启动随机数生成器引擎。这对于用户重现分析非常重要。

set.seed(1234)

# Extract 5 random rows without replacement

my_data %>% sample_n(5, replace =FALSE)

# Extract 5% of rows, randomly without replacement

my_data %>% sample_frac(0.05, replace =FALSE)

选择按变量排序的前n行

#Select the top 5 rows ordered by Sepal.Length

my_data %>% top_n(5, Sepal.Length)

# 按 Species 分组，按Sepal.Length顺序选择每组前5位

my_data %>% group_by(Species) %>% top_n(5, Sepal.Length)

## # A tibble: 16 x 5

## # Groups: Species [3]

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species

##

## 1 5.8 4 1.2 0.2 setosa

## 2 5.7 4.4 1.5 0.4 setosa

## 3 5.7 3.8 1.7 0.3 setosa

## 4 5.5 4.2 1.4 0.2 setosa

## 5 5.5 3.5 1.3 0.2 setosa

## 6 7 3.2 4.7 1.4 versicolor

## # ... with 10 more rows

总结

本教程中，我们介绍了如何使用 dplyr 包过滤数据框的行:

使用逻辑标准过滤行: my_data %>% filter(Sepal.Length >7)

随机选择N行: my_data %>% sample_n(10)

随机选择一定比例的行: my_data %>% sample_frac(0.1)

按某变量选择前n行: my_data %>% top_n(10, Sepal.Length)

按 Species 分组，再选择每组中Sepal.Length顺序前5位：

my_data %>% group_by(Species) %>% top_n(5, Sepal.Length)