获取更多R语言和生信知识,请欢迎关注公众号:医学和生信笔记
医学和生信笔记 公众号主要分享:1.医学小知识、肛肠科小知识;2.R语言和Python相关的数据分析、可视化、机器学习等;3.生物信息学学习资料和自己的学习笔记!
主要介绍使用pivot_longer
和pivot_wider
进行长宽数据转换,这两个函数都是来自于tidyr
包,是gather
和spread
的升级,比原函数功能更加强大,且参数更加清晰直观好记。
今天先学习宽数据变为长数据的5种情况!
library(tidyr)
library(dplyr)
##
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
pivot_longer
宽数据变为长数据,宽数据通常是直接输入的原始数据,宽数据方便录入原始资料,但是不便于数据分析。因此在分析时需要变为长数据。
这是一个宽数据,是一个关于不同季节和宗教信仰的人们的收入数据。
relig_income
## # A tibble: 18 x 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
##
## 1 Agnostic 27 34 60 81 76 137 122
## 2 Atheist 12 27 37 52 35 70 73
## 3 Buddhist 27 21 30 34 33 58 62
## 4 Catholic 418 617 732 670 638 1116 949
## 5 Don’t k~ 15 14 15 11 10 35 21
## 6 Evangel~ 575 869 1064 982 881 1486 949
## 7 Hindu 1 9 7 9 11 34 47
## 8 Histori~ 228 244 236 238 197 223 131
## 9 Jehovah~ 20 27 24 24 21 30 15
## 10 Jewish 19 19 25 25 30 95 69
## 11 Mainlin~ 289 495 619 655 651 1107 939
## 12 Mormon 29 40 48 51 56 112 85
## 13 Muslim 6 7 9 10 9 23 16
## 14 Orthodox 13 17 23 32 32 47 38
## 15 Other C~ 9 7 11 13 13 14 18
## 16 Other F~ 20 33 40 46 49 63 46
## 17 Other W~ 5 2 3 4 2 7 3
## 18 Unaffil~ 217 299 374 365 341 528 407
## # ... with 3 more variables: $100-150k , >150k ,
## # Don't know/refused
这个数据包含3个变量:
religion
:宗教income
:分布在不同的列名中count
:单元格中的数字我们使用pivot_longer
把它变为长数据,这是一个最简单的宽数据变长数据的示例:
relig_income %>%
pivot_longer(cols = !religion,
names_to = "income",
values_to = "count")
## # A tibble: 180 x 3
## religion income count
##
## 1 Agnostic <$10k 27
## 2 Agnostic $10-20k 34
## 3 Agnostic $20-30k 60
## 4 Agnostic $30-40k 81
## 5 Agnostic $40-50k 76
## 6 Agnostic $50-75k 137
## 7 Agnostic $75-100k 122
## 8 Agnostic $100-150k 109
## 9 Agnostic >150k 84
## 10 Agnostic Don't know/refused 96
## # ... with 170 more rows
pivot_longer
的第一个参数是数据集,这里是relig_income
,已省略,religion
的所有列,income
,count
新创建的两列income
和count
都不是原来数据集有的,因此需要加引号。
使用billboard
数据集,列名中含有数字,而不是纯字符。
billboard
## # A tibble: 317 x 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
##
## 1 2 Pac Baby D~ 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+her The Ha~ 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doors~ Krypto~ 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doors~ Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 Boyz Wobble~ 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give M~ 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Teens Dancin~ 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliyah I Don'~ 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliyah Try Ag~ 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams, ~ Open M~ 2000-08-26 76 76 74 69 68 67 61 58
## # ... with 307 more rows, and 68 more variables: wk9 , wk10 ,
## # wk11 , wk12 , wk13 , wk14 , wk15 , wk16 ,
## # wk17 , wk18 , wk19 , wk20 , wk21 , wk22 ,
## # wk23 , wk24 , wk25 , wk26 , wk27 , wk28 ,
## # wk29 , wk30 , wk31 , wk32 , wk33 , wk34 ,
## # wk35 , wk36 , wk37 , wk38 , wk39 , wk40 ,
## # wk41 , wk42 , wk43 , wk44 , wk45 , wk46 , ...
现在我们需要把不同的周放在一列中,因为它其实是一个变量,然后不同的周下面的值也放在一列中,同时我们发现周下面的数字有一些缺失值,我们想丢掉含有这些缺失值的行。
billboard %>%
pivot_longer(cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = T
)
## # A tibble: 5,307 x 5
## artist track date.entered week rank
##
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 wk3 92
## # ... with 5,297 more rows
假如我们想知道每首歌在表单上待了几周,我们可以把week
这一列变为数值型,这需要另外的参数names_prefix/names_transform
。
billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = T,
names_prefix = "wk", # 数字前面的前缀是什么
names_transform = list(week = as.integer) # 对新建的week做什么操作
)
## # A tibble: 5,307 x 5
## artist track date.entered week rank
##
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92
## # ... with 5,297 more rows
或者你可以直接使用readr
包中的parse_number()
函数直接解析week
这一列。
billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
names_transform = list(week = readr::parse_number), # 对新建的week这一列解析
values_to = "rank",
values_drop_na = TRUE,
)
## # A tibble: 5,307 x 5
## artist track date.entered week rank
##
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92
## # ... with 5,297 more rows
在整洁数据的概念中,每一列都应该是一个变量,假如在列名中含有多个变量,那么我们就需要把这些列名变为不同的列,以who
数据集为例。
who
## # A tibble: 7,240 x 60
## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
##
## 1 Afghani~ AF AFG 1980 NA NA NA NA
## 2 Afghani~ AF AFG 1981 NA NA NA NA
## 3 Afghani~ AF AFG 1982 NA NA NA NA
## 4 Afghani~ AF AFG 1983 NA NA NA NA
## 5 Afghani~ AF AFG 1984 NA NA NA NA
## 6 Afghani~ AF AFG 1985 NA NA NA NA
## 7 Afghani~ AF AFG 1986 NA NA NA NA
## 8 Afghani~ AF AFG 1987 NA NA NA NA
## 9 Afghani~ AF AFG 1988 NA NA NA NA
## 10 Afghani~ AF AFG 1989 NA NA NA NA
## # ... with 7,230 more rows, and 52 more variables: new_sp_m4554 ,
## # new_sp_m5564 , new_sp_m65 , new_sp_f014 ,
## # new_sp_f1524 , new_sp_f2534 , new_sp_f3544 ,
## # new_sp_f4554 , new_sp_f5564 , new_sp_f65 ,
## # new_sn_m014 , new_sn_m1524 , new_sn_m2534 ,
## # new_sn_m3544 , new_sn_m4554 , new_sn_m5564 ,
## # new_sn_m65 , new_sn_f014 , new_sn_f1524 , ...
从new_sp_mo14
到newrel_f65
这几列的列名都包含了4个变量:
new/new_
代表了一个变量,但是所有的都是new
,暂且不管,sp/rel/ep
也是一个变量,代表是如何被诊断的,m/f
是性别,也是一个变量,014/1524/2535/3544/4554/65
代表年龄范围,也是一个变量。也就是说这些列应该被重组为4列,可以使用正则表达式提取这些列的规律,然后进行重组,这里只分了3列:
who %>% pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis","gender","age"), # new那一列没写,因为都一样
names_pattern = "new_?(.*)_(.)(.*)", # 正则表达式3个括号对应上面3个列名
values_to = "count"
)
## # A tibble: 405,440 x 8
## country iso2 iso3 year diagnosis gender age count
##
## 1 Afghanistan AF AFG 1980 sp m 014 NA
## 2 Afghanistan AF AFG 1980 sp m 1524 NA
## 3 Afghanistan AF AFG 1980 sp m 2534 NA
## 4 Afghanistan AF AFG 1980 sp m 3544 NA
## 5 Afghanistan AF AFG 1980 sp m 4554 NA
## 6 Afghanistan AF AFG 1980 sp m 5564 NA
## 7 Afghanistan AF AFG 1980 sp m 65 NA
## 8 Afghanistan AF AFG 1980 sp f 014 NA
## 9 Afghanistan AF AFG 1980 sp f 1524 NA
## 10 Afghanistan AF AFG 1980 sp f 2534 NA
## # ... with 405,430 more rows
假如我们知道自己的数据情况,比如这个数据,gender
这一列和age
这一列,可以变为因子型,这样方便后面的分析:
who %>% pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
names_transform = list(
gender = ~ readr::parse_factor(.x, levels = c("f", "m")),
age = ~ readr::parse_factor(
.x,
levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"),
ordered = TRUE
)
),
values_to = "count",
)
## # A tibble: 405,440 x 8
## country iso2 iso3 year diagnosis gender age count
##
## 1 Afghanistan AF AFG 1980 sp m 014 NA
## 2 Afghanistan AF AFG 1980 sp m 1524 NA
## 3 Afghanistan AF AFG 1980 sp m 2534 NA
## 4 Afghanistan AF AFG 1980 sp m 3544 NA
## 5 Afghanistan AF AFG 1980 sp m 4554 NA
## 6 Afghanistan AF AFG 1980 sp m 5564 NA
## 7 Afghanistan AF AFG 1980 sp m 65 NA
## 8 Afghanistan AF AFG 1980 sp f 014 NA
## 9 Afghanistan AF AFG 1980 sp f 1524 NA
## 10 Afghanistan AF AFG 1980 sp f 2534 NA
## # ... with 405,430 more rows
在整洁数据中,每行一个观测,每列一个变量。
有时一行会有多个观测,你想创建的新列的名字是原数据列名的一部分。
family <- tribble(
~family, ~dob_child1, ~dob_child2, ~gender_child1, ~gender_child2,
1L, "1998-11-26", "2000-01-29", 1L, 2L,
2L, "1996-06-22", NA, 2L, NA,
3L, "2002-07-11", "2004-04-05", 2L, 2L,
4L, "2004-10-10", "2009-08-27", 1L, 1L,
5L, "2000-12-05", "2005-02-28", 2L, 1L,
)
family <- family %>% mutate_at(vars(starts_with("dob")), parse_date)
family
## # A tibble: 5 x 5
## family dob_child1 dob_child2 gender_child1 gender_child2
##
## 1 1 1998-11-26 2000-01-29 1 2
## 2 2 1996-06-22 NA 2 NA
## 3 3 2002-07-11 2004-04-05 2 2
## 4 4 2004-10-10 2009-08-27 1 1
## 5 5 2000-12-05 2005-02-28 2 1
在这个数据集中,每一行有2个孩子的信息,每个孩子都有性别和生日两个信息(或者两个值,value)。我们应该把chid变为一列,gender变为一列,dob变为一列。
.value
参数告诉pivot_longer
:列名的一部分是要用的,会变成输出数据的一个变量。
family %>% pivot_longer(
cols = !family,
names_to = c(".value","child"), # 列名中其他部分要用的
names_sep = "_", # 列名中的分隔符
values_drop_na = T
)
## # A tibble: 9 x 4
## family child dob gender
##
## 1 1 child1 1998-11-26 1
## 2 1 child2 2000-01-29 2
## 3 2 child1 1996-06-22 2
## 4 3 child1 2002-07-11 2
## 5 3 child2 2004-04-05 2
## 6 4 child1 2004-10-10 1
## 7 4 child2 2009-08-27 1
## 8 5 child1 2000-12-05 2
## 9 5 child2 2005-02-28 1
在另一个anscombe
数据集中也存在同样的问题:
anscombe
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## 4 9 9 9 8 8.81 8.77 7.11 8.84
## 5 11 11 11 8 8.33 9.26 7.81 8.47
## 6 14 14 14 8 9.96 8.10 8.84 7.04
## 7 6 6 6 8 7.24 6.13 6.08 5.25
## 8 4 4 4 19 4.26 3.10 5.39 12.50
## 9 12 12 12 8 10.84 9.13 8.15 5.56
## 10 7 7 7 8 4.82 7.26 6.42 7.91
## 11 5 5 5 8 5.68 4.74 5.73 6.89
这个数据集有4对变量(x1和y1,x2和y2,x3和y3,x4和y4):
anscombe %>%
pivot_longer(
cols = everything(),
names_to = c(".value","set"), # 列名中x,y要用的
names_pattern = "(.)(.)"
) %>%
arrange(set)
## # A tibble: 44 x 3
## set x y
##
## 1 1 10 8.04
## 2 1 8 6.95
## 3 1 13 7.58
## 4 1 9 8.81
## 5 1 11 8.33
## 6 1 14 9.96
## 7 1 6 7.24
## 8 1 4 4.26
## 9 1 12 10.8
## 10 1 7 4.82
## # ... with 34 more rows
接下来再介绍一个数据集也是这样的情况,帮助大家理解:
pnl <- tibble(
x = 1:4,
a = c(1, 1,0, 0),
b = c(0, 1, 1, 1),
y1 = rnorm(4),
y2 = rnorm(4),
z1 = rep(3, 4),
z2 = rep(-2, 4),
)
pnl
## # A tibble: 4 x 7
## x a b y1 y2 z1 z2
##
## 1 1 1 0 -0.0659 -0.789 3 -2
## 2 2 1 1 -1.19 -0.482 3 -2
## 3 3 0 1 -0.339 2.59 3 -2
## 4 4 0 1 -0.821 0.293 3 -2
接下来变为长数据:
pnl %>%
pivot_longer(
!c(x, a, b),
names_to = c(".value", "time"), # y和z是要用的,其他变为time
names_pattern = "(.)(.)"
)
## # A tibble: 8 x 6
## x a b time y z
##
## 1 1 1 0 1 -0.0659 3
## 2 1 1 0 2 -0.789 -2
## 3 2 1 1 1 -1.19 3
## 4 2 1 1 2 -0.482 -2
## 5 3 0 1 1 -0.339 3
## 6 3 0 1 2 2.59 -2
## 7 4 0 1 1 -0.821 3
## 8 4 0 1 2 0.293 -2
比如下面这个数据集,列名是有重复的:
df <- tibble(id = 1:3, y = 4:6, y = 5:7, y = 7:9, .name_repair = "minimal")
df
## # A tibble: 3 x 4
## id y y y
##
## 1 1 4 5 7
## 2 2 5 6 8
## 3 3 6 7 9
pivot_longer
处理这种数据时,会自动添加新列:
df %>%
pivot_longer(
cols = !id,
names_to = "name",
values_to = "value"
)
## # A tibble: 9 x 3
## id name value
##
## 1 1 y 4
## 2 1 y 5
## 3 1 y 7
## 4 2 y 5
## 5 2 y 6
## 6 2 y 8
## 7 3 y 6
## 8 3 y 7
## 9 3 y 9
假如多个输入数据的列名会变成输出数据的一列,也是这样的情况:
df <- tibble(id = 1:3, x1 = 4:6, x2 = 5:7, y1 = 7:9, y2 = 10:12)
df
## # A tibble: 3 x 5
## id x1 x2 y1 y2
##
## 1 1 4 5 7 10
## 2 2 5 6 8 11
## 3 3 6 7 9 12
这个数据集也是有2对的,可以按照上面的思路写
df %>% pivot_longer(!id, names_to = ".value", names_pattern = "(.).")
## # A tibble: 6 x 3
## id x y
##
## 1 1 4 7
## 2 1 5 10
## 3 2 5 8
## 4 2 6 11
## 5 3 6 9
## 6 3 7 12
当然也可以按照每一行有多个观测的数据集的思路写:
df %>% pivot_longer(!id, names_to = c(".value","group"), names_pattern = "(.)(.)")
## # A tibble: 6 x 4
## id group x y
##
## 1 1 1 4 7
## 2 1 2 5 10
## 3 2 1 5 8
## 4 2 2 6 11
## 5 3 1 6 9
## 6 3 2 7 12
以上就是宽数据变长数据的常见情况,基本涵盖了日常数据,下面讲解长数据变成宽数据的例子。
获取更多R语言和生信知识,请欢迎关注公众号:医学和生信笔记
医学和生信笔记 公众号主要分享:1.医学小知识、肛肠科小知识;2.R语言和Python相关的数据分析、可视化、机器学习等;3.生物信息学学习资料和自己的学习笔记!