做一个练习题,以为很简单,但是却碰到问题
1.将iris数据框的前4列gather,然后还原
iris_gather <- gather(data = iris,
key = LW,
value = S,
-Species)
还原
iris_spread <- spread(data = iris_gather,
key = LW,
value = S)
错误: Each row of output must be identified by a unique combination of keys.
Keys are shared for 600 rows:
查了下解释如下:
The error in spread can occur when there are more than one unique combinations exist. With pivot_wider, it is now replaced with a warning and would return a list column if there are duplicates and then we can unnest. Or another way is to create a sequence column grouped by the column identifier that have duplicates to make a unique row identifier i.e.
需要额外做些处理
iris_gather <- gather(data = iris,
key = LW,
value = SW,
-Species)
iris_gather %>% group_by(LW) %>% mutate(id=1:n())%>% spread(LW,SW)
虽然解决了这个问题,但是作为菜鸟,我也不知道到底错的是什么,也不能解释这个-there is something fundamentally wrong with the design of spread() and gather(),但是这个包的作者本人原话是:
For some time, it’s been obvious that there is something fundamentally wrong with the design of spread()
and gather()
. Many people don’t find the names intuitive and find it hard to remember which direction corresponds to spreading and which to gathering. It also seems surprisingly hard to remember the arguments to these functions, meaning that many people (including me!) have to consult the documentation every time.#我还以为只有我分不清这2个参数,有种共鸣
就是劝放弃使用这两个函数-spread()
and gather()
给出了新的替代改进版函数 具备state-of-the-art features
There are two important new features inspired by other R packages that have been advancing reshaping in R:
pivot_longer()
can work with multiple value variables that may have different types, inspired by the enhancedmelt()
anddcast()
functions provided by the data.table package by Matt Dowle and Arun Srinivasan.pivot_longer()
andpivot_wider()
can take a data frame that specifies precisely how metadata stored in column names becomes data variables (and vice versa), inspired by the cdata package by John Mount and Nina Zumel.
1 gather的替代版本,pivot_longer
数据列数减少,行数增加
relig_income
总共3个变量,18个 V1=religion; 10个 V2=salary(income-收入范围);V3=count(多少人)
那将表格整合成同一个religion的不同收入对应的人数,按照V2=10列;V2 作为一个单位来循环unique(relig_income$religion)
18次,预测总共生成了18*10=180行
test <- relig_income %>%
pivot_longer(!religion, names_to = "income", values_to = "count")
dim(test)
[1] 180 3
第一个参数是:The first argument is the dataset to reshape, relig_income. ---这里选择relig_income
这个数据集
第二个参数是:The second argument describes which columns need to be reshaped. In this case, it’s every column apart from religion.
具体解释下:!religion是排除这个因素;其它的列-V2 都参与进去reshape;把所有的带有count值的列都统计进去
第三个参数是:给V2这个变量作为一列,自命名;The names_to gives the name of the variable that will be created from the data stored in the column names, i.e. income. 给你合并的列也就是第二个变量V2一个新名称,它这里是income,可以自行取名字
第四个参数是:给V3这个变量提取出来成为一列;The values_to gives the name of the variable that will be created from the data stored in the cell value, i.e. count.
上述的是String data in column names-需要整合的数据是单纯的字符串
对于这个数据集,名称比较规律
billboard %>%
pivot_longer(
cols = starts_with("wk"), #限定合并的列是以wk开头的字符串
names_to = "week", #给合并的列(所有的wk)所在的行一个变量名称
values_to = "rank",#给count值一个列名称
values_drop_na = F)#这个很棒了,如果参数为T直接帮你去除合并后的NA值
#预估是76*317=24092行
但是week这一列有字符wk,也有数字,只想看数字,怎么拆分,作者给了参数-names_prefix
和另外一个names_transform
billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
names_prefix = "wk",#去除week那一列的字符串"wk"
names_transform = list(week = as.integer),#将week那一列经过了去除字符串后留下的数字转换为integer
#另外一种是 names_transform = list(week = readr::parse_number),
values_to = "rank",
values_drop_na = TRUE,
)
作者也给了另外一种方式,Alternatively, you could do this with a single argument by using readr::parse_number()
which automatically strips non-numeric components:
这里是给的是列名称中含有字符串和数字,并且想把数字作为整数来直观统计
那如果列名称又包含很多变量呢?
Many variables in column names
使用的数据集是who
who
country, iso2, iso3, and year are already variables, so they can be left as is. But the columns from new_sp_m014 to newrel_f65 encode four variables in their names:
其中一个变量是-new不用管它
The new_/new prefix indicates these are counts of new cases. This dataset only contains new cases, so we’ll ignore it here because it’s constant.
另外3个是图中给标注的那样的
V1. sp/rel/ep describe how the case was diagnosed.诊断方法差别
V2. m/f gives the gender. 男女性别
V3. 014/1524/2535/3544/4554/65 supplies the age range. 年龄段
对数据整理的前提是-得了解这个数据集的构造
这个参数比较厉害了names_pattern
who %>% pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"), #给上述的V1~V3命名
names_pattern = "new_?(.*)_(.)(.*)",
values_to = "count" #带有数值的那一列名称叫做count
)
We can break these variables up by specifying multiple column names in names_to, and then either providing names_sep or names_pattern. Here names_pattern is the most natural fit. It has a similar interface to extract: you give it a regular expression containing groups (defined by ()) and it puts each group in a column
以这个为例new_sp_m2534 ,其实我们手动分开的话是 sp/m/2534这样分成三列
使用这个函数 names_pattern = "new_?(.*)_(.)(.*)"
就是这样分的:() () () 使用小括号把这三列先括起来
第一列是遇到了new_?不管是任何字符串,直到遇到下一个"_"之间的都是作为第一列;第二列是第二个"_"之间的任意一个字符串,规定只有一个(f或者是m);第三列是去掉前面一个的所有剩余字符串
作者又进一步把上述的做了个分类处理
who %>% pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
names_transform = list(
gender = ~ readr::parse_factor(.x, levels = c("f", "m")),
age = ~ readr::parse_factor(
.x,
levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"),
ordered = TRUE
)
),
values_to = "count",
)
另外一个可能是-这个数据集的行包含分类信息,dob, gender
library(readr)
family <- tribble(
~family, ~dob_child1, ~dob_child2, ~gender_child1, ~gender_child2,
1L, "1998-11-26", "2000-01-29", 1L, 2L,
2L, "1996-06-22", NA, 2L, NA,
3L, "2002-07-11", "2004-04-05", 2L, 2L,
4L, "2004-10-10", "2009-08-27", 1L, 1L,
5L, "2000-12-05", "2005-02-28", 2L, 1L,
)
family <- family %>% mutate_at(vars(starts_with("dob")), parse_date)
family %>%
pivot_longer(
!family,
names_to = c(".value", "child"),
names_sep = "_",
values_drop_na = F
)
Note that we have two pieces of information (or values) for each child: their gender and their dob (date of birth). These need to go into separate columns in the result. Again we supply multiple variables to names_to, using names_sep to split up each variable name. Note the special name .value: this tells pivot_longer() that that part of the column name specifies the “value” being measured (which will become a variable in the output).
请注意,我们为每个孩子提供两条信息(或值):他们的性别和他们的出生日期(出生日期)。 这些需要进入结果中的单独列。 我们再次为names_to 提供多个变量,使用names_sep 拆分每个变量名。 请注意特殊名称 .value
:它告诉 pivot_longer() 列名称的那部分指定了被测量的“值”(它将成为输出中的变量)。
理解为
names_to = c(".value", "child")
这个列名称中被指定的是child前面的名字 dob 和 gender,并把它们各自作为输出数据的列
用法太多了,可以自行查看这个函数的说明书,如果感兴趣
vignette("pivot")#加载这个函数的帮助文档
总结:根据上述给的提示,得到了解决方案是这个
##1 gather替代函数建议使用这个pivot_longer
test2 <- iris %>% pivot_longer(!Species,names_to = "LW", values_to = "size")
##1 spread替代函数建议使用这个pivot_wider,会警告,但是不报错
test3 <- test2 %>%
pivot_wider(names_from = LW, values_from = size) %>%
unnest()
1: Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates
2: `cols` is now required when using unnest().
Please use `cols = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)`
它的逆-增加列,还原为原来的列,是很麻烦的,会出现警告,但是不会报错
vignette("pivot")