使用 函数melt()和decast(),是reshape2包中函数的增强
先考虑传统长宽表转化函数,
1.默认函数
a.融合data.table(宽表到长表)
考虑如下data.table:
> DT = fread("melt_default.csv")
> DT
family_id age_mother dob_child1 dob_child2 dob_child3
1: 1 30 1998/11/26 2000/1/29 NA
2: 2 27 1996/6/22 NA NA
3: 3 26 2002/7/11 2004/4/5 2007/9/2
4: 4 32 2004/10/10 2009/8/27 2012/7/21
5: 5 29 2000/12/5 2005/2/28 NA
> str(DT)
Classes ‘data.table’ and 'data.frame': 5 obs. of 5 variables:
$ family_id : int 1 2 3 4 5
$ age_mother: int 30 27 26 32 29
$ dob_child1: chr "1998/11/26" "1996/6/22" "2002/7/11" "2004/10/10" ...
$ dob_child2: chr "2000/1/29" NA "2004/4/5" "2009/8/27" ...
$ dob_child3: chr NA NA "2007/9/2" "2012/7/21" ...
- attr(*, ".internal.selfref")=
把数据宽表转化长表,dob列合并成一列。
> DT.m1 = melt(DT, id.vars = c("family_id", "age_mother"),
+ measure.vars = c("dob_child1", "dob_child2", "dob_child3"))
> DT.m1
family_id age_mother variable value
1: 1 30 dob_child1 1998/11/26
2: 2 27 dob_child1 1996/6/22
3: 3 26 dob_child1 2002/7/11
4: 4 32 dob_child1 2004/10/10
5: 5 29 dob_child1 2000/12/5
6: 1 30 dob_child2 2000/1/29
7: 2 27 dob_child2 NA
8: 3 26 dob_child2 2004/4/5
9: 4 32 dob_child2 2009/8/27
10: 5 29 dob_child2 2005/2/28
11: 1 30 dob_child3 NA
12: 2 27 dob_child3 NA
13: 3 26 dob_child3 2007/9/2
14: 4 32 dob_child3 2012/7/21
15: 5 29 dob_child3 NA
> str(DT.m1)
Classes ‘data.table’ and 'data.frame': 15 obs. of 4 variables:
$ family_id : int 1 2 3 4 5 1 2 3 4 5 ...
$ age_mother: int 30 27 26 32 29 30 27 26 32 29 ...
$ variable : Factor w/ 3 levels "dob_child1","dob_child2",..: 1 1 1 1 1 2 2 2 2 2 ...
$ value : chr "1998/11/26" "1996/6/22" "2002/7/11" "2004/10/10" ...
- attr(*, ".internal.selfref")=
生成两个新变量,默认列名是“variable”和“value”
生成列“variable”默认类型是“factor”,若是想生成“charactor”,可用variable.factor=FALSE来设置,这个只能在data.table的melt中设置,不能用在reshape2中设置。
为变量“variable”和“value”进行命名
> DT.m1 = melt(DT, measure.vars = c("dob_child1", "dob_child2", "dob_child3"),
+ variable.name = "child", value.name = "dob")
> DT.m1
family_id age_mother child dob
1: 1 30 dob_child1 1998/11/26
2: 2 27 dob_child1 1996/6/22
3: 3 26 dob_child1 2002/7/11
4: 4 32 dob_child1 2004/10/10
5: 5 29 dob_child1 2000/12/5
6: 1 30 dob_child2 2000/1/29
7: 2 27 dob_child2 NA
8: 3 26 dob_child2 2004/4/5
9: 4 32 dob_child2 2009/8/27
10: 5 29 dob_child2 2005/2/28
11: 1 30 dob_child3 NA
12: 2 27 dob_child3 NA
13: 3 26 dob_child3 2007/9/2
14: 4 32 dob_child3 2012/7/21
15: 5 29 dob_child3 NA
b)长表转成宽表
> dcast(DT.m1, family_id + age_mother ~ child, value.var = "dob")
family_id age_mother dob_child1 dob_child2 dob_child3
1: 1 30 1998/11/26 2000/1/29 NA
2: 2 27 1996/6/22 NA NA
3: 3 26 2002/7/11 2004/4/5 2007/9/2
4: 4 32 2004/10/10 2009/8/27 2012/7/21
5: 5 29 2000/12/5 2005/2/28 NA
也可以通过使用dcast()中的函数fun.aggregate进行分组计算,计算表DT.m1中,family_id下dob不等于NA的个数:
> dcast(DT.m1,family_id ~.,fun.agg=function(x) sum(!is.na(x)),value.var = "dob")
family_id .
1: 1 2
2: 2 1
3: 3 3
4: 4 3
5: 5 2
2.当前melt/decast的使用局限性
> DT = fread("melt_enhanced.csv")
> DT
family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1: 1 30 1998/11/26 2000/1/29 NA 1 2 NA
2: 2 27 1996/6/22 NA NA 2 NA NA
3: 3 26 2002/7/11 2004/4/5 2007/9/2 2 2 1
4: 4 32 2004/10/10 2009/8/27 2012/7/21 1 1 1
5: 5 29 2000/12/5 2005/2/28 NA 2 1 NA
若想合并所有的dob列和gender列,按照目前的做法,
> DT.m1 = melt(DT, id = c("family_id", "age_mother"))
Warning message:
In melt.data.table(DT, id = c("family_id", "age_mother")) :
'measure.vars' [dob_child1, dob_child2, dob_child3, gender_child1, ...] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'character'. All measure variables not of type 'character' will be coerced to. Check DETAILS in ?melt.data.table for more on coercion.
> DT.m1[, c("variable", "child") := tstrsplit(variable, "_", fixed = TRUE)]
> DT.c1 = dcast(DT.m1, family_id + age_mother + child ~ variable, value.var = "value")
> DT.c1
family_id age_mother child dob gender
1: 1 30 child1 1998/11/26 1
2: 1 30 child2 2000/1/29 2
3: 1 30 child3 NA NA
4: 2 27 child1 1996/6/22 2
5: 2 27 child2 NA NA
6: 2 27 child3 NA NA
7: 3 26 child1 2002/7/11 2
8: 3 26 child2 2004/4/5 2
9: 3 26 child3 2007/9/2 1
10: 4 32 child1 2004/10/10 1
11: 4 32 child2 2009/8/27 1
12: 4 32 child3 2012/7/21 1
13: 5 29 child1 2000/12/5 2
14: 5 29 child2 2005/2/28 1
15: 5 29 child3 NA NA
> str(DT.c1)
Classes ‘data.table’ and 'data.frame': 15 obs. of 5 variables:
$ family_id : int 1 1 1 2 2 2 3 3 3 4 ...
$ age_mother: int 30 30 30 27 27 27 26 26 26 32 ...
$ child : chr "child1" "child2" "child3" "child1" ...
$ dob : chr "1998/11/26" "2000/1/29" NA "1996/6/22" ...
$ gender : chr "1" "2" NA "2" ...
- attr(*, ".internal.selfref")=
- attr(*, "sorted")= chr "family_id" "age_mother" "child"
3.增强新功能
a.增强melt
> colA = paste("dob_child", 1:3, sep = "")
> colB = paste("gender_child", 1:3, sep = "")
> DT.m2 = melt(DT, measure = list(colA, colB), value.name = c("dob", "gender"))
> DT.m2
family_id age_mother variable dob gender
1: 1 30 1 1998/11/26 1
2: 2 27 1 1996/6/22 2
3: 3 26 1 2002/7/11 2
4: 4 32 1 2004/10/10 1
5: 5 29 1 2000/12/5 2
6: 1 30 2 2000/1/29 2
7: 2 27 2 NA NA
8: 3 26 2 2004/4/5 2
9: 4 32 2 2009/8/27 1
10: 5 29 2 2005/2/28 1
11: 1 30 3 NA NA
12: 2 27 3 NA NA
13: 3 26 3 2007/9/2 1
14: 4 32 3 2012/7/21 1
15: 5 29 3 NA NA
> str(DT.m2)
Classes ‘data.table’ and 'data.frame': 15 obs. of 5 variables:
$ family_id : int 1 2 3 4 5 1 2 3 4 5 ...
$ age_mother: int 30 27 26 32 29 30 27 26 32 29 ...
$ variable : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 2 2 2 2 ...
$ dob : chr "1998/11/26" "1996/6/22" "2002/7/11" "2004/10/10" ...
$ gender : int 1 2 2 1 2 2 NA 2 1 1 ...
- attr(*, ".internal.selfref")=
使用pattens
> DT.m2 = melt(DT, measure = patterns("^dob", "^gender"), value.name = c("dob", "gender"))
> DT.m2
family_id age_mother variable dob gender
1: 1 30 1 1998/11/26 1
2: 2 27 1 1996/6/22 2
3: 3 26 1 2002/7/11 2
4: 4 32 1 2004/10/10 1
5: 5 29 1 2000/12/5 2
6: 1 30 2 2000/1/29 2
7: 2 27 2 NA NA
8: 3 26 2 2004/4/5 2
9: 4 32 2 2009/8/27 1
10: 5 29 2 2005/2/28 1
11: 1 30 3 NA NA
12: 2 27 3 NA NA
13: 3 26 3 2007/9/2 1
14: 4 32 3 2012/7/21 1
15: 5 29 3 NA NA
b.增强dcast
> DT.c2 = dcast(DT.m2, family_id + age_mother ~ variable, value.var = c("dob", "gender"))
> DT.c2
family_id age_mother dob_1 dob_2 dob_3 gender_1 gender_2 gender_3
1: 1 30 1998/11/26 2000/1/29 NA 1 2 NA
2: 2 27 1996/6/22 NA NA 2 NA NA
3: 3 26 2002/7/11 2004/4/5 2007/9/2 2 2 1
4: 4 32 2004/10/10 2009/8/27 2012/7/21 1 1 1
5: 5 29 2000/12/5 2005/2/28 NA 2 1 NA
多看书,多写文章,多读书(发出声音来的读),每天坚持1小时。可以学到很多东西。