长表和宽表的相互转化

使用 函数melt()和decast(),是reshape2包中函数的增强

先考虑传统长宽表转化函数,

 

1.默认函数

a.融合data.table(宽表到长表)

考虑如下data.table:

> DT = fread("melt_default.csv")
> DT

   family_id age_mother dob_child1 dob_child2 dob_child3
1:         1         30 1998/11/26  2000/1/29         NA
2:         2         27  1996/6/22         NA         NA
3:         3         26  2002/7/11   2004/4/5   2007/9/2
4:         4         32 2004/10/10  2009/8/27  2012/7/21
5:         5         29  2000/12/5  2005/2/28         NA
> str(DT)
Classes ‘data.table’ and 'data.frame':        5 obs. of  5 variables:
 $ family_id : int  1 2 3 4 5
 $ age_mother: int  30 27 26 32 29
 $ dob_child1: chr  "1998/11/26" "1996/6/22" "2002/7/11" "2004/10/10" ...
 $ dob_child2: chr  "2000/1/29" NA "2004/4/5" "2009/8/27" ...
 $ dob_child3: chr  NA NA "2007/9/2" "2012/7/21" ...
 - attr(*, ".internal.selfref")=

 

把数据宽表转化长表,dob列合并成一列。

> DT.m1 = melt(DT, id.vars = c("family_id", "age_mother"),
+              measure.vars = c("dob_child1", "dob_child2", "dob_child3"))
> DT.m1

    family_id age_mother   variable      value
 1:         1         30 dob_child1 1998/11/26
 2:         2         27 dob_child1  1996/6/22
 3:         3         26 dob_child1  2002/7/11
 4:         4         32 dob_child1 2004/10/10
 5:         5         29 dob_child1  2000/12/5
 6:         1         30 dob_child2  2000/1/29
 7:         2         27 dob_child2         NA
 8:         3         26 dob_child2   2004/4/5
 9:         4         32 dob_child2  2009/8/27
10:         5         29 dob_child2  2005/2/28
11:         1         30 dob_child3         NA
12:         2         27 dob_child3         NA
13:         3         26 dob_child3   2007/9/2
14:         4         32 dob_child3  2012/7/21
15:         5         29 dob_child3         NA

> str(DT.m1)
Classes ‘data.table’ and 'data.frame':        15 obs. of  4 variables:
 $ family_id : int  1 2 3 4 5 1 2 3 4 5 ...
 $ age_mother: int  30 27 26 32 29 30 27 26 32 29 ...
 $ variable  : Factor w/ 3 levels "dob_child1","dob_child2",..: 1 1 1 1 1 2 2 2 2 2 ...
 $ value     : chr  "1998/11/26" "1996/6/22" "2002/7/11" "2004/10/10" ...
 - attr(*, ".internal.selfref")=

 

生成两个新变量,默认列名是“variable”和“value”

生成列“variable”默认类型是“factor”,若是想生成“charactor”,可用variable.factor=FALSE来设置,这个只能在data.table的melt中设置,不能用在reshape2中设置。

 

为变量“variable”和“value”进行命名

> DT.m1 = melt(DT, measure.vars = c("dob_child1", "dob_child2", "dob_child3"),
+              variable.name = "child", value.name = "dob")
> DT.m1

    family_id age_mother      child        dob
 1:         1         30 dob_child1 1998/11/26
 2:         2         27 dob_child1  1996/6/22
 3:         3         26 dob_child1  2002/7/11
 4:         4         32 dob_child1 2004/10/10
 5:         5         29 dob_child1  2000/12/5
 6:         1         30 dob_child2  2000/1/29
 7:         2         27 dob_child2         NA
 8:         3         26 dob_child2   2004/4/5
 9:         4         32 dob_child2  2009/8/27
10:         5         29 dob_child2  2005/2/28
11:         1         30 dob_child3         NA
12:         2         27 dob_child3         NA
13:         3         26 dob_child3   2007/9/2
14:         4         32 dob_child3  2012/7/21
15:         5         29 dob_child3         NA

 

b)长表转成宽表

> dcast(DT.m1, family_id + age_mother ~ child, value.var = "dob")
   family_id age_mother dob_child1 dob_child2 dob_child3
1:         1         30 1998/11/26  2000/1/29         NA
2:         2         27  1996/6/22         NA         NA
3:         3         26  2002/7/11   2004/4/5   2007/9/2
4:         4         32 2004/10/10  2009/8/27  2012/7/21
5:         5         29  2000/12/5  2005/2/28         NA

 

也可以通过使用dcast()中的函数fun.aggregate进行分组计算,计算表DT.m1中,family_id下dob不等于NA的个数:

> dcast(DT.m1,family_id ~.,fun.agg=function(x) sum(!is.na(x)),value.var = "dob")
   family_id .
1:         1 2
2:         2 1
3:         3 3
4:         4 3
5:         5 2

 

 

2.当前melt/decast的使用局限性

> DT = fread("melt_enhanced.csv")
> DT

   family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1:         1         30 1998/11/26  2000/1/29         NA             1             2            NA
2:         2         27  1996/6/22         NA         NA             2            NA            NA
3:         3         26  2002/7/11   2004/4/5   2007/9/2             2             2             1
4:         4         32 2004/10/10  2009/8/27  2012/7/21             1             1             1
5:         5         29  2000/12/5  2005/2/28         NA             2             1            NA

 

若想合并所有的dob列和gender列,按照目前的做法,

> DT.m1 = melt(DT, id = c("family_id", "age_mother"))
Warning message:
In melt.data.table(DT, id = c("family_id", "age_mother")) :
  'measure.vars' [dob_child1, dob_child2, dob_child3, gender_child1, ...] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'character'. All measure variables not of type 'character' will be coerced to. Check DETAILS in ?melt.data.table for more on coercion.

> DT.m1[, c("variable", "child") := tstrsplit(variable, "_", fixed = TRUE)]
> DT.c1 = dcast(DT.m1, family_id + age_mother + child ~ variable, value.var = "value")
> DT.c1

    family_id age_mother  child        dob gender
 1:         1         30 child1 1998/11/26      1
 2:         1         30 child2  2000/1/29      2
 3:         1         30 child3         NA     NA
 4:         2         27 child1  1996/6/22      2
 5:         2         27 child2         NA     NA
 6:         2         27 child3         NA     NA
 7:         3         26 child1  2002/7/11      2
 8:         3         26 child2   2004/4/5      2
 9:         3         26 child3   2007/9/2      1
10:         4         32 child1 2004/10/10      1
11:         4         32 child2  2009/8/27      1
12:         4         32 child3  2012/7/21      1
13:         5         29 child1  2000/12/5      2
14:         5         29 child2  2005/2/28      1
15:         5         29 child3         NA     NA

 

> str(DT.c1)
Classes ‘data.table’ and 'data.frame':        15 obs. of  5 variables:
 $ family_id : int  1 1 1 2 2 2 3 3 3 4 ...
 $ age_mother: int  30 30 30 27 27 27 26 26 26 32 ...
 $ child     : chr  "child1" "child2" "child3" "child1" ...
 $ dob       : chr  "1998/11/26" "2000/1/29" NA "1996/6/22" ...
 $ gender    : chr  "1" "2" NA "2" ...
 - attr(*, ".internal.selfref")=
 - attr(*, "sorted")= chr  "family_id" "age_mother" "child"

 

 

3.增强新功能

a.增强melt

> colA = paste("dob_child", 1:3, sep = "")
> colB = paste("gender_child", 1:3, sep = "")
> DT.m2 = melt(DT, measure = list(colA, colB), value.name = c("dob", "gender"))
> DT.m2

    family_id age_mother variable        dob gender
 1:         1         30        1 1998/11/26      1
 2:         2         27        1  1996/6/22      2
 3:         3         26        1  2002/7/11      2
 4:         4         32        1 2004/10/10      1
 5:         5         29        1  2000/12/5      2
 6:         1         30        2  2000/1/29      2
 7:         2         27        2         NA     NA
 8:         3         26        2   2004/4/5      2
 9:         4         32        2  2009/8/27      1
10:         5         29        2  2005/2/28      1
11:         1         30        3         NA     NA
12:         2         27        3         NA     NA
13:         3         26        3   2007/9/2      1
14:         4         32        3  2012/7/21      1
15:         5         29        3         NA     NA

 

> str(DT.m2)
Classes ‘data.table’ and 'data.frame':        15 obs. of  5 variables:
 $ family_id : int  1 2 3 4 5 1 2 3 4 5 ...
 $ age_mother: int  30 27 26 32 29 30 27 26 32 29 ...
 $ variable  : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 2 2 2 2 ...
 $ dob       : chr  "1998/11/26" "1996/6/22" "2002/7/11" "2004/10/10" ...
 $ gender    : int  1 2 2 1 2 2 NA 2 1 1 ...
 - attr(*, ".internal.selfref")=

 

使用pattens

> DT.m2 = melt(DT, measure = patterns("^dob", "^gender"), value.name = c("dob", "gender"))
> DT.m2

    family_id age_mother variable        dob gender
 1:         1         30        1 1998/11/26      1
 2:         2         27        1  1996/6/22      2
 3:         3         26        1  2002/7/11      2
 4:         4         32        1 2004/10/10      1
 5:         5         29        1  2000/12/5      2
 6:         1         30        2  2000/1/29      2
 7:         2         27        2         NA     NA
 8:         3         26        2   2004/4/5      2
 9:         4         32        2  2009/8/27      1
10:         5         29        2  2005/2/28      1
11:         1         30        3         NA     NA
12:         2         27        3         NA     NA
13:         3         26        3   2007/9/2      1
14:         4         32        3  2012/7/21      1
15:         5         29        3         NA     NA

 

b.增强dcast

> DT.c2 = dcast(DT.m2, family_id + age_mother ~ variable, value.var = c("dob", "gender"))
> DT.c2
   family_
id age_mother      dob_1     dob_2     dob_3 gender_1 gender_2 gender_3
1:         1         30 1998/11/26 2000/1/29        NA        1        2       NA
2:         2         27  1996/6/22        NA        NA        2       NA       NA
3:         3         26  2002/7/11  2004/4/5  2007/9/2        2        2        1
4:         4         32 2004/10/10 2009/8/27 2012/7/21        1        1        1
5:         5         29  2000/12/5 2005/2/28        NA        2        1       NA

 

多看书,多写文章,多读书(发出声音来的读),每天坚持1小时。可以学到很多东西。

你可能感兴趣的:(基础学习,data.table)