对离散变量,我们会观测变量各个层级观测的频数,或者使用两个变量的交叉表格,对离散变量绘制条形图等;
对连续变量,我们会看某个变量的均值,标准差,分位数等
此外,summary(),str(),describe(()等函数(psych包里)做义工数据框的总结。
以上即为一些最基础的方法,但这些方法灵活性不高,输出的信息也是固定的,这时我们需要对数据进行整形。
在整合和整形操作前,我们介绍一个新的可以取代数据框的对象,tibble,一个可以高效读取数据集的包readr。最后会介绍两个用于数据整形的包:reshape2和tidyr包
取代传统数据框的tibble对象
> library(tibble)
> library(tibble)
> library(ggplot2)
> sim.dat=read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv")
> df=data.frame(x=c(1:5),y=rep("a",5))
> as_tibble(df)
# A tibble: 5 x 2
x y
1 1 a
2 2 a
3 3 a
4 4 a
5 5 a
> tibble(x=1:5,y=rep("a",5))
# A tibble: 5 x 2
x y
1 1 a
2 2 a
3 3 a
4 4 a
5 5 a
>
> tibble(x=1:5,y=1,z=x^2+y)
# A tibble: 5 x 3
x y z
1 1 1 2
2 2 1 5
3 3 1 10
4 4 1 17
5 5 1 26
> tb=tibble(':)'="smile",' '="space",'2000'="number")
> print(tb)
# A tibble: 1 x 3
`:)` ` ` `2000`
1 smile space number
>
特别,如果你在其他包中使用tibble对象中的变量也需要加单引号。
tibble和传统数据框的不同主要在于输出显示和截取变量这两个方面
1.输出显示
> print(as_tibble(sim.dat))
# A tibble: 1,000 x 19
age gender income house store_exp online_exp store_trans online_trans
<int> <fctr> <dbl> <fctr> <dbl> <dbl> <int> <int>
1 57 Female 120963.4 Yes 529.1344 303.5125 2 2
2 63 Female 122008.1 Yes 478.0058 109.5297 4 2
3 59 Male 114202.3 Yes 490.8107 279.2496 7 2
4 60 Male 113616.3 Yes 347.8090 141.6698 10 2
5 51 Male 124252.6 Yes 379.6259 112.2372 4 4
6 59 Male 107661.5 Yes 338.3154 195.6870 4 5
7 57 Male 120483.3 Yes 482.5445 284.5363 5 3
8 57 Male 110542.0 Yes 340.7368 135.2556 11 5
9 61 Female 132060.5 Yes 608.2310 142.5503 6 1
10 60 Male 105048.8 Yes 470.3190 163.4663 12 1
# ... with 990 more rows, and 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
# Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
# segment <fctr>
如上,它只展示头10行数据,而且会根据屏幕大小,自动调整列数,列名后还会显示每列的类型,更友好。
2.截取变量
从tibble对象中截取某一变量
用"$"
和"[["
符号
“[[”符号能够通过变量的名字或位置指针来截取
“$”
只能通过变量名截取
“%>%"
(管道操作符)也可进行数据截取
sim.dat$age
sim.dat[["age"]]
sim.dat[[1]]
library(dplyr)
sim.dat%>%.$age
sim.dat%>%.[["age"]]
若用"$"
或"[["
操作符从数据框中截取一个变量时,截取的变量可能不是数据框形式,从而可能会引起程序运行错误,但是从tibble中截取任何一个变量依旧是一个tibble对象
注意:由于tibble对象比较新,所以在清理了数据之后要对数据建模的话,可以将tibble对象转换成原始数据框格式
sim.dat=as.data.frame(sim.dat)
class(sim.dat)
高效数据读写 readr包
readr包中用于读入数据的函数:
read_csv()读入逗号分隔文件
read_csv2()读入分号分隔文件
read_tsv()读人制表符分隔文件
read_delim()读入任意分隔符文件
其中,read_csv()涵盖了大部分的数据读入需求。
#skip=2表示跳过两行
> dat=read_csv("这行是一个样本数据
+ 这行只是注释
+ x,y,z
+ 1,2,3",skip=2)
> print(dat)
# A tibble: 1 x 3
x y z
<int> <int> <int>
1 1 2 3
> dat=read_csv("1,2,3\n4,5,6",col_names=FALSE)
> print(dat)
# A tibble: 2 x 3
X1 X2 X3
<int> <int> <int>
1 1 2 3
2 4 5 6
对于分号分隔文件读取read_csv2()
> dat=read_csv2("x;y;z\n1;2;3")
> print(dat)
# A tibble: 1 x 3
x y z
<int> <int> <int>
1 1 2 3
对于制表符分隔文件,read_tsv()
> dat1=read_tsv("x\ty\tz\n1\t2\t3")
> print(dat1)
# A tibble: 1 x 3
x y z
<int> <int> <int>
1 1 2 3
读入任意分隔符read_delim()
> dat2=read_delim("x|y|z\n1|2|3",delim=
+ "|")
> print(dat2)
# A tibble: 1 x 3
x y z
<int> <int> <int>
1 1 2 3
>
指定缺失值
> dat=read_csv("x,y,z\n1,2,99",na="99")
> print(dat)
# A tibble: 1 x 3
x y z
<int> <int> <chr>
1 1 2
>
readr包也有两个存储数据的函数write_csv()和write_tsv()函数,它们的优点在于:
1.对于字符串采用utf-8编码
2.将日期和时间用ISO8601格式存储,便于其他软件解析y
也可以使用write_excel_csv()函数j将.csv格式数据导出成excel格式
对于其他类型的数据,可使用下面的包
Haven:读入SPASS,Stata和SAS数据
Readxl:读取Excel文档(.xls和xlsx)
DBI:在指定了相应数据库(mysql等)情况下,直接从数据库中通过SQL读取数据。
数据表对象读取:
我们可以用方括号对数据进行索引和搜索。
简单的数据整合也可以用tapply(),aggregate(),table()这些函数
数据框的方括号易于实现数据截取,但是对数据进一步整合,需要其他包的帮助,如果能在方括号中进行数据整合操作,便方便了许多。data.table就可以做到这一点
1、它能更有效处理大数据集
2、操作方式和数据框一样简便
3、能够快速实现数据截取,分组,合并
4、可以轻易将数据框结构转化为数据表结构
#注,传统的数据框无法进行该操作
> dt[,mean(online_trans)]
[1] 13.546
> dt[,mean(online_trans),by=gender]
gender V1
1: Female 15.38448
2: Male 11.26233
> dt[,mean(online_trans),by=.(gender,house)]
gender house V1
1: Female Yes 11.312030
2: Male Yes 8.771523
3: Female No 19.145833
4: Male No 16.486111
> dt[,.(avg=mean(online_trans)),by=.(gender,house)]
gender house avg
1: Female Yes 11.312030
2: Male Yes 8.771523
3: Female No 19.145833
4: Male No 16.486111
数据表的操作类似于sql
如:select gender,avg(online_trans) from sim.dat groupby gender
等价于
> dt[,mean(online_trans),by=gender]
gender V1
1: Female 15.38448
2: Male 11.26233
>
select gender,house,avg(online_trans) as avg from sim.dat group by gender,house
等价于
> dt[,.(avg=mean(online_trans)),by=.(gender,house)]
gender house avg
1: Female Yes 11.312030
2: Male Yes 8.771523
3: Female No 19.145833
4: Male No 16.486111
>
select gender,house,avg(online_trans) as avg from
sim.dat where age <40 groupby gender,house
> dt[age<40,.(avg=mean(online_trans)),by=.(gender,house)]
gender house avg
1: Male Yes 14.45977
2: Female Yes 18.14062
3: Male No 18.24299
4: Female No 20.10196
选择行
> dt[age<20&income>80000]
age gender income house store_exp online_exp store_trans online_trans Q1 Q2
1: 19 Female 83534.70 No 227.6686 1490.719 1 22 2 1
2: 18 Female 89415.97 Yes 209.5487 1926.470 3 28 2 1
3: 19 Female 92812.81 No 186.7475 1041.539 2 18 3 1
Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
1: 1 2 4 1 4 2 4 1 Style
2: 1 1 4 1 4 2 4 1 Style
3: 1 2 4 1 4 3 4 1 Style
> dt[1:2]
age gender income house store_exp online_exp store_trans online_trans Q1 Q2
1: 57 Female 120963.4 Yes 529.1344 303.5125 2 2 4 2
2: 63 Female 122008.1 Yes 478.0058 109.5297 4 2 4 1
Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
1: 1 2 1 4 1 4 2 4 Price
2: 1 2 1 4 1 4 1 4 Price
>
选择列:
> ans=dt[,age]
> head(ans)
[1] 57 63 59 60 51 59
> abs=dt[,.(age,online_exp)]
> head(abs)
age online_exp
1: 57 303.5125
2: 63 109.5297
3: 59 279.2496
4: 60 141.6698
5: 51 112.2372
6: 59 195.6870
> ans=dt[,age:income,with=FALSE]
> head(ans,2)
age gender income
1: 57 Female 120963.4
2: 63 Female 122008.1
#删除某列,-可以换成!
> ans=dt[,-(age:online_exp),with=FALSE]
制表
> dt[,.N]
[1] 1000
> dt[,.N,by=gender]
gender N
1: Female 554
2: Male 446
> dt[age<30,.(count=.N),by=gender]
gender count
1: Female 292
2: Male 86
> dt[,.N]
[1] 1000
> dt[,.N,by=gender]
gender N
1: Female 554
2: Male 446
> dt[age<30,.(count=.N),by=gender]
gender count
1: Female 292
2: Male 86
> head(dt[order(-online_exp)],5)
age gender income house store_exp online_exp store_trans online_trans Q1 Q2
1: 40 Female 217599.7 No 7023.684 9479.442 10 6 1 4
2: 41 Female NA Yes 3786.740 8638.239 14 10 1 4
3: 36 Male 228550.1 Yes 3279.621 8220.555 8 12 1 4
4: 31 Female 159508.1 Yes 5177.081 8005.932 11 13 1 4
5: 43 Female 190407.4 Yes 4694.922 7875.562 6 11 1 4
Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
1: 5 4 3 4 4 1 4 2 Conspicuous
2: 4 4 4 4 4 1 4 2 Conspicuous
3: 5 4 4 4 4 1 4 1 Conspicuous
4: 4 4 4 4 4 1 4 2 Conspicuous
5: 5 4 4 4 4 1 4 2 Conspicuous
> dt[order(-online_exp)][1:5]
age gender income house store_exp online_exp store_trans online_trans Q1 Q2
1: 40 Female 217599.7 No 7023.684 9479.442 10 6 1 4
2: 41 Female NA Yes 3786.740 8638.239 14 10 1 4
3: 36 Male 228550.1 Yes 3279.621 8220.555 8 12 1 4
4: 31 Female 159508.1 Yes 5177.081 8005.932 11 13 1 4
5: 43 Female 190407.4 Yes 4694.922 7875.562 6 11 1 4
Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
1: 5 4 3 4 4 1 4 2 Conspicuous
2: 4 4 4 4 4 1 4 2 Conspicuous
3: 5 4 4 4 4 1 4 1 Conspicuous
4: 4 4 4 4 4 1 4 2 Conspicuous
5: 5 4 4 4 4 1 4 2 Conspicuous
> dt[order(gender,-online_exp)][1:5]
age gender income house store_exp online_exp store_trans online_trans Q1 Q2
1: 40 Female 217599.7 No 7023.684 9479.442 10 6 1 4
2: 41 Female NA Yes 3786.740 8638.239 14 10 1 4
3: 31 Female 159508.1 Yes 5177.081 8005.932 11 13 1 4
4: 43 Female 190407.4 Yes 4694.922 7875.562 6 11 1 4
5: 50 Female 263858.0 Yes 5813.802 7448.729 11 11 1 4
Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
1: 5 4 3 4 4 1 4 2 Conspicuous
2: 4 4 4 4 4 1 4 2 Conspicuous
3: 4 4 4 4 4 1 4 2 Conspicuous
4: 5 4 4 4 4 1 4 2 Conspicuous
5: 5 4 4 4 4 1 4 1 Conspicuous
>
用fread()读取数据
data.table中的fread()函数读取速度比read_csv()更快!!!
base包:apply(),lapply(),sapply()等
> sdat=sim.dat[,!lapply(sim.dat,class)=="factor"]
> apply(sim.dat,2,class)
> apply(sdat,MARGIN=2,function(x) mean(na.omit(x)))
> apply(sdat,MARGIN=2,function(x) sd(na.omit(x)))
plyr包:ddply()
#数据框显示
> ddply(sim.dat,"segment",summarize,avg_online=round(sum(online_exp)/sum(online_trans),2),avg_store=round(sum(store_exp)/sum(store_trans),2))
segment avg_online avg_store
1 Conspicuous 442.27 479.25
2 Price 69.28 81.30
3 Quality 126.05 105.12
4 Style 92.83 121.07
>
dplyr包(专门处理数据框)–其主要功能:
1.数据框显示
2.数据截取
3.数据总结
4.生成新变量
5.合并数据集
> dplyr::tbl_df(sim.dat)
# A tibble: 1,000 x 19
age gender income house store_exp online_exp store_trans online_trans
1 57 Female 120963.4 Yes 529.1344 303.5125 2 2
2 63 Female 122008.1 Yes 478.0058 109.5297 4 2
3 59 Male 114202.3 Yes 490.8107 279.2496 7 2
4 60 Male 113616.3 Yes 347.8090 141.6698 10 2
5 51 Male 124252.6 Yes 379.6259 112.2372 4 4
6 59 Male 107661.5 Yes 338.3154 195.6870 4 5
7 57 Male 120483.3 Yes 482.5445 284.5363 5 3
8 57 Male 110542.0 Yes 340.7368 135.2556 11 5
9 61 Female 132060.5 Yes 608.2310 142.5503 6 1
10 60 Male 105048.8 Yes 470.3190 163.4663 12 1
# ... with 990 more rows, and 11 more variables: Q1 , Q2 , Q3 ,
# Q4 , Q5 , Q6 , Q7 , Q8 , Q9 , Q10 ,
# segment
> dplyr::glimpse(sim.dat)
Observations: 1,000
Variables: 19
$ age 57, 63, 59, 60, 51, 59, 57, 57, 61, 60, 58, 59, 64, 57,...
$ gender "Female", "Female", "Male", "Male", "Male", "Male", "Ma...
$ income 120963.4, 122008.1, 114202.3, 113616.3, 124252.6, 10766...
$ house " Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ store_exp 529.1344, 478.0058, 490.8107, 347.8090, 379.6259, 338.3...
$ online_exp 303.5125, 109.5297, 279.2496, 141.6698, 112.2372, 195.6...
$ store_trans 2, 4, 7, 10, 4, 4, 5, 11, 6, 12, 5, 6, 7, 7, 5, 5, 5, 5...
$ online_trans 2, 2, 2, 2, 4, 5, 3, 5, 1, 1, 4, 2, 4, 3, 5, 1, 3, 2, 2...
$ Q1 4, 4, 5, 5, 4, 4, 4, 5, 4, 4, 4, 4, 5, 4, 4, 5, 5, 5, 4...
$ Q2 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2...
$ Q3 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ Q4 2, 2, 2, 3, 3, 2, 2, 3, 2, 3, 3, 3, 3, 3, 2, 2, 2, 3, 3...
$ Q5 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ Q6 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
$ Q7 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ Q8 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
$ Q9 2, 1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2...
$ Q10 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
$ segment " Price", "Price", "Price", "Price", "Price", "Price", "...
数据截取(按行/列)
> library(magrittr)
> library(dplyr)
> dplyr::filter(sim.dat,income>300000) %>%
+ dplyr::tbl_df()
# A tibble: 4 x 19
age gender income house store_exp online_exp store_trans online_trans
<int> <chr> <dbl> <chr> <dbl> <dbl> <int> <int>
1 40 Male 301398.0 Yes 4840.461 3618.212 10 11
2 33 Male 319704.3 Yes 5998.305 4395.923 9 11
3 41 Male 317476.2 Yes 3029.844 4179.671 11 12
4 37 Female 315697.2 Yes 6548.970 4284.065 13 11
# ... with 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>, Q4 <int>, Q5 <int>,
# Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>, segment <chr>
Warning message:
package ‘bindrcpp’ was built under R version 3.4.3
此外,dinstinct()函数可以删除数据框中重复的行;sample_frac()函数随机选取一定比例的行,sample_n()函数随机选取一定数目的行,slice()函数选取指定位置的行,top_n()选取某变量取值最高的若干观测
> dplyr::distinct(sim.dat)
# A tibble: 1,000 x 19
age gender income house store_exp online_exp store_trans online_trans
<int> <chr> <dbl> <chr> <dbl> <dbl> <int> <int>
1 57 Female 120963.4 Yes 529.1344 303.5125 2 2
2 63 Female 122008.1 Yes 478.0058 109.5297 4 2
3 59 Male 114202.3 Yes 490.8107 279.2496 7 2
4 60 Male 113616.3 Yes 347.8090 141.6698 10 2
5 51 Male 124252.6 Yes 379.6259 112.2372 4 4
6 59 Male 107661.5 Yes 338.3154 195.6870 4 5
7 57 Male 120483.3 Yes 482.5445 284.5363 5 3
8 57 Male 110542.0 Yes 340.7368 135.2556 11 5
9 61 Female 132060.5 Yes 608.2310 142.5503 6 1
10 60 Male 105048.8 Yes 470.3190 163.4663 12 1
# ... with 990 more rows, and 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
# Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
# segment <chr>
> dplyr::sample_frac(sim.dat,0.05,replace=TRUE)
# A tibble: 50 x 19
age gender income house store_exp online_exp store_trans online_trans
<int> <chr> <dbl> <chr> <dbl> <dbl> <int> <int>
1 22 Male 91553.21 No 200.7210 1777.4974 4 27
2 34 Female 60521.76 No 299.3096 2054.1732 3 16
3 33 Male NA No 265.6550 1892.5581 2 12
4 38 Female 164506.62 Yes 3916.9309 5764.1235 11 10
5 26 Female 89461.40 No 200.4784 2449.7965 1 23
6 26 Female 105528.79 Yes 186.9383 2349.9275 5 17
7 55 Male 128194.20 Yes 595.6952 156.9314 6 2
8 35 Female 130108.64 Yes 6155.4803 6201.7090 9 13
9 36 Male NA Yes 203.3036 2202.5147 2 15
10 38 Male 267564.87 Yes 5335.1143 6052.4377 8 10
# ... with 40 more rows, and 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
# Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
# segment <chr>
> dplyr::sample_n(sim.dat,10,replace=TRUE)
# A tibble: 10 x 19
age gender income house store_exp online_exp store_trans online_trans
<int> <chr> <dbl> <chr> <dbl> <dbl> <int> <int>
1 34 Female 73234.49 No 349.5491 2081.4476 4 21
2 25 Female 90856.12 No 203.7759 2228.4818 4 23
3 37 Male 187062.94 Yes 5931.7494 1942.1789 18 11
4 34 Male 53945.69 Yes 370.5065 2305.3430 3 14
5 23 Female 81763.92 No 205.6662 1040.8967 3 24
6 300 Male 208017.46 Yes 5076.8009 6053.4853 12 11
7 56 Male NA Yes 419.6702 192.3719 3 1
8 26 Female 95341.78 No 198.9729 2036.4738 3 21
9 26 Male 78240.93 No 430.2481 2091.4694 3 14
10 27 Female 90303.46 No 198.9020 1870.3866 6 13
# ... with 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>, Q4 <int>, Q5 <int>,
# Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>, segment <chr>
> dplyr::top_n(sim.dat,2,income)
# A tibble: 2 x 19
age gender income house store_exp online_exp store_trans online_trans
<int> <chr> <dbl> <chr> <dbl> <dbl> <int> <int>
1 33 Male 319704.3 Yes 5998.305 4395.923 9 11
2 41 Male 317476.2 Yes 3029.844 4179.671 11 12
# ... with 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>, Q4 <int>, Q5 <int>,
# Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>, segment <chr>
>
以及dplyr下的select()函数对列变量进行选择(代码略)
数据总结:(操作类似于apply()和ddply())
> dplyr::summarise(sim.dat,avg_online=mean(online_trans))
# A tibble: 1 x 1
avg_online
1 13.546
可以用group_by()函数根据某分类变量对观测进行分组总结
生成新变量
mutate()函数可以进行列计算
transmute()函数与mutate()类似
> dplyr::mutate(sim.dat,total_exp=store_exp+online_exp)
# A tibble: 1,000 x 20
age gender income house store_exp online_exp store_trans online_trans
<int> <chr> <dbl> <chr> <dbl> <dbl> <int> <int>
1 57 Female 120963.4 Yes 529.1344 303.5125 2 2
2 63 Female 122008.1 Yes 478.0058 109.5297 4 2
3 59 Male 114202.3 Yes 490.8107 279.2496 7 2
4 60 Male 113616.3 Yes 347.8090 141.6698 10 2
5 51 Male 124252.6 Yes 379.6259 112.2372 4 4
6 59 Male 107661.5 Yes 338.3154 195.6870 4 5
7 57 Male 120483.3 Yes 482.5445 284.5363 5 3
8 57 Male 110542.0 Yes 340.7368 135.2556 11 5
9 61 Female 132060.5 Yes 608.2310 142.5503 6 1
10 60 Male 105048.8 Yes 470.3190 163.4663 12 1
# ... with 990 more rows, and 12 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
# Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
# segment <chr>, total_exp <dbl>
合并数据集
> x=data.frame(cbind(ID=c("A","B","C"),x1=c(1,2,3)))
> y=data.frame(cbind(ID=c("B","C","D"),y1=c(T,T,F)))
> x
ID x1
1 A 1
2 B 2
3 C 3
> y
ID y1
1 B TRUE
2 C TRUE
3 D FALSE
> left_join(x,y,by="ID")
ID x1 y1
1 A 1 <NA>
2 B 2 TRUE
3 C 3 TRUE
Warning message:
Column `ID` joining factors with different levels, coercing to character vector
> inner_join(x,y,by="ID")
ID x1 y1
1 B 2 TRUE
2 C 3 TRUE
Warning message:
Column `ID` joining factors with different levels, coercing to character vector
> full_join(x,y,by="ID")
ID x1 y1
1 A 1 <NA>
2 B 2 TRUE
3 C 3 TRUE
4 D <NA> FALSE
Warning message:
Column `ID` joining factors with different levels, coercing to character vector
> semi_join(x,y,by="ID")
ID x1
1 B 2
2 C 3
Warning message:
Column `ID` joining factors with different levels, coercing to character vector
> anti_join(x,y,by="ID")
ID x1
1 A 1
Warning message:
Column `ID` joining factors with different levels, coercing to character vector
>
此外,dplur包中还有对数据框交,并,补的运算(intersect(),union(),setdiff()),以及一个数据框按行或列加到另一个数据框(bind_rows(),bind_cols())等
reshape2包
数据先通过melt()函数将数据揉开,再通过dcast()函数将数据重塑成想要的形状。
melt()函数能糅合数据框,列表,矩阵,表格等。
tidyr包
首先gather()函数,类似于melt()
spread()函数和gather()函数相反,后者将不同的列堆叠起来,前者将同一列分开。
separate()和unite()也是tidyr包中两个互补函数,separate()可以将不同列分开成多列,unite()能将不同的列合并在一起。类似于paste()函数。