R语言之数据操作

数据读写

对离散变量,我们会观测变量各个层级观测的频数,或者使用两个变量的交叉表格,对离散变量绘制条形图等;
对连续变量,我们会看某个变量的均值,标准差,分位数等
此外,summary(),str(),describe(()等函数(psych包里)做义工数据框的总结。
以上即为一些最基础的方法,但这些方法灵活性不高,输出的信息也是固定的,这时我们需要对数据进行整形。
在整合和整形操作前,我们介绍一个新的可以取代数据框的对象,tibble,一个可以高效读取数据集的包readr。最后会介绍两个用于数据整形的包:reshape2和tidyr包

取代传统数据框的tibble对象

> library(tibble)
> library(tibble)
> library(ggplot2)
> sim.dat=read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv")
> df=data.frame(x=c(1:5),y=rep("a",5))
> as_tibble(df)
# A tibble: 5 x 2
      x      y
   
1     1      a
2     2      a
3     3      a
4     4      a
5     5      a
> tibble(x=1:5,y=rep("a",5))
# A tibble: 5 x 2
      x     y
   
1     1     a
2     2     a
3     3     a
4     4     a
5     5     a
> 
> tibble(x=1:5,y=1,z=x^2+y)
# A tibble: 5 x 3
      x     y     z
    
1     1     1     2
2     2     1     5
3     3     1    10
4     4     1    17
5     5     1    26
> tb=tibble(':)'="smile",' '="space",'2000'="number")
> print(tb)
# A tibble: 1 x 3
   `:)`   ` ` `2000`
     
1 smile space number
> 

特别,如果你在其他包中使用tibble对象中的变量也需要加单引号。
tibble和传统数据框的不同主要在于输出显示截取变量这两个方面
1.输出显示

> print(as_tibble(sim.dat))
# A tibble: 1,000 x 19
     age gender   income  house store_exp online_exp store_trans online_trans
   <int> <fctr>    <dbl> <fctr>     <dbl>      <dbl>       <int>        <int>
 1    57 Female 120963.4    Yes  529.1344   303.5125           2            2
 2    63 Female 122008.1    Yes  478.0058   109.5297           4            2
 3    59   Male 114202.3    Yes  490.8107   279.2496           7            2
 4    60   Male 113616.3    Yes  347.8090   141.6698          10            2
 5    51   Male 124252.6    Yes  379.6259   112.2372           4            4
 6    59   Male 107661.5    Yes  338.3154   195.6870           4            5
 7    57   Male 120483.3    Yes  482.5445   284.5363           5            3
 8    57   Male 110542.0    Yes  340.7368   135.2556          11            5
 9    61 Female 132060.5    Yes  608.2310   142.5503           6            1
10    60   Male 105048.8    Yes  470.3190   163.4663          12            1
# ... with 990 more rows, and 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
#   Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
#   segment <fctr>

如上,它只展示头10行数据,而且会根据屏幕大小,自动调整列数,列名后还会显示每列的类型,更友好。
2.截取变量
从tibble对象中截取某一变量
"$""[["符号
“[[”符号能够通过变量的名字或位置指针来截取
“$”只能通过变量名截取
“%>%"(管道操作符)也可进行数据截取

sim.dat$age
sim.dat[["age"]]
sim.dat[[1]]
library(dplyr)
sim.dat%>%.$age
sim.dat%>%.[["age"]]

若用"$""[["操作符从数据框中截取一个变量时,截取的变量可能不是数据框形式,从而可能会引起程序运行错误,但是从tibble中截取任何一个变量依旧是一个tibble对象
注意:由于tibble对象比较新,所以在清理了数据之后要对数据建模的话,可以将tibble对象转换成原始数据框格式

sim.dat=as.data.frame(sim.dat)
class(sim.dat)

高效数据读写 readr包
readr包中用于读入数据的函数:
read_csv()读入逗号分隔文件
read_csv2()读入分号分隔文件
read_tsv()读人制表符分隔文件
read_delim()读入任意分隔符文件
其中,read_csv()涵盖了大部分的数据读入需求。

#skip=2表示跳过两行
> dat=read_csv("这行是一个样本数据
+ 这行只是注释
+ x,y,z
+ 1,2,3",skip=2)
> print(dat)
# A tibble: 1 x 3
      x     y     z
  <int> <int> <int>
1     1     2     3

> dat=read_csv("1,2,3\n4,5,6",col_names=FALSE)
> print(dat)
# A tibble: 2 x 3
     X1    X2    X3
  <int> <int> <int>
1     1     2     3
2     4     5     6

对于分号分隔文件读取read_csv2()

> dat=read_csv2("x;y;z\n1;2;3")

> print(dat)
# A tibble: 1 x 3
      x     y     z
  <int> <int> <int>
1     1     2     3

对于制表符分隔文件,read_tsv()

> dat1=read_tsv("x\ty\tz\n1\t2\t3")
> print(dat1)
# A tibble: 1 x 3
      x     y     z
  <int> <int> <int>
1     1     2     3

读入任意分隔符read_delim()

> dat2=read_delim("x|y|z\n1|2|3",delim=
+                     "|")
> print(dat2)
# A tibble: 1 x 3
      x     y     z
  <int> <int> <int>
1     1     2     3
> 

指定缺失值

> dat=read_csv("x,y,z\n1,2,99",na="99")
> print(dat)
# A tibble: 1 x 3
      x     y     z
  <int> <int> <chr>
1     1     2  
> 

readr包也有两个存储数据的函数write_csv()和write_tsv()函数,它们的优点在于:
1.对于字符串采用utf-8编码
2.将日期和时间用ISO8601格式存储,便于其他软件解析y
也可以使用write_excel_csv()函数j将.csv格式数据导出成excel格式
对于其他类型的数据,可使用下面的包
Haven:读入SPASS,Stata和SAS数据
Readxl:读取Excel文档(.xls和xlsx)
DBI:在指定了相应数据库(mysql等)情况下,直接从数据库中通过SQL读取数据。
数据表对象读取:
我们可以用方括号对数据进行索引和搜索。
简单的数据整合也可以用tapply(),aggregate(),table()这些函数
数据框的方括号易于实现数据截取,但是对数据进一步整合,需要其他包的帮助,如果能在方括号中进行数据整合操作,便方便了许多。data.table就可以做到这一点
1、它能更有效处理大数据集
2、操作方式和数据框一样简便
3、能够快速实现数据截取,分组,合并
4、可以轻易将数据框结构转化为数据表结构

#注,传统的数据框无法进行该操作
> dt[,mean(online_trans)]
[1] 13.546
> dt[,mean(online_trans),by=gender]
   gender       V1
1: Female 15.38448
2:   Male 11.26233
> dt[,mean(online_trans),by=.(gender,house)]
   gender house        V1
1: Female   Yes 11.312030
2:   Male   Yes  8.771523
3: Female    No 19.145833
4:   Male    No 16.486111
> dt[,.(avg=mean(online_trans)),by=.(gender,house)]
   gender house       avg
1: Female   Yes 11.312030
2:   Male   Yes  8.771523
3: Female    No 19.145833
4:   Male    No 16.486111

数据表的操作类似于sql
如:select gender,avg(online_trans) from sim.dat groupby gender
等价于

> dt[,mean(online_trans),by=gender]
   gender       V1
1: Female 15.38448
2:   Male 11.26233
> 
select gender,house,avg(online_trans) as avg from sim.dat group by gender,house

等价于

> dt[,.(avg=mean(online_trans)),by=.(gender,house)]
   gender house       avg
1: Female   Yes 11.312030
2:   Male   Yes  8.771523
3: Female    No 19.145833
4:   Male    No 16.486111
> 
select gender,house,avg(online_trans) as avg from
sim.dat where age <40 groupby gender,house
> dt[age<40,.(avg=mean(online_trans)),by=.(gender,house)]
   gender house      avg
1:   Male   Yes 14.45977
2: Female   Yes 18.14062
3:   Male    No 18.24299
4: Female    No 20.10196

选择行

> dt[age<20&income>80000]
   age gender   income house store_exp online_exp store_trans online_trans Q1 Q2
1:  19 Female 83534.70    No  227.6686   1490.719           1           22  2  1
2:  18 Female 89415.97   Yes  209.5487   1926.470           3           28  2  1
3:  19 Female 92812.81    No  186.7475   1041.539           2           18  3  1
   Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
1:  1  2  4  1  4  2  4   1   Style
2:  1  1  4  1  4  2  4   1   Style
3:  1  2  4  1  4  3  4   1   Style
> dt[1:2]
   age gender   income house store_exp online_exp store_trans online_trans Q1 Q2
1:  57 Female 120963.4   Yes  529.1344   303.5125           2            2  4  2
2:  63 Female 122008.1   Yes  478.0058   109.5297           4            2  4  1
   Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
1:  1  2  1  4  1  4  2   4   Price
2:  1  2  1  4  1  4  1   4   Price
> 

选择列:

> ans=dt[,age]
> head(ans)
[1] 57 63 59 60 51 59
> abs=dt[,.(age,online_exp)]
> head(abs)
   age online_exp
1:  57   303.5125
2:  63   109.5297
3:  59   279.2496
4:  60   141.6698
5:  51   112.2372
6:  59   195.6870
> ans=dt[,age:income,with=FALSE]
> head(ans,2)
   age gender   income
1:  57 Female 120963.4
2:  63 Female 122008.1
#删除某列,-可以换成!
> ans=dt[,-(age:online_exp),with=FALSE]

制表

> dt[,.N]
[1] 1000
> dt[,.N,by=gender]
   gender   N
1: Female 554
2:   Male 446
> dt[age<30,.(count=.N),by=gender]
   gender count
1: Female   292
2:   Male    86
> dt[,.N]
[1] 1000
> dt[,.N,by=gender]
   gender   N
1: Female 554
2:   Male 446
> dt[age<30,.(count=.N),by=gender]
   gender count
1: Female   292
2:   Male    86
> head(dt[order(-online_exp)],5)
   age gender   income house store_exp online_exp store_trans online_trans Q1 Q2
1:  40 Female 217599.7    No  7023.684   9479.442          10            6  1  4
2:  41 Female       NA   Yes  3786.740   8638.239          14           10  1  4
3:  36   Male 228550.1   Yes  3279.621   8220.555           8           12  1  4
4:  31 Female 159508.1   Yes  5177.081   8005.932          11           13  1  4
5:  43 Female 190407.4   Yes  4694.922   7875.562           6           11  1  4
   Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10     segment
1:  5  4  3  4  4  1  4   2 Conspicuous
2:  4  4  4  4  4  1  4   2 Conspicuous
3:  5  4  4  4  4  1  4   1 Conspicuous
4:  4  4  4  4  4  1  4   2 Conspicuous
5:  5  4  4  4  4  1  4   2 Conspicuous
> dt[order(-online_exp)][1:5]
   age gender   income house store_exp online_exp store_trans online_trans Q1 Q2
1:  40 Female 217599.7    No  7023.684   9479.442          10            6  1  4
2:  41 Female       NA   Yes  3786.740   8638.239          14           10  1  4
3:  36   Male 228550.1   Yes  3279.621   8220.555           8           12  1  4
4:  31 Female 159508.1   Yes  5177.081   8005.932          11           13  1  4
5:  43 Female 190407.4   Yes  4694.922   7875.562           6           11  1  4
   Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10     segment
1:  5  4  3  4  4  1  4   2 Conspicuous
2:  4  4  4  4  4  1  4   2 Conspicuous
3:  5  4  4  4  4  1  4   1 Conspicuous
4:  4  4  4  4  4  1  4   2 Conspicuous
5:  5  4  4  4  4  1  4   2 Conspicuous
> dt[order(gender,-online_exp)][1:5]
   age gender   income house store_exp online_exp store_trans online_trans Q1 Q2
1:  40 Female 217599.7    No  7023.684   9479.442          10            6  1  4
2:  41 Female       NA   Yes  3786.740   8638.239          14           10  1  4
3:  31 Female 159508.1   Yes  5177.081   8005.932          11           13  1  4
4:  43 Female 190407.4   Yes  4694.922   7875.562           6           11  1  4
5:  50 Female 263858.0   Yes  5813.802   7448.729          11           11  1  4
   Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10     segment
1:  5  4  3  4  4  1  4   2 Conspicuous
2:  4  4  4  4  4  1  4   2 Conspicuous
3:  4  4  4  4  4  1  4   2 Conspicuous
4:  5  4  4  4  4  1  4   2 Conspicuous
5:  5  4  4  4  4  1  4   1 Conspicuous
> 

用fread()读取数据
data.table中的fread()函数读取速度比read_csv()更快!!!

数据整合

base包:apply(),lapply(),sapply()等

> sdat=sim.dat[,!lapply(sim.dat,class)=="factor"]
> apply(sim.dat,2,class)
> apply(sdat,MARGIN=2,function(x) mean(na.omit(x)))
> apply(sdat,MARGIN=2,function(x) sd(na.omit(x)))

plyr包:ddply()

#数据框显示
> ddply(sim.dat,"segment",summarize,avg_online=round(sum(online_exp)/sum(online_trans),2),avg_store=round(sum(store_exp)/sum(store_trans),2))
      segment avg_online avg_store
1 Conspicuous     442.27    479.25
2       Price      69.28     81.30
3     Quality     126.05    105.12
4       Style      92.83    121.07
> 

dplyr包(专门处理数据框)–其主要功能:
1.数据框显示
2.数据截取
3.数据总结
4.生成新变量
5.合并数据集

> dplyr::tbl_df(sim.dat)
# A tibble: 1,000 x 19
     age gender   income house store_exp online_exp store_trans online_trans
                                    
 1    57 Female 120963.4   Yes  529.1344   303.5125           2            2
 2    63 Female 122008.1   Yes  478.0058   109.5297           4            2
 3    59   Male 114202.3   Yes  490.8107   279.2496           7            2
 4    60   Male 113616.3   Yes  347.8090   141.6698          10            2
 5    51   Male 124252.6   Yes  379.6259   112.2372           4            4
 6    59   Male 107661.5   Yes  338.3154   195.6870           4            5
 7    57   Male 120483.3   Yes  482.5445   284.5363           5            3
 8    57   Male 110542.0   Yes  340.7368   135.2556          11            5
 9    61 Female 132060.5   Yes  608.2310   142.5503           6            1
10    60   Male 105048.8   Yes  470.3190   163.4663          12            1
# ... with 990 more rows, and 11 more variables: Q1 , Q2 , Q3 ,
#   Q4 , Q5 , Q6 , Q7 , Q8 , Q9 , Q10 ,
#   segment 
> dplyr::glimpse(sim.dat)
Observations: 1,000
Variables: 19
$ age           57, 63, 59, 60, 51, 59, 57, 57, 61, 60, 58, 59, 64, 57,...
$ gender        "Female", "Female", "Male", "Male", "Male", "Male", "Ma...
$ income        120963.4, 122008.1, 114202.3, 113616.3, 124252.6, 10766...
$ house         "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ store_exp     529.1344, 478.0058, 490.8107, 347.8090, 379.6259, 338.3...
$ online_exp    303.5125, 109.5297, 279.2496, 141.6698, 112.2372, 195.6...
$ store_trans   2, 4, 7, 10, 4, 4, 5, 11, 6, 12, 5, 6, 7, 7, 5, 5, 5, 5...
$ online_trans  2, 2, 2, 2, 4, 5, 3, 5, 1, 1, 4, 2, 4, 3, 5, 1, 3, 2, 2...
$ Q1            4, 4, 5, 5, 4, 4, 4, 5, 4, 4, 4, 4, 5, 4, 4, 5, 5, 5, 4...
$ Q2            2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2...
$ Q3            1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ Q4            2, 2, 2, 3, 3, 2, 2, 3, 2, 3, 3, 3, 3, 3, 2, 2, 2, 3, 3...
$ Q5            1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ Q6            4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
$ Q7            1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ Q8            4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
$ Q9            2, 1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2...
$ Q10           4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
$ segment       "Price", "Price", "Price", "Price", "Price", "Price", "...
数据截取(按行/列)

> library(magrittr)
>  library(dplyr)
>  dplyr::filter(sim.dat,income>300000) %>%
+ dplyr::tbl_df()
# A tibble: 4 x 19
    age gender   income house store_exp online_exp store_trans online_trans
  <int>  <chr>    <dbl> <chr>     <dbl>      <dbl>       <int>        <int>
1    40   Male 301398.0   Yes  4840.461   3618.212          10           11
2    33   Male 319704.3   Yes  5998.305   4395.923           9           11
3    41   Male 317476.2   Yes  3029.844   4179.671          11           12
4    37 Female 315697.2   Yes  6548.970   4284.065          13           11
# ... with 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>, Q4 <int>, Q5 <int>,
#   Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>, segment <chr>
Warning message:
package ‘bindrcpp’ was built under R version 3.4.3 

此外,dinstinct()函数可以删除数据框中重复的行;sample_frac()函数随机选取一定比例的行,sample_n()函数随机选取一定数目的行,slice()函数选取指定位置的行,top_n()选取某变量取值最高的若干观测

> dplyr::distinct(sim.dat)
# A tibble: 1,000 x 19
     age gender   income house store_exp online_exp store_trans online_trans
   <int>  <chr>    <dbl> <chr>     <dbl>      <dbl>       <int>        <int>
 1    57 Female 120963.4   Yes  529.1344   303.5125           2            2
 2    63 Female 122008.1   Yes  478.0058   109.5297           4            2
 3    59   Male 114202.3   Yes  490.8107   279.2496           7            2
 4    60   Male 113616.3   Yes  347.8090   141.6698          10            2
 5    51   Male 124252.6   Yes  379.6259   112.2372           4            4
 6    59   Male 107661.5   Yes  338.3154   195.6870           4            5
 7    57   Male 120483.3   Yes  482.5445   284.5363           5            3
 8    57   Male 110542.0   Yes  340.7368   135.2556          11            5
 9    61 Female 132060.5   Yes  608.2310   142.5503           6            1
10    60   Male 105048.8   Yes  470.3190   163.4663          12            1
# ... with 990 more rows, and 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
#   Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
#   segment <chr>
> dplyr::sample_frac(sim.dat,0.05,replace=TRUE)
# A tibble: 50 x 19
     age gender    income house store_exp online_exp store_trans online_trans
   <int>  <chr>     <dbl> <chr>     <dbl>      <dbl>       <int>        <int>
 1    22   Male  91553.21    No  200.7210  1777.4974           4           27
 2    34 Female  60521.76    No  299.3096  2054.1732           3           16
 3    33   Male        NA    No  265.6550  1892.5581           2           12
 4    38 Female 164506.62   Yes 3916.9309  5764.1235          11           10
 5    26 Female  89461.40    No  200.4784  2449.7965           1           23
 6    26 Female 105528.79   Yes  186.9383  2349.9275           5           17
 7    55   Male 128194.20   Yes  595.6952   156.9314           6            2
 8    35 Female 130108.64   Yes 6155.4803  6201.7090           9           13
 9    36   Male        NA   Yes  203.3036  2202.5147           2           15
10    38   Male 267564.87   Yes 5335.1143  6052.4377           8           10
# ... with 40 more rows, and 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
#   Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
#   segment <chr>
> dplyr::sample_n(sim.dat,10,replace=TRUE)
# A tibble: 10 x 19
     age gender    income house store_exp online_exp store_trans online_trans
   <int>  <chr>     <dbl> <chr>     <dbl>      <dbl>       <int>        <int>
 1    34 Female  73234.49    No  349.5491  2081.4476           4           21
 2    25 Female  90856.12    No  203.7759  2228.4818           4           23
 3    37   Male 187062.94   Yes 5931.7494  1942.1789          18           11
 4    34   Male  53945.69   Yes  370.5065  2305.3430           3           14
 5    23 Female  81763.92    No  205.6662  1040.8967           3           24
 6   300   Male 208017.46   Yes 5076.8009  6053.4853          12           11
 7    56   Male        NA   Yes  419.6702   192.3719           3            1
 8    26 Female  95341.78    No  198.9729  2036.4738           3           21
 9    26   Male  78240.93    No  430.2481  2091.4694           3           14
10    27 Female  90303.46    No  198.9020  1870.3866           6           13
# ... with 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>, Q4 <int>, Q5 <int>,
#   Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>, segment <chr>
> dplyr::top_n(sim.dat,2,income)
# A tibble: 2 x 19
    age gender   income house store_exp online_exp store_trans online_trans
  <int>  <chr>    <dbl> <chr>     <dbl>      <dbl>       <int>        <int>
1    33   Male 319704.3   Yes  5998.305   4395.923           9           11
2    41   Male 317476.2   Yes  3029.844   4179.671          11           12
# ... with 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>, Q4 <int>, Q5 <int>,
#   Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>, segment <chr>
> 

以及dplyr下的select()函数对列变量进行选择(代码略)
数据总结:(操作类似于apply()和ddply())

> dplyr::summarise(sim.dat,avg_online=mean(online_trans))
# A tibble: 1 x 1
  avg_online
       
1     13.546

可以用group_by()函数根据某分类变量对观测进行分组总结

生成新变量
mutate()函数可以进行列计算
transmute()函数与mutate()类似

> dplyr::mutate(sim.dat,total_exp=store_exp+online_exp)
# A tibble: 1,000 x 20
     age gender   income house store_exp online_exp store_trans online_trans
   <int>  <chr>    <dbl> <chr>     <dbl>      <dbl>       <int>        <int>
 1    57 Female 120963.4   Yes  529.1344   303.5125           2            2
 2    63 Female 122008.1   Yes  478.0058   109.5297           4            2
 3    59   Male 114202.3   Yes  490.8107   279.2496           7            2
 4    60   Male 113616.3   Yes  347.8090   141.6698          10            2
 5    51   Male 124252.6   Yes  379.6259   112.2372           4            4
 6    59   Male 107661.5   Yes  338.3154   195.6870           4            5
 7    57   Male 120483.3   Yes  482.5445   284.5363           5            3
 8    57   Male 110542.0   Yes  340.7368   135.2556          11            5
 9    61 Female 132060.5   Yes  608.2310   142.5503           6            1
10    60   Male 105048.8   Yes  470.3190   163.4663          12            1
# ... with 990 more rows, and 12 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
#   Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
#   segment <chr>, total_exp <dbl>

合并数据集

> x=data.frame(cbind(ID=c("A","B","C"),x1=c(1,2,3)))
> y=data.frame(cbind(ID=c("B","C","D"),y1=c(T,T,F)))
> x
  ID x1
1  A  1
2  B  2
3  C  3
> y
  ID    y1
1  B  TRUE
2  C  TRUE
3  D FALSE
> left_join(x,y,by="ID")
  ID x1   y1
1  A  1 <NA>
2  B  2 TRUE
3  C  3 TRUE
Warning message:
Column `ID` joining factors with different levels, coercing to character vector 
> inner_join(x,y,by="ID")
  ID x1   y1
1  B  2 TRUE
2  C  3 TRUE
Warning message:
Column `ID` joining factors with different levels, coercing to character vector 
> full_join(x,y,by="ID")
  ID   x1    y1
1  A    1  <NA>
2  B    2  TRUE
3  C    3  TRUE
4  D <NA> FALSE
Warning message:
Column `ID` joining factors with different levels, coercing to character vector 
> semi_join(x,y,by="ID")
  ID x1
1  B  2
2  C  3
Warning message:
Column `ID` joining factors with different levels, coercing to character vector 
> anti_join(x,y,by="ID")
  ID x1
1  A  1
Warning message:
Column `ID` joining factors with different levels, coercing to character vector 
> 

此外,dplur包中还有对数据框交,并,补的运算(intersect(),union(),setdiff()),以及一个数据框按行或列加到另一个数据框(bind_rows(),bind_cols())等

数据整形

reshape2包
数据先通过melt()函数将数据揉开,再通过dcast()函数将数据重塑成想要的形状。
melt()函数能糅合数据框,列表,矩阵,表格等。

tidyr包
首先gather()函数,类似于melt()
spread()函数和gather()函数相反,后者将不同的列堆叠起来,前者将同一列分开。
separate()和unite()也是tidyr包中两个互补函数,separate()可以将不同列分开成多列,unite()能将不同的列合并在一起。类似于paste()函数。

你可能感兴趣的:(机器学习之R语言基础)