R的数据框基本操作：创建、访问、修改

数据框是R中数据组织最常用的方式。
与矩阵类似的是，它们都是表格的形式，不同的是，数据框是多个不同存储类型的向量集合，而矩阵是要求所有向量的存储类型相同。

创建数据框data.frame()

data.frame(域名1=向量名1，域名2=向量名2,....)
可通过names()函数显示各个域名。

> Name = c("Jack", "May","Rose","Tom")
> maths = c(78,99,89,80)
> Chinese = c(88,87,69,90)
> English = c(90,85,83,93)

> df = data.frame(studentName = Name, ChineseScore = Chinese, MathScore = maths, EnglishScore = English)
> df
  studentName ChineseScore MathScore EnglishScore
1        Jack           88        78           90
2         May           87        99           85
3        Rose           69        89           83
4         Tom           90        80           93

> str(df)
'data.frame':   4 obs. of  4 variables: #obs指的是行，variables指的是列
 $ studentName : Factor w/ 4 levels "Jack","May","Rose",..: 1 2 3 4
 $ ChineseScore: num  88 87 69 90
 $ MathScore   : num  78 99 89 80
 $ EnglishScore: num  90 85 83 93

注意studentName的存储类型是因子（Factor）。关于因子，是一种特殊向量，后续再做讨论。

你也可以创建一个空的数据框：

> a = data.frame(x1=character(0),x2=logical(0),x3=numeric(0))
> a
[1] x1 x2 x3
<0 rows> (or 0-length row.names)
> str(a)
'data.frame':   0 obs. of  3 variables:
 $ x1: Factor w/ 0 levels: 
 $ x2: logi 
 $ x3: num

访问数据框

有3种方式:

数据框名$域名（常用）
数据框名[["域名"]]
数据框名[[域编号]]

> df
  studentName ChineseScore MathScore EnglishScore
1        Jack           88        78           90
2         May           87        99           85
3        Rose           69        89           83
4         Tom           90        80           93
> df$ChineseScore
[1] 88 87 69 90
> df[["ChineseScore"]]
[1] 88 87 69 90
> df[[2]]
[1] 88 87 69 90

也可以用绑定函数attach来直接访问里面的向量，这样的好处是无需指定数据框名称。

> attach(df)
> ChineseScore
[1] 88 87 69 90
> detach(df)

注意attach和detach必须配对出现。所以使用时要谨慎。

与attach和detach类似的函数作用又with函数，基本书写格式为：

with(数据框名,{
域访问函数1
域访问函数2
·
·
·
})

还是刚才的例子：

> df
  studentName ChineseScore MathScore EnglishScore
1        Jack           88        78           90
2         May           87        99           85
3        Rose           69        89           83
4         Tom           90        80           93
> with(df,{
+ print(ChineseScore)
+ SumScore = ChineseScore + MathScore + EnglishScore #生成局部向量
+ print(SumScore)
+ })
[1] 88 87 69 90
[1] 256 271 241 263

注意这个SumScore是局部向量，在with{}之外无法使用：

> SumScore
Error: object 'SumScore' not found

修改数据框

1. 添加列

若要修改数据框中的域值，将这个总分SumScore加入到数据框中，怎么办呢？利用within函数可以办到，格式为：

数据框名 = within(数据框名,{
域访问函数
·
·
·
域修改函数
·
·
·
})

在df中加入SumScore的具体操作：

> df = within(df,{
+ SumScore = ChineseScore + MathScore + EnglishScore
+ })
> df
  studentName ChineseScore MathScore EnglishScore SumScore
1        Jack           88        78           90      256
2         May           87        99           85      271
3        Rose           69        89           83      241
4         Tom           90        80           93      263

在within{}内生成的新向量默认加入数据框，成为新的域。
这样的好处是无需生成新的变量，再添加进去，节省了内存。

2. 修改数据框列的顺序

如果要把ChineseScore，MathScore对换一下的话
df[,c('studentName','MathScore','ChineseScore','EnglishScore','SumScore')]

3. subset筛选数据

筛选出总分为SumScore大于260的数据：

x = subset(df, SumScore > 260)

x
  studentName ChineseScore MathScore EnglishScore SumScore
2         May           87        99           85      271
4         Tom           90        80           93      263

本文参考：《R语言数据挖掘》第2版薛薇编著
本文持续更新中