一、R语言实现数据的分组求和
实验数据集 姓名,年龄,班级 ,成绩, 科目
student <- data.frame (
name = c("s1", "s2", "s3", "s2", "s1", "s3"),
age = c(12, 13, 10, 13, 12, 10),
classid = c("c1", "c2", "c3", "c2", "c1", "c3"),
score = c(78, 68, 99, 81, 82, 90),
subject = c("su1", "su1", "su1", "su2", "su2", "su2")
)
> str(students)
'data.frame': 6 obs. of 5 variables:
$ name : Factor w/ 3 levels "s1","s2","s3": 1 2 3 2 1 3
$ age : num 12 13 10 13 12 10
$ classid: Factor w/ 3 levels "c1","c2","c3": 1 2 3 2 1 3
$ score : num 78 68 99 81 82 90
$ subject: Factor w/ 2 levels "su1","su2": 1 1 1 2 2 2
下面我们求每个班级平均成绩:用SQL语句如下
select count(score) from students group by subject
> tapply(student$score, students$subject, sum)
su1 su2
245 253
再来看一个例子,加深对因子的理解:
> affils <- c("R", "D", "D", "R", "U", "D")
> affils <- as.factor(x = affils)
> affils
[1] R D D R U D
Levels: D R U
> affils <- factor(affils, ordered = TRUE)
> affils
[1] R D D R U D
Levels: D < R < U
> affils <- factor(affils, levels = c("U", "R", "D"), ordered = TRUE)
> tapply(ages, affils, mean)
U R D
21 31 41
> ages <- c(25, 26, 55, 37, 21, 42)
> affils <- c("R", "D", "D", "R", "U", "D")
> affils <- as.factor(x = affils)
> affils
[1] R D D R U D
Levels: D R U
> affils <- factor(affils, ordered = TRUE)
> affils
[1] R D D R U D
Levels: D < R < U
> affils <- factor(affils, levels = c("U", "R", "D"), ordered = TRUE)
> affils
[1] R D D R U D
Levels: U < R < D
> tapply(ages, affils, mean)
U R D
21 31 41
好了,有了上面的基础知识,下面进一步加大难度,如果分组变量有几个呢?
请看下面的例子:
实验数据如下:
> staff <- data.frame(list(gender = c("M", "M", "F", "M", "F", "F"),
+ age = c(47, 59, 21, 32, 33, 24),
+ income = c(55000, 88000, 32450, 76500, 123000, 45650)
+ )
+ )
> staff
gender age income
1 M 47 55000
2 M 59 88000
3 F 21 32450
4 M 32 76500
5 F 33 123000
6 F 24 45650
> str(staff)
'data.frame': 6 obs. of 3 variables:
$ gender: Factor w/ 2 levels "F","M": 2 2 1 2 1 1
$ age : num 47 59 21 32 33 24
$ income: num 55000 88000 32450 76500 123000 ...
> staff$over25 <- ifelse(staff$age > 25, 1, 0)
> staff
gender age income over25
1 M 47 55000 1
2 M 59 88000 1
3 F 21 32450 0
4 M 32 76500 1
5 F 33 123000 1
6 F 24 45650 0
> tapply(staff$income, list(staff$gender, staff$over25), sum)
0 1
F 78100 123000
M NA 219500
二、如果你只是想分组呢?那么你就要要用到 spit 函数,注意字符串的分割是用 strsplit, 下面看如下两个例子就清楚明了了
> split(staff$income, list(staff$over25, staff$gender))
$`0.F`
[1] 32450 45650
$`1.F`
[1] 123000
$`0.M`
numeric(0)
$`1.M`
[1] 55000 88000 76500
> split(staff$income, list(staff$gender, staff$over25))
$F.0
[1] 32450 45650
$M.0
numeric(0)
$F.1
[1] 123000
$M.1
[1] 55000 88000 76500
下面看一个有意思的例子,利用 split 迅速定位上面男性的下标,一种非常自然的想法是排序,然后如果数据总是变化无常怎么定位我们想要的那一类数据的下标呢?
> split(1:length(staff$gender), staff$gender)
$F
[1] 3 5 6
$M
[1] 1 2 4
如果我们将这个方法与文本挖掘联想到一起,我们可以发现,这个方法可以非常容易的解决英文文本词汇索引的问题:
如果给你一个文本文件,假设单词都是按照空格分割,现在要统计哪些单词出现在文本中,以及出现的位置和次数,我们可以用下面的方法非常容易的解决
filewords <- function(tf) {
txt <- scan(tf, "")
words <- split(1:length(txt), txt)
return(words)
}
最后一句话:在R中如果可以不使用循环则力求不使用