字符向量可用 c 函数创建,可以用单引号或双引号把字符串引用起来,只要引号之间匹配即可,推荐使用双引号:
c("learn", "character", "and", "factor", "of", "r")
>> [1] "learn" "character" "and" "factor" "of" "r"
所有的字符串被组合后,可使用 collapse 参数把结果收缩成一个包含所有元素的字符串。
paste("red", "yellow")
>> [1] "red yellow"
paste(c("red", "yellow"), "lorry")
>> [1] "red lorry" "yellow lorry"
paste(c("red", "yellow"), "lorry", sep = "-")
>> [1] "red-lorry" "yellow-lorry"
paste("a", 1:5, sep = "")
>> [1] "a1" "a2" "a3" "a4" "a5"
paste("a", 1:5, sep = "", collapse = "+")
>> [1] "a1+a2+a3+a4+a5"
x <- (1:15) ^ 2
toString(x)
>> [1] "1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225"
toString(x, width = 40)
>> [1] "1, 4, 9, 16, 25, 36, 49, 64, 81, 100...."
format 函数可用于数字的格式化,包括指定小数的位数、输出的宽度、是否使用科学计数法等,传入的参数是数值型的向量,输出则是字符串向量:
pow <- 1:3
(powers_of_e <- exp(pow))
>> [1] 2.718282 7.389056 20.085537
format(powers_of_e)
>> [1] " 2.718282" " 7.389056" "20.085537"
format(powers_of_e, digits = 3)
>> [1] " 2.72" " 7.39" "20.09"
format(powers_of_e, digits = 3, width = 10)
>> [1] " 2.72" " 7.39" " 20.09"
format(powers_of_e, digits = 3, scientific = TRUE)
>> [1] "2.72e+00" "7.39e+00" "2.01e+01"
sprintf("%s %d = %f", "Euler's constant to the power", 3, exp(3))
>> [1] "Euler's constant to the power 3 = 20.085537"
sprintf("To three decimal places, e ^ %d = %.4f", 3, exp(3))
>> [1] "To three decimal places, e ^ 3 = 20.0855"
sprintf("In scientific notation, e ^ %d = %e", 3, exp(3))
>> [1] "In scientific notation, e ^ 3 = 2.008554e+01"
使用 toupper 和 tolower 函数能把字符串中的字符全部转换为大写或小写:
toupper("I'm Shouting")
>> [1] "I'M SHOUTING"
tolower("I'm Whispering")
>> [1] "i'm whispering"
substr(x, start, stop) 函数是截取字符串最常用的函数,可以用来截取子字符串或者替换子字符串,其中 start 表示起始的索引,stop 表示结束的索引。
# 截取子字符串
substr("His name is Jack", 5, 8)
>> [1] "name"
# 替换子字符串
x <- "His name is Jack"
substr(x, 13, 16) <- "John"
x
>> [1] "His name is John"
# 按空格分割字符串
x <- "His name is Jack"
strsplit(x, " ")
>> [[1]]
>> [1] "His" "name" "is" "Jack"
因子是一个用于存储类别变量的特殊的变量类型,创建数据框时,R 会默认将类别变量转换为因子。
除了使用数据框在内部自动创建因子之外,也可以使用 factor 函数来创建因子。
heights <- data.frame(
height_cm = c(153, 181, 150, 172, 165, 149, 174, 169, 198, 163),
gender = c(
"female", "male", "female", "male", "male",
"female", "female", "male", "male", "female"
)
)
class(heights$gender)
>> [1] "factor"
heights$gender
>> [1] female male female male male female female male male female
>> Levels: female male
gender_char <- c(
"female", "male", "female", "male", "male",
"female", "female", "male", "male", "female"
)
(gender_fac <- factor(gender_char))
>> [1] female male female male male female female male male female
>> Levels: female male
可以通过指定 levels 参数来更改因子被创建时水平的先后顺序:
factor(gender_char, levels = c("male", "female"))
>> [1] female male female male male female female male male female
>> Levels: male female
在对数据进行处理的过程中,有时候需要去掉某些因子,如 getting_to_work 的数据框中剔除了 time_mins 字段为空的数据,但其对应的因子 bus 仍然被保留了。
getting_to_work <- data.frame(
mode = c(
"bike", "car", "bus", "car", "walk",
"bike", "car", "bike", "car", "car"
),
time_mins = c(25, 13, NA, 22, 65, 28, 15, 24, NA, 14)
)
getting_to_work$mode
>> [1] bike car bus car walk bike car bike car car
>> Levels: bike bus car walk
# 剔除 time_mins 为空的
getting_to_work <- subset(getting_to_work, !is.na(time_mins))
unique(getting_to_work$mode)
>> [1] bike car walk
>> Levels: bike bus car walk
getting_to_work <- droplevels(getting_to_work)
levels(getting_to_work$mode)
>> [1] "bike" "car" "walk"
有些因子的水平在语义上大于或小于其他水平,这时可以使用有序因子,创建时指定参数 ordered = TRUE。
# 考试成绩 good < better < best
grade_choices <- c("good", "better", "best")
grade_values <- sample(grade_choices, 1000, replace = TRUE)
grade_fac <- factor(grade_values, grade_choices, ordered = TRUE)
head(grade_fac)
>> [1] best good good best best good
>> Levels: good < better < best
table(grade_fac)
>> grade_fac
>> good better best
>> 363 293 344
cut 函数能将数值变量切成不同的组,然后返回一个因子。我们随机地生成 10000 名工人的年龄数据(从 16 到 66,使用 Beta 分布),并将他们按每 10 年分组:
ages <- 16 + 50 * rbeta(10000, 2, 3)
grouped_ages <- cut(ages, seq.int(16, 66, 10))
head(grouped_ages)
>> [1] (36,46] (56,66] (46,56] (16,26] (26,36] (26,36]
>> Levels: (16,26] (26,36] (36,46] (46,56] (56,66]
table(grouped_ages)
>> grouped_ages
>> (16,26] (26,36] (36,46] (46,56] (56,66]
>> 1788 3463 2960 1513 276
可以使用 gl 函数来生成因子,其中第一个参数为要生成的因子的水平数,第二个为每个水平需要重复的次数,第三个为因子的长度。
gl(3, 2)
>> [1] 1 1 2 2 3 3
>> Levels: 1 2 3
gl(3, 2, labels = c("A", "B", "C"))
>> [1] A A B B C C
>> Levels: A B C
gl(3, 1, 6, labels = c("A", "B", "C"))
>> [1] A B C A B C
>> Levels: A B C
grade <- gl(3, 2, labels = c("A", "B", "C"))
gender <- gl(2, 1, 6, labels = c("female", "male"))
interaction(grade, gender)
>> [1] A.female A.male B.female B.male C.female C.male
>> Levels: A.female B.female C.female A.male B.male C.male