R语言stringr包处理字符串

stringr包是R数据处理神器Tidyverse包中的工具之一，是处理字符串很好用的工具，结合正则表达式，可以发挥巨大作用。

字符串长度

stringr包的操作对象是向量，str_length()函数用于确定字符串长度。

> x <- c("why", "video", "cross", "extra", "deal", "authority")
> str_length(x)
#> [1] 3 5 5 5 4 9

但是如果是下面这种写法，便会出现语法错误，因为输入对象是非向量。

> str_length("why", "video", "cross", "extra", "deal", "authority")
Error in str_length("why", "video", "cross", "extra", "deal", "authority") : 
  unused arguments ("video", "cross", "extra", "deal", "authority")

字符串拼接

str_c()函数用于进行字符串的拼接，主要参数有待拼接字符串向量，sep=''和collapse=NULL。

> x = c('apple', 'banana','peach')
> y = c('one', 'two', 'three')
> str_c(x, y)
# [1] "appleone"   "bananatwo"  "peachthree"
> str_c(x, y, sep = '_')
# [1] "apple_one"   "banana_two"  "peach_three"
> str_c(x, y, collapse = "_")
# [1] "appleone_bananatwo_peachthree"

注意，上述例子中sep和collapse的作用后的不同，sep作用后还是多个字符串，collapse作用后则变为了一个字符串。

> case1 <- str_c(x, y, sep = '_')
> str_length(case1)
# [1]  9 10 11
> case2 <- str_c(x, y, collapse = "_")
> str_length(case2)
# [1] 29

字符串拆分

str_split()是stringr包中进行字符串拆分的函数，根据特定字符或者子集数量进行字符串拆分，选取特定子集。

# 构建一个由'_'分割的字符串向量
> x <- c('aajs_123_dkks', 'ahda_236_akdk', 'ahdj_178_ajdj', 'agsh_109_auqyr', 'qwp_2635_qnjx')
> str_split(x, pattern = '_')
[[1]]
[1] "aajs" "123"  "dkks"

[[2]]
[1] "ahda" "236"  "akdk"

[[3]]
[1] "ahdj" "178"  "ajdj"

[[4]]
[1] "agsh"  "109"   "auqyr"

[[5]]
[1] "qwp"  "2635" "qnjx"

主要参数如下pattern = , n = Inf , simplify = FALSE，默认返回值类型为list，simplify = True则返回值类型为matrix,array。

> class(str_split(x, pattern = '_'))
[1] "list"
> str_split(x, pattern = '_', simplify = TRUE)
     [,1]   [,2]   [,3]   
[1,] "aajs" "123"  "dkks" 
[2,] "ahda" "236"  "akdk" 
[3,] "ahdj" "178"  "ajdj" 
[4,] "agsh" "109"  "auqyr"
[5,] "qwp"  "2635" "qnjx" 
> class(str_split(x, pattern = '_', simplify = TRUE))
# [1] "matrix" "array"

字符串向量拆分后选择特定的列，用于后续操作，比如本例中拆分后选取数字列，则可以使用矩阵和数组选取子集的操作。

> str_split(x, pattern = '_', simplify = TRUE)[,2]
[1] "123"  "236"  "178"  "109"  "2635"

字符串子集

可以使用str_subset()根据某一特征选取向量中的特定的字符串，也可以结合正则表达式进行选择。参数包括pattern和negate。negate默认是FALSE,如果是TRUE，作用是反选。

> x <- c("why", "video", "cross", "extra", "deal", "authority")
> str_subset(x, pattern = 'o')
[1] "video"     "cross"     "authority"
> str_subset(x, pattern = 'o', negate = T)
[1] "why"   "extra" "deal"

如果使用正则表达式，则和不使用存在不同，如下举例。

> x <- c("why", "video", "cross", "extra", "deal", "authority")
> str_subset(x, pattern = 'oi')
# character(0)
> str_subset(x, pattern = '[oi]')
[1] "video"     "cross"     "authority"

字符串替换

使用str_replace()进行特定字符的替换，参数包括要替换的模式pattern和替换成的模式replacement。

> x <- c("why", "video", "cross", "extra", "deal", "authority")
> str_replace(x, 'i', '@')
[1] "why"       "v@deo"     "cross"     "extra"     "deal"      "author@ty"

使用正则表达式后则有所不同：

> str_replace(x, 'ie', '@@')
[1] "why"       "video"     "cross"     "extra"     "deal"      "authority"
> str_replace(x, '[ie]', '@@')
[1] "why"        "v@@deo"     "cross"      "@@xtra"     "d@@al"      "author@@ty"

注意，str_replace只替换匹配到的第一个，使用str_replace_all()进行全部替换：

> x <- c('apple', 'happy')
> x
[1] "apple" "happy"
> str_replace(string = x, pattern = 'p', replacement = '%')
[1] "a%ple" "ha%py"
> str_replace_all(string = x, pattern = 'p', replacement = '%')
[1] "a%%le" "ha%%y"

另外，使用str_replace_na()缺失值的替换,

> x <- c('one',  NA,'ten', NA, 'eleven',NA)
> x
[1] "one"    NA       "ten"    NA       "eleven" NA   
> str_replace_na(string = x, replacement = '%')
[1] "one"    "%"      "ten"    "%"      "eleven" "%"

字符串填补

使用str_pad(）函数进行字符串的填补，参数包括string, width, side = c("left", "right", "both"), pad = " ")，举例如下：

> str_pad(string = letters[1:7], width = 5, side = 'left', pad = '#')
[1] "####a" "####b" "####c" "####d" "####e" "####f" "####g"

> str_pad(string = letters[1:7], width = 5, side = 'both', pad = '#')
[1] "##a##" "##b##" "##c##" "##d##" "##e##" "##f##" "##g##"

> str_pad(string = letters[1:7], width = 5, side = 'right', pad = '#')
[1] "a####" "b####" "c####" "d####" "e####" "f####" "g####"