R语言stringr包处理字符串

stringr包是R数据处理神器Tidyverse包中的工具之一,是处理字符串很好用的工具,结合正则表达式,可以发挥巨大作用。

字符串长度

stringr包的操作对象是向量str_length()函数用于确定字符串长度。

> x <- c("why", "video", "cross", "extra", "deal", "authority")
> str_length(x)
#> [1] 3 5 5 5 4 9

但是如果是下面这种写法,便会出现语法错误,因为输入对象是非向量。

> str_length("why", "video", "cross", "extra", "deal", "authority")
Error in str_length("why", "video", "cross", "extra", "deal", "authority") : 
  unused arguments ("video", "cross", "extra", "deal", "authority")

字符串拼接

str_c()函数用于进行字符串的拼接,主要参数有待拼接字符串向量sep=''collapse=NULL

> x = c('apple', 'banana','peach')
> y = c('one', 'two', 'three')
> str_c(x, y)
# [1] "appleone"   "bananatwo"  "peachthree"
> str_c(x, y, sep = '_')
# [1] "apple_one"   "banana_two"  "peach_three"
> str_c(x, y, collapse = "_")
# [1] "appleone_bananatwo_peachthree"

注意,上述例子中sepcollapse的作用后的不同,sep作用后还是多个字符串collapse作用后则变为了一个字符串。

> case1 <- str_c(x, y, sep = '_')
> str_length(case1)
# [1]  9 10 11
> case2 <- str_c(x, y, collapse = "_")
> str_length(case2)
# [1] 29

字符串拆分

str_split()stringr包中进行字符串拆分的函数,根据特定字符或者子集数量进行字符串拆分,选取特定子集。

# 构建一个由'_'分割的字符串向量
> x <- c('aajs_123_dkks', 'ahda_236_akdk', 'ahdj_178_ajdj', 'agsh_109_auqyr', 'qwp_2635_qnjx')
> str_split(x, pattern = '_')
[[1]]
[1] "aajs" "123"  "dkks"

[[2]]
[1] "ahda" "236"  "akdk"

[[3]]
[1] "ahdj" "178"  "ajdj"

[[4]]
[1] "agsh"  "109"   "auqyr"

[[5]]
[1] "qwp"  "2635" "qnjx"

主要参数如下pattern = , n = Inf , simplify = FALSE,默认返回值类型为listsimplify = True则返回值类型为matrix,array

> class(str_split(x, pattern = '_'))
[1] "list"
> str_split(x, pattern = '_', simplify = TRUE)
     [,1]   [,2]   [,3]   
[1,] "aajs" "123"  "dkks" 
[2,] "ahda" "236"  "akdk" 
[3,] "ahdj" "178"  "ajdj" 
[4,] "agsh" "109"  "auqyr"
[5,] "qwp"  "2635" "qnjx" 
> class(str_split(x, pattern = '_', simplify = TRUE))
# [1] "matrix" "array" 

字符串向量拆分后选择特定的列,用于后续操作,比如本例中拆分后选取数字列,则可以使用矩阵和数组选取子集的操作。

> str_split(x, pattern = '_', simplify = TRUE)[,2]
[1] "123"  "236"  "178"  "109"  "2635"

字符串子集

可以使用str_subset()根据某一特征选取向量中的特定的字符串,也可以结合正则表达式进行选择。参数包括patternnegatenegate默认是FALSE,如果是TRUE,作用是反选

> x <- c("why", "video", "cross", "extra", "deal", "authority")
> str_subset(x, pattern = 'o')
[1] "video"     "cross"     "authority"
> str_subset(x, pattern = 'o', negate = T)
[1] "why"   "extra" "deal" 

如果使用正则表达式,则和不使用存在不同,如下举例。

> x <- c("why", "video", "cross", "extra", "deal", "authority")
> str_subset(x, pattern = 'oi')
# character(0)
> str_subset(x, pattern = '[oi]')
[1] "video"     "cross"     "authority"

字符串替换

使用str_replace()进行特定字符的替换,参数包括要替换的模式pattern和替换成的模式replacement

> x <- c("why", "video", "cross", "extra", "deal", "authority")
> str_replace(x, 'i', '@')
[1] "why"       "v@deo"     "cross"     "extra"     "deal"      "author@ty"

使用正则表达式后则有所不同:

> str_replace(x, 'ie', '@@')
[1] "why"       "video"     "cross"     "extra"     "deal"      "authority"
> str_replace(x, '[ie]', '@@')
[1] "why"        "v@@deo"     "cross"      "@@xtra"     "d@@al"      "author@@ty"

注意,str_replace只替换匹配到的第一个,使用str_replace_all()进行全部替换:

> x <- c('apple', 'happy')
> x
[1] "apple" "happy"
> str_replace(string = x, pattern = 'p', replacement = '%')
[1] "a%ple" "ha%py"
> str_replace_all(string = x, pattern = 'p', replacement = '%')
[1] "a%%le" "ha%%y"

另外,使用str_replace_na()缺失值的替换,

> x <- c('one',  NA,'ten', NA, 'eleven',NA)
> x
[1] "one"    NA       "ten"    NA       "eleven" NA   
> str_replace_na(string = x, replacement = '%')
[1] "one"    "%"      "ten"    "%"      "eleven" "%" 

字符串填补

使用str_pad()函数进行字符串的填补,参数包括string, width, side = c("left", "right", "both"), pad = " "),举例如下:

> str_pad(string = letters[1:7], width = 5, side = 'left', pad = '#')
[1] "####a" "####b" "####c" "####d" "####e" "####f" "####g"

> str_pad(string = letters[1:7], width = 5, side = 'both', pad = '#')
[1] "##a##" "##b##" "##c##" "##d##" "##e##" "##f##" "##g##"

> str_pad(string = letters[1:7], width = 5, side = 'right', pad = '#')
[1] "a####" "b####" "c####" "d####" "e####" "f####" "g####"

你可能感兴趣的:(R语言stringr包处理字符串)