小洁详解《R数据科学》--第十章使用stringr处理字符串(下)

4.工具

• 确定与某种模式相匹配的字符串；
• 找出匹配的位置；
• 提取出匹配的内容；
• 使用新值替换匹配内容；
• 基于匹配拆分字符串。

4.1匹配检测

str_detect()只返回是否符合的逻辑值，实际上计数更实用。

x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1] TRUE FALSE TRUE

str_detect()和sum、mean连用，统计匹配的个数和比例。

# 有多少个以t开头的常用单词？
sum(str_detect(words, "^t"))#sum是计数
#> [1] 65
# 以元音字母结尾的常用单词的比例是多少？
mean(str_detect(words, "[aeiou]$"))#mean是比例。[]表示任选其一
#> [1] 0.277

复杂的正则表达式可拆分为几个简单的

找出不包含元音字母的所有单词：

# 找出至少包含一个元音字母的所有单词，然后取反
no_vowels_1 <- !str_detect(words, "[aeiou]")
# 找出仅包含辅音字母（非元音字母）的所有单词
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)

匹配取子集操作

words[str_detect(words, "x$")]#方法一，逻辑取子集
#> [1] "box" "sex" "six" "tax"
str_subset(words, "x$")#方法二，str_subset() 直接用取子集函数
#> [1] "box" "sex" "six" "tax"

结合filter，对数据框里的行进行匹配取子集

df <- tibble(
  word = words,
  i = seq_along(word)
)
#发现用seq_along这个操作生成了行号。
df %>%
  filter(str_detect(words, "x$"))

str_count计数：每个字符串各匹配几次

x <- c("apple", "banana", "pear")
str_count(x, "a")
#> [1] 1 3 1
# 平均来看，每个单词中有多少个元音字母？
mean(str_count(words, "[aeiou]"))
#> [1] 1.99

str_count和mutate连用，将匹配个数添加到表格新列

df %>%
  mutate(
    vowels = str_count(word, "[aeiou]"),
    consonants = str_count(word, "[^aeiou]")
  )
#偷偷查了一下这俩单词，元音和辅音

这里就解释了我前面发现的只匹配一次的问题
规律：str_view一个字符串只匹配一次，str_view_all匹配多次，但二者都不匹配重叠

匹配两次而非三次

4.3提取匹配内容

`str_subset`提取匹配到的整个字符串

#首先明确sentence是一个stringr自带向量，由字符串（句子）组成。里面星星点点带有几个颜色单词，
#示例是从这个向量字符串中提取出颜色单词

length(sentences)
head(sentences)

class(sentences)
#>[1] "character"
#构建匹配模式，多种颜色任选，用|连接
colors <- c(
  "red", "orange", "yellow", "green", "blue", "purple"
)

color_match <- str_c(colors, collapse = "|")
color_match

#用str_subset匹配取子集

has_color <- str_subset(sentences, color_match)
has_color

#取出匹配到多次的字符串
more <- sentences[str_count(sentences, color_match) > 1]
str_view_all(more, color_match)#查看所有匹配

`str_extrac`以列表的形式返回每个字符串的匹配

#提取每个字符串第一个匹配
str_extract(more, color_match)
#> [1] "blue" "green" "orange"
#提取每个字符串所有匹配（列表形式）
str_extract_all(more, color_match)
#simplify = TRUE，设置匹配同等长度
str_extract_all(more, color_match, simplify = TRUE)
#> [,1] [,2]
#> [1,] "blue" "red"
#> [2,] "green" "red"
#> [3,] "orange" "red"
x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
#> [,1] [,2] [,3]
#> [1,] "a" "" ""
#> [2,] "a" "b" ""
#> [3,] "a" "b" "c"

4.5分组匹配

备忘：
• ?：0 次或1 次。
• +：1 次或多次。
• *：0 次或多次。

匹配加a或the的所有单词。单词在正则表达式里定义为：至少有1 个非空格字符的字符序列

noun <- "(a|the) ([^ ]+)" #定义匹配模式
#|是或，[]表示任意，^表示非，+表示1次或多次
has_noun <- sentences %>%
str_subset(noun) %>%#取子集
head(10)#取前十行
has_noun %>%
str_extract(noun)#提取每行第一个匹配，得到10个匹配结果
#> [1] "the smooth" "the sheet" "the depth" "a chicken"
#>  "the parked" "the sun" "the huge" "the ball"
#>"the woman" "a helps"

str_match()和tidyr::extract(）拆分每个字符串匹配到的第一个，拆成两个字符串。

has_noun %>%
  str_match(noun)

tibble(sentence = sentences) %>%
  tidyr::extract(
    sentence, c("article", "noun"), "(a|the) ([^ ]+)",
    remove = FALSE
  )

第

4.7　替换匹配内容

str_replace 只替换每个字符串匹配到的第一个
str_replace_all替换每个字符串匹配到的所有

x <- c("apple", "pear", "banana") 
str_replace(x, "[aeiou]", "-")#用-替换每个字符串第一个元音
#> [1] "-pple" "p-ar" "b-nana"
str_replace_all(x, "[aeiou]", "-")#用-替换每个字符串所有元音
#> [1] "-ppl-" "p--r" "b-n-n-"

向量结合str_replace_all()同时执行多个替换：

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house" "two cars" "three people"

回溯引用分组：交换第二个单词和第三个单词的顺序：

sentences %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(5)
#用不交换的对比一下
sentences %>%
head(5)

4.9 str_split 拆分

str_split()将字符串拆分为多个片段
将句子拆分成单词

sentences %>%
head(5) %>%
str_split(" ")

拆分单个字符串，返回一个列表
拆分多元素的向量，设置simplify = TRUE返回一个矩阵
n设定拆分片段的最大数量：

fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)

（这里有点类似于tidyr里的separate，把一列分两列，只是separate按照分隔符来分，不同的是，如果你要分的列数比按照分隔符分的列数少，separate会把多出来的列信息丢掉，而这个不会）
比如：

#刚才的代码改下，有两个分隔符可分三列，可我指定分两列怎么办？
fields <- c("Name: Hadley: ma", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)

而separate的做法会是把多出来的ma丢掉。
用字母、行、句子和单词边界boundary()
关于边界的探索：

str_view_all(x, boundary("word"))
str_view_all(x, boundary("sentence"))
str_view_all(x, boundary("letter"))#报错
#Error in match.arg(type) : 
#  'arg' should be one of “character”, “line_break”, “sentence”, “word”
str_view_all(x, boundary("character"))

4.11

str_locate() 和str_locate_all() 函数可以给出每个匹配的开始位置和结束位置。

5.其他类型的模式

regex()的参数

ignore_case = TRUE 忽略大小写
multiline = TRUE 分行，每行匹配一次（亲测）
comments = TRUE 可加注释
dotall = TRUE dotall = TRUE 可以使得. 匹配包括\n 在内的所有字符。

regex()，之外的3 种函数：

fixed() 按照字符串的字节形式进行精确匹配，不需要转义
coll() 使用标准排序规则来比较字符串
boundary() 边界

6.正则表达式其他应用

apropos() 函数可以在全局环境空间中搜索所有可用对象（可以搜函数）。
dir() 函数可以列出一个目录下的所有文件,pattern可用正则表达式匹配文件名。

微信公众号生信星球同步更新我的文章，欢迎大家扫码关注！

我们有为生信初学者准备的学习小组,点击查看◀️
想要参加我的线上线下课程，也可加好友咨询
如果需要提问，请先看生信星球答疑公告

小洁详解《R数据科学》--第十章 使用stringr处理字符串(下)