函数:
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,fixed = FALSE, useBytes = FALSE, invert = FALSE)
功能:对于字符串向量中的每个元素进行匹配操作,返回成功匹配(包含匹##配)的子字符)的元素的索引;
参数:
perl,指示是否使用perl正则库,还是仅仅使用固定字符串的匹配‘
value,指示是直接返回的元素索引还是返回的元素值
fixed,当取TRUE时则仅使用固定字符匹配,不使用正则化匹配
invert,当取TRUE时则返回未能成功匹配的字符元素,而非匹配
函数:
grepl(pattern, x, ignore.case = FALSE, perl = FALSE,fixed = FALSE, useBytes = FALSE)
功能:与grep()功能类似,只不过返回的是与x等长的逻辑向量
s <-c("Make","MAKE","jack",'June')
#返回字符串向量中与之匹配的索引
s_grep<-grep("e\\b",s,perl = T)
#返回匹配成功的字符串
s_grep_value<-grep("e$",s,perl = T,value = T)
#返回一个与字符串向量等长的向量,匹配则为TRUE,反之为FALSE
s_grepl<-grepl("e$",s,perl = T)
[1] 1 4
[2] “Make” “June”
[3] TRUE FALSE FALSE TRUE
sub函数只会替换字符串向量当中每个可以匹配的字符串的第一个匹配的字符;而gsub函数则是将字符串当中所有能匹配上的字符全部进行替换
s_sub<-sub('[ae|AE]',replacement = "(R)",s,perl = TRUE)
s_gsub<-gsub('[ae|AE]',replacement = "\\L(R)\\E",s,perl = TRUE)
[1] “M®ke” “M®KE” “j®ck” “Jun®”
[2] “M®k®” “M®K®” “j®ck” “Jun®”
替换函数当中replacement参数的使用:说明文档中指出,当perl=TRUE时,可以使用\U和\L来进行大小写转换;当fixed = FALSE时,可以进行后向引用,即使用\1——\9对pattern当中的捕获组进行
引用,而\U和\L则是对捕获组应用的补充,是对捕获组的大小写转换操作,
如下:
text <- "a test of capitalizing"
text2<- "useRs may fly into JFK of laGuardia"
#使首字母大写,其余字母小写,在\\1后使用\\E表示只对\\1进行大写转换,若不写则将\\U后的
#的所有内容都进行大写转换
gsub("(\\w)(\\w*)","\\U\\1\\E\\2",text,perl = TRUE)
#只对从左到右第一个匹配进行返回
sub("(\\w)(\\w*)","\\U\\1\\E\\2",text,perl = T)
#同样的效果,使用了定位符
gsub("\\b(\\w)","\\U\\1",text,perl = T)
#首字母与最后一个字母大写
gsub("(\\w)(\\w*)(\\w)","\\U\\1\\E\\2\\U\\3",text2,perl = T)
#sub函数则只会对匹配的第一个字符进行操作,其余忽略
sub("(\\w)(\\w*)(\\w)","\\U\\1\\E\\2\\U\\3",text2,perl = T)
[1] “A Test Of Capitalizing”
[2] “A test of capitalizing”
[3] “A Test Of Capitalizing”
[4] “UseRS MaY FlY IntO JFK OF LaGuardiA”
[5] “UseRS may fly into JFK of laGuardia”
若匹配的为单个文本字符串text,则返回的是第一个能够匹配成功的字符的开始的位置;
若没有匹配成功则返回-1;返回结果为一个整数,字符串当中的从哪个字符开始可以匹配上;
对于字符向量,即多个"“字符串组成的向量,此时返回的是一个与向量等长的整数向量,相应元素指示了对相应字符串的匹配情况,匹配成功则指示匹配字符串开始的位置,否则为-1;
其整数的属性“match.length”指示了匹配字符串的长度;当使用了捕获组的时候,会有属性"capture.start”, “capture.length” and "capture.names"l来指示;
##捕获组的匹配字符串的情况
s_regexpr_1<-regexpr("[ae]\\b",c("make",'maker'),perl = T)
##注意这里使用了python中特有的捕获组命名方式;
#regexpr()与gregxepr()两个函数仅在perl=T时可以使用python形式来对捕获组命名
s_regexpr_2 <- regexpr("(?P\\w)e\\B","you are a beautiful girl! ",perl = T)
attr(s_regexpr_2,"match.length")
attr(s_regexpr_2,"capture.names")
#示例
s_regexpr_3 <-regexpr("(\\w{1})e\\b","hello-we-will, en we will",perl = T)
s_regexpr_4<-regexpr("(\\w{1})e\\B","hello-we-will, en we will",perl = T)
> s_regexpr_1
[1] 4 -1
attr(,“match.length”)
[1] 1 -1
attr(,“index.type”)
[1] “chars”
attr(,“useBytes”)
[1] TRUE> s_regexpr_2
[1] 11
attr(,“match.length”)
[1] 2
attr(,“index.type”)
[1] “chars”
attr(,“useBytes”)
[1] TRUE
attr(,“capture.start”)
first
[1,] 11
attr(,“capture.length”)
first
[1,] 1
attr(,“capture.names”)
[1] “first”
> s_regexpr_3
[1] 7
attr(,“match.length”)
[1] 2
attr(,“index.type”)
[1] “chars”
attr(,“useBytes”)
[1] TRUE
attr(,“capture.start”)[1,] 7
attr(,“capture.length”)
[1,] 1
attr(,“capture.names”)
[1] “”> s_regexpr_4
[1] 1
attr(,“match.length”)
[1] 2
attr(,“index.type”)
[1] “chars”
attr(,“useBytes”)
[1] TRUE
attr(,“capture.start”)[1,] 1
attr(,“capture.length”)
[1,] 1
attr(,“capture.names”)
[1] “”> attr(s_regexpr_2,“match.length”)
[1] 2
> attr(s_regexpr_2,“capture.names”)
[1] “first”
如sub和gsub的关系,gregexpr函数就是对字符串当中的所有匹配的字符串的起始位置以列表的形式返回,列表的每个元素形式都与单个regexpr函数返回的结果形式一致,及拥有match.lengh等属性;
s_gregexpr0<-gregexpr("[ae]",c("make","maker"),perl = T)
s_gregexpr0
class(s_gregexpr0)
> s_gregexpr0
[[1]]
[1] 2 4
attr(,“match.length”)
[1] 1 1
attr(,“index.type”)
[1] “chars”
attr(,“useBytes”)
[1] TRUE[[2]]
[1] 2 4
attr(,“match.length”)
[1] 1 1
attr(,“index.type”)
[1] “chars”
attr(,“useBytes”)
[1] TRUE> class(s_gregexpr0)
[1] “list”
而对于单个字符串,其也会以列表的形式返回每个匹配字符的起始位置,列表的每个元素都有相应的属性;
s_gregexpr1<-gregexpr('e',"hello-we-will, en we will",perl = T)
s_gregexpr1
> s_gregexpr1
[[1]]
[1] 2 8 16 20
attr(,“match.length”)
[1] 1 1 1 1
attr(,“index.type”)
[1] “chars”
attr(,“useBytes”)
[1] TRUE
这里单个字符串和字符向量返回结果的区别是,单个字符串是以一个向量作为list的一个元素,list的长度为1;而字符串向量所返回的list的元素个数与字符串向量的元素个数一致,每个元素长度则取决于对字符向量单个字符元素的匹配情况:
s_gregexpr2 <-gregexpr("[ae]\\b",c("make",'maker'),perl = T)
str(s_gregexpr2)
str(s_gregexpr1)
str(s_gregexpr0)
> s_gregexpr2
[[1]]
[1] 4
attr(,“match.length”)
[1] 1
attr(,“index.type”)
[1] “chars”
attr(,“useBytes”)
[1] TRUE[[2]]
[1] -1
attr(,“match.length”)
[1] -1
attr(,“index.type”)
[1] “chars”
attr(,“useBytes”)
[1] TRUE> str(s_gregexpr2)
List of 2
$ : int 4
…- attr(, “match.length”)= int 1
…- attr(, “index.type”)= chr “chars”
…- attr(, “useBytes”)= logi TRUE
$ : int -1
…- attr(, “match.length”)= int -1
…- attr(, “index.type”)= chr “chars”
…- attr(, “useBytes”)= logi TRUE
> str(s_gregexpr1)
List of 1
$ : int [1:4] 2 8 16 20
…- attr(, “match.length”)= int [1:4] 1 1 1 1
…- attr(, “index.type”)= chr “chars”
…- attr(, “useBytes”)= logi TRUE
> str(s_gregexpr0)
List of 2
$ : int [1:2] 2 4
…- attr(, “match.length”)= int [1:2] 1 1
…- attr(, “index.type”)= chr “chars”
…- attr(, “useBytes”)= logi TRUE
$ : int [1:2] 2 4
…- attr(, “match.length”)= int [1:2] 1 1
…- attr(, “index.type”)= chr “chars”
…- attr(*, “useBytes”)= logi TRUE
该函数与regexpr()函数返回结果类似,区别在于该函数返回的是各捕获组的起始位置以及各捕获组内匹配的字符串的的长度;其返回一个与文本长度相同的列表,如果没有匹配,则每个元素为-1,或者返回具有匹配开始位置和所有子字符串的整数序列,这些子字符串对应于模式的带括号的子表达式,属性为“ match .length”的一个向量会给出匹配的长度(如果没有匹配,则为-1);同时需要注意的是整数向量的第一个值为全部pattern所匹配的结果,剩余的值才依次从左往右对应了不同的捕获组。
s1<- "Test: A1 BC23 DEF456"
pattern <- "([[:alpha:]]+)([[:digit:]]+)"
s_regexec_1<-regexec(pattern, s)
s_regexec_1
> s_regexec_1
[[1]]
[1] 7 7 8
attr(,“match.length”)
[1] 2 1 1
attr(,“index.type”)
[1] “chars”
attr(,“useBytes”)
[1] TRUE
paste()与paste0()函数
函数:
paste(x1,x2,x3,sep = “”,collapse = NULL)
解释:
将x1,x2,x3三个向量相对应位置的元素进行连接,会自动对短的向量进行循环匹配最长的向量;sep指定了各个元素连接之间的连接符号;如果没有指定collapse参数,返回的结果是与最长一个向量元素个数一样的多个独立的字符串;而指定了collapse参数,则相当于又将那些多个独立的字符串以collapse为连接分隔连为一个字符串。
函数paste0()只有collapse参数,且效率更高;
paste(1:13,c("A","B","C"),0:1,sep = "_")
paste(1:13,c("A","B","C"),0:1,sep = "_",collapse = "%")
paste0(1:13,"_",c("A","B"),"_",0:1)
> [1] “1_A_0” “2_B_1” “3_C_0” “4_A_1” “5_B_0” “6_C_1” “7_A_0” “8_B_1”
[9] “9_C_0” “10_A_1” “11_B_0” “12_C_1” “13_A_0”
>[1] “1_A_0%2_B_1%3_C_0%4_A_1%5_B_0%6_C_1%7_A_0%8_B_1%9_C_0%10_A_1%11_B_0%12_C_1%13_A_0”
> [1] “1_A_0” “2_B_1” “3_A_0” “4_B_1” “5_A_0” “6_B_1” “7_A_0” “8_B_1”
[9] “9_A_0” “10_B_1” “11_A_0” “12_B_1” “13_A_0”
函数:
strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)
参数:
x,为要进行分割的字符串向量,会对每个字符串单独执行分割;
split,同样也是字符向量,各元素指定了作为分割的字符,包含了正则表达式除非fixed =TRUE);当其长度为0(注意,不是NA),即未指定分割字符,则会将x中每个元素分解为单个字符;若长度大于1,将循环使用来匹配分割;
该函数的输出为一个list,list与x等长,每个list元素包含了分割的结果
s_split<-strsplit(c("a+b",'a_b',"a.b","a*b"),split = c("[\\+_\\.\\*]"))
s_split
> s_split
[[1]]
[1] “a” “b”[[2]]
[1] “a” “b”[[3]]
[1] “a” “b”[[4]]
[1] “a” “b”
延伸:
split()函数相比于strsplit函数,该函数主要用来进行数据分组,将向量数据或数据框按照指定的因子(factor)进行分组,典型例子如下
n<-10;nn<-100
#随机生成1000个10以内的整数
g_factor<-factor(round(n*runif(n*nn)))
#生成服从正态分布且有部分随机噪声的数据
x_norm<-rnorm(n*nn)+sqrt(as.numeric(g_factor))
#依照g_factor将数据分组,什么意思呐?
#从数据结构来看有1000个x_norm数据,有1000个g_factor数据,其位置一一对应
#而g_factor为因子,共10个水平,所有值分为10组,那么与其每个值一一对应的
#x_norm也可以依照g_factor的划分进行分组
x_group<-split(x_norm,g_factor)
xg_data<-data.frame(x=x_norm,group=g_factor)
identical(xg_data[xg_data$group==1,1],x_group[[2]])
[1] TRUE
sen1存放5个句子,注意区分对字符串的操作和对单词的操作,比如^与$只是定位的整个字符串的开始与结束,而\b才是单词边界;另外,str_subset()只是提取匹配上的整个字符串,并未具体到与正则规则完全对应的文本上而str_extract()和str_extract_all()才是进一步从完全匹配的字符串当中提取与正则表达式“完全”一致的文本。
library(stringr)
sen1<-sentences[1:20]
first_word<-str_extract(sen1,"\\b\\w*\\b")
#或者
first_word1<-str_extract(sen1,"^[A-Z]\\w*\\b")
#或者
first_word2<-str_extract(sen1,"^\\w*\\b")
> sen1
[1] “The birch canoe slid on the smooth planks.” “Glue the sheet to the dark blue background.” [3] “It’s easy to tell the depth of a well.” “These days a chicken leg is a rare dish.”
[5] “Rice is often served in round bowls.” “The juice of lemons makes fine punch.”
[7] “The box was thrown beside the parked truck.” “The hogs were fed chopped corn and garbage.” [9] “Four hours of steady work faced us.” “Large size in stockings is hard to sell.”
[11] “The boy was there when the sun rose.” “A rod is used to catch pink salmon.”
[13] “The source of the huge river is the clear spring.” “Kick the ball straight and follow through.”
[15] “Help the woman get back to her feet.” “A pot of tea helps to pass the evening.”
[17] “Smoky fires lack flame and heat.” “The soft cushion broke the man’s fall.”
[19] “The salt breeze came across from the sea.” “The girl at the booth sold fifty bonds.”> first_word
[1] “The” “Glue” “It” “These” “Rice” “The” “The” “The” “Four” “Large” “The” “A” “The” “Kick” [15] “Help” “A” “Smoky” “The” “The” “The”> first_word1
[1] “The” “Glue” “It” “These” “Rice” “The” “The” “The” “Four” “Large” “The” “A” “The” “Kick” [15] “Help” “A” “Smoky” “The” “The” “The”
> first_word2
[1] “The” “Glue” “It” “These” “Rice” “The” “The” “The” “Four” “Large” “The” “A” “The” “Kick” [15] “Help” “A” “Smoky” “The” “The” “The”
sen_ing<-str_subset(sentences,"\\b\\w*?ing\\b")
word_ing<-str_extract(sen_ing,"\\b\\w*?ing\\b")
word_ing_all<-str_extract_all(sen_ing,'\\b\\w*?ing\\b')
#显然,两个函数获取的结果是一致的
if(identical(word_ing,unlist(word_ing_all))){
cat("the element of the two word list is same!")
}else{
cat("they are different!")
}
##问题:有没有可能一个句子由两个单词符合要求?
sum(str_count(sentences,"^\\w*?ing\\b")>1)
##如果有,则进行提取
sentences[str_count(sentences,'\\b\\w*?ing\\b') > 1]
[1] the element of the two word list is same!
[1] 0
[1] character(0)
#问题核心一个数词如何匹配,这个比较复杂,能否将所有的数字都完成匹配?
math_num<-"(one|tw[(o)|(e)|(lve)]{1}|four[(teen)]?|forty|five|fift[(een)|(y)]{1}|
six[(teen)|(ty)]?|seven[(teen)|(ty)]?|eight[(een)|(y)]?|nine[(teen)|(ty)]?|ten|eleven|twelve|three|thirteen|thirty)\\b (\\b\\w*\\b)"
math_num1<-"(one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve) (\\b\\w*\\b)"
match_matrix<-str_match(sentences,math_num)
match_matrix[!is.na(match_matrix[,1]),]
###将str_match()只匹配第一个,转换为匹配所有,难点在于判断哪些是匹配的
match_matrix_all<-str_match_all(sentences,math_num)
match_true_index<-lapply(match_matrix_all,function(x){!is.na(x[1])})
match_true_index<-unlist(match_true_index)
match_matrix_all_true<-match_matrix_all[match_true_index]
###逐个提取出组合的词,
##这里sapply与lapply存在一定差异,sapply返回的结果为向量,lapply为列表
sapply(match_matrix_all_true,function(x){x[1]})
unlist(lapply(match_matrix_all_true,function(x){x[1]}))
> match_matrix[!is.na(match_matrix[,1]),]
[,1] [,2] [,3]
[1,] “ten served” “ten” “served”
[2,] “fifty bonds” “fifty” “bonds”
[3,] “one over” “one” “over”
[4,] “seven books” “seven” “books”
[5,] “two met” “two” “met”
[6,] “two factors” “two” “factors”
[7,] “one and” “one” “and”
[8,] “three lists” “three” “lists”
[9,] “thirty times” “thirty” “times”
[10,] “seven is” “seven” “is”
[11,] “two when” “two” “when”
[12,] “one floor” “one” “floor”
[13,] “ten inches” “ten” “inches”
[14,] “one with” “one” “with”
[15,] “one war” “one” “war”
[16,] “one button” “one” “button”
[17,] “ten years” “ten” “years”
[18,] “one in” “one” “in”
[19,] “ten chased” “ten” “chased”
[20,] “one like” “one” “like”
[21,] “two shares” “two” “shares”
[22,] “two distinct” “two” “distinct”
[23,] “one costs” “one” “costs”
[24,] “ten two” “ten” “two”
[25,] “thirty cents” “thirty” “cents”
[26,] “five robins” “five” “robins”
[27,] “four kinds” “four” “kinds”
[28,] “one rang” “one” “rang”
[29,] “ten him” “ten” “him”
[30,] “three story” “three” “story”
[31,] “ten by” “ten” “by”
[32,] “one wall” “one” “wall”
[33,] “three inches” “three” “inches”
[34,] “ten your” “ten” “your”
[35,] “ten than” “ten” “than”
[36,] “one before” “one” “before”
[37,] “three batches” “three” “batches”
[38,] “two leaves” “two” “leaves”> sapply(match_matrix_all_true,function(x){x[1]})
[1] “ten served” “fifty bonds” “one over” “seven books” “two met” “two factors” “one and” [8] “three lists” “thirty times” “seven is” “two when” “one floor” “ten inches” “one with”
[15] “one war” “one button” “ten years” “one in” “ten chased” “one like” “two shares” [22] “two distinct” “one costs” “ten two” “thirty cents” “five robins” “four kinds” “one rang” [29] “ten him” “three story” “ten by” “one wall” “three inches” “ten your” “ten than”
[36] “one before” “three batches” “two leaves”> unlist(lapply(match_matrix_all_true,function(x){x[1]}))
[1] “ten served” “fifty bonds” “one over” “seven books” “two met” “two factors” “one and” [8] “three lists” “thirty times” “seven is” “two when” “one floor” “ten inches” “one with”
[15] “one war” “one button” “ten years” “one in” “ten chased” “one like” “two shares” [22] “two distinct” “one costs” “ten two” “thirty cents” “five robins” “four kinds” “one rang” [29] “ten him” “three story” “ten by” “one wall” “three inches” “ten your” “ten than”
[36] “one before” “three batches” “two leaves”
match_ds<-"\\b\\w*?'\\w*?\\b"
example_s<-"it's my favorite. and I'd like it!"
str_extract_all(example_s,match_ds)
##结果似乎和预想的不同,只提匹配了前半部,第二个捕获组没有匹配到,为什么
##因为第二个捕获组在*(0到无穷大)次匹配后有一个?仅行非贪婪匹配,这样就
##会尽量少的匹配,什么最少——0次,也就是第二捕获组没有匹配到
##将*换为+,将起作用如下:
match_ds1<-"\\b\\w*?'\\w+?\\b"
str_extract_all(example_s,match_ds1)
##问题是,为什么非要用?,非贪婪匹配,比如匹配以a开头和以d结尾的中间部分,
##如下
str_extract_all("asdsffdfgssafgd","a.*d")
#结果是最大程度的匹配,将中间的a*d组合忽略了,而加上?即最小化的非贪婪匹配
str_extract_all("asdsffdfgssafgd","a.*?d")
> str_extract_all(example_s,match_ds)
[[1]]
[1] “it’” “I’”> str_extract_all(example_s,match_ds1)
[[1]]
[1] “it’s” “I’d”> str_extract_all(“asdsffdfgssafgd”,“a.*d”)
[[1]]
[1] “asdsffdfgssafgd”> str_extract_all(“asdsffdfgssafgd”,“a.*?d”)
[[1]]
[1] “asd” “afgd”
函数:
str_replace(string, pattern, replacement)
str_replace_all(string, pattern, replacement)
参数:
replacement为要替换成的字符串:可以是单个字符串,此时将会将string向量中的每个元素中的pattern字符串进行替换;也可以是一个字符串向量,其长度需与pattern或string一致;
或者,对一个字符串元素进行多处不同的替换,可以传递一个命名向量,其形式如下:
c(pattern1 = replacement1,pattern2=replacement2)
也可以将函数传给replacement和pattern:
对于传给pattern的函数,参考regex()、fixed()等函数,其会对正则表达式进行一次处理,将返回值作为匹配字符;
对于传给replacement的函数,每次匹配都会调用一次该函数,该函数的返回值将用于被用来替换匹配字符串;当替换为NA,则可以指定replacement = NA_character_ 。
#例1、传递一个函数
fruits <- c("one apple", "two pears", "three bananas")
str_replace(fruits, "[aeiou]", "-")
str_replace_all(fruits,'[aeiou]','+')
##使用函数toupper()将相应的匹配字符转换为大写
str_replace_all(fruits,'[aeiou]',toupper)
##或者使用自定义函数,对不同字符采取特异处理
str_replace_all(fruits,'[aeiou]',
function(x){
if(x=="a"){
return(toupper(x))
}else{
return("*")
}
})
#或者先定义函数,传入函数名
to_handle<-function(x){
if(x == "a"){
return(toupper(x))
}else if(x=='e'){
return("*")
}else{
return(paste0('[',x,']'))
}
}
str_replace_all(fruits,'[aeiou]',to_handle)
> str_replace(fruits, “[aeiou]”, “-”)
[1] “-ne apple” “tw- pears” “thr-e bananas”
> str_replace_all(fruits,’[aeiou]’,’+’)
[1] “+n+ +ppl+” “tw+ p++rs” “thr++ b+n+n+s”
> str_replace_all(fruits,’[aeiou]’,toupper)
[1] “OnE ApplE” “twO pEArs” “thrEE bAnAnAs”> [1] “n Appl*” “tw* p*Ars” “thr** bAnAnAs”
> str_replace_all(fruits,’[aeiou]’,to_handle)
[1] “[o]n* Appl*” “tw[o] p*Ars” “thr** bAnAnAs”
#例2、传入一个与string等长的字符向量,将会一一对应的进行替换
str_replace_all(fruits,'[aeiou]',c('1','2','[3_m]'))
[1] “1n1 1ppl1” “tw2 p22rs” “thr[3_m][3_m] b[3_m]n[3_m]n[3_m]s”
#例3、使用反向引用组,如下b表示将匹配的字符加倍为两个
str_replace(fruits, "([aeiou])", "\\1\\1")
#或者将一个单词的首位字母进行交换位置,\\0表示的是整个正则表达匹配获得的文本
str_replace_all(fruits,'(\\b\\w)(\\w*?)(\\w\\b)',"\\0_\\3\\2\\1")
> str_replace(fruits, “([aeiou])”, “\1\1”)
[1] “oone apple” “twoo pears” “threee bananas”> str_replace_all(fruits,’(\b\w)(\w*?)(\w\b)’,"\0_\3\2\1")
[1] “one_eno apple_eppla” “two_owt pears_searp” “three_ehret bananas_sananab”
#例4、传入命名向量,进行一个字符串的多处匹配和替换,进行特异性替换
str_replace_all(fruits,c("a" = "1", "b" = "2", "c" = "3"))
> str_replace_all(fruits,c(“a” = “1”, “b” = “2”, “c” = “3”))
[1] “one 1pple” “two pe1rs” “three 21n1n1s”
函数:
str_split(string, pattern, n = Inf, simplify = FALSE)
str_split_fixed(string, pattern, n)
功能:
按照pattern匹配的字符进行分割,返回的结果是不包含匹配字符的;
参数n表示将string分成几个部分,默认是尽可能多的拆分;
对于str_split_fixed()函数,其生成的是一个n列的矩阵,如果n大于可以拆分的最大值,将使用空字符串填补;
参数simplify指示了str_split()函数生成的是字符串列表还是矩阵,默认为FALSE生成向量列表;
str_split("a|b|c|d|e",pattern = '\\|')
#如下指定分割成两部分,生成矩阵形式
str_split("a|b|c|d|e",pattern = '\\|',n = 2,simplify = TRUE)
#生成多于最大可分割的部分
str_split("a|b|c|d|e",pattern = '\\|',n = 6,simplify = TRUE)
#另外,除了pattern外,还可以使用字母、行、句子和单词的边界来进行拆分,
#见函数boundary(),当type取不同值的时候,表示探测不同的边界进行分割;
#boundary(type = c("character", "line_break", "sentence", "word"),
#skip_word_none = NA, ...)
head(sentences,n=3)
str_split(head(sentences,n = 3),pattern = boundary(type = "word"))
#boundary(type = 'character')函数指示按单个字符进行分割划分,空格也算
str_split(string = "It's a good one!",boundary('character'))
#使用空字符串""等价于boundary("character"),将单个字符分割
str_split(string = "It's a good one!","")
> str_split(“a|b|c|d|e”,pattern = ‘\|’)
[[1]]
[1] “a” “b” “c” “d” “e”
> str_split(“a|b|c|d|e”,pattern = ‘\|’,n = 2,simplify = TRUE)
[,1] [,2]
[1,] “a” “b|c|d|e”
> str_split(“a|b|c|d|e”,pattern = ‘\|’,n = 6,simplify = TRUE)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] “a” “b” “c” “d” “e” “”
> head(sentences,n=3)
[1] “The birch canoe slid on the smooth planks.” “Glue the sheet to the dark blue background.”
[3] “It’s easy to tell the depth of a well.”
> str_split(head(sentences,n = 3),pattern = boundary(type = “word”))
[[1]]
[1] “The” “birch” “canoe” “slid” “on” “the” “smooth” “planks”[[2]]
[1] “Glue” “the” “sheet” “to” “the” “dark” “blue” “background”[[3]]
[1] “It’s” “easy” “to” “tell” “the” “depth” “of” “a” “well”> str_split(string = “It’s a good one!”,boundary(‘character’))
[[1]]
[1] “I” “t” “’” “s” " " “a” " " “g” “o” “o” “d” " " “o” “n” “e” “!”> str_split(string = “It’s a good one!”,"")
[[1]]
[1] “I” “t” “’” “s” " " “a” " " “g” “o” “o” “d” " " “o” “n” “e” “!”
函数:
str_locate(string, pattern)
str_locate_all(string, pattern)
功能:
定位pattern的起始位置和结束位置,返回索引值;
str_locate()函数返回的一个整数矩阵,第一列是开始的位置,第二列是结束的文字;
str_locate_all()函数返回的是一个整数矩阵列表,列表的每一个元素对应一个矩阵,对应一个字符串的匹配结果;
#例1、匹配一行的末尾$,看结果end小于start,这是由于$定位了一行/字符串的结尾,其本身长度为1,但不计入
#字符串的长度,所以开始时leng(string)+1,而结束则是字符串的结束end=length(string)
str_locate(sentences[1],pattern = "$")
#例2、""空字符串会定位每个字符的位置
str_locate_all(sen1[1],"")
> str_locate(sentences[1],pattern = “$”)
start end
[1,] 43 42> str_locate_all(sen1[1],"")
[[1]]
start end
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
[5,] 5 5
[6,] 6 6
[7,] 7 7
[8,] 8 8
[9,] 9 9
[10,] 10 10
[11,] 11 11
[12,] 12 12
[13,] 13 13
[14,] 14 14
[15,] 15 15
[16,] 16 16
[17,] 17 17
[18,] 18 18
[19,] 19 19
[20,] 20 20
[21,] 21 21
[22,] 22 22
[23,] 23 23
[24,] 24 24
[25,] 25 25
[26,] 26 26
[27,] 27 27
[28,] 28 28
[29,] 29 29
[30,] 30 30
[31,] 31 31
[32,] 32 32
[33,] 33 33
[34,] 34 34
[35,] 35 35
[36,] 36 36
[37,] 37 37
[38,] 38 38
[39,] 39 39
[40,] 40 40
[41,] 41 41
[42,] 42 42
在调用相应函数,使用一个正则表达式或固定字符串作为pattern参数时,函数内部会自动调用regex()函数对其进行转换,如上例:
str_locate(sentences[1],pattern = "$")
#实际执行等价于
str_locate(sentences[1],pattern = regex("$"))
也就是regex()函数相当于Python当中的re.compile()函数,可以控制不同的类型模式,编译生成类似Python中Pattern对象,进而执行字符串中的正则匹配类似的控制匹配模式的函数,如下:
fixed(pattern, ignore_case = FALSE)
coll(pattern, ignore_case = FALSE, locale = “en”, …)
regex(pattern, ignore_case = FALSE, multiline = FALSE,comments = FALSE, dotall = FALSE, …)
boundary(type = c(“character”, “line_break”, “sentence”, “word”),skip_word_none = NA, …)
其中:
fixed()函数表示z直接比较的是字符串中的字节bytes,是固定字符串的比较而非正则化匹配;
coll()函数按照指定的规则进行匹配比较,比如locale设置了比较的语种类型;
regex()函数则默认使用ICU正则匹配规则;
ignore_case指示是否忽略大小写;
multiline指示是否进入多行匹配模式,TRUE表示$与^会逐行匹配开头和结尾;而FALSE则表示只会匹配整个字符串的起始和结尾;
comments则表示是否进行注释,注释形式为#+空格+注释内容;
dotall指示 . 是否能匹配行尾;
示例如下:
str_extract_all("a\nb\nc", "a.")
str_extract_all("a\nb\nc", regex("a.", dotall = TRUE))
str_extract_all("The Cat in the Hat", "[a-z]+")
str_extract_all("The Cat in the Hat", regex("[a-z]+", ignore_case =TRUE))
> str_extract_all(“a\nb\nc”, “a.”)
[[1]]
character(0)> str_extract_all(“a\nb\nc”, regex(“a.”, dotall = TRUE))
[[1]]
[1] “a\n”> str_extract_all(“The Cat in the Hat”, “[a-z]+”)
[[1]]
[1] “he” “at” “in” “the” “at”> str_extract_all(“The Cat in the Hat”, regex("[a-z]+", ignore_case =TRUE))
[[1]]
[1] “The” “Cat” “in” “the” “Hat”
函数:
apropos(what, where = FALSE, ignore.case = TRUE, mode = “any”)
find(what, mode = “any”, numeric = FALSE, simple.words = TRUE)
功能:
其主要功能是在全局环境中寻找符合由what指定的规则的对象;
对于simple.words则类似于fixed()函数,为TRUE则表示what参数是完全匹配,不使用正则化;
find()函数返回的是搜索路径中要搜索对象的环境名,或者包名;
apropos("replace")
find("replace")
> apropos(“replace”)
[1] “.rs.registerReplaceHook” “.rs.replaceBinding” “.rs.rpc.replace_comment_header”
[4] “replace” “setReplaceMethod” “str_replace”
[7] “str_replace_all” “str_replace_na”> find(“replace”)
t", regex("[a-z]+", ignore_case =TRUE))
[[1]]
[1] “The” “Cat” “in” “the” “Hat”
函数:
apropos(what, where = FALSE, ignore.case = TRUE, mode = “any”)
find(what, mode = “any”, numeric = FALSE, simple.words = TRUE)
功能:
其主要功能是在全局环境中寻找符合由what指定的规则的对象;
对于simple.words则类似于fixed()函数,为TRUE则表示what参数是完全匹配,不使用正则化;
find()函数返回的是搜索路径中要搜索对象的环境名,或者包名;
apropos("replace")
find("replace")
> apropos(“replace”)
[1] “.rs.registerReplaceHook” “.rs.replaceBinding” “.rs.rpc.replace_comment_header”
[4] “replace” “setReplaceMethod” “str_replace”
[7] “str_replace_all” “str_replace_na”> find(“replace”)
[1] “package:base”