在线读书:
R for data science
github地址: https://github.com/hadley/r4ds
II Data Wrangle
Data Wrangle 分为3个步骤:import, tidy transformation.
这一章讲字符串的操作,用到的R包主要是Stringr
.
library(tidyverse)
library(stringr)
14.2 String basic
- R 接受用双引号" " 或者单引号' ' 引起的字符作为string 字符串格式,两种用法没有差别。
- 字符串必须具有完整的前后双引号,缺少后引号的命令行,无法运行,会在下一行显示
+
号。可以按Esc
键退出重新输入。 - 如果要在字符串中包含一个文本单引号或双引号,可以使用
\
来“跳过”它:
double_quote <- "\"" # or ' " '
single_quote <- '\'' or " ' "
或者 也可以采用与外面不同的引号形式来避免错误, 在" "
中 使用 ' ',在' '
中 使用 " "。
- 在字符串中的第一个
\
会被跳过,如果要用'\' 则要用'\\'表示。 - 用print() 输出的字符串,包含了escape,与字符串本来的样子有出入。
可以用 writeLines() 来输出。
x <- c("\"", "\\")
x
#> [1] "\"" "\\"
writeLines(x)
#> "
#> \
- Special characters 特殊字符:
\n : newline新的一行
\t : Tab
\r : carriage return 回车
\b backspace 退格
\a alert (bell)
\f form feed 换页
\v vertical tab 垂直制表符
\\
backslash \ 斜杠
\'
ASCII apostrophe ' 单引号
\"
ASCII quotation mark " 双引号
\`` ASCII grave accent (backtick)\
重音符
\nnn character with given octal code (1, 2 or 3 digits)
\xnn character with given hex code (1 or 2 hex digits)
\unnnn Unicode character with given code (1--4 hex digits)
\Unnnnnnnn Unicode character with given code (1--8 hex digits)
可以用?'"'
, or?"'"
查看特殊字符串的帮助文档
a <- "abc\\efg\r12456" #"\r" 表示 回车 ,"\\" 表示 \ .
a
# "abc\\efg\r12456"
writeLines(a) ## 前面的字符被后面的替换掉了,多余的留了下来。
# 12456fg
a <- "abc\\efg\b12456" # "\b" 表示退格,删除了前面一个字符。
writeLines(a)
# abc\ef12456
a <- "abc\\efg\a12456" # "\a" 表示警告,插入了一个 表示警告的�符号
writeLines(a)
# abc\efg�12456
a <- "abc\\efg\f12456" #"\f" 表示换页,页面被清空,只留下之后的“12456”。
# 12456
a <- "abc\\efg\v123456" #
writeLines(a)
# abc\efg�123456
a <- "abc\\efg\12456" # " \124" 被认为是字符代码,插入了一个字符。
writeLines(a)
#abc\efgT56
- Base R 也有许多函数可以进行String 操作,但他们很多不一致,因此这里只用stringr,他们的函数具有更直观的名称。所有的stringr函数都具有
str_
的前缀,这样在输入str_
代码后,后面的会触发自动补全功能,能够看到所有的stringr的函数,方便选择。 - str_length() 查看字符串长度
- str_c() 合并字符串,
sep=
参数可以设置分隔符符号。
str_c("x", "y")
#> [1] "xy"
str_c("x", "y", sep = ", ")
#> [1] "x, y"
- str_c() 是矢量化的,它自动处理较短的向量使其长度与最长的向量相同:
str_c("prefix-", c("a", "b", "c"), "-suffix")
#> [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"
- 长度为0 的字符串,被
str_c
默认清除。在与if
函数一起使用时特别有用.
name <- "Hadley"
time_of_day <- "morning"
birthday <- FALSE
str_c(
"Good ", time_of_day, " ", name,
if (birthday) " and HAPPY BIRTHDAY",
"."
)
#> [1] "Good morning Hadley."
-
str_replace_na()
将NA值当作字符串"NA" 进行操作。
x <- c("abc", NA)
str_c("|-", x, "-|")
#> [1] "|-abc-|" NA
str_c("|-", str_replace_na(x), "-|")
#> [1] "|-abc-|" "|-NA-|"
-
str_c
可以用于合并一个字符串向量,用collapse
参数。
str_c(c("x", "y", "z"), collapse = ", ")
#> [1] "x, y, z"
- Subsetting string 字符串子集
-- str_sub() 具有stat , end 参数用于给定子集的位置。
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
#> [1] "App" "Ban" "Pea"
# negative numbers count backwards from end
str_sub(x, -3, -1)
#> [1] "ple" "ana" "ear"
-- str_sub() 不会报错,会给出尽可能正确的回应。
str_sub("a", 1, 5)
#> [1] "a"
-- str_sub() 的结果也可用 赋值符号进行修改。
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
#> [1] "apple" "banana" "pear"
-- str_to_lower() 转换为小写字母
-- str_to_upper() 转换为大写字母
-- str_to_title() 转换为标题形式,每个单词首字母大写。
-- str_to_sentence() 转换为句子形式,Only 每句的首字母大写。
- Locales 地域
由于不同地域具有不同的书写习惯,为了保证在不同地域的电脑上代码运行结果一致,有必要指定locale =
参数。
locale 参数的值参照 ISO 639 language code,用2或3个字母的缩写表示。
order() and sort() 函数也使用当前电脑的 locale 信息,当需要在不同电脑上都显示相同的结果时,就要添加locale =
参数。
x <- c("apple", "eggplant", "banana")
str_sort(x, locale = "en") # English
#> [1] "apple" "banana" "eggplant"
str_sort(x, locale = "haw") # Hawaiian
#> [1] "apple" "eggplant" "banana"
-- en 英语;
-- zh 中文;
-- fr 法语;
-- ja 日语;
-- de 德语;
-- es 西班牙语;
......
str_wrap()
# 每一个输入的字符串都是被当做一个段落(或者仅包含空格的行)。段落按照设置的格式(width,indent,exdent)进行分行。每一行为一个字符串作为结果返回。width ,每行的宽度,indent, 首行缩进,exdent,除首行外其他行的缩进(悬挂缩进)。
str_wrap(string, width = 80, indent = 0, exdent = 0)str_trim() ###移除字符串开头和结尾处的空格。
--str_trim(string, side = c("both", "left", "right"))str_squish(string) ###移除字符串内重复的空格。
str_pad() # 在两端增加空格。
练习:
Write a function that turns (e.g.) a vector c("a", "b", "c") into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2.
str_commasep <- function(x, delim = ",") {
n <- length(x)
x <-str_replace_na(x)
if (n == 0) {
""
} else if (n == 1) {
x
} else if (n == 2) {
# no comma before and when n == 2
str_c(x[[1]], "and", x[[2]], sep = " ")
} else {
# commas after all n - 1 elements
not_last <- str_c(x[seq_len(n - 1)], delim)
# prepend "and" to the last element
last <- str_c("and", x[[n]], sep = " ")
# combine parts with spaces
str_c(c(not_last, last), collapse = " ")
}
}
str_commasep("")
#> [1] ""
str_commasep("a")
#> [1] "a"
str_commasep(c("a", "b"))
#> [1] "a and b"
str_commasep(c("a", "b", "c"))
#> [1] "a, b, and c"
str_commasep(c("a", "b", "c", "d"))
#> [1] "a, b, c, and d"
## 作者:Richard_Zhou
## 链接:https://www.jianshu.com/p/4790b00dc238
14.3 Matching patterns with regular expressions
正则表达式的模式匹配
- str_view() 简单匹配
- ". " 可以匹配任意字符。
- "\." 可以精确匹配
.
符号,用\
跳过任意匹配,不过这又产生了一个问题,就是\
符号本身就表示跳过,因此如果要输入\.
时,实际需要输入\\.
才可以。
如果需要匹配\
, 则需要\\\\
,这是因为文本\\\\
,表示的意思是\\
,在执行匹配时,又再次执行了跳过\
。
x <- "a\\b"
writeLines(x)
#> a\b
str_view(x, "\\\\")
-
^
匹配起始字符串 -
$
匹配字符串的末尾
14.3.2.1 Exercises
1.How would you match the literal string "$^$"?
str_view("$^$","\\$\\^\\$")
- Given the corpus of common words in stringr::words, create regular expressions that find all words that:
Start with “y”.
End with “x”
Are exactly three letters long. (Don’t cheat by using str_length()!)
Have seven letters or more.
str_view(stringr::words,"^y",match=T)
str_view(stringr::words, "x$",match=T)
str_view(stringr::words,"^...$",match=T)
str_view(stringr::words, "^.......",match=T)
其他模糊匹配方式:
\d: 匹配任意数字
\s: 匹配任意空白 (e.g. space, tab, newline).
[abc]: 匹配 a, b, or c.
[^abc]: 匹配任意字符,除了a, b, or c.
[]
可以匹配 $ . | ? * + ( ) [ {字符,而不用“\”,但有些字符在[] 也有特殊意义,因此,必须手动输入\
来跳过]
\
^
and -
.
14.3.3.1 Exercises
1.Create regular expressions to find all words that:
Start with a vowel(元音).
That only contain consonants(辅音). (Hint: thinking about matching “not”-vowels.)
End with ed, but not with eed.
End with ing or ise.
Empirically verify the rule “i before e except after c”.
Is “q” always followed by a “u”?
str_view(stringr::words, "^[aeiou],match=T)
str_view(stringr::words, "^[^aeiou],match=T)
str_view(stringr::words, "[^e]ed$",match=T)
str_view(stringr::words, "ing|ise$",match=T)
str_view(stringr::words, "ing|ise$",match=T)
str_view(stringr::words, "[^c]ie|cei",match=T)
str_view(stringr::words, "q[^u]",match=T) ### 没有匹配,及所有的"q"都有“u”跟着。
2.Write a regular expression that matches a word if it’s probably written in British English, not American English.
str_view(stringr::words, "re$",match=T)# 以–re结尾的单词:英式以-re结尾;美式以-er结尾。
str_view(stringr::words, "our$",match=T)#以-our结尾的单词:英式以-our结尾;美式通常以-or结尾。
str_view(stringr::words, "ise$",match=T)#以-ize或-ise结尾的单词:英式英语中,以-ize或-ise拼写的动词都是可以的;而在美式英语中,总是拼做-ize。
str_view(stringr::words, "yse$",match=T)#以-yse结尾的单词:英式英语中,这类动词写作-yse;美式英语中总是写作-yze。
str_view(stringr::words, "ll[ed|ing]$",match=T)#以元音+字母l结尾的单词:英式拼写中,动词以元音+字母l结尾时,如果需要再添加元音,会双写l;美式拼写中,无需双写。
str_view(stringr::words, "[ae|oe]",match=T)#双元音的拼写:英式英语中,双元音ae或oe都是两个字母;美式英语中,它们都写做一个字母e。
str_view(stringr::words, "ence$",match=T)#以–ence结尾的名词:英式英语中以–ence结尾的名词,在美式英语中写做-ense。
str_view(stringr::words, "ogue$",match=T)#以–ogue结尾的名词:英式拼写为–ogue;美式拼写为-og或-ogue均可。
Create a regular expression that will match telephone numbers as commonly written in your country.
"(0[0-9]{2,3})-" #固话
"1([1-9]{2})([0-9]{8})"## 手机
14.3.4 Repetition 重复
- ?: 0 or 1
- +: 1 or 多次
- *: 0 or 多次
- {n}: n次
- {n,}: n or 多次
- {,m}: 最多m次
- {n,m}: 最少n次,最多m次
str_view(x, "C{2,3}") ###默认匹配最长的字符串
str_view(x, 'C{2,3}?') ### 匹配最短的字符串
14.3.4.1 Exercises
Describe the equivalents of
?
,+
,*
in{m,n}
form.
?={0,1}
+={1,}
*={0,}-
Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)
-
^.*$
## .*匹配任意字符 -
"\\{.+\\}"
## {.+} -
\d{4}-\d{2}-\d{2}
## 任意数字重复4次,-,任意数字重复2次,-,任意数字重复2次 -
"\\\\{4}"
## \{4}, 表示“\”4次
-
-
Create regular expressions to find all words that:
- Start with three consonants.
str_view(stringr::words, "^[^aoeiu]{3}",match=T)
- Have three or more vowels in a row.
str_view(stringr::words, "[aoeiu]{3,}",match=T)
- Have two or more vowel-consonant pairs in a row.
str_view(stringr::words, "([aoeiu][^aoeiu]){2,}",match=T)
- Start with three consonants.
Solve the beginner regexp crosswords athttps://regexcrossword.com/challenges/beginner.
14.3.5 Grouping and backreferences
正则表达式的反向引用
反向引用非常方便,因为它允许重复一个模式(pattern),无需再重写一遍。我们可以使用#(#是组号)来引用前面已定义的组(用括号括起来的内容)。比如一个文本以abc开始,接着为xyz,紧跟着abc,对应的正则表达式可以为“abcxyzabc”,也可以使用反向引用重写正则表达式,"(abc)xyz\\1"
,\1
表示第一组(abc)。\2
表示第二组,\3
表示第三组,以此类推。
14.3.5.1 Exercises
- Describe, in words, what these expressions will match:
(.)\1\1
## 3个相同字符aaa
"(.)(.)\\2\\1"
## 2个字符的回文结构abba
(..)\1
## 任意2个字符的重复结构abab
"(.).\\1.\\1"
## 类似abaa的结构
"(.)(.)(.).*\\3\\2\\1"
## 3个连续字符及其回文结构,中间可以间隔任意字符abcxxcba - Construct regular expressions to match words that:
Start and end with the same character.
str_view(stringr::words, "^(.).*\\1$",match=T)
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
str_view(stringr::words, "(..).*\\1",match=T)
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
str_view(stringr::words, "(.).*\\1.*\\1",match=T)
14.4 Tools
正则表达式的匹配模式用pattern来表示,他把正则表达式在字符串的功能分为四个方面,分别是
查找:Detect pattern,确定这个模式有没有
定位:Locate pattern, 返回模式起止位置
取回:Extract pattern, 返回模式匹配到的条目
替换:Replace pattern,替换匹配的模式,返回替换后的结果
14.4.1 Detect matches
str_detect()
为了确定字符串向量是否匹配模式,返回向量等长的逻辑值。
x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1] TRUE FALSE TRUE
-
str_count()
告诉你字符串中有多少个匹配项
str_count(x, "a")
#> [1] 1 3 1
- 正则表达式中的匹配不会重复,例如下面例子中匹配数是2不是3.
str_count("abababa", "aba")
#> [1] 2
str_view_all("abababa", "aba")
- 带有
_all
后缀的函数会进行全部匹配,而不是只匹配一个。
14.4.2 Extract matches
1.str_subset()
可以实现匹配项取子集
str_subset(x,"e")
[1] "apple" "pear"
-
str_extract()
##返回的是匹配到的模式
str_extract_all()
##以list形式返回的匹配到的模式
使用simplify=TRUE参数,以matrix形式返回匹配的模式
str_extract(x,"e")
[1] "e" NA "e"
##以list形式返回的匹配到的模式
str_extract_all(x,"a")
[[1]]
[1] "a"
[[2]]
[1] "a" "a" "a"
[[3]]
[1] "a"
##以matrix形式返回匹配到的模式
str_extract_all(x,"a",simplify = T)
[,1] [,2] [,3]
[1,] "a" "" ""
[2,] "a" "a" "a"
[3,] "a" "" ""
14.4.3 Grouped matches
-
str_match
返回的是数据框
第一列是str_extract匹配到的模式 ,后面依次是括号中的内容,模式中有多少个(),就返回多少列。本例中有2对小括号。
如果数据是 tibble格式,使用tidyr::extract()
函数也很方便,工作方式类似str_match()
只是需要命名匹配项。
str_match(x,"([aoeiu]).*([aoeiu])")
[,1] [,2] [,3]
[1,] "apple" "a" "e"
[2,] "anana" "a" "a"
[3,] "ea" "e" "a"
tibble(x=x) %>%
tidyr::extract(x,c("vowel1","vowel2"),"([aoeiu]).*([aoeiu])",
remove=FALSE) ## 保留原数据
14.4.4 Replacing matches
-
str_replace()
andstr_replace_all()
可以将匹配字符串替换为其他字符串
str_replace(x, "[aeiou]", "-")
#> [1] "-pple" "p-ar" "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-" "p--r" "b-n-n-"
- 使用
str_replace_all()
还可以实现 多重替换通过提供a named vector
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house" "two cars" "three people"
- 除了使用固定字符串,还可以对匹配部件进行反向引用。Instead of replacing with a fixed string you can use backreferences to insert components of the match.
sentences %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% ###第2个单词与第3个单词互换位置。
head(5)
#> [1] "The canoe birch slid on the smooth planks."
#> [2] "Glue sheet the to the dark blue background."
#> [3] "It's to easy tell the depth of a well."
#> [4] "These a days chicken leg is a rare dish."
#> [5] "Rice often is served in round bowls."
14.4.1.1 Exercises
1.For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.
Find all words that start or end with x.
words[str_detect(words,"^x|x$")]
Find all words that start with a vowel and end with a consonant.
words[str_detect(words,"^[aoeiu].*[^aoeiu]$")]
Are there any words that contain at least one of each different vowel?
words[str_count(words,"[aoeiu]")>2]
- What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)
words[str_count(words,"[aoeiu]") == max(str_count(words,"[aoeiu]"))]
words[str_count(words,"[aoeiu]")/str_length(words) == max(str_count(words,"[aoeiu]")/str_length(words))]
14.4.2.1 Exercises
- In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a colour. Modify the regex to fix the problem.
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = " | ")
colour_match
#> [1] "red|orange|yellow|green|blue|purple"
has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)
修改正则表达式:
colors <- c( " red", "orange", "yellow", "green", "blue", "purple")
> (colour_match <- str_c(colors,collapse = "|"))
[1] " red|orange|yellow|green|blue|purple"
> str_view_all(more,colour_match)
- From the Harvard sentences data, extract:
The first word from each sentence.
str_extract(sentences,"^[^ ]+ ")
All words ending in ing
.
str_extract(sentences,"[^ ]+ing ")
All plurals.
str_extract(sentences,"([a-z]+)(((s|x|sh|ch)es)|ies|[aoeiu]ys|ves|[^aeiu']s)[ .]")
14.4.3.1 Exercises
- Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.
numb <- c(" one","two","three","four","five","six","seven"," eight","nine"," ten ")
number <- str_c(numb,collapse = " | ") %>% paste0("(",.,") ","([^ ]+)")
str_subset(sentences,number)%>%str_match(number)
Find all contractions. Separate out the pieces before and after the apostrophe.
str_subset(sentences,"'") %>% str_match("([^ ]+)'([^ ]+)")
14.4.4.1 Exercises
- Replace all forward slashes in a string with backslashes.
x <- c("ab\\c","abbc\\edf")
x
#[1] "ab\\c" "abbc\\edf"
str_replace_all(x,"\\\\\\\\","\\/\\/")
#"ab//c" "abbc//edf"
- Implement a simple version of str_to_lower() using replace_all().
paste0('"',LETTERS,'"',"=",'"',letters,'"') %>% str_c(collapse = ",") %>% writeLines()
#"A"="a","B"="b","C"="c","D"="d","E"="e","F"="f","G"="g","H"="h","I"="i","J"="j","K"="k","L"="l","M"="m","N"="n","O"="o","P"="p","Q"="q","R"="r","S"="s","T"="t","U"="u","V"="v","W"="w","X"="x","Y"="y","Z"="z"
str_replace_all(sentences,c("A"="a","B"="b","C"="c","D"="d","E"="e","F"="f","G"="g","H"="h","I"="i","J"="j","K"="k","L"="l","M"="m","N"="n","O"="o","P"="p","Q"="q","R"="r","S"="s","T"="t","U"="u","V"="v","W"="w","X"="x","Y"="y","Z"="z"))%>% head(5)
[1] "the birch canoe slid on the smooth planks." "glue the sheet to the dark blue background."
[3] "it's easy to tell the depth of a well." "these days a chicken leg is a rare dish."
[5] "rice is often served in round bowls."
c("A"="a","B"="b","C"="c","D"="d","E"="e","F"="f","G"="g","H"="h","I"="i","J"="j","K"="k","L"="l","M"="m","N"="n","O"="o","P"="p","Q"="q","R"="r","S"="s","T"="t","U"="u","V"="v","W"="w","X"="x","Y"="y","Z"="z")
## A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
## "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
- 通过观察发现,c() 只是实现了对字符串向量的命名,因此,采用下面的方法更好。
names(letters) <- LETTERS
letters
## A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
## "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
str_replace_all(sentences,letters) %>% head(5)
#[1] "the birch canoe slid on the smooth planks."
#[2] "glue the sheet to the dark blue background."
#[3] "it's easy to tell the depth of a well."
#[4] "these days a chicken leg is a rare dish."
#[5] "rice is often served in round bowls."
- Switch the first and last letters in words. Which of those strings are still words?
str_replace_all(words,"(^.)(.*)(.$)","\\3\\2\\1")
str_replace_all(words,"(^.)(.*)(.$)","\\3\\2\\1") %>% str_subset(paste0("^",str_c(words,collapse = "$|^"),"$"))
[1] "a" "america" "area" "dad" "dead" "lead" "read" "depend" "god"
[10] "educate" "else" "encourage" "engine" "europe" "evidence" "example" "excuse" "exercise"
[19] "expense" "experience" "eye" "dog" "health" "high" "knock" "deal" "level"
[28] "local" "nation" "on" "non" "no" "rather" "dear" "refer" "remember"
[37] "serious" "stairs" "test" "tonight" "transport" "treat" "trust" "window" "yesterday"
14.4.5 Splitting 分列
-
str_split()
可按一定规律分割字符串为多个单元,类似excel 中的分列功能。 - 由于每个元素可能包含的单元数目不同,因此该函数返回的格式是由字符串向量组成的列表。
- 如果你只是分割一个字符串,最方便的做法取子集[[1]],这样返回vector 格式的结果。
- 也可以指定参数,
simplify = TRUE
返回matrix格式的结果。同时也可以通过n= xx
参数指定返回matrix最大列数,以去除一些不必要的列。 - 除了以匹配模式进行分列,还可以通过
boundary()
函数以字符串、行、句子,单词进行分列。
Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word boundary()s:
boundary(type = c("character", "line_break", "sentence", "word"), skip_word_none = NA, ...)
words <- c("These are some words.")
> str_count(words, boundary("word"))
[1] 4
> str_split(words, " ")[[1]]
[1] "These" "are" "" "" "some" "words."
> str_split(words, boundary("word"))[[1]]
[1] "These" "are" "some" "words"
-
str_split_fixed()
返回 字符串 matrix withn
columns.
fruits <- c(
"apples and oranges and pears and bananas",
"pineapples and mangos and guavas"
)
> str_split_fixed(fruits, " and ", 3)
[,1] [,2] [,3]
[1,] "apples" "oranges" "pears and bananas"
[2,] "pineapples" "mangos" "guavas"
> str_split_fixed(fruits, " and ", 4)
[,1] [,2] [,3] [,4]
[1,] "apples" "oranges" "pears" "bananas"
[2,] "pineapples" "mangos" "guavas" ""
14.4.5.1 Exercises
1.Split up a string like "apples, pears, and bananas" into individual components.
c<-"apples, pears, and bananas"
str_split(c,", |and ")
2.Why is it better to split up by boundary("word") than " "?
boundary("word") 可以忽略空格、逗号等的影响。
- What does splitting with an empty string ("") do? Experiment, and then read the documentation.
结果以每个字符分列:
str_split(words,"")[[1]]
[1] "T" "h" "e" "s" "e" " " "a" "r" "e" " " " " " " "s" "o" "m" "e" " " "w" "o" "r" "d" "s" "."
14.4.6 Find matches
str_locate() ·and ·str_locate_all()· 给出匹配模式的起始和终止位置,可以用
str_locate()查找匹配位置
str_sub()` 进行提取或修改.
14.5 Other types of pattern 其他模式
- 通常所使用的pattern ,其实是regex()函数的省略。
- 通过regex() 的参数,可以实现更加精确的控制:
-
ignore_case = TRUE
忽略字符串的大写、或小写模式,通常默认为FALSE
。
-
bananas <- c("banana", "Banana", "BANANA")
> str_extract(bananas,"banana")
[1] "banana" NA NA
> str_extract(bananas,regex("banana",ignore_case = TRUE))
[1] "banana" "Banana" "BANANA"
-
multiline = TRUE
允许使用^
and$
来匹配每一行的开始或者结尾,而不是整个文本的开头和末尾。
x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]
#> [1] "Line"
str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
#> [1] "Line" "Line" "Line"
-
comments = TRUE
允许使用注释,使得正则表达式更容易理解。#
后的字符被认为是注释,所有的空格也会被忽略。如果需要使用空格需要使用\\
.
str_split(words,regex("\\ ",comments = T))[[1]]## 以空格进行分列
# [1] "These" "are" "" "" "some" "words."
str_split(words,regex(" ",comments = T))[[1]]## pattern中的空格被忽略,以每个字符进行分列
# [1] "" "T" "h" "e" "s" "e" " " "a" "r" "e" " " " " " " "s" "o" "m" "e" " " "w" "o" "r" "d" "s" "." ""
str_split(words,regex("[. ]",comments = T))[[1]] ## 只能以点进行分列。
# [1] "These are some words" ""
str_split(words,regex("[.\\ ]",comments = T))[[1]]## 以· 或者空格 进行分列
# [1] "These" "are" "" "" "some" "words" ""
-
dotall = TRUE
allows.
匹配任意字符包括\n
.
此外,还有3种函数可以替代regex() :
-
fixed()
# 精确匹配字符串,忽略正则表达式,速度更快。在非英文环境使用fixed()
需要注意同一字符的不同实现方式,在fixed()
函数中无法识别。 -
coll()
: 比较字符串以标准的校对准则。compare strings using standard collation rules.coll()
具有locale
参数,以确定采用哪种校对规则。coll()
的执行速度很慢。 -
boundary()
在str_split()
中提到的boundary()
函数也可以在其他函数中使用。
str_extract_all(words, boundary("word"))[[1]]
[1] "These" "are" "some" "words"
14.5.1 Exercises
- How would you find all strings containing \ with regex() vs. with fixed()?
x <- "Line 1\\\\Line 2\nLine 3"
writeLines(x)
#Line 1\\Line 2
#Line 3
str_view_all(x, regex("\\\\"))
str_view_all(x,fixed("\\"))
- What are the five most common words in sentences?
str_split(sentences,boundary("word")) %>% ## 分割单词,
unlist %>% str_to_lower%>% ## 去list,全部转为小写
table() %>% sort(decreasing = T) %>% ##使用table统计 ,sort排序
head(5)#head显示前5个。
# .
# the a of to and
# 751 202 132 123 118
14.6 Other uses of regular expressions
-
apropos()
searches all objects available from the global environment. -
dir()
lists all the files in a directory. The pattern argument takes a regular expression and only returns file names that match the pattern.
14.7 stringi
stringr
是在stringi
包的基础上产生的,stringr
包含了最基本的字符串处理函数46个,stringi
包具有234个函数,功能更加强大。如果有更复杂的字符串处理,可以使用stringi
包,这两个包的函数非常相似,只需要替换str_
为 stri_
即可,
14.7.1 Exercises
Find the stringi functions that:
Count the number of words.
stri_count(sentences,regex = " ") %>% head()
Find duplicated strings.
stri_extract(words,regex="(.)\\1",simplify=T)
###寻找连续字符
stri_duplicated(c("a", "b", "a", NA, "a", NA))
### 判断是否有重复字符串
Generate random text.How do you control the language that stri_sort() uses for sorting?
- 可以指定
decreasing=TRUE
参数,倒序排列
stri_sort(sample(LETTERS))
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
stri_sort(sample(LETTERS),decreasing = TRUE)
[1] "Z" "Y" "X" "W" "V" "U" "T" "S" "R" "Q" "P" "O" "N" "M" "L" "K" "J" "I" "H" "G" "F" "E" "D" "C" "B" "A"
-指定locale="xx"
参数,按地区语言特性排列。
stri_sort(c("hladny", "chladny"), locale="pl_PL")
[1] "chladny" "hladny"
> stri_sort(c("hladny", "chladny"), locale="sk_SK")
[1] "hladny" "chladny"
-指定numeric=TRUE
参数,按数字大小排列
stri_sort(c(1, 100, 2, 101, 11, 10))
[1] "1" "10" "100" "101" "11" "2"
> stri_sort(c(1, 100, 2, 101, 11, 10), numeric=TRUE)
[1] "1" "2" "10" "11" "100" "101"