[读书笔记r4ds]14. Strings

在线读书:
R for data science
github地址: https://github.com/hadley/r4ds


II Data Wrangle

Data Wrangle 分为3个步骤:import, tidy transformation. 
image.png

这一章讲字符串的操作,用到的R包主要是Stringr.

library(tidyverse)
library(stringr)

14.2 String basic

  • R 接受用双引号" " 或者单引号' ' 引起的字符作为string 字符串格式,两种用法没有差别。
  • 字符串必须具有完整的前后双引号,缺少后引号的命令行,无法运行,会在下一行显示+号。可以按Esc键退出重新输入。
  • 如果要在字符串中包含一个文本单引号或双引号,可以使用\来“跳过”它:
double_quote <- "\""  # or ' " '
single_quote <- '\''   or " ' "

或者 也可以采用与外面不同的引号形式来避免错误, 在" " 中 使用 ' ',在' '中 使用 " "

  • 在字符串中的第一个\ 会被跳过,如果要用'\' 则要用'\\'表示。
  • 用print() 输出的字符串,包含了escape,与字符串本来的样子有出入。
    可以用 writeLines() 来输出。
x <- c("\"", "\\")
x
#> [1] "\"" "\\"
writeLines(x)
#> "
#> \
  • Special characters 特殊字符:
    \n : newline新的一行
    \t : Tab
    \r : carriage return 回车
    \b backspace 退格
    \a alert (bell)
    \f form feed 换页
    \v vertical tab 垂直制表符
    \\ backslash \ 斜杠
    \' ASCII apostrophe ' 单引号
    \" ASCII quotation mark " 双引号
    \`` ASCII grave accent (backtick)\ 重音符
    \nnn character with given octal code (1, 2 or 3 digits)
    \xnn character with given hex code (1 or 2 hex digits)
    \unnnn Unicode character with given code (1--4 hex digits)
    \Unnnnnnnn Unicode character with given code (1--8 hex digits)
    可以用?'"', or ?"'" 查看特殊字符串的帮助文档
a <- "abc\\efg\r12456"   #"\r" 表示 回车 ,"\\" 表示 \ .
a
# "abc\\efg\r12456"
 writeLines(a)          ## 前面的字符被后面的替换掉了,多余的留了下来。
# 12456fg             
a <- "abc\\efg\b12456" #  "\b" 表示退格,删除了前面一个字符。
writeLines(a)
# abc\ef12456
a <- "abc\\efg\a12456" # "\a" 表示警告,插入了一个 表示警告的�符号
writeLines(a)
# abc\efg�12456
a <- "abc\\efg\f12456"   #"\f" 表示换页,页面被清空,只留下之后的“12456”。
# 12456 
a <- "abc\\efg\v123456"  # 
writeLines(a)
# abc\efg�123456
a <- "abc\\efg\12456" # " \124" 被认为是字符代码,插入了一个字符。
writeLines(a)
#abc\efgT56
  • Base R 也有许多函数可以进行String 操作,但他们很多不一致,因此这里只用stringr,他们的函数具有更直观的名称。所有的stringr函数都具有str_的前缀,这样在输入str_代码后,后面的会触发自动补全功能,能够看到所有的stringr的函数,方便选择。
  • str_length() 查看字符串长度
  • str_c() 合并字符串, sep= 参数可以设置分隔符符号。
str_c("x", "y")
#> [1] "xy"
str_c("x", "y", sep = ", ")
#> [1] "x, y"
  • str_c() 是矢量化的,它自动处理较短的向量使其长度与最长的向量相同:
str_c("prefix-", c("a", "b", "c"), "-suffix")
#> [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"
  • 长度为0 的字符串,被str_c默认清除。在与if函数一起使用时特别有用.
name <- "Hadley"
time_of_day <- "morning"
birthday <- FALSE

str_c(
  "Good ", time_of_day, " ", name,
  if (birthday) " and HAPPY BIRTHDAY",
  "."
)
#> [1] "Good morning Hadley."
  • str_replace_na()NA值当作字符串"NA" 进行操作。
x <- c("abc", NA)
str_c("|-", x, "-|")
#> [1] "|-abc-|" NA
str_c("|-", str_replace_na(x), "-|")
#> [1] "|-abc-|" "|-NA-|"
  • str_c 可以用于合并一个字符串向量,用collapse参数。
str_c(c("x", "y", "z"), collapse = ", ")
#> [1] "x, y, z"
  • Subsetting string 字符串子集
    -- str_sub() 具有stat , end 参数用于给定子集的位置。
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
#> [1] "App" "Ban" "Pea"
# negative numbers count backwards from end
str_sub(x, -3, -1)
#> [1] "ple" "ana" "ear"

-- str_sub() 不会报错,会给出尽可能正确的回应。

str_sub("a", 1, 5)
#> [1] "a"

-- str_sub() 的结果也可用 赋值符号进行修改。

str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
#> [1] "apple"  "banana" "pear"

-- str_to_lower() 转换为小写字母
-- str_to_upper() 转换为大写字母
-- str_to_title() 转换为标题形式,每个单词首字母大写。
-- str_to_sentence() 转换为句子形式,Only 每句的首字母大写。

  • Locales 地域
    由于不同地域具有不同的书写习惯,为了保证在不同地域的电脑上代码运行结果一致,有必要指定locale =参数。
    locale 参数的值参照 ISO 639 language code,用2或3个字母的缩写表示。
    order() and sort() 函数也使用当前电脑的 locale 信息,当需要在不同电脑上都显示相同的结果时,就要添加 locale =参数。
x <- c("apple", "eggplant", "banana")
str_sort(x, locale = "en")  # English
#> [1] "apple"    "banana"   "eggplant"
str_sort(x, locale = "haw") # Hawaiian
#> [1] "apple"    "eggplant" "banana"

-- en 英语;
-- zh 中文;
-- fr 法语;
-- ja 日语;
-- de 德语;
-- es 西班牙语;
......

  • str_wrap() # 每一个输入的字符串都是被当做一个段落(或者仅包含空格的行)。段落按照设置的格式(width,indent,exdent)进行分行。每一行为一个字符串作为结果返回。width ,每行的宽度,indent, 首行缩进,exdent,除首行外其他行的缩进(悬挂缩进)。
    str_wrap(string, width = 80, indent = 0, exdent = 0)

  • str_trim() ###移除字符串开头和结尾处的空格。
    --str_trim(string, side = c("both", "left", "right"))

  • str_squish(string) ###移除字符串内重复的空格。

  • str_pad() # 在两端增加空格。

练习:

Write a function that turns (e.g.) a vector c("a", "b", "c") into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2.

str_commasep <- function(x, delim = ",") {
  n <- length(x)
  x <-str_replace_na(x)
  if (n == 0) {
    ""
  } else if (n == 1) {
    x
  } else if (n == 2) {
    # no comma before and when n == 2
    str_c(x[[1]], "and", x[[2]], sep = " ")
  } else {
    # commas after all n - 1 elements
    not_last <- str_c(x[seq_len(n - 1)], delim)
    # prepend "and" to the last element
    last <- str_c("and", x[[n]], sep = " ")
    # combine parts with spaces
    str_c(c(not_last, last), collapse = " ")
  }
}
str_commasep("")
#> [1] ""
str_commasep("a")
#> [1] "a"
str_commasep(c("a", "b"))
#> [1] "a and b"
str_commasep(c("a", "b", "c"))
#> [1] "a, b, and c"
str_commasep(c("a", "b", "c", "d"))
#> [1] "a, b, c, and d"
## 作者:Richard_Zhou
##  链接:https://www.jianshu.com/p/4790b00dc238

14.3 Matching patterns with regular expressions

正则表达式的模式匹配

  • str_view() 简单匹配
  • ". " 可以匹配任意字符。
  • "\." 可以精确匹配.符号,用\跳过任意匹配,不过这又产生了一个问题,就是\符号本身就表示跳过,因此如果要输入\.时,实际需要输入\\.才可以。
    如果需要匹配\, 则需要\\\\,这是因为文本\\\\ ,表示的意思是\\,在执行匹配时,又再次执行了跳过\
x <- "a\\b"
writeLines(x)
#> a\b

str_view(x, "\\\\")
  • ^匹配起始字符串
  • $ 匹配字符串的末尾

14.3.2.1 Exercises

1.How would you match the literal string "$^$"?

str_view("$^$","\\$\\^\\$")
  1. Given the corpus of common words in stringr::words, create regular expressions that find all words that:
    Start with “y”.
    End with “x”
    Are exactly three letters long. (Don’t cheat by using str_length()!)
    Have seven letters or more.
str_view(stringr::words,"^y",match=T)
str_view(stringr::words, "x$",match=T)
str_view(stringr::words,"^...$",match=T)
str_view(stringr::words, "^.......",match=T)

其他模糊匹配方式:

\d: 匹配任意数字
\s: 匹配任意空白 (e.g. space, tab, newline).
[abc]: 匹配 a, b, or c.
[^abc]: 匹配任意字符,除了a, b, or c.

[] 可以匹配 $ . | ? * + ( ) [ {字符,而不用“\”,但有些字符在[] 也有特殊意义,因此,必须手动输入\来跳过] \ ^ and -.

14.3.3.1 Exercises

1.Create regular expressions to find all words that:
Start with a vowel(元音).
That only contain consonants(辅音). (Hint: thinking about matching “not”-vowels.)
End with ed, but not with eed.
End with ing or ise.
Empirically verify the rule “i before e except after c”.
Is “q” always followed by a “u”?

str_view(stringr::words, "^[aeiou],match=T)
str_view(stringr::words, "^[^aeiou],match=T)
str_view(stringr::words, "[^e]ed$",match=T)
str_view(stringr::words, "ing|ise$",match=T)
str_view(stringr::words, "ing|ise$",match=T)
str_view(stringr::words, "[^c]ie|cei",match=T)
str_view(stringr::words, "q[^u]",match=T)  ### 没有匹配,及所有的"q"都有“u”跟着。

2.Write a regular expression that matches a word if it’s probably written in British English, not American English.

str_view(stringr::words, "re$",match=T)# 以–re结尾的单词:英式以-re结尾;美式以-er结尾。
str_view(stringr::words, "our$",match=T)#以-our结尾的单词:英式以-our结尾;美式通常以-or结尾。
str_view(stringr::words, "ise$",match=T)#以-ize或-ise结尾的单词:英式英语中,以-ize或-ise拼写的动词都是可以的;而在美式英语中,总是拼做-ize。
str_view(stringr::words, "yse$",match=T)#以-yse结尾的单词:英式英语中,这类动词写作-yse;美式英语中总是写作-yze。
str_view(stringr::words, "ll[ed|ing]$",match=T)#以元音+字母l结尾的单词:英式拼写中,动词以元音+字母l结尾时,如果需要再添加元音,会双写l;美式拼写中,无需双写。
str_view(stringr::words, "[ae|oe]",match=T)#双元音的拼写:英式英语中,双元音ae或oe都是两个字母;美式英语中,它们都写做一个字母e。
str_view(stringr::words, "ence$",match=T)#以–ence结尾的名词:英式英语中以–ence结尾的名词,在美式英语中写做-ense。
str_view(stringr::words, "ogue$",match=T)#以–ogue结尾的名词:英式拼写为–ogue;美式拼写为-og或-ogue均可。

Create a regular expression that will match telephone numbers as commonly written in your country.
"(0[0-9]{2,3})-" #固话
"1([1-9]{2})([0-9]{8})"## 手机

14.3.4 Repetition 重复

  • ?: 0 or 1
  • +: 1 or 多次
  • *: 0 or 多次
  • {n}: n次
  • {n,}: n or 多次
  • {,m}: 最多m次
  • {n,m}: 最少n次,最多m次
str_view(x, "C{2,3}") ###默认匹配最长的字符串
str_view(x, 'C{2,3}?') ### 匹配最短的字符串

14.3.4.1 Exercises

  1. Describe the equivalents of ?, +, * in {m,n} form.
    ?={0,1}
    +={1,}
    *={0,}

  2. Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)

    1. ^.*$ ## .*匹配任意字符
    2. "\\{.+\\}" ## {.+}
    3. \d{4}-\d{2}-\d{2} ## 任意数字重复4次,-,任意数字重复2次,-,任意数字重复2次
    4. "\\\\{4}" ## \{4}, 表示“\”4次
  3. Create regular expressions to find all words that:

    1. Start with three consonants.
      str_view(stringr::words, "^[^aoeiu]{3}",match=T)
    2. Have three or more vowels in a row.
      str_view(stringr::words, "[aoeiu]{3,}",match=T)
    3. Have two or more vowel-consonant pairs in a row.
      str_view(stringr::words, "([aoeiu][^aoeiu]){2,}",match=T)
  4. Solve the beginner regexp crosswords athttps://regexcrossword.com/challenges/beginner.

14.3.5 Grouping and backreferences

正则表达式的反向引用
反向引用非常方便,因为它允许重复一个模式(pattern),无需再重写一遍。我们可以使用#(#是组号)来引用前面已定义的组(用括号括起来的内容)。比如一个文本以abc开始,接着为xyz,紧跟着abc,对应的正则表达式可以为“abcxyzabc”,也可以使用反向引用重写正则表达式,"(abc)xyz\\1"\1表示第一组(abc)。\2表示第二组,\3表示第三组,以此类推。

14.3.5.1 Exercises

  1. Describe, in words, what these expressions will match:
    (.)\1\1 ## 3个相同字符aaa
    "(.)(.)\\2\\1" ## 2个字符的回文结构abba
    (..)\1 ## 任意2个字符的重复结构abab
    "(.).\\1.\\1" ## 类似abaa的结构
    "(.)(.)(.).*\\3\\2\\1" ## 3个连续字符及其回文结构,中间可以间隔任意字符abcxxcba
  2. Construct regular expressions to match words that:
    Start and end with the same character.
    str_view(stringr::words, "^(.).*\\1$",match=T)
    Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
    str_view(stringr::words, "(..).*\\1",match=T)
    Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
    str_view(stringr::words, "(.).*\\1.*\\1",match=T)

14.4 Tools

image.png

正则表达式的匹配模式用pattern来表示,他把正则表达式在字符串的功能分为四个方面,分别是

查找:Detect pattern,确定这个模式有没有
定位:Locate pattern, 返回模式起止位置
取回:Extract pattern, 返回模式匹配到的条目
替换:Replace pattern,替换匹配的模式,返回替换后的结果

14.4.1 Detect matches

str_detect() 为了确定字符串向量是否匹配模式,返回向量等长的逻辑值。

x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1]  TRUE FALSE  TRUE
  • str_count() 告诉你字符串中有多少个匹配项
str_count(x, "a")
#> [1] 1 3 1
  • 正则表达式中的匹配不会重复,例如下面例子中匹配数是2不是3.
str_count("abababa", "aba")
#> [1] 2
str_view_all("abababa", "aba")
  • 带有_all 后缀的函数会进行全部匹配,而不是只匹配一个。

14.4.2 Extract matches

1.str_subset() 可以实现匹配项取子集

str_subset(x,"e")
[1] "apple" "pear" 
  1. str_extract()##返回的是匹配到的模式
    str_extract_all()##以list形式返回的匹配到的模式

使用simplify=TRUE参数,以matrix形式返回匹配的模式

str_extract(x,"e")
[1] "e" NA  "e"
##以list形式返回的匹配到的模式
str_extract_all(x,"a")
[[1]]
[1] "a"

[[2]]
[1] "a" "a" "a"

[[3]]
[1] "a"
##以matrix形式返回匹配到的模式
str_extract_all(x,"a",simplify = T)
     [,1] [,2] [,3]
[1,] "a"  ""   ""  
[2,] "a"  "a"  "a" 
[3,] "a"  ""   ""  

14.4.3 Grouped matches

  1. str_match返回的是数据框
    第一列是str_extract匹配到的模式 ,后面依次是括号中的内容,模式中有多少个(),就返回多少列。本例中有2对小括号。
    如果数据是 tibble格式,使用tidyr::extract()函数也很方便,工作方式类似str_match()只是需要命名匹配项。
str_match(x,"([aoeiu]).*([aoeiu])")
     [,1]    [,2] [,3]
[1,] "apple" "a"  "e" 
[2,] "anana" "a"  "a" 
[3,] "ea"    "e"  "a" 
tibble(x=x) %>%
 tidyr::extract(x,c("vowel1","vowel2"),"([aoeiu]).*([aoeiu])",
remove=FALSE)  ## 保留原数据

14.4.4 Replacing matches

  1. str_replace() and str_replace_all() 可以将匹配字符串替换为其他字符串
str_replace(x, "[aeiou]", "-")
#> [1] "-pple"  "p-ar"   "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-"  "p--r"   "b-n-n-"
  • 使用 str_replace_all() 还可以实现 多重替换通过提供a named vector
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house"    "two cars"     "three people"
  • 除了使用固定字符串,还可以对匹配部件进行反向引用。Instead of replacing with a fixed string you can use backreferences to insert components of the match.
sentences %>% 
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% ###第2个单词与第3个单词互换位置。
  head(5) 
#> [1] "The canoe birch slid on the smooth planks." 
#> [2] "Glue sheet the to the dark blue background."
#> [3] "It's to easy tell the depth of a well."     
#> [4] "These a days chicken leg is a rare dish."   
#> [5] "Rice often is served in round bowls."

14.4.1.1 Exercises

1.For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.

Find all words that start or end with x.

words[str_detect(words,"^x|x$")]

Find all words that start with a vowel and end with a consonant.

words[str_detect(words,"^[aoeiu].*[^aoeiu]$")]

Are there any words that contain at least one of each different vowel?

words[str_count(words,"[aoeiu]")>2]
  1. What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)
words[str_count(words,"[aoeiu]") == max(str_count(words,"[aoeiu]"))]
words[str_count(words,"[aoeiu]")/str_length(words) == max(str_count(words,"[aoeiu]")/str_length(words))]

14.4.2.1 Exercises

  1. In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a colour. Modify the regex to fix the problem.
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = " | ")
colour_match
#> [1] "red|orange|yellow|green|blue|purple"
has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)
image.png

修改正则表达式:

colors <- c( " red", "orange", "yellow", "green", "blue", "purple")
> (colour_match <- str_c(colors,collapse = "|"))
[1] " red|orange|yellow|green|blue|purple"
> str_view_all(more,colour_match)
image.png
  1. From the Harvard sentences data, extract:
    The first word from each sentence.
str_extract(sentences,"^[^ ]+ ")

All words ending in ing.

str_extract(sentences,"[^ ]+ing ")

All plurals.

str_extract(sentences,"([a-z]+)(((s|x|sh|ch)es)|ies|[aoeiu]ys|ves|[^aeiu']s)[ .]")

14.4.3.1 Exercises

  1. Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.
numb <- c(" one","two","three","four","five","six","seven"," eight","nine"," ten ")
number <- str_c(numb,collapse = " | ") %>% paste0("(",.,") ","([^ ]+)")
str_subset(sentences,number)%>%str_match(number)

Find all contractions. Separate out the pieces before and after the apostrophe.

str_subset(sentences,"'") %>% str_match("([^ ]+)'([^ ]+)")

14.4.4.1 Exercises

  1. Replace all forward slashes in a string with backslashes.
x <- c("ab\\c","abbc\\edf")
 x
#[1] "ab\\c"     "abbc\\edf"
str_replace_all(x,"\\\\\\\\","\\/\\/")
#"ab//c"     "abbc//edf"
  1. Implement a simple version of str_to_lower() using replace_all().
paste0('"',LETTERS,'"',"=",'"',letters,'"') %>% str_c(collapse = ",") %>% writeLines()
#"A"="a","B"="b","C"="c","D"="d","E"="e","F"="f","G"="g","H"="h","I"="i","J"="j","K"="k","L"="l","M"="m","N"="n","O"="o","P"="p","Q"="q","R"="r","S"="s","T"="t","U"="u","V"="v","W"="w","X"="x","Y"="y","Z"="z"
str_replace_all(sentences,c("A"="a","B"="b","C"="c","D"="d","E"="e","F"="f","G"="g","H"="h","I"="i","J"="j","K"="k","L"="l","M"="m","N"="n","O"="o","P"="p","Q"="q","R"="r","S"="s","T"="t","U"="u","V"="v","W"="w","X"="x","Y"="y","Z"="z"))%>% head(5)
[1] "the birch canoe slid on the smooth planks."  "glue the sheet to the dark blue background."
[3] "it's easy to tell the depth of a well."      "these days a chicken leg is a rare dish."   
[5] "rice is often served in round bowls."       
c("A"="a","B"="b","C"="c","D"="d","E"="e","F"="f","G"="g","H"="h","I"="i","J"="j","K"="k","L"="l","M"="m","N"="n","O"="o","P"="p","Q"="q","R"="r","S"="s","T"="t","U"="u","V"="v","W"="w","X"="x","Y"="y","Z"="z")
##  A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z 
## "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" 
  • 通过观察发现,c() 只是实现了对字符串向量的命名,因此,采用下面的方法更好。
names(letters) <- LETTERS
letters
##   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z 
## "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" 
str_replace_all(sentences,letters) %>% head(5)
#[1] "the birch canoe slid on the smooth planks."               
#[2] "glue the sheet to the dark blue background."              
#[3] "it's easy to tell the depth of a well."                   
#[4] "these days a chicken leg is a rare dish."                 
#[5] "rice is often served in round bowls."  
  1. Switch the first and last letters in words. Which of those strings are still words?
str_replace_all(words,"(^.)(.*)(.$)","\\3\\2\\1")
str_replace_all(words,"(^.)(.*)(.$)","\\3\\2\\1") %>% str_subset(paste0("^",str_c(words,collapse = "$|^"),"$"))
 [1] "a"          "america"    "area"       "dad"        "dead"       "lead"       "read"       "depend"     "god"       
[10] "educate"    "else"       "encourage"  "engine"     "europe"     "evidence"   "example"    "excuse"     "exercise"  
[19] "expense"    "experience" "eye"        "dog"        "health"     "high"       "knock"      "deal"       "level"     
[28] "local"      "nation"     "on"         "non"        "no"         "rather"     "dear"       "refer"      "remember"  
[37] "serious"    "stairs"     "test"       "tonight"    "transport"  "treat"      "trust"      "window"     "yesterday" 

14.4.5 Splitting 分列

  • str_split() 可按一定规律分割字符串为多个单元,类似excel 中的分列功能。
  • 由于每个元素可能包含的单元数目不同,因此该函数返回的格式是由字符串向量组成的列表。
  • 如果你只是分割一个字符串,最方便的做法取子集[[1]],这样返回vector 格式的结果。
  • 也可以指定参数,simplify = TRUE返回matrix格式的结果。同时也可以通过n= xx 参数指定返回matrix最大列数,以去除一些不必要的列。
  • 除了以匹配模式进行分列,还可以通过boundary()函数以字符串、行、句子,单词进行分列。
    Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word boundary()s:
    boundary(type = c("character", "line_break", "sentence", "word"), skip_word_none = NA, ...)
words <- c("These are   some words.")
> str_count(words, boundary("word"))
[1] 4
> str_split(words, " ")[[1]]
[1] "These"  "are"    ""       ""       "some"   "words."
> str_split(words, boundary("word"))[[1]]
[1] "These" "are"   "some"  "words"
  • str_split_fixed() 返回 字符串 matrix with ncolumns.
fruits <- c(
  "apples and oranges and pears and bananas",
  "pineapples and mangos and guavas"
)
> str_split_fixed(fruits, " and ", 3)
     [,1]         [,2]      [,3]               
[1,] "apples"     "oranges" "pears and bananas"
[2,] "pineapples" "mangos"  "guavas"           
> str_split_fixed(fruits, " and ", 4)
     [,1]         [,2]      [,3]     [,4]     
[1,] "apples"     "oranges" "pears"  "bananas"
[2,] "pineapples" "mangos"  "guavas" ""       

14.4.5.1 Exercises

1.Split up a string like "apples, pears, and bananas" into individual components.

c<-"apples, pears, and bananas"
str_split(c,", |and ")

2.Why is it better to split up by boundary("word") than " "?
boundary("word") 可以忽略空格、逗号等的影响。

  1. What does splitting with an empty string ("") do? Experiment, and then read the documentation.
    结果以每个字符分列:
str_split(words,"")[[1]]
 [1] "T" "h" "e" "s" "e" " " "a" "r" "e" " " " " " " "s" "o" "m" "e" " " "w" "o" "r" "d" "s" "."

14.4.6 Find matches

str_locate() ·and ·str_locate_all()· 给出匹配模式的起始和终止位置,可以用str_locate()查找匹配位置str_sub()` 进行提取或修改.

14.5 Other types of pattern 其他模式

  • 通常所使用的pattern ,其实是regex()函数的省略。
  • 通过regex() 的参数,可以实现更加精确的控制:
    • ignore_case = TRUE 忽略字符串的大写、或小写模式,通常默认为FALSE
bananas <- c("banana", "Banana", "BANANA")
> str_extract(bananas,"banana")
[1] "banana" NA       NA      
> str_extract(bananas,regex("banana",ignore_case = TRUE))
[1] "banana" "Banana" "BANANA"
  • multiline = TRUE 允许使用 ^ and$ 来匹配每一行的开始或者结尾,而不是整个文本的开头和末尾。
x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]
#> [1] "Line"
str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
#> [1] "Line" "Line" "Line"
  • comments = TRUE 允许使用注释,使得正则表达式更容易理解。#后的字符被认为是注释,所有的空格也会被忽略。如果需要使用空格需要使用\\.
 str_split(words,regex("\\ ",comments = T))[[1]]## 以空格进行分列
# [1] "These"  "are"    ""       ""       "some"   "words."
str_split(words,regex(" ",comments = T))[[1]]## pattern中的空格被忽略,以每个字符进行分列
# [1] ""  "T" "h" "e" "s" "e" " " "a" "r" "e" " " " " " " "s" "o" "m" "e" " " "w" "o" "r" "d" "s" "." ""
str_split(words,regex("[. ]",comments = T))[[1]] ## 只能以点进行分列。
# [1] "These are   some words" ""                      
str_split(words,regex("[.\\ ]",comments = T))[[1]]##  以· 或者空格 进行分列
# [1] "These" "are"   ""      ""      "some"  "words" "" 
  • dotall = TRUE allows .匹配任意字符包括\n.

此外,还有3种函数可以替代regex() :

  • fixed() # 精确匹配字符串,忽略正则表达式,速度更快。在非英文环境使用fixed()需要注意同一字符的不同实现方式,在fixed()函数中无法识别。
  • coll(): 比较字符串以标准的校对准则。compare strings using standard collation rules. coll() 具有 locale参数,以确定采用哪种校对规则。coll()的执行速度很慢。
  • boundary()str_split()中提到的boundary() 函数也可以在其他函数中使用。
 str_extract_all(words, boundary("word"))[[1]]
[1] "These" "are"   "some"  "words"

14.5.1 Exercises

  1. How would you find all strings containing \ with regex() vs. with fixed()?
x <- "Line 1\\\\Line 2\nLine 3"
writeLines(x)
#Line 1\\Line 2
#Line 3
str_view_all(x, regex("\\\\"))
str_view_all(x,fixed("\\"))
  1. What are the five most common words in sentences?
str_split(sentences,boundary("word")) %>%  ## 分割单词,
  unlist %>%  str_to_lower%>%   ## 去list,全部转为小写
  table() %>% sort(decreasing = T) %>%  ##使用table统计 ,sort排序
  head(5)#head显示前5个。
# .
# the   a  of  to and 
# 751 202 132 123 118 

14.6 Other uses of regular expressions

  • apropos() searches all objects available from the global environment.
  • dir() lists all the files in a directory. The pattern argument takes a regular expression and only returns file names that match the pattern.

14.7 stringi

stringr 是在stringi包的基础上产生的,stringr 包含了最基本的字符串处理函数46个,stringi包具有234个函数,功能更加强大。如果有更复杂的字符串处理,可以使用stringi包,这两个包的函数非常相似,只需要替换str_stri_即可,

14.7.1 Exercises

  1. Find the stringi functions that:
    Count the number of words.
    stri_count(sentences,regex = " ") %>% head()
    Find duplicated strings.
    stri_extract(words,regex="(.)\\1",simplify=T) ###寻找连续字符
    stri_duplicated(c("a", "b", "a", NA, "a", NA)) ### 判断是否有重复字符串
    Generate random text.

  2. How do you control the language that stri_sort() uses for sorting?

  • 可以指定decreasing=TRUE参数,倒序排列
 stri_sort(sample(LETTERS))
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
stri_sort(sample(LETTERS),decreasing = TRUE) 
 [1] "Z" "Y" "X" "W" "V" "U" "T" "S" "R" "Q" "P" "O" "N" "M" "L" "K" "J" "I" "H" "G" "F" "E" "D" "C" "B" "A"

-指定locale="xx"参数,按地区语言特性排列。

stri_sort(c("hladny", "chladny"), locale="pl_PL")
[1] "chladny" "hladny" 
> stri_sort(c("hladny", "chladny"), locale="sk_SK")
[1] "hladny"  "chladny"

-指定numeric=TRUE参数,按数字大小排列

stri_sort(c(1, 100, 2, 101, 11, 10))
[1] "1"   "10"  "100" "101" "11"  "2"  
> stri_sort(c(1, 100, 2, 101, 11, 10), numeric=TRUE)
[1] "1"   "2"   "10"  "11"  "100" "101"

你可能感兴趣的:([读书笔记r4ds]14. Strings)