R语言基础9--字符串的处理(初步处理+stringr&stringi)


R语言基础系列:

  • R语言基础1--R数据格式:.rds和.rda的区别
  • R语言基础2--数据排序与长宽型数据的转换
  • R语言基础3--tidyverse包总结
  • R语言基础4--dplyr包的函数及用法
  • R语言基础5--tidyr包的函数及用法
  • R语言基础6--apply函数家族及其应用
  • R语言基础7--R语言中的正则表达式
  • R语言基础8--R语言缺失值、异常值和重复值的识别与处理

字符串的处理与正则表达式关系密切,参考:R语言中的正则表达式

1. 字符串的初步处理

生成字符串

x <- c('huake','wuda')
1.1 nchar函数:查看字符串有多少个字符
nchar(x)
# [1] 5 4

⚠️注意nchar函数与length函数的区别,如果用length(x),返回的是2(有两个字符串),但可以使用str_length()函数

length(x)
# [1] 2
str_length(x)
# [1] 5 4
1.2 大小写的转换

toupper函数:小写变大写
tolower函数:大写变小写

toupper('huake')
# [1] "HUAKE"
tolower('WUDA')
# [1] "wuda"
1.3 paste()函数和paste0()函数:连接字符串

paste函数

stringa <- LETTERS[1:5]
STRINGB <- 1:5
paste(stringa,STRINGB)
# [1] "A 1" "B 2" "C 3" "D 4" "E 5"

# sep参数可以定义黏贴参数间的连接方法
paste(stringa,STRINGB,sep='-')
# [1] "A-1" "B-2" "C-3" "D-4" "E-5"

#collapse参数,把所有参数粘贴在一起,并定义连接方法
paste(stringa,STRINGB,collapse ='-')
# [1] "A 1-B 2-C 3-D 4-E 5"

paste0函数 (0代表粘贴在一起后没有间隔)

paste0(stringa,STRINGB)
# [1] "A1" "B2" "C3" "D4" "E5"

#使用sep和collapse也无法插入到中间
paste0(stringa,STRINGB,sep='-')
# [1] "A1-" "B2-" "C3-" "D4-" "E5-"
paste0(stringa,STRINGB,collapse ='-')
# [1] "A1-B2-C3-D4-E5"

若对paste函数设置sep="",效果和paste0一样

1.4 拆分函数strsplit()

拆分后生成列表

stringC <- paste(stringa,STRINGB,sep='/')
stringC
# [1] "A/1" "B/2" "C/3" "D/4" "E/5"

M <- strsplit(stringC,split = '/')
M
# [[1]]
# [1] "A" "1"

# [[2]]
# [1] "B" "2"

# [[3]]
# [1] "C" "3"

# [[4]]
# [1] "D" "4"

# [[5]]
# [1] "E" "5"

class(M)
# [1] "list"
1.5 字符串的截取函数 substr
# 从2-4位截取
stringd <- c('python','java','ruby','php','huazhongda')
sub_str <- substr(stringd,start = 2,stop = 4)
sub_str
# [1] "yth" "ava" "uby" "hp"  "uaz"

# 除了截取,还可以赋值 #将2-4位换成aaa
substr(stringd,start = 2,stop = 4) <- 'aaa'
stringd
# [1] "paaaon"     "jaaa"       "raaa"       "paa"        "haaahongda"
1.6 grep()函数和grepl()函数

处理比较复杂的字符串

# 语法
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
     fixed = FALSE, useBytes = FALSE, invert = FALSE)
grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
      fixed = FALSE, useBytes = FALSE)

生成向量

seq_names <- c('EU_FRA02_C1_S2008','AF_COM12_B0_2004','AF_COM17_F0_S2008',
               'AS_CHN11_C3_2004','EU-FRA-C3-S2007','NAUSA02E02005',
               'AS_CHN12_N0_05','NA_USA03_C2_S2007','NA USA04 A3 2004',
               'EU_UK01_A0_2009','eu_fra_a2_s98','SA/BRA08/B0/1996')
# 有大写有小写,有斜杠有下划线,有确定年份有不确定年份。。。
# 如第一个,EU是欧洲,FRA是法国,02是法国的第二个序列,C1是序列亚型,2008是样本收集年份,S是2008年是一个推测的数值,并不确定。

grep()函数提取法国的元素

fra_seq <- grep(pattern = 'FRA|fra',x=seq_names)
fra_seq
# [1]  1  5 11
seq_names[fra_seq]
# [1] "EU_FRA02_C1_S2008" "EU-FRA-C3-S2007"   "eu_fra_a2_s98" 

#也可通过设置value = TRUE来返回得到的元素
fra_seq <- grep(pattern = 'FRA|fra',x=seq_names,value = TRUE)
fra_seq
# [1] "EU_FRA02_C1_S2008" "EU-FRA-C3-S2007"   "eu_fra_a2_s98"

# 通过设置ignore.case = T来忽略大小写
grep(pattern = 'FRA|fra',x=seq_names,value = TRUE,ignore.case = T)
#  [1] "EU_FRA02_C1_S2008" "EU-FRA-C3-S2007"   "eu_fra_a2_s98" 
#这里用到了正则表达式

grepl()函数返回的是TRUE或FALSE

grepl(pattern = 'FRA|fra',x=seq_names)
# [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
# 用[]提取
seq_names[grepl(pattern = 'FRA|fra',x=seq_names)]
# [1] "EU_FRA02_C1_S2008" "EU-FRA-C3-S2007"   "eu_fra_a2_s98" 

✨练习:提取如上向量中有明确收集年份的序列。
(思路:找出不明确年份的序列(含s或S的),然后取非。)

spe_seq <- seq_names[!grepl(pattern = '[s|S][0-9]{2,4}\\b',seq_names)]
spe_seq
# [1] "AF_COM12_B0_2004" "AS_CHN11_C3_2004" "NAUSA02E02005"    "AS_CHN12_N0_05"  
# [5] "NA USA04 A3 2004" "EU_UK01_A0_2009"  "SA/BRA08/B0/1996"

# \\是转义符,\\b是去匹配boundary,放在右边说明是去匹配字符的结尾。
# 前面[s|S]的意思是在s或S中取值,[0-9]的意思是在0-9中取值,{2,4}紧跟在[0-9]后面的意思在0-9中取值取2-4次。
1.7 gsub()函数和sub()函数
# 语法
sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
    fixed = FALSE, useBytes = FALSE)
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
     fixed = FALSE, useBytes = FALSE)
money <- c('$1888','$2888','$3888')

# 由于美元符的存在,不能直接使用as.numeric
as.numeric(money)
# [1] NA NA NA

gsub()函数

# $本身也有含义,不能直接使用,需要在前面加上转义符\\,之后再用as.numeric转换。
money1 <- gsub('\\$',replacement = '',money)
money1
[1] "1888" "2888" "3888"
as.numeric(money1)
# [1] 1888 2888 3888

gsub函数可以替换它找到的所有的字符
sub函数只能替换它找到的第一个字符

sub('\\$',replacement = '',money)
# [1] "1888" "2888" "3888"

money <- c('$1888 $2888 $3888')
sub('\\$',replacement = '',money)
# [1] "1888 $2888 $3888"
gsub('\\$',replacement = '',money)
# [1] "1888 2888 3888"
1.8 regexpr()函数、gregexpr()函数和regexec()函数

功能非常类似

# 语法
regexpr(pattern, text, ignore.case = FALSE, perl = FALSE,
        fixed = FALSE, useBytes = FALSE)
gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE,
         fixed = FALSE, useBytes = FALSE)
regexec(pattern, text, ignore.case = FALSE, perl = FALSE,
        fixed = FALSE, useBytes = FALSE)

以regexpr()为例:

# 寻找test_string里含有pp的字符串
test_string <- c('happy','apple','application','apolotoc')
regexpr('pp',test_string)
# [1]  3  2  2 -1
# attr(,"match.length")
# [1]  2  2  2 -1
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
# 返回的3 2 2 -1的意思是,第一个字符串里的pp出现在第三位,第二个和第三个出现在第二位。最后一个没有找到,返回-1。
1.9 agrep()函数和agrepl()函数

以agrep()为例

string1 <- c('I need a favour','my favorite sport','you made an error')
agrep('favor',string1)
# [1] 1 2

英式英语和美式英语的写法可以自动被识别

2. stringr和stringi包

stringr和stringi功能类似,stringi功能更强大,但更依赖于正则表达式的使用。

# 查看这两个包中的函数
library(stringr)
library(stringi)
ls('package:stringr')
ls('package:stringi')
# stringr中有52个函数, stringi中有252个函数。
2.1 stringr包⚠️
常用函数 功能
str_split() /str_c() 字符串拆分与组合
str_length() 检测字符串长度
str_sub() 按位置提取字符
str_dup 识别重复的字符串
str_trim 去除字符串首尾的空格
str_to_upper()/str_to_lower()/str_to_title() 大小写转换
str_locate() 字符串定位
str_detect(x,“h”) 字符检测 –返回逻辑值
str_extract()/ str_extract_all() 字符提取
str_remove()/ str_remove_all() 字符删除
str_replace()/str_replace_all() 字符串替换
  • 2.1.1 str_c()和str_split()
    str_c()函数与paste函数类似
library(stringr)
str_c('a','b')
# [1] "ab"
str_c('a','b',sep='-')
# [1] "a-b"

str_split()⚠️

x <- "The birch canoe slid on the smooth planks."
x
# [1] "The birch canoe slid on the smooth planks."
str_split(x," ") #生成的是列表
# [[1]]
# [1] "The"     "birch"   "canoe"   "slid"    "on"     
# [6] "the"     "smooth"  "planks."
x[[1]]  #得到向量
[1] "The birch canoe slid on the smooth planks."

y = c("john 150","mike 140","lucy 152")
str_split(y," ")
# [[1]]
# [1] "john" "150" 

# [[2]]
# [1] "mike" "140" 

# [[3]]
# [1] "lucy" "152" 
str_split(y," ",simplify = T) #‘simplify = T’生成矩阵⚠️
#      [,1]   [,2] 
# [1,] "john" "150"
# [2,] "mike" "140"
# [3,] "lucy" "152"
  • 2.1.2 str_length()函数
    对字符串进行计数,与nchar()类似

  • 2.1.3 str_sub()函数:按位置提取字符

aaa <- 'huake tongji cardio'
str_sub(aaa,c(1,4,8),c(2,7,11))
# [1] "hu"   "ke t" "ongj"
#第一个是1-2个字符,第二个是4-7个字符,第三个是8-11个字符
  • 2.1.4 str_dup
fruit <- c('apple','pear','banana')
str_dup(fruit,2) #2表示把字符串重复两次
# [1] "appleapple"   "pearpear"     "bananabanana"

str_dup(fruit,2:4)
# [1] "appleapple"               "pearpearpear"             "bananabananabananabanana"
  • 2.1.5 str_trim 去除字符串首尾的空格
string <- c('  Huake is good    ')
string
# [1] "  Huake is good    "
str_trim(string,side = 'both')
# [1] "Huake is good"
  • 2.1.6 str_locate 字符串定位
fruit <- c("apple", "banana", "pear", "pineapple")
str_locate(fruit, "$")
#      start end
# [1,]     6   5
# [2,]     7   6
# [3,]     5   4
# [4,]    10   9
str_locate(fruit, "a")
# start end
# [1,]     1   1
# [2,]     2   2
# [3,]     3   3
# [4,]     5   5
str_locate(fruit, c("a", "b", "p", "p"))
#      start end
# [1,]     1   1
# [2,]     1   1
# [3,]     1   1
# [4,]     1   1
  • 2.1.7 str_detect 字符检测⚠️
fruit <- c("apple", "banana", "pear", "pinapple")
str_detect(fruit, "a")
# [1] TRUE TRUE TRUE TRUE
str_detect(fruit, "^a")
# [1]  TRUE FALSE FALSE FALSE
str_detect(fruit, "a$")
# [1] FALSE  TRUE FALSE FALSE
str_detect(fruit, "b")
# [1] FALSE  TRUE FALSE FALSE
str_detect(fruit, "[aeiou]")
# [1] TRUE TRUE TRUE TRUE

❗️:str_detect()和ifelse()联合使用可以根据字符串中是否存在某字符将字符串分为两类,常用于GEO等分析时根据样本名判断该样本是正常样本还是病例(如肿瘤)样本。

用法:

ifelse(str_detect(colname(a), ''tumor), 'tumor', 'normal' )
# 如果在数据框a的列名中搜索到tumor,返回tumor,没有搜索到返回normal。
  • 2.1.8 str_extract和str_extract_all
shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")
str_extract(shopping_list, "\\d")
# [1] "4" NA  NA  "2"
str_extract(shopping_list, "[a-z]+")
# [1] "apples" "bag"    "bag"    "milk"  
str_extract(shopping_list, "[a-z]{1,4}")
# [1] "appl" "bag"  "bag"  "milk"
str_extract(shopping_list, "\\b[a-z]{1,4}\\b")
# [1] NA     "bag"  "bag"  "milk"
str_extract_all(shopping_list, "[a-z]+")
# [[1]]
# [1] "apples" "x"     

# [[2]]
# [1] "bag"   "of"    "flour"

# [[3]]
# [1] "bag"   "of"    "sugar"

# [[4]]
# [1] "milk" "x"  
  • 2.1.9 str_remove()和str_remove_all()
fruits <- c("one apple", "two pears", "three bananas")
str_remove(fruits, "[aeiou]")
# [1] "ne apple"     "tw pears"     "thre bananas"
str_remove_all(fruits, "[aeiou]")
# [1] "n ppl"    "tw prs"   "thr bnns"
  • 2.1.10 str_replace()和str_replace_all()
fruits <- c("one apple", "two pears", "three bananas")
str_replace(fruits, "[aeiou]", "-")
# [1] "-ne apple"     "tw- pears"     "thr-e bananas"
str_replace_all(fruits, "[aeiou]", "-")
# [1] "-n- -ppl-"     "tw- p--rs"     "thr-- b-n-n-s"
2.2 stringi包
  • 2.2.1 stri_join 字符串的粘贴
stri_join(1:13, letters)
#  [1] "1a"  "2b"  "3c"  "4d"  "5e"  "6f"  "7g"  "8h"  "9i"  "10j" "11k"
# [12] "12l" "13m" "1n"  "2o"  "3p"  "4q"  "5r"  "6s"  "7t"  "8u"  "9v" 
# [23] "10w" "11x" "12y" "13z"
stri_join(1:13, letters, sep=',')
#  [1] "1,a"  "2,b"  "3,c"  "4,d"  "5,e"  "6,f"  "7,g"  "8,h"  "9,i"  "10,j"
# [11] "11,k" "12,l" "13,m" "1,n"  "2,o"  "3,p"  "4,q"  "5,r"  "6,s"  "7,t" 
# [21] "8,u"  "9,v"  "10,w" "11,x" "12,y" "13,z"
stri_join(1:13, letters, collapse='; ')
# [1] "1a; 2b; 3c; 4d; 5e; 6f; 7g; 8h; 9i; 10j; 11k; 12l; 13m; 1n; 2o; 3p; 4q; 5r; 6s; 7t; 8u; 9v; 10w; 11x; 12y; 13z"
2.2.2 stri_cmp_eq和stri_cmp_neq

stri_cmp_eq 判断两个字符串是否完全一样
stri_cmp_neq 判断两个字符串是否不一样

stri_cmp_eq('AB','AB')
# [1] TRUE
stri_cmp_eq('AB','aB')
# [1] FALSE
stri_cmp_neq('AB','aB')
# [1] TRUE
  • 2.2.3 stri_cmp_lt和stri_cmp_gt
    stri_cmp_lt 小于
    stri_cmp_gt 大于
    字符串之间的比较,针对数字时按数字大小,针对字母的时候按字母表的顺序,后出现的大
stri_cmp_lt('121','221')
# [1] TRUE
stri_cmp_lt('a121','b221')
# [1] TRUE
  • 2.2.4 stri_count
s <- 'Lorem ipsum dolor sit amet, consectetur adipisicing elit.'
stri_count(s, fixed='dolor')
# [1] 1
stri_count(s, regex='\\p{L}+')
# [1] 8
  • 2.2.5 stri_dup
stri_dup('a', 1:5)
# [1] "a"     "aa"    "aaa"   "aaaa"  "aaaaa"
stri_dup(c('a', NA, 'ba'), 4)
# [1] "aaaa"     NA         "babababa"
# stri_dup(c('abc', 'pqrst'), c(4, 2))
[1] "abcabcabcabc" "pqrstpqrst" 
  • 2.2.6 stri_detect_fixed
stri_detect_fixed(c('stringi R', 'R STRINGI', '123'), c('i', 'R', '0'))
# [1]  TRUE  TRUE FALSE

向量化的从前面那个里面寻找后面那个,找到了就返回TRUE,找不到就返回FALSE

  • 2.2.7 stri_detect_regex
# 寻找以ab开头的和以t结尾的
stri_detect_regex(c('above','abort','about','abnormal','abandon'),'^ab')
# [1] TRUE TRUE TRUE TRUE TRUE
stri_detect_regex(c('above','abort','about','abnormal','abandon'),'t\\b')
# [1] FALSE  TRUE  TRUE FALSE FALSE

# case_insensitive=TRUE是忽视大小写
stri_detect_regex(c('ABove','abort','About','aBnormal','abandon'),'^ab',case_insensitive=TRUE)
# [1] TRUE TRUE TRUE TRUE TRUE
  • 2.2.8 stri_startswith_fixed 判断是不是以某个字符开始
stri_startswith_fixed(c('a1','a2','b3','a4','c5'),'a1')
# [1]  TRUE FALSE FALSE FALSE FALSE
stri_startswith_fixed(c('abaDc','asdfh','abiude'),'ba',from=2)
# [1]  TRUE FALSE FALSE
# from定义从第几个字符开始匹配
  • 2.2.9 stri_endswith_fixed 判断是不是以某个字符结束
stri_endswith_fixed(c('abaDc','asdfh','abiudba'),'ba')
# [1] FALSE FALSE  TRUE
stri_endswith_fixed(c('abaDc','asdfh','abiudba'),'ba',to=3)
# [1]  TRUE FALSE FALSE
# to表示匹配到第几位
  • 2.2.10 stri_extract_all
stri_extract_all('XaaaaX', regex=c('\\p{Ll}', '\\p{Ll}+', '\\p{Ll}{2,3}', '\\p{Ll}{2,3}?'))
# [[1]]
# [1] "a" "a" "a" "a"

# [[2]]
# [1] "aaaa"

# [[3]]
# [1] "aaa"

# [[4]]
# [1] "aa" "aa"

stri_extract_all('Bartolini', coll='i')
# [[1]]
# [1] "i" "i"

stri_extract_all('stringi is so good!', charclass='\\p{Zs}') # all white-spaces
# [[1]]
# [1] " " " " " "
  • 2.2.11 stri_extract_all_fixed
    参数overlap=TRUE意思是可以重复的对字符串进行匹配
stri_extract_all_fixed('abaBAba', 'Aba', case_insensitive=TRUE)
# [[1]]
# [1] "aba" "Aba"
stri_extract_all_fixed('abaBAba', 'Aba', case_insensitive=TRUE, overlap=TRUE)
# [[1]]
# [1] "aba" "aBA" "Aba"
  • 2.2.12 stri_extract_all_boundaries 提取字符串的边界
    根据空格提取的。问题是提取出来的字符串也带空格
stri_extract_all_boundaries('stringi: THE string processing package 123.48...')
# [[1]]
# [1] "stringi: "   "THE "        "string "    
# [4] "processing " "package "    "123.48..." 
  • 2.2.13 stri_extract_all_words 提取单词
stri_extract_all_words('stringi: THE string processing package 123.48...')
# [[1]]
# [1] "stringi"    "THE"        "string"     "processing"
# [5] "package"    "123.48"
  • 2.2.14 stri_isempty 判断字符串中是否存在空字符
    注意:空格不算空字符
stri_isempty(letters[1:3])
# [1] FALSE FALSE FALSE
stri_isempty(c(',', '', 'abc', '123', '\u0105\u0104'))
# [1] FALSE  TRUE FALSE FALSE FALSE
stri_isempty(character(1))
[1] TRUE
  • 2.2.15 stri_locate_all 定位函数 可以找到匹配字符在字符串中出现的位置
stri_locate_all('Bartolini', fixed='i')
# [[1]]
#      start end
# [1,]     7   7
# [2,]     9   9

你可能感兴趣的:(R语言基础9--字符串的处理(初步处理+stringr&stringi))