Spark SQL 字符串函数汇总

本文总结一些常用的字符串函数。还是在databricks社区版。

  • 字符串截取函数:substr \ substring

  • 字符串的长度函数 len \ length

  • 字符串定位函数 instr

  • 字符串分割函数 split \ split_part

  • 字符串去空格函数:trim \ ltrim \ rtrim

  • 字符串补足函数:lpad \ rpad

  • 字符串拼接函数: concat \ concat_ ws

  • 字符串替换函数: replace \ regexp_replace

  • 正则表达式相关函数:regexp

  • 字符串模糊匹配函数: like \ rlike

  • 字符串转大小写函数: lower Icase \ upper \ ucase

  • json解析函数get json_object

  • 重复字符串函数:repeat

  • URL解析函数:parse url

  • 集合查找函数:find_in_set

  • 字符串反转函数:reverse

CREATE TABLE temp (id int,name string,email string,phone string)
INSERT INTO temp VALUES
  (1, 'John Doe', '[email protected]', '123-456-7890'),
  (2, 'Jane Smith', '[email protected]', '555-555-5555'),
  (3, 'Bob Johnson', '[email protected]', '555-123-4567'),
  (4, 'Alice Brown', '[email protected]', '555-987-6543'),
  (5, 'Charlie Davis', '[email protected]', '555-111-2222');
id name email phone
1 John Doe [email protected] 123-456-7890
2 Jane Smith [email protected] 555-555-5555
3 Bob Johnson [email protected] 555-123-4567
4 Alice Brown [email protected] 555-987-6543
5 Charlie Davis [email protected] 555-111-2222
字符串截取函数:substr \ substring

语法1: substr(string A, int start),substring(string A, int start)

返回值: string

**说明:**返回字符串A从start位置到结尾的字符串

语法2: substr(string A, int start, int len),substring(string A, int start, int len)

返回值: string

**说明:**返回字符串A从start位置开始,长度为len的字符串

select t1.*,substr(t1.phone,1,3) as area_number
from temp as t1
id name email phone area_number
1 John Doe [email protected] 123-456-7890 123
2 Jane Smith [email protected] 555-555-5555 555
3 Bob Johnson [email protected] 555-123-4567 555
4 Alice Brown [email protected] 555-987-6543 555
5 Charlie Davis [email protected] 555-111-2222 555
substring

substring(str, pos[, len]) - Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len.

substring(str FROM pos[ FOR len]]) - Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len.

substring(str, pos[, len])-返回从pos开始且长度为Len的str子字符串,或从pos开始且长度为len的字节数组切片。

substring(str FROM pos[ FOR len]])-返回str的子字符串,从pos开始,是长度len,或从pos开始且长度为en的字节数组切片

select 
   t1.phone
  ,substring(t1.phone,1,3) as area_num
  ,substring(t1.phone,9)as tail_num
  ,substring(t1.phone,-4) as tail_num
  ,substring(t1.phone from -4) as tail_num
  ,substring(t1.phone from 5 for 1) as mid_num -- 从第5位开始的第一个
  ,substring(t1.phone from 5 for 3) as mid_num 
from temp as t1
phone area_num tail_num tail_num tail_num mid_num mid_num
123-456-7890 123 7890 7890 7890 4 456
555-555-5555 555 5555 5555 5555 5 555
555-123-4567 555 4567 4567 4567 1 123
555-987-6543 555 6543 6543 6543 9 987
555-111-2222 555 2222 2222 2222 1 111
substring_index

substring_index(str, delim, count) - Returns the substring from str before count occurrences of the delimiter delim. If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If count is negative, everything to the right of the final delimiter (counting from the right) is returned. The function substring_index performs a case-sensitive match when searching for delim.

substring_index(str, Delim,count)-返回str中的子串,然后才出现分隔符的count。如果count为正,则返回分隔符左侧的所有内容(从左侧开始计数)。如果count为负,则返回最终分隔符右侧的所有内容(从右侧开始计数)。该函数substring_index在搜索Delim时执行区分大小写的匹配。

select 
   t1.email
  ,substring_index(t1.email,'.',1) as name
  ,substring_index(t1.email,'.',-1) as com
  ,substring_index(t1.email,'@',1) as before
  ,substring_index(t1.email,'@',-1) as after
from temp as t1
email name com before after
[email protected] john com john.doe example.com
[email protected] jane com jane.smith example.com
[email protected] bob com bob.johnson example.com
[email protected] alice com alice.brown example.com
[email protected] charlie com charlie.davis example.com
SELECT substring_index('www.apache.org', '.', 2)as wangzhi
-- www.apache
字符串的长度函数 len \ length
select 
   name
  ,len(name) as l_name
  ,length(name) as l_name
from temp
name l_name l_name
John Doe 8 8
Jane Smith 10 10
Bob Johnson 11 11
Alice Brown 11 11
Charlie Davis 13 13
字符串定位函数 instr
-- 注意细节
select 
    email
   ,instr(email,'@') as in_a
   ,len(email) as l_email
   ,substr(email,instr(email,'@')) as e_before
   ,substr(email,instr(email,'@'),len(email)) as e_before
   ,substr(email,instr(email,'@')+1,len(email)) as e_com
   ,substr(email,1,instr(email,'@')-1) as e_after
from temp
email in_a l_email e_before e_before e_com e_after
[email protected] 9 20 @example.com @example.com example.com john.doe
[email protected] 11 22 @example.com @example.com example.com jane.smith
[email protected] 12 23 @example.com @example.com example.com bob.johnson
[email protected] 12 23 @example.com @example.com example.com alice.brown
[email protected] 14 25 @example.com @example.com example.com charlie.davis
字符串分割 split \ split_part

split(str, regex, limit) - Splits str around occurrences that match regex and returns an array with a length of at most limit

Arguments:

  • str - a string expression to split.
  • regex - a string representing a regular expression. The regex string should be a Java regular expression.
  • limit - an integer expression which controls the number of times the regex is applied.
  • limit > 0: The resulting array’s length will not be more than limit, and the resulting array’s last entry will contain all input beyond the last matched regex.
  • limit <= 0: regex will be applied as many times as possible, and the resulting array can be of any size.

拆分(str, regex,limited)-将str拆分为与regex匹配的事件,并返回一个长度最多为限制的数组

论据:

  • str-要拆分的字符串表达式。
  • 正则表达式-表示正则表达式的字符串。正则表达式字符串应该是一个Java的正则表达式。
  • 限制-控制应用正则表达式的次数的整数表达式。
  • 限制>0:结果数组的长度不会超过限制,并且结果数组的最后一个条目将包含超过最后一个匹配正则表达式的所有输入。
  • limited<=0:正则表达式将被应用尽可能多的次数,并且生成的数组可以是任何大小。
split
select 
    email
   ,split(email,'example.') as e_before
   ,split(email,'@') as e_before
from temp12
email e_before e_before
[email protected] [“john.doe@”,“com”] [“john.doe”,“example.com”]
[email protected] [“jane.smith@”,“com”] [“jane.smith”,“example.com”]
[email protected] [“bob.johnson@”,“com”] [“bob.johnson”,“example.com”]
[email protected] [“alice.brown@”,“com”] [“alice.brown”,“example.com”]
[email protected] [“charlie.davis@”,“com”] [“charlie.davis”,“example.com”]
select 
    split('001_prod_7532_20230510','_',1) as e_1
   ,split('001_prod_7532_20230510','_',2) as e_2
   ,split('001_prod_7532_20230510','_',3) as e_3
 -- 但是不是我想要的形式
e_1 e_2 e_3
[“001_prod_7532_20230510”] [“001”,“prod_7532_20230510”] [“001”,“prod”,“7532_20230510”]
split_part

split_part(str, delimiter, partNum) - Splits str by delimiter and return requested part of the split (1-based). If any input is null, returns null. if partNum is out of range of split parts, returns empty string. If partNum is 0, throws an error. If partNum is negative, the parts are counted backward from the end of the string. If the delimiter is an empty string, the str is not split.

split_part(str, delimiter, partNum) -按分隔符拆分str并返回拆分的请求部分(基于1)。如果任何输入为空,则返回空。如果partNum超出分割部分的范围,则返回空字符串。如果partNum为0,则抛出错误。如果partNum为负数,则从字符串末尾开始向后计数。如果分隔符是空字符串,则不拆分该str。

-- 这下就对了
select 
    email
   ,split_part(email,'.',1) as e_before
   ,split_part(email,'.',2) as e_before
from temp12
email e_before e_before
[email protected] john doe@example
[email protected] jane smith@example
[email protected] bob johnson@example
[email protected] alice brown@example
[email protected] charlie davis@example
select 
    split_part('001_prod_7532_20230510','_',1) as e_1
   ,split_part('001_prod_7532_20230510','_',2) as e_2
   ,split_part('001_prod_7532_20230510','_',3) as e_3
e_1 e_2 e_3
001 prod 7532
去空格函数:trim \ ltrim \ rtrim

trim(str) - Removes the leading and trailing space characters from str.

trim(BOTH FROM str) - Removes the leading and trailing space characters from str.

trim(LEADING FROM str) - Removes the leading space characters from str.

trim(TRAILING FROM str) - Removes the trailing space characters from str.

trim(trimStr FROM str) - Remove the leading and trailing trimStr characters from str.

trim(BOTH trimStr FROM str) - Remove the leading and trailing trimStr characters from str.

trim(LEADING trimStr FROM str) - Remove the leading trimStr characters from str.

trim(TRAILING trimStr FROM str) - Remove the trailing trimStr characters from str.

Arguments:

  • str - a string expression
  • trimStr - the trim string characters to trim, the default value is a single space
  • BOTH, FROM - these are keywords to specify trimming string characters from both ends of the string
  • LEADING, FROM - these are keywords to specify trimming string characters from the left end of the string
  • TRAILING, FROM - these are keywords to specify trimming string characters from the right end of the string

trim(str) -从str中删除前后空格字符
trim(BOTH FROM str) -从str中删除前导和尾随空格字符。
trim(LEADING FROM str) -从str中删除前导空格字符
trim(TRAILING FROM str) -从str中删除尾随空格字符
trim(trimStr FROM str) -从str中删除前后的trimStr字符
trim(BOTH trimStr FROM str) -从str中删除开头和结尾的trinstr字符。
trim(LEADING trimStr FROM str) -从str中删除前导trimStr字符
trim(TRAILING trimStr FROM str) -从str中删除尾随的trinstr字符
参数:
•STR—字符串表达式
•trim Str -要修剪的字符串字符,默认值是一个空格

•BOTH, FROM -这些关键字用于指定从字符串的两端修剪字符串字符
•LEADING, FROM -这些关键字用于指定从字符串的左端修剪字符串字符
•TRAILING, FROM -这些关键字用于指定从字符串的右端修剪字符串字符

trim
select 
   lizi
  ,length(lizi) as l_lizi
  ,trim(lizi) as lizi1
  ,ltrim(lizi) as l_trim
  ,rtrim(lizi) as r_trim
  ,len(ltrim(lizi)) as l_trim_l
  ,len(rtrim(lizi)) as r_trim_l
  ,trim(both from lizi) as lizi2
  ,trim(LEADING from lizi) as lizi3
  ,trim(TRAILING from lizi) as lizi4
from 
(
  select ' SparkSQLHive  ' as lizi
  union all 
  select ' SparkSQLHive  ' as lizi
  union all 
  select ' SparkSQLHive  ' as lizi
  union all 
  select ' SparkSQLHive  ' as lizi
)

lizi l_lizi lizi1 l_trim r_trim l_trim_l r_trim_l lizi2 lizi3 lizi4
SparkSQLHive 15 SparkSQLHive SparkSQLHive SparkSQLHive 14 13 SparkSQLHive SparkSQLHive SparkSQLHive
SparkSQLHive 15 SparkSQLHive SparkSQLHive SparkSQLHive 14 13 SparkSQLHive SparkSQLHive SparkSQLHive
SparkSQLHive 15 SparkSQLHive SparkSQLHive SparkSQLHive 14 13 SparkSQLHive SparkSQLHive SparkSQLHive
SparkSQLHive 15 SparkSQLHive SparkSQLHive SparkSQLHive 14 13 SparkSQLHive SparkSQLHive SparkSQLHive
select 
   lizi
  ,trim('SL' from lizi) as lizi5
  ,trim(LEADING 'SL' from lizi) as lizi6
  ,trim(TRAILING 'SL' from lizi) as lizi7
from
  (
    select 'SSparkSQLS' as lizi
  )
lizi lizi5 lizi6 lizi7
SSparkSQLS parkSQ parkSQLS SSparkSQ
补足函数:lpad \ rpad

lpad(str, len[, pad]) - Returns str, left-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters or bytes. If pad is not specified, str will be padded to the left with space characters if it is a character string, and with zeros if it is a byte sequence.

lpad(str, lenl,padl)-返回str,用pad左填充到len的长度如果str比len长,则返回值缩短为len字符或字节。如果未指定pad,如果是字符串,str将在左侧填充空格字符,如果是字节序列,则填充零。

select 
   name
  ,lpad(name,20,'20230520') as l_name
  ,rpad(name,20,'20230520') as r_name
  ,len(lpad(name,20,'20230520')) as len_name
from 
  temp12
拼接函数 concat \ concat_ws

concat

concat(col1, col2, …, colN) - Returns the concatenation of col1, col2, …, colN.

Concat (col1,col2,. a,colN)-返回 col1,col2,. ,CoIN 的串联。

concat_ws

concat_ws(sep[, str | array(str)]+) - Returns the concatenation of the strings separated by sep.

Concat _ ws (sepl,str l array (str) J +)-返回由 sep 分隔的字符串的串联

select 
  id,name
  ,concat(id,name) as id_name
  ,concat(id,'_',name) as id_name2
  ,concat_ws(id,'_',name) as id_name3
  ,concat_ws('_',id,name) as id_name4
  ,id||'_'||name as id_name5
from 
  temp12
id name id_name id_name2 id_name3 id_name4 id_name5
1 John Doe 1John Doe 1_John Doe _1John Doe 1_John Doe 1_John Doe
2 Jane Smith 2Jane Smith 2_Jane Smith _2Jane Smith 2_Jane Smith 2_Jane Smith
3 Bob Johnson 3Bob Johnson 3_Bob Johnson _3Bob Johnson 3_Bob Johnson 3_Bob Johnson
4 Alice Brown 4Alice Brown 4_Alice Brown _4Alice Brown 4_Alice Brown 4_Alice Brown
5 Charlie Davis 5Charlie Davis 5_Charlie Davis _5Charlie Davis 5_Charlie Davis 5_Charlie Davis
替换函数 replace \ regexp_replace

replace(str, search[, replace]) - Replaces all occurrences of search with replace.

Arguments:

  • str - a string expression

  • search - a string expression. If search is not found in str, str is returned unchanged.

  • replace - a string expression. If replace is not specified or is an empty string, nothing replaces the string that is removed from str.

replace(str, search[, replace]) -用replace替换所有搜索
参数:
•str -一个字符串表达式
•search—字符串表达式。如果在str中没有找到搜索,则返回不变的str。
•replace -一个字符串表达式。如果replace未指定或为空字符串,则不会替换从str中删除的字符串

select 
   email
  ,replace(email,'@') as e_12
  ,replace(email,'@','') as e_12
  ,replace(email,'@','#') as e_13
  ,regexp_replace(email,'@','#') as e_14
from 
  temp12
email e_12 e_12 e_13 e_14
[email protected] john.doeexample.com john.doeexample.com john.doe#example.com john.doe#example.com
[email protected] jane.smithexample.com jane.smithexample.com jane.smith#example.com jane.smith#example.com
[email protected] bob.johnsonexample.com bob.johnsonexample.com bob.johnson#example.com bob.johnson#example.com
[email protected] alice.brownexample.com alice.brownexample.com alice.brown#example.com alice.brown#example.com
[email protected] charlie.davisexample.com charlie.davisexample.com charlie.davis#example.com charlie.davis#example.com
正则表达式相关函数:regexp
CREATE TABLE example_data (
  id INT,
  name VARCHAR(50),
  qq_email VARCHAR(50),
  website VARCHAR(50)
);

INSERT INTO example_data VALUES
(1, 'John Smith', '[email protected]', 'www.example.com'),
(2, 'Jane Doe', '[email protected]', 'www.example2.com'),
(3, 'Bob Johnson', '[email protected]', 'www.example2.com'),
(4, 'Alice Brown', '[email protected]', 'www.example.com'),
(5, 'Charlie Green', '[email protected]', 'www.example.com'),
(6, 'David Lee', '[email protected]', 'www.example2.com');
select * from example_data
id name qq_email website
1 John Smith [email protected] www.example.com
2 Jane Doe [email protected] www.example2.com
3 Bob Johnson [email protected] www.example2.com
4 Alice Brown [email protected] www.example.com
5 Charlie Green [email protected] www.example.com
6 David Lee [email protected] www.example2.com
regexp
SELECT * FROM example_data 
WHERE regexp(website, '.*example2.*');
id name qq_email website
2 Jane Doe [email protected] www.example2.com
3 Bob Johnson [email protected] www.example2.com
6 David Lee [email protected] www.example2.com
SELECT * FROM example_data 
WHERE regexp(name, '^J.*');
id name qq_email website
1 John Smith [email protected] www.example.com
2 Jane Doe [email protected] www.example2.com
正则表达式替换函数:regexp_replace

语法: regexp_replace(string A, string B, string C)

返回值: string

**说明:**将字符串A中的符合java正则表达式B的部分替换为C。注意,在有些情况下要使用转义字符,类似oracle中的regexp_replace函数

regexp_count
SELECT id, qq_email, regexp_count(qq_email, '.*qq.*') AS count
FROM example_data
WHERE regexp(qq_email, '.*qq.*')
ORDER BY count DESC;
id qq_email count
1 [email protected] 1
2 [email protected] 1
3 [email protected] 1
4 [email protected] 1
5 [email protected] 1
6 [email protected] 1
正则表达式解析函数:regexp_extract
SELECT qq_email, 
       regexp_extract(qq_email, '^(.+)@', 1) AS username, 
       regexp_extract(qq_email, '@(.+)$', 1) AS domain 
FROM example_data;

qq_email username domain
[email protected] john.smith qq.com
[email protected] john.smith qq.com
[email protected] bob.johnson qq.com
[email protected] alice.brown qq.com
[email protected] alice.brown qq.com
[email protected] david.lee qq.com
SELECT qq_email, 
       regexp_instr(qq_email, 'o') AS position 
FROM example_data
qq_email position
[email protected] 2
[email protected] 2
[email protected] 2
[email protected] 9
[email protected] 9
[email protected] 15
SELECT qq_email 
FROM example_data
WHERE regexp_like(qq_email, 'smith');
qq_email
[email protected]
[email protected]
模糊匹配函数 like / rlike
SELECT *
FROM example_data
WHERE qq_email LIKE '%[email protected]';
id name qq_email website
1 John Smith [email protected] www.example.com
2 Jane Doe [email protected] www.example2.com
SELECT *
FROM example_data
WHERE qq_email RLIKE 'brown';
id name qq_email website
4 Alice Brown [email protected] www.example.com
5 Charlie Green [email protected] www.example.com
字符串转大小写函数 lower \ lcase \ upper \ ucase
select 
    name
   ,lower(name) as l_name
   ,lcase(name) as l_name
   ,upper(name) as u_name
   ,ucase(name) as u_name
from example_data
name l_name l_name u_name u_name
John Smith john smith john smith JOHN SMITH JOHN SMITH
Jane Doe jane doe jane doe JANE DOE JANE DOE
Bob Johnson bob johnson bob johnson BOB JOHNSON BOB JOHNSON
Alice Brown alice brown alice brown ALICE BROWN ALICE BROWN
Charlie Green charlie green charlie green CHARLIE GREEN CHARLIE GREEN
David Lee david lee david lee DAVID LEE DAVID LEE
json解析函数get_json_object

get_json_object(json_txt, path) - Extracts a json object from path.

get_json_object(json_txt, path) -从path中提取一个json对象

SELECT get_json_object('{"name": "John", "age": 30}', '$.name') as name;
name
John
重复字符串函数:repeat
select name,repeat(name,2) as name2
from temp12
name name2
John Doe John DoeJohn Doe
Jane Smith Jane SmithJane Smith
Bob Johnson Bob JohnsonBob Johnson
Alice Brown Alice BrownAlice Brown
Charlie Davis Charlie DavisCharlie Davis
URL解析函数:parse_url

https://learn.microsoft.com/zh-cn/azure/databricks/sql/language-manual/functions/parse_url

> SELECT parse_url('http://spark.apache.org/path?query=1', 'HOST');
 spark.apache.org

> SELECT parse_url('http://spark.apache.org/path?query=1', 'QUERY');
 query=1

> SELECT parse_url('http://spark.apache.org/path?query=1', 'QUERY', 'query');
 1

> SELECT parse_url('http://spark.  apache.org/path?query=1', 'QUERY', 'query');
Error: Illegal argument
集合查找函数: find_in_set

find_in_set(str, str_array) - Returns the index (1-based) of the given string (str) in the comma-delimited list (str_array). Returns 0, if the string was not found or if the given string (str) contains a comma.

find_in_set(str,str_array)-返回逗号分隔列表(str_array)中给定字符串(str)的索引(基于1)。如果未找到字符串或给定字符串(str)包含逗号,则返回O。

语法: find_in_set(string str, string strList)

返回值: int

说明: 返回str在strlist第一次出现的位置,strlist是用逗号分割的字符串。如果没有找该str字符,则返回0

**使用:**select find_in_set(‘ab’,‘ef,ab,de’) from student limit 1;

2

> SELECT find_in_set(‘ab’,‘abc,b,ab,c,def’);

​ 3

字符串反转函数:reverse
select name,reverse(name) as r_name
from temp12
name r_name
John Doe eoD nhoJ
Jane Smith htimS enaJ
Bob Johnson nosnhoJ boB
Alice Brown nworB ecilA
Charlie Davis sivaD eilrahC

你可能感兴趣的:(spark,sql,大数据)