Spark SQL操作之-函数汇总篇-上

Spark SQL操作之-函数汇总篇-上

  • 开头的胡扯
  • 环境说明
  • 概要
  • 内置函数详情
    • org.apache.spark.sql.functions
    • 聚合函数
    • 集合函数
    • 时间处理函数
    • 字符串处理函数
    • 一些不常见的跨列处理的函数
    • SQL界的if...else

开头的胡扯

又懒了好久了,来一发。

环境说明

1. JDK 1.8
2. Spark 2.1

概要

跟所有的传统关系数据库一样,Spark SQL提供了许多内置函数方便处理数据。同时它也知道不可能满足广大吃瓜群众的胃口,毕竟业务千千万,所以也毫无意外的提供了用户自定义函数接口备用。窗口函数(或者叫分析函数)也是有的,看到有些人说会用窗口函数有多牛多牛,真是扯。就是个带范围的函数而已。接下来都会示例说明。

内置函数详情

org.apache.spark.sql.functions

内置函数基本都在这个类里面。包括聚合函数,集合函数,日期时间函数,字符串函数,数学函数,排序函数,窗口函数以及其他一些杂七杂八的函数。看了下官方文档页面,count了一下关键字,大概299个函数。我擦,这么多函数写下来,我…好饿。就每个类别挑一丢丢写吧。

聚合函数

聚合函数里面有很多统计函数,像是皮尔逊系数,峰度,偏度,标准差,均值,求和,去重求和等;然后还有一些计数统计函数,如非去重计数,去重计数,近似去重计数。

scala> spark.range(1,10).registerTempTable("aaa")
 
scala> spark.sql("select * from aaa").show
+---+
| id|
+---+
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+
 
### variance是var_samp的别名 ###
scala> spark.sql("select id,id*2 as double_id from aaa").select(
     |   avg("id"),variance("id"),skewness("id"),kurtosis("id"),
     |   corr("id","double_id")).show
+-------+------------+------------+------------+-------------------+
|avg(id)|var_samp(id)|skewness(id)|kurtosis(id)|corr(id, double_id)|
+-------+------------+------------+------------+-------------------+
|    5.0|         7.5|         0.0|       -1.23|                1.0|
+-------+------------+------------+------------+-------------------+
 
scala> spark.sql("select id%3 as id from aaa").select(
     |   count("id"),countDistinct("id"),sum("id"),sumDistinct("id")
     | ).show
+---------+------------------+-------+----------------+
|count(id)|count(DISTINCT id)|sum(id)|sum(DISTINCT id)|
+---------+------------------+-------+----------------+
|        9|                 3|      9|               3|
+---------+------------------+-------+----------------+
 
scala> spark.sql("select id%3 as id from aaa").select(
     |   collect_set("id"), collect_list("id"), 
     |   sort_array(collect_list("id"))).show(false)
+---------------+---------------------------+----------------------------------+
|collect_set(id)|collect_list(id)           |sort_array(collect_list(id), true)|
+---------------+---------------------------+----------------------------------+
|[0, 1, 2]      |[1, 2, 0, 1, 2, 0, 1, 2, 0]|[0, 0, 0, 1, 1, 1, 2, 2, 2]       |
+---------------+---------------------------+----------------------------------+

集合函数

集合函数用的多的可能就是一行转多行那点东西了

scala> val df = spark.createDataset(Seq(
     |   ("aaa",List(1,2,3)),("bbb",List(3,4))) 
df: org.apache.spark.sql.DataFrame = [key1: string, key2: array]
 
scala> df.show
+----+---------+
|key1|     key2|
+----+---------+
| aaa|[1, 2, 3]|
| bbb|   [3, 4]|
+----+---------+
 
### 注意,要转多行了噢 ###
scala> df.select($"key1", explode($"key2").as("explode_col")).show
+----+-----------+
|key1|explode_col|
+----+-----------+
| aaa|          1|
| aaa|          2|
| aaa|          3|
| bbb|          3|
| bbb|          4|
+----+-----------+
 
### 这里还有个posexplode函数,可以在拆分列的时候把原来的位置信息列出来
scala> df.select($"key1", posexplode($"key2").as("key2_pos"::"key2_val"::Nil)).show
+----+--------+--------+
|key1|key2_pos|key2_val|
+----+--------+--------+
| aaa|       0|       1|
| aaa|       1|       2|
| aaa|       2|       3|
| bbb|       0|       3|
| bbb|       1|       4|
+----+--------+--------+

另外就是有些处理json数据的函数,也可以了解一下

scala> val df = spark.createDataset(Seq(
     |   ("aaa",1,2),("bbb",3,4),("ccc",3,5),("bbb",4, 6))).toDF("key1","key2","key3")
df: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field]
 
scala> df.show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
| bbb|   3|   4|
| ccc|   3|   5|
| bbb|   4|   6|
+----+----+----+
 
### 按列名转成json格式记录 ###
scala> val df_json=df.select(to_json(struct($"key1",$"key2",$"key3")).as("json_key")).cache
df_json: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [json_key: string]
 
scala> df_json.show(false)
+--------------------------------+
|json_key                        |
+--------------------------------+
|{"key1":"aaa","key2":1,"key3":2}|
|{"key1":"bbb","key2":3,"key3":4}|
|{"key1":"ccc","key2":3,"key3":5}|
|{"key1":"bbb","key2":4,"key3":6}|
+--------------------------------+
 
### 反过来,如果是json格式数据,如何转回列###
scala> val df_json_2 = df_json.select(from_json($"json_key", new StructType().add("key1",StringType).add("key2",LongType).add("key3",LongType)).as("json_data")).cache
df_json_2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [json_data: struct]
 
### 看起来还不行,还在一列里面 ###
scala> df_json_2.show
+---------+
|json_data|
+---------+
|[aaa,1,2]|
|[bbb,3,4]|
|[ccc,3,5]|
|[bbb,4,6]|
+---------+
 
### 嗯,这下就还原了 ###
scala> df_json_2.select($"json_data.key1".as("key1"), $"json_data.key2".as("key2"), $"json_data.key3".as("key3")).show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
| bbb|   3|   4|
| ccc|   3|   5|
| bbb|   4|   6|
+----+----+----+

时间处理函数

时间处理函数比较多,各种加减法,各种格式转换。最有意思的是,许多函数能同时支持date/timestamp/string三种类型,非常方便。

scala> val df = spark.sql("select current_date() as dt, current_timestamp() as ts, '2019-05-16 04:00:00' as tm_str").cache
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [dt: date, ts: timestamp ... 1 more field]
 
scala> df.printSchema
root
 |-- dt: date (nullable = false)
 |-- ts: timestamp (nullable = false)
 |-- tm_str: string (nullable = false)
 
scala> df.show(false)
+----------+-----------------------+-------------------+
|dt        |ts                     |tm_str             |
+----------+-----------------------+-------------------+
|2019-05-16|2019-05-16 05:24:01.874|2019-05-16 04:00:00|
+----------+-----------------------+-------------------+
 
scala> df.select(dayofyear($"dt"), dayofyear($"ts"), dayofyear($"tm_str")).show
+-------------+-------------+-----------------+
|dayofyear(dt)|dayofyear(ts)|dayofyear(tm_str)|
+-------------+-------------+-----------------+
|          136|          136|              136|
+-------------+-------------+-----------------+
 
scala> df.select(hour($"dt"), hour($"ts"), hour($"tm_str")).show
+--------+--------+------------+
|hour(dt)|hour(ts)|hour(tm_str)|
+--------+--------+------------+
|       0|       5|           4|
+--------+--------+------------+
 
### dt字段是date类型,ts字段是timestamp类型,都能直接add_months ###
### ts字段是timestamp类型,ts_str是string类型,也都能直接date_format ###
scala> df.select(datediff(add_months($"dt",2), add_months($"ts",1)).as("diff_days"),
     |   date_format($"ts","HH:mm:SS"), date_format($"tm_str","MM/dd/YYYY")).show
+---------+-------------------------+-------------------------------+
|diff_days|date_format(ts, HH:mm:SS)|date_format(tm_str, MM/dd/YYYY)|
+---------+-------------------------+-------------------------------+
|       30|                05:24:874|                     05/16/2019|
+---------+-------------------------+-------------------------------+

字符串处理函数

字符串处理函数功能挺多,这里列出一部分的用法。包括子串查找,正则匹配,文本相似度等。

scala> val df = spark.createDataset(Seq(("aaa"),("abc"),("aabbc"),("ccc"))).toDF("key1")
df: org.apache.spark.sql.DataFrame = [key1: string]
 
### 可以看到instr和locate的功能是一样的。locate多了个指定位置开始定位的版本
scala> df.select(instr($"key1","b"), locate("b", $"key1"), locate("b",$"key1",3)).show
+--------------+------------------+------------------+
|instr(key1, b)|locate(b, key1, 1)|locate(b, key1, 3)|
+--------------+------------------+------------------+
|             0|                 0|                 0|
|             2|                 2|                 0|
|             3|                 3|                 3|
|             0|                 0|                 0|
+--------------+------------------+------------------+
 
### 官方文档在介绍instr时明确说了两个参数有一个为null则返回结果为null,但是locate的说明里没有提
### 实践证明,行为还是一样的。
scala> df.select(instr($"key1",null),locate(null,$"key1")).show
+-----------------+---------------------+
|instr(key1, NULL)|locate(NULL, key1, 1)|
+-----------------+---------------------+
|             null|                 null|
|             null|                 null|
|             null|                 null|
|             null|                 null|
+-----------------+---------------------+
 
### 居然还有format_string这种函数在sql里面出现
scala> df.select(format_string("wahaha, value is [%s]", $"key1")).show(false)
+------------------------------------------+
|format_string(wahaha, value is [%s], key1)|
+------------------------------------------+
|wahaha, value is [aaa]                    |
|wahaha, value is [abc]                    |
|wahaha, value is [aabbc]                  |
|wahaha, value is [ccc]                    |
+------------------------------------------+
 
### 有个translate函数,用来做单个字符的翻译,注意,是单个字符。
### 尽管看起来有些像字符串替换,但本质上是为了提供编码翻译的功能
scala> df.select($"key1", translate($"key1","abc","123")).show
+-----+-------------------------+
| key1|translate(key1, abc, 123)|
+-----+-------------------------+
|  aaa|                      111|
|  abc|                      123|
|aabbc|                    11223|
|  ccc|                      333|
+-----+-------------------------+
 
### 正则也是支持的,看下正则提取和正则替换
### 参数中的正则表达式和提取的group的概念,不明白的就需要自行百度了
scala> df.select($"key1", regexp_extract($"key1", "b(.*?)c",1 )).show
+-----+--------------------------------+
| key1|regexp_extract(key1, b(.*?)c, 1)|
+-----+--------------------------------+
|  aaa|                                |
|  abc|                                |
|aabbc|                               b|
|  ccc|                                |
+-----+--------------------------------+
 
### 把连续的两个a或者三个a替换成"111_"
scala> df.select($"key1", regexp_replace($"key1", "a{2,3}","111_")).show
+-----+----------------------------------+
| key1|regexp_replace(key1, a{2,3}, 111_)|
+-----+----------------------------------+
|  aaa|                              111_|
|  abc|                               abc|
|aabbc|                           111_bbc|
|  ccc|                               ccc|
+-----+----------------------------------+
 
### 还有两个处理相似度的函数,可以了解一下
### 一个是levenshtein函数,就是求编辑距离
scala> df.select(levenshtein(lit("aaa"),lit("aba")), levenshtein(lit("aaa"),lit("bab"))).limit(1).show
+---------------------+---------------------+
|levenshtein(aaa, aba)|levenshtein(aaa, bab)|
+---------------------+---------------------+
|                    1|                    2|
+---------------------+---------------------+
 
### 另一个是soundex函数,这个函数从没用过。说是按字符串的发音来编码.
### 功能是把字符串编码成4个字节,发音相似的词编码会相似,真会玩~~
scala> df.select(soundex(lit("hello")), soundex(lit("hollow")
   |   ), soundex(lit("how"))).limit(1).show
+--------------+---------------+------------+
|soundex(hello)|soundex(hollow)|soundex(how)|
+--------------+---------------+------------+
|          H400|           H400|        H000|
+--------------+---------------+------------+
 
### 另外还有些奇葩函数,不知道老外们要内置这种函数到底是有多常用。。。
### initcap: 把字段的第一个字母变大写
### ascii: 计算字段的第一个字母的ascii码值
### lpad,rpad:在字符串左右两侧补字符,看着想对齐的样子~
scala> df.select($"key1", initcap($"key1"), ascii($"key1"),lpad($"key1", 5,"*")).show
+-----+-------------+-----------+----------------+
| key1|initcap(key1)|ascii(key1)|lpad(key1, 5, *)|
+-----+-------------+-----------+----------------+
|  aaa|          Aaa|         97|           **aaa|
|  abc|          Abc|         97|           **abc|
|aabbc|        Aabbc|         97|           aabbc|
|  ccc|          Ccc|         99|           **ccc|
+-----+-------------+-----------+----------------+

一些不常见的跨列处理的函数

嗯,这一部分是某些奇奇怪怪的用法的函数,不常用,但需要的时候又挺有用哒。比如,要找出给定的几个列中的第一个非空值,是coalesce()函数。

scala> val df_1 = spark.createDataset(Seq((null,"2"),("0",null),(null,null),("4", "5"))).toDF("key1","key2")
df_1: org.apache.spark.sql.DataFrame = [key1: string, key2: string]
 
### 造点NaN的字段
scala> val df = df_1.withColumn("key3",sqrt(rand()-0.4)).cache
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [key1: string, key2: string ... 1 more field]
 
scala> df.show
+----+----+------------------+
|key1|key2|              key3|
+----+----+------------------+
|null|   2|               NaN|
|   0|null|               NaN|
|null|null|0.4865678556424092|
|   4|   5|0.7648302408726048|
+----+----+------------------+
 
scala> df.select($"*",coalesce($"key1",$"key2",$"key3")).show
+----+----+------------------+--------------------------+
|key1|key2|              key3|coalesce(key1, key2, key3)|
+----+----+------------------+--------------------------+
|null|   2|               NaN|                         2|
|   0|null|               NaN|                         0|
|null|null|0.4865678556424092|        0.4865678556424092|
|   4|   5|0.7648302408726048|                         4|
+----+----+------------------+--------------------------+

### 注意一下,和上面的示例对比,这里有这里说明了两个问题
### 1. coalesce函数是找给定顺序的字段的第一个非空值,字段顺序变化会导致结果不一样
### 2. 找的是第一个非空值,NaN也是非空的
scala> df.select($"*",coalesce($"key3",$"key1",$"key2")).show
+----+----+------------------+--------------------------+
|key1|key2|              key3|coalesce(key3, key1, key2)|
+----+----+------------------+--------------------------+
|null|   2|               NaN|                       NaN|
|   0|null|               NaN|                       NaN|
|null|null|0.4865678556424092|        0.4865678556424092|
|   4|   5|0.7648302408726048|        0.7648302408726048|
+----+----+------------------+--------------------------+

对应于处理null值的coalesce函数,Spark SQL还提供了nanvl函数来处理NaN值

### coalesce函数是支持多列的,但是nanvl仅支持两列
scala> df.select($"key2",$"key3",nanvl($"key2",$"key3")).show
+----+------------------+-----------------+
|key2|              key3|nanvl(key2, key3)|
+----+------------------+-----------------+
|   2|               NaN|              2.0|
|null|               NaN|             null|
|null|0.4865678556424092|             null|
|   5|0.7648302408726048|              5.0|
+----+------------------+-----------------+

还有类似greatest(),least()函数这种返回给定的多个列中的最大值,最小值的函数。

### greatest()函数,找出给定所有列中的最大值,同样的,NaN有些特别
scala> df.select($"*",greatest($"key1".cast("double"),$"key2".cast("double"),$"key3")).show
+----+----+------------------+----------------------------------------------------------+
|key1|key2|              key3|greatest(CAST(key1 AS DOUBLE), CAST(key2 AS DOUBLE), key3)|
+----+----+------------------+----------------------------------------------------------+
|null|   2|               NaN|                                                       NaN|
|   0|null|               NaN|                                                       NaN|
|null|null|0.4865678556424092|                                        0.4865678556424092|
|   4|   5|0.7648302408726048|                                                       5.0|
+----+----+------------------+----------------------------------------------------------+
 
### 找最小值时,NaN就不见了,可知NaN是被当成一个极大值处理的
scala> df.select($"*",least($"key1".cast("double"),$"key2".cast("double"),$"key3")).show
+----+----+------------------+-------------------------------------------------------+
|key1|key2|              key3|least(CAST(key1 AS DOUBLE), CAST(key2 AS DOUBLE), key3)|
+----+----+------------------+-------------------------------------------------------+
|null|   2|               NaN|                                                    2.0|
|   0|null|               NaN|                                                    0.0|
|null|null|0.4865678556424092|                                     0.4865678556424092|
|   4|   5|0.7648302408726048|                                     0.7648302408726048|
+----+----+------------------+-------------------------------------------------------+

SQL界的if…else

最后再提一个when函数吧,是个条件函数,简单实用

scala> df.select($"key1", when(isnull($"key1"), rand()).when($"key1"<3,5).otherwise(10).as("new_key1")).show
+----+-------------------+
|key1|           new_key1|
+----+-------------------+
|null| 0.8093967624263269|
|   0|                5.0|
|null|0.11877309710543482|
|   4|               10.0|
+----+-------------------+

差不多就这些吧,窗口函数和自定义函数下一篇再写。果断把标题从《函数汇总篇》改为《函数汇总篇-上》,就是这么机智~

走了走了,上班要迟到了~~


喜欢这些内容的话,可以关注下公众号哈~
Spark SQL操作之-函数汇总篇-上_第1张图片

你可能感兴趣的:(Spark,大数据,Spark,spark-sql,dataset,Spark,SQL专栏)