又懒了好久了,来一发。
1. JDK 1.8
2. Spark 2.1
跟所有的传统关系数据库一样,Spark SQL提供了许多内置函数方便处理数据。同时它也知道不可能满足广大吃瓜群众的胃口,毕竟业务千千万,所以也毫无意外的提供了用户自定义函数接口备用。窗口函数(或者叫分析函数)也是有的,看到有些人说会用窗口函数有多牛多牛,真是扯。就是个带范围的函数而已。接下来都会示例说明。
内置函数基本都在这个类里面。包括聚合函数,集合函数,日期时间函数,字符串函数,数学函数,排序函数,窗口函数以及其他一些杂七杂八的函数。看了下官方文档页面,count了一下关键字,大概299个函数。我擦,这么多函数写下来,我…好饿。就每个类别挑一丢丢写吧。
聚合函数里面有很多统计函数,像是皮尔逊系数,峰度,偏度,标准差,均值,求和,去重求和等;然后还有一些计数统计函数,如非去重计数,去重计数,近似去重计数。
scala> spark.range(1,10).registerTempTable("aaa")
scala> spark.sql("select * from aaa").show
+---+
| id|
+---+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
### variance是var_samp的别名 ###
scala> spark.sql("select id,id*2 as double_id from aaa").select(
| avg("id"),variance("id"),skewness("id"),kurtosis("id"),
| corr("id","double_id")).show
+-------+------------+------------+------------+-------------------+
|avg(id)|var_samp(id)|skewness(id)|kurtosis(id)|corr(id, double_id)|
+-------+------------+------------+------------+-------------------+
| 5.0| 7.5| 0.0| -1.23| 1.0|
+-------+------------+------------+------------+-------------------+
scala> spark.sql("select id%3 as id from aaa").select(
| count("id"),countDistinct("id"),sum("id"),sumDistinct("id")
| ).show
+---------+------------------+-------+----------------+
|count(id)|count(DISTINCT id)|sum(id)|sum(DISTINCT id)|
+---------+------------------+-------+----------------+
| 9| 3| 9| 3|
+---------+------------------+-------+----------------+
scala> spark.sql("select id%3 as id from aaa").select(
| collect_set("id"), collect_list("id"),
| sort_array(collect_list("id"))).show(false)
+---------------+---------------------------+----------------------------------+
|collect_set(id)|collect_list(id) |sort_array(collect_list(id), true)|
+---------------+---------------------------+----------------------------------+
|[0, 1, 2] |[1, 2, 0, 1, 2, 0, 1, 2, 0]|[0, 0, 0, 1, 1, 1, 2, 2, 2] |
+---------------+---------------------------+----------------------------------+
集合函数用的多的可能就是一行转多行那点东西了
scala> val df = spark.createDataset(Seq(
| ("aaa",List(1,2,3)),("bbb",List(3,4)))
df: org.apache.spark.sql.DataFrame = [key1: string, key2: array]
scala> df.show
+----+---------+
|key1| key2|
+----+---------+
| aaa|[1, 2, 3]|
| bbb| [3, 4]|
+----+---------+
### 注意,要转多行了噢 ###
scala> df.select($"key1", explode($"key2").as("explode_col")).show
+----+-----------+
|key1|explode_col|
+----+-----------+
| aaa| 1|
| aaa| 2|
| aaa| 3|
| bbb| 3|
| bbb| 4|
+----+-----------+
### 这里还有个posexplode函数,可以在拆分列的时候把原来的位置信息列出来
scala> df.select($"key1", posexplode($"key2").as("key2_pos"::"key2_val"::Nil)).show
+----+--------+--------+
|key1|key2_pos|key2_val|
+----+--------+--------+
| aaa| 0| 1|
| aaa| 1| 2|
| aaa| 2| 3|
| bbb| 0| 3|
| bbb| 1| 4|
+----+--------+--------+
另外就是有些处理json数据的函数,也可以了解一下
scala> val df = spark.createDataset(Seq(
| ("aaa",1,2),("bbb",3,4),("ccc",3,5),("bbb",4, 6))).toDF("key1","key2","key3")
df: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field]
scala> df.show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa| 1| 2|
| bbb| 3| 4|
| ccc| 3| 5|
| bbb| 4| 6|
+----+----+----+
### 按列名转成json格式记录 ###
scala> val df_json=df.select(to_json(struct($"key1",$"key2",$"key3")).as("json_key")).cache
df_json: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [json_key: string]
scala> df_json.show(false)
+--------------------------------+
|json_key |
+--------------------------------+
|{"key1":"aaa","key2":1,"key3":2}|
|{"key1":"bbb","key2":3,"key3":4}|
|{"key1":"ccc","key2":3,"key3":5}|
|{"key1":"bbb","key2":4,"key3":6}|
+--------------------------------+
### 反过来,如果是json格式数据,如何转回列###
scala> val df_json_2 = df_json.select(from_json($"json_key", new StructType().add("key1",StringType).add("key2",LongType).add("key3",LongType)).as("json_data")).cache
df_json_2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [json_data: struct]
### 看起来还不行,还在一列里面 ###
scala> df_json_2.show
+---------+
|json_data|
+---------+
|[aaa,1,2]|
|[bbb,3,4]|
|[ccc,3,5]|
|[bbb,4,6]|
+---------+
### 嗯,这下就还原了 ###
scala> df_json_2.select($"json_data.key1".as("key1"), $"json_data.key2".as("key2"), $"json_data.key3".as("key3")).show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa| 1| 2|
| bbb| 3| 4|
| ccc| 3| 5|
| bbb| 4| 6|
+----+----+----+
时间处理函数比较多,各种加减法,各种格式转换。最有意思的是,许多函数能同时支持date/timestamp/string三种类型,非常方便。
scala> val df = spark.sql("select current_date() as dt, current_timestamp() as ts, '2019-05-16 04:00:00' as tm_str").cache
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [dt: date, ts: timestamp ... 1 more field]
scala> df.printSchema
root
|-- dt: date (nullable = false)
|-- ts: timestamp (nullable = false)
|-- tm_str: string (nullable = false)
scala> df.show(false)
+----------+-----------------------+-------------------+
|dt |ts |tm_str |
+----------+-----------------------+-------------------+
|2019-05-16|2019-05-16 05:24:01.874|2019-05-16 04:00:00|
+----------+-----------------------+-------------------+
scala> df.select(dayofyear($"dt"), dayofyear($"ts"), dayofyear($"tm_str")).show
+-------------+-------------+-----------------+
|dayofyear(dt)|dayofyear(ts)|dayofyear(tm_str)|
+-------------+-------------+-----------------+
| 136| 136| 136|
+-------------+-------------+-----------------+
scala> df.select(hour($"dt"), hour($"ts"), hour($"tm_str")).show
+--------+--------+------------+
|hour(dt)|hour(ts)|hour(tm_str)|
+--------+--------+------------+
| 0| 5| 4|
+--------+--------+------------+
### dt字段是date类型,ts字段是timestamp类型,都能直接add_months ###
### ts字段是timestamp类型,ts_str是string类型,也都能直接date_format ###
scala> df.select(datediff(add_months($"dt",2), add_months($"ts",1)).as("diff_days"),
| date_format($"ts","HH:mm:SS"), date_format($"tm_str","MM/dd/YYYY")).show
+---------+-------------------------+-------------------------------+
|diff_days|date_format(ts, HH:mm:SS)|date_format(tm_str, MM/dd/YYYY)|
+---------+-------------------------+-------------------------------+
| 30| 05:24:874| 05/16/2019|
+---------+-------------------------+-------------------------------+
字符串处理函数功能挺多,这里列出一部分的用法。包括子串查找,正则匹配,文本相似度等。
scala> val df = spark.createDataset(Seq(("aaa"),("abc"),("aabbc"),("ccc"))).toDF("key1")
df: org.apache.spark.sql.DataFrame = [key1: string]
### 可以看到instr和locate的功能是一样的。locate多了个指定位置开始定位的版本
scala> df.select(instr($"key1","b"), locate("b", $"key1"), locate("b",$"key1",3)).show
+--------------+------------------+------------------+
|instr(key1, b)|locate(b, key1, 1)|locate(b, key1, 3)|
+--------------+------------------+------------------+
| 0| 0| 0|
| 2| 2| 0|
| 3| 3| 3|
| 0| 0| 0|
+--------------+------------------+------------------+
### 官方文档在介绍instr时明确说了两个参数有一个为null则返回结果为null,但是locate的说明里没有提
### 实践证明,行为还是一样的。
scala> df.select(instr($"key1",null),locate(null,$"key1")).show
+-----------------+---------------------+
|instr(key1, NULL)|locate(NULL, key1, 1)|
+-----------------+---------------------+
| null| null|
| null| null|
| null| null|
| null| null|
+-----------------+---------------------+
### 居然还有format_string这种函数在sql里面出现
scala> df.select(format_string("wahaha, value is [%s]", $"key1")).show(false)
+------------------------------------------+
|format_string(wahaha, value is [%s], key1)|
+------------------------------------------+
|wahaha, value is [aaa] |
|wahaha, value is [abc] |
|wahaha, value is [aabbc] |
|wahaha, value is [ccc] |
+------------------------------------------+
### 有个translate函数,用来做单个字符的翻译,注意,是单个字符。
### 尽管看起来有些像字符串替换,但本质上是为了提供编码翻译的功能
scala> df.select($"key1", translate($"key1","abc","123")).show
+-----+-------------------------+
| key1|translate(key1, abc, 123)|
+-----+-------------------------+
| aaa| 111|
| abc| 123|
|aabbc| 11223|
| ccc| 333|
+-----+-------------------------+
### 正则也是支持的,看下正则提取和正则替换
### 参数中的正则表达式和提取的group的概念,不明白的就需要自行百度了
scala> df.select($"key1", regexp_extract($"key1", "b(.*?)c",1 )).show
+-----+--------------------------------+
| key1|regexp_extract(key1, b(.*?)c, 1)|
+-----+--------------------------------+
| aaa| |
| abc| |
|aabbc| b|
| ccc| |
+-----+--------------------------------+
### 把连续的两个a或者三个a替换成"111_"
scala> df.select($"key1", regexp_replace($"key1", "a{2,3}","111_")).show
+-----+----------------------------------+
| key1|regexp_replace(key1, a{2,3}, 111_)|
+-----+----------------------------------+
| aaa| 111_|
| abc| abc|
|aabbc| 111_bbc|
| ccc| ccc|
+-----+----------------------------------+
### 还有两个处理相似度的函数,可以了解一下
### 一个是levenshtein函数,就是求编辑距离
scala> df.select(levenshtein(lit("aaa"),lit("aba")), levenshtein(lit("aaa"),lit("bab"))).limit(1).show
+---------------------+---------------------+
|levenshtein(aaa, aba)|levenshtein(aaa, bab)|
+---------------------+---------------------+
| 1| 2|
+---------------------+---------------------+
### 另一个是soundex函数,这个函数从没用过。说是按字符串的发音来编码.
### 功能是把字符串编码成4个字节,发音相似的词编码会相似,真会玩~~
scala> df.select(soundex(lit("hello")), soundex(lit("hollow")
| ), soundex(lit("how"))).limit(1).show
+--------------+---------------+------------+
|soundex(hello)|soundex(hollow)|soundex(how)|
+--------------+---------------+------------+
| H400| H400| H000|
+--------------+---------------+------------+
### 另外还有些奇葩函数,不知道老外们要内置这种函数到底是有多常用。。。
### initcap: 把字段的第一个字母变大写
### ascii: 计算字段的第一个字母的ascii码值
### lpad,rpad:在字符串左右两侧补字符,看着想对齐的样子~
scala> df.select($"key1", initcap($"key1"), ascii($"key1"),lpad($"key1", 5,"*")).show
+-----+-------------+-----------+----------------+
| key1|initcap(key1)|ascii(key1)|lpad(key1, 5, *)|
+-----+-------------+-----------+----------------+
| aaa| Aaa| 97| **aaa|
| abc| Abc| 97| **abc|
|aabbc| Aabbc| 97| aabbc|
| ccc| Ccc| 99| **ccc|
+-----+-------------+-----------+----------------+
嗯,这一部分是某些奇奇怪怪的用法的函数,不常用,但需要的时候又挺有用哒。比如,要找出给定的几个列中的第一个非空值,是coalesce()函数。
scala> val df_1 = spark.createDataset(Seq((null,"2"),("0",null),(null,null),("4", "5"))).toDF("key1","key2")
df_1: org.apache.spark.sql.DataFrame = [key1: string, key2: string]
### 造点NaN的字段
scala> val df = df_1.withColumn("key3",sqrt(rand()-0.4)).cache
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [key1: string, key2: string ... 1 more field]
scala> df.show
+----+----+------------------+
|key1|key2| key3|
+----+----+------------------+
|null| 2| NaN|
| 0|null| NaN|
|null|null|0.4865678556424092|
| 4| 5|0.7648302408726048|
+----+----+------------------+
scala> df.select($"*",coalesce($"key1",$"key2",$"key3")).show
+----+----+------------------+--------------------------+
|key1|key2| key3|coalesce(key1, key2, key3)|
+----+----+------------------+--------------------------+
|null| 2| NaN| 2|
| 0|null| NaN| 0|
|null|null|0.4865678556424092| 0.4865678556424092|
| 4| 5|0.7648302408726048| 4|
+----+----+------------------+--------------------------+
### 注意一下,和上面的示例对比,这里有这里说明了两个问题
### 1. coalesce函数是找给定顺序的字段的第一个非空值,字段顺序变化会导致结果不一样
### 2. 找的是第一个非空值,NaN也是非空的
scala> df.select($"*",coalesce($"key3",$"key1",$"key2")).show
+----+----+------------------+--------------------------+
|key1|key2| key3|coalesce(key3, key1, key2)|
+----+----+------------------+--------------------------+
|null| 2| NaN| NaN|
| 0|null| NaN| NaN|
|null|null|0.4865678556424092| 0.4865678556424092|
| 4| 5|0.7648302408726048| 0.7648302408726048|
+----+----+------------------+--------------------------+
对应于处理null值的coalesce函数,Spark SQL还提供了nanvl函数来处理NaN值
### coalesce函数是支持多列的,但是nanvl仅支持两列
scala> df.select($"key2",$"key3",nanvl($"key2",$"key3")).show
+----+------------------+-----------------+
|key2| key3|nanvl(key2, key3)|
+----+------------------+-----------------+
| 2| NaN| 2.0|
|null| NaN| null|
|null|0.4865678556424092| null|
| 5|0.7648302408726048| 5.0|
+----+------------------+-----------------+
还有类似greatest(),least()函数这种返回给定的多个列中的最大值,最小值的函数。
### greatest()函数,找出给定所有列中的最大值,同样的,NaN有些特别
scala> df.select($"*",greatest($"key1".cast("double"),$"key2".cast("double"),$"key3")).show
+----+----+------------------+----------------------------------------------------------+
|key1|key2| key3|greatest(CAST(key1 AS DOUBLE), CAST(key2 AS DOUBLE), key3)|
+----+----+------------------+----------------------------------------------------------+
|null| 2| NaN| NaN|
| 0|null| NaN| NaN|
|null|null|0.4865678556424092| 0.4865678556424092|
| 4| 5|0.7648302408726048| 5.0|
+----+----+------------------+----------------------------------------------------------+
### 找最小值时,NaN就不见了,可知NaN是被当成一个极大值处理的
scala> df.select($"*",least($"key1".cast("double"),$"key2".cast("double"),$"key3")).show
+----+----+------------------+-------------------------------------------------------+
|key1|key2| key3|least(CAST(key1 AS DOUBLE), CAST(key2 AS DOUBLE), key3)|
+----+----+------------------+-------------------------------------------------------+
|null| 2| NaN| 2.0|
| 0|null| NaN| 0.0|
|null|null|0.4865678556424092| 0.4865678556424092|
| 4| 5|0.7648302408726048| 0.7648302408726048|
+----+----+------------------+-------------------------------------------------------+
最后再提一个when函数吧,是个条件函数,简单实用
scala> df.select($"key1", when(isnull($"key1"), rand()).when($"key1"<3,5).otherwise(10).as("new_key1")).show
+----+-------------------+
|key1| new_key1|
+----+-------------------+
|null| 0.8093967624263269|
| 0| 5.0|
|null|0.11877309710543482|
| 4| 10.0|
+----+-------------------+
差不多就这些吧,窗口函数和自定义函数下一篇再写。果断把标题从《函数汇总篇》改为《函数汇总篇-上》,就是这么机智~
走了走了,上班要迟到了~~