《Spark 2.0技术预览:更容易、更快速、更智能》文章中简单地介绍了Spark 2.0带来的新技术等。Spark 2.0是Apache Spark的下一个主要版本。此版本在架构抽象、API以及平台的类库方面带来了很大的变化,为该框架明年的发展奠定了方向,所以了解Spark 2.0的一些特性对我们能够使用它有着非常重要的作用。本博客将对Spark 2.0进行一序列的介绍(参见Spark 2.0分类),欢迎关注
Spark中的DataSet和Dataframe API支持结构化分析。结构化分析的一个重要的方面是管理元数据。这些元数据可能是一些临时元数据(比如临时表)、SQLContext上注册的UDF以及持久化的元数据(比如Hivemeta store或者HCatalog)。
Spark的早期版本是没有标准的API来访问这些元数据的。用户通常使用查询语句(比如show tables
)来查询这些元数据。这些查询通常需要操作原始的字符串,而且不同元数据类型的操作也是不一样的。
这种情况在Spark 2.0中得到改变。Spark 2.0中添加了标准的API(称为catalog)来访问Spark SQL中的元数据。这个API既可以操作Spark SQL,也可以操作Hive元数据。
这篇文章中我将介绍如何使用catalog API。
Catalog可以通过SparkSession获取,下面代码展示如何获取Catalog:
scala>
import
org.apache.spark.sql.SparkSession
import
org.apache.spark.sql.SparkSession
scala>
val
sparkSession
=
SparkSession.builder.appName(
"spark session example"
).enableHiveSupport().getOrCreate()
sparkSession
:
org.apache.spark.sql.SparkSession
=
org.apache.spark.sql.SparkSession
@
5
d
50
ea
49
scala>
val
catalog
=
sparkSession.catalog
catalog
:
org.apache.spark.sql.catalog.Catalog
=
org.apache.spark.sql.internal.CatalogImpl
@
17308
af
1
|
我们一旦创建好catalog对象之后,我们可以使用它来查询元数据中的数据库,catalog上的API返回的结果全部都是dataset。
scala> catalog.listDatabases().select(
"name"
).show(
false
)
+-----------------------+
|name |
+-----------------------+
|iteblog |
|default |
+-----------------------+
only showing top
20
rows
|
listDatabases
返回元数据中所有的数据库。默认情况下,元数据仅仅只有名为default的数据库。如果是Hive元数据,那么它会从Hive元数据中获取所有的数据库。listDatabases
返回的类型是dataset,所以我们可以使用Dataset上的所有操作来查询元数据。
在Spark的早期版本,我们使用registerTempTable
来注册Dataframe。然而在Spark 2.0中,这个API已经被遗弃了。registerTempTable
名字很让人误解,因为用户会认为这个函数会将Dataframe持久化并且保证这个临时表,但是实际上并不是这样的,所以社区才有意将它替换成createTempView
。createTempView
的使用方法如下:
df.createTempView(
"iteblog"
)
|
我们注册完一个view之后,然后就可以使用listTables函数来查询它。
正如我们可以展示出元数据中的所有数据库一样,我们也可以展示出元数据中某个数据库中的表。它会展示出Spark SQL中所有注册的临时表。同时可以展示出Hive中默认数据库(也就是default)中的表。如下:
scala> catalog.listTables().select(
"name"
).show(
false
)
+----------------------------------------+
|name |
+----------------------------------------+
|city
_
to
_
level |
|table
2
|
|test |
|ticket
_
order |
|tmp
1
_
result |
|iteblog |
+----------------------------------------+
|
上面的iteblog表就是使用df.createTempView("iteblog")
注册的临时表。
我们可以使用Catalog提供的API来检查某个表是否缓存。如下:
scala> println(catalog.isCached(
"iteblog"
))
false
|
上面判断iteblog表是否缓存,结果输出false。默认情况下表是不会被缓存的,我们可以手动缓存某个表,如下:
scala> df.cache()
res
4
:
df.
type
=
[
_
c
0
:
string,
_
c
1
:
string ...
2
more fields]
scala> println(catalog.isCached(
"iteblog"
))
true
|
现在iteblog表已经被缓存了,所有现在的输出结构是true。
我们可以使用catalog提供的API来删除view。如果是Spark SQL情况,那么它会删除事先注册好的view;如果是hive情况,那么它会从元数据中删除表。
scala> catalog.dropTempView(
"iteblog"
)
|
我们不仅可以使用Catalog API操作表,还可以用它操作UDF。下面代码片段展示SparkSession上所有已经注册号的函数,当然也包括了Spark内置的函数。
scala> catalog.listFunctions().select(
"name"
,
"className"
,
"isTemporary"
).show(
100
,
false
)
+---------------------+-----------------------------------------------------------------------+-----------+
|name |className |isTemporary|
+---------------------+-----------------------------------------------------------------------+-----------+
|! |org.apache.spark.sql.catalyst.expressions.Not |
true
|
|
%
|org.apache.spark.sql.catalyst.expressions.Remainder |
true
|
|& |org.apache.spark.sql.catalyst.expressions.BitwiseAnd |
true
|
|* |org.apache.spark.sql.catalyst.expressions.Multiply |
true
|
|+ |org.apache.spark.sql.catalyst.expressions.Add |
true
|
|- |org.apache.spark.sql.catalyst.expressions.Subtract |
true
|
|/ |org.apache.spark.sql.catalyst.expressions.Divide |
true
|
|< |org.apache.spark.sql.catalyst.expressions.LessThan |
true
|
|<
=
|org.apache.spark.sql.catalyst.expressions.LessThanOrEqual |
true
|
|<
=
> |org.apache.spark.sql.catalyst.expressions.EqualNullSafe |
true
|
|
=
|org.apache.spark.sql.catalyst.expressions.EqualTo |
true
|
|
==
|org.apache.spark.sql.catalyst.expressions.EqualTo |
true
|
|> |org.apache.spark.sql.catalyst.expressions.GreaterThan |
true
|
|>
=
|org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual |
true
|
|^ |org.apache.spark.sql.catalyst.expressions.BitwiseXor |
true
|
|abs |org.apache.spark.sql.catalyst.expressions.Abs |
true
|
|acos |org.apache.spark.sql.catalyst.expressions.Acos |
true
|
|add
_
months |org.apache.spark.sql.catalyst.expressions.AddMonths |
true
|
|and |org.apache.spark.sql.catalyst.expressions.And |
true
|
|approx
_
count
_
distinct|org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus|
true
|
|array |org.apache.spark.sql.catalyst.expressions.CreateArray |
true
|
|array
_
contains |org.apache.spark.sql.catalyst.expressions.ArrayContains |
true
|
|ascii |org.apache.spark.sql.catalyst.expressions.Ascii |
true
|
|asin |org.apache.spark.sql.catalyst.expressions.Asin |
true
|
|assert
_
true |org.apache.spark.sql.catalyst.expressions.AssertTrue |
true
|
|atan |org.apache.spark.sql.catalyst.expressions.Atan |
true
|
|atan
2
|org.apache.spark.sql.catalyst.expressions.Atan
2
|
true
|
|avg |org.apache.spark.sql.catalyst.expressions.aggregate.Average |
true
|
|base
64
|org.apache.spark.sql.catalyst.expressions.Base
64
|
true
|
|bin |org.apache.spark.sql.catalyst.expressions.Bin |
true
|
|bround |org.apache.spark.sql.catalyst.expressions.BRound |
true
|
|cbrt |org.apache.spark.sql.catalyst.expressions.Cbrt |
true
|
|ceil |org.apache.spark.sql.catalyst.expressions.Ceil |
true
|
|ceiling |org.apache.spark.sql.catalyst.expressions.Ceil |
true
|
|coalesce |org.apache.spark.sql.catalyst.expressions.Coalesce |
true
|
|collect
_
list |org.apache.spark.sql.catalyst.expressions.aggregate.CollectList |
true
|
|collect
_
set |org.apache.spark.sql.catalyst.expressions.aggregate.CollectSet |
true
|
|concat |org.apache.spark.sql.catalyst.expressions.Concat |
true
|
|concat
_
ws |org.apache.spark.sql.catalyst.expressions.ConcatWs |
true
|
|conv |org.apache.spark.sql.catalyst.expressions.Conv |
true
|
|corr |org.apache.spark.sql.catalyst.expressions.aggregate.Corr |
true
|
|cos |org.apache.spark.sql.catalyst.expressions.Cos |
true
|
|cosh |org.apache.spark.sql.catalyst.expressions.Cosh |
true
|
|count |org.apache.spark.sql.catalyst.expressions.aggregate.Count |
true
|
|covar
_
pop |org.apache.spark.sql.catalyst.expressions.aggregate.CovPopulation |
true
|
|covar
_
samp |org.apache.spark.sql.catalyst.expressions.aggregate.CovSample |
true
|
|crc
32
|org.apache.spark.sql.catalyst.expressions.Crc
32
|
true
|
|cube |org.apache.spark.sql.catalyst.expressions.Cube |
true
|
|cume
_
dist |org.apache.spark.sql.catalyst.expressions.CumeDist |
true
|
|current
_
database |org.apache.spark.sql.catalyst.expressions.CurrentDatabase |
true
|
|current
_
date |org.apache.spark.sql.catalyst.expressions.CurrentDate |
true
|
|current
_
timestamp |org.apache.spark.sql.catalyst.expressions.CurrentTimestamp |
true
|
|date
_
add |org.apache.spark.sql.catalyst.expressions.DateAdd |
true
|
|date
_
format |org.apache.spark.sql.catalyst.expressions.DateFormatClass |
true
|
|date
_
sub |org.apache.spark.sql.catalyst.expressions.DateSub |
true
|
|datediff |org.apache.spark.sql.catalyst.expressions.DateDiff |
true
|
|day |org.apache.spark.sql.catalyst.expressions.DayOfMonth |
true
|
|dayofmonth |org.apache.spark.sql.catalyst.expressions.DayOfMonth |
true
|
|dayofyear |org.apache.spark.sql.catalyst.expressions.DayOfYear |
true
|
|decode |org.apache.spark.sql.catalyst.expressions.Decode |
true
|
|degrees |org.apache.spark.sql.catalyst.expressions.ToDegrees |
true
|
|dense
_
rank |org.apache.spark.sql.catalyst.expressions.DenseRank |
true
|
|e |org.apache.spark.sql.catalyst.expressions.EulerNumber |
true
|
|encode |org.apache.spark.sql.catalyst.expressions.Encode |
true
|
|exp |org.apache.spark.sql.catalyst.expressions.Exp |
true
|
|explode |org.apache.spark.sql.catalyst.expressions.Explode |
true
|
|expm
1
|org.apache.spark.sql.catalyst.expressions.Expm
1
|
true
|
|factorial |org.apache.spark.sql.catalyst.expressions.Factorial |
true
|
|find
_
in
_
set |org.apache.spark.sql.catalyst.expressions.FindInSet |
true
|
|first |org.apache.spark.sql.catalyst.expressions.aggregate.First |
true
|
|first
_
value |org.apache.spark.sql.catalyst.expressions.aggregate.First |
true
|
|floor |org.apache.spark.sql.catalyst.expressions.Floor |
true
|
|format
_
number |org.apache.spark.sql.catalyst.expressions.FormatNumber |
true
|
|format
_
string |org.apache.spark.sql.catalyst.expressions.FormatString |
true
|
|from
_
unixtime |org.apache.spark.sql.catalyst.expressions.FromUnixTime |
true
|
|from
_
utc
_
timestamp |org.apache.spark.sql.catalyst.expressions.FromUTCTimestamp |
true
|
|get
_
json
_
object |org.apache.spark.sql.catalyst.expressions.GetJsonObject |
true
|
|greatest |org.apache.spark.sql.catalyst.expressions.Greatest |
true
|
|grouping |org.apache.spark.sql.catalyst.expressions.Grouping |
true
|
|grouping
_
id |org.apache.spark.sql.catalyst.expressions.GroupingID |
true
|
|hash |org.apache.spark.sql.catalyst.expressions.Murmur
3
Hash |
true
|
|hex |org.apache.spark.sql.catalyst.expressions.Hex |
true
|
|hour |org.apache.spark.sql.catalyst.expressions.Hour |
true
|
|hypot |org.apache.spark.sql.catalyst.expressions.Hypot |
true
|
|
if
|org.apache.spark.sql.catalyst.expressions.If |
true
|
|ifnull |org.apache.spark.sql.catalyst.expressions.IfNull |
true
|
|in |org.apache.spark.sql.catalyst.expressions.In |
true
|
|initcap |org.apache.spark.sql.catalyst.expressions.InitCap |
true
|
|input
_
file
_
name |org.apache.spark.sql.catalyst.expressions.InputFileName |
true
|
|instr |org.apache.spark.sql.catalyst.expressions.StringInstr |
true
|
|isnan |org.apache.spark.sql.catalyst.expressions.IsNaN |
true
|
|isnotnull |org.apache.spark.sql.catalyst.expressions.IsNotNull |
true
|
|isnull |org.apache.spark.sql.catalyst.expressions.IsNull |
true
|
|json
_
tuple |org.apache.spark.sql.catalyst.expressions.JsonTuple |
true
|
|kurtosis |org.apache.spark.sql.catalyst.expressions.aggregate.Kurtosis |
true
|
|lag |org.apache.spark.sql.catalyst.expressions.Lag |
true
|
|last |org.apache.spark.sql.catalyst.expressions.aggregate.Last |
true
|
|last
_
day |org.apache.spark.sql.catalyst.expressions.LastDay |
true
|
|last
_
value |org.apache.spark.sql.catalyst.expressions.aggregate.Last |
true
|
|lcase |org.apache.spark.sql.catalyst.expressions.Lower |
true
|
+---------------------+-----------------------------------------------------------------------+-----------+
only showing top
100
rows
|
上面展示了100个函数及其实现类。
原文链接:https://www.iteblog.com/archives/1701.html#Catalog_API