Spark创建临时视图

前言

查看Spark Dataset的API发现,官网给了四种方法来创建临时视图,它们分别是:

def createGlobalTempView(viewName: String): Unit
// Creates a global temporary view using the given name.

def createOrReplaceGlobalTempView(viewName: String): Unit
// Creates or replaces a global temporary view using the given name.

def createOrReplaceTempView(viewName: String): Unit
// Creates a local temporary view using the given name.

def createTempView(viewName: String): Unit
// Creates a local temporary view using the given name.

它们有什么区别呢?分别在什么情况下使用呢?

临时视图

1. createTempView(viewName: String)

创建一个临时视图,viewName为视图名字。临时视图是session 级别的,会随着session 的消失而消失

  • 如果指定的临时视图已存在,则抛出TempTableAlreadyExistsException 异常
  • 参数 viewName:视图名字

示例:

df.createTempView("people")
df2 = spark_session.sql("select * from people")
2. createOrReplaceTempView(viewName: String)

创建一个临时视图,viewName为视图名字。如果该视图已存在,则替换它。

  • 参数 viewName:视图名字

上面两个其实可以归为一类,都为临时视图,都是session级别的,不是全局的。顾名思义,带有Replace的就是如果存在视图,即替换。

全局临时视图

1. createGlobalTempView(viewName: String)

创建一个全局临时视图,viewName为视图名字
spark sql 中的临时视图是session 级别的,会随着session 的消失而消失。如果希望一个临时视图跨session 而存在,则可以建立一个全局临时视图。

  • 如果指定的全局临时视图已存在,则抛出TempTableAlreadyExistsException 异常
  • 全局临时视图存在于系统数据库global_temp 中,必须加上库名取引用它
  • 参数 viewName:视图名字

示例:

df.createGlobalTempView("people")
spark_session.sql("SELECT * FROM global_temp.people").show()

再次强调:使用全局视图时必须加上global_temp数据库名称

2. createOrReplaceGlobalTempView(viewName: String)

创建一个全局临时视图,viewName为视图名字。如果该视图已存在,则替换它。

  • 参数 viewName:视图名字

上面两个其实又可以归为一类,都为全局临时视图。顾名思义,带有Replace的就是如果存在视图,即替换。

再次划一遍重点!!!

Spark SQL中的临时视图(Temporary views)是会话范围的,如果创建它的会话终止,临时视图将消失。如果需建立在所有会话之间共享的临时视图,并保持活动状态,直到Spark应用程序终止,那么可以创建一个全局临时视图(Global Temporary View)。全局临时视图绑定到Spark系统保留的数据库global_temp,我们必须使用限定名称来引用它,例如SELECT * FROM global_temp.view1。

煮个例子

测试数据准备:

1,tom,23
2,jack,24
3,lily,18
4,lucy,19

全部测试代码:

  import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession

object Test {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.ERROR)
    val spark = SparkSession.builder()
      .appName(name = this.getClass.getSimpleName)
      .master(master = "local[*]")
      .getOrCreate()

    import spark.sql
    import spark.implicits._
    val user_df = spark.read.textFile("./data/user")
      .map(_.split(","))
      .map(x => (x(0), x(1), x(2)))
      .toDF("id", "name", "age")
      .cache()

    user_df.createTempView(viewName = "view")
    user_df.createGlobalTempView(viewName = "global_view")

    sql(sqlText = "select * from view").show()
    sql(sqlText = "select * from global_temp.global_view").show()
    // 创建新的SparkSession
    val new_session: SparkSession = spark.newSession()
    new_session.sql(sqlText = "select * from global_temp.global_view").show()
    new_session.sql(sqlText = "select * from view").show()
    spark.stop()
  }
}

输出结果:

+---+----+---+
| id|name|age|
+---+----+---+
|  1| tom| 23|
|  2|jack| 24|
|  3|lily| 18|
|  4|lucy| 19|
+---+----+---+

+---+----+---+
| id|name|age|
+---+----+---+
|  1| tom| 23|
|  2|jack| 24|
|  3|lily| 18|
|  4|lucy| 19|
+---+----+---+

+---+----+---+
| id|name|age|
+---+----+---+
|  1| tom| 23|
|  2|jack| 24|
|  3|lily| 18|
|  4|lucy| 19|
+---+----+---+

Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: view; line 1 pos 14
	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:47)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:733)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:685)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:715)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:708)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:87)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:708)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:654)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
	at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
	at scala.collection.immutable.List.foldLeft(List.scala:84)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:121)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:106)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
	at cn.unisk.Test$.main(Test.scala:29)
	at cn.unisk.Test.main(Test.scala)
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'view' not found in database 'default';
	at org.apache.spark.sql.catalyst.catalog.ExternalCatalog$class.requireTableExists(ExternalCatalog.scala:48)
	at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.requireTableExists(InMemoryCatalog.scala:45)
	at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.getTable(InMemoryCatalog.scala:326)
	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getTable(ExternalCatalogWithListener.scala:138)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:701)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:730)
	... 44 more

显然同一个SparkSession创建的临时视图只能使用同一个SparkSession访问,但是不同的SparkSession之间可以相互访问创建的全局临时视图。

什么场景需要创建多个SparkSession?

说句实话:我从来没有用过全局临时视图,也就是从来没有出现创建多个SparkSession的场景

网上找了两个场景:

  • Keeping sessions with minor differences in configuration (直译:保持会话在配置上的细微差别)
    其实说白了就是:不同的SparkSession使用不同的配置信息
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_141)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.range(100).groupBy("id").count.rdd.getNumPartitions
res0: Int = 200

scala> 

scala> val newSpark = spark.newSession
newSpark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@618a9cb7

scala> newSpark.conf.set("spark.sql.shuffle.partitions", 99)

scala> newSpark.range(100).groupBy("id").count.rdd.getNumPartitions
res2: Int = 99

scala> spark.range(100).groupBy("id").count.rdd.getNumPartitions  // No effect on initial session
res3: Int = 200
  • Separating temporary namespaces:(直译:将临时命名空间进行分离)
    说白了就是:同一套代码中,可以通过不同的SparkSession来使用相同的临时视图名称。(可感觉没什么实质的用途)
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_141)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.range(1).createTempView("foo")

scala> 

scala> spark.catalog.tableExists("foo")
res1: Boolean = true

scala> 

scala> val newSpark = spark.newSession
newSpark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@73418044

scala> newSpark.catalog.tableExists("foo")
res2: Boolean = false

scala> newSpark.range(100).createTempView("foo")  // No exception

scala> spark.table("foo").count // No effect on inital session
res4: Long = 1     

后记

相信现在可以很好的理解Spark中的视图了

你可能感兴趣的:(Spark,spark,临时视图,全局视图,createTempView)