查看Spark Dataset的API发现,官网给了四种方法来创建临时视图,它们分别是:
def createGlobalTempView(viewName: String): Unit
// Creates a global temporary view using the given name.
def createOrReplaceGlobalTempView(viewName: String): Unit
// Creates or replaces a global temporary view using the given name.
def createOrReplaceTempView(viewName: String): Unit
// Creates a local temporary view using the given name.
def createTempView(viewName: String): Unit
// Creates a local temporary view using the given name.
它们有什么区别呢?分别在什么情况下使用呢?
创建一个临时视图,viewName为视图名字。临时视图是session 级别的,会随着session 的消失而消失
示例:
df.createTempView("people")
df2 = spark_session.sql("select * from people")
创建一个临时视图,viewName为视图名字。如果该视图已存在,则替换它。
上面两个其实可以归为一类,都为临时视图,都是session级别的,不是全局的。顾名思义,带有Replace的就是如果存在视图,即替换。
创建一个全局临时视图,viewName为视图名字
spark sql 中的临时视图是session 级别的,会随着session 的消失而消失。如果希望一个临时视图跨session 而存在,则可以建立一个全局临时视图。
示例:
df.createGlobalTempView("people")
spark_session.sql("SELECT * FROM global_temp.people").show()
再次强调:使用全局视图时必须加上global_temp数据库名称
创建一个全局临时视图,viewName为视图名字。如果该视图已存在,则替换它。
上面两个其实又可以归为一类,都为全局临时视图。顾名思义,带有Replace的就是如果存在视图,即替换。
Spark SQL中的临时视图(Temporary views)是会话范围的,如果创建它的会话终止,临时视图将消失。如果需建立在所有会话之间共享的临时视图,并保持活动状态,直到Spark应用程序终止,那么可以创建一个全局临时视图(Global Temporary View)。全局临时视图绑定到Spark系统保留的数据库global_temp,我们必须使用限定名称来引用它,例如SELECT * FROM global_temp.view1。
测试数据准备:
1,tom,23
2,jack,24
3,lily,18
4,lucy,19
全部测试代码:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
object Test {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder()
.appName(name = this.getClass.getSimpleName)
.master(master = "local[*]")
.getOrCreate()
import spark.sql
import spark.implicits._
val user_df = spark.read.textFile("./data/user")
.map(_.split(","))
.map(x => (x(0), x(1), x(2)))
.toDF("id", "name", "age")
.cache()
user_df.createTempView(viewName = "view")
user_df.createGlobalTempView(viewName = "global_view")
sql(sqlText = "select * from view").show()
sql(sqlText = "select * from global_temp.global_view").show()
// 创建新的SparkSession
val new_session: SparkSession = spark.newSession()
new_session.sql(sqlText = "select * from global_temp.global_view").show()
new_session.sql(sqlText = "select * from view").show()
spark.stop()
}
}
输出结果:
+---+----+---+
| id|name|age|
+---+----+---+
| 1| tom| 23|
| 2|jack| 24|
| 3|lily| 18|
| 4|lucy| 19|
+---+----+---+
+---+----+---+
| id|name|age|
+---+----+---+
| 1| tom| 23|
| 2|jack| 24|
| 3|lily| 18|
| 4|lucy| 19|
+---+----+---+
+---+----+---+
| id|name|age|
+---+----+---+
| 1| tom| 23|
| 2|jack| 24|
| 3|lily| 18|
| 4|lucy| 19|
+---+----+---+
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: view; line 1 pos 14
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:47)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:733)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:685)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:715)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:708)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:87)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:708)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:654)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:121)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:106)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
at cn.unisk.Test$.main(Test.scala:29)
at cn.unisk.Test.main(Test.scala)
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'view' not found in database 'default';
at org.apache.spark.sql.catalyst.catalog.ExternalCatalog$class.requireTableExists(ExternalCatalog.scala:48)
at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.requireTableExists(InMemoryCatalog.scala:45)
at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.getTable(InMemoryCatalog.scala:326)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getTable(ExternalCatalogWithListener.scala:138)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:701)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:730)
... 44 more
显然同一个SparkSession创建的临时视图只能使用同一个SparkSession访问,但是不同的SparkSession之间可以相互访问创建的全局临时视图。
说句实话:我从来没有用过全局临时视图,也就是从来没有出现创建多个SparkSession的场景
网上找了两个场景:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_141)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.range(100).groupBy("id").count.rdd.getNumPartitions
res0: Int = 200
scala>
scala> val newSpark = spark.newSession
newSpark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@618a9cb7
scala> newSpark.conf.set("spark.sql.shuffle.partitions", 99)
scala> newSpark.range(100).groupBy("id").count.rdd.getNumPartitions
res2: Int = 99
scala> spark.range(100).groupBy("id").count.rdd.getNumPartitions // No effect on initial session
res3: Int = 200
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_141)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.range(1).createTempView("foo")
scala>
scala> spark.catalog.tableExists("foo")
res1: Boolean = true
scala>
scala> val newSpark = spark.newSession
newSpark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@73418044
scala> newSpark.catalog.tableExists("foo")
res2: Boolean = false
scala> newSpark.range(100).createTempView("foo") // No exception
scala> spark.table("foo").count // No effect on inital session
res4: Long = 1
相信现在可以很好的理解Spark中的视图了