执行计划实质:看做成tree(树),树节点上通过Rule对象保存节点信息。
SparkSQL tree节点分一个几类:
a. 一元节点:filter、count等
b. 二元节点:join,union等
c. 叶子节点:加载外部数据等;
// 例1
val query_df = spark.sql("select * from global_temp.person where name like '%o%'")
1.explain()查看物理执行计划
== Physical Plan ==
*Filter Contains(name#16, o)
+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).name, true) AS name#16, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).age AS age#17, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).address, true) AS address#18]
+- Scan ExternalRDDScan[obj#15]
2.explain(true)查看整个SQL的执行计划,主要分为4个阶段:
a.解析过程
== Parsed Logical Plan ==
'Project [*]
+- 'Filter 'name LIKE %o%
+- 'UnresolvedRelation `global_temp`.`person`
说明:Project:映射,返回结果
b.逻辑计划
== Analyzed Logical Plan ==
name: string, age: int, address: string
Project [name#16, age#17, address#18]
+- Filter name#16 LIKE %o%
+- SubqueryAlias person, `global_temp`.`person`
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).name, true) AS name#16, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).age AS age#17, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).address, true) AS address#18]
+- ExternalRDD [obj#15]
c.优化阶段
== Optimized Logical Plan ==
Filter Contains(name#16, o)
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).name, true) AS name#16, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).age AS age#17, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).address, true) AS address#18]
+- ExternalRDD [obj#15]
d.物理执行计划
== Physical Plan ==
*Filter Contains(name#16, o)
+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).name, true) AS name#16, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).age AS age#17, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).address, true) AS address#18]
+- Scan ExternalRDDScan[obj#15]
//案例2说明优化过程:
val query_df = spark.sql("select * from global_temp.person where age > 36")
query_df.explain(true)
1. == Parsed Logical Plan ==
'Project [*]
+- 'Filter ('age > 36)
+- 'UnresolvedRelation `global_temp`.`person`
2.== Analyzed Logical Plan ==
name: string, age: int, address: string
Project [name#16, age#17, address#18]
+- Filter (age#17 > 36)
+- SubqueryAlias person, `global_temp`.`person`
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).name, true) AS name#16, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).age AS age#17, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).address, true) AS address#18]
+- ExternalRDD [obj#15]
3.== Optimized Logical Plan ==
Filter (isnotnull(age#17) && (age#17 > 36))
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).name, true) AS name#16, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).age AS age#17, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).address, true) AS address#18]
+- ExternalRDD [obj#15]
4.== Physical Plan ==
*Filter (isnotnull(age#17) && (age#17 > 36))
+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).name, true) AS name#16, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).age AS age#17, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).address, true) AS address#18]
+- Scan ExternalRDDScan[obj#15]
需将hive-site.xml复制到{SAPRK_HOME/conf}目录下
spark-sql
//默认开启“Local模式”spark-sql --master spark://master:7077
spark-sql --master yarn
Reserved Memory(预留内存):默认为300M;
【org.apache.spark.memory.UnifiedMemoryManager】RESERVED_SYSTEM_MEMORY_BYTES = 300 * 1024 * 1024
作用:用于存储Spark相关定义的参数,如sc,sparksession等
User Memory(用户内存):
作用:用于存储用户级别相关定义的数据,如:参数或变量
Spark Memory(Spark内存):
作用:用于计算(Execution Memory)和cache缓存(Storage Memory)
案例说明:
假如每个executor分配的内存为1G=1024M,那么;
Reserved Memory = 300M
User Memory = (1024M-300M) * (1-0.6) = 724M * 0.4 = 289.6M
Spark Memory = (1024M-300M) * 0.6 = 434.4M = Execution Memory(217.2M)+Storage Memory(217.2M)
用户内存和 spark内存是 4:6 ,spark 内存又是分 计算内存和缓存各一半
在分配executor内存,需考虑最小内存数为:450M,最小内存是预留内存的1.5 倍
val minSystemMemory = (reservedMemory * 1.5).ceil.toLong
内存强占问题:
a.缓存数据大于执行数据(RDD):storage Memory强占Execution Memory
b.Execution Memory占优,storage Memory必须释放,因为Execution Memory优先级比较高
ctrl+shift+n:打开类搜索器 ctrl+shift+f:打开代码搜索器