SparkSQL执行流程、SQL执行计划、集成hive、内存分配

SparkSQL执行流程

  1. SQL执行过程
    例: select f1,f2,f3 from table_name where condition
    Parse(解析)
    首先,根据SQL语法搜素关键字(select、from、where、group by等等),标志出projection、DataSource、filter
    Bind(绑定)
    通过解析阶段的相关内容(projection、DataSource、filter),校验DataSource、filed合法性;如果校验失败,抛异常。
    optimize(优化)
    通过数据库对当前DataSource进行的统计数据分析,执行相应的优化措施。
    Execute(执行)
    开启物理执行,将逻辑计划转化为相对应的Task。
    SparkSQL执行流程、SQL执行计划、集成hive、内存分配_第1张图片

SparkSQL执行流程、SQL执行计划、集成hive、内存分配_第2张图片

SQL执行计划

执行计划实质:看做成tree(树),树节点上通过Rule对象保存节点信息。
SparkSQL tree节点分一个几类:
a. 一元节点:filter、count等
b. 二元节点:join,union等
c. 叶子节点:加载外部数据等;

// 例1
val query_df = spark.sql("select  * from global_temp.person where name like '%o%'")
1.explain()查看物理执行计划
        == Physical Plan ==
	*Filter Contains(name#16, o)
	+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).name, true) AS name#16, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).age AS age#17, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).address, true) AS address#18]
	   +- Scan ExternalRDDScan[obj#15]
2.explain(true)查看整个SQL的执行计划,主要分为4个阶段:
    a.解析过程
	   == Parsed Logical Plan ==
		'Project [*]
		+- 'Filter 'name LIKE %o%
		   +- 'UnresolvedRelation `global_temp`.`person`

	   说明:Project:映射,返回结果
	b.逻辑计划
           == Analyzed Logical Plan ==
		name: string, age: int, address: string
		Project [name#16, age#17, address#18]
		+- Filter name#16 LIKE %o%
		   +- SubqueryAlias person, `global_temp`.`person`
		      +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).name, true) AS name#16, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).age AS age#17, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).address, true) AS address#18]
			 +- ExternalRDD [obj#15]
	c.优化阶段
	   == Optimized Logical Plan ==
		Filter Contains(name#16, o)
		+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).name, true) AS name#16, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).age AS age#17, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).address, true) AS address#18]
		   +- ExternalRDD [obj#15]
	d.物理执行计划
	   == Physical Plan ==
		*Filter Contains(name#16, o)
		+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).name, true) AS name#16, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).age AS age#17, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).address, true) AS address#18]
		   +- Scan ExternalRDDScan[obj#15]
 
        
//案例2说明优化过程:
val query_df = spark.sql("select * from global_temp.person where age > 36")
query_df.explain(true)	  
1. == Parsed Logical Plan ==
	'Project [*]
	+- 'Filter ('age > 36)
	   +- 'UnresolvedRelation `global_temp`.`person`
	
2.== Analyzed Logical Plan ==
	name: string, age: int, address: string
	Project [name#16, age#17, address#18]
	+- Filter (age#17 > 36)
	   +- SubqueryAlias person, `global_temp`.`person`
	      +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).name, true) AS name#16, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).age AS age#17, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).address, true) AS address#18]
		 +- ExternalRDD [obj#15]
	
3.== Optimized Logical Plan ==
	Filter (isnotnull(age#17) && (age#17 > 36))
	+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).name, true) AS name#16, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).age AS age#17, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).address, true) AS address#18]
	   +- ExternalRDD [obj#15]
	
4.== Physical Plan ==
	*Filter (isnotnull(age#17) && (age#17 > 36))
	+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).name, true) AS name#16, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).age AS age#17, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.hyxy.SparkSql_Demo$Person, true]).address, true) AS address#18]
	   +- Scan ExternalRDDScan[obj#15]

sparkSQL集成到hive

需将hive-site.xml复制到{SAPRK_HOME/conf}目录下

  1. 将hive-site.xml复制到{SAPRK_HOME/conf}目录下;
    $>cd $HIVE_HOME/conf
    $>cp hive-site.xml ~/soft/spark/conf 【之前自定义的spark安装目录】
  2. 将hive-site.xml复制到所有Spark节点;
    $>cd $SPARK_HOME/conf
    $>scp hive-site.xml hadoop@slave01:~/soft/spark/conf/
    $>scp hive-site.xml hadoop@slave02:~/soft/spark/conf/
  3. 将MySQL驱动包[mysql-connector-java-5.1.36-bin.jar]复制到{SPARK_HOME/jars},并发送到其他节点
    $>cp mysql-connector-java-5.1.36-bin.jar ~/soft/spark/jars/ 【在$HIVE_HOME下的 lib文件中】
    $>cd ~/soft/spark/jars/
    $>scp mysql-connector-java-5.1.36-bin.jar hadoop@slave01:~/soft/spark/jars/
    $>scp mysql-connector-java-5.1.36-bin.jar hadoop@slave02:~/soft/spark/jars/
  4. 开启Hadoop;
    $>zKServer.sh start
    $>start-dfs.sh
    $>strat-yarn.sh
  5. 开启sparkSQL
    $>spark-sql //默认开启“Local模式”
    等价于:spark-sql --master local
  6. 如果在Standalone模式下:
    $>spark-sql --master spark://master:7077
    如果在Spark on yarn模式下:
    $>spark-sql --master yarn
  7. 在spark-sql命令行中,编写HQL
    spark-sql>show databases;
    spark-sql>use hive;
    spark-sql>select * from student;

spark的内存分配

  1. Reserved Memory(预留内存):默认为300M
    【org.apache.spark.memory.UnifiedMemoryManager】RESERVED_SYSTEM_MEMORY_BYTES = 300 * 1024 * 1024
    作用:用于存储Spark相关定义的参数,如sc,sparksession等

  2. User Memory(用户内存):
    作用:用于存储用户级别相关定义的数据,如:参数或变量

  3. Spark Memory(Spark内存):
    作用:用于计算(Execution Memory)和cache缓存(Storage Memory)

  4. 案例说明:
    假如每个executor分配的内存为1G=1024M,那么;
    Reserved Memory = 300M
    User Memory = (1024M-300M) * (1-0.6) = 724M * 0.4 = 289.6M
    Spark Memory = (1024M-300M) * 0.6 = 434.4M = Execution Memory(217.2M)+Storage Memory(217.2M)
    用户内存和 spark内存是 4:6 ,spark 内存又是分 计算内存和缓存各一半

  5. 在分配executor内存,需考虑最小内存数为:450M最小内存是预留内存的1.5 倍
    val minSystemMemory = (reservedMemory * 1.5).ceil.toLong

  6. 内存强占问题:
    a.缓存数据大于执行数据(RDD):storage Memory强占Execution Memory
    b.Execution Memory占优,storage Memory必须释放,因为Execution Memory优先级比较高

    ctrl+shift+n:打开类搜索器     ctrl+shift+f:打开代码搜索器
    

你可能感兴趣的:(Spark,sparkSQL执行过程,SQL执行计划,集成hive,内存分配)