Spark-Sql源码简单走读

简述

自从Spark统一了RDD和DataFrame(DataSet)后,批处理上对DataFrame的使用频率上也大大超过了原始RDD,同样的SparkSql的使用也越来越频繁,因此对其中的执行过程进行简单了解是必不可少的,本文就对SparkSql源码进行简单的流程走读,涉及复杂内容的地方做到知其作用目的即可,不予深究。

从一条sql开始

在新版本中,SparkSession早已经作为统一入口,下面就直接进入SparkSession.sql方法中:

class SparkSession private...
 def sql(sqlText: String): DataFrame = {
    // tracker作为一个统一的封装,主要用于记录执行时长,会贯穿整SQL的解析过程
    val tracker = new QueryPlanningTracker
	// measurePhase 方法的内容放在下面,其中 sessionState的构造过程就不再展开,是一个lazy对象,从配置文件读取类名,通过反射来完成实例化
    val plan = tracker.measurePhase(QueryPlanningTracker.PARSING) {
      sessionState.sqlParser.parsePlan(sqlText)
    }
    Dataset.ofRows(self, plan, tracker)
  }
  // Measure the start and end time of a phase. 这里是柯里化的一个应用,在后续代码中,我们会看到调用逻辑
  def measurePhase[T](phase: String)(f: => T): T = {
    val startTime = System.currentTimeMillis()
    val ret = f //这里调用第二个传入参数,是一个函数
    val endTime = System.currentTimeMillis
	// phase 作为第一个传参,在后续调用会对同一类处理过程传同一值,用于准确计算每个阶段的耗时 
	// val PARSING = "parsing" val ANALYSIS = "analysis" val OPTIMIZATION = "optimization"
  val PLANNING = "planning"
    if (phasesMap.containsKey(phase)) {
      val oldSummary = phasesMap.get(phase)
      phasesMap.put(phase, new PhaseSummary(oldSummary.startTimeMs, endTime))
    } else {
      phasesMap.put(phase, new PhaseSummary(startTime, endTime))
    }
    ret
  }
}	

根据上面的measurePhase方法内容,直接去查看柯里化的第二部分函数,sessionState.sqlParser.parsePlan(sqlText)内容,这里我们直接看接口实现类CatalystSqlParser

class CatalystSqlParser(conf: SQLConf) extends AbstractSqlParser {
  val astBuilder = new AstBuilder(conf)
}

可以看到内部实例化了AstBuilder,进一步看parsePlan,又是一个柯里化,第二部分是匿名函数

  /** Creates LogicalPlan for a given SQL string. */
  override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { parser =>
    astBuilder.visitSingleStatement(parser.singleStatement()) match {
      case plan: LogicalPlan => plan
      case _ =>
        val position = Origin(None, None)
        throw new ParseException(Option(sqlText), "Unsupported SQL statement", position, position)
    }
  }

先来看parse方法体,简单来看柯里化第二个参数仅在末尾调用,仅此可以简单理解为将上述匿名函数替换至此。(本文后面部分不再对柯里化函数进行分拆解释)

protected def parse[T](command: String)(toResult: SqlBaseParser => T): T = {
    logDebug(s"Parsing command: $command")
    // 实例化 词法解释器,具体不再展开
    val lexer = new SqlBaseLexer(new UpperCaseCharStream(CharStreams.fromString(command)))
    ...
    // 构造Token流
    val tokenStream = new CommonTokenStream(lexer)
    // 实例化 语法解释器,具体不再展开
    val parser = new SqlBaseParser(tokenStream)
    try {
      try {
        // first, try parsing with potentially faster SLL mode
        parser.getInterpreter.setPredictionMode(PredictionMode.SLL)
        // 调用第二个传参,是一个函数,将实例化的 parser作为入参
        toResult(parser)
      }
      catch {}
    }
    catch {}
  }

回到parsePlan的第二部分,其中parser.singleStatement()返回内容为语法文件中定义的最顶级节点,也即SqlBase.g4中的各主节点,visitSingleStatement用于遍历语法树,将各节点替换成LogicalPlan返回。

{ parser =>
    astBuilder.visitSingleStatement(parser.singleStatement()) match {
      case plan: LogicalPlan => plan
      case _ =>...
    }
}

至此,SparkSql完成了从sql语句到逻辑树的初步转变,之后就开始对这个未经优化的逻辑树进行多方位的处理。

解释器Analyzer(package org.apache.spark.sql.catalyst.analysis)

回到开始SparkSession.sql方法的最后一步Dataset.ofRows(self, plan, tracker),是重要对象QueryExecution初始化操作的入口:

/** A variant of ofRows that allows passing in a tracker so we can track query parsing time. */
  def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan, tracker: QueryPlanningTracker)
    : DataFrame = {
    val qe = new QueryExecution(sparkSession, logicalPlan, tracker)
    qe.assertAnalyzed()
    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))
  }

先来看QueryExecution的实例化内容,其中包含的全部关键内容,都是lazy对象assertAnalyzed之后会一个一个触发:

class QueryExecution(...)[
  // TODO: Move the planner an optimizer into here from SessionState.
  protected def planner = sparkSession.sessionState.planner
  // 触发入口
  def assertAnalyzed(): Unit = analyzed
  // measurePhase上面介绍过了,就是记录时间,柯里化,直接关注第二部分即可
  lazy val analyzed: LogicalPlan = tracker.measurePhase(QueryPlanningTracker.ANALYSIS) {
  	// setActiveSession在SparkCore中介绍过,作用在于保证全局只有一个活动的SparkSession
    SparkSession.setActiveSession(sparkSession)
    sparkSession.sessionState.analyzer.executeAndCheck(logical, tracker)
  }
}

先来看解析器analyzer部分,继承至RuleExecutor,这点很重要

class Analyzer(
    catalog: SessionCatalog,
    conf: SQLConf,
    maxIterations: Int)
  extends RuleExecutor[LogicalPlan] with CheckAnalysis {
  def executeAndCheck(plan: LogicalPlan, tracker: QueryPlanningTracker): LogicalPlan = {
      val analyzed = executeAndTrack(plan, tracker)
      ...
  }
}
object RuleExecutor {
  def executeAndTrack(plan: TreeType, tracker: QueryPlanningTracker): TreeType = {
    QueryPlanningTracker.withTracker(tracker) {
      execute(plan)
    }
  }
  // 这是analyzer主要内容,Executes the batches of rules defined by the subclass. 
  def execute(plan: TreeType): TreeType = {
    var curPlan = plan
    ...
    // 遍历 analyzer类中定义的每一个 batch
    batches.foreach { batch =>
      val batchStartPlan = curPlan
      var iteration = 1
      var lastPlan = curPlan
      var continue = true
      // Run until fix point (or the max number of iterations as specified in the strategy.
      while (continue) {
        // 遍历每个batch下定义的 rule,应用于当前plan
        curPlan = batch.rules.foldLeft(curPlan) {
          case (plan, rule) =>
            val startTime = System.nanoTime()
            val result = rule(plan)
            val runTime = System.nanoTime() - startTime
            val effective = !result.fastEquals(plan)
            ... 
    	  curPlan
    }
  }
}

总之就是遍历执行analyzer类中batches定义的规则,下面放上部分batches内容供参考:

// 就是一个seq数组,内部由模板类Batch组成
// protected case class Batch(name: String, strategy: Strategy, rules: Rule[TreeType]*)
lazy val batches: Seq[Batch] = Seq(
    Batch("Hints", fixedPoint,
      new ResolveHints.ResolveBroadcastHints(conf),
      ResolveHints.ResolveCoalesceHints,
      ResolveHints.RemoveAllHints),
    Batch("Simple Sanity Check", Once,
      LookupFunctions),
    Batch("Substitution", fixedPoint,
      CTESubstitution,
      WindowsSubstitution,
      EliminateUnions,
      new SubstituteUnresolvedOrdinals(conf)),
    Batch("Resolution", fixedPoint,
      ResolveTableValuedFunctions ::
      ResolveRelations ::
      ResolveReferences ::
      ResolveCreateNamedStruct ::
      ResolveDeserializer ::
      ResolveNewInstance ::
      ResolveUpCast ::
      ResolveGroupingAnalytics ::
      ResolvePivot ::
      ResolveOrdinalInOrderByAndGroupBy ::
      ResolveAggAliasInGroupBy ::
      ResolveMissingReferences ::
      ExtractGenerator ::
      ResolveGenerate ::
      ResolveFunctions ::
      ResolveAliases ::
      ResolveSubquery ::
      ResolveSubqueryColumnAliases ::
      ResolveWindowOrder ::
      ResolveWindowFrame ::
      ResolveNaturalAndUsingJoin ::
      ResolveOutputRelation ::
      ExtractWindowExpressions ::
      GlobalAggregates ::
      ResolveAggregateFunctions ::
      TimeWindowing ::
      ResolveInlineTables(conf) ::
      ResolveHigherOrderFunctions(catalog) ::
      ResolveLambdaVariables(conf) ::
      ResolveTimeZone(conf) ::
      ResolveRandomSeed ::
      TypeCoercion.typeCoercionRules(conf) ++
      extendedResolutionRules : _*),
    Batch("Post-Hoc Resolution", Once, postHocResolutionRules: _*),
    Batch("View", Once,
      AliasViewChild(conf)),
    Batch("Nondeterministic", Once,
      PullOutNondeterministic),
    Batch("UDF", Once,
      HandleNullInputsForUDF),
    Batch("UpdateNullability", Once,
      UpdateAttributeNullability),
    Batch("Subquery", Once,
      UpdateOuterReferences),
    Batch("Cleanup", fixedPoint,
      CleanupAliases)
  )

每一项Rule都是一个静态对象,其中Bathch("Resolution")是最为常用。以ResolveRelations为例,尝试跟踪analyzer是如何得到的逻辑树,重点看看注释内容:

  /**
   * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog.
   */
  object ResolveRelations extends Rule[LogicalPlan] {
    def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
      case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved =>
        EliminateSubqueryAliases(lookupTableFromCatalog(u)) match {
          case v: View =>
            u.failAnalysis(s"Inserting into a view is not allowed. View: ${v.desc.identifier}.")
          case other => i.copy(table = other)
        }
      // 入口
      case u: UnresolvedRelation => resolveRelation(u)
    }
    // If the unresolved relation is running directly on files, we just return the original
    // UnresolvedRelation, the plan will get resolved later. Else we look up the table from catalog
    // and change the default database name(in AnalysisContext) if it is a view. 
    def resolveRelation(plan: LogicalPlan): LogicalPlan = plan match {
      case u: UnresolvedRelation if !isRunningDirectlyOnFiles(u.tableIdentifier) =>
        val defaultDatabase = AnalysisContext.get.defaultDatabase
        // 调用 lookupTableFromCatalog查找
        val foundRelation = lookupTableFromCatalog(u, defaultDatabase)
        resolveRelation(foundRelation)
      // The view's child should be a logical plan parsed from the `desc.viewText`, the variable
      // `viewText` should be defined, or else we throw an error on the generation of the View
      // operator.
      case view @ View(desc, _, child) if !child.resolved =>...
        view.copy(child = newChild)
      case p @ SubqueryAlias(_, view: View) =>...
      case _ => plan
    }
    // Look up the table with the given name from catalog. The database we used is decided by the precedence.
    private def lookupTableFromCatalog(u: UnresolvedRelation,defaultDatabase: Option[String] = None): LogicalPlan = {
      val tableIdentWithDb = u.tableIdentifier.copy(
        database = u.tableIdentifier.database.orElse(defaultDatabase))
      try {
        // catalog 是一个SessionCatalog对象
        catalog.lookupRelation(tableIdentWithDb)
      } catch {...}
    }
 }
 // 存储了spark sql的所有数据表信息
class SessionCatalog(
    externalCatalogBuilder: () => ExternalCatalog,
    globalTempViewManagerBuilder: () => GlobalTempViewManager,
    functionRegistry: FunctionRegistry,
    conf: SQLConf,
    hadoopConf: Configuration,
    parser: ParserInterface,
    functionResourceLoader: FunctionResourceLoader) extends Logging {
	...
}

经过解释器analyzer处理后,得到了经过处理的逻辑树,数据源和字段、方法等内容均已经得到。

优化器Optimizer(org.apache.spark.sql.catalyst.optimizer)

回到ofRows方法,执行完qe.assertAnalyzed()好像没有后续内容了,其实不然,我们继续进到QueryExecution实例化方法中

def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan, tracker: QueryPlanningTracker)
    : DataFrame = {
    val qe = new QueryExecution(sparkSession, logicalPlan, tracker)
    qe.assertAnalyzed()
    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))
  }
class QueryExecution(...){
  //虽然大部分都是lazy方法,找不到何时调用,但有一个方法我们需要重点关注
  // executedPlan should not be used to initialize any SparkPlan. It should be
  // only used for execution.
  lazy val executedPlan: SparkPlan = tracker.measurePhase(QueryPlanningTracker.PLANNING) {
    // sparkPlan触发lazy的实例化
    prepareForExecution(sparkPlan)
  }**加粗样式**

注意上面的lazy对象executedPlan,我们随便选一个Dateset中封装的action方法,以collect为例:

  def collect(): Array[T] = withAction("collect", queryExecution)(collectFromPlan)
  /**
   * Wrap a Dataset action to track the QueryExecution and time cost, then report to the
   * user-registered callback functions.
   */
  private def withAction[U](name: String, qe: QueryExecution)(action: SparkPlan => U) = {
    SQLExecution.withNewExecutionId(sparkSession, qe, Some(name)) {
      qe.executedPlan.foreach { plan =>
        plan.resetMetrics()
      }
      action(qe.executedPlan)
    }
  }

其他细节不必在意,我们很容易发现,Dateset其实封装了所有的action,都会调用QueryExecutionexecutedPlan方法,所以触发这些lazy对象的使命就交给action算子吧。

lazy val sparkPlan: SparkPlan = tracker.measurePhase(QueryPlanningTracker.PLANNING) {
    SparkSession.setActiveSession(sparkSession)
    // TODO: We use next(), i.e. take the first plan returned by the planner, here for now, but we will implement to choose the best plan.
    // 触发 optimizedPlan,并通过两个封装函数返回 SparkPlan
    planner.plan(ReturnAnswer(optimizedPlan)).next()
}
lazy val optimizedPlan: LogicalPlan = tracker.measurePhase(QueryPlanningTracker.OPTIMIZATION) {
    // 可以看到,又是 executeAndTrack方法,同时触发 withCachedData,不再展开
    sparkSession.sessionState.optimizer.executeAndTrack(withCachedData, tracker)
}

依次触发各个lazy对象,其中optimizeranalyzer一样,继承至RuleExecutorexecuteAndTrack方法最终也是通过调用execute(plan: TreeType)来依次遍历规则Batch数组,两者区别在于各自持有的Batch数组不同

Planer (org.apache.spark.sql.execution)

经过优化器optimizer处理得到优化后的逻辑树,之后通过planner.plan(LogicalPlan)来完成到物理树的转换,其中plannerSparkPlanner对象,SparkPlanner继承至SparkStrategiesSparkStrategies又继承至QueryPlannerplan方法在QueryPlanner中定义:

def plan(plan: LogicalPlan): Iterator[PhysicalPlan] = {
    // Collect physical plan candidates.
    // 遍历策略集应用在plan上,strategies定义在SparkPlanner中,每条策略都是继承Strategy
    val candidates = strategies.iterator.flatMap(_(plan))

    // The candidates may contain placeholders marked as [[planLater]],
    // so try to replace them by their child plans.
    val plans = candidates.flatMap { candidate =>
      // collectPlaceholders放在下面,在SparkPlanner中定义
      val placeholders = collectPlaceholders(candidate)

      if (placeholders.isEmpty) {
        // Take the candidate as is because it does not contain placeholders.
        Iterator(candidate)
      } else {
        // Plan the logical plan marked as [[planLater]] and replace the placeholders.
        placeholders.iterator.foldLeft(Iterator(candidate)) {
          case (candidatesWithPlaceholders, (placeholder, logicalPlan)) =>
            // Plan the logical plan for the placeholder.
            val childPlans = this.plan(logicalPlan)

            candidatesWithPlaceholders.flatMap { candidateWithPlaceholders =>
              childPlans.map { childPlan =>
                // Replace the placeholder by the child plan
                candidateWithPlaceholders.transformUp {
                  case p if p.eq(placeholder) => childPlan
                }
              }
            }
        }
      }
    }
	// prunePlans 放在下面,在SparkPlanner中定义
    val pruned = prunePlans(plans)
    assert(pruned.hasNext, s"No plan for $plan")
    pruned
  }
  override protected def collectPlaceholders(plan: SparkPlan): Seq[(SparkPlan, LogicalPlan)] = {
    plan.collect {
      case placeholder @ PlanLater(logicalPlan) => placeholder -> logicalPlan
    }
  }
    override protected def prunePlans(plans: Iterator[SparkPlan]): Iterator[SparkPlan] = {
    // TODO: We will need to prune bad plans when we improve plan space exploration
    //       to prevent combinatorial explosion.
    plans
  }

plan中最重要的就是头部的strategies,对传入的逻辑树应用各种策略,最终转换为一组物理树,最终planner.plan(ReturnAnswer(optimizedPlan)).next()去除第一个作为最终结果物理树。

转换为执行代码

返回最开始,prepareForExecution方法

  // executedPlan should not be used to initialize any SparkPlan. It should be
  // only used for execution.
  lazy val executedPlan: SparkPlan = tracker.measurePhase(QueryPlanningTracker.PLANNING) {
    prepareForExecution(sparkPlan)
  }
  protected def prepareForExecution(plan: SparkPlan): SparkPlan = {
    preparations.foldLeft(plan) { case (sp, rule) => rule.apply(sp) }
  }

最终需要对物理树也即sparkPlan:Physical Plan再次应用一些规则,其中preparations的内容如下:

protected def preparations: Seq[Rule[SparkPlan]] = Seq(
   PlanSubqueries(sparkSession),                          //特殊子查询物理计划处理
   EnsureRequirements(sparkSession.sessionState.conf),    //确保执行计划分区与排序正确性
   CollapseCodegenStages(sparkSession.sessionState.conf), //代码生成
   ReuseExchange(sparkSession.sessionState.conf),         //节点重用
   ReuseSubquery(sparkSession.sessionState.conf))         //子查询重用

其中CollapseCodegenStages用于全阶段代码生成,内容比较复杂,留作以后再细读,目前只需知道物理树最终转换为了可执行代码。之后的处理过程就和Spark-core中一样了,需要注意的是代码生成是在 Driver 端进行的,而代码编译是在 Executor 端进行的。

最后

总结整个执行过程其实很清晰,

  1. 首先传入的Sql语句,被SqlParser通过词法解释器和语法解释器处理为Unresolved LogicalPlan,此时只是得到一个结构化的对象,并不包含任何数据信息。
  2. 通过解释器Analyzer 将上一步得到的Unresolved LogicalPlan处理为Analyzed Logical Plan,此时已经包含了数据来源与各列相关信息。期间用到了事先定义好的 Rule 以及 SessionCatalog 等信息。
  3. 然后被action算子触发后续优化操作。
  4. 通过优化器Optimizer对上一步的到Analyzed Logical Plan进行优化,用到了例如谓词下推,变量替换等提前定义的规则(Rule)。
  5. 之后通过 SparkPlanner将Logical Plan转换为Physical Plan,期间用的提前定义的各种策略(Strategy)。
  6. 最终通过prepareForExecution将上一步的Physical Plan转换为可执行代码,期间用到的提前定义的规则(Rule)。
  7. 剩下的就交给Spark-Core去完成具体的执行过程

Spark-Sql源码中涉及很多额外的内容,而且代码风格比较高级(相比Spark-Core),期间查询问题时翻到两篇博文,写得非常不错,链接放在文章末尾。

参考文章:

一条 SQL 在 Apache Spark 之旅
Spark Catalyst 源码解析

你可能感兴趣的:(Spark-Sql源码)