自从Spark统一了RDD和DataFrame(DataSet)后,批处理上对DataFrame的使用频率上也大大超过了原始RDD,同样的SparkSql的使用也越来越频繁,因此对其中的执行过程进行简单了解是必不可少的,本文就对SparkSql源码进行简单的流程走读,涉及复杂内容的地方做到知其作用目的即可,不予深究。
在新版本中,SparkSession
早已经作为统一入口,下面就直接进入SparkSession.sql
方法中:
class SparkSession private...
def sql(sqlText: String): DataFrame = {
// tracker作为一个统一的封装,主要用于记录执行时长,会贯穿整SQL的解析过程
val tracker = new QueryPlanningTracker
// measurePhase 方法的内容放在下面,其中 sessionState的构造过程就不再展开,是一个lazy对象,从配置文件读取类名,通过反射来完成实例化
val plan = tracker.measurePhase(QueryPlanningTracker.PARSING) {
sessionState.sqlParser.parsePlan(sqlText)
}
Dataset.ofRows(self, plan, tracker)
}
// Measure the start and end time of a phase. 这里是柯里化的一个应用,在后续代码中,我们会看到调用逻辑
def measurePhase[T](phase: String)(f: => T): T = {
val startTime = System.currentTimeMillis()
val ret = f //这里调用第二个传入参数,是一个函数
val endTime = System.currentTimeMillis
// phase 作为第一个传参,在后续调用会对同一类处理过程传同一值,用于准确计算每个阶段的耗时
// val PARSING = "parsing" val ANALYSIS = "analysis" val OPTIMIZATION = "optimization"
val PLANNING = "planning"
if (phasesMap.containsKey(phase)) {
val oldSummary = phasesMap.get(phase)
phasesMap.put(phase, new PhaseSummary(oldSummary.startTimeMs, endTime))
} else {
phasesMap.put(phase, new PhaseSummary(startTime, endTime))
}
ret
}
}
根据上面的measurePhase方法内容,直接去查看柯里化的第二部分函数,sessionState.sqlParser.parsePlan(sqlText)
内容,这里我们直接看接口实现类CatalystSqlParser
class CatalystSqlParser(conf: SQLConf) extends AbstractSqlParser {
val astBuilder = new AstBuilder(conf)
}
可以看到内部实例化了AstBuilder
,进一步看parsePlan
,又是一个柯里化,第二部分是匿名函数
/** Creates LogicalPlan for a given SQL string. */
override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { parser =>
astBuilder.visitSingleStatement(parser.singleStatement()) match {
case plan: LogicalPlan => plan
case _ =>
val position = Origin(None, None)
throw new ParseException(Option(sqlText), "Unsupported SQL statement", position, position)
}
}
先来看parse
方法体,简单来看柯里化第二个参数仅在末尾调用,仅此可以简单理解为将上述匿名函数替换至此。(本文后面部分不再对柯里化函数进行分拆解释)
protected def parse[T](command: String)(toResult: SqlBaseParser => T): T = {
logDebug(s"Parsing command: $command")
// 实例化 词法解释器,具体不再展开
val lexer = new SqlBaseLexer(new UpperCaseCharStream(CharStreams.fromString(command)))
...
// 构造Token流
val tokenStream = new CommonTokenStream(lexer)
// 实例化 语法解释器,具体不再展开
val parser = new SqlBaseParser(tokenStream)
try {
try {
// first, try parsing with potentially faster SLL mode
parser.getInterpreter.setPredictionMode(PredictionMode.SLL)
// 调用第二个传参,是一个函数,将实例化的 parser作为入参
toResult(parser)
}
catch {}
}
catch {}
}
回到parsePlan
的第二部分,其中parser.singleStatement()
返回内容为语法文件中定义的最顶级节点,也即SqlBase.g4
中的各主节点,visitSingleStatement
用于遍历语法树,将各节点替换成LogicalPlan
返回。
{ parser =>
astBuilder.visitSingleStatement(parser.singleStatement()) match {
case plan: LogicalPlan => plan
case _ =>...
}
}
至此,SparkSql
完成了从sql语句到逻辑树的初步转变,之后就开始对这个未经优化的逻辑树进行多方位的处理。
回到开始SparkSession.sql
方法的最后一步Dataset.ofRows(self, plan, tracker)
,是重要对象QueryExecution
初始化操作的入口:
/** A variant of ofRows that allows passing in a tracker so we can track query parsing time. */
def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan, tracker: QueryPlanningTracker)
: DataFrame = {
val qe = new QueryExecution(sparkSession, logicalPlan, tracker)
qe.assertAnalyzed()
new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))
}
先来看QueryExecution
的实例化内容,其中包含的全部关键内容,都是lazy对象,assertAnalyzed
之后会一个一个触发:
class QueryExecution(...)[
// TODO: Move the planner an optimizer into here from SessionState.
protected def planner = sparkSession.sessionState.planner
// 触发入口
def assertAnalyzed(): Unit = analyzed
// measurePhase上面介绍过了,就是记录时间,柯里化,直接关注第二部分即可
lazy val analyzed: LogicalPlan = tracker.measurePhase(QueryPlanningTracker.ANALYSIS) {
// setActiveSession在SparkCore中介绍过,作用在于保证全局只有一个活动的SparkSession
SparkSession.setActiveSession(sparkSession)
sparkSession.sessionState.analyzer.executeAndCheck(logical, tracker)
}
}
先来看解析器analyzer
部分,继承至RuleExecutor
,这点很重要
class Analyzer(
catalog: SessionCatalog,
conf: SQLConf,
maxIterations: Int)
extends RuleExecutor[LogicalPlan] with CheckAnalysis {
def executeAndCheck(plan: LogicalPlan, tracker: QueryPlanningTracker): LogicalPlan = {
val analyzed = executeAndTrack(plan, tracker)
...
}
}
object RuleExecutor {
def executeAndTrack(plan: TreeType, tracker: QueryPlanningTracker): TreeType = {
QueryPlanningTracker.withTracker(tracker) {
execute(plan)
}
}
// 这是analyzer主要内容,Executes the batches of rules defined by the subclass.
def execute(plan: TreeType): TreeType = {
var curPlan = plan
...
// 遍历 analyzer类中定义的每一个 batch
batches.foreach { batch =>
val batchStartPlan = curPlan
var iteration = 1
var lastPlan = curPlan
var continue = true
// Run until fix point (or the max number of iterations as specified in the strategy.
while (continue) {
// 遍历每个batch下定义的 rule,应用于当前plan
curPlan = batch.rules.foldLeft(curPlan) {
case (plan, rule) =>
val startTime = System.nanoTime()
val result = rule(plan)
val runTime = System.nanoTime() - startTime
val effective = !result.fastEquals(plan)
...
curPlan
}
}
}
总之就是遍历执行analyzer
类中batches
定义的规则,下面放上部分batches内容供参考:
// 就是一个seq数组,内部由模板类Batch组成
// protected case class Batch(name: String, strategy: Strategy, rules: Rule[TreeType]*)
lazy val batches: Seq[Batch] = Seq(
Batch("Hints", fixedPoint,
new ResolveHints.ResolveBroadcastHints(conf),
ResolveHints.ResolveCoalesceHints,
ResolveHints.RemoveAllHints),
Batch("Simple Sanity Check", Once,
LookupFunctions),
Batch("Substitution", fixedPoint,
CTESubstitution,
WindowsSubstitution,
EliminateUnions,
new SubstituteUnresolvedOrdinals(conf)),
Batch("Resolution", fixedPoint,
ResolveTableValuedFunctions ::
ResolveRelations ::
ResolveReferences ::
ResolveCreateNamedStruct ::
ResolveDeserializer ::
ResolveNewInstance ::
ResolveUpCast ::
ResolveGroupingAnalytics ::
ResolvePivot ::
ResolveOrdinalInOrderByAndGroupBy ::
ResolveAggAliasInGroupBy ::
ResolveMissingReferences ::
ExtractGenerator ::
ResolveGenerate ::
ResolveFunctions ::
ResolveAliases ::
ResolveSubquery ::
ResolveSubqueryColumnAliases ::
ResolveWindowOrder ::
ResolveWindowFrame ::
ResolveNaturalAndUsingJoin ::
ResolveOutputRelation ::
ExtractWindowExpressions ::
GlobalAggregates ::
ResolveAggregateFunctions ::
TimeWindowing ::
ResolveInlineTables(conf) ::
ResolveHigherOrderFunctions(catalog) ::
ResolveLambdaVariables(conf) ::
ResolveTimeZone(conf) ::
ResolveRandomSeed ::
TypeCoercion.typeCoercionRules(conf) ++
extendedResolutionRules : _*),
Batch("Post-Hoc Resolution", Once, postHocResolutionRules: _*),
Batch("View", Once,
AliasViewChild(conf)),
Batch("Nondeterministic", Once,
PullOutNondeterministic),
Batch("UDF", Once,
HandleNullInputsForUDF),
Batch("UpdateNullability", Once,
UpdateAttributeNullability),
Batch("Subquery", Once,
UpdateOuterReferences),
Batch("Cleanup", fixedPoint,
CleanupAliases)
)
每一项Rule都是一个静态对象,其中Bathch("Resolution")
是最为常用。以ResolveRelations
为例,尝试跟踪analyzer是如何得到的逻辑树,重点看看注释内容:
/**
* Replaces [[UnresolvedRelation]]s with concrete relations from the catalog.
*/
object ResolveRelations extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved =>
EliminateSubqueryAliases(lookupTableFromCatalog(u)) match {
case v: View =>
u.failAnalysis(s"Inserting into a view is not allowed. View: ${v.desc.identifier}.")
case other => i.copy(table = other)
}
// 入口
case u: UnresolvedRelation => resolveRelation(u)
}
// If the unresolved relation is running directly on files, we just return the original
// UnresolvedRelation, the plan will get resolved later. Else we look up the table from catalog
// and change the default database name(in AnalysisContext) if it is a view.
def resolveRelation(plan: LogicalPlan): LogicalPlan = plan match {
case u: UnresolvedRelation if !isRunningDirectlyOnFiles(u.tableIdentifier) =>
val defaultDatabase = AnalysisContext.get.defaultDatabase
// 调用 lookupTableFromCatalog查找
val foundRelation = lookupTableFromCatalog(u, defaultDatabase)
resolveRelation(foundRelation)
// The view's child should be a logical plan parsed from the `desc.viewText`, the variable
// `viewText` should be defined, or else we throw an error on the generation of the View
// operator.
case view @ View(desc, _, child) if !child.resolved =>...
view.copy(child = newChild)
case p @ SubqueryAlias(_, view: View) =>...
case _ => plan
}
// Look up the table with the given name from catalog. The database we used is decided by the precedence.
private def lookupTableFromCatalog(u: UnresolvedRelation,defaultDatabase: Option[String] = None): LogicalPlan = {
val tableIdentWithDb = u.tableIdentifier.copy(
database = u.tableIdentifier.database.orElse(defaultDatabase))
try {
// catalog 是一个SessionCatalog对象
catalog.lookupRelation(tableIdentWithDb)
} catch {...}
}
}
// 存储了spark sql的所有数据表信息
class SessionCatalog(
externalCatalogBuilder: () => ExternalCatalog,
globalTempViewManagerBuilder: () => GlobalTempViewManager,
functionRegistry: FunctionRegistry,
conf: SQLConf,
hadoopConf: Configuration,
parser: ParserInterface,
functionResourceLoader: FunctionResourceLoader) extends Logging {
...
}
经过解释器analyzer
处理后,得到了经过处理的逻辑树,数据源和字段、方法等内容均已经得到。
回到ofRows
方法,执行完qe.assertAnalyzed()
好像没有后续内容了,其实不然,我们继续进到QueryExecution
实例化方法中
def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan, tracker: QueryPlanningTracker)
: DataFrame = {
val qe = new QueryExecution(sparkSession, logicalPlan, tracker)
qe.assertAnalyzed()
new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))
}
class QueryExecution(...){
//虽然大部分都是lazy方法,找不到何时调用,但有一个方法我们需要重点关注
// executedPlan should not be used to initialize any SparkPlan. It should be
// only used for execution.
lazy val executedPlan: SparkPlan = tracker.measurePhase(QueryPlanningTracker.PLANNING) {
// sparkPlan触发lazy的实例化
prepareForExecution(sparkPlan)
}**加粗样式**
注意上面的lazy对象executedPlan
,我们随便选一个Dateset
中封装的action方法,以collect
为例:
def collect(): Array[T] = withAction("collect", queryExecution)(collectFromPlan)
/**
* Wrap a Dataset action to track the QueryExecution and time cost, then report to the
* user-registered callback functions.
*/
private def withAction[U](name: String, qe: QueryExecution)(action: SparkPlan => U) = {
SQLExecution.withNewExecutionId(sparkSession, qe, Some(name)) {
qe.executedPlan.foreach { plan =>
plan.resetMetrics()
}
action(qe.executedPlan)
}
}
其他细节不必在意,我们很容易发现,Dateset
其实封装了所有的action,都会调用QueryExecution
的executedPlan
方法,所以触发这些lazy对象的使命就交给action算子吧。
lazy val sparkPlan: SparkPlan = tracker.measurePhase(QueryPlanningTracker.PLANNING) {
SparkSession.setActiveSession(sparkSession)
// TODO: We use next(), i.e. take the first plan returned by the planner, here for now, but we will implement to choose the best plan.
// 触发 optimizedPlan,并通过两个封装函数返回 SparkPlan
planner.plan(ReturnAnswer(optimizedPlan)).next()
}
lazy val optimizedPlan: LogicalPlan = tracker.measurePhase(QueryPlanningTracker.OPTIMIZATION) {
// 可以看到,又是 executeAndTrack方法,同时触发 withCachedData,不再展开
sparkSession.sessionState.optimizer.executeAndTrack(withCachedData, tracker)
}
依次触发各个lazy对象,其中optimizer
和analyzer
一样,继承至RuleExecutor
,executeAndTrack方法最终也是通过调用execute(plan: TreeType)
来依次遍历规则Batch数组,两者区别在于各自持有的Batch数组不同
经过优化器optimizer
处理得到优化后的逻辑树,之后通过planner.plan(LogicalPlan)
来完成到物理树的转换,其中planner是SparkPlanner对象,SparkPlanner
继承至SparkStrategies
,SparkStrategies
又继承至QueryPlanner
,plan方法在QueryPlanner中定义:
def plan(plan: LogicalPlan): Iterator[PhysicalPlan] = {
// Collect physical plan candidates.
// 遍历策略集应用在plan上,strategies定义在SparkPlanner中,每条策略都是继承Strategy
val candidates = strategies.iterator.flatMap(_(plan))
// The candidates may contain placeholders marked as [[planLater]],
// so try to replace them by their child plans.
val plans = candidates.flatMap { candidate =>
// collectPlaceholders放在下面,在SparkPlanner中定义
val placeholders = collectPlaceholders(candidate)
if (placeholders.isEmpty) {
// Take the candidate as is because it does not contain placeholders.
Iterator(candidate)
} else {
// Plan the logical plan marked as [[planLater]] and replace the placeholders.
placeholders.iterator.foldLeft(Iterator(candidate)) {
case (candidatesWithPlaceholders, (placeholder, logicalPlan)) =>
// Plan the logical plan for the placeholder.
val childPlans = this.plan(logicalPlan)
candidatesWithPlaceholders.flatMap { candidateWithPlaceholders =>
childPlans.map { childPlan =>
// Replace the placeholder by the child plan
candidateWithPlaceholders.transformUp {
case p if p.eq(placeholder) => childPlan
}
}
}
}
}
}
// prunePlans 放在下面,在SparkPlanner中定义
val pruned = prunePlans(plans)
assert(pruned.hasNext, s"No plan for $plan")
pruned
}
override protected def collectPlaceholders(plan: SparkPlan): Seq[(SparkPlan, LogicalPlan)] = {
plan.collect {
case placeholder @ PlanLater(logicalPlan) => placeholder -> logicalPlan
}
}
override protected def prunePlans(plans: Iterator[SparkPlan]): Iterator[SparkPlan] = {
// TODO: We will need to prune bad plans when we improve plan space exploration
// to prevent combinatorial explosion.
plans
}
plan中最重要的就是头部的strategies
,对传入的逻辑树应用各种策略,最终转换为一组物理树,最终planner.plan(ReturnAnswer(optimizedPlan)).next()
去除第一个作为最终结果物理树。
返回最开始,prepareForExecution
方法
// executedPlan should not be used to initialize any SparkPlan. It should be
// only used for execution.
lazy val executedPlan: SparkPlan = tracker.measurePhase(QueryPlanningTracker.PLANNING) {
prepareForExecution(sparkPlan)
}
protected def prepareForExecution(plan: SparkPlan): SparkPlan = {
preparations.foldLeft(plan) { case (sp, rule) => rule.apply(sp) }
}
最终需要对物理树也即sparkPlan:Physical Plan
再次应用一些规则,其中preparations
的内容如下:
protected def preparations: Seq[Rule[SparkPlan]] = Seq(
PlanSubqueries(sparkSession), //特殊子查询物理计划处理
EnsureRequirements(sparkSession.sessionState.conf), //确保执行计划分区与排序正确性
CollapseCodegenStages(sparkSession.sessionState.conf), //代码生成
ReuseExchange(sparkSession.sessionState.conf), //节点重用
ReuseSubquery(sparkSession.sessionState.conf)) //子查询重用
其中CollapseCodegenStages
用于全阶段代码生成,内容比较复杂,留作以后再细读,目前只需知道物理树最终转换为了可执行代码。之后的处理过程就和Spark-core
中一样了,需要注意的是代码生成是在 Driver 端进行的,而代码编译是在 Executor 端进行的。
总结整个执行过程其实很清晰,
Spark-Sql源码中涉及很多额外的内容,而且代码风格比较高级(相比Spark-Core),期间查询问题时翻到两篇博文,写得非常不错,链接放在文章末尾。
一条 SQL 在 Apache Spark 之旅
Spark Catalyst 源码解析