焦焦^_^

Spark SQL:运行原理

Spark SQL的运行过程

SQL在Spark执行要经历以下几步：

用户提交SQL文本
解析器将SQL文本解析成逻辑计划
分析器结合Catalog对逻辑计划做进一步分析，验证表是否存在，操作是否支持等
优化器对分析器分析的逻辑计划做进一步优化，如将过滤逻辑下推到子查询，查询改写，子查询共用等
Planner再将优化后的逻辑计划根据预先设定的映射逻辑转换为物理执行计划
物理执行计划做RDD计算，最终向用户返回查询数据

Parser

Spark SQL 使用 Antlr 进行记法和语法解析，并生成 UnresolvedPlan

源码分析：
第一步：SparkSession 的 sql(sqlText: String): DataFrame 为例，描述 SQL 语句的解析过程：

  def sql(sqlText: String): DataFrame = {
    Dataset.ofRows(self, sessionState.sqlParser.parsePlan(sqlText))
  }

第二步：调用 parse 方法将 SQL 解析为抽象语法树(调用的是SparkSqlParser父类AbstractSqlParser(继承ParserInterface)的方法)
源码地址：org.apache.spark.sql.catalyst.parser.ParseDriver.scala

   //调用的主函数是parse
  override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { parser =>
    //从 singleStatement 结点开始，遍历语法树，将结点转换为逻辑计划
    astBuilder.visitSingleStatement(parser.singleStatement()) match {
      case plan: LogicalPlan => plan
      case _ =>
        val position = Origin(None, None)
        throw new ParseException(Option(sqlText), "Unsupported SQL statement", position, position)
    }
  }

第三步： parse(sqlText)方法

 /**
   * 使用 ANTLR 4 实现了 SQL 语句的词法分析和语法分析，获得了抽象语法树
   */
  protected def parse[T](command: String)(toResult: SqlBaseParser => T): T = {
    logDebug(s"Parsing command: $command")

    //词法解析器SqlBaseLexer(将SQL语句解析成一个个短语)
    // Antlr 生成的 SqlBaseLexer 对 SQL 进行词法分析，生成 CommonTokenStream
    val lexer = new SqlBaseLexer(new UpperCaseCharStream(CharStreams.fromString(command)))
    lexer.removeErrorListeners()
    lexer.addErrorListener(ParseErrorListener)
    lexer.legacy_setops_precedence_enbled = SQLConf.get.setOpsPrecedenceEnforced

    //语法解析器SqlBaseParser
    // Antlr 生成的 SqlBaseParser 进行语法分析，得到 LogicalPlan
    val tokenStream = new CommonTokenStream(lexer)
    val parser = new SqlBaseParser(tokenStream)
    parser.addParseListener(PostProcessor)
    parser.removeErrorListeners()
    parser.addErrorListener(ParseErrorListener)
    parser.legacy_setops_precedence_enbled = SQLConf.get.setOpsPrecedenceEnforced

    try {
      try {
        // first, try parsing with potentially faster SLL mode
        parser.getInterpreter.setPredictionMode(PredictionMode.SLL)
        //回调
        toResult(parser)
      }
      catch {
        case e: ParseCancellationException =>
          // if we fail, parse with LL mode
          tokenStream.seek(0) // rewind input stream
          parser.reset()

          // Try Again.
          parser.getInterpreter.setPredictionMode(PredictionMode.LL)
          toResult(parser)
      }
    }
    catch {
      case e: ParseException if e.command.isDefined =>
        throw e
      case e: ParseException =>
        throw e.withCommand(command)
      case e: AnalysisException =>
        val position = Origin(e.line, e.startPosition)
        throw new ParseException(Option(command), e.message, position, position)
    }
  }
}

SqlBaseLexer使用antlr4生成的词法分析器

antlr解释器,我们需要的因素:

语法文件(org.apache.spark.sql.catalyst.parser.SqlBase.g4，放置在子工程catalyst中)
监视器类或者访问者类
在spark-sql的体系中,主要是使用访问者类(SparkSqlAstBuilder),但是也使用了监听器类辅助(PostProcessor)来处理格式转换

SELECT * FROM A  LEFT JOIN B ON A.ID = B.ID WHERE A.AGE > 20

第四步：从 astBuilder.visitSingleStatement结点开始，遍历语法树，将结点转换为逻辑计划

  override def visitSingleStatement(ctx: SingleStatementContext): LogicalPlan = withOrigin(ctx) {
    visit(ctx.statement).asInstanceOf[LogicalPlan]
  }

@Override
public T visit(ParseTree tree) {
	return tree.accept(this);
}

@Override
public  T accept(ParseTreeVisitor visitor) {
	if ( visitor instanceof SqlBaseVisitor ) return ((SqlBaseVisitor)visitor).visitQuerySpecification(this);
	else return visitor.visitChildren(this);
}

第五步：AstBuilder.scala 实现了SqlBaseBaseVisitor用于解析逻辑的实现

  override def visitQuerySpecification(
      ctx: QuerySpecificationContext): LogicalPlan = withOrigin(ctx) {
    val from = OneRowRelation().optional(ctx.fromClause) {
      visitFromClause(ctx.fromClause)
    }
    withQuerySpecification(ctx, from)
  }

FROM 语句解析

  override def visitFromClause(ctx: FromClauseContext): LogicalPlan = withOrigin(ctx) {
    val from = ctx.relation.asScala.foldLeft(null: LogicalPlan) { (left, relation) =>
      val right = plan(relation.relationPrimary)
      val join = right.optionalMap(left)(Join(_, _, Inner, None))
      withJoinRelations(join, relation)
    }
    if (ctx.pivotClause() != null) {
      if (!ctx.lateralView.isEmpty) {
        throw new ParseException("LATERAL cannot be used together with PIVOT in FROM clause", ctx)
      }
      withPivot(ctx.pivotClause, from)
    } else {
      ctx.lateralView.asScala.foldLeft(from)(withGenerate)
    }
  }

WHERE 等语句解析

  private def withQuerySpecification(
      ctx: QuerySpecificationContext,
      relation: LogicalPlan): LogicalPlan = withOrigin(ctx) {
    import ctx._

    // WHERE
    def filter(ctx: BooleanExpressionContext, plan: LogicalPlan): LogicalPlan = {
      Filter(expression(ctx), plan)
    }

    def withHaving(ctx: BooleanExpressionContext, plan: LogicalPlan): LogicalPlan = {
      // Note that we add a cast to non-predicate expressions. If the expression itself is
      // already boolean, the optimizer will get rid of the unnecessary cast.
      val predicate = expression(ctx) match {
        case p: Predicate => p
        case e => Cast(e, BooleanType)
      }
      Filter(predicate, plan)
    }


    // Expressions. 也就是要查询的内容
    val expressions = Option(namedExpressionSeq).toSeq
      .flatMap(_.namedExpression.asScala)
      .map(typedVisit[Expression])

    // Create either a transform or a regular query.
    val specType = Option(kind).map(_.getType).getOrElse(SqlBaseParser.SELECT)
    specType match {
      case SqlBaseParser.MAP | SqlBaseParser.REDUCE | SqlBaseParser.TRANSFORM =>
        // Transform

        // Add where.
        val withFilter = relation.optionalMap(where)(filter)

        // Create the attributes.
        val (attributes, schemaLess) = if (colTypeList != null) {
          // Typed return columns.
          (createSchema(colTypeList).toAttributes, false)
        } else if (identifierSeq != null) {
          // Untyped return columns.
          val attrs = visitIdentifierSeq(identifierSeq).map { name =>
            AttributeReference(name, StringType, nullable = true)()
          }
          (attrs, false)
        } else {
          (Seq(AttributeReference("key", StringType)(),
            AttributeReference("value", StringType)()), true)
        }

        // Create the transform.
        ScriptTransformation(
          expressions,
          string(script),
          attributes,
          withFilter,
          withScriptIOSchema(
            ctx, inRowFormat, recordWriter, outRowFormat, recordReader, schemaLess))
      //select语句
      case SqlBaseParser.SELECT =>
        // Regular select

        // Add lateral views.
        val withLateralView = ctx.lateralView.asScala.foldLeft(relation)(withGenerate)

        // Add where.
        val withFilter = withLateralView.optionalMap(where)(filter)

        // Add aggregation or a project.
        val namedExpressions = expressions.map {
          case e: NamedExpression => e
          case e: Expression => UnresolvedAlias(e)
        }
      
        //最终返回的是 Project(namedExpressions, withFilter)，他继承了LogicalPlan
        def createProject() = if (namedExpressions.nonEmpty) {
          Project(namedExpressions, withFilter)
        } else {
          withFilter
        }

        val withProject = if (aggregation == null && having != null) {
          if (conf.getConf(SQLConf.LEGACY_HAVING_WITHOUT_GROUP_BY_AS_WHERE)) {
            // If the legacy conf is set, treat HAVING without GROUP BY as WHERE.
            withHaving(having, createProject())
          } else {
            // According to SQL standard, HAVING without GROUP BY means global aggregate.
            withHaving(having, Aggregate(Nil, namedExpressions, withFilter))
          }
        } else if (aggregation != null) {
          val aggregate = withAggregation(aggregation, namedExpressions, withFilter)
          aggregate.optionalMap(having)(withHaving)
        } else {
          // When hitting this branch, `having` must be null.
          createProject()
        }

        // Distinct
        val withDistinct = if (setQuantifier() != null && setQuantifier().DISTINCT() != null) {
          Distinct(withProject)
        } else {
          withProject
        }

        // Window
        val withWindow = withDistinct.optionalMap(windows)(withWindows)

        // Hint
        hints.asScala.foldRight(withWindow)(withHints)
    }
  }

第六步：最终返回的是 Project(namedExpressions, withFilter)，继承了LogicalPlan

case class Project(projectList: Seq[NamedExpression], child: LogicalPlan) extends UnaryNode {
  override def output: Seq[Attribute] = projectList.map(_.toAttribute)
  override def maxRows: Option[Long] = child.maxRows

  override lazy val resolved: Boolean = {
    val hasSpecialExpressions = projectList.exists ( _.collect {
        case agg: AggregateExpression => agg
        case generator: Generator => generator
        case window: WindowExpression => window
      }.nonEmpty
    )

    !expressions.exists(!_.resolved) && childrenResolved && !hasSpecialExpressions
  }

  override def validConstraints: Set[Expression] =
    child.constraints.union(getAliasedConstraints(projectList))
}

总结

SparkSQL是通过AntlrV4这个开源解析框架解析的.在使用的时候做了几层抽象和封装：

构造这模式抽象AstBuilder，将AntlrV4的SQL语法实现细节封装
封装SparksqlParser ，并使用构造这模式，封装SparkSqlAstbuilder继承AstBuilder，那么AntlrV4的特性就都能被SparkSqlParser使用。
为了兼容用户自定义的解析器，解析器定定一个为ParseInferface接口类型。用户可以通过SparkSession注入自己的解析器，Spark底层会将用户的解析器和SPark的解析器做合并和统一，并保证完整性。

查询涉及的两张表，被解析成了两个 UnresolvedRelation，也即只知道这们是两张表，却并不知道它们是 EXTERNAL TABLE 还是 MANAGED TABLE，也不知道它们的数据存在哪儿，更不知道它们的表结构如何
sum(v) 的结果未命名
Project 部分只知道是选择出了属性，却并不知道这些属性属于哪张表，更不知道其数据类型
Filter 部分也不知道数据类型

第七步：回到最开始的SQL

  def sql(sqlText: String): DataFrame = {
    Dataset.ofRows(self, sessionState.sqlParser.parsePlan(sqlText))
  }

Dataset.ofRows方法

  def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
     //QueryExecution创建
    val qe = sparkSession.sessionState.executePlan(logicalPlan)
    qe.assertAnalyzed()
    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))
  }

第八步：executePlan(logicalPlan)方法创建QueryExecution

def executePlan(plan: LogicalPlan): QueryExecution = createQueryExecution(plan)

第九步:createQueryExecution(plan)创建QueryExecution

/**
 *  QueryExecution是Spark用来执行关系型查询的主要工作流。
 *  它是被设计用来为开发人员提供更方便的对查询执行中间阶段的访问。
 *  QueryExecution中最重要的是成员变量，它的成员变量几乎都是lazy variable，
 *  它的方法大部分是为了提供给这些lazy variable去调用的
 */
class QueryExecution(val sparkSession: SparkSession, val logical: LogicalPlan) {

  // TODO: Move the planner an optimizer into here from SessionState.
  protected def planner = sparkSession.sessionState.planner

  def assertAnalyzed(): Unit = analyzed

  def assertSupported(): Unit = {
    if (sparkSession.sessionState.conf.isUnsupportedOperationCheckEnabled) {
      UnsupportedOperationChecker.checkForBatch(analyzed)
    }
  }

  //调用analyzer解析器,对Unresolved LogicalPlan进行元数据绑定生成的Resolved LogicalPlan
  lazy val analyzed: LogicalPlan = {
    SparkSession.setActiveSession(sparkSession)
    sparkSession.sessionState.analyzer.executeAndCheck(logical)
  }

  // 如果缓存中有查询结果，则直接替换为缓存的结果
  lazy val withCachedData: LogicalPlan = {
    assertAnalyzed()
    assertSupported()
    sparkSession.sharedState.cacheManager.useCachedData(analyzed)
  }

  //调用optimizer优化器
  lazy val optimizedPlan: LogicalPlan = sparkSession.sessionState.optimizer.execute(withCachedData)

  //已经进行过优化的逻辑执行计划进行转换而得到的物理执行计划SparkPlan
  lazy val sparkPlan: SparkPlan = {
    SparkSession.setActiveSession(sparkSession)
    // TODO: We use next(), i.e. take the first plan returned by the planner, here for now,
    //       but we will implement to choose the best plan.
   // 在一系列plan中选取一个
    planner.plan(ReturnAnswer(optimizedPlan)).next()
  }

  // executedPlan should not be used to initialize any SparkPlan. It should be
  // only used for execution.
  lazy val executedPlan: SparkPlan = prepareForExecution(sparkPlan)

  /** Internal version of the RDD. Avoids copies and has no schema */
  //QueryExecution最后生成的RDD,触发了action操作后才会执行
  lazy val toRdd: RDD[InternalRow] = executedPlan.execute()

......

}

analyzer解析器

parser生成逻辑执行计划后，使用analyzer将逻辑执行计划进行分析。

第九步：sparkSession.sessionState.analyzer.executeAndCheck(logical)方法
源码地址：org.apache.spark.sql.catalyst.analysis.Analyzer.scala

class Analyzer(
    //管理着临时表、view、函数及外部依赖元数据（如hive metastore），是analyzer进行绑定的桥梁
    catalog: SessionCatalog,
    conf: SQLConf,
    maxIterations: Int)
  extends RuleExecutor[LogicalPlan] with CheckAnalysis {

  def this(catalog: SessionCatalog, conf: SQLConf) = {
    this(catalog, conf, conf.optimizerMaxIterations)
  }

  def executeAndCheck(plan: LogicalPlan): LogicalPlan = AnalysisHelper.markInAnalyzer {
    // 执行 analyze逻辑
    val analyzed = execute(plan)
    try {
      checkAnalysis(analyzed)
      analyzed
    } catch {
      case e: AnalysisException =>
        val ae = new AnalysisException(e.message, e.line, e.startPosition, Option(analyzed))
        ae.setStackTrace(e.getStackTrace)
        throw ae
    }
  }

}

第十步：execute(plan)方法

  override def execute(plan: LogicalPlan): LogicalPlan = {
    AnalysisContext.reset()
    try {
      executeSameContext(plan)
    } finally {
      AnalysisContext.reset()
    }
  }
  
  private def executeSameContext(plan: LogicalPlan): LogicalPlan = super.execute(plan)

第十一步：super.execute(plan)方法

/**
 * 优化规则执行器
 * 按批次，按顺序对LogicalPlan执行rule，会迭代多次
 */
abstract class RuleExecutor[TreeType <: TreeNode[_]] extends Logging {

  /**
   * An execution strategy for rules that indicates the maximum number of executions. If the
   * execution reaches fix point (i.e. converge) before maxIterations, it will stop.
   */
  /** 控制运行次数的策略，如果在达到最大迭代次数前到达稳定点就停止运行 */
  abstract class Strategy { def maxIterations: Int }

  /** A strategy that only runs once. */
  /** 只运行一次的策略 */
  case object Once extends Strategy { val maxIterations = 1 }

  /** A strategy that runs until fix point or maxIterations times, whichever comes first. */
   /** 运行到稳定点或者最大迭代次数的策略，2选1 */
  case class FixedPoint(maxIterations: Int) extends Strategy

  /** A batch of rules. */
  //  一个批次的rules
  protected case class Batch(name: String, strategy: Strategy, rules: Rule[TreeType]*)

  /** Defines a sequence of rule batches, to be overridden by the implementation. */
  //  所有的rule，按批次存放
  protected def batches: Seq[Batch]

  /**
   * Defines a check function that checks for structural integrity of the plan after the execution
   * of each rule. For example, we can check whether a plan is still resolved after each rule in
   * `Optimizer`, so we can catch rules that return invalid plans. The check function returns
   * `false` if the given plan doesn't pass the structural integrity check.
   */
  protected def isPlanIntegral(plan: TreeType): Boolean = true

  /**
   * Executes the batches of rules defined by the subclass. The batches are executed serially
   * using the defined execution strategy. Within each batch, rules are also executed serially.
   */
  // 执行rule，关键代码很简单，就是按批次，按顺序对plan执行rule，会迭代多次
  def execute(plan: TreeType): TreeType = {
    // 用来对比执行规则前后，初始的plan有无变化
    var curPlan = plan
    val queryExecutionMetrics = RuleExecutor.queryExecutionMeter

    batches.foreach { batch =>
      val batchStartPlan = curPlan
      var iteration = 1
      var lastPlan = curPlan
      var continue = true

      // Run until fix point (or the max number of iterations as specified in the strategy.
      // 执行直到达到稳定点或者最大迭代次数
      while (continue) {
        curPlan = batch.rules.foldLeft(curPlan) {
          case (plan, rule) =>
            val startTime = System.nanoTime()
            // 执行rule，得到新的plan
            val result = rule(plan)
            val runTime = System.nanoTime() - startTime
            // 判断rule是否起了作用
            if (!result.fastEquals(plan)) {
              queryExecutionMetrics.incNumEffectiveExecution(rule.ruleName)
              queryExecutionMetrics.incTimeEffectiveExecutionBy(rule.ruleName, runTime)
              logTrace(
                s"""
                  |=== Applying Rule ${rule.ruleName} ===
                  |${sideBySide(plan.treeString, result.treeString).mkString("\n")}
                """.stripMargin)
            }
            queryExecutionMetrics.incExecutionTimeBy(rule.ruleName, runTime)
            queryExecutionMetrics.incNumExecution(rule.ruleName)

            // Run the structural integrity checker against the plan after each rule.
            if (!isPlanIntegral(result)) {
              val message = s"After applying rule ${rule.ruleName} in batch ${batch.name}, " +
                "the structural integrity of the plan is broken."
              throw new TreeNodeException(result, message, null)
            }

            result
        }
        iteration += 1
        // 到达最大迭代次数, 不再执行优化
        if (iteration > batch.strategy.maxIterations) {
          // Only log if this is a rule that is supposed to run more than once.
          // 只对最大迭代次数大于1的情况打log
          if (iteration != 2) {
            val message = s"Max iterations (${iteration - 1}) reached for batch ${batch.name}"
            if (Utils.isTesting) {
              throw new TreeNodeException(curPlan, message, null)
            } else {
              logWarning(message)
            }
          }
          continue = false
        }
        // plan不变了，到达稳定点，不再执行优化
        if (curPlan.fastEquals(lastPlan)) {
          logTrace(
            s"Fixed point reached for batch ${batch.name} after ${iteration - 1} iterations.")
          continue = false
        }
        lastPlan = curPlan
      }
      // 该批次rule是否起作用
      if (!batchStartPlan.fastEquals(curPlan)) {
        logDebug(
          s"""
            |=== Result of Batch ${batch.name} ===
            |${sideBySide(batchStartPlan.treeString, curPlan.treeString).mkString("\n")}
          """.stripMargin)
      } else {
        logTrace(s"Batch ${batch.name} has no effect.")
      }
    }

    curPlan
  }
}

第十二步：rule(plan)

/**
 * 输入为旧的plan，输出为新的plan，仅此而已。
 * 所以真正的逻辑在各个继承实现的rule里，analyze的过程也就是执行各个rule的过程
 */
abstract class Rule[TreeType <: TreeNode[_]] extends Logging {

  /** Name for this rule, automatically inferred based on class name. */
  val ruleName: String = {
    val className = getClass.getName
    if (className endsWith "$") className.dropRight(1) else className
  }

  def apply(plan: TreeType): TreeType
}

第十三步:rule的集合(转换规则)

  lazy val batches: Seq[Batch] = Seq(
    Batch("Hints", fixedPoint,
      new ResolveHints.ResolveBroadcastHints(conf),
      ResolveHints.ResolveCoalesceHints,
      ResolveHints.RemoveAllHints),
    Batch("Simple Sanity Check", Once,
      LookupFunctions),
    Batch("Substitution", fixedPoint,
      CTESubstitution,
      WindowsSubstitution,
      EliminateUnions,
      new SubstituteUnresolvedOrdinals(conf)),
    Batch("Resolution", fixedPoint,
      ResolveTableValuedFunctions ::
        //通过catalog解析表名 
        ResolveRelations ::
        //解析从子节点的操作生成的属性，一般是别名引起的，比如a.id 
        ResolveReferences ::
        ResolveCreateNamedStruct ::
        ResolveDeserializer ::
        ResolveNewInstance ::
        ResolveUpCast ::
        ResolveGroupingAnalytics ::
        ResolvePivot ::
        ResolveOrdinalInOrderByAndGroupBy ::
        ResolveAggAliasInGroupBy ::
        ResolveMissingReferences ::
        ExtractGenerator ::
        ResolveGenerate ::
        //解析函数
        ResolveFunctions ::
        ResolveAliases ::
        ResolveSubquery ::
        ResolveSubqueryColumnAliases ::
        ResolveWindowOrder ::
        ResolveWindowFrame ::
        ResolveNaturalAndUsingJoin ::
        ResolveOutputRelation ::
        ExtractWindowExpressions ::
        //解析全局的聚合函数，比如select sum(score) from table 
        GlobalAggregates ::
        ResolveAggregateFunctions ::
        TimeWindowing ::
        ResolveInlineTables(conf) ::
        //解析having子句后面的聚合过滤条件，比如having sum(score) > 400
        ResolveHigherOrderFunctions(catalog) ::
        ResolveLambdaVariables(conf) ::
        ResolveTimeZone(conf) ::
        ResolveRandomSeed ::
        TypeCoercion.typeCoercionRules(conf) ++
        extendedResolutionRules: _*),
    Batch("Post-Hoc Resolution", Once, postHocResolutionRules: _*),
    Batch("View", Once,
      AliasViewChild(conf)),
    Batch("Nondeterministic", Once,
      PullOutNondeterministic),
    Batch("UDF", Once,
      HandleNullInputsForUDF),
    Batch("FixNullability", Once,
      FixNullability),
    Batch("Subquery", Once,
      UpdateOuterReferences),
    Batch("Cleanup", fixedPoint,
      CleanupAliases))

第十四步：ResolveRelations 通过catalog解析表名

  object ResolveRelations extends Rule[LogicalPlan] {

    // If the unresolved relation is running directly on files, we just return the original
    // UnresolvedRelation, the plan will get resolved later. Else we look up the table from catalog
    // and change the default database name(in AnalysisContext) if it is a view.
    // We usually look up a table from the default database if the table identifier has an empty
    // database part, for a view the default database should be the currentDb when the view was
    // created. When the case comes to resolving a nested view, the view may have different default
    // database with that the referenced view has, so we need to use
    // `AnalysisContext.defaultDatabase` to track the current default database.
    // When the relation we resolve is a view, we fetch the view.desc(which is a CatalogTable), and
    // then set the value of `CatalogTable.viewDefaultDatabase` to
    // `AnalysisContext.defaultDatabase`, we look up the relations that the view references using
    // the default database.
    // For example:
    // |- view1 (defaultDatabase = db1)
    //   |- operator
    //     |- table2 (defaultDatabase = db1)
    //     |- view2 (defaultDatabase = db2)
    //        |- view3 (defaultDatabase = db3)
    //   |- view4 (defaultDatabase = db4)
    // In this case, the view `view1` is a nested view, it directly references `table2`, `view2`
    // and `view4`, the view `view2` references `view3`. On resolving the table, we look up the
    // relations `table2`, `view2`, `view4` using the default database `db1`, and look up the
    // relation `view3` using the default database `db2`.
    //
    // Note this is compatible with the views defined by older versions of Spark(before 2.2), which
    // have empty defaultDatabase and all the relations in viewText have database part defined.
    // 在Catalog中匹配table信息
    def resolveRelation(plan: LogicalPlan): LogicalPlan = plan match {
      //不是这种情况 select * from parquet.`/path/to/query`  
      case u: UnresolvedRelation if !isRunningDirectlyOnFiles(u.tableIdentifier) =>
        // 默认数据库
        val defaultDatabase = AnalysisContext.get.defaultDatabase
        // 在Catalog中查找
        val foundRelation = lookupTableFromCatalog(u, defaultDatabase)
        resolveRelation(foundRelation)
      // The view's child should be a logical plan parsed from the `desc.viewText`, the variable
      // `viewText` should be defined, or else we throw an error on the generation of the View
      // operator.
      case view @ View(desc, _, child) if !child.resolved =>
        // Resolve all the UnresolvedRelations and Views in the child.
        val newChild = AnalysisContext.withAnalysisContext(desc.viewDefaultDatabase) {
          if (AnalysisContext.get.nestedViewDepth > conf.maxNestedViewDepth) {
            view.failAnalysis(s"The depth of view ${view.desc.identifier} exceeds the maximum " +
              s"view resolution depth (${conf.maxNestedViewDepth}). Analysis is aborted to " +
              s"avoid errors. Increase the value of ${SQLConf.MAX_NESTED_VIEW_DEPTH.key} to work " +
              "around this.")
          }
          executeSameContext(child)
        }
        view.copy(child = newChild)
      case p @ SubqueryAlias(_, view: View) =>
        val newChild = resolveRelation(view)
        p.copy(child = newChild)
      case _ => plan
    }

    // rule的入口，resolveOperatorsUp 后序遍历树，并对每个节点应用rule
    def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
      case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved =>
        EliminateSubqueryAliases(lookupTableFromCatalog(u)) match {
          case v: View =>
            u.failAnalysis(s"Inserting into a view is not allowed. View: ${v.desc.identifier}.")
          case other => i.copy(table = other)
        }
      // 匹配到UnresolvedRelation
      case u: UnresolvedRelation => resolveRelation(u)
    }

    // Look up the table with the given name from catalog. The database we used is decided by the
    // precedence:
    // 1. Use the database part of the table identifier, if it is defined;
    // 2. Use defaultDatabase, if it is defined(In this case, no temporary objects can be used,
    //    and the default database is only used to look up a view);
    // 3. Use the currentDb of the SessionCatalog.
    private def lookupTableFromCatalog(
      u: UnresolvedRelation,
      defaultDatabase: Option[String] = None): LogicalPlan = {
      val tableIdentWithDb = u.tableIdentifier.copy(
        database = u.tableIdentifier.database.orElse(defaultDatabase))
      try {
        catalog.lookupRelation(tableIdentWithDb)
      } catch {
        // 如果没有找到表，便会抛出异常了
        case e: NoSuchTableException =>
          u.failAnalysis(s"Table or view not found: ${tableIdentWithDb.unquotedString}", e)
        // If the database is defined and that database is not found, throw an AnalysisException.
        // Note that if the database is not defined, it is possible we are looking up a temp view.
        case e: NoSuchDatabaseException =>
          u.failAnalysis(s"Table or view not found: ${tableIdentWithDb.unquotedString}, the " +
            s"database ${e.db} doesn't exist.", e)
      }
    }

  //相当于后序遍历树
  def resolveOperatorsUp(rule: PartialFunction[LogicalPlan, LogicalPlan]): LogicalPlan = {
    //plan是否已经被处理过
    if (!analyzed) {
      AnalysisHelper.allowInvokingTransformsInAnalyzer {
        //递归调用,优先处理它的子节点
        val afterRuleOnChildren = mapChildren(_.resolveOperatorsUp(rule))
        if (self fastEquals afterRuleOnChildren) {
          CurrentOrigin.withOrigin(origin) {
            rule.applyOrElse(self, identity[LogicalPlan])
          }
        } else {
          CurrentOrigin.withOrigin(origin) {
            rule.applyOrElse(afterRuleOnChildren, identity[LogicalPlan])
          }
        }
      }
    } else {
      self
    }
  }

//catalog 存储了spark sql的所有数据表信息
class SessionCatalog(
    externalCatalogBuilder: () => ExternalCatalog,
    globalTempViewManagerBuilder: () => GlobalTempViewManager,
    functionRegistry: FunctionRegistry,
    conf: SQLConf,
    hadoopConf: Configuration,
    parser: ParserInterface,
    functionResourceLoader: FunctionResourceLoader) extends Logging {

......

    def lookupRelation(name: TableIdentifier): LogicalPlan = {
    synchronized {
      val db = formatDatabaseName(name.database.getOrElse(currentDb))
      val table = formatTableName(name.table)
      //若db等于globalTempViewManager.database
      if (db == globalTempViewManager.database) {
        //globalTempViewManager(HashMap[String, LogicalPlan])维护了一个全局viewName和其元数据LogicalPlan的映射
        globalTempViewManager.get(table).map { viewDef =>
          SubqueryAlias(table, db, viewDef)
        }.getOrElse(throw new NoSuchTableException(db, table))
      //若database已定义，且临时表中未有此table
      } else if (name.database.isDefined || !tempViews.contains(table)) {
        //从externalCatalog(如hive)中获取table对应的元数据信息metadata:CatalogTable，
        //此对象包含了table对应的类型（table（内部还是外部表），view）、存储格式、字段shema信息等
        val metadata = externalCatalog.getTable(db, table)
        //返回的table是View类型
        if (metadata.tableType == CatalogTableType.VIEW) {
          val viewText = metadata.viewText.getOrElse(sys.error("Invalid view without text."))
          // The relation is a view, so we wrap the relation by:
          // 1. Add a [[View]] operator over the relation to keep track of the view desc;
          // 2. Wrap the logical plan in a [[SubqueryAlias]] which tracks the name of the view.
          /**
           * 构造View对象（包括将viewText通过parser模块解析成语法树），并传入构造一个SubqueryAlias返回
           * 
           * 说明此table名对应的就是一个如hive的table表，通过metadata、数据和分区列的schema构造了CatalogRelation，
           * 并以此tableRelation构造SubqueryAlias返回。
           * 
           * 这里就可以看出从一个未绑定的UnresolvedRelation到通过catalog替换的过程。
           */
          val child = View(
            desc = metadata,
            output = metadata.schema.toAttributes,
            child = parser.parsePlan(viewText))
          SubqueryAlias(table, db, child)
        } else {
          SubqueryAlias(table, db, UnresolvedCatalogRelation(metadata))
        }
      } else {
        //明是个session级别的临时表，从tempTables获取到包含元数据信息的LogicalPlan 并构造SubqueryAlias返回
        SubqueryAlias(table, tempViews(table))
      }
    }
  }
......

}

总结：
Analyzer 分析前后的 LogicalPlan 对比

分析后，每张表对应的字段集，字段类型，数据存储位置都已确定。Project 与 Filter 操作的字段类型以及在表中的位置也已确定。

有了这些信息，已经可以直接将该 LogicalPlan 转换为 Physical Plan 进行执行。

但是由于不同用户提交的 SQL 质量不同，直接执行会造成不同用户提交的语义相同的不同 SQL 执行效率差距甚远。换句话说，如果要保证较高的执行效率，用户需要做大量的 SQL 优化，使用体验大大降低。

为了尽可能保证无论用户是否熟悉 SQL 优化，提交的 SQL 质量如何， Spark SQL 都能以较高效率执行，还需在执行前进行 LogicalPlan 优化。

Optimizer优化器

optimizer是catalyst中关键的一个部分，提供对sql查询的一个优化。optimizer的主要职责是针对Analyzer的resolved logical plan，根据不同的batch优化策略)，来对执行计划树进行优化，优化逻辑计划节点(Logical Plan)以及表达式(Expression)，同时，此部分也是转换成物理执行计划的前置。

第十五步:sparkSession.sessionState.optimizer.execute(withCachedData)方法
源码地址：org.apache.spark.sql.catalyst.optimizer.Optimizer.scala

/**
 * Optimizer的主要职责是将Analyzer给Resolved的Logical Plan根据不同的优化策略Batch，
 * 来对语法树进行优化，优化逻辑计划节点(Logical Plan)以及表达式(Expression)， 也是转换成物理执行计划的前置。
 * 它的工作原理和analyzer一致，也是通过其下的batch里面的Rule[LogicalPlan]来进行处理的
 */
abstract class Optimizer(sessionCatalog: SessionCatalog)
  extends RuleExecutor[LogicalPlan] {

  // Check for structural integrity of the plan in test mode. Currently we only check if a plan is
  // still resolved after the execution of each rule.
  override protected def isPlanIntegral(plan: LogicalPlan): Boolean = {
    !Utils.isTesting || plan.resolved
  }

  protected def fixedPoint = FixedPoint(SQLConf.get.optimizerMaxIterations)

  /**
   * Defines the default rule batches in the Optimizer.
   *
   * Implementations of this class should override this method, and [[nonExcludableRules]] if
   * necessary, instead of [[batches]]. The rule batches that eventually run in the Optimizer,
   * i.e., returned by [[batches]], will be (defaultBatches - (excludedRules - nonExcludableRules)).
   */
  def defaultBatches: Seq[Batch] = {
    val operatorOptimizationRuleSet =
      Seq(
        // Operator push down
        PushProjectionThroughUnion, //谓词下推
        ReorderJoin,
        EliminateOuterJoin,
        PushPredicateThroughJoin,
        PushDownPredicate,
        LimitPushDown,
        ColumnPruning, //列剪裁
        InferFiltersFromConstraints,
        // Operator combine
        CollapseRepartition,
        CollapseProject,
        CollapseWindow,
        CombineFilters, //合并filter
        CombineLimits, //合并limit
        CombineUnions,
        // Constant folding and strength reduction
        NullPropagation, //null处理，NULL容易产生数据倾斜
        ConstantPropagation,
        FoldablePropagation,
        OptimizeIn, // 关键字in的优化，替代为InSet
        ConstantFolding, //针对常量的优化，在这里会直接计算可以获得的常量
        ReorderAssociativeOperator,
        LikeSimplification, //like的简单优化
        //简化过滤条件，比如true and score > 0 直接替换成score > 
        BooleanSimplification,
        //简化filter，比如where 1=1 或者where 1=2，前者直接去掉这个过滤
        SimplifyConditionals,
        RemoveDispensableExpressions,
        SimplifyBinaryComparison,
        PruneFilters,
        EliminateSorts,
        //简化转换，比如两个比较字段的数据类型是一样的，就不需要转换
        SimplifyCasts,
        //简化大小写转换，比如Upper(Upper('a'))转为认为是Upper('a')
        SimplifyCaseConversionExpressions,
        RewriteCorrelatedScalarSubquery,
        EliminateSerialization,
        RemoveRedundantAliases,
        RemoveRedundantProject,
        SimplifyExtractValueOps,
        CombineConcats) ++
        extendedOperatorOptimizationRules

    val operatorOptimizationBatch: Seq[Batch] = {
      val rulesWithoutInferFiltersFromConstraints =
        operatorOptimizationRuleSet.filterNot(_ == InferFiltersFromConstraints)
      Batch("Operator Optimization before Inferring Filters", fixedPoint, //精度优化
        rulesWithoutInferFiltersFromConstraints: _*) ::
        Batch("Infer Filters", Once,
          InferFiltersFromConstraints) ::
        Batch("Operator Optimization after Inferring Filters", fixedPoint,
          rulesWithoutInferFiltersFromConstraints: _*) :: Nil
    }

    (Batch("Eliminate Distinct", Once, EliminateDistinct) ::
      // Technically some of the rules in Finish Analysis are not optimizer rules and belong more
      // in the analyzer, because they are needed for correctness (e.g. ComputeCurrentTime).
      // However, because we also use the analyzer to canonicalized queries (for view definition),
      // we do not eliminate subqueries or compute current time in the analyzer.
      Batch("Finish Analysis", Once,
        EliminateSubqueryAliases,
        EliminateView,
        ReplaceExpressions,
        ComputeCurrentTime,
        GetCurrentDatabase(sessionCatalog),
        RewriteDistinctAggregates,
        ReplaceDeduplicateWithAggregate) ::
        //////////////////////////////////////////////////////////////////////////////////////////
        // Optimizer rules start here
        //////////////////////////////////////////////////////////////////////////////////////////
        // - Do the first call of CombineUnions before starting the major Optimizer rules,
        //   since it can reduce the number of iteration and the other rules could add/move
        //   extra operators between two adjacent Union operators.
        // - Call CombineUnions again in Batch("Operator Optimizations"),
        //   since the other rules might make two separate Unions operators adjacent.
        Batch("Union", Once,
          CombineUnions) ::
        // run this once earlier. this might simplify the plan and reduce cost of optimizer.
        // for example, a query such as Filter(LocalRelation) would go through all the heavy
        // optimizer rules that are triggered when there is a filter
        // (e.g. InferFiltersFromConstraints). if we run this batch earlier, the query becomes just
        // LocalRelation and does not trigger many rules
        Batch("LocalRelation early", fixedPoint,
          ConvertToLocalRelation,
          PropagateEmptyRelation) ::
        Batch("Pullup Correlated Expressions", Once,
          PullupCorrelatedPredicates) ::
        Batch("Subquery", Once,
          OptimizeSubqueries) ::
        Batch("Replace Operators", fixedPoint,
          RewriteExceptAll,
          RewriteIntersectAll,
          ReplaceIntersectWithSemiJoin,
          ReplaceExceptWithFilter,
          ReplaceExceptWithAntiJoin,
          ReplaceDistinctWithAggregate) :: // aggregate替换distinct
        Batch("Aggregate", fixedPoint,
          RemoveLiteralFromGroupExpressions,
          RemoveRepetitionFromGroupExpressions) :: Nil ++
        operatorOptimizationBatch) :+
        Batch("Join Reorder", Once,
          CostBasedJoinReorder) :+
        Batch("Remove Redundant Sorts", Once,
          RemoveRedundantSorts) :+
        Batch("Decimal Optimizations", fixedPoint,
          DecimalAggregates) :+
        Batch("Object Expressions Optimization", fixedPoint,
          EliminateMapObjects,
          CombineTypedFilters) :+
        Batch("LocalRelation", fixedPoint,
          ConvertToLocalRelation,
          PropagateEmptyRelation) :+
        Batch("Extract PythonUDF From JoinCondition", Once,
          PullOutPythonUDFInJoinCondition) :+
        // The following batch should be executed after batch "Join Reorder" "LocalRelation" and
        // "Extract PythonUDF From JoinCondition".
        Batch("Check Cartesian Products", Once,
          CheckCartesianProducts) :+
        Batch("RewriteSubquery", Once,
          RewritePredicateSubquery,
          ColumnPruning,//去掉一些用不上的列
          CollapseProject,
          RemoveRedundantProject) :+
        Batch("UpdateAttributeReferences", Once,
          UpdateNullabilityInAttributeReferences)
  }

第十六步：PushPredicateThroughJoin的关键代码

object PushPredicateThroughJoin extends Rule[LogicalPlan] with PredicateHelper {

......

  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    // push the where condition down into join filter
   // match 这个结构  
  case f @ Filter(filterCondition, Join(left, right, joinType, joinCondition)) =>
      val (leftFilterConditions, rightFilterConditions, commonFilterCondition) =
        split(splitConjunctivePredicates(filterCondition), left, right)
      joinType match {
        // 是 inner join
        case _: InnerLike =>
          // push down the single side `where` condition into respective sides
          // left下推为Filter
          val newLeft = leftFilterConditions.
            reduceLeftOption(And).map(Filter(_, left)).getOrElse(left)
          // right下推为Filter
          val newRight = rightFilterConditions.
            reduceLeftOption(And).map(Filter(_, right)).getOrElse(right)
          val (newJoinConditions, others) =
            commonFilterCondition.partition(canEvaluateWithinJoin)
          val newJoinCond = (newJoinConditions ++ joinCondition).reduceLeftOption(And)
          // 最终的优化结果
          val join = Join(newLeft, newRight, joinType, newJoinCond)
          if (others.nonEmpty) {
            Filter(others.reduceLeft(And), join)
          } else {
            join
          }
        case RightOuter =>
          // push down the right side only `where` condition
          val newLeft = left
          val newRight = rightFilterConditions.
            reduceLeftOption(And).map(Filter(_, right)).getOrElse(right)
          val newJoinCond = joinCondition
          val newJoin = Join(newLeft, newRight, RightOuter, newJoinCond)

          (leftFilterConditions ++ commonFilterCondition).
            reduceLeftOption(And).map(Filter(_, newJoin)).getOrElse(newJoin)
        case LeftOuter | LeftExistence(_) =>
          // push down the left side only `where` condition
          val newLeft = leftFilterConditions.
            reduceLeftOption(And).map(Filter(_, left)).getOrElse(left)
          val newRight = right
          val newJoinCond = joinCondition
          val newJoin = Join(newLeft, newRight, joinType, newJoinCond)

          (rightFilterConditions ++ commonFilterCondition).
            reduceLeftOption(And).map(Filter(_, newJoin)).getOrElse(newJoin)
        case FullOuter => f // DO Nothing for Full Outer Join
        case NaturalJoin(_) => sys.error("Untransformed NaturalJoin node")
        case UsingJoin(_, _) => sys.error("Untransformed Using join node")
      }

  ......

}

总结：

Spark SQL 目前的优化主要是基于规则的优化，即 RBO （Rule-based optimization）

每个优化以 Rule 的形式存在，每条 Rule 都是对 Analyzed Plan 的等价转换
RBO 设计良好，易于扩展，新的规则可以非常方便地嵌入进 Optimizer
RBO 目前已经足够好，但仍然需要更多规则来 cover 更多的场景
优化思路主要是减少参与计算的数据量以及计算本身的代价

PushdownPredicate

PushdownPredicate 是最常见的用于减少参与计算的数据量的方法。

直接对两表进行 Join 操作，然后再进行 Filter 操作。引入 PushdownPredicate 后，可先对两表进行 Filter 再进行 Join

当 Filter 可过滤掉大部分数据时，参与 Join 的数据量大大减少，从而使得 Join 操作速度大大提高。

这里需要说明的是，此处的优化是 LogicalPlan 的优化，从逻辑上保证了将 Filter 下推后由于参与 Join 的数据量变少而提高了性能。另一方面，在物理层面，Filter 下推后，对于支持 Filter 下推的 Storage，并不需要将表的全量数据扫描出来再过滤，而是直接只扫描符合 Filter 条件的数据，从而在物理层面极大减少了扫描表的开销，提高了执行速度。

ConstantFolding

本文的 SQL 查询中，Project 部分包含了 100 + 800 + match_score + english_score 。如果不进行优化，那如果有一亿条记录，就会计算一亿次 100 + 80，非常浪费资源。因此可通过 ConstantFolding 将这些常量合并，从而减少不必要的计算，提高执行速度

ColumnPruning

在上图中，Filter 与 Join 操作会保留两边所有字段，然后在 Project 操作中筛选出需要的特定列。如果能将 Project 下推，在扫描表时就只筛选出满足后续操作的最小字段集，则能大大减少 Filter 与 Project 操作的中间结果集数据量，从而极大提高执行速度。

此处的优化是逻辑上的优化。在物理上，Project 下推后，对于列式存储，如 Parquet 和 ORC，可在扫描表时就只扫描需要的列而跳过不需要的列，进一步减少了扫描开销，提高了执行速度。

SparkPlanner

得到优化后的 LogicalPlan 后，SparkPlanner 将其转化为 SparkPlan 即物理计划。

第十七步：planner.plan(ReturnAnswer(optimizedPlan)).next()
源码地址：org.apache.spark.sql.catalyst.planning.QueryPlanner.scala

abstract class GenericStrategy[PhysicalPlan <: TreeNode[PhysicalPlan]] extends Logging {

  /**
   * Returns a placeholder for a physical plan that executes `plan`. This placeholder will be
   * filled in automatically by the QueryPlanner using the other execution strategies that are
   * available.
   */
  protected def planLater(plan: LogicalPlan): PhysicalPlan

  def apply(plan: LogicalPlan): Seq[PhysicalPlan]
}


abstract class QueryPlanner[PhysicalPlan <: TreeNode[PhysicalPlan]] {
  /** A list of execution strategies that can be used by the planner */
  def strategies: Seq[GenericStrategy[PhysicalPlan]]

  /**
   * 整合所有的Strategy，_(plan)每个Strategy应用plan上，
   * 得到所有Strategies执行完后生成的所有Physical Plan的集合。
   */
  def plan(plan: LogicalPlan): Iterator[PhysicalPlan] = {
    // Obviously a lot to do here still...

    // Collect physical plan candidates.
    // strategies 实现了转化
    val candidates = strategies.iterator.flatMap(_(plan))

    // The candidates may contain placeholders marked as [[planLater]],
    // so try to replace them by their child plans.
    // 移除planLater
    val plans = candidates.flatMap { candidate =>
      val placeholders = collectPlaceholders(candidate)

      if (placeholders.isEmpty) {
        // Take the candidate as is because it does not contain placeholders.
        Iterator(candidate)
      } else {
        // Plan the logical plan marked as [[planLater]] and replace the placeholders.
        placeholders.iterator.foldLeft(Iterator(candidate)) {
          case (candidatesWithPlaceholders, (placeholder, logicalPlan)) =>
            // Plan the logical plan for the placeholder.
            val childPlans = this.plan(logicalPlan)

            candidatesWithPlaceholders.flatMap { candidateWithPlaceholders =>
              childPlans.map { childPlan =>
                // Replace the placeholder by the child plan
                candidateWithPlaceholders.transformUp {
                  case p if p.eq(placeholder) => childPlan
                }
              }
            }
        }
      }
    }
    // 裁剪plan，去掉 bad plan，但目前只是原封不动返回
    val pruned = prunePlans(plans)
    assert(pruned.hasNext, s"No plan for $plan")
    pruned
  }

第十八步：SparkPlanner继承于SparkStrategies，而SparkStrategies继承了QueryPlanner

//包含不同策略的策略来优化物理执行计划
class SparkPlanner(
    val sparkContext: SparkContext,
    val conf: SQLConf,
    val experimentalMethods: ExperimentalMethods)
  extends SparkStrategies {

  //partitions的个数 
  def numPartitions: Int = conf.numShufflePartitions

  //策略的集合
  override def strategies: Seq[Strategy] =
    experimentalMethods.extraStrategies ++
      extraPlanningStrategies ++ (
      PythonEvals ::
      DataSourceV2Strategy ::
      //一个针对Hadoop文件系统做的策略，当执行计划的底层Relation是HadoopFsRelation时会调用到，用来扫描文件
      FileSourceStrategy ::
      //Spark针对DataSource预定义了四种scan接口，TableScan、PrunedScan、PrunedFilteredScan、CatalystScan
      //(CatalystScan是unstable的，也是不常用的)，如果开发者（用户）自己实现的DataSource是实现了这四种接口之一的
      //在scan到执行计划的底层Relation时，就会调用来扫描文件
      DataSourceStrategy(conf) ::
      //在Spark SQL中加limit n时候回调用到（如果不指定，Spark 默认也会limit 20），
      //在源码中，会给每种case的limit节点的子节点使用PlanLater
      SpecialLimits ::
      //执行聚合函数的策略
      Aggregation ::
      Window ::
      //JoinSelection用到了相关的统计信息来选择将Join转换为
      //BroadcastHashJoinExec还是ShuffledHashJoinExec
      //还是SortMergeJoinExec，属于CBO基于代价的策略
      JoinSelection ::
      //当数据在内存中被缓存过，就会用到该策略。
      InMemoryScans ::
      //一些基本操作的执行策略，如flatMap，sort，project等，但是实际上大都是给这些节点的子节点套上一个PlanLater。
      BasicOperators :: Nil)

......

直接使用 BroadcastExchangeExec 将数据广播出去，然后结合广播数据对 people 表使用 BroadcastHashJoinExec 进行 Join。再经过 Project 后使用 HashAggregateExec 进行分组聚合。

第十九步：lazy val executedPlan: SparkPlan = prepareForExecution(sparkPlan) 执行前的一些准备工作

  protected def prepareForExecution(plan: SparkPlan): SparkPlan = {
    //将规则遍历应用到plan
    preparations.foldLeft(plan) { case (sp, rule) => rule.apply(sp) }
  }

  /** A sequence of rules that will be applied in order to the physical plan before execution. */
  protected def preparations: Seq[Rule[SparkPlan]] = Seq(
    PlanSubqueries(sparkSession),
    EnsureRequirements(sparkSession.sessionState.conf),
    CollapseCodegenStages(sparkSession.sessionState.conf),
    ReuseExchange(sparkSession.sessionState.conf),
    ReuseSubquery(sparkSession.sessionState.conf))

第二十步：EnsureRequirements的apply 方法

  def apply(plan: SparkPlan): SparkPlan = plan.transformUp {
    // TODO: remove this after we create a physical operator for `RepartitionByExpression`.
    case operator @ ShuffleExchangeExec(upper: HashPartitioning, child, _) =>
      child.outputPartitioning match {
        case lower: HashPartitioning if upper.semanticEquals(lower) => child
        case _ => operator
      }
    case operator: SparkPlan =>
      //执行 ensureDistributionAndOrdering
      ensureDistributionAndOrdering(reorderJoinPredicates(operator))
  }

第二十步：ensureDistributionAndOrdering方法

 /**
   * Preparations 会比较 children的实际输出分布和需求输出分布的不同(比较partitioning)，
   * 然后添加ShuffleExchangeExec、ExchangeCoordinator、SortExec做转化处理，使其匹配
   */
  private def ensureDistributionAndOrdering(operator: SparkPlan): SparkPlan = {
    val requiredChildDistributions: Seq[Distribution] = operator.requiredChildDistribution
    val requiredChildOrderings: Seq[Seq[SortOrder]] = operator.requiredChildOrdering
    var children: Seq[SparkPlan] = operator.children
    assert(requiredChildDistributions.length == children.length)
    assert(requiredChildOrderings.length == children.length)

    // Ensure that the operator's children satisfy their output distribution requirements.
    // children的实际输出分布(其实就是partitioning)满足要求的输出分布
    children = children.zip(requiredChildDistributions).map {
      //满足，直接返回
      case (child, distribution) if child.outputPartitioning.satisfies(distribution) =>
        child
      // 广播单独处理
      case (child, BroadcastDistribution(mode)) =>
        BroadcastExchangeExec(mode, child)
      case (child, distribution) =>
        val numPartitions = distribution.requiredNumPartitions
          .getOrElse(defaultNumPreShufflePartitions)
         // 做一次shuffle(可以认为是重新分区,改变分布)，满足需求。
        ShuffleExchangeExec(distribution.createPartitioning(numPartitions), child)
    }

    // Get the indexes of children which have specified distribution requirements and need to have
    // same number of partitions.
    // 过滤有指定分布需求的children
    val childrenIndexes = requiredChildDistributions.zipWithIndex.filter {
      case (UnspecifiedDistribution, _) => false
      case (_: BroadcastDistribution, _) => false
      case _ => true
    }.map(_._2)

    // children 的 partition 数量
    val childrenNumPartitions =
      childrenIndexes.map(children(_).outputPartitioning.numPartitions).toSet

    if (childrenNumPartitions.size > 1) {
      // Get the number of partitions which is explicitly required by the distributions.
      // children 的分布需要的partition数量
      val requiredNumPartitions = {
        val numPartitionsSet = childrenIndexes.flatMap {
          index => requiredChildDistributions(index).requiredNumPartitions
        }.toSet
        assert(numPartitionsSet.size <= 1,
          s"$operator have incompatible requirements of the number of partitions for its children")
        numPartitionsSet.headOption
      }
      // partition 数量目标
      val targetNumPartitions = requiredNumPartitions.getOrElse(childrenNumPartitions.max)
      // 指定分布的children的partition数量全部统一为 targetNumPartitions
      children = children.zip(requiredChildDistributions).zipWithIndex.map {
        case ((child, distribution), index) if childrenIndexes.contains(index) =>
          if (child.outputPartitioning.numPartitions == targetNumPartitions) {
            child   // 符合目标，直接返回
          } else {
            // 不符合目标，shuffle
            val defaultPartitioning = distribution.createPartitioning(targetNumPartitions)
            child match {
              // If child is an exchange, we replace it with a new one having defaultPartitioning.
              case ShuffleExchangeExec(_, c, _) => ShuffleExchangeExec(defaultPartitioning, c)
              case _ => ShuffleExchangeExec(defaultPartitioning, child)
            }
          }

        case ((child, _), _) => child
      }
    }

    // Now, we need to add ExchangeCoordinator if necessary.
    // Actually, it is not a good idea to add ExchangeCoordinators while we are adding Exchanges.
    // However, with the way that we plan the query, we do not have a place where we have a
    // global picture of all shuffle dependencies of a post-shuffle stage. So, we add coordinator
    // at here for now.
    // Once we finish https://issues.apache.org/jira/browse/SPARK-10665,
    // we can first add Exchanges and then add coordinator once we have a DAG of query fragments.
    //添加ExchangeCoordinator,调节多个spark plan的数据分布
    children = withExchangeCoordinator(children, requiredChildDistributions)

    // Now that we've performed any necessary shuffles, add sorts to guarantee output orderings:
    // 如果有sort的需求，则加上SortExec
    children = children.zip(requiredChildOrderings).map { case (child, requiredOrdering) =>
      // If child.outputOrdering already satisfies the requiredOrdering, we do not need to sort.
      if (SortOrder.orderingSatisfies(child.outputOrdering, requiredOrdering)) {
        child
      } else {
        SortExec(requiredOrdering, global = false, child = child)
      }
    }

    operator.withNewChildren(children)
  }

第二十一步: withExchangeCoordinator(children, requiredChildDistributions)方法

  private def withExchangeCoordinator(
      children: Seq[SparkPlan],
      requiredChildDistributions: Seq[Distribution]): Seq[SparkPlan] = {
    // 判断是否需要添加 ExchangeCoordinator
    val supportsCoordinator =
      if (children.exists(_.isInstanceOf[ShuffleExchangeExec])) {
        // Right now, ExchangeCoordinator only support HashPartitionings.
        children.forall {
          // 条件1 children中有ShuffleExchangeExec且分区为HashPartitioning  
          case e @ ShuffleExchangeExec(hash: HashPartitioning, _, _) => true
          case child =>
            child.outputPartitioning match {
              case hash: HashPartitioning => true
              case collection: PartitioningCollection =>
                collection.partitionings.forall(_.isInstanceOf[HashPartitioning])
              case _ => false
            }
        }
      } else {
        // In this case, although we do not have Exchange operators, we may still need to
        // shuffle data when we have more than one children because data generated by
        // these children may not be partitioned in the same way.
        // Please see the comment in withCoordinator for more details.
        // 条件2 分布为ClusteredDistribution 或 HashClusteredDistribution
        val supportsDistribution = requiredChildDistributions.forall { dist =>
          dist.isInstanceOf[ClusteredDistribution] || dist.isInstanceOf[HashClusteredDistribution]
        }
        children.length > 1 && supportsDistribution
      }

    val withCoordinator =
      // adaptiveExecutionEnabled 且 符合条件1或2
      //adaptiveExecutionEnabled 默认是false的，所以ExchangeCoordinator默认是关闭的
      if (adaptiveExecutionEnabled && supportsCoordinator) {
        val coordinator =
          new ExchangeCoordinator(
            targetPostShuffleInputSize,
            minNumPostShufflePartitions)
        children.zip(requiredChildDistributions).map {
          case (e: ShuffleExchangeExec, _) =>
            // This child is an Exchange, we need to add the coordinator.
            // 条件1， 直接添加 coordinator
            e.copy(coordinator = Some(coordinator))
          case (child, distribution) =>
            // If this child is not an Exchange, we need to add an Exchange for now.
            // Ideally, we can try to avoid this Exchange. However, when we reach here,
            // there are at least two children operators (because if there is a single child
            // and we can avoid Exchange, supportsCoordinator will be false and we
            // will not reach here.). Although we can make two children have the same number of
            // post-shuffle partitions. Their numbers of pre-shuffle partitions may be different.
            // For example, let's say we have the following plan
            //         Join
            //         /  \
            //       Agg  Exchange
            //       /      \
            //    Exchange  t2
            //      /
            //     t1
            // In this case, because a post-shuffle partition can include multiple pre-shuffle
            // partitions, a HashPartitioning will not be strictly partitioned by the hashcodes
            // after shuffle. So, even we can use the child Exchange operator of the Join to
            // have a number of post-shuffle partitions that matches the number of partitions of
            // Agg, we cannot say these two children are partitioned in the same way.
            // Here is another case
            //         Join
            //         /  \
            //       Agg1  Agg2
            //       /      \
            //   Exchange1  Exchange2
            //       /       \
            //      t1       t2
            // In this case, two Aggs shuffle data with the same column of the join condition.
            // After we use ExchangeCoordinator, these two Aggs may not be partitioned in the same
            // way. Let's say that Agg1 and Agg2 both have 5 pre-shuffle partitions and 2
            // post-shuffle partitions. It is possible that Agg1 fetches those pre-shuffle
            // partitions by using a partitionStartIndices [0, 3]. However, Agg2 may fetch its
            // pre-shuffle partitions by using another partitionStartIndices [0, 4].
            // So, Agg1 and Agg2 are actually not co-partitioned.
            //
            // It will be great to introduce a new Partitioning to represent the post-shuffle
            // partitions when one post-shuffle partition includes multiple pre-shuffle partitions.
            // 条件2，这个child的children虽然分区数目一样，但是不一定是同一种分区方式，所以加上coordinator
            val targetPartitioning = distribution.createPartitioning(defaultNumPreShufflePartitions)
            assert(targetPartitioning.isInstanceOf[HashPartitioning])
            ShuffleExchangeExec(targetPartitioning, child, Some(coordinator))
        }
      } else {
        // If we do not need ExchangeCoordinator, the original children are returned.
        children
      }

    withCoordinator
  }

第二十二步：executedPlan.execute()最后一步执行

  final def execute(): RDD[InternalRow] = executeQuery {
    if (isCanonicalizedPlan) {
      throw new IllegalStateException("A canonicalized plan is not supposed to be executed.")
    }
    //执行各个具体SparkPlan的doExecute函数
    doExecute()
  }

  protected final def executeQuery[T](query: => T): T = {
    RDDOperationScope.withScope(sparkContext, nodeName, false, true) {
      prepare()
      waitForSubqueries()
      query
    }
  }

  final def prepare(): Unit = {
    // doPrepare() may depend on it's children, we should call prepare() on all the children first.
    children.foreach(_.prepare())
    synchronized {
      if (!prepared) {
        prepareSubqueries()
        doPrepare()
        prepared = true
      }
    }
  }

最关键的两个函数是 doPrepare和 doExecute方法

第二十三步：以排序为例看一下SortExec的 doExecute方法

  protected override def doExecute(): RDD[InternalRow] = {
    val peakMemory = longMetric("peakMemory")
    val spillSize = longMetric("spillSize")
    val sortTime = longMetric("sortTime")

    //调用child的execute方法，然后对每个partition进行排序
    child.execute().mapPartitionsInternal { iter =>
      val sorter = createSorter()

      val metrics = TaskContext.get().taskMetrics()
      // Remember spill data size of this task before execute this operator so that we can
      // figure out how many bytes we spilled for this operator.
      val spillSizeBefore = metrics.memoryBytesSpilled
      // 排序
      val sortedIterator = sorter.sort(iter.asInstanceOf[Iterator[UnsafeRow]])
      sortTime += sorter.getSortTimeNanos / 1000000
      peakMemory += sorter.getPeakMemoryUsage
      spillSize += metrics.memoryBytesSpilled - spillSizeBefore
      metrics.incPeakExecutionMemory(sorter.getPeakMemoryUsage)

      sortedIterator
    }
  }

第二十四步： child.execute().也就是ShuffleExchangeExec

  private[this] val exchanges = ArrayBuffer[ShuffleExchangeExec]()

  override protected def doPrepare(): Unit = {
    // If an ExchangeCoordinator is needed, we register this Exchange operator
    // to the coordinator when we do prepare. It is important to make sure
    // we register this operator right before the execution instead of register it
    // in the constructor because it is possible that we create new instances of
    // Exchange operators when we transform the physical plan
    // (then the ExchangeCoordinator will hold references of unneeded Exchanges).
    // So, we should only call registerExchange just before we start to execute
    // the plan.
    coordinator match {
      // 向exchangeCoordinator注册该exchange
      case Some(exchangeCoordinator) => exchangeCoordinator.registerExchange(this)
      case _ =>
    }
  }

  @GuardedBy("this")
  // 注册就是添加到array中
  def registerExchange(exchange: ShuffleExchangeExec): Unit = synchronized {
    exchanges += exchange
  }

  protected override def doExecute(): RDD[InternalRow] = attachTree(this, "execute") {
    // Returns the same ShuffleRowRDD if this plan is used by multiple plans.
     // 有缓存则直接返回缓存
    if (cachedShuffleRDD == null) {
      cachedShuffleRDD = coordinator match {
      // 有exchangeCoordinator  
      case Some(exchangeCoordinator) =>
          val shuffleRDD = exchangeCoordinator.postShuffleRDD(this)
          assert(shuffleRDD.partitions.length == newPartitioning.numPartitions)
          shuffleRDD
       // 没有exchangeCoordinator
        case _ =>
          val shuffleDependency = prepareShuffleDependency()
          preparePostShuffleRDD(shuffleDependency)
      }
    }
    cachedShuffleRDD
  }

第二十五步：没有exchangeCoordinator的情况 prepareShuffleDependency()方法

  /**
   * 返回一个ShuffleDependency，ShuffleDependency中最重要的是rddWithPartitionIds，
   * 它决定了每一条InternalRow shuffle后的partition id
   */
  private[exchange] def prepareShuffleDependency()
    : ShuffleDependency[Int, InternalRow, InternalRow] = {
    ShuffleExchangeExec.prepareShuffleDependency(
      child.execute(), child.output, newPartitioning, serializer)
  }

  def prepareShuffleDependency(
      rdd: RDD[InternalRow],
      outputAttributes: Seq[Attribute],
      newPartitioning: Partitioning,
      serializer: Serializer): ShuffleDependency[Int, InternalRow, InternalRow] = {
 
  ......

    // Now, we manually create a ShuffleDependency. Because pairs in rddWithPartitionIds
    // are in the form of (partitionId, row) and every partitionId is in the expected range
    // [0, part.numPartitions - 1]. The partitioner of this is a PartitionIdPassthrough.
    val dependency =
      new ShuffleDependency[Int, InternalRow, InternalRow](
        rddWithPartitionIds,
        new PartitionIdPassthrough(part.numPartitions),
        serializer)

    dependency
  }

第二十六步：preparePostShuffleRDD(shuffleDependency)方法

  private[exchange] def preparePostShuffleRDD(
      shuffleDependency: ShuffleDependency[Int, InternalRow, InternalRow],
      specifiedPartitionStartIndices: Option[Array[Int]] = None): ShuffledRowRDD = {
    // If an array of partition start indices is provided, we need to use this array
    // to create the ShuffledRowRDD. Also, we need to update newPartitioning to
    // update the number of post-shuffle partitions.
    // 如果specifiedPartitionStartIndices存在，它将决定shuffle后的分区情况
    // exchangeCoordinator 会用到specifiedPartitionStartIndices来实现功能
    specifiedPartitionStartIndices.foreach { indices =>
      assert(newPartitioning.isInstanceOf[HashPartitioning])
      newPartitioning = UnknownPartitioning(indices.length)
    }
    new ShuffledRowRDD(shuffleDependency, specifiedPartitionStartIndices)
  }

//返回结果是ShuffledRowRDD
class ShuffledRowRDD(
    var dependency: ShuffleDependency[Int, InternalRow, InternalRow],
    specifiedPartitionStartIndices: Option[Array[Int]] = None)
  extends RDD[InternalRow](dependency.rdd.context, Nil) {

  // 分区数目
  private[this] val numPreShufflePartitions = dependency.partitioner.numPartitions

  // 每个partition的startIndice
  private[this] val partitionStartIndices: Array[Int] = specifiedPartitionStartIndices match {
    case Some(indices) => indices
    case None =>
      // When specifiedPartitionStartIndices is not defined, every post-shuffle partition
      // corresponds to a pre-shuffle partition.
      (0 until numPreShufflePartitions).toArray
  }

  // rdd 的partitioner
  private[this] val part: Partitioner =
    new CoalescedPartitioner(dependency.partitioner, partitionStartIndices)

  override def getDependencies: Seq[Dependency[_]] = List(dependency)

  override val partitioner: Option[Partitioner] = Some(part)

  // 获取所有的partition
  override def getPartitions: Array[Partition] = {
    assert(partitionStartIndices.length == part.numPartitions)
    Array.tabulate[Partition](partitionStartIndices.length) { i =>
      val startIndex = partitionStartIndices(i)
      val endIndex =
        if (i < partitionStartIndices.length - 1) {
          partitionStartIndices(i + 1)
        } else {
          numPreShufflePartitions
        }
      new ShuffledRowRDDPartition(i, startIndex, endIndex)
    }
  }

  override def getPreferredLocations(partition: Partition): Seq[String] = {
    val tracker = SparkEnv.get.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster]
    val dep = dependencies.head.asInstanceOf[ShuffleDependency[_, _, _]]
    tracker.getPreferredLocationsForShuffle(dep, partition.index)
  }

  override def compute(split: Partition, context: TaskContext): Iterator[InternalRow] = {
    val shuffledRowPartition = split.asInstanceOf[ShuffledRowRDDPartition]
    // The range of pre-shuffle partitions that we are fetching at here is
    // [startPreShufflePartitionIndex, endPreShufflePartitionIndex - 1].
    val reader =
      SparkEnv.get.shuffleManager.getReader(
        dependency.shuffleHandle,
        shuffledRowPartition.startPreShufflePartitionIndex,
        shuffledRowPartition.endPreShufflePartitionIndex,
        context)
    reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(_._2)
  }

  override def clearDependencies() {
    super.clearDependencies()
    dependency = null
  }
}

/**
 * A Partitioner that might group together one or more partitions from the parent.
 *
 * @param parent a parent partitioner
 * @param partitionStartIndices indices of partitions in parent that should create new partitions
 *   in child (this should be an array of increasing partition IDs). For example, if we have a
 *   parent with 5 partitions, and partitionStartIndices is [0, 2, 4], we get three output
 *   partitions, corresponding to partition ranges [0, 1], [2, 3] and [4] of the parent partitioner.
 */
class CoalescedPartitioner(val parent: Partitioner, val partitionStartIndices: Array[Int])
  extends Partitioner {
  // 实现 partition 的转换 
  @transient private lazy val parentPartitionMapping: Array[Int] = {
    val n = parent.numPartitions
    val result = new Array[Int](n)
    for (i <- 0 until partitionStartIndices.length) {
      val start = partitionStartIndices(i)
      val end = if (i < partitionStartIndices.length - 1) partitionStartIndices(i + 1) else n
      for (j <- start until end) {
        result(j) = i
      }
    }
    result
  }

第二十七步：有exchangeCoordinator的情况 exchangeCoordinator.postShuffleRDD(this)方法

//返回的是ShuffledRowRDD
  def postShuffleRDD(exchange: ShuffleExchangeExec): ShuffledRowRDD = {
    doEstimationIfNecessary()

    if (!postShuffleRDDs.containsKey(exchange)) {
      throw new IllegalStateException(
        s"The given $exchange is not registered in this coordinator.")
    }

    postShuffleRDDs.get(exchange)
  }

  @GuardedBy("this")
  private def doEstimationIfNecessary(): Unit = synchronized {
    // It is unlikely that this method will be called from multiple threads
    // (when multiple threads trigger the execution of THIS physical)
    // because in common use cases, we will create new physical plan after
    // users apply operations (e.g. projection) to an existing DataFrame.
    // However, if it happens, we have synchronized to make sure only one
    // thread will trigger the job submission.
    if (!estimated) {
      // Make sure we have the expected number of registered Exchange operators.
      assert(exchanges.length == numExchanges)

      val newPostShuffleRDDs = new JHashMap[ShuffleExchangeExec, ShuffledRowRDD](numExchanges)

      // Submit all map stages
      val shuffleDependencies = ArrayBuffer[ShuffleDependency[Int, InternalRow, InternalRow]]()
      val submittedStageFutures = ArrayBuffer[SimpleFutureAction[MapOutputStatistics]]()
      var i = 0
      // 依次执行每个注册的exchange的prepareShuffleDependency方法
      while (i < numExchanges) {
        val exchange = exchanges(i)
        val shuffleDependency = exchange.prepareShuffleDependency()
        shuffleDependencies += shuffleDependency
        if (shuffleDependency.rdd.partitions.length != 0) {
          // submitMapStage does not accept RDD with 0 partition.
          // So, we will not submit this dependency.
          submittedStageFutures +=
            exchange.sqlContext.sparkContext.submitMapStage(shuffleDependency)
        }
        i += 1
      }

      // Wait for the finishes of those submitted map stages.
      // 统计结果
      val mapOutputStatistics = new Array[MapOutputStatistics](submittedStageFutures.length)
      var j = 0
      while (j < submittedStageFutures.length) {
        // This call is a blocking call. If the stage has not finished, we will wait at here.
        mapOutputStatistics(j) = submittedStageFutures(j).get()
        j += 1
      }

      // If we have mapOutputStatistics.length < numExchange, it is because we do not submit
      // a stage when the number of partitions of this dependency is 0.
      assert(mapOutputStatistics.length <= numExchanges)

      // Now, we estimate partitionStartIndices. partitionStartIndices.length will be the
      // number of post-shuffle partitions.
      // 得到partitionStartIndices
      val partitionStartIndices =
        if (mapOutputStatistics.length == 0) {
          Array.empty[Int]
        } else {
          // 根据 mapOutputStatistics 获取 partitionStartIndices
          estimatePartitionStartIndices(mapOutputStatistics)
        }
      // 执行preparePostShuffleRDD，和没有exchangeCoordinator唯一的不同是有partitionStartIndices参数！
      var k = 0
      while (k < numExchanges) {
        val exchange = exchanges(k)
        val rdd =
          exchange.preparePostShuffleRDD(shuffleDependencies(k), Some(partitionStartIndices))
        newPostShuffleRDDs.put(exchange, rdd)

        k += 1
      }

      // Finally, we set postShuffleRDDs and estimated.
      assert(postShuffleRDDs.isEmpty)
      assert(newPostShuffleRDDs.size() == numExchanges)
      // 结果放入缓存
      postShuffleRDDs.putAll(newPostShuffleRDDs)
      estimated = true
    }
  }

  /**
   * Estimates partition start indices for post-shuffle partitions based on
   * mapOutputStatistics provided by all pre-shuffle stages.
   */
  def estimatePartitionStartIndices(
      mapOutputStatistics: Array[MapOutputStatistics]): Array[Int] = {
    // If minNumPostShufflePartitions is defined, it is possible that we need to use a
    // value less than advisoryTargetPostShuffleInputSize as the target input size of
    // a post shuffle task.
    // 每个partition的目标inputsize，即每个分区数据量的大小
    val targetPostShuffleInputSize = minNumPostShufflePartitions match {
      case Some(numPartitions) =>
        val totalPostShuffleInputSize = mapOutputStatistics.map(_.bytesByPartitionId.sum).sum
        // The max at here is to make sure that when we have an empty table, we
        // only have a single post-shuffle partition.
        // There is no particular reason that we pick 16. We just need a number to
        // prevent maxPostShuffleInputSize from being set to 0.
        val maxPostShuffleInputSize =
          math.max(math.ceil(totalPostShuffleInputSize / numPartitions.toDouble).toLong, 16)
        math.min(maxPostShuffleInputSize, advisoryTargetPostShuffleInputSize)

      case None => advisoryTargetPostShuffleInputSize
    }

    logInfo(
      s"advisoryTargetPostShuffleInputSize: $advisoryTargetPostShuffleInputSize, " +
      s"targetPostShuffleInputSize $targetPostShuffleInputSize.")

    // Make sure we do get the same number of pre-shuffle partitions for those stages.
    // 得到分区数，应该有且只有一个数值
    val distinctNumPreShufflePartitions =
      mapOutputStatistics.map(stats => stats.bytesByPartitionId.length).distinct
    // The reason that we are expecting a single value of the number of pre-shuffle partitions
    // is that when we add Exchanges, we set the number of pre-shuffle partitions
    // (i.e. map output partitions) using a static setting, which is the value of
    // spark.sql.shuffle.partitions. Even if two input RDDs are having different
    // number of partitions, they will have the same number of pre-shuffle partitions
    // (i.e. map output partitions).
    assert(
      distinctNumPreShufflePartitions.length == 1,
      "There should be only one distinct value of the number pre-shuffle partitions " +
        "among registered Exchange operator.")
    val numPreShufflePartitions = distinctNumPreShufflePartitions.head
     // 开始构建partitionStartIndices  
    val partitionStartIndices = ArrayBuffer[Int]()
    // The first element of partitionStartIndices is always 0.
    partitionStartIndices += 0

    var postShuffleInputSize = 0L
    // 根据targetPostShuffleInputSize，对分区进行调整，会做一些合并之类的操作。
    var i = 0
    while (i < numPreShufflePartitions) {
      // We calculate the total size of ith pre-shuffle partitions from all pre-shuffle stages.
      // Then, we add the total size to postShuffleInputSize.
      var nextShuffleInputSize = 0L
      var j = 0
      while (j < mapOutputStatistics.length) {
        nextShuffleInputSize += mapOutputStatistics(j).bytesByPartitionId(i)
        j += 1
      }

      // If including the nextShuffleInputSize would exceed the target partition size, then start a
      // new partition.
      if (i > 0 && postShuffleInputSize + nextShuffleInputSize > targetPostShuffleInputSize) {
        partitionStartIndices += i
        // reset postShuffleInputSize.
        postShuffleInputSize = nextShuffleInputSize
      } else postShuffleInputSize += nextShuffleInputSize

      i += 1
    }

    partitionStartIndices.toArray
  }

有exchangeCoordinator的情况就生成了partitionStartIndices，从而对分区进行了调整

execute 最终的输出是rdd，剩下的结果便是spark对rdd的运算了。其实 spark sql 最终的目标便也是生成rdd，交给spark core来运算

你可能感兴趣的:(Spark)

24.park和unpark方法卷土重来… java并发编程 java
1.park方法可以暂停线程，线程状态为wait。2.unpark方法可以恢复线程，线程状态为runnable。3.LockSupport的静态方法。4.park和unpark方法调用不分先后，unpark先调用，park后执行也可以恢复线程。publicclassParkDemo{publicstaticvoidmain(String[]args){Threadt1=newThread(()->
安全运维的 “五层防护”：构建全方位安全体系 KKKlucifer 安全运维
在数字化运维场景中，异构系统复杂、攻击手段隐蔽等挑战日益突出。保旺达基于“全域纳管-身份认证-行为监测-自动响应-审计溯源”的五层防护架构，融合AI、零信任等技术，构建全链路安全运维体系，以下从技术逻辑与实践落地展开解析：第一层：全域资产纳管——筑牢安全根基挑战云网基础设施包含分布式计算（Hadoop/Spark）、数据流处理（Storm/Flink）等异构组件，通信协议繁杂，传统方案难以全面纳管
Hive 事务表(ACID)问题梳理
文章目录问题描述分析原因什么是事务表概念事务表和普通内部表的区别相关配置事务表的适用场景注意事项设计原理与实现文件管理格式参考博客问题描述工作中需要使用pyspark读取Hive中的数据，但是发现可以获取metastore，外部表的数据可以读取，内部表数据有些表报错信息是：AnalysisException:org.apache.hadoop.hive.ql.metadata.HiveExcept
云原生--微服务、CICD、SaaS、PaaS、IaaS 青秋. 云原生 docker 云原生微服务 kubernetes serverless service_mesh ci/cd
往期推荐浅学React和JSX-CSDN博客一文搞懂大数据流式计算引擎Flink【万字详解，史上最全】-CSDN博客一文入门大数据准流式计算引擎Spark【万字详解，全网最新】_大数据spark-CSDN博客目录1.云原生概念和特点2.常见云模式3.云对外提供服务的架构模式3.1IaaS（Infrastructure-as-a-Service）3.2PaaS（Platform-as-a-Servi
Spark运行架构 EmoGP Spark spark 架构大数据
Spark框架的核心是一个计算引擎，整体来说，它采用了标准master-slave的结构如下图所示，它展示了一个Spark执行时的基本结构，图形中的Driver表示master，负责管理整个集群中的作业任务调度，图形中的Executor则是slave，负责实际执行任务。由上图可以看出，对于Spark框架有两个核心组件：DriverSpark驱动器节点，用于执行Spark任务中的main方法，负
Spark 各种配置项 zhixingheyi_tian 大数据 spark Spark Conf spark jvm java
/bin/spark-shell--masteryarn--deploy-modeclient/bin/spark-shell--masteryarn--deploy-modeclusterTherearetwodeploymodesthatcanbeusedtolaunchSparkapplicationsonYARN.Inclustermode,theSparkdriverrunsinside
Spark RDD 及性能调优 Aurora_NeAr spark wpf c#
RDDProgrammingRDD核心架构与特性分区（Partitions）：数据被切分为多个分区；每个分区在集群节点上独立处理；分区是并行计算的基本单位。计算函数（ComputeFunction）：每个分区应用相同的转换函数；惰性执行机制。依赖关系（Dependencies）窄依赖：1个父分区→1个子分区（map、filter）。宽依赖：1个父分区→多个子分区（groupByKey、join）。
Apache Iceberg数据湖基础 Aurora_NeAr apache
IntroducingApacheIceberg数据湖的演进与挑战传统数据湖（Hive表格式）的缺陷：分区锁定：查询必须显式指定分区字段（如WHEREdt='2025-07-01'）。无原子性：并发写入导致数据覆盖或部分可见。低效元数据：LIST操作扫描全部分区目录（云存储成本高）。Iceberg的革新目标：解耦计算引擎与存储格式（支持Spark/Flink/Trino等）；提供ACID事务、模式
大数据技术之Flink
第1章Flink概述1.1Flink是什么1.2Flink特点1.3FlinkvsSparkStreaming表Flink和Streaming对比FlinkStreaming计算模型流计算微批处理时间语义事件时间、处理时间处理时间窗口多、灵活少、不灵活（窗口必须是批次的整数倍）状态有没有流式SQL有没有1.4Flink的应用场景1.5Flink分层API第2章Flink快速上手2.1创建项目在准备
Hadoop核心组件最全介绍 Cachel wood 大数据开发 hadoop 大数据分布式 spark 数据库计算机网络
文章目录一、Hadoop核心组件1.HDFS(HadoopDistributedFileSystem)2.YARN(YetAnotherResourceNegotiator)3.MapReduce二、数据存储与管理1.HBase2.Hive3.HCatalog4.Phoenix三、数据处理与计算1.Spark2.Flink3.Tez4.Storm5.Presto6.Impala四、资源调度与集群管
大数据分析技术的学习路径，不是绝对的，仅供参考水云桐程序员学习大数据数据分析学习方法
阶段一：基础筑基（1-3个月）1.编程语言：Python：掌握基础语法、数据结构、流程控制、函数、面向对象编程、常用库（NumPy,Pandas）。SQL：精通SELECT语句（过滤、排序、分组、聚合、连接）、DDL/DML基础。理解关系型数据库概念（表、主键、外键、索引）。MySQL或PostgreSQL是很好的起点。Java/Scala：深入理解Hadoop/Spark等框架会更有优势。初学者
大数据开发高频面试题：Spark与MapReduce解析
被招网约司机的盯上了好几天实习了六个月，到期被通知不能转正。外包裁员让我去友商我该去吗？offer比较华为状态码浏览器插件嵌入式项目推荐2019秋招总结+云从语音算法面经+银行群面面经科大讯飞语音算法面经语音算法美团一面已挂科大讯飞智能语音方向值得去吗？语音算法oc科大讯飞语音算法二面荣耀一面语音算法面经，已挂荣耀_语音算法工程一面科大讯飞语音一面凉经8.18携程机器学习（语音方向）一面【vivo
spark处理kafka的用户行为数据写入hive 月光一族吖 spark kafka hive
在CentOS上部署Hadoop（Hadoop3.4.1）和Hive（Hive3.1.2）的详细步骤说明。这份指南面向单机安装（伪集群模式），如果需要搭建真正的多节点集群，各节点间的网络互访、SSH免密登录以及配置同步需进一步调整。注意：本指南假设你已拥有root权限或者具有sudo权限，并且系统连接Internet（用于下载安装包）。步骤中的版本号可根据实际需要进行更改。一、环境准备更新系统软件
Spark 4.0的VariantType 类型以及内部存储鸿乃江边鸟大数据 SQL spark spark sql 大数据
背景本文基于Spark4.0总结Spark中的VariantType类型，用尽量少的字节来存储Json的格式化数据分析这里主要介绍Variant的存储，我们从VariantBuilder.buildJson方法(把对应的json数据存储为VariantType类型)开始：publicstaticVariantparseJson(JsonParserparser,booleanallowDuplic
如何学习才能更好地理解人工智能工程技术专业和其他信息技术专业的关联性？人工智能教学实践 python编程实践人工智能学习人工智能
要深入理解人工智能工程技术专业与其他信息技术专业的关联性，需要跳出单一专业的学习框架，通过“理论筑基-实践串联-跨学科整合”的路径构建系统性认知。以下是分阶段、可落地的学习方法：一、建立“专业关联”的理论认知框架绘制知识关联图谱操作方法：用XMind或Notion绘制思维导图，以AI为中心，辐射关联专业的核心技术节点。例如：AI（机器学习）├─数据支撑：大数据技术（Hadoop/Spark）+数据
Spark从入门到熟悉（篇二）
本文介绍Spark的RDD编程，并进行实战演练，加强对编程的理解，实现快速入手知识脉络包含如下8部分内容：创建RDD常用Action操作常用Transformation操作针对PairRDD的常用操作缓存操作共享变量分区操作编程实战创建RDD实现方式有如下两种方式实现：textFile加载本地或者集群文件系统中的数据用parallelize方法将Driver中的数据结构并行化成RDD示例"""te
Kafka生态整合深度解析：构建现代化数据架构的核心枢纽
Kafka生态整合深度解析：构建现代化数据架构的核心枢纽导语：在当今数据驱动的时代，ApacheKafka已经成为企业级数据架构的核心组件。本文将深入探讨Kafka与主流技术栈的整合方案，帮助架构师和开发者构建高效、可扩展的现代化数据处理平台。文章目录Kafka生态整合深度解析：构建现代化数据架构的核心枢纽一、Kafka与流处理引擎的深度集成1.1Kafka+ApacheSpark：批流一体化处理
Spark on Docker：容器化大数据开发环境搭建指南 AI天才研究院 ChatGPT 实战 ChatGPT AI大模型应用入门实战与进阶大数据 spark docker ai
SparkonDocker：容器化大数据开发环境搭建指南关键词：Spark、Docker、容器化、大数据开发、分布式计算、开发环境搭建、容器编排摘要：本文系统讲解如何通过Docker实现Spark开发环境的容器化部署，涵盖从基础概念到实战部署的完整流程。首先分析Spark分布式计算框架与Docker容器技术的核心原理及融合优势，接着详细演示单节点开发环境和多节点集群环境的搭建步骤，包括Docker
SeaTunnel 社区月报（5-6 月）：全新功能上线、Bug 大扫除、Merge 之星是谁？ SeaTunnel bug SeaTunnel 开源数据集成大数据
在5月和6月，SeaTunnel社区迎来了一轮密集更新：2.3.11正式发布，新增对Databend、Elasticsearch向量、HTTP批量写入、ClickHouse多表写入等多个连接器能力，全面提升了数据同步灵活性。同时，近100个修复与优化PR合入，涵盖Spark引擎并行性修复、Paimon精度兼容性增强、Mongo-CDCExactlyOnce默认值优化、OracleDDL类型支持补全
Spark从入门到熟悉（篇三）小新学习屋数据分析 spark 大数据分布式
本文介绍Spark的DataFrame、SparkSQL，并进行SparkSQL实战，加强对编程的理解，实现快速入手知识脉络包含如下7部分内容：RDD和DataFrame、SparkSQL的对比创建DataFrameDataFrame保存成文件DataFrame的API交互DataFrame的SQL交互SparkSQL实战参考资料RDD和DataFrame、SparkSQL的对比RDD对比Data
大数据集群架构hadoop集群、Hbase集群、zookeeper、kafka、spark、flink、doris、dataeas(二) 争取不加班！ hadoop hbase zookeeper 大数据运维
zookeeper单节点部署wget-chttps://dlcdn.apache.org/zookeeper/zookeeper-3.8.4/apache-zookeeper-3.8.4-bin.tar.gz下载地址tarxfapache-zookeeper-3.8.4-bin.tar.gz-C/data/&&mv/data/apache-zookeeper-3.8.4-bin//data/zoo
Hadoop、Spark、Flink 三大大数据处理框架的能力与应用场景
一、技术能力与应用场景对比产品能力特点应用场景Hadoop-基于MapReduce的批处理框架-HDFS分布式存储-容错性强、适合离线分析-作业调度使用YARN-日志离线分析-数据仓库存储-T+1报表分析-海量数据处理Spark-基于内存计算，速度快-支持批处理、流处理（StructuredStreaming）-支持SQL、ML、图计算等-支持多语言（Scala、Java、Python）-近实时处
SeaTunnel 社区月报（5-6 月）：全新功能上线、Bug 大扫除、Merge 之星是谁？数据库
在5月和6月，SeaTunnel社区迎来了一轮密集更新：2.3.11正式发布，新增对Databend、Elasticsearch向量、HTTP批量写入、ClickHouse多表写入等多个连接器能力，全面提升了数据同步灵活性。同时，近100个修复与优化PR合入，涵盖Spark引擎并行性修复、Paimon精度兼容性增强、Mongo-CDCExactlyOnce默认值优化、OracleDDL类型支持补全
spark数据处理练习题番外篇【上】
一.单选题（共23题，100分）1.(单选题)maven依赖应该加在哪个文件中？A.pom.xmlB.log4j.propertiesC.src/main/scala.resourceD.src/test/scala.resource正确答案:A:pom.xml;Maven依赖应该添加在pom.xml文件中，这是Maven项目的核心配置文件。解释：pom.xml(ProjectObjectMode
基于django+Spark+大数据+爬虫技术的国漫推荐与可视化平台设计和实现(源码+论文+部署讲解等) 阿勇学长大数据项目实战案例 Java精品毕业设计实例 Python数据可视化项目案例大数据 django spark 国漫推荐与可视化平台毕业设计 Java
博主介绍：✌全网粉丝50W+,csdn特邀作者、博客专家、CSDN新星计划导师、Java领域优质创作者,博客之星、掘金/华为云/阿里云/InfoQ等平台优质作者、专注于Java技术领域和学生毕业项目实战,高校老师/讲师/同行前辈交流✌技术范围：SpringBoot、Vue、SSM、HLMT、Jsp、PHP、Nodejs、Python、爬虫、数据可视化、小程序、安卓app、大数据、物联网、机器学习等
spark写入hive表问题 qq_42265026 spark hive 大数据
1、httpclient发送post请求，当返回的数据过大时，报错socketclosed这个原因是客户端主动将连接关闭，根本原因是将httpclient。execute的返回结果closeableResponse作为a方法的返回结果，在b方法中进行解析虽然在b方法中没有关闭closeableResponse，但是在a方法中返回closeableResponse后，会进行httppost.real
spark解析压缩包数据，写入到hive表中 dbbigdata spark 大数据 hive
spark解析xxxxx.tar.gz形式的压缩包。压缩包里面是一个个的json文件或者zip的文件，zip里面是json文件。先用spark读取tar.gz的路径，然后开流传给newTarArchiveInputStream(newGZIPInputStream(file))去处理，大概的代码如下defmain(args:Array[String]):Unit={valroot:String=a
【SequoiaDB】4 巨杉数据库SequoiaDB整体架构 Alen_Liu_SZ 巨杉数据库 SequoiaDB架构编目节点协调节点数据节点巨杉数据库
1整体架构SequoiaDB巨杉数据库作为分布式数据库，由数据库存储引擎与数据库实例两大模块组成。其中，数据库存储引擎模块是数据存储的核心，负责提供整个数据库的读写服务、数据的高可用与容灾、ACID与发你不是事务等全部核心数据服务能力。数据库实例模块则作为协议与语法的适配层，用户可根据需要创建包括MySQL、PostgreSQL与SparkSQL在内的结构化数据实例；支持JSON语法的MongoD
App Store暗藏虚假抖音，内含间谍软件窃取照片和加密货币 FreeBuf- TikTok App Store iOS Android
卡巴斯基网络安全研究人员近日发现名为SparkKitty的新型间谍软件活动，该恶意程序已感染苹果AppStore和谷歌Play官方商店的多个应用。这款间谍软件旨在窃取用户移动设备中的所有图片，疑似专门搜寻加密货币相关信息。该攻击活动自2024年初开始活跃，主要针对东南亚和中国用户。伪装流行应用渗透设备SparkKitty间谍软件通过看似无害的应用程序渗透设备，通常伪装成TikTok等流行应用的修改
存得快查得准，但就是算不动？试试时序数据库 TDengine × Spark 的组合拳
每个工程师可能都遇到过类似场景：时序数据沉淀在数据库中，格式规范、查询快捷，但当任务升级——比如滑窗聚合、多源拼接、机器学习训练——一些业务可能就需要更强的计算能力和更灵活的分析工具。TDengine专注于高效存储与极速查询，而在数据“算力”层面，我们选择了更强的伙伴。现在，TDengine正式开放与ApacheSpark的无缝集成通道。一个是高性能、低成本的时序数据库，一个是横扫大数据世界的分析
ASM系列六利用TreeApi 添加和移除类成员 lijingyao8206 jvm 动态代理 ASM 字节码技术 TreeAPI
同生成的做法一样，添加和移除类成员只要去修改fields和methods中的元素即可。这里我们拿一个简单的类做例子，下面这个Task类，我们来移除isNeedRemove方法，并且添加一个int 类型的addedField属性。 package asm.core; /** * Created by yunshen.ljy on 2015/6/
Springmvc-权限设计 bee1314 spring Web jsp
万丈高楼平地起。权限管理对于管理系统而言已经是标配中的标配了吧，对于我等俗人更是不能免俗。同时就目前的项目状况而言，我们还不需要那么高大上的开源的解决方案，如Spring Security，Shiro。小伙伴一致决定我们还是从基本的功能迭代起来吧。目标： 1.实现权限的管理（CRUD） 2.实现部门管理（CRUD) 3.实现人员的管理（CRUD） 4.实现部门和权限
算法竞赛入门经典（第二版）第2章习题 CrazyMizzz c 算法
2.4.1 输出技巧 #include <stdio.h> int main() { int i, n; scanf("%d", &n); for (i = 1; i <= n; i++) printf("%d\n", i); return 0; } 习题2-2 水仙花数(daffodil
struts2中jsp自动跳转到Action 麦田的设计者 jsp webxml struts2 自动跳转
1、在struts2的开发中，经常需要用户点击网页后就直接跳转到一个Action，执行Action里面的方法，利用mvc分层思想执行相应操作在界面上得到动态数据。毕竟用户不可能在地址栏里输入一个Action（不是专业人士） 2、＜jsp:forward page="xxx.action" /＞，这个标签可以实现跳转，page的路径是相对地址,不同与jsp和j
php 操作webservice实例 IT独行者 PHP webservice
首先大家要简单了解了何谓webservice，接下来就做两个非常简单的例子，webservice还是逃不开server端与client端。我测试的环境为：apache2.2.11 php5.2.10做这个测试之前，要确认你的php配置文件中已经将soap扩展打开，即extension=php_soap.dll; OK 现在我们来体验webservice //server端 serve
Windows下使用Vagrant安装linux系统 _wy_ windows vagrant
准备工作：下载安装 VirtualBox ：https://www.virtualbox.org/ 下载安装 Vagrant ：http://www.vagrantup.com/ 下载需要使用的 box ：官方提供的范例：http://files.vagrantup.com/precise32.box 还可以在 http://www.vagrantbox.es/
更改linux的文件拥有者及用户组(chown和chgrp) 无量 c linux chgrp chown
本文（转） http://blog.163.com/yanenshun@126/blog/static/128388169201203011157308/ http://ydlmlh.iteye.com/blog/1435157 一、基本使用：使用chown命令可以修改文件或目录所属的用户：命令
linux下抓包工具矮蛋蛋 linux
原文地址： http://blog.chinaunix.net/uid-23670869-id-2610683.html tcpdump -nn -vv -X udp port 8888 上面命令是抓取udp包、端口为8888 netstat -tln 命令是用来查看linux的端口使用情况 13 . 列出所有的网络连接 lsof -i 14. 列出所有tcp 网络连接信息 l
我觉得mybatis是垃圾！：“每一个用mybatis的男纸，你伤不起” alafqq mybatis
最近看了每一个用mybatis的男纸，你伤不起原文地址：http://www.iteye.com/topic/1073938 发表一下个人看法。欢迎大神拍砖；个人一直使用的是Ibatis框架，公司对其进行过小小的改良；最近换了公司，要使用新的框架。听说mybatis不错；就对其进行了部分的研究；发现多了一个mapper层；个人感觉就是个dao；
解决java数据交换之谜百合不是茶数据交换
交换两个数字的方法有以下三种，其中第一种最常用 /* 输出最小的一个数 */ public class jiaohuan1 { public static void main(String[] args) { int a =4; int b = 3; if(a<b){ // 第一种交换方式 int tmep =
渐变显示 bijian1013 JavaScript
<style type="text/css"> #wxf { FILTER: progid:DXImageTransform.Microsoft.Gradient(GradientType=0, StartColorStr=#ffffff, EndColorStr=#97FF98); height: 25px; } </style>
探索JUnit4扩展：断言语法assertThat bijian1013 java 单元测试 assertThat
一.概述 JUnit 设计的目的就是有效地抓住编程人员写代码的意图，然后快速检查他们的代码是否与他们的意图相匹配。 JUnit 发展至今，版本不停的翻新，但是所有版本都一致致力于解决一个问题，那就是如何发现编程人员的代码意图，并且如何使得编程人员更加容易地表达他们的代码意图。JUnit 4.4 也是为了如何能够
【Gson三】Gson解析{"data":{"IM":["MSN","QQ","Gtalk"]}} bit1129 gson
如何把如下简单的JSON字符串反序列化为Java的POJO对象? {"data":{"IM":["MSN","QQ","Gtalk"]}} 下面的POJO类Model无法完成正确的解析： import com.google.gson.Gson;
【Kafka九】Kafka High Level API vs. Low Level API bit1129 kafka
1. Kafka提供了两种Consumer API High Level Consumer API Low Level Consumer API(Kafka诡异的称之为Simple Consumer API，实际上非常复杂) 在选用哪种Consumer API时，首先要弄清楚这两种API的工作原理，能做什么不能做什么，能做的话怎么做的以及用的时候，有哪些可能的问题
在nginx中集成lua脚本：添加自定义Http头，封IP等 ronin47 nginx lua
Lua是一个可以嵌入到Nginx配置文件中的动态脚本语言，从而可以在Nginx请求处理的任何阶段执行各种Lua代码。刚开始我们只是用Lua 把请求路由到后端服务器，但是它对我们架构的作用超出了我们的预期。下面就讲讲我们所做的工作。强制搜索引擎只索引mixlr.com Google把子域名当作完全独立的网站，我们不希望爬虫抓取子域名的页面，降低我们的Page rank。 location /{
java-归并排序 bylijinnan java
import java.util.Arrays; public class MergeSort { public static void main(String[] args) { int[] a={20,1,3,8,5,9,4,25}; mergeSort(a,0,a.length-1); System.out.println(Arrays.to
Netty源码学习-CompositeChannelBuffer bylijinnan java netty
CompositeChannelBuffer体现了Netty的“Transparent Zero Copy” 查看API（ http://docs.jboss.org/netty/3.2/api/org/jboss/netty/buffer/package-summary.html#package_description）可以看到，所谓“Transparent Zero Copy”是通
Android中给Activity添加返回键 hotsunshine Activity
// this need android:minSdkVersion="11" getActionBar().setDisplayHomeAsUpEnabled(true); @Override public boolean onOptionsItemSelected(MenuItem item) {
静态页面传参 ctrain 静态
$(document).ready(function () { var request = { QueryString : function (val) { var uri = window.location.search; var re = new RegExp("" + val + "=([^&?]*)", &
Windows中查找某个目录下的所有文件中包含某个字符串的命令 daizj windows 查找某个目录下的所有文件包含某个字符串
findstr可以完成这个工作。 [html] view plain copy >findstr /s /i "string" *.* 上面的命令表示，当前目录以及当前目录的所有子目录下的所有文件中查找"string&qu
改善程序代码质量的一些技巧 dcj3sjt126com 编程 PHP 重构
有很多理由都能说明为什么我们应该写出清晰、可读性好的程序。最重要的一点，程序你只写一次，但以后会无数次的阅读。当你第二天回头来看你的代码时，你就要开始阅读它了。当你把代码拿给其他人看时，他必须阅读你的代码。因此，在编写时多花一点时间，你会在阅读它时节省大量的时间。让我们看一些基本的编程技巧：尽量保持方法简短尽管很多人都遵
SharedPreferences对数据的存储 dcj3sjt126com
SharedPreferences简介： &nbs
linux复习笔记之bash shell (2) bash基础 eksliang bash bash shell
转载请出自出处： http://eksliang.iteye.com/blog/2104329 1.影响显示结果的语系变量（locale） 1.1locale这个命令就是查看当前系统支持多少种语系，命令使用如下： [root@localhost shell]# locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8"
Android零碎知识总结 gqdy365 android
1、CopyOnWriteArrayList add(E) 和remove(int index)都是对新的数组进行修改和新增。所以在多线程操作时不会出现java.util.ConcurrentModificationException错误。所以最后得出结论：CopyOnWriteArrayList适合使用在读操作远远大于写操作的场景里，比如缓存。发生修改时候做copy，新老版本分离，保证读的高
HoverTree.Model.ArticleSelect类的作用 hvt Web .net C#hovertree asp.net
ArticleSelect类在命名空间HoverTree.Model中可以认为是文章查询条件类，用于存放查询文章时的条件，例如HvtId就是文章的id。HvtIsShow就是文章的显示属性，当为-1是，该条件不产生作用，当为0时，查询不公开显示的文章，当为1时查询公开显示的文章。HvtIsHome则为是否在首页显示。HoverTree系统源码完全开放，开发环境为Visual Studio 2013
PHP 判断是否使用代理 PHP Proxy Detector 天梯梦 proxy
1. php 类 I found this class looking for something else actually but I remembered I needed some while ago something similar and I never found one. I'm sure it will help a lot of developers who try to
apache的math库中的回归——regression（翻译） lvdccyb Math apache
这个Math库，虽然不向weka那样专业的ML库，但是用户友好，易用。多元线性回归，协方差和相关性（皮尔逊和斯皮尔曼），分布测试（假设检验，t，卡方，G），统计。数学库中还包含，Cholesky，LU，SVD，QR，特征根分解，真不错。基本覆盖了：线代，统计，矩阵，最优化理论曲线拟合常微分方程遗传算法（GA），还有3维的运算。。。
基础数据结构和算法十三：Undirected Graphs (2) sunwinner Algorithm
Design pattern for graph processing. Since we consider a large number of graph-processing algorithms, our initial design goal is to decouple our implementations from the graph representation
云计算平台最重要的五项技术 sumapp 云计算云平台智城云
云计算平台最重要的五项技术 1、云服务器云服务器提供简单高效，处理能力可弹性伸缩的计算服务，支持国内领先的云计算技术和大规模分布存储技术，使您的系统更稳定、数据更安全、传输更快速、部署更灵活。特性机型丰富通过高性能服务器虚拟化为云服务器，提供丰富配置类型虚拟机，极大简化数据存储、数据库搭建、web服务器搭建等工作；仅需要几分钟，根据CP
《京东技术解密》有奖试读获奖名单公布 ITeye管理员活动
ITeye携手博文视点举办的12月技术图书有奖试读活动已圆满结束，非常感谢广大用户对本次活动的关注与参与。 12月试读活动回顾： http://webmaster.iteye.com/blog/2164754 本次技术图书试读活动获奖名单及相应作品如下：一等奖（两名） Microhardest：http://microhardest.ite