哈士奇说喵

SparkSql 2.2.x 中 Broadcast Join的陷阱(hint不生效)

问题描述

在spark 2.2.0 的sparksql 中使用hint指定广播表，却无法进行指定广播；

前期准备

hive> select * from test.tmp_demo_small;
OK
tmp_demo_small.pas_phone	tmp_demo_small.age
156	20
157	22
158	15

hive> analyze table test.tmp_demo_small compute statistics;
Table test.tmp_demo_small stats: [numFiles=1, numRows=3, totalSize=21, rawDataSize=18]



hive> select * from test.tmp_demo_big;
OK
tmp_demo_big.pas_phone	tmp_demo_big.ord_id	tmp_demo_big.dt
156	aa1	20191111
156	aa2	20191112
157	bb1	20191111
157	bb2	20191112
157	bb3	20191113
157	bb4	20191114
158	cc1	20191111
158	cc2	20191112
158	cc3	20191113

hive> analyze table test.tmp_demo_big compute statistics;
Table test.tmp_demo_big stats: [numFiles=1, numRows=9, totalSize=153, rawDataSize=144]

sparksql解析过程详见：Apache Spark源码走读之11 – sql的解析与执行不是本篇重点，不过有个解析后的语法树有用，可以比较明显的展示左表右表，不然可能有小伙伴要纳闷buildright是个啥了

验证方式

结论为先： 当小表join小表时（都符合默认广播条件 spark.sql.autoBroadcastJoinThreshold默认10M），无论是否指定广播对象，都是以右表优先匹配；也就是说hint在这种情况下失效。

注释什么的都放在代码里面了

使用默认方式join自动广播

select
    big.pas_phone,
    big.ord_id,
    small.age,
    sum(1) over(partition by big.pas_phone) as ord_cnt
from
    test.tmp_demo_small as small  -- 小表 3 行
join
    test.tmp_demo_big as big  -- 大表 9 行
on
    small.pas_phone = big.pas_phone
where
    small.age > 21

查看执行计划(每个执行过程从下往上读，模拟树结构)

== Parsed Logical Plan == --  抽象语法树，由ANTLR解析
Project [pas_phone#39, ord_id#40, age#38, ord_cnt#35L]
+- Project [pas_phone#39, ord_id#40, age#38, ord_cnt#35L, ord_cnt#35L]
   +- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#39, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#35L], [pas_phone#39]
      +- Project [pas_phone#39, ord_id#40, age#38]  -- 只知道是选择出了属性，却并不知道这些属性属于哪张表，更不知道其数据类型
         +- Filter (age#38 > 21)
            +- Join Inner, (pas_phone#37 = pas_phone#39)
               :- SubqueryAlias small
               :  +- SubqueryAlias tmp_demo_small
               :     +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#37, age#38]
               +- SubqueryAlias big
                  +- SubqueryAlias tmp_demo_big
                     +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#39, ord_id#40, dt#41]

== Analyzed Logical Plan ==  -- 逻辑语法树
pas_phone: int, ord_id: string, age: int, ord_cnt: bigint  -- 数据类型解析
Project [pas_phone#39, ord_id#40, age#38, ord_cnt#35L]
+- Project [pas_phone#39, ord_id#40, age#38, ord_cnt#35L, ord_cnt#35L]
   +- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#39, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#35L], [pas_phone#39]
      +- Project [pas_phone#39, ord_id#40, age#38]
         +- Filter (age#38 > 21)
            +- Join Inner, (pas_phone#37 = pas_phone#39)
               :- SubqueryAlias small
               :  +- SubqueryAlias tmp_demo_small
               :     +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#37, age#38]
               +- SubqueryAlias big
                  +- SubqueryAlias tmp_demo_big
                     +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#39, ord_id#40, dt#41]

== Optimized Logical Plan ==  -- 逻辑优化
Window [sum(1) windowspecdefinition(pas_phone#39, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#35L], [pas_phone#39]
+- Project [pas_phone#39, ord_id#40, age#38]
   +- Join Inner, (pas_phone#37 = pas_phone#39)
      :- Filter ((isnotnull(age#38) && (age#38 > 21)) && isnotnull(pas_phone#37))  -- 谓语下推优化
      :  +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#37, age#38]
      +- Project [pas_phone#39, ord_id#40]
         +- Filter isnotnull(pas_phone#39)
            +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#39, ord_id#40, dt#41]

== Physical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#39, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#35L], [pas_phone#39]
+- *Sort [pas_phone#39 ASC NULLS FIRST], false, 0
   +- Exchange(coordinator id: 449256327) hashpartitioning(pas_phone#39, 1000), coordinator[target post-shuffle partition size: 67108864]
      +- *Project [pas_phone#39, ord_id#40, age#38]
         +- *BroadcastHashJoin [pas_phone#37], [pas_phone#39], Inner, BuildRight -- buildright表示使用右表进行广播
            :- *Filter ((isnotnull(age#38) && (age#38 > 21)) && isnotnull(pas_phone#37))
            :  +- HiveTableScan [pas_phone#37, age#38], HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#37, age#38]
            +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
               +- *Filter isnotnull(pas_phone#39)
                  +- HiveTableScan [pas_phone#39, ord_id#40], HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#39, ord_id#40, dt#41]

使用hint进行指定广播对象

select
    /*+ BROADCAST(small) */
    big.pas_phone,
    big.ord_id,
    small.age,
    sum(1) over(partition by big.pas_phone) as ord_cnt
from
    test.tmp_demo_small as small  -- 小表 3 行
join
    test.tmp_demo_big as big  -- 大表 9 行
on
    small.pas_phone = big.pas_phone
where
    small.age > 21

执行计划

== Parsed Logical Plan ==
Project [pas_phone#61, ord_id#62, age#60, ord_cnt#57L]
+- Project [pas_phone#61, ord_id#62, age#60, ord_cnt#57L, ord_cnt#57L]
   +- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#61, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#57L], [pas_phone#61]
      +- Project [pas_phone#61, ord_id#62, age#60]
         +- Filter (age#60 > 21)
            +- Join Inner, (pas_phone#59 = pas_phone#61)
               :- ResolvedHint isBroadcastable=true
               :  +- SubqueryAlias small
               :     +- SubqueryAlias tmp_demo_small
               :        +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#59, age#60]
               +- SubqueryAlias big
                  +- SubqueryAlias tmp_demo_big
                     +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#61, ord_id#62, dt#63]

== Analyzed Logical Plan ==
pas_phone: int, ord_id: string, age: int, ord_cnt: bigint
Project [pas_phone#61, ord_id#62, age#60, ord_cnt#57L]
+- Project [pas_phone#61, ord_id#62, age#60, ord_cnt#57L, ord_cnt#57L]
   +- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#61, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#57L], [pas_phone#61]
      +- Project [pas_phone#61, ord_id#62, age#60]
         +- Filter (age#60 > 21)
            +- Join Inner, (pas_phone#59 = pas_phone#61)
               :- ResolvedHint isBroadcastable=true
               :  +- SubqueryAlias small
               :     +- SubqueryAlias tmp_demo_small
               :        +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#59, age#60]
               +- SubqueryAlias big
                  +- SubqueryAlias tmp_demo_big
                     +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#61, ord_id#62, dt#63]

== Optimized Logical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#61, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#57L], [pas_phone#61]
+- Project [pas_phone#61, ord_id#62, age#60]
   +- Join Inner, (pas_phone#59 = pas_phone#61)
      :- ResolvedHint isBroadcastable=true  -- 这里可以看到在逻辑优化的时候，这个参数是生效的
      :  +- Filter ((isnotnull(age#60) && (age#60 > 21)) && isnotnull(pas_phone#59))
      :     +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#59, age#60]
      +- Project [pas_phone#61, ord_id#62]
         +- Filter isnotnull(pas_phone#61)
            +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#61, ord_id#62, dt#63]

== Physical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#61, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#57L], [pas_phone#61]
+- *Sort [pas_phone#61 ASC NULLS FIRST], false, 0
   +- Exchange(coordinator id: 1477200907) hashpartitioning(pas_phone#61, 1000), coordinator[target post-shuffle partition size: 67108864]
      +- *Project [pas_phone#61, ord_id#62, age#60]
         +- *BroadcastHashJoin [pas_phone#59], [pas_phone#61], Inner, BuildRight -- buildright表示仍然使用右表进行广播
            :- *Filter ((isnotnull(age#60) && (age#60 > 21)) && isnotnull(pas_phone#59))
            :  +- HiveTableScan [pas_phone#59, age#60], HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#59, age#60]
            +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
               +- *Filter isnotnull(pas_phone#61)
                  +- HiveTableScan [pas_phone#61, ord_id#62], HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#61, ord_id#62, dt#63]

刚开始一路走下来，感觉都正常，而且逻辑优化的时候将一些filter条件下推都是符合RBO优化原则；但是到最后的生成物理执行计划的时候出现问题，理论上来说应该会进行比较两个子表，哪一个小广播哪个；为什么会出现这个问题？问题是应该出在物理执行计划中Join的选择方式上，定位spark 2.2.0 源码; 从 apply 开始看

位置：spark-2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala


object JoinSelection extends Strategy with PredicateHelper {

  /**
   * Matches a plan whose output should be small enough to be used in broadcast join.
   */
  
  // 3. canBroadcast(right), 传入的right是个LogicalPlan对象，也就是一个逻辑计划，其中包含了这个子树节点表的内部信息，包括meta信息，还有解析的hint；这里会进行判断；只需要存在hint语句 或者 满足节点树(这里是右表)filter之后的信息大大于0且小于一个阈值(默认10M) 这两个条件其一就返回true
  
  private def canBroadcast(plan: LogicalPlan): Boolean = {
    plan.stats(conf).hints.isBroadcastable.getOrElse(false) ||
      (plan.stats(conf).sizeInBytes >= 0 &&
        plan.stats(conf).sizeInBytes <= conf.autoBroadcastJoinThreshold)
  }

  ...  隐去一部分代码

	// 2. canBuildRight(joinType)判断下，返回 true
  private def canBuildRight(joinType: JoinType): Boolean = joinType match {
    case _: InnerLike | LeftOuter | LeftSemi | LeftAnti => true
    case j: ExistenceJoin => true
    case _ => false
  }

  private def canBuildLeft(joinType: JoinType): Boolean = joinType match {
    case _: InnerLike | RightOuter => true
    case _ => false
  }

  def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {

    // --- BroadcastHashJoin --------------------------------------------------------------------
    // 1. 广播判断条件 ：首先判断（2） canBuildRight(joinType)；然后接着判断 （3）canBroadcast(right)；当（2）且（3）都true则开始执行broadcast，且广播右表，不理会hint中是否制定广播表

    case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
      if canBuildRight(joinType) && canBroadcast(right) =>
      Seq(joins.BroadcastHashJoinExec(
        leftKeys, rightKeys, joinType, BuildRight, condition, planLater(left), planLater(right)))

    case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
      if canBuildLeft(joinType) && canBroadcast(left) =>
      Seq(joins.BroadcastHashJoinExec(
        leftKeys, rightKeys, joinType, BuildLeft, condition, planLater(left), planLater(right)))

    // --- ShuffledHashJoin ---------------------------------------------------------------------

    case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
       if !conf.preferSortMergeJoin && canBuildRight(joinType) && canBuildLocalHashMap(right)
         && muchSmaller(right, left) ||
         !RowOrdering.isOrderable(leftKeys) =>
      ...

    // --- SortMergeJoin ------------------------------------------------------------

    case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
      if RowOrdering.isOrderable(leftKeys) =>
      ...
    // --- Without joining keys ------------------------------------------------------------
    ...
    case _ => Nil
  }
}

至此，解释了为什么spark 2.2.0中，hint没有生效的问题；因为判断join方式的时候，优先判断是否使用broadcast join，模式匹配先匹配right的情况，也就是说，如果右表只要足够小且满足广播规则，那么无论hint是否有或者hint左表右表，都会进行广播右表；但是一旦右边太大，而且没有hint的方式标注使用右表，那么就会进入第二个，判断左表是否符合广播条件，是的话就进行广播；一样的代码放在2.4.3中看下情况如何

select
    /*+ BROADCAST(small) */
    big.pas_phone,
    big.ord_id,
    small.age,
    sum(1) over(partition by big.pas_phone) as ord_cnt
from
    test.tmp_demo_small as small  -- 小表 3 行
join
    test.tmp_demo_big as big  -- 大表 9 行
on
    small.pas_phone = big.pas_phone
where
    small.age > 21

执行计划

== Parsed Logical Plan ==
Project [pas_phone#4, ord_id#5, age#3, ord_cnt#0L]
+- Project [pas_phone#4, ord_id#5, age#3, ord_cnt#0L, ord_cnt#0L]
   +- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#4, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#0L], [pas_phone#4]
      +- Project [pas_phone#4, ord_id#5, age#3]
         +- Filter (age#3 > 21)
            +- Join Inner, (pas_phone#2 = pas_phone#4)
               :- ResolvedHint (broadcast)
               :  +- SubqueryAlias `small`
               :     +- SubqueryAlias `test`.`tmp_demo_small`
               :        +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#2, age#3]
               +- SubqueryAlias `big`
                  +- SubqueryAlias `test`.`tmp_demo_big`
                     +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#4, ord_id#5, dt#6]

== Analyzed Logical Plan ==
pas_phone: int, ord_id: string, age: int, ord_cnt: bigint
Project [pas_phone#4, ord_id#5, age#3, ord_cnt#0L]
+- Project [pas_phone#4, ord_id#5, age#3, ord_cnt#0L, ord_cnt#0L]
   +- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#4, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#0L], [pas_phone#4]
      +- Project [pas_phone#4, ord_id#5, age#3]
         +- Filter (age#3 > 21)
            +- Join Inner, (pas_phone#2 = pas_phone#4)
               :- ResolvedHint (broadcast)
               :  +- SubqueryAlias `small`
               :     +- SubqueryAlias `test`.`tmp_demo_small`
               :        +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#2, age#3]
               +- SubqueryAlias `big`
                  +- SubqueryAlias `test`.`tmp_demo_big`
                     +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#4, ord_id#5, dt#6]

== Optimized Logical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#4, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#0L], [pas_phone#4]
+- Project [pas_phone#4, ord_id#5, age#3]
   +- Join Inner, (pas_phone#2 = pas_phone#4)
      :- ResolvedHint (broadcast) -- 解析hint语句，指定广播表
      :  +- Filter ((isnotnull(age#3) && (age#3 > 21)) && isnotnull(pas_phone#2))
      :     +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#2, age#3]
      +- Project [pas_phone#4, ord_id#5]
         +- Filter isnotnull(pas_phone#4)
            +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#4, ord_id#5, dt#6]

== Physical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#4, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#0L], [pas_phone#4]
+- *(3) Sort [pas_phone#4 ASC NULLS FIRST], false, 0
   +- Exchange(coordinator id: 632554218) hashpartitioning(pas_phone#4, 1000), coordinator[target post-shuffle partition size: 67108864]
      +- *(2) Project [pas_phone#4, ord_id#5, age#3]
         +- *(2) BroadcastHashJoin [pas_phone#2], [pas_phone#4], Inner, BuildLeft  -- BuildLeft hint制定生效
            :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
            :  +- *(1) Filter ((isnotnull(age#3) && (age#3 > 21)) && isnotnull(pas_phone#2))
            :     +- Scan hive test.tmp_demo_small [pas_phone#2, age#3], HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#2, age#3]
            +- *(2) Filter isnotnull(pas_phone#4)
               +- Scan hive test.tmp_demo_big [pas_phone#4, ord_id#5], HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#4, ord_id#5, dt#6]

不指定广播表，默认 join

select
    big.pas_phone,
    big.ord_id,
    small.age,
    sum(1) over(partition by big.pas_phone) as ord_cnt
from
    test.tmp_demo_small as small  -- 小表 3 行
join
    test.tmp_demo_big as big  -- 大表 9 行
on
    small.pas_phone = big.pas_phone
where
    small.age > 21

执行计划

== Parsed Logical Plan ==
Project [pas_phone#15, ord_id#16, age#14, ord_cnt#11L]
+- Project [pas_phone#15, ord_id#16, age#14, ord_cnt#11L, ord_cnt#11L]
   +- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#15, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#11L], [pas_phone#15]
      +- Project [pas_phone#15, ord_id#16, age#14]
         +- Filter (age#14 > 21)
            +- Join Inner, (pas_phone#13 = pas_phone#15)
               :- SubqueryAlias `small`
               :  +- SubqueryAlias `test`.`tmp_demo_small`
               :     +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#13, age#14]
               +- SubqueryAlias `big`
                  +- SubqueryAlias `test`.`tmp_demo_big`
                     +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#15, ord_id#16, dt#17]

== Analyzed Logical Plan ==
pas_phone: int, ord_id: string, age: int, ord_cnt: bigint
Project [pas_phone#15, ord_id#16, age#14, ord_cnt#11L]
+- Project [pas_phone#15, ord_id#16, age#14, ord_cnt#11L, ord_cnt#11L]
   +- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#15, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#11L], [pas_phone#15]
      +- Project [pas_phone#15, ord_id#16, age#14]
         +- Filter (age#14 > 21)
            +- Join Inner, (pas_phone#13 = pas_phone#15)
               :- SubqueryAlias `small`
               :  +- SubqueryAlias `test`.`tmp_demo_small`
               :     +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#13, age#14]
               +- SubqueryAlias `big`
                  +- SubqueryAlias `test`.`tmp_demo_big`
                     +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#15, ord_id#16, dt#17]

== Optimized Logical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#15, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#11L], [pas_phone#15]
+- Project [pas_phone#15, ord_id#16, age#14]
   +- Join Inner, (pas_phone#13 = pas_phone#15)
      :- Filter ((isnotnull(age#14) && (age#14 > 21)) && isnotnull(pas_phone#13))
      :  +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#13, age#14]
      +- Project [pas_phone#15, ord_id#16]
         +- Filter isnotnull(pas_phone#15)
            +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#15, ord_id#16, dt#17]

== Physical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#15, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#11L], [pas_phone#15]
+- *(3) Sort [pas_phone#15 ASC NULLS FIRST], false, 0
   +- Exchange(coordinator id: 1731877543) hashpartitioning(pas_phone#15, 1000), coordinator[target post-shuffle partition size: 67108864]
      +- *(2) Project [pas_phone#15, ord_id#16, age#14]
         +- *(2) BroadcastHashJoin [pas_phone#13], [pas_phone#15], Inner, BuildLeft -- 广播左表成功
            :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
            :  +- *(1) Filter ((isnotnull(age#14) && (age#14 > 21)) && isnotnull(pas_phone#13))
            :     +- Scan hive test.tmp_demo_small [pas_phone#13, age#14], HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#13, age#14]
            +- *(2) Filter isnotnull(pas_phone#15)
               +- Scan hive test.tmp_demo_big [pas_phone#15, ord_id#16], HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#15, ord_id#16, dt#17]

这就有些意思了，看下2.4.3 的源码

位置：spark-2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala 


 object JoinSelection extends Strategy with PredicateHelper {

    /**
     * Matches a plan whose output should be small enough to be used in broadcast join.
     */
    private def canBroadcast(plan: LogicalPlan): Boolean = {
      plan.stats.sizeInBytes >= 0 && plan.stats.sizeInBytes <= conf.autoBroadcastJoinThreshold
    }

    /**
     * Matches a plan whose single partition should be small enough to build a hash table.
     *
     * Note: this assume that the number of partition is fixed, requires additional work if it's
     * dynamic.
     */
    private def canBuildLocalHashMap(plan: LogicalPlan): Boolean = {
      plan.stats.sizeInBytes < conf.autoBroadcastJoinThreshold * conf.numShufflePartitions
    }

    /**
     * Returns whether plan a is much smaller (3X) than plan b.
     *
     * The cost to build hash map is higher than sorting, we should only build hash map on a table
     * that is much smaller than other one. Since we does not have the statistic for number of rows,
     * use the size of bytes here as estimation.
     */
    private def muchSmaller(a: LogicalPlan, b: LogicalPlan): Boolean = {
      a.stats.sizeInBytes * 3 <= b.stats.sizeInBytes
    }

    private def canBuildRight(joinType: JoinType): Boolean = joinType match {
      case _: InnerLike | LeftOuter | LeftSemi | LeftAnti | _: ExistenceJoin => true
      case _ => false
    }

    private def canBuildLeft(joinType: JoinType): Boolean = joinType match {
      case _: InnerLike | RightOuter => true
      case _ => false
    }

   	// 3. 就是简单比较左右两表大小，
    private def broadcastSide(
        canBuildLeft: Boolean,
        canBuildRight: Boolean,
        left: LogicalPlan,
        right: LogicalPlan): BuildSide = {

      def smallerSide =
        if (right.stats.sizeInBytes <= left.stats.sizeInBytes) BuildRight else BuildLeft

      if (canBuildRight && canBuildLeft) {
        // Broadcast smaller side base on its estimated physical size
        // if both sides have broadcast hint
        smallerSide
      } else if (canBuildRight) {
        BuildRight
      } else if (canBuildLeft) {
        BuildLeft
      } else {
        // for the last default broadcast nested loop join
        smallerSide
      }
    }

   // 1 判断 canBroadcastByHints(joinType, left, right) ，接着判断 canBuildLeft(joinType)和canBuildRight(joinType) 两者只需要一个为 true就可以，join类型条件基本囊括；主要是判断针对左右子树表的hint制定广播
   private def canBroadcastByHints(joinType: JoinType, left: LogicalPlan, right: LogicalPlan)
      : Boolean = {
      val buildLeft = canBuildLeft(joinType) && left.stats.hints.broadcast
      val buildRight = canBuildRight(joinType) && right.stats.hints.broadcast
      buildLeft || buildRight
    }

   // 2. broadcastSideByHints(joinType, left, right) 再吊起 broadcastSide进行比较，（3）其实就是简单比较两个表的大小
   private def broadcastSideByHints(joinType: JoinType, left: LogicalPlan, right: LogicalPlan)
      : BuildSide = {
      val buildLeft = canBuildLeft(joinType) && left.stats.hints.broadcast
      val buildRight = canBuildRight(joinType) && right.stats.hints.broadcast
      broadcastSide(buildLeft, buildRight, left, right)
    }

    private def canBroadcastBySizes(joinType: JoinType, left: LogicalPlan, right: LogicalPlan)
      : Boolean = {
      val buildLeft = canBuildLeft(joinType) && canBroadcast(left)
      val buildRight = canBuildRight(joinType) && canBroadcast(right)
      buildLeft || buildRight
    }

    private def broadcastSideBySizes(joinType: JoinType, left: LogicalPlan, right: LogicalPlan)
      : BuildSide = {
      val buildLeft = canBuildLeft(joinType) && canBroadcast(left)
      val buildRight = canBuildRight(joinType) && canBroadcast(right)
      broadcastSide(buildLeft, buildRight, left, right)
    }

    def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {

      // 区分了两种，当指定hint时和未指定hint时
      // --- BroadcastHashJoin --------------------------------------------------------------------
			
      // broadcast hints were specified
      
     // 对于有hint的情况，先判断 canBroadcastByHints(joinType, left, right)（1）为true只是表示有hint语句且囊括的join类型符合条件；然后再吊起 broadcastSideByHints(joinType, left, right) 判断广播哪张表（2）
      case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
        if canBroadcastByHints(joinType, left, right) =>
        val buildSide = broadcastSideByHints(joinType, left, right)
        Seq(joins.BroadcastHashJoinExec(
          leftKeys, rightKeys, joinType, buildSide, condition, planLater(left), planLater(right)))

      
      // broadcast hints were not specified, so need to infer it from size and configuration.
      // 对于没有hint的情况，直接走到判断两张表大小来决定谁当广播表（当然符合一些前置条件）
      case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
        if canBroadcastBySizes(joinType, left, right) =>
        val buildSide = broadcastSideBySizes(joinType, left, right)
        Seq(joins.BroadcastHashJoinExec(
          leftKeys, rightKeys, joinType, buildSide, condition, planLater(left), planLater(right)))

 
      // --- ShuffledHashJoin ---------------------------------------------------------------------

      case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
         if !conf.preferSortMergeJoin && canBuildRight(joinType) && canBuildLocalHashMap(right)
           && muchSmaller(right, left) ||
           !RowOrdering.isOrderable(leftKeys) =>
       ...
      // --- SortMergeJoin ------------------------------------------------------------

      case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
        if RowOrdering.isOrderable(leftKeys) =>
        ...
      // --- Without joining keys ----------------------------------------------------------
      ...
    }
  }

所以综上所述

spark 2.2.2的版本当小表join小表(两表都符合广播条件)，hint 指定广播表会失效，默认广播右表；若不hint，则默认广播右表
spark 2.4.3的版本可以指定(inner join)广播表(即使超过广播阈值,但小心OOM风险)；若不hint，则在符合广播阈值的条件下，使用较小的表进行广播
spark不支持full outer join；对于right outer join 只能广播左表；对于left outer join，left semi join，left anti join，internal join等只能广播右表，inner join 可以指定广播
其余的一些join触发条件要求：SparkSQL-有必要坐下来聊聊Join，Spark SQL 之 Join 实现

最后放两张收稿图，用于区分2.2和2.4之间的broadcastjoin判断方式

spark 2.2.0

spark 2.4.2

by the way

本来是遇到了一个having的问题，在本地执行没有问题，但是打包好使用spark-submit提交到集群的时候就莫名其妙报错了；

select
    big.pas_phone,
    big.ord_id,
    small.age,
    sum(1) over(partition by big.pas_phone) as ord_cnt
from
    test.tmp_demo_small as small  -- 小表 3 行
join
    test.tmp_demo_big as big  -- 大表 9 行
on
    small.pas_phone = big.pas_phone
where
    small.age > 21
having
    ord_cnt > 2
   
   
Error in query: grouping expressions sequence is empty, and 'big.`pas_phone`' is not an aggregate function. Wrap '()' in windowing function(s) or wrap 'big.`pas_phone`' in first() (or first_value) if you don't care which value you get.;;
'Project [pas_phone#26, ord_id#27, age#25, ord_cnt#22L]
+- 'Project [pas_phone#26, ord_id#27, age#25, ord_cnt#22L, ord_cnt#22L]
   +- 'Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#26, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#22L], [pas_phone#26]
      +- 'Filter ('ord_cnt > 2)
         +- Aggregate [pas_phone#26, ord_id#27, age#25]
            +- Filter (age#25 > 21)
               +- Join Inner, (pas_phone#24 = pas_phone#26)
                  :- SubqueryAlias `small`
                  :  +- SubqueryAlias `test`.`tmp_demo_small`
                  :     +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#24, age#25]
                  +- SubqueryAlias `big`
                     +- SubqueryAlias `test`.`tmp_demo_big`
                        +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#26, ord_id#27, dt#28]

问题也查到了，我在本地执行的时候使用的是yarn-client模式，所以我的driver是我的服务器，而我这台服务器spark版本是2.2.2的，所以执行没啥问题，因为driver负责生成DAG，划分task等等，这个都是在sql转化为rdd之后去执行的，所以还有一个就是前置的解析sql的工作，也就是sql -> rdd，这个也是由driver来完成的，而提交到集群的方式是yarn-cluster模式，driver在集群的某一台机器上，这就很尬尬了，公司竟然升级到2.4.3了，导致sql解析的环境已经和我的本地不匹配了，然后查了一下新版的 spark release note

In Spark version 2.3 and earlier, HAVING without GROUP BY is treated as WHERE. This means, SELECT 1 FROM range(10) HAVING true is executed as SELECT 1 FROM range(10) WHERE true and returns 10 rows. This violates SQL standard, and has been fixed in Spark 2.4. Since Spark 2.4, HAVING without GROUP BY is treated as a global aggregate, which means SELECT 1 FROM range(10) HAVING true will return only one row. To restore the previous behavior, set spark.sql.legacy.parser.havingWithoutGroupByAsWhere to true.

cool，问题解决，原因也找到了, 如果非要像以前2.2那样不想改整段代码操作，那么再前面加set spark.sql.legacy.parser.havingWithoutGroupByAsWhere=true;解决问题

set spark.sql.legacy.parser.havingWithoutGroupByAsWhere=true;
select
    big.pas_phone,
    big.ord_id,
    small.age,
    sum(1) over(partition by big.pas_phone) as ord_cnt
from
    test.tmp_demo_small as small  -- 小表 3 行
join
    test.tmp_demo_big as big  -- 大表 9 行
on
    small.pas_phone = big.pas_phone
where
    small.age > 21
having
    ord_cnt > 2

附录

  /**
   * Select the proper physical plan for join based on joining keys and size of logical plan.
   *
   * At first, uses the [[ExtractEquiJoinKeys]] pattern to find joins where at least some of the
   * predicates can be evaluated by matching join keys. If found, join implementations are chosen
   * with the following precedence:
   *
   * - Broadcast hash join (BHJ):
   *     BHJ is not supported for full outer join. For right outer join, we only can broadcast the
   *     left side. For left outer, left semi, left anti and the internal join type ExistenceJoin,
   *     we only can broadcast the right side. For inner like join, we can broadcast both sides.
   *     Normally, BHJ can perform faster than the other join algorithms when the broadcast side is
   *     small. However, broadcasting tables is a network-intensive operation. It could cause OOM
   *     or perform worse than the other join algorithms, especially when the build/broadcast side
   *     is big.
   *
   *     For the supported cases, users can specify the broadcast hint (e.g. the user applied the
   *     [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame) and session-based
   *     [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold to adjust whether BHJ is used and
   *     which join side is broadcast.
   *
   *     1) Broadcast the join side with the broadcast hint, even if the size is larger than
   *     [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]]. If both sides have the hint (only when the type
   *     is inner like join), the side with a smaller estimated physical size will be broadcast.
   *     2) Respect the [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold and broadcast the side
   *     whose estimated physical size is smaller than the threshold. If both sides are below the
   *     threshold, broadcast the smaller side. If neither is smaller, BHJ is not used.
   *
   * - Shuffle hash join: if the average size of a single partition is small enough to build a hash
   *     table.
   *
   * - Sort merge: if the matching join keys are sortable.
   *
   * If there is no joining keys, Join implementations are chosen with the following precedence:
   * - BroadcastNestedLoopJoin (BNLJ):
   *     BNLJ supports all the join types but the impl is OPTIMIZED for the following scenarios:
   *     For right outer join, the left side is broadcast. For left outer, left semi, left anti
   *     and the internal join type ExistenceJoin, the right side is broadcast. For inner like
   *     joins, either side is broadcast.
   *
   *     Like BHJ, users still can specify the broadcast hint and session-based
   *     [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold to impact which side is broadcast.
   *
   *     1) Broadcast the join side with the broadcast hint, even if the size is larger than
   *     [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]]. If both sides have the hint (i.e., just for
   *     inner-like join), the side with a smaller estimated physical size will be broadcast.
   *     2) Respect the [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold and broadcast the side
   *     whose estimated physical size is smaller than the threshold. If both sides are below the
   *     threshold, broadcast the smaller side. If neither is smaller, BNLJ is not used.
   *
   * - CartesianProduct: for inner like join, CartesianProduct is the fallback option.
   *
   * - BroadcastNestedLoopJoin (BNLJ):
   *     For the other join types, BNLJ is the fallback option. Here, we just pick the broadcast
   *     side with the broadcast hint. If neither side has a hint, we broadcast the side with
   *     the smaller estimated physical size.
   */

你可能感兴趣的:(SQL,Spark)

深入了解 C# 中的 LINQ：功能、语法与应用解析江沉晚呤时 Net core C#solr lucene c#.netcore
1.什么是LINQ？LINQ（LanguageIntegratedQuery，语言集成查询）是C#和其他.NET语言中的一种强大的查询功能，它允许开发者在语言中直接执行查询操作。LINQ使得开发者可以使用C#语法（或VB.NET）直接对集合、数据库、XML等数据源进行查询和操作，而不需要依赖外部查询语言（如SQL）或者复杂的API。LINQ提供了一个统一的查询模型，可以对各种数据源进行查询，包括集
YashanDB SQL命令备份恢复数据库
本文内容来自YashanDB官网，原文内容请见https://doc.yashandb.com/yashandb/23.3/zh/%E6%95%B0%E6%8D%AE%...SQL命令方式的备份恢复操作适用于单机/共享集群部署的数据库。操作示例以下为对单机部署的数据库执行备份恢复的模拟场景：1.通过yasql连接数据库，将数据库切换到归档模式，归档模式必须在数据库MOUNT状态下才能开启。$yas
快速入手-基于Django的mysql操作（四）神奇侠2024 django django
1、数据的增删改查defadd(request):UserInfo.objects.create(username="admin",password="1234561",age=18)UserInfo.objects.create(username="admin2",password="1234562",age=19)UserInfo.objects.create(username="admin3
一个完整的小项目案例，涉及到项目的规划，模块的设计功能的衔接等。 PyAIGCMaster 我的学习笔记学习
以下是一个基于分层架构和模块化设计的项目规划，使用Tkinter作为GUI框架，Playwright进行浏览器操作，SQLite作为数据库：项目结构```web_checker/├──__main__.py#程序入口├──config.py#配置管理├──gui/#图形界面模块│├──__init__.py│└──main_window.py├──services/#业务逻辑│├──__init_
MySQL 进阶学习文档你曾经是少年数据库
一、存储引擎1.1核心架构四层架构：连接层→服务层→引擎层→存储层插件式存储引擎：不同引擎独立管理数据存储，可动态选择1.2主流引擎对比特性InnoDB（默认）MyISAMMemory事务支持✅支持❌不支持❌不支持锁粒度行锁表锁表锁外键支持✅支持❌不支持❌不支持存储位置磁盘磁盘内存适用场景高并发事务读多写少临时数据缓存选择建议：优先选InnoDB（支持事务和外键）读多写少且无需事务选MyISAM临
mysql数据库应用与开发姜桂洪课后答案_清华大学出版社-图书详情-《MySQL数据库应用与开发》... 韦盛江课后答案
前言Oracle公司的MySQL是目前最流行的关系数据库管理系统之一。MySQL所使用的SQL语言是用于访问数据库的最常用标准化语言。MySQL数据库以其精巧灵活、运行速度快、经济适用性强、开放源码等优势，作为网站数据库获得许多中小型网站的开发公司的青睐。MySQL性能卓越，搭配PHP和Apache可组成良好的软件开发环境，并且已经大量部署到中小型企业和高校的教学平台。本书从教学实际需求出发，结合
MySQL学习路线蜡笔小新星 MySQL 数据库 mysql 学习经验分享
本专栏纯干货订阅专栏不迷路以下是一个详细的MySQL学习路线，适合从初学者到中高级用户的逐步学习。整个路线分为几个阶段，每个阶段包含了必要的知识点和学习材料。第一阶段：基础知识（1-2周）目标：了解数据库的基本概念，熟悉MySQL的基本用法。学习内容：数据库基础什么是数据库、数据库管理系统（DBMS）数据库的类型（关系型数据库与非关系型数据库）SQL（结构化查询语言）概述MySQL入门MySQL的
C#电子相册：面向对象设计与架构实践金融先生-Frank
本文还有配套的精品资源，点击获取简介：C#电子相册是一个使用高级编程语言C#开发的Windows平台应用程序。该项目采用面向对象编程方法，将对象如照片和相册封装、继承和多态地组织起来。它可能采用了MVC、MVVM或MVP架构模式，并使用.NETFramework或.NETCore以及VisualStudio作为开发环境。数据库管理部分涉及SQL数据库，支持相册数据的存储与检索。文件列表中的"eri
【赵渝强老师】达梦数据库的归档模式赵渝强老师达梦（DM）数据库数据库 oracle
达梦数据库的备份与恢复都需要使用到重做日志文件。在默认的情况下，达梦数据库采用的非归档模式。通过执行下面的语句可以查看当前数据库实例的日志模式。SQL>selectarch_modefromv$database;#输出的信息如下：行号ARCH_MODE-------------------1N#提示：这里输出的N表示的是非归档模式。由于在非归档模式下，重做日志文件会发生覆盖的情况，从而造成数据的丢
[附源码]Python计算机毕业设计SSM基于B-S的心理健康管理系统（程序+LW) Python、JAVA毕设程序源码 java 开发语言
环境配置：Jdk1.8+Tomcat7.0+Mysql+HBuilderX（Webstorm也行）+Eclispe（IntelliJIDEA,Eclispe,MyEclispe,Sts都支持）。项目技术：SSM+mybatis+Maven+Vue等等组成，B/S模式+Maven管理等等。环境需要1.运行环境：最好是javajdk1.8，我们在这个平台上运行的。其他版本理论上也可以。2.IDE环境：
5-1 使用ECharts将MySQL数据库中的数据可视化上课的牛马实训大数据
方法一：使用PythonFlask框架搭建API对于技术小白来说，使用ECharts将MySQL数据库中的数据可视化需要分步骤完成。以下是详细的实现流程：一、技术架构‌后端服务‌：使用PythonFlask框架搭建API（简单易学，适合新手）数据库连接‌：通过Python的pymysql库连接MySQL前端可视化‌：HTML+JavaScript+ECharts数据流向‌：MySQL数据库→Pyt
计算机毕业设计JavaBS景区票务管理系统设计与实现(源码+系统+mysql数据库+lw文档）毅铭科技数据库
计算机毕业设计JavaBS景区票务管理系统设计与实现(源码+系统+mysql数据库+lw文档）计算机毕业设计JavaBS景区票务管理系统设计与实现(源码+系统+mysql数据库+lw文档）本源码技术栈：项目架构：B/S架构开发语言：Java语言开发软件：ideaeclipse前端技术：Layui、HTML、CSS、JS、JQuery等技术后端技术：JAVA运行环境：Win10、JDK1.8数据库：
mysql修改表中所有字段不许为空_如何用SQL语句修改一个表的字段，让它不能为空... Asama浅间
展开全部ALTERTABLE表ALTERCOLUMN[字段名]字段类型NOTNULLSQL语句1、基32313133353236313431303231363533e78988e69d8331333365643661本介绍：sql语句是对数据库进行操作的一种语言。结构化查询语言(StructuredQueryLanguage)简称SQL，结构化查询语言是一种数据库查询和程序设计语言，用于存取数据以
LakeHouse湖仓一体成为下一站灯塔，数仓、数据湖架构即将退出群聊科杰科技大数据数据仓库
摘要：当前的大数据技术应用趋势表明，客户对单一的数据湖和数仓架构并不满意。近年来几乎所有的数据仓库都增加了对Parquet和ORC格式的外部表支持，这使数仓用户可以从相同的SQL引擎查询数据湖表，但它不会使数据湖表更易于管理，也不会消除仓库中数据的ETL复杂性、陈旧性和高级分析挑战。KeenDataLakeHouse（湖仓一体）作为新一代大数据技术架构，将逐渐取代单一数据湖和数仓架构，成为大数据架
对数据库的总结 java
一、数据库基础1.数据库是一个用于存储和操作数据的文件系统2.关系型数据库：是基于二维表存储的，每个表格由列和行组成，列代表属性，行代表约束，数据的组织和查询更加方便和高效。3.库表操作结构：MySQL和Oracle，通用工具Navicat4.SQL语句的库表操作：createtable：创建表altertable：修改表droptable：删除表truncatetable：删除表中的所有数据，但
Mysql-经典实战案例（10）：如何用PT-Archiver完成大表的自动归档从不删库的DBA Mysql 经典实战案例 mysql 数据库
真实痛点：电商订单表存储优化场景现状分析某电商平台订单表（order_info）每月新增500万条记录主库：高频读写，SSD存储（空间告急）历史库：HDD存储，只读查询优化目标✅自动迁移7天前的订单到历史库✅每周六23:30执行，不影响业务高峰✅确保数据一致性第一章：前期准备：沙盒实验室搭建1.1实验环境架构生产库：10.33.112.22历史库：10.30.76.41.2环境初始化（双节点执行）
linux自律第 40 天嵌入式大大白数据库
在学习了sqlite3数据库的增删改查之后，我开始做了一个基于web服务端的商品查询系统，将商品的图片，名称，id，详细描述和关键词等都放入了该数据库中，利用该数据库和html构建的网页来完成该项目。该项目首先需要设计出登录系统，登录需要密码和账号，所以需要注册，我打算在注册的时候使用数据库，将注册的信息放在数据库中。然后使用账号密码登录的时候，输入的账号密码在请求报文中，以post的形式发出来，
SQL中where与having的区别 WD技术 #mysql面试 sql 数据库 database
1.where和having的区别2.聚合函数和groupby3.where和having的执行顺序4.where不能使用聚合函数、having中可以使用聚合函数1.where和having的区别where:where是一个约束声明,使用where来约束来自数据库的数据;where是在结果返回之前起作用的;where中不能使用聚合函数。having:having是一个过滤声明;在查询返回结果集以后
MySQL性能优化实战笔记 - 通俗易懂版泥潭硬拔 mysql 性能优化笔记
1.存储引擎选择-到底选哪个？InnoDBvsMyISAM通俗对比想象你开了一家银行：InnoDB就像是有保险柜的银行支持事务：比如转账，要么都成功，要么都失败行级锁：小明在存钱时，小红还能同时取钱缺点：需要更多内存和CPUMyISAM就像是简易储物柜不支持事务：操作简单直接表级锁：一个人在用时，其他人要等待优点：读取速度快，占用资源少2.实战案例：常见性能问题及解决方案案例1：查询特别慢--糟糕
优化Apache Spark性能之JVM参数配置指南 weixin_30777913 jvm spark 大数据开发语言性能优化
ApacheSpark运行在JVM之上，JVM的垃圾回收（GC）、内存管理以及堆外内存使用情况，会直接对Spark任务的执行效率产生影响。因此，合理配置JVM参数是优化Spark性能的关键步骤，以下将详细介绍优化策略和配置建议。通过以下优化方法，可以显著减少GC停顿时间、提升内存利用率，进而提高Spark作业吞吐量和数据处理效率。同时，要根据具体的工作负载和集群配置进行调整，并定期监控Spark应
GraphCube、Spark和深度学习技术赋能快消行业关键运营环节 weixin_30777913 开发语言大数据深度学习人工智能 spark
在快消品（FMCG）行业，需求计划（DemandPlanning）、库存管理（InventoryManagement）和需求供应管理（DemandSupplyManagement）是影响企业整体效率和利润水平的关键运营环节。GraphCube图多维数据集技术、Spark大数据分析处理技术和深度学习技术的结合，为这些环节提供了智能化、动态化和实时化的解决方案，显著提升业务运营效率和企业利润。一、技术
中高级开发必看！MySQL 面试秘籍助你飞升七七知享数据库 mysql 面试数据库程序人生职场和发展学习方法 github
中高级开发必看！MySQL面试秘籍助你飞升想要晋升中高级开发岗位？MySQL面试攻略来助力！这篇CSDN文章堪称你进阶路上的“秘密武器”，从基础概念到高阶优化，全方位覆盖MySQL面试要点，无论是索引原理、查询优化，还是事务处理、主从复制，都有深入解读，助你轻松应对面试官的各类难题，稳稳拿下心仪Offer，向着中高级开发岗位大步迈进！
【MyDB】6-TabelManager 字段与表管理之2-SQL语句解析 -$_$- Java项目 sql python 数据库
【MyDB】6-TabelManager字段与表管理之2-SQL语句解析前言SQL语法Parser类具体实现入口方法Parse(byte[]statement)事务控制parseBegin()parseCommit()，parseAbortDDL(DataDefinitionLanguage)parseCreate()parseDrop()DML语句parseSelect()parseInsert
Mybatis和Mybatis-plus常用注解 AWen_X Java常用框架注解 mybatis 开发语言 java 后端 spring boot spring
Mybatis和Mybatis-Plus常用注解一、Mybatis常用注解1.@Select注解说明：标记查询语句，用于定义查询操作的SQL语句。代码示例：@Select("SELECT*FROMusersWHEREid=#{id}")UsergetUserById(@Param("id")Longid);注解处理类：由org.apache.ibatis.builder.annotation.Ma
稳定运行的以PostgreSQL数据库为数据源和目标的ETL性能变差时提高性能方法和步骤 weixin_30777913 postgresql 开发语言数据库性能优化 etl
在使用PostgreSQL作为数据源和目标的ETL（Extract,Transform,Load）过程中，当ETL性能变差时，可以通过一系列方法来诊断问题并提高性能。提高PostgreSQL数据库ETL性能的核心思想是从数据库配置、查询优化、硬件资源、并行处理等多个方面入手。通过上述方法逐步优化，可以大幅提升ETL过程的效率。下面是提高PostgreSQL数据库ETL性能的一些常用方法和步骤：1.
mysql-大批量插入数据的三种方式和使用场景不穿铠甲的穿山甲 mysql 数据库
1.批量插入三种方式INSERTINTO…SELECTINSERTINTO…VALUES(…)LOADDATAINFILE‘/path/to/datafile.csv’INTOTABLEtable_name2.批量插入2.1INSERTINTO…SELECT用途：从另一个表中选择数据并插入到目标表中。语法示例：INSERTINTOtarget_table(column1,column2)SELEC
【金丹境】巧解mysql的事务与隔离级别 jstart千语 mysql 数据库
目录事务的特性（ACID）原子性（Atomicity）一致性（Consistency）隔离性（Isonlation）持久性（Durability）事务的隔离级别未提交读（READUNCOMMITTED）读已提交（READCOMMITTED）可重复读（REPEATABLEREAD）可序列化（SERIALIZABLE）事务并发问题脏读——读到别的事务修改但未提交的内容不可重复读——单条数据两次读取到的
【新品发售】NVIDIA 发布全球最小个人 AI 超级计算机 DGX Spark segmentfault
GTC2025大会上，NVIDIA正式推出了搭载NVIDIAGraceBlackwell平台的个人AI超级计算机——DGXSpark。赞奇可接受预订，直接私信后台即刻预订！DGXSpark(前身为ProjectDIGITS)支持AI开发者、研究人员、数据科学家和学生，在台式电脑上对大模型进行原型设计、微调和推理。用户可以在本地运行这些模型，或将其部署在NVIDIADGXCloud或任何其他加速云或
GreatSQL 为何选择全表扫描而不选索引数据库mysql
GreatSQL为何选择全表扫描而不选索引1.问题背景在生产环境中，发现某些查询即使有索引，也没有使用索引，反而选择了全表扫描。这种现象的根本原因在于优化器评估索引扫描的成本时，认为使用索引的成本高于全表扫描。2.场景复现2.1环境信息机器IP：192.168.137.120GreatSQL版本：8.0.32-262.2环境准备通过脚本创建了一个包含100万条数据的表，并在age列上创建了索引id
Kafka Connect Node.js Connector 指南丁操余
KafkaConnectNode.jsConnector指南kafka-connectequivalenttokafka-connect:wrench:fornodejs:sparkles::turtle::rocket::sparkles:项目地址:https://gitcode.com/gh_mirrors/ka/kafka-connect项目介绍KafkaConnectNode.jsConn
Algorithm 香水浓 java Algorithm
冒泡排序 public static void sort(Integer[] param) { for (int i = param.length - 1; i > 0; i--) { for (int j = 0; j < i; j++) { int current = param[j]; int next = param[j + 1];
mongoDB 复杂查询表达式开窍的石头 mongodb
1:count Pg: db.user.find().count(); 统计多少条数据 2:不等于$ne Pg: db.user.find({_id:{$ne:3}},{name:1,sex:1,_id:0}); 查询id不等于3的数据。 3：大于$gt $gte(大于等于) &n
Jboss Java heap space异常解决方法, jboss OutOfMemoryError : PermGen space 0624chenhong jvm jboss
转自 http://blog.csdn.net/zou274/article/details/5552630 解决办法： window->preferences->java->installed jres->edit jre 把default vm arguments 的参数设为-Xms64m -Xmx512m ----------------
文件上传下载解析相对路径不懂事的小屁孩文件上传
有点坑吧，弄这么一个简单的东西弄了一天多，身边还有大神指导着，网上各种百度着。下面总结一下遇到的问题：文件上传，在页面上传的时候，不要想着去操作绝对路径，浏览器会对客户端的信息进行保护，避免用户信息收到攻击。在上传图片，或者文件时，使用form表单来操作。前台通过form表单传输一个流到后台，而不是ajax传递参数到后台，代码如下: <form action=&
怎么实现qq空间批量点赞换个号韩国红果果 qq
纯粹为了好玩！！逻辑很简单 1 打开浏览器console；输入以下代码。先上添加赞的代码 var tools={}; //添加所有赞 function init(){ document.body.scrollTop=10000; setTimeout(function(){document.body.scrollTop=0;},2000);//加
判断是否为中文灵静志远中文
方法一： public class Zhidao { public static void main(String args[]) { String s = "sdf灭礌 kjl d{';\fdsjlk是"; int n=0; for(int i=0; i<s.length(); i++) { n = (int)s.charAt(i); if((
一个电话面试后总结 a-john 面试
今天，接了一个电话面试，对于还是初学者的我来说，紧张了半天。面试的问题分了层次，对于一类问题，由简到难。自己觉得回答不好的地方作了一下总结：在谈到集合类的时候，举几个常用的集合类，想都没想，直接说了list,map。然后对list和map分别举几个类型： list方面：ArrayList,LinkedList。在谈到他们的区别时，愣住了
MSSQL中Escape转义的使用 aijuans MSSQL
IF OBJECT_ID('tempdb..#ABC') is not null drop table tempdb..#ABC create table #ABC ( PATHNAME NVARCHAR(50) ) insert into #ABC SELECT N'/ABCDEFGHI' UNION ALL SELECT N'/ABCDGAFGASASSDFA' UNION ALL
一个简单的存储过程 asialee mysql 存储过程构造数据批量插入
今天要批量的生成一批测试数据，其中中间有部分数据是变化的，本来想写个程序来生成的，后来想到存储过程就可以搞定，所以随手写了一个，记录在此： DELIMITER $$ DROP PROCEDURE IF EXISTS inse
annot convert from HomeFragment_1 to Fragment 百合不是茶 android 导包错误
创建了几个类继承Fragment, 需要将创建的类存储在ArrayList<Fragment>中; 出现不能将new 出来的对象放到队列中,原因很简单; 创建类时引入包是:import android.app.Fragment; 创建队列和对象时使用的包是:import android.support.v4.ap
Weblogic10两种修改端口的方法 bijian1013 weblogic 端口号配置管理 config.xml
一.进入控制台进行修改 1.进入控制台: http://127.0.0.1:7001/console 2.展开左边树菜单域结构->环境->服务器-->点击AdminServer(管理) &
mysql 操作指令征客丶 mysql
一、连接mysql 进入 mysql 的安装目录； $ bin/mysql -p [host IP 如果是登录本地的mysql 可以不写 -p 直接 -u] -u [userName] -p 输入密码，回车，接连；二、权限操作［如果你很了解mysql数据库后，你可以直接去修改系统表，然后用 mysql> flush privileges; 指令让权限生效］ 1、赋权 mys
【Hive一】Hive入门 bit1129 hive
Hive安装与配置 Hive的运行需要依赖于Hadoop，因此需要首先安装Hadoop2.5.2，并且Hive的启动前需要首先启动Hadoop。 Hive安装和配置的步骤 1. 从如下地址下载Hive0.14.0 http://mirror.bit.edu.cn/apache/hive/ 2.解压hive，在系统变
ajax 三种提交请求的方法 BlueSkator Ajax jqery
1、ajax 提交请求 $.ajax({ type:"post", url : "${ctx}/front/Hotel/getAllHotelByAjax.do", dataType : "json", success : function(result) { try { for(v
mongodb开发环境下的搭建入门 braveCS 运维
linux下安装mongodb 1）官网下载mongodb-linux-x86_64-rhel62-3.0.4.gz 2）linux 解压 gzip -d mongodb-linux-x86_64-rhel62-3.0.4.gz; mv mongodb-linux-x86_64-rhel62-3.0.4 mongodb-linux-x86_64-rhel62-
编程之美-最短摘要的生成 bylijinnan java 数据结构算法编程之美
import java.util.HashMap; import java.util.Map; import java.util.Map.Entry; public class ShortestAbstract { /** * 编程之美最短摘要的生成 * 扫描过程始终保持一个[pBegin,pEnd]的range,初始化确保[pBegin,pEnd]的ran
json数据解析及typeof chengxuyuancsdn js typeof json解析
// json格式 var people='{"authors": [{"firstName": "AAA","lastName": "BBB"},' +' {"firstName": "CCC&
流程系统设计的层次和目标 comsci 设计模式数据结构 sql 框架脚本
流程系统设计的层次和目标
RMAN List和report 命令 daizj oracle list report rman
LIST 命令使用RMAN LIST 命令显示有关资料档案库中记录的备份集、代理副本和映像副本的信息。使用此命令可列出： • RMAN 资料档案库中状态不是AVAILABLE 的备份和副本 • 可用的且可以用于还原操作的数据文件备份和副本 • 备份集和副本，其中包含指定数据文件列表或指定表空间的备份 • 包含指定名称或范围的所有归档日志备份的备份集和副本 • 由标记、完成时间、可
二叉树:红黑树 dieslrae 二叉树
红黑树是一种自平衡的二叉树,它的查找,插入,删除操作时间复杂度皆为O(logN),不会出现普通二叉搜索树在最差情况时时间复杂度会变为O(N)的问题. 红黑树必须遵循红黑规则,规则如下 1、每个节点不是红就是黑。 2、根总是黑的 &
C语言homework3，7个小题目的代码 dcj3sjt126com c
1、打印100以内的所有奇数。 # include <stdio.h> int main(void) { int i; for (i=1; i<=100; i++) { if (i%2 != 0) printf("%d ", i); } return 0; } 2、从键盘上输入10个整数，
自定义按钮, 图片在上, 文字在下, 居中显示 dcj3sjt126com 自定义
#import <UIKit/UIKit.h> @interface MyButton : UIButton -(void)setFrame:(CGRect)frame ImageName:(NSString*)imageName Target:(id)target Action:(SEL)action Title:(NSString*)title Font:(CGFloa
MySQL查询语句练习题，测试足够用了 flyvszhb sql mysql
http://blog.sina.com.cn/s/blog_767d65530101861c.html 1.创建student和score表 CREATE TABLE student ( id INT(10) NOT NULL UNIQUE PRIMARY KEY , name VARCHAR
转：MyBatis Generator 详解 happyqing mybatis
MyBatis Generator 详解 http://blog.csdn.net/isea533/article/details/42102297 MyBatis Generator详解 http://git.oschina.net/free/Mybatis_Utils/blob/master/MybatisGeneator/MybatisGeneator.
让程序员少走弯路的14个忠告 jingjing0907 工作计划学习
无论是谁，在刚进入某个领域之时，有再大的雄心壮志也敌不过眼前的迷茫：不知道应该怎么做，不知道应该做什么。下面是一名软件开发人员所学到的经验，希望能对大家有所帮助 1.不要害怕在工作中学习。只要有电脑，就可以通过电子阅读器阅读报纸和大多数书籍。如果你只是做好自己的本职工作以及分配的任务，那是学不到很多东西的。如果你盲目地要求更多的工作，也是不可能提升自己的。放
nginx和NetScaler区别流浪鱼 nginx
NetScaler是一个完整的包含操作系统和应用交付功能的产品，Nginx并不包含操作系统，在处理连接方面，需要依赖于操作系统，所以在并发连接数方面和防DoS攻击方面，Nginx不具备优势。 2.易用性方面差别也比较大。Nginx对管理员的水平要求比较高，参数比较多，不确定性给运营带来隐患。在NetScaler常见的配置如健康检查，HA等，在Nginx上的配置的实现相对复杂。 3.策略灵活度方
第11章动画效果（下） onestopweb 动画
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
FAQ - SAP BW BO roadmap blueoxygen BO BW
http://www.sdn.sap.com/irj/boc/business-objects-for-sap-faq Besides, I care that how to integrate tightly. By the way, for BW consultants, please just focus on Query Designer which i
关于java堆内存溢出的几种情况 tomcat_oracle java jvm jdk thread
【情况一】：　　 java.lang.OutOfMemoryError: Java heap space：这种是java堆内存不够，一个原因是真不够，另一个原因是程序中有死循环；　　如果是java堆内存不够的话，可以通过调整JVM下面的配置来解决：　　<jvm-arg>-Xms3062m</jvm-arg> 　　<jvm-arg>-Xmx
Manifest.permission_group权限组阿尔萨斯 Permission
结构继承关系 public static final class Manifest.permission_group extends Object java.lang.Object android. Manifest.permission_group 常量 ACCOUNTS 直接通过统计管理器访问管理的统计 COST_MONEY可以用来让用户花钱但不需要通过与他们直接牵涉的权限 D