Spark源码分析 – Shuffle

参考详细探究Spark的shuffle实现, 写的很清楚, 当前设计的来龙去脉

Hadoop

Hadoop的思路是, 在mapper端每次当memory buffer中的数据快满的时候, 先将memory中的数据, 按partition进行划分, 然后各自存成小文件, 这样当buffer不断的spill的时候, 就会产生大量的小文件
所以Hadoop后面直到reduce之前做的所有的事情其实就是不断的merge, 基于文件的多路并归排序, 在map端的将相同partition的merge到一起, 在reduce端, 把从mapper端copy来的数据文件进行merge, 以用于最终的reduce
多路并归排序, 达到两个目的
merge, 把相同key的value都放到一个arraylist里面
sort, 最终的结果是按key排序的
这个方案扩展性很好, 面对大数据也没有问题, 当然问题在效率, 毕竟需要多次进行基于文件的多路并归排序, 多轮的和磁盘进行数据读写……

Spark

Spark的优势在于效率, 所以没有做merge sort, 这样省去多次磁盘读写
当然这样会有扩展性问题, 很难两全,
因为不能后面再merge, 所以在写的时候, 需要同时打开corenum * bucketnum个文件, 写完才能关闭
并且在reduce的时候, 由于之前没有做merge, 所以必须在内存里面维护所有key的hashmap, 实时的merge和reduce, 详细参考下面

写

如何将shuffle数据写入block, 关键看ShuffleMapTask中的逻辑
可用看到使用shuffleBlockManager, Spark从0.8开始将shuffleBlockManager从普通的BlockManager中分离出来, 便于优化

ShuffleMapTask

      // Obtain all the block writers for shuffle blocks.

      val ser = SparkEnv.get.serializerManager.get(dep.serializerClass)

      shuffle = blockManager.shuffleBlockManager.forShuffle(dep.shuffleId, numOutputSplits, ser) // 创建ShuffleBlocks, 参数是shuffleId和目标partitions数目

      buckets = shuffle.acquireWriters(partition) // 生成ShuffleWriterGroup, shuffle目标buckets(对应于partition)



      // Write the map output to its associated buckets.

      for (elem <- rdd.iterator(split, taskContext)) { // 从RDD中取出每个elem数据

        val pair = elem.asInstanceOf[Product2[Any, Any]]

        val bucketId = dep.partitioner.getPartition(pair._1) // 根据pair的key进行shuffle, 得到目标bucketid

        buckets.writers(bucketId).write(pair) // 将pair数据写入bucket.writer (BlockObjectWriter)

      }      



      // Commit这些buckets到block, 其他的RDD会从通过shuffleid找到这些block, 并读取数据

      // Commit the writes. Get the size of each bucket block (total block size).

      var totalBytes = 0L

      val compressedSizes: Array[Byte] = buckets.writers.map { writer: BlockObjectWriter =>

        writer.commit()

        writer.close()

        val size = writer.size()

        totalBytes += size

        MapOutputTracker.compressSize(size)

      }

ShuffleBlockManager

ShuffleBlockManager的核心函数就是forShuffle, 这个函数返回ShuffleBlocks对象
ShuffleBlocks对象的函数acquireWriters, 返回ShuffleWriterGroup, 其中封装所有partition所对应的BlockObjectWriter

这里的问题是,
由于Spark的调度是基于task的, task其实对应于partition
如果有m个partitions, 而需要shuffle到n个partition上, 其实就是m个mapper task和n个reducer task
当然在spark中不可能所有的mapper task一起运行, task的并行度取决于core number

1. 如果每个mapper task都要产生n个files, 那么最终产生的文件数就是n*m, 文件数过多...
在Spark 0.8.1中已经优化成使用shuffle consolidation, 即多个mapper task公用一个bucket文件, 怎么公用?
取决于并行度, 因为并行的task是无法公用一个bucket文件的, 所以至少会产生corenum * bucketnum个文件, 而后面被执行的task就可以重用前面创建的bucketfile, 而不用重新创建

2. 在打开文件写的时候, 每个文件的write handler默认需要100KB内存缓存, 所以同时需要corenum * bucketnum * 100kb大小的内存消耗, 这个问题还没有得到解决

其实就是说spark在shuffle的时候碰到了扩展性问题, 这个问题为什么Hadoop没有碰到?
因为hadoop可用容忍多次的磁盘读写, 多次的文件merge, 所以它可以在每次从buffer spill的时候, 把内容写到一个新的文件中, 然后后面再去做文件merge

private[spark]

class ShuffleWriterGroup(val id: Int, val writers: Array[BlockObjectWriter])

private[spark]

trait ShuffleBlocks {

  def acquireWriters(mapId: Int): ShuffleWriterGroup

  def releaseWriters(group: ShuffleWriterGroup)

}



private[spark]

class ShuffleBlockManager(blockManager: BlockManager) {



  def forShuffle(shuffleId: Int, numBuckets: Int, serializer: Serializer): ShuffleBlocks = {

    new ShuffleBlocks {

      // Get a group of writers for a map task.

      override def acquireWriters(mapId: Int): ShuffleWriterGroup = {

        val bufferSize = System.getProperty("spark.shuffle.file.buffer.kb", "100").toInt * 1024

        val writers = Array.tabulate[BlockObjectWriter](numBuckets) { bucketId => // 根据需要shuffle的partition数目创建writers

          val blockId = ShuffleBlockManager.blockId(shuffleId, bucketId, mapId)  // blockid = "shuffle_" + shuffleId + "_" + mapId + "_" + bucketId

          blockManager.getDiskBlockWriter(blockId, serializer, bufferSize) // 从blockManager得到DiskBlockWriter

        }

        new ShuffleWriterGroup(mapId, writers)

      }



      override def releaseWriters(group: ShuffleWriterGroup) = {

        // Nothing really to release here.

      }

    }

  }

}

读

PairRDDFunctions.combineByKey

关于这部分参考, Spark源码分析 – PairRDD
关键的一点是, 在reduce端的处理中 (可以看没有mapSideCombine的部分, 更清晰一些)
mapPartitions其实是使用的MapPartitionsRDD, 即对于每个item调用aggregator.combineValuesByKey
可以看到这里和Hadoop最大的不同是, Hadoop在reduce时得到的是一个key已经merge好的集合, 所以一次性reduce处理完后, 就可以直接存掉了
而Spark没有merge这块, 所以数据是一个个来的, 所以你必须在内存里面维持所有的key的hashmap, 这里就可能有扩展性问题, Spark在PR303中实现外部排序的方案来应对这样的问题

    //RDD本身的partitioner和传入的partitioner相等时, 即不需要重新shuffle, 直接map即可

    if (self.partitioner == Some(partitioner)) {  

      self.mapPartitions(aggregator.combineValuesByKey, preservesPartitioning = true) //2. mapPartitions, map端直接调用combineValuesByKey

    } else if (mapSideCombine) { //如果需要mapSideCombine

      val combined = self.mapPartitions(aggregator.combineValuesByKey, preservesPartitioning = true) //先在partition内部做mapSideCombine

      val partitioned = new ShuffledRDD[K, C, (K, C)](combined, partitioner).setSerializer(serializerClass) //3. ShuffledRDD, 进行shuffle

      partitioned.mapPartitions(aggregator.combineCombinersByKey, preservesPartitioning = true) //Shuffle完后, 在reduce端再做一次combine, 使用combineCombinersByKey

    } else {

      // Don't apply map-side combiner.和上面的区别就是不做mapSideCombine

      val values = new ShuffledRDD[K, V, (K, V)](self, partitioner).setSerializer(serializerClass)

      values.mapPartitions(aggregator.combineValuesByKey, preservesPartitioning = true)

    }

ShuffledRDD

  override def compute(split: Partition, context: TaskContext): Iterator[P] = {

    val shuffledId = dependencies.head.asInstanceOf[ShuffleDependency[K, V]].shuffleId

    SparkEnv.get.shuffleFetcher.fetch[P](shuffledId, split.index, context.taskMetrics, //使用shuffleFetcher.fetch得到shuffle过数据的iterator

      SparkEnv.get.serializerManager.get(serializerClass))

  }

ShuffleFetcher

从mapOutputTracker查询到(根据shuffleId, reduceId)需要读取的shuffle partition的地址

然后从blockManager获取所有这写block的fetcher的iterator

private[spark] abstract class ShuffleFetcher {

  /**

   * Fetch the shuffle outputs for a given ShuffleDependency.

   * @return An iterator over the elements of the fetched shuffle outputs.

   */

  def fetch[T](shuffleId: Int, reduceId: Int, metrics: TaskMetrics,  // reduceId, 就是reduce端的partitionid

      serializer: Serializer = SparkEnv.get.serializerManager.default): Iterator[T]

  /** Stop the fetcher */

  def stop() {}

}

private[spark] class BlockStoreShuffleFetcher extends ShuffleFetcher with Logging {



  override def fetch[T](shuffleId: Int, reduceId: Int, metrics: TaskMetrics, serializer: Serializer)

    : Iterator[T] =

  {

    val blockManager = SparkEnv.get.blockManager


    val statuses = SparkEnv.get.mapOutputTracker.getServerStatuses(shuffleId, reduceId) // 从mapOutputTracker获取shuffleid的Array[MapStatus]


    val splitsByAddress = new HashMap[BlockManagerId, ArrayBuffer[(Int, Long)]]  // 由于有多个map在同一个node上, 有相同的BlockManagerId, 需要合并

    for (((address, size), index) <- statuses.zipWithIndex) {  // 这里index指,在map端的partitionid

      splitsByAddress.getOrElseUpdate(address, ArrayBuffer()) += ((index, size)) // {BlockManagerId,((mappartitionid, size),…)} 

    }

    val blocksByAddress: Seq[(BlockManagerId, Seq[(String, Long)])] = splitsByAddress.toSeq.map { // (BlockManagerId, (blockfile地址, size)) 

      case (address, splits) =>

        (address, splits.map(s => ("shuffle_%d_%d_%d".format(shuffleId, s._1, reduceId), s._2))) // 可以看到blockfile地址,由shuffleId, mappartitionid, reduceId决定

    }

    val blockFetcherItr = blockManager.getMultiple(blocksByAddress, serializer) // Iterator of (block ID, value) 

    val itr = blockFetcherItr.flatMap(unpackBlock) // unpackBlock会拆开(block ID, value)取出value, 以生成最终获取到数据的iterater



    CompletionIterator[T, Iterator[T]](itr, { // 和普通Iterator的区别是,迭代完时, 会调用后面的completion逻辑

      val shuffleMetrics = new ShuffleReadMetrics

      shuffleMetrics.shuffleFinishTime = System.currentTimeMillis

      shuffleMetrics.remoteFetchTime = blockFetcherItr.remoteFetchTime

      shuffleMetrics.fetchWaitTime = blockFetcherItr.fetchWaitTime

      shuffleMetrics.remoteBytesRead = blockFetcherItr.remoteBytesRead

      shuffleMetrics.totalBlocksFetched = blockFetcherItr.totalBlocks

      shuffleMetrics.localBlocksFetched = blockFetcherItr.numLocalBlocks

      shuffleMetrics.remoteBlocksFetched = blockFetcherItr.numRemoteBlocks

      metrics.shuffleReadMetrics = Some(shuffleMetrics)

    })

  }

  private def convertMapStatuses(

        shuffleId: Int,

        reduceId: Int,

        statuses: Array[MapStatus]): Array[(BlockManagerId, Long)] = {

    assert (statuses != null)

    statuses.map {

      status => 

        if (status == null) {

          throw new FetchFailedException(null, shuffleId, -1, reduceId,

            new Exception("Missing an output location for shuffle " + shuffleId))

        } else {

          (status.location, decompressSize(status.compressedSizes(reduceId))) // 关键转化就是, 将decompressSize只取该reduce partition的部分

        }

    }

  }

}

Shuffle信息注册 - MapOutputTracker

前面有个问题没有说清楚, 当shuffle完成后, reducer端的task怎么知道应该从哪里获取当前partition所需要的所有shuffled blocks
在Hadoop中是通过JobTracker, Mapper会通过Hb告诉JobTracker执行的状况, Reducer不断的去询问JobTracker, 并知道需要copy哪些HDFS文件
而在Spark中就通过将shuffle信息注册到MapOutputTracker

MapOutputTracker

首先每个节点都可能需要查询shuffle信息, 所以需要MapOutputTrackerActor用于通信
参考SparkContext中的逻辑, 只有在master上才创建Actor对象, 其他slaver上只是创建Actor Ref

private[spark] class MapOutputTrackerActor(tracker: MapOutputTracker) extends Actor with Logging {

  def receive = {

    case GetMapOutputStatuses(shuffleId: Int, requester: String) => // 提高用于查询shuffle信息的接口

      logInfo("Asked to send map output locations for shuffle " + shuffleId + " to " + requester)

      sender ! tracker.getSerializedLocations(shuffleId)



    case StopMapOutputTracker =>

      logInfo("MapOutputTrackerActor stopped!")

      sender ! true

      context.stop(self)

  }

}

注意, 只有master上的MapOutputTracker会有所有的最新shuffle信息
但是对于slave, 出于效率考虑, 也会buffer从master得到的shuffle信息, 所以getServerStatuses中会先在local的mapStatuses取数据, 如果没有, 再取remote的master上获取

private[spark] class MapOutputTracker extends Logging {

  var trackerActor: ActorRef = _   // MapOutputTrackerActor

  private var mapStatuses = new TimeStampedHashMap[Int, Array[MapStatus]]  // 用于buffer所有的shuffle信息

  def registerShuffle(shuffleId: Int, numMaps: Int) {  // 注册shuffle id, 初始化Array[MapStatus]

    if (mapStatuses.putIfAbsent(shuffleId, new Array[MapStatus](numMaps)).isDefined) {

      throw new IllegalArgumentException("Shuffle ID " + shuffleId + " registered twice")

    }

  }



  def registerMapOutput(shuffleId: Int, mapId: Int, status: MapStatus) { // 当task完成时, 注册MapOutput信息

    var array = mapStatuses(shuffleId)

    array.synchronized {

      array(mapId) = status

    }

  }

  // Remembers which map output locations are currently being fetched on a worker

  private val fetching = new HashSet[Int]

  // Called on possibly remote nodes to get the server URIs and output sizes for a given shuffle

  def getServerStatuses(shuffleId: Int, reduceId: Int): Array[(BlockManagerId, Long)] = {

    val statuses = mapStatuses.get(shuffleId).orNull

    if (statuses == null) {  // local的mapStatuses中没有

      logInfo("Don't have map outputs for shuffle " + shuffleId + ", fetching them")

      var fetchedStatuses: Array[MapStatus] = null

      fetching.synchronized {

        if (fetching.contains(shuffleId)) { // 已经在fetching中, 所以只需要wait

          // Someone else is fetching it; wait for them to be done

          while (fetching.contains(shuffleId)) {

            try {

              fetching.wait()

            } catch {

              case e: InterruptedException =>

            }

          }

        }



        // Either while we waited the fetch happened successfully, or

        // someone fetched it in between the get and the fetching.synchronized.

        fetchedStatuses = mapStatuses.get(shuffleId).orNull

        if (fetchedStatuses == null) {

          // We have to do the fetch, get others to wait for us.

          fetching += shuffleId  // 如果还没有就加到fetching, 继续fetching

        }

      }

      

      if (fetchedStatuses == null) {

        // We won the race to fetch the output locs; do so

        val hostPort = Utils.localHostPort()

        // This try-finally prevents hangs due to timeouts:

        try {

          val fetchedBytes =

            askTracker(GetMapOutputStatuses(shuffleId, hostPort)).asInstanceOf[Array[Byte]] // 从remote master上fetching

          fetchedStatuses = deserializeStatuses(fetchedBytes)

          logInfo("Got the output locations")

          mapStatuses.put(shuffleId, fetchedStatuses) // 把结果buffer到local

        } finally {

          fetching.synchronized {

            fetching -= shuffleId

            fetching.notifyAll()

          }

        }

      }

      if (fetchedStatuses != null) {

        fetchedStatuses.synchronized {

          return MapOutputTracker.convertMapStatuses(shuffleId, reduceId, fetchedStatuses)

        }

      }

      else{

        throw new FetchFailedException(null, shuffleId, -1, reduceId,

          new Exception("Missing all output locations for shuffle " + shuffleId))

      }      

    } else {  // 在local找到, 直接返回

      statuses.synchronized {

        return MapOutputTracker.convertMapStatuses(shuffleId, reduceId, statuses)

      }

    }

  }

注册工作都是在master上的DAGScheduler完成的
Spark中是以shuffleid来标识每个shuffle, 不同于Hadoop, 一个job中可能有多个shuffle过程, 所以无法通过jobid
分两步来注册,
1. 在new stage的时候, 需要注册shuffleid, 由于new stage一定是由于遇到shuffleDep

  private def newStage(

      rdd: RDD[_],

      shuffleDep: Option[ShuffleDependency[_,_]],

      jobId: Int,

      callSite: Option[String] = None)

    : Stage =

  {

    if (shuffleDep != None) {

      // Kind of ugly: need to register RDDs with the cache and map output tracker here

      // since we can't do it in the RDD constructor because # of partitions is unknown

      mapOutputTracker.registerShuffle(shuffleDep.get.shuffleId, rdd.partitions.size) // 注册shuffleid和map端RDD的partitions数目

    }

2. 在handle TaskCompletion事件的时候, 当一个ShuffleMapTask完成的时候, 即mapOutput产生的时候, 就可以注册MapStatus(BlockManagerId, compressedSizes)
通过BlockManagerId+partitionid+reduceid就可以知道blockid, 从而读到数据

 private def handleTaskCompletion(event: CompletionEvent) {

    event.reason match {

      case Success =>

        task match {

          case rt: ResultTask[_, _] =>

          case smt: ShuffleMapTask =>

            val status = event.result.asInstanceOf[MapStatus] // 在ShuffleTask的run的返回值本身就是MapStatus, 所以这里做下类型转换

            val execId = status.location.executorId  // class MapStatus(var location: BlockManagerId, var compressedSizes: Array[Byte])

            if (failedEpoch.contains(execId) && smt.epoch <= failedEpoch(execId)) {

              logInfo("Ignoring possibly bogus ShuffleMapTask completion from " + execId)

            } else {

              stage.addOutputLoc(smt.partition, status) // 把MapStatus buffer到stage中outputLocs上去

            }

              if (stage.shuffleDep != None) {

                // We supply true to increment the epoch number here in case this is a

                // recomputation of the map outputs. In that case, some nodes may have cached

                // locations with holes (from when we detected the error) and will need the

                // epoch incremented to refetch them.

                // TODO: Only increment the epoch number if this is not the first time

                //       we registered these map outputs.

                mapOutputTracker.registerMapOutputs(  // 注册到mapOutputTracker中的mapStatuses上

                  stage.shuffleDep.get.shuffleId,

                  stage.outputLocs.map(list => if (list.isEmpty) null else list.head).toArray,

                  changeEpoch = true)

              }

自定义分区我的K8409 Hadoop hdfs hadoop 大数据
通过简单例子了解partition分区类的重写方法分区是在MR的过程中进行的，属于Shuffle阶段但是在Job端不要忘记进行调用：job.setPartitionerClass(xxx.class)按照年龄分区：classAgePartitionerextendsPartitioner{@OverridepublicintgetPartition(MyComparablekey,NullWrit
Hadoop之mapreduce -- WrodCount案例以及各种概念 lzhlizihang hadoop mapreduce 大数据
文章目录一、MapReduce的优缺点二、MapReduce案例--WordCount1、导包2、Mapper方法3、Partitioner方法（自定义分区器）4、reducer方法5、driver（main方法）6、Writable（手机流量统计案例的实体类）三、关于片和块1、什么是片，什么是块？2、mapreduce启动多少个MapTask任务？四、MapReduce的原理五、Shuffle过
TypeError: list indices must be integers or slices, not list m0_68138877 pytorch list
TypeError:listindicesmustbeintegersorslices,notlist原因：传入参数搞错了计划通过一个下标list，通过rand.shuffle实现训练数据的随机化，结果因为传入的数据是没有tokenized的（就是一堆原始的字符串，并且是用list保存的，tokenize之后应该是一个torch.tensor类型的张量）修复方法：对应原因，传入正确的参数即可总结：
Python | Leetcode Python题解之第384题打乱数组 Mopes__ 分享 Python Leetcode 题解
题目：题解：classSolution:def__init__(self,nums:List[int]):self.nums=numsself.original=nums.copy()defreset(self)->List[int]:self.nums=self.original.copy()returnself.numsdefshuffle(self)->List[int]:foriinran
浙大 | PTA 自测-5 Shuffling Machine (20分) 赞美_太阳！数据结构-起步能力自测 c语言 oj系统
Shufflingisaprocedureusedtorandomizeadeckofplayingcards.Becausestandardshufflingtechniquesareseenasweak,andinordertoavoid“insidejobs”whereemployeescollaboratewithgamblersbyperforminginadequateshuffles
PTA 自测-5 Shuffling Machine byakki python实战
Shufflingisaprocedureusedtorandomizeadeckofplayingcards.Becausestandardshufflingtechniquesareseenasweak,andinordertoavoid“insidejobs”whereemployeescollaboratewithgamblersbyperforminginadequateshuffles
PTA 自测-5 Shuffling Machine (20 分) c语言扶栏笑看花满园 PTA题目
Shufflingisaprocedureusedtorandomizeadeckofplayingcards.Becausestandardshufflingtechniquesareseenasweak,andinordertoavoid"insidejobs"whereemployeescollaboratewithgamblersbyperforminginadequateshuffles
Spark-RDD迭代器管道计算隔着天花板看星星 spark 大数据 scala
一、上下文《Spark-Task启动流程》中讲到我们提交Stage是传入的是这个Stage最后一个RDD，当Task中触发ShuffleWriter、返回Driver数据或者写入Hadoop文件系统时才触发这个RDD调用它的iterator()，下面我们就来看下RDD.iterator()背后的故事。二、RDD中的iterator我们先来看下rdd.iterator()以及后面一些列的调用fina
MapTask、Shuffle、ReduceTask工作机制 piziyang12138
MapReduce整个工作流程：image.pngimage.pngShuffle阶段image.png
【划分数据集】stratifiedShuffleSplit分层抽样芜湖xin python
importpandasaspdfromsklearn.model_selectionimportStratifiedShuffleSplit#分出10%作为独立测试集ss=StratifiedShuffleSplit(n_splits=1,test_size=0.1,random_state=42)data=pd.read_csv("F:\\PaperCode\\Mypaper_python_c
大数据开发（Spark面试真题-卷一） Key-Key 大数据 spark 面试
大数据开发（Spark面试真题）1、什么是SparkStreaming？简要描述其工作原理。2、什么是Spark内存管理机制？请解释其中的主要概念，并说明其作用。3、请解释一下Spark中的shuffle是什么，以及为什么shuffle操作开销较大？4、请解释一下Spark中的RDD持久化（Caching）是什么以及为什么要使用持久化？5、请解释一下Spark中ResilientDistribut
大数据开发（Hadoop面试真题-卷九） Key-Key 大数据 hadoop 面试
大数据开发（Hadoop面试真题）1、Hivecount(distinct)有几个reduce，海量数据会有什么问题？2、既然HBase底层数据是存储在HDFS上，为什么不直接使用HDFS，而还要用HBase?3、Sparkmapjoin的实现原理？4、Spark的stage如何划分？在源码中是怎么判断属于ShuffleMapStage或ResultStage的？5、SparkreduceByKe
Spark-sql Adaptive Execution动态调整分区数量，调整输出文件数不想起的昵称 hive spark hive 数据仓库
背景：在数仓任务中，经常要解决小文件的问题。有时间为了解决小文件问题，我们把spark.sql.shuffle.partitions这个参数调整的很小，但是随着时间的推移，数据量越来越大，当初设置的参数就不合适了，那有没有一个可以自我伸缩的参数呢？看看这个参数如何运用：我们的spark-sql版本：[hadoop@666~]$spark-sql--versionWelcometo______/__
HIVE中MAP和REDUCE数量这孩子谁懂哈 HIVE hive hadoop mapreduce
一、总览MR执行过程一般的MapReduce程序会经过以下几个过程：输入（Input）、输入分片（Splitting）、Map阶段、Shuffle阶段、Reduce阶段、输出（Finalresult）。1、输入就不用说了，数据一般放在HDFS上面就可以了，而且文件是被分块的。关于文件块和文件分片的关系，在输入分片中说明。2、输入分片：在进行Map阶段之前，MapReduce框架会根据输入文件计算输
粉丝：什么情况下，hive 只会产生一个reduce任务，而没有maptask 浪尖聊大数据-浪尖 mapreduce hive 大数据 spark java
今天下午，在微信群里看到粉丝聊天，提到了一个某公司的面试题：什么情况下，hive只会产生一个reduce任务，而没有maptask这个问题是不是很神奇？我们常规使用的mapreducer任务执行过程大致如下图：appmaster通过某种策略计算数据源可以做多少分片（getSplits方法），对应的生成固定数量的maptask，假如存在shuffle的话，就根据默认或者指定的reducer数，将数据
hive中mr个数判断 qq_18219755 大数据 hive mr个数
对于JOIN操作：Map：以JOINON条件中的列作为Key，如果有多个列，则Key是这些列的组合以JOIN之后所关心的列作为Value，当有多个列时，Value是这些列的组合。在Value中还会包含表的Tag信息，用于标明此Value对应于哪个表。按照Key进行排序。Shuffle：根据Key的值进行Hash，并将Key/Value对按照Hash值推至不同对Reduce中。Reduce：Redu
spark为什么比mapreduce快？后端
spark为什么比mapreduce快？首先澄清几个误区：1：两者都是基于内存计算的，任何计算框架都肯定是基于内存的，所以网上说的spark是基于内存计算所以快，显然是错误的2;DAG计算模型减少的是磁盘I/O次数（相比于mapreduce计算模型而言），而不是shuffle次数，因为shuffle是根据数据重组的次数而定，所以shuffle次数不能减少所以总结spark比mapreduce快的原
大数据Map Reduce (Hadoop) 和 MPP数据库的区别山哥Samuel
原理的角度出发,mapreduce其实就是二分查找的一个逆过程,不过因为计算节点有限,所以map和reduce前都预先有一个分区的步骤.二分查找要求数据是排序好的,所以MapReduce之间会有一个shuffle的过程对Map的结果排序.Reduce的输入是排好序的.MR分而治之的策略和数据库行业中另一种数据库MassivelyParallelProcessor即大规模并行处理数据库(典型代表AW
C++,stl，常用排序算法，常用拷贝和替换算法柏箱 C++STL c++排序算法算法 stl 拷贝和替换算法
目录1.常用排序算法sortrandom_shufflemergereverse2.常用拷贝和替换算法copyreplacereplace_ifswap1.常用排序算法sort默认从小到大排序#includeusingnamespacestd;intmain(){vectorv;v.push_back(1);v.push_back(2);v.push_back(9);v.push_back(2);
面试系列之《Spark》（持续更新...） atwdy Spark 面试 spark
1.job&stage&task如何划分？job：应用程序中每遇到一个action算子就会划分为一个job。stage：一个job任务中从后往前划分，分区间每产生了shuffle也就是宽依赖则划分为一个stage，stage这体现了spark的pipeline思想，即数据在内存中尽可能的往后多计算，最后落盘，减少磁盘IO。task：RDD中一个分区对应一个task。2.什么是宽依赖和窄依赖？根据分
Collections集合工具类 IT老王Hua_TZ java基础 java java 开发语言
Collections集合工具类Collections是集合工具类，用来对集合进行操作。部分方法如下：publicstaticvoidsort(Listlist)//集合元素排序//排序前元素list集合元素[33,11,77,55]Collections.sort(list);//排序后元素list集合元素[11,33,55,77]publicstaticvoidshuffle(Listlist
Hadoop Shuffle SharlotteZZZ
Whentherearemultiplereducers,themaptaskspartitiontheiroutput,eachcreatingonepartitionforeachreducetask.Therecanbemanykeys(andtheirassociatedvalues)ineachpartition,buttherecordsforeverykeyareallinasing
深度学习基础之《TensorFlow框架（4）—Operation》 csj50 机器学习深度学习
一、常见的OP1、举例类型实例标量运算add，sub，mul，div，exp，log，greater，less，equal向量运算concat，slice，splot，canstant，rank，shape，shuffle矩阵运算matmul，matrixinverse，matrixdateminant带状态的运算variable，assgin，assginadd神经网络组件softmax，sig
洗牌算法 hekirakuno
随机打乱一个数组的顺序。场景：验证码之类。要求生成n位不重复的数字组合（0
ShuffleManager 原理 stone_zhu
在Spark的源码中，负责shuffle过程的执行、计算、处理的组件主要是ShuffleManager。在Spark1.2以前，默认的shuffle计算引擎是HashShuffleManager。该ShuffleMananger有一个非常严重的弊端，就是会产生大量的磁盘文件，进而有大量的磁盘IO操作，比较影响性能。因此在Spark1.2之后，默认的ShuffleManager改成了SortShuf
刘谦春晚魔术解析Python python成长之路 Python java 前端服务器 python
说明：技术有限魔术口吻以名字三个字、男生、北方人为例来写的importrandomlist1=[]whilelen(list1)!=4:num=random.randint(1,13)ifnumnotinlist1:list1.append(num)#给list1中存入4个不相同的数print(f"随机抽4张牌，分别是{list1}")random.shuffle(list1)print(f"打乱
28个极简代码——python YYHhao. python学习 python 开发语言
文章目录1、大写首字母2、逗号连接3、分块4、合并两个字典5、回文序列6、检查重复项7、解包8、链式对比9、链式函数调用10、列表的差11、内存占用12、使用枚举13、首字母小写14、通过函数取差15、不使用if-else的计算子16、压缩17、元素频率18、元音统计19、展开列表20、重复元素判断21、字典默认值22、字符元素组成23、字节占用24、打印N次字符串25、Shuffle26、Try
深度学习为什么需要suffle，xgb为什么不需要shuffle? fengyuzhou
因为深度学习的优化方法是随机梯度下降，每次只需要考虑一个batch的数据，也就是每次的“视野”只能看到这一批数据，而不是全局的数据。是一种“流式学习”。原始数据因为某中原因分布并不平均，会出现连续的正负样本，或者数据分布集中的情况，这样的话会限制梯度优化方向的可选择性，导致收敛点选择空间严重变少。不容易收敛到最优值。而xgb模型训练建树的过程最重要的步骤是分裂点的选择。考虑的数据是整个训练集。xg
spark 资源动态释放 kikiki2
通过spark-submit会固定占用一占的资源，有什么办法，在任务不运作的时候将资源释放，让其它任务使用呢，yarn新版本默认已经支持了，我们使用的是HDP。版本如下配置HDP里面已经默认支持spark动态资源释配置代码配置valsparkConf=newSparkConf().set("spark.shuffle.service.enabled","true").set("spark.dyna
深度学习上采样算子 noobiee 机器学习深度学习深度学习人工智能算法
CV领域1.Upsample利用传统插值方法进行上采样。往往会在upsample后接一个conv，进行学习。任务：超分，目标检测。2.PixelShufflePixelShuffler是一种端到端可学习的上采样模块，通过设置上采样比例，就可由低分辨率图像获取指定倍率的高分辨率图像。上采样可以理解为在同一个位置，原来只是以1:1的比例提取信息，而现在以1:4的比例提取信息，提取信息的频率更高了，所以
windows下源码安装golang 616050468 golang安装 golang环境 windows
系统： 64位win7，开发环境：sublime text 2， go版本： 1.4.1 1. 安装前准备(gcc, gdb, git) golang在64位系
redis批量删除带空格的key bylijinnan redis
redis批量删除的通常做法： redis-cli keys "blacklist*" | xargs redis-cli del 上面的命令在key的前后没有空格时是可以的，但有空格就不行了： $redis-cli keys "blacklist*" 1) "blacklist:12: [email protected]
oracle正则表达式的用法 0624chenhong oracle 正则表达式
方括号表达示方括号表达式描述 [[:alnum:]] 字母和数字混合的字符 [[:alpha:]] 字母字符 [[:cntrl:]] 控制字符 [[:digit:]] 数字字符 [[:graph:]] 图像字符 [[:lower:]] 小写字母字符 [[:print:]] 打印字符 [[:punct：]] 标点符号字符 [[:space:]]
2048源码(核心算法有，缺少几个anctionbar，以后补上) 不懂事的小屁孩 2048
2048游戏基本上有四部分组成， 1：主activity，包含游戏块的16个方格，上面统计分数的模块 2：底下的gridview，监听上下左右的滑动，进行事件处理， 3：每一个卡片，里面的内容很简单，只有一个text，记录显示的数字 4：Actionbar，是游戏用重新开始，设置等功能(这个在底下可以下载的代码里面还没有实现) 写代码的流程 1：设计游戏的布局，基本是两块，上面是分
jquery内部链式调用机理换个号韩国红果果 JavaScript jquery
只需要在调用该对象合适(比如下列的setStyles)的方法后让该方法返回该对象（通过this 因为一旦一个函数称为一个对象方法的话那么在这个方法内部this（结合下面的setStyles）指向这个对象） function create(type){ var element=document.createElement(type); //this=element;
你订酒店时的每一次点击背后都是NoSQL和云计算蓝儿唯美 NoSQL
全球最大的在线旅游公司Expedia旗下的酒店预订公司，它运营着89个网站，跨越68个国家，三年前开始实验公有云，以求让客户在预订网站上查询假期酒店时得到更快的信息获取体验。云端本身是用于驱动网站的部分小功能的，如搜索框的自动推荐功能，还能保证处理Hotels.com服务的季节性需求高峰整体储能。 Hotels.com的首席技术官Thierry Bedos上个月在伦敦参加“2015 Clou
java笔记1 a-john java
1，面向对象程序设计（Object-oriented Propramming，OOP）：java就是一种面向对象程序设计。 2，对象：我们将问题空间中的元素及其在解空间中的表示称为“对象”。简单来说，对象是某个类型的实例。比如狗是一个类型，哈士奇可以是狗的一个实例，也就是对象。 3，面向对象程序设计方式的特性： 3.1 万物皆为对象。
C语言 sizeof和strlen之间的那些事 C/C++软件开发求职面试题必备考点（一） aijuans C/C++求职面试必备考点
找工作在即，以后决定每天至少写一个知识点，主要是记录，逼迫自己动手、总结加深印象。当然如果能有一言半语让他人收益，后学幸运之至也。如有错误，还希望大家帮忙指出来。感激不尽。后学保证每个写出来的结果都是自己在电脑上亲自跑过的，咱人笨，以前学的也半吊子。很多时候只能靠运行出来的结果再反过来
程序员写代码时就不要管需求了吗？ asia007 程序员不能一味跟需求走
编程也有2年了，刚开始不懂的什么都跟需求走，需求是怎样就用代码实现就行，也不管这个需求是否合理，是否为较好的用户体验。当然刚开始编程都会这样，但是如果有了2年以上的工作经验的程序员只知道一味写代码，而不在写的过程中思考一下这个需求是否合理，那么，我想这个程序员就只能一辈写敲敲代码了。我的技术不是很好，但是就不代
Activity的四种启动模式百合不是茶 android 栈模式启动 Activity的标准模式启动栈顶模式启动单例模式启动
android界面的操作就是很多个activity之间的切换,启动模式决定启动的activity的生命周期 ; 启动模式xml中配置 <activity android:name=".MainActivity" android:launchMode="standard&quo
Spring中@Autowired标签与@Resource标签的区别 bijian1013 java spring @Resource @Autowired @Qualifier
Spring不但支持自己定义的@Autowired注解，还支持由JSR-250规范定义的几个注解，如：@Resource、 @PostConstruct及@PreDestroy。 1. @Autowired @Autowired是Spring 提供的，需导入 Package:org.springframewo
Changes Between SOAP 1.1 and SOAP 1.2 sunjing Changes Enable SOAP 1.1 SOAP 1.2
JAX-WS SOAP Version 1.2 Part 0: Primer (Second Edition) SOAP Version 1.2 Part 1: Messaging Framework (Second Edition) SOAP Version 1.2 Part 2: Adjuncts (Second Edition) Which style of WSDL
【Hadoop二】Hadoop常用命令 bit1129 hadoop
以Hadoop运行Hadoop自带的wordcount为例， hadoop脚本位于/home/hadoop/hadoop-2.5.2/bin/hadoop，需要说明的是，这些命令的使用必须在Hadoop已经运行的情况下才能执行 Hadoop HDFS相关命令 hadoop fs -ls 列出HDFS文件系统的第一级文件和第一级
java异常处理（初级）白糖_ java DAO spring 虚拟机 Ajax
从学习到现在从事java开发一年多了，个人觉得对java只了解皮毛，很多东西都是用到再去慢慢学习，编程真的是一项艺术，要完成一段好的代码，需要懂得很多。最近项目经理让我负责一个组件开发，框架都由自己搭建，最让我头疼的是异常处理，我看了一些网上的源码，发现他们对异常的处理不是很重视，研究了很久都没有找到很好的解决方案。后来有幸看到一个200W美元的项目部分源码，通过他们对异常处理的解决方案，我终
记录整理-工作问题 braveCS 工作
1）那位同学还是CSV文件默认Excel打开看不到全部结果。以为是没写进去。同学甲说文件应该不分大小。后来log一下原来是有写进去。只是Excel有行数限制。那位同学进步好快啊。 2）今天同学说写文件的时候提示jvm的内存溢出。我马上反应说那就改一下jvm的内存大小。同学说改用分批处理了。果然想问题还是有局限性。改jvm内存大小只能暂时地解决问题，以后要是写更大的文件还是得改内存。想问题要长远啊
org.apache.tools.zip实现文件的压缩和解压，支持中文 bylijinnan apache
刚开始用java.util.Zip，发现不支持中文（网上有修改的方法，但比较麻烦）后改用org.apache.tools.zip org.apache.tools.zip的使用网上有更简单的例子下面的程序根据实际需求，实现了压缩指定目录下指定文件的方法 import java.io.BufferedReader; import java.io.BufferedWrit
读书笔记-4 chengxuyuancsdn 读书笔记
1、JSTL 核心标签库标签 2、避免SQL注入 3、字符串逆转方法 4、字符串比较compareTo 5、字符串替换replace 6、分拆字符串 1、JSTL 核心标签库标签共有13个，学习资料：http://www.cnblogs.com/lihuiyy/archive/2012/02/24/2366806.html 功能上分为4类： (1)表达式控制标签：out
[物理与电子]半导体教材的一个小问题 comsci 问题
各种模拟电子和数字电子教材中都有这个词汇-空穴书中对这个词汇的解释是; 当电子脱离共价键的束缚成为自由电子之后,共价键中就留下一个空位,这个空位叫做空穴我现在回过头翻大学时候的教材,觉得这个
Flashback Database --闪回数据库 daizj oracle 闪回数据库
Flashback 技术是以Undo segment中的内容为基础的，因此受限于UNDO_RETENTON参数。要使用flashback 的特性，必须启用自动撤销管理表空间。在Oracle 10g中， Flash back家族分为以下成员： Flashback Database， Flashback Drop，Flashback Query(分Flashback Query,Flashbac
简单排序:插入排序 dieslrae 插入排序
public void insertSort(int[] array){ int temp; for(int i=1;i<array.length;i++){ temp = array[i]; for(int k=i-1;k>=0;k--)
C语言学习六指针小示例、一维数组名含义，定义一个函数输出数组的内容 dcj3sjt126com c
# include <stdio.h> int main(void) { int * p; //等价于 int *p 也等价于 int* p; int i = 5; char ch = 'A'; //p = 5; //error //p = &ch; //error //p = ch; //error p = &i; //
centos下php redis扩展的安装配置3种方法 dcj3sjt126com redis
方法一 1.下载php redis扩展包代码如下复制代码 #wget http://redis.googlecode.com/files/redis-2.4.4.tar.gz 2 tar -zxvf 解压压缩包，cd /扩展包（进入扩展包然后运行phpize 一下是我环境中phpize的目录，/usr/local/php/bin/phpize (一定要
线程池(Executors) shuizhaosi888 线程池
在java类库中，任务执行的主要抽象不是Thread，而是Executor，将任务的提交过程和执行过程解耦 public interface Executor { void execute(Runnable command); } public class RunMain implements Executor{ @Override pub
openstack 快速安装笔记 haoningabc openstack
前提是要配置好yum源版本icehouse，操作系统redhat6.5 最简化安装，不要cinder和swift 三个节点 172 control节点keystone glance horizon 173 compute节点nova 173 network节点neutron control /etc/sysctl.conf net.ipv4.ip_forward =
从c面向对象的实现理解c++的对象（二） jimmee C++面向对象虚函数
1. 类就可以看作一个struct，类的方法，可以理解为通过函数指针的方式实现的，类对象分配内存时，只分配成员变量的，函数指针并不需要分配额外的内存保存地址。 2. c++中类的构造函数，就是进行内存分配(malloc)，调用构造函数 3. c++中类的析构函数，就时回收内存(free) 4. c++是基于栈和全局数据分配内存的，如果是一个方法内创建的对象，就直接在栈上分配内存了。专门在
如何让那个一个div可以拖动 lingfeng520240 html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml
第10章高级事件（中） onestopweb 事件
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
计算两个经纬度之间的距离 roadrunners 计算纬度 LBS 经度距离
要解决这个问题的时候，到网上查了很多方案，最后计算出来的都与百度计算出来的有出入。下面这个公式计算出来的距离和百度计算出来的距离是一致的。 /** * * @param longitudeA * 经度A点 * @param latitudeA * 纬度A点 * @param longitudeB *
最具争议的10个Java话题 tomcat_oracle java
1、Java8已经到来。什么！？ Java8 支持lambda。哇哦，RIP Scala！　　随着Java8 的发布，出现很多关于新发布的Java8是否有潜力干掉Scala的争论，最终的结论是远远没有那么简单。Java8可能已经在Scala的lambda的包围中突围，但Java并非是函数式编程王位的真正觊觎者。　　2、Java 9 即将到来　　 Oracle早在8月份就发布
zoj 3826 Hierarchical Notation(模拟) 阿尔萨斯 rar
题目链接：zoj 3826 Hierarchical Notation 题目大意：给定一些结构体，结构体有value值和key值，Q次询问，输出每个key值对应的value值。解题思路：思路很简单，写个类词法的递归函数，每次将key值映射成一个hash值，用map映射每个key的value起始终止位置，预处理完了查询就很简单了。这题是最后10分钟出的，因为没有考虑value为{}的情

Spark源码分析 – Shuffle

写

读

Shuffle信息注册 - MapOutputTracker

你可能感兴趣的:(shuffle)