Bro_Rabbit

[笔记迁移][Spark][12]Spark源码——内核架构5

文章目录

11. Shuffle（最最最最最重要，重点出错、调优目标）
12. BlockManager：底层数据管理组件（还是主从结构）
13. CacheManager(2.3中没有CacheManager)

11. Shuffle（最最最最最重要，重点出错、调优目标）

（1）发生时机：
与Tuple的key操作相关，包括reduceByKey/ groupByKey/ sortByKey/ countByKey/ join/ cogroup

（2）特点：
[1] 在Spark早期版本中，bucket缓存非常重要，需要将一个ShuffleMapTask的所有数据全部写入内存之后，才会刷新到磁盘。但这也引发了一个问题，若Map一端数据过多，很容易造成内存溢出OM。所以，在之后的版本中进行了优化，默认bucket缓存是100kb，当以写入的数据达到阈值后，就将内存中的数据一点点的刷新到磁盘，避免了容易发生的OM，但若缓存过小将引发过多的io操作。
[2] 与MapReduce完全不一样的是： MapReduce必须在Map阶段将所有数据写入本地磁盘文件后，才能启动Reduce来拉取数据，因为MR要实现默认Key排序，只有先得到所有数据才能排序。Spark默认情况下不会对数据进行排序，因此ShuffleMapTask每写入一点数据，ResultTask就可以拉取一点数据，然后在本地执行自定义聚合函数和算子计算。优点在于速度快，缺点在于不如MR计算模型方便。

（3）默认的Shuffle

（4）优化后的Shuffle
         在Spark新版本中的consolidation机制，即ShuffleGroup的概念：第一批ShuffleMapTask将数据写入n个（n=ResultTask数量Cpu_core数量）本地文件，之后批次的ShuffleMapTask不再新建文件，而是直接将数据写入之前已创建的对应本地文件中，相当于多个ShuffleMap Task的输出在一组共享文件中合并，从而减少本地磁盘上的文件数量。

    【防裂说明】假设：当前节点有2个Cpu core，运行了4个ShuffleMapTask，==因此只能同时并行其中的2个（即分成了22组合）==
    <1>并行的ShuffleMapTask，写入的文件一定是不同的：当一批并行ShuffleMapTask完成后，新一批的ShuffleMapTask启动并行时，使用consolidation机制复用上一批内存缓存和文件。
    <2>一个ShuffleGroup中的每个文件，都存储了多个ShuffleMapTask的数据，每个ShuffleMapTask数据称为一个Segement。此外，还会通过一些索引，来标记每个ShuffleMapTask的输出在ShuffleBlockFile中的索引及偏移量，来进行不同ShuffleMapTask的数据区分。

（5）相关源码：
    <1> ShuffleWrite相关：HashShuffleWriter / FileShuffleBlockManager 【注：源码Spark2.3 与Spark1.6完全不同】
    <2> ShuffleRead入口：ShuffledRDD.compute,但其他逻辑HashShuffleReader / BlockFileShuffleFetcher【注：源码Spark2.3与Spark1.6完全不同】
    <3>调优参数：max.bytes.in.flight

 override def compute (split : Partition, context: TaskContext): Iterator[( K, C)] = {
    val dep = dependencies.head .asInstanceOf [ShuffleDependency[K, V, C]]
    SparkEnv.get .shuffleManager .getReader (dep .shuffleHandle , split .index , split.index + 1 , context )
      .read()
      .asInstanceOf[ Iterator[(K , C)]]
  }

  /**
   * 拉取的关键组件BlockShuffleFetcher,开始拉取ResultTask对应的多份数据
   */
  private[ this] def initialize (): Unit = {
    // Add a task completion callback (called in both success case and failure case) to cleanup.
    context.addTaskCompletionListener(_ => cleanup())

    // Split local and remote blocks.
    val remoteRequests = splitLocalRemoteBlocks()
    // Add the remote requests into our queue in a random order
    fetchRequests ++= Utils.randomize(remoteRequests)
    assert ((0 == reqsInFlight) == ( 0 == bytesInFlight),
      "expected reqsInFlight = 0 but found reqsInFlight = " + reqsInFlight +
      ", expected bytesInFlight = 0 but found bytesInFlight = " + bytesInFlight )

    // Send out initial requests for blocks, up to our maxBytesInFlight
    fetchUpToMaxBytes()

    val numFetches = remoteRequests.size - fetchRequests .size
    logInfo("Started " + numFetches + " remote fetches in" + Utils.getUsedTimeMs(startTime))

    // Get Local Blocks 即数据本地化
    fetchLocalBlocks()
    logDebug("Got local blocks in " + Utils. getUsedTimeMs(startTime))
  }

  /**
   * 循环往复，只要发现还有数据没有拉取完，就发送请求到远程去拉取数据
   */
  private def fetchUpToMaxBytes (): Unit = {
    // Send fetch requests up to maxBytesInFlight(max.bytes.in.flight调优参数). If you cannot fetch from a remote host
    // immediately, defer the request until the next time it can be processed.

    // Process any outstanding deferred fetch requests if possible.
    if (deferredFetchRequests. nonEmpty) {
      for (( remoteAddress, defReqQueue) <- deferredFetchRequests) {
        while (isRemoteBlockFetchable(defReqQueue) &&
            !isRemoteAddressMaxedOut(remoteAddress, defReqQueue.front)) {
          val request = defReqQueue.dequeue()
          logDebug(s "Processing deferred fetch request for $ remoteAddress with "
            + s "${request.blocks.length} blocks")
          send(remoteAddress, request)
          if (defReqQueue.isEmpty) {
            deferredFetchRequests -= remoteAddress
          }
        }
      }
    }

    // Process any regular fetch requests if possible.
    while ( isRemoteBlockFetchable(fetchRequests )) {
      val request = fetchRequests. dequeue()
      val remoteAddress = request. address
      if (isRemoteAddressMaxedOut( remoteAddress, request )) {
        logDebug(s"Deferring fetch request for $ remoteAddress with ${request.blocks.size} blocks")
        val defReqQueue = deferredFetchRequests.getOrElse(remoteAddress, new Queue[FetchRequest]())
        defReqQueue.enqueue (request)
        deferredFetchRequests(remoteAddress) = defReqQueue
      } else {
        send( remoteAddress, request )  //复杂的实现，去远程获取数据
      }
    }

12. BlockManager：底层数据管理组件（还是主从结构）

【防裂说明】每个节点上都有BlockManager，包括几个关键组件：

组件	功能
DiskStore	负责对磁盘上的数据进行读写
MemoryStore	负责对内存中的数据进行读写
BlockTransferService	负责对远程其他节点的BlockManager管理的数据读写
ConnectionManager	负责建立当前BlockManager到远程其他节点的BlockManager的网络连接

（1）BlockManagerMaster负责管理各个节点上的BlockManager“内部管理的数据”的元数据，比如增删改的变更操作都会维护在这里。
（2）每个BlockManager创建初始化之后，首先会向BlockManagerMaster进行注册，此时BlockManager会为其创建对应的BlockManagerInfo。
（3）使用BlockManger“写”操作时，例如RDD运行过程中的一些数据，或手动指定了persist()，优先会将数据写入内存中，当内存大小不够时，会使自己的算法将内存中的部分数据写入磁盘。此外，如果persist()指定了replica，那么会使用BlockTransferService将数据replicate一份到其他节点的BlockManager上。
（4）使用BlockManger“读”操作时，例如ShuffleRead，如果能从本地读取数据，那么就利用DiskStore或MemoryStore从本地读取数据。但如果本地没有数据，会使用ConnectionManger向数据所在的节点的BlockManager建立连接，然后使用BlockTransferService从远程BlockManger读取数据。
（5）只要使用了BlockManager进行增删改操作，就必须将Block的BlockStatus上报至BlockManagerMater，修改其内部对应的BlockManagerInfo的BlockStatus元数据。

【源码】
（1）BlockManagerMaster.scala中只是定义了一些操作入口，BlockManagerEndPoint才是真正作用的组件

/**
 * BlockManagerMasterEndpoint is an [[ThreadSafeRpcEndpoint]] on the master node to track statuses
 * of all slaves' block managers.
 */
private[spark]
class BlockManagerMasterEndpoint(
    override val rpcEnv : RpcEnv,
    val isLocal: Boolean,
    conf: SparkConf,
    listenerBus: LiveListenerBus)
  extends ThreadSafeRpcEndpoint with Logging {

  // Mapping from block manager id to the block manager's information.
  // BlockManagerMaster要负责维护每个BlockManager的BlockManagerInfo
  private val blockManagerInfo = new mutable.HashMap[BlockManagerId, BlockManagerInfo]

  // Mapping from executor ID to block manager ID.
  // 每个Executor与一个BlockManager相关联
  private val blockManagerIdByExecutor = new mutable.HashMap[String, BlockManagerId]

  // Mapping from block id to the set of block managers that have the block.
  private val blockLocations = new JHashMap[BlockId, mutable.HashSet[BlockManagerId]]

  private val askThreadPool = ThreadUtils.newDaemonCachedThreadPool( "block-manager-ask-thread-pool" )
  // ... ...

  /**
   * 注册BlockManager
   * Returns the BlockManagerId with topology information populated, if available.
   */
  private def register (
      idWithoutTopologyInfo: BlockManagerId,
      maxOnHeapMemSize: Long,
      maxOffHeapMemSize: Long,
      slaveEndpoint: RpcEndpointRef): BlockManagerId = {
    // the dummy id is not expected to contain the topology information.
    // we get that info here and respond back with a more fleshed out block manager id
    val id = BlockManagerId(
      idWithoutTopologyInfo. executorId,
      idWithoutTopologyInfo. host,
      idWithoutTopologyInfo. port,
      topologyMapper.getTopologyForHost(idWithoutTopologyInfo.host))

    val time = System. currentTimeMillis()
    // 检查已注册的元数据，没有则新注册
    if (!blockManagerInfo.contains(id)) {
      // 根据BlockManager对应的ExecutorId找到对应的BlockManagerInfo
      // 安全判断：如果BlockManagerInfo map里没有BlockManagerId，则同步的BlockManagerIdByExecutor map里也必须没有
      blockManagerIdByExecutor.get(id.executorId) match {
        case Some (oldId ) =>
          // A block manager of the same executor already exists, so remove it (assumed dead)
          logError("Got two different block manager registrations on same executor - "
              + s " will replace old one $oldId with new one $ id")
          removeExecutor(id .executorId ) //blockManagerIdByExecutor.get(execId).foreach(removeBlockManager)
        case None =>
      }
      logInfo("Registering block manager %s with %s RAM, %s" .format(
        id. hostPort, Utils.bytesToString( maxOnHeapMemSize + maxOffHeapMemSize ), id))

      //往blockManagerIdByExecutor map中保存一份executorId到blockManagerId的映射
      blockManagerIdByExecutor(id.executorId) = id

      //为blockManagerId创建一份BlockManagerInfo并往blockManagerInfo map中保存一份ID->INFO的映射
      blockManagerInfo(id) = new BlockManagerInfo(
        id, System.currentTimeMillis(), maxOnHeapMemSize, maxOffHeapMemSize, slaveEndpoint)
    }
    listenerBus. post( SparkListenerBlockManagerAdded(time , id , maxOnHeapMemSize + maxOffHeapMemSize,
        Some(maxOnHeapMemSize ), Some(maxOffHeapMemSize)))
    id
  }



  private def removeBlockManager (blockManagerId : BlockManagerId) {
    // 尝试根据blockManagerId获取对应的BlockManagerInfo
    val info = blockManagerInfo(blockManagerId)

    // Remove the block manager from blockManagerIdByExecutor.
    blockManagerIdByExecutor -= blockManagerId.executorId

    // Remove it from blockManagerInfo and remove all the blocks.
    blockManagerInfo.remove(blockManagerId)

    val iterator = info.blocks.keySet.iterator
    // 遍历BlockManagerInfo内部所有Block对应的BlockId
    while ( iterator. hasNext) {
      //清空BlockManagerInfo内部的Block的BlockStatus
      val blockId = iterator.next
      val locations = blockLocations.get(blockId)
      locations -= blockManagerId
      // De-register the block if none of the block managers have it. Otherwise, if pro-active
      // replication is enabled, and a block is either an RDD or a test block (the latter is used
      // for unit testing), we send a message to a randomly chosen executor location to replicate
      // the given block. Note that we ignore other block types (such as broadcast/shuffle blocks
      // etc.) as replication doesn't make much sense in that context.
      if (locations.size == 0) {
        blockLocations.remove(blockId)
        logWarning(s"No more replicas available for $ blockId !")
      } else if (proactivelyReplicate && (blockId.isRDD || blockId.isInstanceOf[TestBlockId])) {
        // As a heursitic, assume single executor failure to find out the number of replicas that
        // existed before failure
        val maxReplicas = locations .size + 1
        val i = (new Random(blockId.hashCode)).nextInt(locations.size)
        val blockLocations = locations .toSeq
        val candidateBMId = blockLocations (i)
        blockManagerInfo.get(candidateBMId). foreach { bm =>
          val remainingLocations = locations.toSeq.filter(bm => bm != candidateBMId)
          val replicateMsg = ReplicateBlock(blockId, remainingLocations, maxReplicas)
          bm.slaveEndpoint.ask[Boolean](replicateMsg)
        }
      }
    }

    listenerBus. post( SparkListenerBlockManagerRemoved(System.currentTimeMillis (), blockManagerId))
    logInfo(s"Removing block manager $ blockManagerId")

  }

  /**
   * 更新BlockInfo，每个BlockManager上，如果Block发生变化，那么都要发送updateBlockInfo请求至BlockManagerMastrer更新
   */
  private def updateBlockInfo (
      blockManagerId: BlockManagerId,
      blockId: BlockId,
      storageLevel: StorageLevel,
      memSize: Long,
      diskSize: Long): Boolean = {

    if (!blockManagerInfo.contains(blockManagerId)) {
      if (blockManagerId. isDriver && !isLocal) {
        // We intentionally do not register the master (except in local mode),
        // so we should not indicate failure.
        return true
      } else {
        return false
      }
    }

    if (blockId == null) {
      blockManagerInfo(blockManagerId). updateLastSeenMs()
      return true
    }

   
    blockManagerInfo(blockManagerId). updateBlockInfo(blockId, storageLevel, memSize, diskSize)

    // 每一个Block可能会在多个BlockManager上
    // 因为如果StorageLevel设置为_2这种级别，就需要将Block赋值一份副本，放到其他BlockManager上
    // BockLocations map 保存了每个blockId对应的BlockManagerId的Set（自动去重的多个）
    var locations: mutable.HashSet[ BlockManagerId] = null
    if (blockLocations.containsKey(blockId)) {
      locations = blockLocations.get(blockId)
    } else {
      locations = new mutable.HashSet[BlockManagerId]
      blockLocations.put(blockId, locations)
    }

    if (storageLevel. isValid) {
      locations.add(blockManagerId)
    } else {
      locations.remove(blockManagerId)
    }

    // Remove the block from master tracking if it has been removed on all slaves.
    if (locations.size == 0) {
      blockLocations.remove(blockId)
    }
    true
  }


  // ... ...
}

/**
 * 每一个BlockManager的元数据结构BlockManagerInfo
 */
private[spark] class BlockManagerInfo(
    val blockManagerId: BlockManagerId,
    timeMs: Long,
    val maxOnHeapMem: Long,
    val maxOffHeapMem: Long,
    val slaveEndpoint: RpcEndpointRef)
  extends Logging {

  val maxMem = maxOnHeapMem + maxOffHeapMem

  private var _lastSeenMs : Long = timeMs
  private var _remainingMem : Long = maxMem

  // Mapping from block id to its status.
  private val _blocks = new JHashMap[BlockId, BlockStatus]

  // Cached blocks held by this BlockManager. This does not include broadcast blocks.
  private val _cachedBlocks = new mutable.HashSet[BlockId]
}

@DeveloperApi
case class BlockStatus(storageLevel: StorageLevel, memSize : Long, diskSize : Long) {
  def isCached: Boolean = memSize + diskSize > 0
}

  def updateBlockInfo(
      blockId: BlockId,
      storageLevel: StorageLevel,
      memSize: Long,
      diskSize: Long) {

    updateLastSeenMs()

    val blockExists = _blocks.containsKey(blockId)
    var originalMemSize: Long = 0
    var originalDiskSize: Long = 0
    var originalLevel: StorageLevel = StorageLevel. NONE

    if (blockExists) {
      // The block exists on the slave already.
      val blockStatus: BlockStatus = _blocks.get(blockId)
      originalLevel = blockStatus.storageLevel
      originalMemSize = blockStatus.memSize
      originalDiskSize = blockStatus.diskSize

      // 判断如果storageLevel是基于内存，那么就给剩余内存数量加上当前的内存
      if (originalLevel. useMemory) {
        _remainingMem += originalMemSize
      }
    }

    if (storageLevel. isValid) {
      /* isValid means it is either stored in-memory or on-disk.
       * The memSize here indicates the data size in or dropped from memory,
       * externalBlockStoreSize here indicates the data size in or dropped from externalBlockStore,
       * and the diskSize here indicates the data size in or dropped to disk.
       * They can be both larger than 0, when a block is dropped from memory to disk.
       * Therefore, a safe way to set BlockStatus is to set its info in accurate modes. */
      var blockStatus: BlockStatus = null
      if (storageLevel. useMemory) {
        blockStatus = BlockStatus (storageLevel, memSize = memSize, diskSize = 0)
        _blocks.put(blockId, blockStatus)
        _remainingMem -= memSize
        if (blockExists ) {
          logInfo(s"Updated $blockId in memory on ${blockManagerId.hostPort}" +
            s " (current size: ${Utils. bytesToString(memSize )}," +
            s " original size: ${Utils. bytesToString(originalMemSize )}," +
            s " free: ${Utils. bytesToString(_remainingMem)})" )
        } else {
          logInfo(s"Added $blockId in memory on ${blockManagerId.hostPort}" +
            s " (size: ${Utils. bytesToString(memSize )}," +
            s " free: ${Utils. bytesToString(_remainingMem)})" )
        }
      }
      if (storageLevel. useDisk) {
        blockStatus = BlockStatus (storageLevel, memSize = 0, diskSize = diskSize)
        _blocks.put(blockId, blockStatus)
        if (blockExists ) {
          logInfo(s"Updated $blockId on disk on ${blockManagerId.hostPort}" +
            s " (current size: ${Utils. bytesToString(diskSize )}," +
            s " original size: ${Utils. bytesToString(originalDiskSize )})" )
        } else {
          logInfo(s"Added $blockId on disk on ${blockManagerId.hostPort}" +
            s " (size: ${Utils. bytesToString(diskSize )})" )
        }
      }
      if (! blockId. isBroadcast && blockStatus.isCached) {
        _cachedBlocks += blockId
      }
    } else if (blockExists ) {
      // If isValid is not true, drop the block.
      _blocks.remove(blockId)
      _cachedBlocks -= blockId
      if (originalLevel. useMemory) {
        logInfo(s"Removed $ blockId on ${blockManagerId.hostPort} in memory" +
          s " (size: ${Utils. bytesToString(originalMemSize )}," +
          s " free: ${Utils. bytesToString(_remainingMem)})" )
      }
      if (originalLevel. useDisk) {
        logInfo(s"Removed $ blockId on ${blockManagerId.hostPort} on disk" +
          s " (size: ${Utils. bytesToString(originalDiskSize )})" )
      }
    }
  }
}

/**
 * Manager running on every node (driver and executors) which provides interfaces for putting and
 * retrieving blocks both locally and remotely into various stores (memory, disk, and off-heap).
 *
 * Note that [[initialize()]] must be called before the BlockManager is usable.
 */
private[spark] class BlockManager(
    executorId: String,
    rpcEnv: RpcEnv,
    val master: BlockManagerMaster,
    val serializerManager: SerializerManager,
    val conf: SparkConf,
    memoryManager: MemoryManager,
    mapOutputTracker: MapOutputTracker,
    shuffleManager: ShuffleManager,
    val blockTransferService: BlockTransferService,
    securityManager: SecurityManager,
    numUsableCores: Int)
  extends BlockDataManager with BlockEvictionHandler with Logging {

  private[spark] val externalShuffleServiceEnabled =
    conf.getBoolean("spark.shuffle.service.enabled", false)

  val diskBlockManager = {
    // Only perform cleanup if an external service is not serving our shuffle files.
    val deleteFilesOnStop =
      !externalShuffleServiceEnabled || executorId == SparkContext.DRIVER_IDENTIFIER
    new DiskBlockManager(conf, deleteFilesOnStop )
  }

  // Visible for testing
  private[storage] val blockInfoManager = new BlockInfoManager

  private val futureExecutionContext = ExecutionContext .fromExecutorService (
    ThreadUtils.newDaemonCachedThreadPool ("block-manager-future" , 128 ))

  // Actual storage of where blocks are kept
  private[spark] val memoryStore =
    new MemoryStore(conf, blockInfoManager, serializerManager, memoryManager, this)
  private[spark] val diskStore = new DiskStore (conf, diskBlockManager, securityManager)
  memoryManager.setMemoryStore(memoryStore)
  // ... ...
 /**
   * Initializes the BlockManager with the given appId. This is not performed in the constructor as
   * the appId may not be known at BlockManager instantiation time (in particular for the driver,
   * where it is only learned after registration with the TaskScheduler).
   *
   * This method initializes the BlockTransferService and ShuffleClient, registers with the
   * BlockManagerMaster, starts the BlockManagerWorker endpoint, and registers with a local shuffle
   * service if configured.
   */
  def initialize(appId: String): Unit = {
    // 首先初始化用于远程数据传输的BlockTransferService
    blockTransferService.init( this)
    shuffleClient.init(appId)

    blockReplicationPolicy = {
      val priorityClass = conf.get(
        "spark.storage.replication.policy" , classOf[RandomBlockReplicationPolicy].getName)
      val clazz = Utils.classForName (priorityClass )
      val ret = clazz.newInstance. asInstanceOf[BlockReplicationPolicy]
      logInfo(s"Using $ priorityClass for block replication policy")
      ret
    }

    // 为当前这个BlockManager创建唯一一个BlockManagerId
    // 从BlockManagerId的初始化即可看出一个BlockManager是通过一个节点上的一个Executor进行唯一标识的
    val id =
      BlockManagerId( executorId, blockTransferService .hostName, blockTransferService.port, None )

    //发送BlockManager的注册消息
    val idFromMaster = master. registerBlockManager(
      id,
      maxOnHeapMemory,
      maxOffHeapMemory,
      slaveEndpoint)

    blockManagerId = if (idFromMaster != null) idFromMaster else id

    shuffleServerId = if (externalShuffleServiceEnabled) {
      logInfo(s"external shuffle service port = $ externalShuffleServicePort")
      BlockManagerId(executorId, blockTransferService.hostName, externalShuffleServicePort)
    } else {
      blockManagerId
    }

    // Register Executors' configuration with the local shuffle service, if one should exist.
    if (externalShuffleServiceEnabled && !blockManagerId.isDriver) {
      registerWithExternalShuffleServer()
    }

    logInfo(s"Initialized BlockManager: $ blockManagerId")
  }

  /**
   * Get block from the local block manager as serialized bytes.
   *
   * Must be called while holding a read lock on the block.
   * Releases the read lock upon exception; keeps the read lock upon successful return.
   */
  private def doGetLocalBytes (blockId : BlockId, info: BlockInfo): BlockData = {
    val level = info.level
    logDebug(s"Level for block $ blockId is $level ")
    // In order, try to read the serialized bytes from memory, then from disk, then fall back to
    // serializing in-memory objects, and, finally, throw an exception if the block does not exist.
    if (level.deserialized) {
      // Try to avoid expensive serialization by reading a pre -serialized copy from disk:
      if (level.useDisk && diskStore.contains(blockId)) {
        // Note: we purposely do not try to put the block back into memory here. Since this branch
        // handles deserialized blocks, this block may only be cached in memory as objects, not
        // serialized bytes. Because the caller only requested bytes, it doesn't make sense to
        // cache the block's deserialized objects since that caching may not have a payoff.
        // DiskStore底层使用Java NIO进行读写操作
        diskStore.getBytes(blockId)
      } else if (level .useMemory && memoryStore.contains(blockId)) {
        // The block was not found on disk, so serialize an in-memory copy:
        // 关键MemoryStore中用entry维持Block在内存中的数据 : private val entries = new LinkedHashMap[ BlockId, MemoryEntry[_](32,0.75f,true) ] getBytes / getValues 进行多线程并发访问同步
        new ByteBufferBlockData(serializerManager.dataSerializeWithExplicitClassTag(
          blockId, memoryStore.getValues(blockId).get, info.classTag), true)
      } else {
        handleLocalReadFailure(blockId )
      }
    } else {  // storage level is serialized
      if (level.useMemory && memoryStore.contains(blockId)) {
        new ByteBufferBlockData(memoryStore.getBytes(blockId).get, false)
      } else if (level .useDisk && diskStore.contains(blockId)) {
        val diskData = diskStore.getBytes(blockId)
        maybeCacheDiskBytesInMemory(info , blockId , level , diskData )
          . map( new ByteBufferBlockData(_, false))
          . getOrElse(diskData )
      } else {
        handleLocalReadFailure(blockId )
      }
    }
  }

  /**
   * Get block from remote block managers as serialized bytes.
   */
  def getRemoteBytes( blockId: BlockId): Option[ChunkedByteBuffer ] = {
    logDebug(s"Getting remote block $ blockId")
    require(blockId != null, "BlockId is null")
    var runningFailureCount = 0
    var totalFailureCount = 0

    // Because all the remote blocks are registered in driver, it is not necessary to ask
    // all the slave executors to get block status.
    val locationsAndStatus = master. getLocationsAndStatus(blockId )
    val blockSize = locationsAndStatus.map { b =>
      b.status.diskSize.max(b.status.memSize)
    }.getOrElse(0L)
    val blockLocations = locationsAndStatus.map (_.locations ).getOrElse (Seq.empty)

    // If the block size is above the threshold, we should pass our FileManger to
    // BlockTransferService, which will leverage it to spill the block; if not, then passed-in
    // null value means the block will be persisted in memory.
    val tempFileManager = if (blockSize > maxRemoteBlockToMem) {
      remoteBlockTempFileManager
    } else {
      null
    }

    val locations = sortLocations(blockLocations )
    val maxFetchFailures = locations.size
    var locationIterator = locations.iterator
    while ( locationIterator.hasNext ) {
      val loc = locationIterator. next()
      logDebug(s"Getting remote block $ blockId from $loc ")
      val data = try {
        blockTransferService.fetchBlockSync(
          loc.host, loc.port, loc.executorId, blockId.toString, tempFileManager).nioByteBuffer ()
      } catch {
        case NonFatal (e ) =>
          runningFailureCount += 1
          totalFailureCount += 1

          if (totalFailureCount >= maxFetchFailures ) {
            // Give up trying anymore locations. Either we've tried all of the original locations,
            // or we've refreshed the list of locations from the master, and have still
            // hit failures after trying locations from the refreshed list.
            logWarning(s"Failed to fetch block after $totalFailureCount fetch failures. " +
              s "Most recent failure cause:", e )
            return None
          }

          logWarning(s"Failed to fetch remote block $blockId " +
            s "from $loc (failed attempt $ runningFailureCount)", e)

          // If there is a large number of executors then locations list can contain a
          // large number of stale entries causing a large number of retries that may
          // take a significant amount of time. To get rid of these stale entries
          // we refresh the block locations after a certain number of fetch failures
          if (runningFailureCount >= maxFailuresBeforeLocationRefresh) {
            locationIterator = sortLocations(master .getLocations (blockId )).iterator
            logDebug(s"Refreshed locations from the driver " +
              s "after ${runningFailureCount } fetch failures." )
            runningFailureCount = 0
          }

          // This location failed, so we retry fetch from a different one by returning null here
          null
      }

      if (data != null) {
        return Some (new ChunkedByteBuffer (data ))
      }
      logDebug(s"The value of block $ blockId is null")
    }
    logDebug(s"Block $ blockId not found")
    None
  }

  /**
   * Put the given bytes according to the given level in one of the block stores, replicating
   * the values if necessary.
   *
   * If the block already exists, this method will not overwrite it.
   *
   * '''Important!''' Callers must not mutate or release the data buffer underlying `bytes`. Doing
   * so may corrupt or change the data stored by the `BlockManager`.
   *
   * @param keepReadLock if true, this method will hold the read lock when it returns (even if the
   *                     block already exists). If false, this method will hold no locks when it
   *                     returns.
   * @return true if the block was already present or if the put succeeded, false otherwise.
   */
  private def doPutBytes [T](
      blockId: BlockId,
      bytes: ChunkedByteBuffer,
      level: StorageLevel,
      classTag: ClassTag[T],
      tellMaster: Boolean = true,
      keepReadLock: Boolean = false): Boolean = {
    doPut(blockId, level, classTag, tellMaster = tellMaster, keepReadLock = keepReadLock) { info =>
      val startTimeMs = System.currentTimeMillis
      // Since we're storing bytes, initiate the replication before storing them locally.
      // This is faster as data is already serialized and ready to send.
      val replicationFuture = if (level.replication > 1 ) {
        Future {
          // This is a blocking action and should run in futureExecutionContext which is a cached
          // thread pool. The ByteBufferBlockData wrapper is not disposed of to avoid releasing
          // buffers that are owned by the caller.
          replicate(blockId, new ByteBufferBlockData(bytes, false), level, classTag)
        }(futureExecutionContext)
      } else {
        null
      }

      val size = bytes.size

      if (level.useMemory) {
        // Put it in memory first, even if it also has useDisk set to true;
        // We will drop it to disk later if the memory store can't hold it.
        val putSucceeded = if (level.deserialized) {
          val values =
            serializerManager.dataDeserializeStream(blockId, bytes.toInputStream())(classTag)
          memoryStore.putIteratorAsValues(blockId, values, classTag) match {
            case Right(_) => true
            case Left(iter) =>
              // If putting deserialized values in memory failed, we will put the bytes directly to
              // disk, so we don't need this iterator and can close it to free resources earlier.
              iter.close()
              false
          }
        } else {
          val memoryMode = level.memoryMode
          memoryStore.putBytes(blockId, size, memoryMode, () => {
            if (memoryMode == MemoryMode.OFF_HEAP &&
                bytes.chunks.exists(buffer => !buffer.isDirect)) {
              bytes.copy(Platform.allocateDirectBuffer)
            } else {
              bytes
            }
          })
        }
        if (!putSucceeded && level.useDisk) {
          logWarning(s "Persisting block $blockId to disk instead." )
          diskStore.putBytes(blockId, bytes)
        }
      } else if (level.useDisk) {
        diskStore.putBytes(blockId, bytes)
      }

      val putBlockStatus = getCurrentBlockStatus(blockId, info)
      val blockWasSuccessfullyStored = putBlockStatus.storageLevel.isValid
      if (blockWasSuccessfullyStored) {
        // Now that the block is in either the memory or disk store,
        // tell the master about it.
        info.size = size
        if (tellMaster && info.tellMaster) {
          reportBlockStatus(blockId, putBlockStatus)
        }
        addUpdatedBlockStatusToTaskMetrics(blockId, putBlockStatus)
      }
      logDebug( "Put block %s locally took %s" .format(blockId, Utils.getUsedTimeMs(startTimeMs)))
      if (level.replication > 1) {
        // Wait for asynchronous replication to finish
        try {
          ThreadUtils.awaitReady(replicationFuture, Duration.Inf)
        } catch {
          case NonFatal(t) =>
            throw new Exception("Error occurred while waiting for replication to finish", t)
        }
      }
      if (blockWasSuccessfullyStored) {
        None
      } else {
        Some(bytes)
      }
    }.isEmpty
  }


  /**
   * Put the given block according to the given level in one of the block stores, replicating
   * the values if necessary.
   *
   * If the block already exists, this method will not overwrite it.
   *
   * @param keepReadLock if true, this method will hold the read lock when it returns (even if the
   *                     block already exists). If false, this method will hold no locks when it
   *                     returns.
   * @return None if the block was already present or if the put succeeded, or Some(iterator)
   *         if the put failed.
   */
  private def doPutIterator [T](
      blockId: BlockId,
      iterator: () => Iterator[T ],
      level: StorageLevel,
      classTag: ClassTag[T],
      tellMaster: Boolean = true,
      keepReadLock: Boolean = false): Option[PartiallyUnrolledIterator[T]] = {
    doPut(blockId, level, classTag, tellMaster = tellMaster, keepReadLock = keepReadLock) { info =>
      val startTimeMs = System.currentTimeMillis
      var iteratorFromFailedMemoryStorePut: Option[PartiallyUnrolledIterator[T]] = None
      // Size of the block in bytes
      var size = 0L
      if (level.useMemory) {
        // Put it in memory first, even if it also has useDisk set to true;
        // We will drop it to disk later if the memory store can't hold it.
        if (level.deserialized) {
          memoryStore.putIteratorAsValues(blockId, iterator(), classTag) match {
            case Right(s) =>
              size = s
            case Left(iter) =>
              // Not enough space to unroll this block; drop to disk if applicable
              if (level.useDisk) {
                logWarning(s "Persisting block $blockId to disk instead." )
                diskStore.put(blockId) { channel =>
                  val out = Channels.newOutputStream(channel)
                  serializerManager.dataSerializeStream(blockId, out, iter)(classTag)
                }
                size = diskStore.getSize(blockId)
              } else {
                iteratorFromFailedMemoryStorePut = Some(iter)
              }
          }
        } else { // !level.deserialized
          memoryStore.putIteratorAsBytes(blockId, iterator(), classTag, level.memoryMode) match {
            case Right(s) =>
              size = s
            case Left(partiallySerializedValues) =>
              // Not enough space to unroll this block; drop to disk if applicable
              if (level.useDisk) {
                logWarning(s "Persisting block $blockId to disk instead." )
                diskStore.put(blockId) { channel =>
                  val out = Channels.newOutputStream(channel)
                  partiallySerializedValues.finishWritingToStream(out)
                }
                size = diskStore.getSize(blockId)
              } else {
                iteratorFromFailedMemoryStorePut = Some(partiallySerializedValues.valuesIterator)
              }
          }
        }

      } else if (level.useDisk) {
        diskStore.put(blockId) { channel =>
          val out = Channels.newOutputStream(channel)
          serializerManager.dataSerializeStream(blockId, out, iterator())(classTag)
        }
        size = diskStore.getSize(blockId)
      }

      val putBlockStatus = getCurrentBlockStatus(blockId, info)
      val blockWasSuccessfullyStored = putBlockStatus.storageLevel.isValid
      if (blockWasSuccessfullyStored) {
        // Now that the block is in either the memory or disk store, tell the master about it.
        info.size = size
        if (tellMaster && info.tellMaster) {
          reportBlockStatus(blockId, putBlockStatus)
        }
        addUpdatedBlockStatusToTaskMetrics(blockId, putBlockStatus)
        logDebug( "Put block %s locally took %s" .format(blockId, Utils.getUsedTimeMs(startTimeMs)))
        if (level.replication > 1 ) {
          val remoteStartTime = System.currentTimeMillis
          val bytesToReplicate = doGetLocalBytes(blockId, info)
          // [SPARK-16550] Erase the typed classTag when using default serialization, since
          // NettyBlockRpcServer crashes when deserializing repl-defined classes.
          // TODO( ekl) remove this once the classloader issue on the remote end is fixed.
          val remoteClassTag = if (!serializerManager.canUseKryo(classTag)) {
            scala.reflect.classTag[Any]
          } else {
            classTag
          }
          try {
            replicate(blockId, bytesToReplicate, level, remoteClassTag)
          } finally {
            bytesToReplicate.dispose()
          }
          logDebug( "Put block %s remotely took %s"
            .format(blockId, Utils.getUsedTimeMs(remoteStartTime)))
        }
      }
      assert(blockWasSuccessfullyStored == iteratorFromFailedMemoryStorePut.isEmpty)
      iteratorFromFailedMemoryStorePut
    }
  }
  // ... ...
}

/**
 * Tracks metadata for an individual block.
 *
 * Instances of this class are _not_ thread-safe and are protected by locks in the
 * [[BlockInfoManager]].
 *
 * @param level the block's storage level. This is the requested persistence level, not the
 *              effective storage level of the block (i.e. if this is MEMORY_AND_DISK, then this
 *              does not imply that the block is actually resident in memory).
 * @param classTag the block's [[ClassTag]], used to select the serializer
 * @param tellMaster whether state changes for this block should be reported to the master. This
 *                   is true for most blocks, but is false for broadcast blocks.
 */
private[storage] class BlockInfo(
    val level: StorageLevel,
    val classTag: ClassTag[_],
    val tellMaster: Boolean)

13. CacheManager(2.3中没有CacheManager)

源码：从RDD的iterator()入手

  /**
   * Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
   * This should ''not'' be called by users directly, but is available for implementors of custom
   * subclasses of RDD.
   */
  final def iterator (split : Partition, context: TaskContext): Iterator[ T] = {
    // 若storageLevel不为NONE,也就是之前持久化过该RDD，那么就不会去直接从父RDD执行算子计算新RDD的partition
    // 优先尝试使用CacheManager去获取持久化的数据
    if (storageLevel != StorageLevel.NONE) {
      getOrCompute( split, context)
    } else {
      // 否则，尝试从ChkPoint的持久化中获取
      computeOrReadCheckpoint( split, context)
    }
  }


  /**
   * Gets or computes an RDD partition. Used by RDD.iterator() when an RDD is cached.
   */
  private[spark] def getOrCompute (partition : Partition, context: TaskContext): Iterator[T ] = {
    val blockId = RDDBlockId(id, partition .index )
    var readCachedBlock = true
    // This method is called on executors, so we need call SparkEnv.get instead of sc.env.
    SparkEnv.get .blockManager .getOrElseUpdate (blockId , storageLevel, elementClassTag, () => {
      readCachedBlock = false
      computeOrReadCheckpoint( partition, context )
    }) match {
      case Left( blockResult) =>
        if (readCachedBlock ) {
          val existingMetrics = context .taskMetrics ().inputMetrics
          existingMetrics.incBytesRead (blockResult.bytes)
          new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {
            override def next (): T = {
              existingMetrics.incRecordsRead (1 )
              delegate.next()
            }
          }
        } else {
          new InterruptibleIterator( context, blockResult.data.asInstanceOf[Iterator[ T]])
        }
      case Right( iter) =>
        new InterruptibleIterator( context, iter. asInstanceOf[Iterator[T]])
    }
  }

  /**
   * Compute an RDD partition or read it from a checkpoint if the RDD is checkpointing.
   */
  private[spark] def computeOrReadCheckpoint (split : Partition, context: TaskContext): Iterator[ T] =
  {
    if (isCheckpointedAndMaterialized) {
      firstParent[ T]. iterator( split, context)
    } else {
      compute(split, context)
    }
  }

  /**
   * Retrieve the given block if it exists, otherwise call the provided `makeIterator` method
   * to compute the block, persist it, and return its values.
   *
   * @return either a BlockResult if the block was successfully cached, or an iterator if the block
   *         could not be cached.
   */
  def getOrElseUpdate[ T](
      blockId: BlockId,
      level: StorageLevel,
      classTag: ClassTag[T],
      makeIterator: () => Iterator[T ]): Either[BlockResult, Iterator[ T]] = {
    // Attempt to read the block from local or remote storage. If it's present, then we don't need
    // to go through the local-get-or-put path.
    get[T](blockId)(classTag) match {
      case Some(block ) =>
        return Left (block )
      case _ =>
        // Need to compute the block.
    }
    // Initially we hold no locks on this block.
    doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true ) match {
      case None =>
        // doPut() didn't hand work back to us, so the block already existed or was successfully
        // stored. Therefore, we now hold a read lock on the block.
        val blockResult = getLocalValues (blockId ).getOrElse {
          // Since we held a read lock between the doPut() and get() calls, the block should not
          // have been evicted, so get() not returning the block indicates some internal error.
          releaseLock(blockId)
          throw new SparkException(s "get() failed for block $blockId even though we held a lock")
        }
        // We already hold a read lock on the block from the doPut() call and getLocalValues()
        // acquires the lock again, so we need to call releaseLock() here so that the net number
        // of lock acquisitions is 1 (since the caller will only call release() once).
        releaseLock(blockId)
        Left( blockResult)
      case Some(iter ) =>
        // The put failed, likely because the data was too large to fit in memory and could not be
        // dropped to disk. Therefore, we need to pass the input iterator back to the caller so
        // that they can decide what to do with the values (e.g. process them without caching).
       Right(iter)
    }
  }

你可能感兴趣的:(bigdata,spark)

nosql数据库技术与应用知识点皆过客，揽星河 NoSQL nosql 数据库大数据数据分析数据结构非关系型数据库
Nosql知识回顾大数据处理流程数据采集(flume、爬虫、传感器)数据存储(本门课程NoSQL所处的阶段)Hdfs、MongoDB、HBase等数据清洗(入仓)Hive等数据处理、分析(Spark、Flink等)数据可视化数据挖掘、机器学习应用(Python、SparkMLlib等)大数据时代存储的挑战(三高)高并发(同一时间很多人访问)高扩展(要求随时根据需求扩展存储)高效率(要求读写速度快)
Hadoop架构 henan程序媛 hadoop 大数据分布式
一、案列分析1.1案例概述现在已经进入了大数据(BigData)时代，数以万计用户的互联网服务时时刻刻都在产生大量的交互，要处理的数据量实在是太大了，以传统的数据库技术等其他手段根本无法应对数据处理的实时性、有效性的需求。HDFS顺应时代出现，在解决大数据存储和计算方面有很多的优势。1.2案列前置知识点1.什么是大数据大数据是指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的大量数据集合，
分享一个基于python的电子书数据采集与可视化分析 hadoop电子书数据分析与推荐系统 spark大数据毕设项目（源码、调试、LW、开题、PPT) 计算机源码社 Python项目大数据大数据 python hadoop 计算机毕业设计选题计算机毕业设计源码数据分析 spark毕设
作者：计算机源码社个人简介：本人八年开发经验，擅长Java、Python、PHP、.NET、Node.js、Android、微信小程序、爬虫、大数据、机器学习等，大家有这一块的问题可以一起交流！学习资料、程序开发、技术解答、文档报告如需要源码，可以扫取文章下方二维码联系咨询Java项目微信小程序项目Android项目Python项目PHP项目ASP.NET项目Node.js项目选题推荐项目实战|p
Spark 组件 GraphX、Streaming 叶域大数据 spark spark 大数据分布式
Spark组件GraphX、Streaming一、SparkGraphX1.1GraphX的主要概念1.2GraphX的核心操作1.3示例代码1.4GraphX的应用场景二、SparkStreaming2.1SparkStreaming的主要概念2.2示例代码2.3SparkStreaming的集成2.4SparkStreaming的应用场景SparkGraphX用于处理图和图并行计算。Graph
大数据毕业设计hadoop+spark+hive知识图谱租房数据分析可视化大屏租房推荐系统 58同城租房爬虫房源推荐系统房价预测系统计算机毕业设计机器学习深度学习人工智能 2401_84572577 程序员大数据 hadoop 人工智能
做了那么多年开发，自学了很多门编程语言，我很明白学习资源对于学一门新语言的重要性，这些年也收藏了不少的Python干货，对我来说这些东西确实已经用不到了，但对于准备自学Python的人来说，或许它就是一个宝藏，可以给你省去很多的时间和精力。别在网上瞎学了，我最近也做了一些资源的更新，只要你是我的粉丝，这期福利你都可拿走。我先来介绍一下这些东西怎么用，文末抱走。（1）Python所有方向的学习路线（
Spark集群的三种模式 MelodyYN #Spark spark hadoop big data
文章目录1、Spark的由来1.1Hadoop的发展1.2MapReduce与Spark对比2、Spark内置模块3、Spark运行模式3.1Standalone模式部署配置历史服务器配置高可用运行模式3.2Yarn模式安装部署配置历史服务器运行模式4、WordCount案例1、Spark的由来定义：Hadoop主要解决，海量数据的存储和海量数据的分析计算。Spark是一种基于内存的快速、通用、可
Java中的大数据处理框架对比分析省赚客app开发者 java 开发语言
Java中的大数据处理框架对比分析大家好，我是微赚淘客系统3.0的小编，是个冬天不穿秋裤，天冷也要风度的程序猿！今天，我们将深入探讨Java中常用的大数据处理框架，并对它们进行对比分析。大数据处理框架是现代数据驱动应用的核心，它们帮助企业处理和分析海量数据，以提取有价值的信息。本文将重点介绍ApacheHadoop、ApacheSpark、ApacheFlink和ApacheStorm这四种流行的
写出渗透测试信息收集详细流程卿酌南烛_b805
一、扫描域名漏洞：域名漏洞扫描工具有AWVS、APPSCAN、Netspark、WebInspect、Nmap、Nessus、天镜、明鉴、WVSS、RSAS等。二、子域名探测：1、dns域传送漏洞2、搜索引擎查找（通过Google、bing、搜索c段）3、通过ssl证书查询网站：https://myssl.com/ssl.html和https://www.chinassl.net/ssltools
echarts象形渐变柱状图星星跌入梦境* echarts angular.js 前端
一、效果图如下：二、代码如下（1）父组件importitemfrom'../bigdata/components/item.vue'exportdefault{components:{item}}.page-con{width:100%;height:100%;.main-con{width:35%;height:33%;}}（2）子组件importechartsfrom"echarts";exp
Spark MLlib模型训练—推荐算法 ALS(Alternative Least Squares) 不二人生 Spark ML 实战 spark-ml 推荐算法算法
SparkMLlib模型训练—推荐算法ALS(AlternativeLeastSquares)如果你平时爱刷抖音，或者热衷看电影，不知道有没有过这样的体验：这类影视App你用得越久，它就好像会读心术一样，总能给你推荐对胃口的内容。其实这种迎合用户喜好的推荐，离不开机器学习中的推荐算法。在今天这一讲，我们就结合两个有趣的电影推荐场景，为你讲解SparkMLlib支持的协同过滤与频繁项集算法电影推荐场
Python基础知识进阶之正则表达式_头歌python正则表达式进阶前端陈萨龙程序员 python 学习面试
最后硬核资料：关注即可领取PPT模板、简历模板、行业经典书籍PDF。技术互助：技术群大佬指点迷津，你的问题可能不是问题，求资源在群里喊一声。面试题库：由技术群里的小伙伴们共同投稿，热乎的大厂面试真题，持续更新中。知识体系：含编程语言、算法、大数据生态圈组件（Mysql、Hive、Spark、Flink）、数据仓库、Python、前端等等。网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是
分布式离线计算—Spark—基础介绍测试开发abbey 人工智能—大数据
原文作者：饥渴的小苹果原文地址：【Spark】Spark基础教程目录Spark特点Spark相对于Hadoop的优势Spark生态系统Spark基本概念Spark结构设计Spark各种概念之间的关系Executor的优点Spark运行基本流程Spark运行架构的特点Spark的部署模式Spark三种部署方式Hadoop和Spark的统一部署摘要：Spark是基于内存计算的大数据并行计算框架Spar
spark常用命令我是浣熊的微笑 spark
查看报错日志：yarnlogsapplicationIDspark2-submit--masteryarn--classcom.hik.ReadHdfstest-1.0-SNAPSHOT.jar进入$SPARK_HOME目录，输入bin/spark-submit--help可以得到该命令的使用帮助。hadoop@wyy:/app/hadoop/spark100$bin/spark-submit--
spark启动命令学不会又听不懂 spark 大数据分布式
hadoop启动：cd/root/toolssstart-dfs.sh，只需在hadoop01上启动stop-dfs.sh日志查看：cat/root/toolss/hadoop/logs/hadoop-root-datanode-hadoop03.outzookeeper启动：cd/root/toolss/zookeeperbin/zkServer.shstart，三台都要启动bin/zkServ
大数据领域的深度分析——AI是在帮助开发者还是取代他们？阳爱铭大数据与数据中台技术沉淀大数据人工智能后端数据库架构数据库开发 etl工程师 chatgpt
在大数据领域，生成式人工智能（AIGC）的应用正在迅速扩展，改变了数据科学家和开发者的工作方式。本文将从大数据的专业视角，探讨AI工具在这一领域的作用，以及它们是如何帮助开发者而非取代他们的。1.大数据领域的AI工具现状在大数据领域，AI工具已经取得了显著进展，以下是几款主要的AI工具及其功能和实际应用：ApacheSpark+MLlib：ApacheSpark是一个开源的分布式计算系统，广泛用于
大数据新视界 --大数据大厂之 Spark 性能优化秘籍：从配置到代码实践青云交大数据新视界 Spark 性能优化内存分配并行度存储级别 shuffle 减少算法优化代码实践数据读取广播变量数据倾斜 Spark 数据库
亲爱的朋友们，热烈欢迎你们来到青云交的博客！能与你们在此邂逅，我满心欢喜，深感无比荣幸。在这个瞬息万变的时代，我们每个人都在苦苦追寻一处能让心灵安然栖息的港湾。而我的博客，正是这样一个温暖美好的所在。在这里，你们不仅能够收获既富有趣味又极为实用的内容知识，还可以毫无拘束地畅所欲言，尽情分享自己独特的见解。我真诚地期待着你们的到来，愿我们能在这片小小的天地里共同成长，共同进步。本博客的精华专栏：Ja
编程常用命令总结 Yellow0523 Linux BigData 大数据
编程命令大全1.软件环境变量的配置JavaScalaSparkHadoopHive2.大数据软件常用命令Spark基本命令Spark-SQL命令Hive命令HDFS命令YARN命令Zookeeper命令kafka命令Hibench命令MySQL命令3.Linux常用命令Git命令conda命令pip命令查看Linux系统的详细信息查看Linux系统架构(X86还是ARM，两种方法都可)端口号命令L
【面试系列】Spark 高频面试题解答野老杂谈全网最全IT公司面试宝典面试 spark 职场和发展大数据
欢迎来到我的博客，很高兴能够在这里和您见面！欢迎订阅相关专栏：⭐️全网最全IT互联网公司面试宝典：收集整理全网各大IT互联网公司技术、项目、HR面试真题.⭐️AIGC时代的创新与未来：详细讲解AIGC的概念、核心技术、应用领域等内容。⭐️大数据平台建设指南：全面讲解从数据采集到数据可视化的整个过程，掌握构建现代化数据平台的核心技术和方法。⭐️《遇见Python：初识、了解与热恋》：涵盖了Pytho
spark常见面试题爱敲代码的小黑 spark 大数据分布式
文章目录1.Spark的运行流程？2.Spark中的RDD机制理解吗？3.RDD的宽窄依赖4.DAG中为什么要划分Stage？5.Spark程序执行，有时候默认为什么会产生很多task，怎么修改默认task执行个数？6.RDD中reduceBykey与groupByKey哪个性能好，为什么？7.SparkMasterHA主从切换过程不会影响到集群已有作业的运行，为什么？8.SparkMaster使
Spark面试题 golove666 面试题大全 spark 大数据分布式面试
Spark面试题1.Spark基础概念1.1解释Spark是什么以及它的主要特点Spark是什么？Spark的主要特点1.2描述Spark运行时架构和组件主要的Spark架构组件：1.3讲述Spark中的弹性分布式数据集（RDD）和数据帧（DataFrame）弹性分布式数据集（RDD）主要特征：创建和转换：使用场景：数据帧（DataFrame）主要特征：创建和操作：使用场景：RDD与DataFra
图计算：基于SparkGrpahX计算聚类系数妙龄少女郭德纲 Spark 图算法 Scala 聚类数据挖掘机器学习
图计算：基于SparkGrpahX计算聚类系数文章目录图计算：基于SparkGrpahX计算聚类系数一、什么是聚类系数二、基于SparkGraphX的聚类系数代码实现总结一、什么是聚类系数聚类系数（ClusteringCoefficient）是图计算和网络分析中的一个重要概念，用于衡量网络中节点的局部聚集程度。它有助于理解网络中节点之间的紧密程度和网络的结构特性。这是一种用来衡量图中节点聚类程度的
2024年最全使用Python求解方程_python解方程(1)，字节面试官迟到 2401_84569545 程序员 python 学习面试
最后硬核资料：关注即可领取PPT模板、简历模板、行业经典书籍PDF。技术互助：技术群大佬指点迷津，你的问题可能不是问题，求资源在群里喊一声。面试题库：由技术群里的小伙伴们共同投稿，热乎的大厂面试真题，持续更新中。知识体系：含编程语言、算法、大数据生态圈组件（Mysql、Hive、Spark、Flink）、数据仓库、Python、前端等等。网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是
Spark运行时架构 tooolik spark 架构大数据
目录一，Spark运行时架构二，YARN集群架构（一）YARN集群主要组件1、ResourceManager-资源管理器2、NodeManager-节点管理器3、Task-任务4、Container-容器5、ApplicationMaster-应用程序管理器6，总结（二）YARN集群中应用程序的执行流程三、SparkStandalone架构（一）client提交方式（二）cluster提交方式四、
使用SparkSql进行表的分析与统计 xingyuan8 大数据 java
背景我们的数据挖掘平台对数据统计有比较迫切的需求，而Spark本身对数据统计已经做了一些工作，希望梳理一下Spark已经支持的数据统计功能，后期再进行扩展。准备数据在参考文献6中下载鸢尾花数据，此处格式为iris.data格式，先将data后缀改为csv后缀（不影响使用，只是为了保证后续操作不需要修改）。数据格式如下：SepalLengthSepalWidthPetalLengthPetalWid
13.Spark Core-Spark中广播变量和累加器 __元昊__
一、前述Spark中因为算子中的真正逻辑是发送到Executor中去运行的，所以当Executor中需要引用外部变量时，需要使用广播变量。累机器相当于统筹大变量，常用于计数，统计。二、具体原理1、广播变量广播变量理解图image注意事项1、能不能将一个RDD使用广播变量广播出去？不能，因为RDD是不存储数据的。可以将RDD的结果广播出去。2、广播变量只能在Driver端定义，不能在Executor
比较Spark与Flink 傲雪凌霜，松柏长青大数据后端 spark flink 大数据
ApacheSpark和ApacheFlink都是目前非常流行的大数据处理引擎，但它们在架构、处理模式、应用场景等方面有一些显著的区别。下面是二者的对比：1.处理模式Spark:主要支持批处理（BatchProcessing），也能通过SparkStreaming处理流式数据，但SparkStreaming本质上是通过微批（micro-batching）的方式处理流数据，延迟相对较高。SparkS
Spark底层逻辑傲雪凌霜，松柏长青大数据后端 spark 大数据
ApacheSpark的底层逻辑可以从其核心概念、组件和执行流程等方面来理解。Spark提供了一个分布式数据处理框架，其底层逻辑基于批处理架构，能够在大规模集群中高效地处理数据。以下是Spark的底层逻辑的详细介绍：1.核心概念Spark的底层基于几个核心概念来实现分布式计算，包括：RDD（ResilientDistributedDataset，弹性分布式数据集）：RDD是Spark最基础的数据抽
Spark - 升级版数据源JDBC2 大猪大猪
在spark的数据源中，只支持Append,Overwrite,ErrorIfExists,Ignore,这几种模式，但是我们在线上的业务几乎全是需要upsert功能的，就是已存在的数据肯定不能覆盖，在mysql中实现就是采用：ONDUPLICATEKEYUPDATE，有没有这样一种实现？官方：不好意思，不提供，dounine：我这有呀，你来用吧。哈哈，为了方便大家的使用我已经把项目打包到mave
PySpark 静听山水 Spark spark
PySpark的本质确实是Python的一个接口层，它允许你使用Python语言来编写ApacheSpark应用程序。通过这个接口，你可以利用Spark强大的分布式计算能力，同时享受Python的易用性和灵活性。1、PySpark的工作原理PySpark的工作原理可以概括为以下几个步骤：编写Python代码：开发者使用Python语法来编写Spark应用程序。这些程序通常涉及创建RDDs（弹性分布
Ubuntu的ssh 请不要问我是谁
安装sshsudoapt-getupdatesudoapt-getinstallopenssh-server检测ssh是否启动sudops-e|grepssh创建root用户sudopasswdroot配置本机无密码ssh登录cd/home/spark0ssh-keygen-trsa-P""cat.ssh/id_rsa.pub>>.ssh/authorized_keyschmod600.ssh/a
java数字签名三种方式知了ing java jdk
以下3钟数字签名都是基于jdk7的 1，RSA String password="test"; // 1.初始化密钥 KeyPairGenerator keyPairGenerator = KeyPairGenerator.getInstance("RSA"); keyPairGenerator.initialize(51
Hibernate学习笔记 caoyong Hibernate
1>、Hibernate是数据访问层框架，是一个ORM(Object Relation Mapping)框架，作者为:Gavin King 2>、搭建Hibernate的开发环境 a>、添加jar包: aa>、hibernatte开发包中/lib/required/所
设计模式之装饰器模式Decorator（结构型）漂泊一剑客 Decorator
1. 概述若你从事过面向对象开发，实现给一个类或对象增加行为，使用继承机制，这是所有面向对象语言的一个基本特性。如果已经存在的一个类缺少某些方法，或者须要给方法添加更多的功能（魅力），你也许会仅仅继承这个类来产生一个新类—这建立在额外的代码上。
读取磁盘文件txt，并输入String 一炮送你回车库 String
public static void main(String[] args) throws IOException { String fileContent = readFileContent("d:/aaa.txt"); System.out.println(fileContent);
js三级联动下拉框 3213213333332132 三级联动
//三级联动省/直辖市<select id="province"></select> 市/省直辖<select id="city"></select> 县/区 <select id="area"></select>
erlang之parse_transform编译选项的应用 616050468 parse_transform 游戏服务器属性同步 abstract_code
最近使用erlang重构了游戏服务器的所有代码，之前看过C++/lua写的服务器引擎代码，引擎实现了玩家属性自动同步给前端和增量更新玩家数据到数据库的功能，这也是现在很多游戏服务器的优化方向，在引擎层面去解决数据同步和数据持久化，数据发生变化了业务层不需要关心怎么去同步给前端。由于游戏过程中玩家每个业务中玩家数据更改的量其实是很少
JAVA JSON的解析 darkranger java
// { // “Total”：“条数”， // Code: 1, // // “PaymentItems”:[ // { // “PaymentItemID”:”支款单ID”, // “PaymentCode”:”支款单编号”, // “PaymentTime”:”支款日期”, // ”ContractNo”:”合同号”， //
POJ-1273-Drainage Ditches aijuans ACM_POJ
POJ-1273-Drainage Ditches http://poj.org/problem?id=1273 基本的最大流，按LRJ的白书写的 #include<iostream> #include<cstring> #include<queue> using namespace std; #define INF 0x7fffffff int ma
工作流Activiti5表的命名及含义 atongyeye 工作流 Activiti
activiti5 - http://activiti.org/designer/update在线插件安装 activiti5一共23张表 Activiti的表都以ACT_开头。第二部分是表示表的用途的两个字母标识。用途也和服务的API对应。 ACT_RE_*: 'RE'表示repository。这个前缀的表包含了流程定义和流程静态资源（图片，规则，等等）。 A
android的广播机制和广播的简单使用百合不是茶 android 广播机制广播的注册
Android广播机制简介在Android中，有一些操作完成以后，会发送广播，比如说发出一条短信，或打出一个电话，如果某个程序接收了这个广播，就会做相应的处理。这个广播跟我们传统意义中的电台广播有些相似之处。之所以叫做广播，就是因为它只负责“说”而不管你“听不听”，也就是不管你接收方如何处理。另外，广播可以被不只一个应用程序所接收，当然也可能不被任何应
Spring事务传播行为详解 bijian1013 java spring 事务传播行为
在service类前加上@Transactional，声明这个service所有方法需要事务管理。每一个业务方法开始时都会打开一个事务。 Spring默认情况下会对运行期例外(RunTimeException)进行事务回滚。这
eidtplus operate 征客丶 eidtplus
开启列模式: Alt+C 鼠标选择 OR Alt+鼠标左键拖动列模式替换或复制内容(多行): 右键-->格式-->填充所选内容-->选择相应操作 OR Ctrl+Shift+V(复制多行数据,必须行数一致) -------------------------------------------------------
【Kafka一】Kafka入门 bit1129 kafka
这篇文章来自Spark集成Kafka(http://bit1129.iteye.com/blog/2174765)，这里把它单独取出来，作为Kafka的入门吧下载Kafka http://mirror.bit.edu.cn/apache/kafka/0.8.1.1/kafka_2.10-0.8.1.1.tgz 2.10表示Scala的版本，而0.8.1.1表示Kafka
Spring 事务实现机制 BlueSkator spring 代理事务
Spring是以代理的方式实现对事务的管理。我们在Action中所使用的Service对象，其实是代理对象的实例，并不是我们所写的Service对象实例。既然是两个不同的对象，那为什么我们在Action中可以象使用Service对象一样的使用代理对象呢？为了说明问题，假设有个Service类叫AService，它的Spring事务代理类为AProxyService，AService实现了一个接口
bootstrap源码学习与示例：bootstrap-dropdown（转帖） BreakingBad bootstrap dropdown
bootstrap-dropdown组件是个烂东西，我读后的整体感觉。一个下拉开菜单的设计： <ul class="nav pull-right"> <li id="fat-menu" class="dropdown">
读《研磨设计模式》-代码笔记-中介者模式-Mediator bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ /* * 中介者模式（Mediator）：用一个中介对象来封装一系列的对象交互。 * 中介者使各对象不需要显式地相互引用，从而使其耦合松散，而且可以独立地改变它们之间的交互。 * * 在我看来，Mediator模式是把多个对象（
常用代码记录 chenjunt3 UI Excel J#
1、单据设置某行或某字段不能修改 //i是行号,"cash"是字段名称 getBillCardPanelWrapper().getBillCardPanel().getBillModel().setCellEditable(i, "cash", false); //取得单据表体所有项用以上语句做循环就能设置整行了 getBillC
搜索引擎与工作流引擎 comsci 算法工作搜索引擎网络应用
最近在公司做和搜索有关的工作，(只是简单的应用开源工具集成到自己的产品中)工作流系统的进一步设计暂时放在一边了，偶然看到谷歌的研究员吴军写的数学之美系列中的搜索引擎与图论这篇文章中的介绍，我发现这样一个关系(仅仅是猜想) -----搜索引擎和流程引擎的基础--都是图论，至少像在我在JWFD中引擎算法中用到的是自定义的广度优先
oracle Health Monitor daizj oracle Health Monitor
About Health Monitor Beginning with Release 11g, Oracle Database includes a framework called Health Monitor for running diagnostic checks on the database. About Health Monitor Checks Health M
JSON字符串转换为对象 dieslrae java json
作为前言,首先是要吐槽一下公司的脑残编译部署方式,web和core分开部署本来没什么问题,但是这丫居然不把json的包作为基础包而作为web的包,导致了core端不能使用,而且我们的core是可以当web来用的(不要在意这些细节),所以在core中处理json串就是个问题.没办法,跟编译那帮人也扯不清楚,只有自己写json的解析了.
C语言学习八结构体，综合应用，学生管理系统 dcj3sjt126com C语言
实现功能的代码： # include <stdio.h> # include <malloc.h> struct Student { int age; float score; char name[100]; }; int main(void) { int len; struct Student * pArr; int i,
vagrant学习笔记 dcj3sjt126com vagrant
想了解多主机是如何定义和使用的, 所以又学习了一遍vagrant 1. vagrant virtualbox 下载安装 https://www.vagrantup.com/downloads.html https://www.virtualbox.org/wiki/Downloads 查看安装在命令行输入vagrant 2.
14.性能优化-优化-软件配置优化 frank1234 软件配置性能优化
1.Tomcat线程池修改tomcat的server.xml文件： <Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" maxThreads="1200" m
一个不错的shell 脚本教程入门级 HarborChung linux shell
一个不错的shell 脚本教程入门级建立一个脚本　　Linux中有好多中不同的shell，但是通常我们使用bash (bourne again shell) 进行shell编程，因为bash是免费的并且很容易使用。所以在本文中笔者所提供的脚本都是使用bash（但是在大多数情况下，这些脚本同样可以在 bash的大姐，bourne shell中运行）。　　如同其他语言一样
Spring4新特性——核心容器的其他改进 jinnianshilongnian spring 动态代理 spring4 依赖注入
Spring4新特性——泛型限定式依赖注入 Spring4新特性——核心容器的其他改进 Spring4新特性——Web开发的增强 Spring4新特性——集成Bean Validation 1.1(JSR-349)到SpringMVC Spring4新特性——Groovy Bean定义DSL Spring4新特性——更好的Java泛型操作API Spring4新
Linux设置tomcat开机启动 liuxingguome tomcat linux 开机自启动
执行命令sudo gedit /etc/init.d/tomcat6 然后把以下英文部分复制过去。（注意第一句#!/bin/sh如果不写，就不是一个shell文件。然后将对应的jdk和tomcat换成你自己的目录就行了。 #!/bin/bash # # /etc/rc.d/init.d/tomcat # init script for tomcat precesses
第13章 Ajax进阶（下） onestopweb Ajax
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
Troubleshooting Crystal Reports off BW blueoxygen BO
http://wiki.sdn.sap.com/wiki/display/BOBJ/Troubleshooting+Crystal+Reports+off+BW#TroubleshootingCrystalReportsoffBW-TracingBOE Quite useful, especially this part: SAP BW connectivity For t
Java开发熟手该当心的11个错误 tomcat_oracle java jvm 多线程单元测试
#1、不在属性文件或XML文件中外化配置属性。比如，没有把批处理使用的线程数设置成可在属性文件中配置。你的批处理程序无论在DEV环境中，还是UAT（用户验收测试）环境中，都可以顺畅无阻地运行，但是一旦部署在PROD 上，把它作为多线程程序处理更大的数据集时，就会抛出IOException，原因可能是JDBC驱动版本不同，也可能是#2中讨论的问题。如果线程数目可以在属性文件中配置，那么使它成为
正则表达式大全 yang852220741 html 编程正则表达式
今天向大家分享正则表达式大全，它可以大提高你的工作效率正则表达式也可以被当作是一门语言，当你学习一门新的编程语言的时候，他们是一个小的子语言。初看时觉得它没有任何的意义，但是很多时候，你不得不阅读一些教程，或文章来理解这些简单的描述模式。一、校验数字的表达式数字：^[0-9]*$ n位的数字：^\d{n}$ 至少n位的数字：^\d{n,}$ m-n位的数字：^\d{m,n}$