九师兄-梁川川

spark学习-35-Spark的Map任务输出跟踪器MapOutputTracker

1。在sparkEnv的初始化中有这样一段代码来初始化Map任务输出跟踪器MapOutputTracker

　　　mapOutputTracker用于跟踪map阶段任务的输出状态，此状态便于reduce阶段任务获取地址以及中间输出结果。每个map任务或者 reduce任务都会有唯一的标识。分别为mapId和reduceId.每个reduce任务的输入可能是多个map任务的输出，reduce会到各个map任务的所有节点上拉去Block，这一过程交shuffle，每批shuffle过程都有唯一的表示shuffleId。

val mapOutputTracker = if (isDriver) {
      new MapOutputTrackerMaster(conf, broadcastManager, isLocal)
    } else {
      new MapOutputTrackerWorker(conf)
    }

2。在MapOutputTracker.scala文件中有这么多类

private[spark] sealed trait MapOutputTrackerMessage

private[spark] case class GetMapOutputStatuses(shuffleId: Int)
  extends MapOutputTrackerMessage

private[spark] case object StopMapOutputTracker extends MapOutputTrackerMessage

private[spark] case class GetMapOutputMessage(shuffleId: Int, context: RpcCallContext)

private[spark] class MapOutputTrackerMasterEndpoint(
                                                     override val rpcEnv: RpcEnv, tracker: MapOutputTrackerMaster, conf: SparkConf)
  extends RpcEndpoint with Logging {}



private[spark] abstract class MapOutputTracker(conf: SparkConf) extends Logging {}

private[spark] class MapOutputTrackerMaster(conf: SparkConf,broadcastManager:BroadcastManager, isLocal: Boolean) extends MapOutputTracker(conf) {}

private[spark] class MapOutputTrackerWorker(conf: SparkConf) extends MapOutputTracker(conf) {}

private[spark] object MapOutputTracker extends Logging {}

可以看到 MapOutputTrackerMaster和MapOutputTrackerWorker都继承了MapOutputTracker。

3。总结

一、MapOutputTracker是一个abstract抽象类。获取的Map out的信息根据master和worker有不同的用途：master上，用来记录ShuffleMapTasks所需的map out的源；worker上，仅仅作为cache用来执行shuffle计算。

网友总结：
* MapOutputTracker是 SparkEnv初始化是初始化的重要组件之一是master-slave的结构
* 用来跟踪记录shuffleMapTask的输出位置（shuffleMapTask要写到哪里去），
* shuffleReader读取shuffle文件之前就是去请求MapOutputTrackerMaster 要自己处理的数据在哪里？
* MapOutputTracker给它返回一批 MapOutputTrackerWorker的列表（地址，port等信息）
* shuffleReader开始读取文件进行后期处理

1、askTracker()：检查MapOutputTracker的连接是否正常。

2、sendTracker()：检查MapOutPutTracker是否正常工作（发送任意信息返回true）。

3、getServerStatuses()：根据参数shuffle id来获取shuffle对应的map out所在的位置及信息。如果没有直接的对应shuffle id的信息，则需要从所有的map中匹配对应shuffle id的map out。

4、getEpoch()和updateEpoch()：获取和更新epoch的值。epoch的值是与master同步的，保证map outs是最新的有用的。

二、MapOutPutTrackerMaster针对master的MapOutPutTracker，按照前文的意思，它的作用是为每个shuffle准备其所需要的所有map out，可以加速map outs传送给shuffle的速度。在存储map out的HashMap中，HashMap是基于时间戳的，因此map outs被减少只能因为它被注销掉或者生命周期耗尽。

1、registerShuffle()：在map out的集合mapStatuses中注册新的Shuffle，参数为Shuffle id和map的个数。

2、registerMapOutPut()：根据Shuffle id在mapStatuses中为Shuffle添加map out的状态（存储的map out其实就是map out的状态）。

3、registerMapOutPuts()：同时添加多个map out。

4、unregisterMapOutPut()：在mapStatuses中注销给定Shuffle的map out。

5、重写unrigesterShuffle()：移除mapStatuses中的给定Shuffle的map out。

6、containShuffle()：判断是否存在给定的Shuffle。

7、incrementEpoch()：同步epoch加一。

8、getSerializedMapOutputStatuses()：给定Shuffle id，返回其map out集合。首先是对epoch进行锁状态下的同步，保证获取资源的正确性；其次，根据Shuffle id获取指定位置的statuses，如果指定位置没有对应Shuffle id的statuses，那么获取这个位置的statuses快照返回，作为参考；最后，如果操作的epoch与锁状态下的epoch是一致的，将获取到的statuses存入缓存。

9、stop()：停止MapOutPutTracker，清除mapStatuses，清空缓存。

10、cleanup()：在指定时间清除mapStatuses和cachedSerializedStatuses。

三、MapOutPutTracker对象。它通过serializedMapStatuses将map out流通过gzip的压缩方式压缩（压缩是可行的，因为很多map out基于同样的hostname），这样方便数据流传递给reduce进行操作。

4。看代码

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.spark

import java.io._
import java.util.concurrent.{ConcurrentHashMap, LinkedBlockingQueue, ThreadPoolExecutor}
import java.util.zip.{GZIPInputStream, GZIPOutputStream}

import scala.collection.JavaConverters._
import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet, Map}
import scala.reflect.ClassTag
import scala.util.control.NonFatal

import org.apache.spark.broadcast.{Broadcast, BroadcastManager}
import org.apache.spark.internal.Logging
import org.apache.spark.rpc.{RpcCallContext, RpcEndpoint, RpcEndpointRef, RpcEnv}
import org.apache.spark.scheduler.MapStatus
import org.apache.spark.shuffle.MetadataFetchFailedException
import org.apache.spark.storage.{BlockId, BlockManagerId, ShuffleBlockId}
import org.apache.spark.util._

private[spark] sealed trait MapOutputTrackerMessage


private[spark] case class GetMapOutputStatuses(shuffleId: Int)
  extends MapOutputTrackerMessage



private[spark] case object StopMapOutputTracker extends MapOutputTrackerMessage



private[spark] case class GetMapOutputMessage(shuffleId: Int, context: RpcCallContext)



/** RpcEndpoint class for MapOutputTrackerMaster
  *
  * */
private[spark] class MapOutputTrackerMasterEndpoint(
                                                     override val rpcEnv: RpcEnv, tracker: MapOutputTrackerMaster, conf: SparkConf)
  extends RpcEndpoint with Logging {

  logDebug("init") // force eager creation of logger

  /** 处理RpcEndpointRef.ask方法，如果不匹配消息，将抛出SparkException
    *
    * 这一段代码没看懂？
    * */
  override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
    case GetMapOutputStatuses(shuffleId: Int) =>
      val hostPort = context.senderAddress.hostPort
      logInfo("Asked to send map output locations for shuffle " + shuffleId + " to " + hostPort)
      val mapOutputStatuses = tracker.post(new GetMapOutputMessage(shuffleId, context))

    case StopMapOutputTracker =>
      logInfo("MapOutputTrackerMasterEndpoint stopped!")
      context.reply(true)
      stop()
  }
}






/**
  * Class that keeps track of the location of the map output of
  * a stage. This is abstract because different versions of MapOutputTracker
  * (driver and executor) use different HashMap to store its metadata.
  *
  * 类跟踪一个阶段map输出的位置。这是抽象的，因为不同版本的MapOutputTracker(驱动程序和执行器)
  * 使用不同的HashMap来存储它的元数据。
  *
  * MapOutputTracker是一个abstract抽象类。获取的Map out的信息根据master和worker有不同的用途：
  * master上，用来记录ShuffleMapTasks所需的map out的源；
  * worker上，仅仅作为cache用来执行shuffle计算。
  *
  * 它通过serializedMapStatuses将map out流通过gzip的压缩方式压缩（压缩是可行的，因为很多map out基于同样的hostname），
  * 这样方便数据流传递给reduce进行操作。
  * 
  * 
  * MapOutputTracker是 SparkEnv初始化是初始化的重要组件之一  是master-slave的结构
  * 用来跟踪记录shuffleMapTask的输出位置 （shuffleMapTask要写到哪里去），
  * shuffleReader读取shuffle文件之前就是去请求MapOutputTrackerMaster 要自己处理的数据 在哪里？
  * MapOutputTracker给它返回一批 MapOutputTrackerWorker的列表（地址，port等信息）
  * shuffleReader开始读取文件  进行后期处理
  *
  */
private[spark] abstract class MapOutputTracker(conf: SparkConf) extends Logging {

  /** Set to the MapOutputTrackerMasterEndpoint living on the driver.
    * 在driver上设置MapOutputTrackerMasterEndpoint为living活动的。
    * */
  var trackerEndpoint: RpcEndpointRef = _

  /**
    * This HashMap has different behavior for the driver and the executors.
    * 这个HashMap对驱动程序和执行器有不同的行为。
    *
    * On the driver, it serves as the source of map outputs recorded from ShuffleMapTasks.
    * On the executors, it simply serves as a cache, in which a miss triggers a fetch from the
    * driver's corresponding HashMap.
    *
    * 在驱动程序中，它充当从ShuffleMapTasks记录的映射输出的源。
    * 在执行器上，它只是充当一个缓存，在这个缓存中，一个失误触发了驱动程序相应的HashMap的获取。
    *
    * Note: because mapStatuses is accessed concurrently, subclasses should make sure it's a
    * thread-safe map.
    * 注意:由于mapStatuses同时被访问，子类应该确保它是一个线程安全的映射。
    */
  protected val mapStatuses: Map[Int, Array[MapStatus]]

  /**
    * Incremented every time a fetch fails so that client nodes know to clear
    * their cache of map output locations if this happens.
    * 每次当一个fetch失败时递增，这样客户节点知道如果发生这种情况，就可以清除它们的映射输出位置缓存。
    */
  protected var epoch: Long = 0
  protected val epochLock = new AnyRef

  /** Remembers which map output locations are currently being fetched on an executor.
    * 记住，哪个映射输出位置当前正在被执行器获取。
    * */
  private val fetching = new HashSet[Int]

  /**
    * Send a message to the trackerEndpoint and get its result within a default timeout, or
    * throw a SparkException if this fails.
    *
    * 向trackerEndpoint发送一条消息，并在默认超时中获取它的结果，如果失败，则抛出一个SparkException。
    *
    * askTracker()：检查MapOutputTracker的连接是否正常。
    */
  protected def askTracker[T: ClassTag](message: Any): T = {
    try {
      trackerEndpoint.askSync[T](message)
    } catch {
      case e: Exception =>
        logError("Error communicating with MapOutputTracker", e)
        throw new SparkException("Error communicating with MapOutputTracker", e)
    }
  }

  /** Send a one-way message to the trackerEndpoint, to which we expect it to reply with true.
    * 发送一条到trackerEndpoint的单向消息，我们期望它以true回复。
    *
    * 检查MapOutPutTracker是否正常工作（发送任意信息返回true）。
    * */
  protected def sendTracker(message: Any) {
    val response = askTracker[Boolean](message)
    if (response != true) {
      throw new SparkException(
        "Error reply received from MapOutputTracker. Expecting true, got " + response.toString)
    }
  }

  /**
    * Called from executors to get the server URIs and output sizes for each shuffle block that
    * needs to be read from a given reduce task.
    * 从executors调用服务器uri，并从给定的reduce任务中读取每个需要读取的shuffle块的输出大小。
    *
    * @return A sequence of 2-item tuples, where the first item in the tuple is a BlockManagerId,
    *         and the second item is a sequence of (shuffle block id, shuffle block size) tuples
    *         describing the shuffle blocks that are stored at that block manager.
    */
  def getMapSizesByExecutorId(shuffleId: Int, reduceId: Int)
  : Seq[(BlockManagerId, Seq[(BlockId, Long)])] = {
    getMapSizesByExecutorId(shuffleId, reduceId, reduceId + 1)
  }

  /**
    * Called from executors to get the server URIs and output sizes for each shuffle block that
    * needs to be read from a given range of map output partitions (startPartition is included but
    * endPartition is excluded from the range).
    * 从executors调用，获取服务器uri和每个shuffle块的输出大小，需要从给定范围的映射输出分区读取
    * (包括startPartition，但是endPartition被排除在范围之外)。
    *
    * @return A sequence of 2-item tuples, where the first item in the tuple is a BlockManagerId,
    *         and the second item is a sequence of (shuffle block id, shuffle block size) tuples
    *         describing the shuffle blocks that are stored at that block manager.
    */
  def getMapSizesByExecutorId(shuffleId: Int, startPartition: Int, endPartition: Int)
  : Seq[(BlockManagerId, Seq[(BlockId, Long)])] = {
    logDebug(s"Fetching outputs for shuffle $shuffleId, partitions $startPartition-$endPartition")
    val statuses = getStatuses(shuffleId)
    // Synchronize on the returned array because, on the driver, it gets mutated in place
    statuses.synchronized {
      return MapOutputTracker.convertMapStatuses(shuffleId, startPartition, endPartition, statuses)
    }
  }

  /**
    * Return statistics about all of the outputs for a given shuffle.
    * 对给定的shuffle的所有输出返回统计信息。
    */
  def getStatistics(dep: ShuffleDependency[_, _, _]): MapOutputStatistics = {
    val statuses = getStatuses(dep.shuffleId)
    // Synchronize on the returned array because, on the driver, it gets mutated in place
    statuses.synchronized {
      val totalSizes = new Array[Long](dep.partitioner.numPartitions)
      for (s <- statuses) {
        for (i <- 0 until totalSizes.length) {
          totalSizes(i) += s.getSizeForBlock(i)
        }
      }
      new MapOutputStatistics(dep.shuffleId, totalSizes)
    }
  }

  /**
    * Get or fetch the array of MapStatuses for a given shuffle ID. NOTE: clients MUST synchronize
    * on this array when reading it, because on the driver, we may be changing it in place.
    *
    * (It would be nice to remove this restriction in the future.)
    *
    *  根据参数shuffle id来获取shuffle对应的map out所在的位置及信息。如果没有直接的对应shuffle id的信息，
    *  则需要从所有的map中匹配对应shuffle id的map out。
    *
    * 获取map任务状态：
    *   Spark通过调用MapOutputTracker的getStatuses（1.5版本是getServerStatuses）来获取map任务执行的状态信息，
    * 其中处理步骤如下：
    *   1.从当前BlockManager的MapOutputTracker中获取MapStatus，若没有就进入第2步。否则进入第4步。
    *   2.如果获取列表（fetching）中已经存在要取的shuffleId，那么久等待其他线程获取，如果获取列表中不存在要获取的shuffleId，
    *     那么就将shuffleId放入获取列表。
    *   3.调用askTracker方法向MapOutputTrackerMasterActor发送GetMapOutputStatuses消息获取map任务的状态信息。
    *     MapOutputTrackerMasterActor接收到GetMapOutputStatuses消息后，将请求的map任务状态信息序列化后发送给请求方，
    *     请求方接收到map任务状态信息后进行反序列化操作，然后放入本地的mapStatuses中。
    *   4.调用MapOutputTracker的convertMapStatuses方法将或得到的MapStatus转换为map任务所在的地址（即BlockManagerId）
    *     和map任务输出中分配给当前reduce任务的Block大小。
    */
  private def getStatuses(shuffleId: Int): Array[MapStatus] = {
    val statuses = mapStatuses.get(shuffleId).orNull
    if (statuses == null) {
      logInfo("Don't have map outputs for shuffle " + shuffleId + ", fetching them")
      val startTime = System.currentTimeMillis
      var fetchedStatuses: Array[MapStatus] = null
      fetching.synchronized {
        // Someone else is fetching it; wait for them to be done
        while (fetching.contains(shuffleId)) {
          try {
            fetching.wait()
          } catch {
            case e: InterruptedException =>
          }
        }

        // Either while we waited the fetch happened successfully, or
        // someone fetched it in between the get and the fetching.synchronized.
        fetchedStatuses = mapStatuses.get(shuffleId).orNull
        if (fetchedStatuses == null) {
          // We have to do the fetch, get others to wait for us.
          fetching += shuffleId
        }
      }

      if (fetchedStatuses == null) {
        // We won the race to fetch the statuses; do so
        logInfo("Doing the fetch; tracker endpoint = " + trackerEndpoint)
        // This try-finally prevents hangs due to timeouts:
        try {
          val fetchedBytes = askTracker[Array[Byte]](GetMapOutputStatuses(shuffleId))
          fetchedStatuses = MapOutputTracker.deserializeMapStatuses(fetchedBytes)
          logInfo("Got the output locations")
          mapStatuses.put(shuffleId, fetchedStatuses)
        } finally {
          fetching.synchronized {
            fetching -= shuffleId
            fetching.notifyAll()
          }
        }
      }
      logDebug(s"Fetching map output statuses for shuffle $shuffleId took " +
        s"${System.currentTimeMillis - startTime} ms")

      if (fetchedStatuses != null) {
        return fetchedStatuses
      } else {
        logError("Missing all output locations for shuffle " + shuffleId)
        throw new MetadataFetchFailedException(
          shuffleId, -1, "Missing all output locations for shuffle " + shuffleId)
      }
    } else {
      return statuses
    }
  }

  /** Called to get current epoch number.
    * 获取当前的epoch数值
    * epoch number是什么有什么用？
    * epoch的值是与master同步的，保证map outs是最新的有用的。
    * */
  def getEpoch: Long = {
    epochLock.synchronized {
      return epoch
    }
  }

  /**
    * Called from executors to update the epoch number, potentially clearing old outputs
    * because of a fetch failure. Each executor task calls this with the latest epoch
    * number on the driver at the time it was created.
    *
    * 由执行器调用来更新epoch号，可能会清除旧的输出，因为它会导致fetch失败。每个执行器任务都在创建时使用最新的epoch号来调用此任务。
    */
  def updateEpoch(newEpoch: Long) {
    epochLock.synchronized {
      if (newEpoch > epoch) {
        logInfo("Updating epoch to " + newEpoch + " and clearing cache")
        epoch = newEpoch
        mapStatuses.clear()
      }
    }
  }

  /** Unregister shuffle data.
    * 注销洗牌数据。
    * */
  def unregisterShuffle(shuffleId: Int) {
    mapStatuses.remove(shuffleId)
  }

  /** Stop the tracker.
    * 停止tracker
    * */
  def stop() { }
}








/**
  * MapOutputTracker for the driver.
  * driver的MapOutputTracker内部使用mapStatuses：TimeStampedHashMap[int,Array[MapStatus]]来维护跟踪各个map任务的输出状态，
  * 其中key对应shuffleld,Array存储各个map任务对应的状态信息MapStatus.
  *
  * MapOutPutTrackerMaster针对master的MapOutPutTracker，按照前文的意思，它的作用是为每个shuffle
  * 准备其所需要的所有map out，可以加速map outs传送给shuffle的速度。在存储map out的HashMap中，HashMap
  * 是基于时间戳的，因此map outs被减少只能因为它被注销掉或者生命周期耗尽。
  */
private[spark] class MapOutputTrackerMaster(conf: SparkConf,
                                            broadcastManager: BroadcastManager, isLocal: Boolean)
  extends MapOutputTracker(conf) {

  /** Cache a serialized version of the output statuses for each shuffle to send them out faster
    * 将输出状态的序列化版本缓存到每个洗牌中，以更快地将它们发送出去
    * */
  private var cacheEpoch = epoch

  // The size at which we use Broadcast to send the map output statuses to the executors
  // 我们使用广播将输出状态的map发送给执行器的大小
  private val minSizeForBroadcast =
    conf.getSizeAsBytes("spark.shuffle.mapOutput.minSizeForBroadcast", "512k").toInt

  /** Whether to compute locality preferences for reduce tasks
    * 是否计算局部首选项以减少任务
    * */
  private val shuffleLocalityEnabled = conf.getBoolean("spark.shuffle.reduceLocality.enabled", true)

  // Number of map and reduce tasks above which we do not assign preferred locations based on map
  // output sizes. We limit the size of jobs for which assign preferred locations as computing the
  // top locations by size becomes expensive.
  /**
    * map数量和reduce任务的数量，我们不根据map的输出大小分配优先位置。我们限制了作业的大小,
    * 他根据指定的优先位置作为计算顶级位置变得昂贵起来。
    */
  private val SHUFFLE_PREF_MAP_THRESHOLD = 1000
  // NOTE: This should be less than 2000 as we use HighlyCompressedMapStatus beyond that
  // 注意:这个应该小于2000作为我们使用HighlyCompressedMapStatus除此之外
  private val SHUFFLE_PREF_REDUCE_THRESHOLD = 1000

  // Fraction of total map output that must be at a location for it to considered as a preferred
  // location for a reduce task. Making this larger will focus on fewer locations where most data
  // can be read locally, but may lead to more delay in scheduling if those locations are busy.
  private val REDUCER_PREF_LOCS_FRACTION = 0.2

  // HashMaps for storing mapStatuses and cached serialized statuses in the driver.
  // Statuses are dropped only by explicit de-registering.
  protected val mapStatuses = new ConcurrentHashMap[Int, Array[MapStatus]]().asScala
  private val cachedSerializedStatuses = new ConcurrentHashMap[Int, Array[Byte]]().asScala

  private val maxRpcMessageSize = RpcUtils.maxMessageSizeBytes(conf)

  // Kept in sync with cachedSerializedStatuses explicitly
  // This is required so that the Broadcast variable remains in scope until we remove
  // the shuffleId explicitly or implicitly.
  private val cachedSerializedBroadcast = new HashMap[Int, Broadcast[Array[Byte]]]()

  // This is to prevent multiple serializations of the same shuffle - which happens when
  // there is a request storm when shuffle start.
  private val shuffleIdLocks = new ConcurrentHashMap[Int, AnyRef]()

  // requests for map output statuses
  private val mapOutputRequests = new LinkedBlockingQueue[GetMapOutputMessage]

  // Thread pool used for handling map output status requests. This is a separate thread pool
  // to ensure we don't block the normal dispatcher threads.
  private val threadpool: ThreadPoolExecutor = {
    val numThreads = conf.getInt("spark.shuffle.mapOutput.dispatcher.numThreads", 8)
    val pool = ThreadUtils.newDaemonFixedThreadPool(numThreads, "map-output-dispatcher")
    for (i <- 0 until numThreads) {
      pool.execute(new MessageLoop)
    }
    pool
  }

  // Make sure that we aren't going to exceed the max RPC message size by making sure
  // we use broadcast to send large map output statuses.
  if (minSizeForBroadcast > maxRpcMessageSize) {
    val msg = s"spark.shuffle.mapOutput.minSizeForBroadcast ($minSizeForBroadcast bytes) must " +
      s"be <= spark.rpc.message.maxSize ($maxRpcMessageSize bytes) to prevent sending an rpc " +
      "message that is too large."
    logError(msg)
    throw new IllegalArgumentException(msg)
  }

  def post(message: GetMapOutputMessage): Unit = {
    mapOutputRequests.offer(message)
  }

  /** Message loop used for dispatching messages. */
  private class MessageLoop extends Runnable {
    override def run(): Unit = {
      try {
        while (true) {
          try {
            val data = mapOutputRequests.take()
            if (data == PoisonPill) {
              // Put PoisonPill back so that other MessageLoops can see it.
              mapOutputRequests.offer(PoisonPill)
              return
            }
            val context = data.context
            val shuffleId = data.shuffleId
            val hostPort = context.senderAddress.hostPort
            logDebug("Handling request to send map output locations for shuffle " + shuffleId +
              " to " + hostPort)
            val mapOutputStatuses = getSerializedMapOutputStatuses(shuffleId)
            context.reply(mapOutputStatuses)
          } catch {
            case NonFatal(e) => logError(e.getMessage, e)
          }
        }
      } catch {
        case ie: InterruptedException => // exit
      }
    }
  }

  /** A poison endpoint that indicates MessageLoop should exit its message loop. */
  private val PoisonPill = new GetMapOutputMessage(-99, null)

  // Exposed for testing
  private[spark] def getNumCachedSerializedBroadcast = cachedSerializedBroadcast.size

  /** 在map out的集合mapStatuses中注册新的Shuffle，参数为Shuffle id和map的个数。  */
  def registerShuffle(shuffleId: Int, numMaps: Int) {
    if (mapStatuses.put(shuffleId, new Array[MapStatus](numMaps)).isDefined) {
      throw new IllegalArgumentException("Shuffle ID " + shuffleId + " registered twice")
    }
    // add in advance
    shuffleIdLocks.putIfAbsent(shuffleId, new Object())
  }

  /** 根据Shuffle id在mapStatuses中为Shuffle添加map out的状态（存储的map out其实就是map out的状态）。 */
  def registerMapOutput(shuffleId: Int, mapId: Int, status: MapStatus) {
    val array = mapStatuses(shuffleId)
    array.synchronized {
      array(mapId) = status
    }
  }

  /** Register multiple map output information for the given shuffle
    *
    * 同时添加多个map out。
    * */
  def registerMapOutputs(shuffleId: Int, statuses: Array[MapStatus], changeEpoch: Boolean = false) {
    mapStatuses.put(shuffleId, statuses.clone())
    if (changeEpoch) {
      incrementEpoch()
    }
  }

  /** Unregister map output information of the given shuffle, mapper and block manager *
    *
    * 在mapStatuses中注销给定Shuffle的map out。
    */
  def unregisterMapOutput(shuffleId: Int, mapId: Int, bmAddress: BlockManagerId) {
    val arrayOpt = mapStatuses.get(shuffleId)
    if (arrayOpt.isDefined && arrayOpt.get != null) {
      val array = arrayOpt.get
      array.synchronized {
        if (array(mapId) != null && array(mapId).location == bmAddress) {
          array(mapId) = null
        }
      }
      incrementEpoch()
    } else {
      throw new SparkException("unregisterMapOutput called for nonexistent shuffle ID")
    }
  }

  /** Unregister shuffle data */
  override def unregisterShuffle(shuffleId: Int) {
    mapStatuses.remove(shuffleId)
    cachedSerializedStatuses.remove(shuffleId)
    cachedSerializedBroadcast.remove(shuffleId).foreach(v => removeBroadcast(v))
    shuffleIdLocks.remove(shuffleId)
  }

  /** Check if the given shuffle is being tracked
    *
    *  判断是否存在给定的Shuffle。
    * */
  def containsShuffle(shuffleId: Int): Boolean = {
    cachedSerializedStatuses.contains(shuffleId) || mapStatuses.contains(shuffleId)
  }

  /**
    * Return the preferred hosts on which to run the given map output partition in a given shuffle,
    * i.e. the nodes that the most outputs for that partition are on.
    *
    * @param dep shuffle dependency object
    * @param partitionId map output partition that we want to read
    * @return a sequence of host names
    */
  def getPreferredLocationsForShuffle(dep: ShuffleDependency[_, _, _], partitionId: Int)
  : Seq[String] = {
    if (shuffleLocalityEnabled && dep.rdd.partitions.length < SHUFFLE_PREF_MAP_THRESHOLD &&
      dep.partitioner.numPartitions < SHUFFLE_PREF_REDUCE_THRESHOLD) {
      val blockManagerIds = getLocationsWithLargestOutputs(dep.shuffleId, partitionId,
        dep.partitioner.numPartitions, REDUCER_PREF_LOCS_FRACTION)
      if (blockManagerIds.nonEmpty) {
        blockManagerIds.get.map(_.host)
      } else {
        Nil
      }
    } else {
      Nil
    }
  }

  /**
    * Return a list of locations that each have fraction of map output greater than the specified
    * threshold.
    *
    * @param shuffleId id of the shuffle
    * @param reducerId id of the reduce task
    * @param numReducers total number of reducers in the shuffle
    * @param fractionThreshold fraction of total map output size that a location must have
    *                          for it to be considered large.
    */
  def getLocationsWithLargestOutputs(
                                      shuffleId: Int,
                                      reducerId: Int,
                                      numReducers: Int,
                                      fractionThreshold: Double)
  : Option[Array[BlockManagerId]] = {

    val statuses = mapStatuses.get(shuffleId).orNull
    if (statuses != null) {
      statuses.synchronized {
        if (statuses.nonEmpty) {
          // HashMap to add up sizes of all blocks at the same location
          val locs = new HashMap[BlockManagerId, Long]
          var totalOutputSize = 0L
          var mapIdx = 0
          while (mapIdx < statuses.length) {
            val status = statuses(mapIdx)
            // status may be null here if we are called between registerShuffle, which creates an
            // array with null entries for each output, and registerMapOutputs, which populates it
            // with valid status entries. This is possible if one thread schedules a job which
            // depends on an RDD which is currently being computed by another thread.
            if (status != null) {
              val blockSize = status.getSizeForBlock(reducerId)
              if (blockSize > 0) {
                locs(status.location) = locs.getOrElse(status.location, 0L) + blockSize
                totalOutputSize += blockSize
              }
            }
            mapIdx = mapIdx + 1
          }
          val topLocs = locs.filter { case (loc, size) =>
            size.toDouble / totalOutputSize >= fractionThreshold
          }
          // Return if we have any locations which satisfy the required threshold
          if (topLocs.nonEmpty) {
            return Some(topLocs.keys.toArray)
          }
        }
      }
    }
    None
  }

  /** 同步epoch加一。 */
  def incrementEpoch() {
    epochLock.synchronized {
      epoch += 1
      logDebug("Increasing epoch to " + epoch)
    }
  }

  private def removeBroadcast(bcast: Broadcast[_]): Unit = {
    if (null != bcast) {
      broadcastManager.unbroadcast(bcast.id,
        removeFromDriver = true, blocking = false)
    }
  }

  private def clearCachedBroadcast(): Unit = {
    for (cached <- cachedSerializedBroadcast) removeBroadcast(cached._2)
    cachedSerializedBroadcast.clear()
  }

  /**
    * 给定Shuffle id，返回其map out集合。首先是对epoch进行锁状态下的同步，保证获取资源的正确性；
    * 其次，根据Shuffle id获取指定位置的statuses，如果指定位置没有对应Shuffle id的statuses，
    * 那么获取这个位置的statuses快照返回，作为参考；最后，如果操作的epoch与锁状态下的epoch是一致的，
    * 将获取到的statuses存入缓存。
    *
    * */
  def getSerializedMapOutputStatuses(shuffleId: Int): Array[Byte] = {
    var statuses: Array[MapStatus] = null
    var retBytes: Array[Byte] = null
    var epochGotten: Long = -1

    // Check to see if we have a cached version, returns true if it does
    // and has side effect of setting retBytes.  If not returns false
    // with side effect of setting statuses
    def checkCachedStatuses(): Boolean = {
      epochLock.synchronized {
        if (epoch > cacheEpoch) {
          cachedSerializedStatuses.clear()
          clearCachedBroadcast()
          cacheEpoch = epoch
        }
        cachedSerializedStatuses.get(shuffleId) match {
          case Some(bytes) =>
            retBytes = bytes
            true
          case None =>
            logDebug("cached status not found for : " + shuffleId)
            statuses = mapStatuses.getOrElse(shuffleId, Array.empty[MapStatus])
            epochGotten = epoch
            false
        }
      }
    }

    if (checkCachedStatuses()) return retBytes
    var shuffleIdLock = shuffleIdLocks.get(shuffleId)
    if (null == shuffleIdLock) {
      val newLock = new Object()
      // in general, this condition should be false - but good to be paranoid
      val prevLock = shuffleIdLocks.putIfAbsent(shuffleId, newLock)
      shuffleIdLock = if (null != prevLock) prevLock else newLock
    }
    // synchronize so we only serialize/broadcast it once since multiple threads call
    // in parallel
    shuffleIdLock.synchronized {
      // double check to make sure someone else didn't serialize and cache the same
      // mapstatus while we were waiting on the synchronize
      if (checkCachedStatuses()) return retBytes

      // If we got here, we failed to find the serialized locations in the cache, so we pulled
      // out a snapshot of the locations as "statuses"; let's serialize and return that
      val (bytes, bcast) = MapOutputTracker.serializeMapStatuses(statuses, broadcastManager,
        isLocal, minSizeForBroadcast)
      logInfo("Size of output statuses for shuffle %d is %d bytes".format(shuffleId, bytes.length))
      // Add them into the table only if the epoch hasn't changed while we were working
      epochLock.synchronized {
        if (epoch == epochGotten) {
          cachedSerializedStatuses(shuffleId) = bytes
          if (null != bcast) cachedSerializedBroadcast(shuffleId) = bcast
        } else {
          logInfo("Epoch changed, not caching!")
          removeBroadcast(bcast)
        }
      }
      bytes
    }
  }

  /** 停止MapOutPutTracker，清除mapStatuses，清空缓存。 */
  override def stop() {
    mapOutputRequests.offer(PoisonPill)
    threadpool.shutdown()
    sendTracker(StopMapOutputTracker)
    mapStatuses.clear()
    trackerEndpoint = null
    cachedSerializedStatuses.clear()
    clearCachedBroadcast()
    shuffleIdLocks.clear()
  }
}










/**
  * MapOutputTracker for the executors, which fetches map output information from the driver's
  * MapOutputTrackerMaster.
  */
private[spark] class MapOutputTrackerWorker(conf: SparkConf) extends MapOutputTracker(conf) {
  protected val mapStatuses: Map[Int, Array[MapStatus]] =
    new ConcurrentHashMap[Int, Array[MapStatus]]().asScala
}











private[spark] object MapOutputTracker extends Logging {

  val ENDPOINT_NAME = "MapOutputTracker"
  private val DIRECT = 0
  private val BROADCAST = 1

  // Serialize an array of map output locations into an efficient byte format so that we can send
  // it to reduce tasks. We do this by compressing the serialized bytes using GZIP. They will
  // generally be pretty compressible because many map outputs will be on the same hostname.
  def serializeMapStatuses(statuses: Array[MapStatus], broadcastManager: BroadcastManager,
                           isLocal: Boolean, minBroadcastSize: Int): (Array[Byte], Broadcast[Array[Byte]]) = {
    val out = new ByteArrayOutputStream
    out.write(DIRECT)
    val objOut = new ObjectOutputStream(new GZIPOutputStream(out))
    Utils.tryWithSafeFinally {
      // Since statuses can be modified in parallel, sync on it
      statuses.synchronized {
        objOut.writeObject(statuses)
      }
    } {
      objOut.close()
    }
    val arr = out.toByteArray
    if (arr.length >= minBroadcastSize) {
      // Use broadcast instead.
      // Important arr(0) is the tag == DIRECT, ignore that while deserializing !
      val bcast = broadcastManager.newBroadcast(arr, isLocal)
      // toByteArray creates copy, so we can reuse out
      out.reset()
      out.write(BROADCAST)
      val oos = new ObjectOutputStream(new GZIPOutputStream(out))
      oos.writeObject(bcast)
      oos.close()
      val outArr = out.toByteArray
      logInfo("Broadcast mapstatuses size = " + outArr.length + ", actual size = " + arr.length)
      (outArr, bcast)
    } else {
      (arr, null)
    }
  }

  // Opposite of serializeMapStatuses.
  def deserializeMapStatuses(bytes: Array[Byte]): Array[MapStatus] = {
    assert (bytes.length > 0)

    def deserializeObject(arr: Array[Byte], off: Int, len: Int): AnyRef = {
      val objIn = new ObjectInputStream(new GZIPInputStream(
        new ByteArrayInputStream(arr, off, len)))
      Utils.tryWithSafeFinally {
        objIn.readObject()
      } {
        objIn.close()
      }
    }

    bytes(0) match {
      case DIRECT =>
        deserializeObject(bytes, 1, bytes.length - 1).asInstanceOf[Array[MapStatus]]
      case BROADCAST =>
        // deserialize the Broadcast, pull .value array out of it, and then deserialize that
        val bcast = deserializeObject(bytes, 1, bytes.length - 1).
          asInstanceOf[Broadcast[Array[Byte]]]
        logInfo("Broadcast mapstatuses size = " + bytes.length +
          ", actual size = " + bcast.value.length)
        // Important - ignore the DIRECT tag ! Start from offset 1
        deserializeObject(bcast.value, 1, bcast.value.length - 1).asInstanceOf[Array[MapStatus]]
      case _ => throw new IllegalArgumentException("Unexpected byte tag = " + bytes(0))
    }
  }

  /**
    * Given an array of map statuses and a range of map output partitions, returns a sequence that,
    * for each block manager ID, lists the shuffle block IDs and corresponding shuffle block sizes
    * stored at that block manager.
    *
    * If any of the statuses is null (indicating a missing location due to a failed mapper),
    * throws a FetchFailedException.
    *
    * @param shuffleId Identifier for the shuffle
    * @param startPartition Start of map output partition ID range (included in range)
    * @param endPartition End of map output partition ID range (excluded from range)
    * @param statuses List of map statuses, indexed by map ID.
    * @return A sequence of 2-item tuples, where the first item in the tuple is a BlockManagerId,
    *         and the second item is a sequence of (shuffle block ID, shuffle block size) tuples
    *         describing the shuffle blocks that are stored at that block manager.
    */
  private def convertMapStatuses(
                                  shuffleId: Int,
                                  startPartition: Int,
                                  endPartition: Int,
                                  statuses: Array[MapStatus]): Seq[(BlockManagerId, Seq[(BlockId, Long)])] = {
    assert (statuses != null)
    val splitsByAddress = new HashMap[BlockManagerId, ArrayBuffer[(BlockId, Long)]]
    for ((status, mapId) <- statuses.zipWithIndex) {
      if (status == null) {
        val errorMessage = s"Missing an output location for shuffle $shuffleId"
        logError(errorMessage)
        throw new MetadataFetchFailedException(shuffleId, startPartition, errorMessage)
      } else {
        for (part <- startPartition until endPartition) {
          splitsByAddress.getOrElseUpdate(status.location, ArrayBuffer()) +=
            ((ShuffleBlockId(shuffleId, mapId, part), status.getSizeForBlock(part)))
        }
      }
    }

    splitsByAddress.toSeq
  }
}

你可能感兴趣的:(spark,MapOutput,Tracker,大数据-spark)

Python用Bokeh处理大规模数据可视化的最佳实践一键难忘 Bokeh python 开发语言
用Bokeh处理大规模数据可视化的最佳实践在大规模数据处理和分析中，数据可视化是一个至关重要的环节。Bokeh是一个在Python生态中广泛使用的交互式数据可视化库，它具有强大的可扩展性和灵活性。本文将介绍如何使用Bokeh处理大规模数据可视化，并提供一些最佳实践和代码实例，帮助你高效地展示大数据集中的重要信息。1.为什么选择Bokeh？Bokeh是一个专为浏览器呈现而设计的可视化库，它支持高效渲
分页优化之——游标分页 PhilipJ0303 Java面试 java 数据库优化游标分页分页查询
游标分页（Cursor-basedPagination）是一种高效的分页方式，特别适用于大数据集和无限滚动的场景。与传统的基于页码的分页（如page=1&size=10）不同，游标分页通过一个唯一的游标（通常是时间戳或唯一ID）来标记分页的位置，避免了传统分页在数据变动时的重复或遗漏问题。以下是游标分页在前后端的实现方式：1.游标分页的核心概念游标（Cursor）：游标是一个唯一标识符，通常是数据
轻松入门Apache SeaTunnel：数据集成利器窝窝和牛牛 SeaTunnel ETL 数据集成
文章目录轻松入门ApacheSeaTunnel：数据集成利器什么是SeaTunnel基本原理运行流程SeaTunnelvsDataX：两大数据集成工具对比实战场景：MySQL数据同步至ElasticsearchSeaTunnel实现方案DataX实现方案实现原理对比底层依赖环境方案优缺点分析快速上手环境准备简单示例总结轻松入门ApacheSeaTunnel：数据集成利器什么是SeaTunnelAp
Azure Delta Lake、Databricks和Event Hubs实现实时欺诈检测 weixin_30777913 azure 云计算
设计Azure云架构方案实现AzureDeltaLake和AzureDatabricks，结合AzureEventHubs/Kafka摄入实时数据，通过DeltaLake实现Exactly-Once语义，实时欺诈检测（流数据写入DeltaLake，批处理模型实时更新），以及具体实现的详细步骤和关键PySpark代码。完整实现代码需要根据具体数据格式和业务规则进行调整，建议通过DatabricksR
探索数据安全新境界：Apache Spark SQL Ranger Security插件深度揭秘乌昱有Melanie
探索数据安全新境界：ApacheSparkSQLRangerSecurity插件深度揭秘项目地址:https://gitcode.com/gh_mirrors/sp/spark-ranger随着大数据的爆炸性增长，数据安全性成为了企业不可忽视的核心议题。在这一背景下，【ApacheSparkSQLRangerSecurityPlugin】以其强大的数据访问控制能力脱颖而出，成为数据处理领域的明星级
Java 大视界 -- Java 大数据在智能医疗远程会诊与专家协作中的技术支持（146）青云交大数据新视界 Java 大视界 java 大数据智能医疗远程会诊专家协作数据安全病例诊断
亲爱的朋友们，热烈欢迎来到青云交的博客！能与诸位在此相逢，我倍感荣幸。在这飞速更迭的时代，我们都渴望一方心灵净土，而我的博客正是这样温暖的所在。这里为你呈上趣味与实用兼具的知识，也期待你毫无保留地分享独特见解，愿我们于此携手成长，共赴新程！一、欢迎加入【福利社群】点击快速加入：青云交灵犀技韵交响盛汇福利社群点击快速加入2：2024CSDN博客之星创作交流营（NEW)二、本博客的精华专栏：大数据新视
Flink相关面试题努力的搬砖人. 面试 java 后端 flink
以下是150道ApacheFlink面试题及其详细回答，涵盖了Flink的基础知识、核心架构、API使用、性能调优等多个方面，每道题目都尽量详细且简单易懂：Flink基础概念类1.什么是ApacheFlink？ApacheFlink是一个开源的流处理和批处理框架，能够实现快速、可靠、可扩展的大数据处理。它既可以处理无界的数据流，也可以处理有界的数据批，提供了低延迟和高吞吐量的实时数据处理能力。Fl
基于Azure云平台构建实时数据仓库 weixin_30777913 云计算 azure 开发语言 spark python
设计Azure云架构方案实现AzureDeltaLake和AzureDatabricks，结合电商网站的流数据，构建实时数据仓库，支持T+0报表（如电商订单分析），具以及具体实现的详细步骤和关键PySpark代码。一、架构设计[电商网站]→[AzureEventHubs]→[AzureDatabricksStreaming]↓[AzureDeltaLake]←→[DatabricksSQLAnal
2017安全之势：云、大数据、IoT、人工智能 weixin_34392906 人工智能大数据嵌入式
“新技术让信息系统变成了孙悟空，开始无所不能，但安全仍是它的‘紧箍咒’！怎样解开这个‘紧箍咒’？各路安全厂商各显其能，但似乎路漫漫兮离目标还很遥远。”三未信安董事长张岳公在ZD至顶网《百位意见领袖寄语2017》中说出了这样一句话，我觉着很有道理。安全是一个永恒的话题，如果说它与新的信息技术相生相克也不过分。即便如此，我们更要尽可能的减少安全带来的束缚。2017已经到来，不妨来看看至顶网与业界大咖总
直方图梯度提升：大数据时代的极速决策引擎万事可爱^ 大数据机器学习深度学习直方图梯度提升 GBDT 算法
一、为什么需要直方图梯度提升？在Kaggle竞赛的冠军解决方案中，超过70%的获奖方案都使用了梯度提升算法。但当数据量突破百万级时，传统梯度提升树（GBDT）面临三大致命瓶颈：训练耗时剧增：每个特征的分割点计算都需要全量数据排序内存消耗爆炸：存储排序后的特征值需要额外空间处理效率低下：无法有效利用现代CPU的多核特性而梯度提升决策树（GBDT）作为集成学习的代表算法，通过迭代构建决策树实现预测能力
从原理到实践：Go 语言内存优化策略深度解析叶间清风1998 服务器 linux 网络
目录一、引言二、Go语言内存管理基础原理2.1栈与堆内存分配2.2垃圾回收机制剖析三、内存优化策略与实践3.1合理使用指针传递3.2避免不必要的内存分配3.3优化切片与映射的使用3.4控制变量作用域3.5减少闭包导致的变量逃逸四、内存优化工具与性能分析4.1pprof工具的使用4.2其他性能分析辅助手段五、不同场景下的内存优化案例分析5.1高并发Web服务场景5.2大数据处理与分析场景六、总结与展
硅谷企业的大数据平台架构什么样？看看Twitter、Airbnb、Uber的实践大数据v 分布式数据库大数据编程语言 hadoop
导读：本文分析一下典型硅谷互联网企业的大数据平台架构。作者：彭锋宋文欣孙浩峰来源：大数据DT（ID：hzdashuju）01Twitter的大数据平台架构Twitter是最早一批推进数字化运营的硅谷企业之一，其公司运营和产品迭代的很多功能是由其底层的大数据平台提供的。图7-2所示为Twitter大数据平台的基本示意图。▲图7-2Twitter大数据平台架构Twitter的大数据平台开发比较早，很多
【图像预处理】瞬间记忆深度学习 python
(4条消息)图像预处理方法总结_AI强仔的博客-CSDN博客对图像进行预处理的一些常见方法包括：调整图像大小和分辨率，以便适应模型的输入要求。对图像进行裁剪或填充，以使其大小和比例符合要求。调整图像的亮度、对比度和饱和度等图像属性。进行图像平滑或锐化操作，以去除噪声或增强图像特征。进行图像归一化或标准化，以确保各个特征在相同的尺度上。应用数据增强技术，如旋转、平移、缩放、翻转等，以扩大数据集，提高
大数据学习（75）-大数据组件总结 viperrrrrrr 大数据 impala yarn hdfs hive CDH mapreduce
大数据学习系列专栏：哲学语录:用力所能及，改变世界。如果觉得博主的文章还不错的话，请点赞+收藏⭐️+留言支持一下博主哦一、CDHCDH（ClouderaDistributionIncludingApacheHadoop)是由Cloudera公司提供的一个集成了ApacheHadoop以及相关生态系统的发行版本。CDH是一个大数据平台，简化和加速了大数据处理分析的部署和管理。CDH提供Hadoop的
大数据点燃智能制造变革之火——从数据到价值的跃迁 Echo_Wish 大数据高阶实战秘籍大数据制造
大数据点燃智能制造变革之火——从数据到价值的跃迁在全球制造业向智能化转型的浪潮中，大数据已然成为点燃变革的关键火种。从车间到供应链，从设备到产品生命周期，制造业正通过大数据分析找到隐形的效率优化机会，打破传统生产模式的桎梏。作为Echo_Wish，今天我将和大家探讨大数据如何融入智能制造，助力实现生产效率和业务价值的双重飞跃。一、智能制造的核心诉求：数据驱动的决策与执行智能制造的目标是通过数据驱动
Sqoop安装部署愿与狸花过一生大数据 sqoop hadoop hive
ApacheSqoop简介Sqoop（SQL-to-Hadoop）是Apache开源项目，主要用于：将关系型数据库中的数据导入Hadoop分布式文件系统（HDFS）或相关组件（如Hive、HBase）。将Hadoop处理后的数据导出回关系型数据库。核心特性批量数据传输支持从数据库表到HDFS/Hive的全量或增量数据迁移。并行化处理基于MapReduce实现并行导入导出，提升大数据量场景的效率。自
AI预测体彩排3新模型百十个定位预测+胆码预测+杀和尾+杀和值2025年3月21日第25弹 GIS小天体彩排3 人工智能机器学习彩票算法
前面由于工作原因停更了很长时间，停更期间很多彩友一直私信我何时恢复发布每日预测，目前手头上的项目已经基本收尾，接下来恢复发布。当然，也有很多朋友一直咨询3D超级助手开发的进度，在这里统一回复下。由于本人既精通编程+大数据分析，也热衷于彩票研究，所以很多彩友通过一些渠道找到了我。目前，加我的已有不少彩友，分成了3类人群：第一类：平时不懂数据分析，买彩全靠瞎猜乱蒙，这些朋友希望借助我的技术和方法来给他
Zynq PL端IP核之AXI DMA Mazy.v fpga开发嵌入式硬件 arm开发单片机
1.AXIDMA简介Zynq提供了两种DMA，一种是PS中的DMA控制器，通过GP口与PL端连接，另一种是PL中的AXIDMAIP核（软核），通过HP口与PS端连接。Zynq有4个HP接口，每一个HP接口都包含控制和数据FIFO，这些FIFO为大数据量突发传输提供缓冲，让HP接口成为理想的高速数据传输接口。AXIDMAIP内核在AXI4内存映射和AXI4StreamIP接口之间提供高带宽直接储存访
揭秘时空大数据：详细介绍、真实应用场景和数据示例解析陈书予 GIS开发（时空大数据）前端大数据 python 时序数据库
时空大数据(SpatialBigData)是指利用空间环境和时间环境信息，以及数字技术，从多种来源获取的海量、动态的、多维的数据，对空间环境和时间环境进行实时监测，并基于复杂的数据分析和挖掘，获取有价值的信息。时空大数据示例：1）社会网络数据：Twitter、Facebook、Instagram等社交媒体上的海量数据，可以通过时间、空间、主题等来提取有价值的信息。2）遥感图像数据：通过遥感技术从卫
python基于Django的旅游景点数据分析及可视化的设计与实现 7blk7 qq2295116502 python django 数据分析
目录项目介绍技术栈具体实现截图Scrapy爬虫框架关键技术和使用的工具环境等的说明解决的思路开发流程爬虫核心代码展示系统设计论文书写大纲详细视频演示源码获取项目介绍大数据分析是现下比较热门的词汇，通过分析之后可以得到更多深入且有价值的信息。现实的科技手段中，越来越多的应用都会涉及到大数据随着大数据时代的到来，数据挖掘、分析与应用成为多个行业的关键,本课题首先介绍了网络爬虫的基本概念以及技术实现方法
存算一体与存算分离：架构设计的深度解析与实现方案克里斯蒂亚诺罗纳尔多阿维罗大数据数据库
随着数据量的不断增大和对计算能力的需求日益提高，存算一体作为一种新型架构设计理念，在大数据处理、云计算和人工智能等领域正逐步引起广泛关注。在深入探讨存算一体之前，我们需要先了解存储和计算的基本概念，以及存算分离和存算一体之间的区别。什么是存算一体？存算一体，顾名思义，是将数据存储与计算资源紧密结合，形成一个统一的架构。在这种架构下，存储和计算不仅在物理层面上结合，更在架构设计上深度融合。具体来说，
LakeHouse湖仓一体成为下一站灯塔，数仓、数据湖架构即将退出群聊科杰科技大数据数据仓库
摘要：当前的大数据技术应用趋势表明，客户对单一的数据湖和数仓架构并不满意。近年来几乎所有的数据仓库都增加了对Parquet和ORC格式的外部表支持，这使数仓用户可以从相同的SQL引擎查询数据湖表，但它不会使数据湖表更易于管理，也不会消除仓库中数据的ETL复杂性、陈旧性和高级分析挑战。KeenDataLakeHouse（湖仓一体）作为新一代大数据技术架构，将逐渐取代单一数据湖和数仓架构，成为大数据架
数据让农业更聪明——用大数据激活田间地头 Echo_Wish 大数据大数据
数据让农业更聪明——用大数据激活田间地头在农业领域，随着人口增长和气候变化的影响，如何提升生产力始终是个关键话题。大数据，这个曾经只属于科技领域的概念，如今已悄然进入田间地头。今天，我以Echo_Wish的视角，和大家聊聊大数据如何赋能农业生产力，帮农民在阳光下也能掌握“科技的钥匙”。认识农业中的大数据什么是农业中的“大数据”？简单来说，就是收集和分析有关土地、气候、作物、病虫害以及市场需求等方面
优化Apache Spark性能之JVM参数配置指南 weixin_30777913 jvm spark 大数据开发语言性能优化
ApacheSpark运行在JVM之上，JVM的垃圾回收（GC）、内存管理以及堆外内存使用情况，会直接对Spark任务的执行效率产生影响。因此，合理配置JVM参数是优化Spark性能的关键步骤，以下将详细介绍优化策略和配置建议。通过以下优化方法，可以显著减少GC停顿时间、提升内存利用率，进而提高Spark作业吞吐量和数据处理效率。同时，要根据具体的工作负载和集群配置进行调整，并定期监控Spark应
GraphCube、Spark和深度学习技术赋能快消行业关键运营环节 weixin_30777913 开发语言大数据深度学习人工智能 spark
在快消品（FMCG）行业，需求计划（DemandPlanning）、库存管理（InventoryManagement）和需求供应管理（DemandSupplyManagement）是影响企业整体效率和利润水平的关键运营环节。GraphCube图多维数据集技术、Spark大数据分析处理技术和深度学习技术的结合，为这些环节提供了智能化、动态化和实时化的解决方案，显著提升业务运营效率和企业利润。一、技术
从 0 到 1 构建 Python 分布式爬虫，实现搜索引擎全攻略七七知享 Python python 分布式爬虫搜索引擎算法程序人生网络爬虫
从0到1构建Python分布式爬虫，实现搜索引擎全攻略在大数据与信息爆炸的时代，搜索引擎已然成为人们获取信息的关键入口。你是否好奇，像百度、谷歌这般强大的搜索引擎，背后是如何精准且高效地抓取海量网页数据的？本文将带你一探究竟，以Python为工具，打造属于自己的分布式爬虫，进而搭建一个简易搜索引擎，完整呈现从底层代码编写到系统搭建的全过程。通过本文的实践，我们成功打造了Python分布式爬虫，并以
第三十篇维度建模：从理论到落地的企业级实践随缘而动，随遇而安数据库 sql 数据仓库大数据数据库架构
目录一、维度建模核心理论体系1.1Kimball方法论四大支柱1.2关键概念对比矩阵二、四步建模法全流程解析2.1选择业务过程（以电商为例）2.2声明原子粒度（订单案例）2.3维度设计规范时间维度（含财年逻辑）SCDType2完整实现（Hudi）2.4事实表类型与设计三、企业级建模实战：电商用户分析3.1业务矩阵分析3.2模型实现代码四、高级建模技巧4.1多星型模式关联4.2大数据场景优化五、性能
计算机专业毕业设计题目推荐（新颖选题）本科计算机人工智能专业相关毕业设计选题大全✅ 会写代码的羊毕设选题课程设计人工智能毕业设计毕设题目毕业设计题目 ai AI编程
文章目录前言最新毕设选题（建议收藏起来）本科计算机人工智能专业相关的毕业设计选题毕设作品推荐前言2025全新毕业设计项目博主介绍：✌全网粉丝10W+,CSDN全栈领域优质创作者，博客之星、掘金/华为云/阿里云等平台优质作者。技术范围：SpringBoot、Vue、SSM、HLMT、Jsp、PHP、Nodejs、Python、爬虫、数据可视化、小程序、大数据、机器学习等设计与开发。主要内容：免费功能
深陷“大数据杀熟”漩涡的飞猪，庄卓然如何力挽狂澜？财经三剑客大数据
在线旅游市场（OTA）的蓬勃发展为消费者带来了诸多便利，然而，在这股数字化浪潮中，飞猪旅行却因其频繁陷入“大数据杀熟”的争议而备受瞩目。这一行为不仅损害了消费者的合法权益，更让飞猪的品牌形象蒙上了一层阴影。近年来，飞猪平台上关于价格乱象的投诉屡禁不止。在黑猫投诉平台上，与“飞猪”相关的投诉累计已超9万条，其中直接以“飞猪杀熟”为关键词的投诉便达数百条。消费者们纷纷反映，在飞猪平台上预订机票、酒店等
【新品发售】NVIDIA 发布全球最小个人 AI 超级计算机 DGX Spark segmentfault
GTC2025大会上，NVIDIA正式推出了搭载NVIDIAGraceBlackwell平台的个人AI超级计算机——DGXSpark。赞奇可接受预订，直接私信后台即刻预订！DGXSpark(前身为ProjectDIGITS)支持AI开发者、研究人员、数据科学家和学生，在台式电脑上对大模型进行原型设计、微调和推理。用户可以在本地运行这些模型，或将其部署在NVIDIADGXCloud或任何其他加速云或
多线程编程之卫生间周凡杨 java 并发卫生间线程厕所
如大家所知，火车上车厢的卫生间很小，每次只能容纳一个人，一个车厢只有一个卫生间，这个卫生间会被多个人同时使用，在实际使用时，当一个人进入卫生间时则会把卫生间锁上，等出来时打开门，下一个人进去把门锁上，如果有一个人在卫生间内部则别人的人发现门是锁的则只能在外面等待。问题分析：首先问题中有两个实体，一个是人，一个是厕所，所以设计程序时就可以设计两个类。人是多数的，厕所只有一个（暂且模拟的是一个车厢）。
How to Install GUI to Centos Minimal sunjing linux Install Desktop GUI
http://www.namhuy.net/475/how-to-install-gui-to-centos-minimal.html I have centos 6.3 minimal running as web server. I’m looking to install gui to my server to vnc to my server. You can insta
Shell 函数 daizj shell 函数
Shell 函数 linux shell 可以用户定义函数，然后在shell脚本中可以随便调用。 shell中函数的定义格式如下： [function] funname [()]{ action; [return int;] } 说明： 1、可以带function fun() 定义，也可以直接fun() 定义,不带任何参数。 2、参数返回
Linux服务器新手操作之一周凡杨 Linux 简单操作
1.whoami 当一个用户登录Linux系统之后，也许他想知道自己是发哪个用户登录的。此时可以使用whoami命令。 [ecuser@HA5-DZ05 ~]$ whoami e
浅谈Socket通信（一）朱辉辉33 socket
在java中ServerSocket用于服务器端，用来监听端口。通过服务器监听，客户端发送请求，双方建立链接后才能通信。当服务器和客户端建立链接后，两边都会产生一个Socket实例，我们可以通过操作Socket来建立通信。首先我建立一个ServerSocket对象。当然要导入java.net.ServerSocket包 ServerSock
关于框架的简单认识西蜀石兰框架
入职两个月多，依然是一个不会写代码的小白，每天的工作就是看代码，写wiki。前端接触CSS、HTML、JS等语言，一直在用的CS模型，自然免不了数据库的链接及使用，真心涉及框架，项目中用到的BootStrap算一个吧，哦，JQuery只能算半个框架吧，我更觉得它是另外一种语言。后台一直是纯Java代码，涉及的框架是Quzrtz和log4j。都说学前端的要知道三大框架，目前node.
You have an error in your SQL syntax; check the manual that corresponds to your 林鹤霄
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'option,changed_ids ) values('0ac91f167f754c8cbac00e9e3dc372
MySQL5.6的my.ini配置 aigo mysql
注意：以下配置的服务器硬件是：8核16G内存 [client] port=3306 [mysql] default-character-set=utf8 [mysqld] port=3306 basedir=D:/mysql-5.6.21-win
mysql 全文模糊查找便捷解决方案 alxw4616 mysql
mysql 全文模糊查找便捷解决方案 2013/6/14 by 半仙 [email protected] 目的: 项目需求实现模糊查找. 原则: 查询不能超过 1秒. 问题: 目标表中有超过1千万条记录. 使用like '%str%' 进行模糊查询无法达到性能需求. 解决方案: 使用mysql全文索引. 1.全文索引 : MySQL支持全文索引和搜索功能。MySQL中的全文索
自定义数据结构链表(单项 ,双向,环形) 百合不是茶单项链表双向链表
链表与动态数组的实现方式差不多, 数组适合快速删除某个元素链表则可以快速的保存数组并且可以是不连续的单项链表;数据从第一个指向最后一个实现代码: //定义动态链表 clas
threadLocal实例 bijian1013 java thread java多线程 threadLocal
实例1： package com.bijian.thread; public class MyThread extends Thread { private static ThreadLocal tl = new ThreadLocal() { protected synchronized Object initialValue() { return new Inte
activemq安全设置—设置admin的用户名和密码 bijian1013 java activemq
ActiveMQ使用的是jetty服务器, 打开conf/jetty.xml文件，找到 <bean id="adminSecurityConstraint" class="org.eclipse.jetty.util.security.Constraint"> <p
【Java范型一】Java范型详解之范型集合和自定义范型类 bit1129 java
本文详细介绍Java的范型，写一篇关于范型的博客原因有两个，前几天要写个范型方法(返回值根据传入的类型而定)，竟然想了半天，最后还是从网上找了个范型方法的写法；再者，前一段时间在看Gson, Gson这个JSON包的精华就在于对范型的优雅简单的处理，看它的源代码就比较迷糊，只其然不知其所以然。所以，还是花点时间系统的整理总结下范型吧。范型内容范型集合类范型类
【HBase十二】HFile存储的是一个列族的数据 bit1129 hbase
在HBase中，每个HFile存储的是一个表中一个列族的数据，也就是说，当一个表中有多个列簇时，针对每个列簇插入数据，最后产生的数据是多个HFile，每个对应一个列族，通过如下操作验证 1. 建立一个有两个列族的表 create 'members','colfam1','colfam2' 2. 在members表中的colfam1中插入50*5
Nginx 官方一个配置实例 ronin47 nginx 配置实例
user www www; worker_processes 5; error_log logs/error.log; pid logs/nginx.pid; worker_rlimit_nofile 8192; events { worker_connections 4096;} http { include conf/mim
java-15.输入一颗二元查找树，将该树转换为它的镜像，即在转换后的二元查找树中，左子树的结点都大于右子树的结点。用递归和循环 bylijinnan java
//use recursion public static void mirrorHelp1(Node node){ if(node==null)return; swapChild(node); mirrorHelp1(node.getLeft()); mirrorHelp1(node.getRight()); } //use no recursion bu
返回null还是empty bylijinnan java apache spring 编程
第一个问题，函数是应当返回null还是长度为0的数组（或集合）？第二个问题，函数输入参数不当时，是异常还是返回null？先看第一个问题有两个约定我觉得应当遵守： 1.返回零长度的数组或集合而不是null（详见《Effective Java》）理由就是，如果返回empty，就可以少了很多not-null判断： List<Person> list
[科技与项目]工作流厂商的战略机遇期 comsci 工作流
在新的战略平衡形成之前，这里有一个短暂的战略机遇期，只有大概最短6年，最长14年的时间，这段时间就好像我们森林里面的小动物，在秋天中，必须抓紧一切时间存储坚果一样，否则无法熬过漫长的冬季。。。。在微软，甲骨文，谷歌，IBM,SONY
过度设计-举例 cuityang 过度设计
过度设计，需要更多设计时间和测试成本，如无必要，还是尽量简洁一些好。未来的事情，比如访问量，比如数据库的容量，比如是否需要改成分布式都是无法预料的再举一个例子，对闰年的判断逻辑：　　1、 if($Year%4==0) return True; else return Fasle; 　　2、if ( ($Year%4==0 &am
java进阶，《Java性能优化权威指南》试读 darkblue086 java性能优化
记得当年随意读了微软出版社的.NET 2.0应用程序调试，才发现调试器如此强大，应用程序开发调试其实真的简单了很多，不仅仅是因为里面介绍了很多调试器工具的使用，更是因为里面寻找问题并重现问题的思想让我震撼，时隔多年，Java已经如日中天，成为许多大型企业应用的首选，而今天，这本《Java性能优化权威指南》让我再次找到了这种感觉，从不经意的开发过程让我刮目相看，原来性能调优不是简单地看看热点在哪里，
网络学习笔记初识OSI七层模型与TCP协议 dcj3sjt126com 学习笔记
协议：在计算机网络中通信各方面所达成的、共同遵守和执行的一系列约定　　计算机网络的体系结构：计算机网络的层次结构和各层协议的集合。　　两类服务：　　面向连接的服务通信双方在通信之前先建立某种状态，并在通信过程中维持这种状态的变化，同时为服务对象预先分配一定的资源。这种服务叫做面向连接的服务。　　面向无连接的服务通信双方在通信前后不建立和维持状态，不为服务对象
mac中用命令行运行mysql dcj3sjt126com mysql linux mac
参考这篇博客：http://www.cnblogs.com/macro-cheng/archive/2011/10/25/mysql-001.html 感觉workbench不好用（有点先入为主了）。 1，安装mysql 在mysql的官方网站下载 mysql 5.5.23 http://www.mysql.com/downloads/mysql/，根据我的机器的配置情况选择了64
MongDB查询（1）——基本查询[五] eksliang mongodb mongodb 查询 mongodb find
MongDB查询转载请出自出处：http://eksliang.iteye.com/blog/2174452 一、find简介 MongoDB中使用find来进行查询。 API:如下 function ( query , fields , limit , skip, batchSize, options ){.....} 参数含义： query:查询参数 fie
base64，加密解密经融加密，对接 y806839048 经融加密对接
String data0 = new String(Base64.encode(bo.getPaymentResult().getBytes(("GBK")))); String data1 = new String(Base64.decode(data0.toCharArray()),"GBK"); // 注意编码格式，注意用于加密，解密的要是同
JavaWeb之JSP概述 ihuning javaweb
什么是JSP？为什么使用JSP？ JSP表示Java Server Page，即嵌有Java代码的HTML页面。使用JSP是因为在HTML中嵌入Java代码比在Java代码中拼接字符串更容易、更方便和更高效。 JSP起源在很多动态网页中，绝大部分内容都是固定不变的，只有局部内容需要动态产生和改变。如果使用Servl
apple watch 指南啸笑天 apple
1. 文档 WatchKit Programming Guide（中译在线版 By @CocoaChina）译文译者原文概览 - 开始为 Apple Watch 进行开发 @星夜暮晨 Overview - Developing for Apple Watch 概览 - 配置 Xcode 项目 - Overview - Configuring Yo
java经典的基础题目 macroli java 编程
1.列举出 10个JAVA语言的优势 a:免费，开源，跨平台(平台独立性)，简单易用，功能完善，面向对象，健壮性，多线程，结构中立，企业应用的成熟平台, 无线应用 2.列举出JAVA中10个面向对象编程的术语 a:包，类，接口，对象，属性，方法，构造器，继承，封装，多态，抽象，范型 3.列举出JAVA中6个比较常用的包 Java.lang;java.util;java.io;java.sql;ja
你所不知道神奇的js replace正则表达式 qiaolevip 每天进步一点点学习永无止境纵观千象 regex
var v = 'C9CFBAA3CAD0'; console.log(v); var arr = v.split(''); for (var i = 0; i < arr.length; i ++) { if (i % 2 == 0) arr[i] = '%' + arr[i]; } console.log(arr.join('')); console.log(v.r
[一起学Hive]之十五-分析Hive表和分区的统计信息(Statistics) superlxw1234 hive hive分析表 hive统计信息 hive Statistics
关键字：Hive统计信息、分析Hive表、Hive Statistics 类似于Oracle的分析表，Hive中也提供了分析表和分区的功能，通过自动和手动分析Hive表，将Hive表的一些统计信息存储到元数据中。表和分区的统计信息主要包括：行数、文件数、原始数据大小、所占存储大小、最后一次操作时间等； 14.1 新表的统计信息对于一个新创建
Spring Boot 1.2.5 发布 wiselyman spring boot
Spring Boot 1.2.5已在7月2日发布，现在可以从spring的maven库和maven中心库下载。这个版本是一个维护的发布版，主要是一些修复以及将Spring的依赖提升至4.1.7(包含重要的安全修复)。官方建议所有的Spring Boot用户升级这个版本。项目首页 | 源