michaelli916

第12课：Spark Streaming源码解读之Executor容错安全性

本节课聚焦executor的安全容错，driver的安全容错下节课讲。
executor的安全容错主要是executor接受的数据的安全性，计算的安全容错完全可以借助于底层的rdd的安全容错。数据的安全性对spark streaming至关重要，这有2个原因：
第一个原因：spark streaming不断地持续地接受数据,不断地持续地产生JOb不断地持续地提交Job；
第二个原因：由于是基于spark core，如果能够确保数据安全可靠，即使运行时有故障也可以借助rdd的容错性自动进行恢复。

最简单的容错机制是副本，其次还有是数据源接受重放（即可以从数据源重新读取过去若干时间内的数据）。数据副本也有2中方式：通过配置storage level基于block manager做备份，这样接受到的数据存储到executor时，就可以天然地借助bm的机制做备份，这也是默认采用了的方式；通过wal来做备份。

一。基于bm做备份：
storage level默认是MEMORY_AND_DISK_SER_2,这时接受到的数据除了存储到receiver所在executor的机器的内存（和磁盘），还会有一份存储到其他的executor的内存（和磁盘）中，以socketTextStream(）为例：

  /**
   * Create a input stream from TCP source hostname:port. Data is received using
   * a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited
   * lines.
   * @param hostname      Hostname to connect to for receiving data
   * @param port          Port to connect to for receiving data
   * @param storageLevel  Storage level to use for storing the received objects
   *                      (default: StorageLevel.MEMORY_AND_DISK_SER_2)
   */
  def socketTextStream(
      hostname: String,
      port: Int,
      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
    ): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
    socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
  }

创建Receiver时用到的storageLevel就是来自这里。

private[streaming]
class SocketReceiver[T: ClassTag](
    host: String,
    port: Int,
    bytesToObjects: InputStream => Iterator[T],
    storageLevel: StorageLevel
  ) extends Receiver[T](storageLevel) with Logging {

ReceiverSupervisorImpl中，在没有做wal时，实例化了BlockManagerBasedBlockHandler：

 private val receivedBlockHandler: ReceivedBlockHandler = {
    if (WriteAheadLogUtils.enableReceiverLog(env.conf)) {
      if (checkpointDirOption.isEmpty) {
        throw new SparkException(
          "Cannot enable receiver write-ahead log without checkpoint directory set. " +
            "Please use streamingContext.checkpoint() to set the checkpoint directory. " +
            "See documentation for more details.")
      }
      new WriteAheadLogBasedBlockHandler(env.blockManager, receiver.streamId,
        receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get)
    } else {
      new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel)
    }
  }

BlockManagerBasedBlockHandler最终通过blockManager来存储数据：

/**
 * Implementation of a [[org.apache.spark.streaming.receiver.ReceivedBlockHandler]] which
 * stores the received blocks into a block manager with the specified storage level.
 */
private[streaming] class BlockManagerBasedBlockHandler(
    blockManager: BlockManager, storageLevel: StorageLevel)
  extends ReceivedBlockHandler with Logging {

  def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = {

    var numRecords = None: Option[Long]

    val putResult: Seq[(BlockId, BlockStatus)] = block match {
      case ArrayBufferBlock(arrayBuffer) =>
        numRecords = Some(arrayBuffer.size.toLong)
        blockManager.putIterator(blockId, arrayBuffer.iterator, storageLevel,
          tellMaster = true)
      case IteratorBlock(iterator) =>
        val countIterator = new CountingIterator(iterator)
        val putResult = blockManager.putIterator(blockId, countIterator, storageLevel,
          tellMaster = true)
        numRecords = countIterator.count
        putResult
      case ByteBufferBlock(byteBuffer) =>
        blockManager.putBytes(blockId, byteBuffer, storageLevel, tellMaster = true)
      case o =>
        throw new SparkException(
          s"Could not store $blockId to block manager, unexpected block type ${o.getClass.getName}")
    }
    if (!putResult.map { _._1 }.contains(blockId)) {
      throw new SparkException(
        s"Could not store $blockId to block manager with storage level $storageLevel")
    }
    BlockManagerBasedStoreResult(blockId, numRecords)
  }

  def cleanupOldBlocks(threshTime: Long) {
    // this is not used as blocks inserted into the BlockManager are cleared by DStream's clearing
    // of BlockRDDs.
  }
}

二。基于wal做备份：
ReceiverSupervisorImpl中，在采用wal时，实例化了WriteAheadLogBasedBlockHandler,
可以看到wal机制使用了checkpoint目录，生产环境下一般都是放在hdfs，此时默认采用了3份副本，安全，但比较耗时,性能可能有影响,在对实时性要求比较高的情况下,不建议采用

/**
 * Implementation of a [[org.apache.spark.streaming.receiver.ReceivedBlockHandler]] which
 * stores the received blocks in both, a write ahead log and a block manager.
 */
private[streaming] class WriteAheadLogBasedBlockHandler(
    blockManager: BlockManager,
    streamId: Int,
    storageLevel: StorageLevel,
    conf: SparkConf,
    hadoopConf: Configuration,
    checkpointDir: String,
    clock: Clock = new SystemClock
  ) extends ReceivedBlockHandler with Logging {

  private val blockStoreTimeout = conf.getInt(
    "spark.streaming.receiver.blockStoreTimeout", 30).seconds

  private val effectiveStorageLevel = {
    if (storageLevel.deserialized) {
      logWarning(s"Storage level serialization ${storageLevel.deserialized} is not supported when" +
        s" write ahead log is enabled, change to serialization false")
    }
    //此时，由于做了wal，就没有必要再用有备份的storageLevel，因为checkpoint存放在hdfs上，默认就有了3份副本；
    if (storageLevel.replication > 1) {
      logWarning(s"Storage level replication ${storageLevel.replication} is unnecessary when " +
        s"write ahead log is enabled, change to replication 1")
    }

    StorageLevel(storageLevel.useDisk, storageLevel.useMemory, storageLevel.useOffHeap, false, 1)
  }

  if (storageLevel != effectiveStorageLevel) {
    logWarning(s"User defined storage level $storageLevel is changed to effective storage level " +
      s"$effectiveStorageLevel when write ahead log is enabled")
  }

  //写日志
  // Write ahead log manages
  private val writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
    conf, checkpointDirToLogDir(checkpointDir, streamId), hadoopConf)

  // For processing futures used in parallel block storing into block manager and write ahead log
  // # threads = 2, so that both writing to BM and WAL can proceed in parallel
  implicit private val executionContext = ExecutionContext.fromExecutorService(
    ThreadUtils.newDaemonFixedThreadPool(2, this.getClass.getSimpleName))

  //并行写BM与wal
  /**
   * This implementation stores the block into the block manager as well as a write ahead log.
   * It does this in parallel, using Scala Futures, and returns only after the block has
   * been stored in both places.
   */
  def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = {

    var numRecords = None: Option[Long]
    // Serialize the block so that it can be inserted into both
    val serializedBlock = block match {
      case ArrayBufferBlock(arrayBuffer) =>
        numRecords = Some(arrayBuffer.size.toLong)
        blockManager.dataSerialize(blockId, arrayBuffer.iterator)
      case IteratorBlock(iterator) =>
        val countIterator = new CountingIterator(iterator)
        val serializedBlock = blockManager.dataSerialize(blockId, countIterator)
        numRecords = countIterator.count
        serializedBlock
      case ByteBufferBlock(byteBuffer) =>
        byteBuffer
      case _ =>
        throw new Exception(s"Could not push $blockId to block manager, unexpected block type")
    }

    // Store the block in block manager
    val storeInBlockManagerFuture = Future {
      val putResult =
        blockManager.putBytes(blockId, serializedBlock, effectiveStorageLevel, tellMaster = true)
      if (!putResult.map { _._1 }.contains(blockId)) {
        throw new SparkException(
          s"Could not store $blockId to block manager with storage level $storageLevel")
      }
    }

    // Store the block in write ahead log
    val storeInWriteAheadLogFuture = Future {
      writeAheadLog.write(serializedBlock, clock.getTimeMillis())
    }

    // Combine the futures, wait for both to complete, and return the write ahead log record handle
    val combinedFuture = storeInBlockManagerFuture.zip(storeInWriteAheadLogFuture).map(_._2)
    val walRecordHandle = Await.result(combinedFuture, blockStoreTimeout)
    WriteAheadLogBasedStoreResult(blockId, numRecords, walRecordHandle)
  }

  def cleanupOldBlocks(threshTime: Long) {
    writeAheadLog.clean(threshTime, false)
  }

  def stop() {
    writeAheadLog.close()
    executionContext.shutdown()
  }
}

private[streaming] object WriteAheadLogBasedBlockHandler {
  def checkpointDirToLogDir(checkpointDir: String, streamId: Int): String = {
    new Path(checkpointDir, new Path("receivedData", streamId.toString)).toString
  }
}

来看下WriteAheadLog，抽象类，wal是顺序写数据，顺序或随机读数据，没有更改或删除记录之类的，读时只需要游标或指针，所以还是很快的：

**
 * :: DeveloperApi ::
 *
 * This abstract class represents a write ahead log (aka journal) that is used by Spark Streaming
 * to save the received data (by receivers) and associated metadata to a reliable storage, so that
 * they can be recovered after driver failures. See the Spark documentation for more information
 * on how to plug in your own custom implementation of a write ahead log.
 */
@org.apache.spark.annotation.DeveloperApi
public abstract class WriteAheadLog {
  /**
   * Write the record to the log and return a record handle, which contains all the information
   * necessary to read back the written record. The time is used to the index the record,
   * such that it can be cleaned later. Note that implementations of this abstract class must
   * ensure that the written data is durable and readable (using the record handle) by the
   * time this function returns.
   */
  abstract public WriteAheadLogRecordHandle write(ByteBuffer record, long time);

  /**
   * Read a written record based on the given record handle.
   */
  abstract public ByteBuffer read(WriteAheadLogRecordHandle handle);

  /**
   * Read and return an iterator of all the records that have been written but not yet cleaned up.
   */
  abstract public Iterator readAll();

  /**
   * Clean all the records that are older than the threshold time. It can wait for
   * the completion of the deletion.
   */
  abstract public void clean(long threshTime, boolean waitForCompletion);

  /**
   * Close this log and release any resources.
   */
  abstract public void close();
}

返回的句柄是WriteAheadLogRecordHandle，它的一个具体实现是FileBasedWriteAheadLogSegment：

/**
 * :: DeveloperApi ::
 *
 * This abstract class represents a handle that refers to a record written in a
 * {@link org.apache.spark.streaming.util.WriteAheadLog WriteAheadLog}.
 * It must contain all the information necessary for the record to be read and returned by
 * an implemenation of the WriteAheadLog class.
 *
 * @see org.apache.spark.streaming.util.WriteAheadLog
 */
@org.apache.spark.annotation.DeveloperApi
public abstract class WriteAheadLogRecordHandle implements java.io.Serializable {
}

/** Class for representing a segment of data in a write ahead log file */
private[streaming] case class FileBasedWriteAheadLogSegment(path: String, offset: Long, length: Int)
  extends WriteAheadLogRecordHandle

来看下WriteAheadLog抽象类的具体实现FileBasedWriteAheadLog，需要说明的是，这里的注释中说了hdfs,其实使用的可以是hadoop支持的所有文件系统:

/**
 * This class manages write ahead log files.
 *
 *  - Writes records (bytebuffers) to periodically rotating log files.
 *  - Recovers the log files and the reads the recovered records upon failures.
 *  - Cleans up old log files.
 *
 * Uses [[org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter]] to write
 * and [[org.apache.spark.streaming.util.FileBasedWriteAheadLogReader]] to read.
 *
 * @param logDirectory Directory when rotating log files will be created.
 * @param hadoopConf Hadoop configuration for reading/writing log files.
 */
private[streaming] class FileBasedWriteAheadLog(
    conf: SparkConf,
    logDirectory: String,
    hadoopConf: Configuration,
    rollingIntervalSecs: Int,
    maxFailures: Int,
    closeFileAfterWrite: Boolean
  ) extends WriteAheadLog with Logging {

  import FileBasedWriteAheadLog._

  private val pastLogs = new ArrayBuffer[LogInfo]
  private val callerNameTag = getCallerName.map(c => s" for $c").getOrElse("")

  private val threadpoolName = s"WriteAheadLogManager $callerNameTag"
  private val threadpool = ThreadUtils.newDaemonCachedThreadPool(threadpoolName, 20)
  private val executionContext = ExecutionContext.fromExecutorService(threadpool)
  override protected val logName = s"WriteAheadLogManager $callerNameTag"

  private var currentLogPath: Option[String] = None
  private var currentLogWriter: FileBasedWriteAheadLogWriter = null
  private var currentLogWriterStartTime: Long = -1L
  private var currentLogWriterStopTime: Long = -1L

  initializeOrRecover()

  /**
   * Write a byte buffer to the log file. This method synchronously writes the data in the
   * ByteBuffer to HDFS. When this method returns, the data is guaranteed to have been flushed
   * to HDFS, and will be available for readers to read.
   */
  def write(byteBuffer: ByteBuffer, time: Long): FileBasedWriteAheadLogSegment = synchronized {
    var fileSegment: FileBasedWriteAheadLogSegment = null
    var failures = 0
    var lastException: Exception = null
    var succeeded = false
    while (!succeeded && failures < maxFailures) {
      try {
        fileSegment = getLogWriter(time).write(byteBuffer)
        if (closeFileAfterWrite) {
          resetWriter()
        }
        succeeded = true
      } catch {
        case ex: Exception =>
          lastException = ex
          logWarning("Failed to write to write ahead log")
          resetWriter()
          failures += 1
      }
    }
    if (fileSegment == null) {
      logError(s"Failed to write to write ahead log after $failures failures")
      throw lastException
    }
    fileSegment
  }

  def read(segment: WriteAheadLogRecordHandle): ByteBuffer = {
    val fileSegment = segment.asInstanceOf[FileBasedWriteAheadLogSegment]
    var reader: FileBasedWriteAheadLogRandomReader = null
    var byteBuffer: ByteBuffer = null
    try {
      reader = new FileBasedWriteAheadLogRandomReader(fileSegment.path, hadoopConf)
      byteBuffer = reader.read(fileSegment)
    } finally {
      reader.close()
    }
    byteBuffer
  }

  /**
   * Read all the existing logs from the log directory.
   *
   * Note that this is typically called when the caller is initializing and wants
   * to recover past state from the write ahead logs (that is, before making any writes).
   * If this is called after writes have been made using this manager, then it may not return
   * the latest the records. This does not deal with currently active log files, and
   * hence the implementation is kept simple.
   */
  def readAll(): JIterator[ByteBuffer] = synchronized {
    val logFilesToRead = pastLogs.map{ _.path} ++ currentLogPath
    logInfo("Reading from the logs:\n" + logFilesToRead.mkString("\n"))
    def readFile(file: String): Iterator[ByteBuffer] = {
      logDebug(s"Creating log reader with $file")
      val reader = new FileBasedWriteAheadLogReader(file, hadoopConf)
      CompletionIterator[ByteBuffer, Iterator[ByteBuffer]](reader, reader.close _)
    }
    if (!closeFileAfterWrite) {
      logFilesToRead.iterator.map(readFile).flatten.asJava
    } else {
      // For performance gains, it makes sense to parallelize the recovery if
      // closeFileAfterWrite = true
      seqToParIterator(threadpool, logFilesToRead, readFile).asJava
    }
  }

  /**
   * Delete the log files that are older than the threshold time.
   *
   * Its important to note that the threshold time is based on the time stamps used in the log
   * files, which is usually based on the local system time. So if there is coordination necessary
   * between the node calculating the threshTime (say, driver node), and the local system time
   * (say, worker node), the caller has to take account of possible time skew.
   *
   * If waitForCompletion is set to true, this method will return only after old logs have been
   * deleted. This should be set to true only for testing. Else the files will be deleted
   * asynchronously.
   */
  def clean(threshTime: Long, waitForCompletion: Boolean): Unit = {
    val oldLogFiles = synchronized {
      val expiredLogs = pastLogs.filter { _.endTime < threshTime }
      pastLogs --= expiredLogs
      expiredLogs
    }
    logInfo(s"Attempting to clear ${oldLogFiles.size} old log files in $logDirectory " +
      s"older than $threshTime: ${oldLogFiles.map { _.path }.mkString("\n")}")

    def deleteFile(walInfo: LogInfo): Unit = {
      try {
        val path = new Path(walInfo.path)
        val fs = HdfsUtils.getFileSystemForPath(path, hadoopConf)
        fs.delete(path, true)
        logDebug(s"Cleared log file $walInfo")
      } catch {
        case ex: Exception =>
          logWarning(s"Error clearing write ahead log file $walInfo", ex)
      }
      logInfo(s"Cleared log files in $logDirectory older than $threshTime")
    }
    oldLogFiles.foreach { logInfo =>
      if (!executionContext.isShutdown) {
        try {
          val f = Future { deleteFile(logInfo) }(executionContext)
          if (waitForCompletion) {
            import scala.concurrent.duration._
            Await.ready(f, 1 second)
          }
        } catch {
          case e: RejectedExecutionException =>
            logWarning("Execution context shutdown before deleting old WriteAheadLogs. " +
              "This would not affect recovery correctness.", e)
        }
      }
    }
  }


  /** Stop the manager, close any open log writer */
  def close(): Unit = synchronized {
    if (currentLogWriter != null) {
      currentLogWriter.close()
    }
    executionContext.shutdown()
    logInfo("Stopped write ahead log manager")
  }

  /** Get the current log writer while taking care of rotation */
  private def getLogWriter(currentTime: Long): FileBasedWriteAheadLogWriter = synchronized {
    if (currentLogWriter == null || currentTime > currentLogWriterStopTime) {
      resetWriter()
      currentLogPath.foreach {
        pastLogs += LogInfo(currentLogWriterStartTime, currentLogWriterStopTime, _)
      }
      currentLogWriterStartTime = currentTime
      currentLogWriterStopTime = currentTime + (rollingIntervalSecs * 1000)
      val newLogPath = new Path(logDirectory,
        timeToLogFile(currentLogWriterStartTime, currentLogWriterStopTime))
      currentLogPath = Some(newLogPath.toString)
      currentLogWriter = new FileBasedWriteAheadLogWriter(currentLogPath.get, hadoopConf)
    }
    currentLogWriter
  }

  /** Initialize the log directory or recover existing logs inside the directory */
  private def initializeOrRecover(): Unit = synchronized {
    val logDirectoryPath = new Path(logDirectory)
    val fileSystem = HdfsUtils.getFileSystemForPath(logDirectoryPath, hadoopConf)

    if (fileSystem.exists(logDirectoryPath) && fileSystem.getFileStatus(logDirectoryPath).isDir) {
      val logFileInfo = logFilesTologInfo(fileSystem.listStatus(logDirectoryPath).map { _.getPath })
      pastLogs.clear()
      pastLogs ++= logFileInfo
      logInfo(s"Recovered ${logFileInfo.size} write ahead log files from $logDirectory")
      logDebug(s"Recovered files are:\n${logFileInfo.map(_.path).mkString("\n")}")
    }
  }

  private def resetWriter(): Unit = synchronized {
    if (currentLogWriter != null) {
      currentLogWriter.close()
      currentLogWriter = null
    }
  }
}

private[streaming] object FileBasedWriteAheadLog {

  case class LogInfo(startTime: Long, endTime: Long, path: String)

  val logFileRegex = """log-(\d+)-(\d+)""".r

  def timeToLogFile(startTime: Long, stopTime: Long): String = {
    s"log-$startTime-$stopTime"
  }

  def getCallerName(): Option[String] = {
    val stackTraceClasses = Thread.currentThread.getStackTrace().map(_.getClassName)
    stackTraceClasses.find(!_.contains("WriteAheadLog")).flatMap(_.split("\\.").lastOption)
  }

  /** Convert a sequence of files to a sequence of sorted LogInfo objects */
  def logFilesTologInfo(files: Seq[Path]): Seq[LogInfo] = {
    files.flatMap { file =>
      logFileRegex.findFirstIn(file.getName()) match {
        case Some(logFileRegex(startTimeStr, stopTimeStr)) =>
          val startTime = startTimeStr.toLong
          val stopTime = stopTimeStr.toLong
          Some(LogInfo(startTime, stopTime, file.toString))
        case None =>
          None
      }
    }.sortBy { _.startTime }
  }

  /**
   * This creates an iterator from a parallel collection, by keeping at most `n` objects in memory
   * at any given time, where `n` is the size of the thread pool. This is crucial for use cases
   * where we create `FileBasedWriteAheadLogReader`s during parallel recovery. We don't want to
   * open up `k` streams altogether where `k` is the size of the Seq that we want to parallelize.
   */
  def seqToParIterator[I, O](
      tpool: ThreadPoolExecutor,
      source: Seq[I],
      handler: I => Iterator[O]): Iterator[O] = {
    val taskSupport = new ThreadPoolTaskSupport(tpool)
    val groupSize = tpool.getMaximumPoolSize.max(8)
    source.grouped(groupSize).flatMap { group =>
      val parallelCollection = group.par
      parallelCollection.tasksupport = taskSupport
      parallelCollection.map(handler)
    }.flatten
  }
}

具体的读写用到了 FileBasedWriteAheadLogRandomReader, FileBasedWriteAheadLogReader.
二。基于数据重放：
此时不需要副本不需要容错,kafka就相当于一个文件存储系统,当然kafka有2中方式:基于receiver的与direct的,receiver的数据存放是交给zk来管理metadata如偏移量offset,如果失效后,kafka可以基于偏移量重新读取,(此时还没有发出acknowledged,kafka不会认为你已经消费了此数据),当然可能存在数据重复消费的问题,(当数据消费后还没来得及发送ask来同步zk中的元数据时就会发生重复消费)所以生产环境越来越多的使用direct的方式,自己管理offset, 可以确保有且仅有一次的容错处理.

来看下DirectKafkaInputDStream()类:

**
 *  A stream of {@link org.apache.spark.streaming.kafka.KafkaRDD} where
 * each given Kafka topic/partition corresponds to an RDD partition.
 * The spark configuration spark.streaming.kafka.maxRatePerPartition gives the maximum number
 *  of messages
 * per second that each '''partition''' will accept.
 * Starting offsets are specified in advance,
 * and this DStream is not responsible for committing offsets,
 * so that you can control exactly-once semantics.
 * For an easy interface to Kafka-managed offsets,
 *  see {@link org.apache.spark.streaming.kafka.KafkaCluster}
 * @param kafkaParams Kafka "http://kafka.apache.org/documentation.html#configuration">
 * configuration parameters.
 *   Requires "metadata.broker.list" or "bootstrap.servers" to be set with Kafka broker(s),
 *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
 * @param fromOffsets per-topic/partition Kafka offsets defining the (inclusive)
 *  starting point of the stream
 * @param messageHandler function for translating each message into the desired type
 */
private[streaming]
class DirectKafkaInputDStream[
  K: ClassTag,
  V: ClassTag,
  U <: Decoder[K]: ClassTag,
  T <: Decoder[V]: ClassTag,
  R: ClassTag](
    ssc_ : StreamingContext,
    val kafkaParams: Map[String, String],
    val fromOffsets: Map[TopicAndPartition, Long],
    messageHandler: MessageAndMetadata[K, V] => R
  ) extends InputDStream[R](ssc_) with Logging {
  val maxRetries = context.sparkContext.getConf.getInt(
    "spark.streaming.kafka.maxRetries", 1)

  // Keep this consistent with how other streams are named (e.g. "Flume polling stream [2]")
  private[streaming] override def name: String = s"Kafka direct stream [$id]"

  protected[streaming] override val checkpointData =
    new DirectKafkaInputDStreamCheckpointData


  /**
   * Asynchronously maintains & sends new rate limits to the receiver through the receiver tracker.
   */
  override protected[streaming] val rateController: Option[RateController] = {
    if (RateController.isBackPressureEnabled(ssc.conf)) {
      Some(new DirectKafkaRateController(id,
        RateEstimator.create(ssc.conf, context.graph.batchDuration)))
    } else {
      None
    }
  }

  protected val kc = new KafkaCluster(kafkaParams)

  //限流用maxRateLimitPerPartition
  private val maxRateLimitPerPartition: Int = context.sparkContext.getConf.getInt(
      "spark.streaming.kafka.maxRatePerPartition", 0)
  protected def maxMessagesPerPartition: Option[Long] = {
    val estimatedRateLimit = rateController.map(_.getLatestRate().toInt)
    val numPartitions = currentOffsets.keys.size

    val effectiveRateLimitPerPartition = estimatedRateLimit
      .filter(_ > 0)
      .map { limit =>
        if (maxRateLimitPerPartition > 0) {
          Math.min(maxRateLimitPerPartition, (limit / numPartitions))
        } else {
          limit / numPartitions
        }
      }.getOrElse(maxRateLimitPerPartition)

    if (effectiveRateLimitPerPartition > 0) {
      val secsPerBatch = context.graph.batchDuration.milliseconds.toDouble / 1000
      Some((secsPerBatch * effectiveRateLimitPerPartition).toLong)
    } else {
      None
    }
  }

  protected var currentOffsets = fromOffsets

  @tailrec
  protected final def latestLeaderOffsets(retries: Int): Map[TopicAndPartition, LeaderOffset] = {
    val o = kc.getLatestLeaderOffsets(currentOffsets.keySet)
    // Either.fold would confuse @tailrec, do it manually
    if (o.isLeft) {
      val err = o.left.get.toString
      if (retries <= 0) {
        throw new SparkException(err)
      } else {
        log.error(err)
        Thread.sleep(kc.config.refreshLeaderBackoffMs)
        latestLeaderOffsets(retries - 1)
      }
    } else {
      o.right.get
    }
  }

  // limits the maximum number of messages per partition
  protected def clamp(
    leaderOffsets: Map[TopicAndPartition, LeaderOffset]): Map[TopicAndPartition, LeaderOffset] = {
    maxMessagesPerPartition.map { mmp =>
      leaderOffsets.map { case (tp, lo) =>
        tp -> lo.copy(offset = Math.min(currentOffsets(tp) + mmp, lo.offset))
      }
    }.getOrElse(leaderOffsets)
  }

  override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = {
    val untilOffsets = clamp(latestLeaderOffsets(maxRetries))
    //DirectKafkaInputDStream计算的时候会生成KafkaRDD
    val rdd = KafkaRDD[K, V, U, T, R](
      context.sparkContext, kafkaParams, currentOffsets, untilOffsets, messageHandler)

    // Report the record number and metadata of this batch interval to InputInfoTracker.
    val offsetRanges = currentOffsets.map { case (tp, fo) =>
      val uo = untilOffsets(tp)
      OffsetRange(tp.topic, tp.partition, fo, uo.offset)
    }
    val description = offsetRanges.filter { offsetRange =>
      // Don't display empty ranges.
      offsetRange.fromOffset != offsetRange.untilOffset
    }.map { offsetRange =>
      s"topic: ${offsetRange.topic}\tpartition: ${offsetRange.partition}\t" +
        s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}"
    }.mkString("\n")
    // Copy offsetRanges to immutable.List to prevent from being modified by the user
    val metadata = Map(
      "offsets" -> offsetRanges.toList,
      StreamInputInfo.METADATA_KEY_DESCRIPTION -> description)
    val inputInfo = StreamInputInfo(id, rdd.count, metadata)
    ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)

    currentOffsets = untilOffsets.map(kv => kv._1 -> kv._2.offset)
    Some(rdd)
  }

  override def start(): Unit = {
  }

  def stop(): Unit = {
  }

  private[streaming]
  class DirectKafkaInputDStreamCheckpointData extends DStreamCheckpointData(this) {
    def batchForTime: mutable.HashMap[Time, Array[(String, Int, Long, Long)]] = {
      data.asInstanceOf[mutable.HashMap[Time, Array[OffsetRange.OffsetRangeTuple]]]
    }

    override def update(time: Time) {
      batchForTime.clear()
      generatedRDDs.foreach { kv =>
        val a = kv._2.asInstanceOf[KafkaRDD[K, V, U, T, R]].offsetRanges.map(_.toTuple).toArray
        batchForTime += kv._1 -> a
      }
    }

    override def cleanup(time: Time) { }

    override def restore() {
      // this is assuming that the topics don't change during execution, which is true currently
      val topics = fromOffsets.keySet
      val leaders = KafkaCluster.checkErrors(kc.findLeaders(topics))

      batchForTime.toSeq.sortBy(_._1)(Time.ordering).foreach { case (t, b) =>
         logInfo(s"Restoring KafkaRDD for time $t ${b.mkString("[", ", ", "]")}")
         generatedRDDs += t -> new KafkaRDD[K, V, U, T, R](
           context.sparkContext, kafkaParams, b.map(OffsetRange(_)), leaders, messageHandler)
      }
    }
  }

  /**
   * A RateController to retrieve the rate from RateEstimator.
   */
  private[streaming] class DirectKafkaRateController(id: Int, estimator: RateEstimator)
    extends RateController(id, estimator) {
    override def publish(rate: Long): Unit = ()
  }
}

每次batch生成的时候都会调latestLeaderOffsets来看最新的offset,与上一次batch处理了的offset相减,就获得了此次offset范围,就确定了rdd的数据源.
来看下计算生成的KafkaRDD:

**
 * A batch-oriented interface for consuming from Kafka.
 * Starting and ending offsets are specified in advance,
 * so that you can control exactly-once semantics.
 * @param kafkaParams Kafka "http://kafka.apache.org/documentation.html#configuration">
 * configuration parameters. Requires "metadata.broker.list" or "bootstrap.servers" to be set
 * with Kafka broker(s) specified in host1:port1,host2:port2 form.
 * @param offsetRanges offset ranges that define the Kafka data belonging to this RDD
 * @param messageHandler function for translating each message into the desired type
 */
private[kafka]
class KafkaRDD[
  K: ClassTag,
  V: ClassTag,
  U <: Decoder[_]: ClassTag,
  T <: Decoder[_]: ClassTag,
  R: ClassTag] private[spark] (
    sc: SparkContext,
    kafkaParams: Map[String, String],
    val offsetRanges: Array[OffsetRange],
    leaders: Map[TopicAndPartition, (String, Int)],
    messageHandler: MessageAndMetadata[K, V] => R
  ) extends RDD[R](sc, Nil) with Logging with HasOffsetRanges {
  override def getPartitions: Array[Partition] = {
    offsetRanges.zipWithIndex.map { case (o, i) =>
        val (host, port) = leaders(TopicAndPartition(o.topic, o.partition))
        new KafkaRDDPartition(i, o.topic, o.partition, o.fromOffset, o.untilOffset, host, port)
    }.toArray
  }

  override def count(): Long = offsetRanges.map(_.count).sum

  override def countApprox(
      timeout: Long,
      confidence: Double = 0.95
  ): PartialResult[BoundedDouble] = {
    val c = count
    new PartialResult(new BoundedDouble(c, 1.0, c, c), true)
  }

  override def isEmpty(): Boolean = count == 0L

  override def take(num: Int): Array[R] = {
    val nonEmptyPartitions = this.partitions
      .map(_.asInstanceOf[KafkaRDDPartition])
      .filter(_.count > 0)

    if (num < 1 || nonEmptyPartitions.size < 1) {
      return new Array[R](0)
    }

    // Determine in advance how many messages need to be taken from each partition
    val parts = nonEmptyPartitions.foldLeft(Map[Int, Int]()) { (result, part) =>
      val remain = num - result.values.sum
      if (remain > 0) {
        val taken = Math.min(remain, part.count)
        result + (part.index -> taken.toInt)
      } else {
        result
      }
    }

    val buf = new ArrayBuffer[R]
    val res = context.runJob(
      this,
      (tc: TaskContext, it: Iterator[R]) => it.take(parts(tc.partitionId)).toArray,
      parts.keys.toArray)
    res.foreach(buf ++= _)
    buf.toArray
  }

  override def getPreferredLocations(thePart: Partition): Seq[String] = {
    val part = thePart.asInstanceOf[KafkaRDDPartition]
    // TODO is additional hostname resolution necessary here
    Seq(part.host)
  }

  private def errBeginAfterEnd(part: KafkaRDDPartition): String =
    s"Beginning offset ${part.fromOffset} is after the ending offset ${part.untilOffset} " +
      s"for topic ${part.topic} partition ${part.partition}. " +
      "You either provided an invalid fromOffset, or the Kafka topic has been damaged"

  private def errRanOutBeforeEnd(part: KafkaRDDPartition): String =
    s"Ran out of messages before reaching ending offset ${part.untilOffset} " +
    s"for topic ${part.topic} partition ${part.partition} start ${part.fromOffset}." +
    " This should not happen, and indicates that messages may have been lost"

  private def errOvershotEnd(itemOffset: Long, part: KafkaRDDPartition): String =
    s"Got ${itemOffset} > ending offset ${part.untilOffset} " +
    s"for topic ${part.topic} partition ${part.partition} start ${part.fromOffset}." +
    " This should not happen, and indicates a message may have been skipped"

  override def compute(thePart: Partition, context: TaskContext): Iterator[R] = {
    val part = thePart.asInstanceOf[KafkaRDDPartition]
    assert(part.fromOffset <= part.untilOffset, errBeginAfterEnd(part))
    if (part.fromOffset == part.untilOffset) {
      log.info(s"Beginning offset ${part.fromOffset} is the same as ending offset " +
        s"skipping ${part.topic} ${part.partition}")
      Iterator.empty
    } else {
      new KafkaRDDIterator(part, context)
    }
  }

  private class KafkaRDDIterator(
      part: KafkaRDDPartition,
      context: TaskContext) extends NextIterator[R] {

    context.addTaskCompletionListener{ context => closeIfNeeded() }

    log.info(s"Computing topic ${part.topic}, partition ${part.partition} " +
      s"offsets ${part.fromOffset} -> ${part.untilOffset}")

    val kc = new KafkaCluster(kafkaParams)
    val keyDecoder = classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties])
      .newInstance(kc.config.props)
      .asInstanceOf[Decoder[K]]
    val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])
      .newInstance(kc.config.props)
      .asInstanceOf[Decoder[V]]
    val consumer = connectLeader
    var requestOffset = part.fromOffset
    var iter: Iterator[MessageAndOffset] = null

    // The idea is to use the provided preferred host, except on task retry attempts,
    // to minimize number of kafka metadata requests
    private def connectLeader: SimpleConsumer = {
      if (context.attemptNumber > 0) {
        kc.connectLeader(part.topic, part.partition).fold(
          errs => throw new SparkException(
            s"Couldn't connect to leader for topic ${part.topic} ${part.partition}: " +
              errs.mkString("\n")),
          consumer => consumer
        )
      } else {
        kc.connect(part.host, part.port)
      }
    }

    private def handleFetchErr(resp: FetchResponse) {
      if (resp.hasError) {
        val err = resp.errorCode(part.topic, part.partition)
        if (err == ErrorMapping.LeaderNotAvailableCode ||
          err == ErrorMapping.NotLeaderForPartitionCode) {
          log.error(s"Lost leader for topic ${part.topic} partition ${part.partition}, " +
            s" sleeping for ${kc.config.refreshLeaderBackoffMs}ms")
          Thread.sleep(kc.config.refreshLeaderBackoffMs)
        }
        // Let normal rdd retry sort out reconnect attempts
        throw ErrorMapping.exceptionFor(err)
      }
    }

    private def fetchBatch: Iterator[MessageAndOffset] = {
      val req = new FetchRequestBuilder()
        .addFetch(part.topic, part.partition, requestOffset, kc.config.fetchMessageMaxBytes)
        .build()
      val resp = consumer.fetch(req)
      handleFetchErr(resp)
      // kafka may return a batch that starts before the requested offset
      resp.messageSet(part.topic, part.partition)
        .iterator
        .dropWhile(_.offset < requestOffset)
    }

    override def close(): Unit = {
      if (consumer != null) {
        consumer.close()
      }
    }

    override def getNext(): R = {
      if (iter == null || !iter.hasNext) {
        iter = fetchBatch
      }
      if (!iter.hasNext) {
        assert(requestOffset == part.untilOffset, errRanOutBeforeEnd(part))
        finished = true
        null.asInstanceOf[R]
      } else {
        val item = iter.next()
        if (item.offset >= part.untilOffset) {
          assert(item.offset == part.untilOffset, errOvershotEnd(item.offset, part))
          finished = true
          null.asInstanceOf[R]
        } else {
          requestOffset = item.nextOffset
          messageHandler(new MessageAndMetadata(
            part.topic, part.partition, item.message, item.offset, keyDecoder, valueDecoder))
        }
      }
    }
  }
}

private[kafka]
object KafkaRDD {
  import KafkaCluster.LeaderOffset

  /**
   * @param kafkaParams Kafka 
   * configuration parameters.
   *   Requires "metadata.broker.list" or "bootstrap.servers" to be set with Kafka broker(s),
   *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
   * @param fromOffsets per-topic/partition Kafka offsets defining the (inclusive)
   *  starting point of the batch
   * @param untilOffsets per-topic/partition Kafka offsets defining the (exclusive)
   *  ending point of the batch
   * @param messageHandler function for translating each message into the desired type
   */
  def apply[
    K: ClassTag,
    V: ClassTag,
    U <: Decoder[_]: ClassTag,
    T <: Decoder[_]: ClassTag,
    R: ClassTag](
      sc: SparkContext,
      kafkaParams: Map[String, String],
      fromOffsets: Map[TopicAndPartition, Long],
      untilOffsets: Map[TopicAndPartition, LeaderOffset],
      messageHandler: MessageAndMetadata[K, V] => R
    ): KafkaRDD[K, V, U, T, R] = {
    val leaders = untilOffsets.map { case (tp, lo) =>
        tp -> (lo.host, lo.port)
    }.toMap

    val offsetRanges = fromOffsets.map { case (tp, fo) =>
        val uo = untilOffsets(tp)
        OffsetRange(tp.topic, tp.partition, fo, uo.offset)
    }.toArray

    new KafkaRDD[K, V, U, T, R](sc, kafkaParams, offsetRanges, leaders, messageHandler)
  }
}

基于kafka的direct方式是经典的容错方式.
需要说明的是，所有的容错都会消耗一部分性能，由于不是所有情况都不能容忍数据丢（很多时候我们允许在容忍度范围内丢失一部分数据,比如5%），当数据完整性要求不高时，有时候就不需要配置额外的容错.
补充一点：1000个block丢失了1个，也是丢失，按照现有的机制，也需要从新读取处理所有的block，粒度太粗，可以通过修改direct kafka的方式的源码来修整这点。

本次分享来自于王家林老师的课程‘源码版本定制发行班’，在此向王家林老师表示感谢！
欢迎大家交流技术知识！一起学习，共同进步!

你可能感兴趣的:(spark)

nosql数据库技术与应用知识点皆过客，揽星河 NoSQL nosql 数据库大数据数据分析数据结构非关系型数据库
Nosql知识回顾大数据处理流程数据采集(flume、爬虫、传感器)数据存储(本门课程NoSQL所处的阶段)Hdfs、MongoDB、HBase等数据清洗(入仓)Hive等数据处理、分析(Spark、Flink等)数据可视化数据挖掘、机器学习应用(Python、SparkMLlib等)大数据时代存储的挑战(三高)高并发(同一时间很多人访问)高扩展(要求随时根据需求扩展存储)高效率(要求读写速度快)
分享一个基于python的电子书数据采集与可视化分析 hadoop电子书数据分析与推荐系统 spark大数据毕设项目（源码、调试、LW、开题、PPT) 计算机源码社 Python项目大数据大数据 python hadoop 计算机毕业设计选题计算机毕业设计源码数据分析 spark毕设
作者：计算机源码社个人简介：本人八年开发经验，擅长Java、Python、PHP、.NET、Node.js、Android、微信小程序、爬虫、大数据、机器学习等，大家有这一块的问题可以一起交流！学习资料、程序开发、技术解答、文档报告如需要源码，可以扫取文章下方二维码联系咨询Java项目微信小程序项目Android项目Python项目PHP项目ASP.NET项目Node.js项目选题推荐项目实战|p
Spark 组件 GraphX、Streaming 叶域大数据 spark spark 大数据分布式
Spark组件GraphX、Streaming一、SparkGraphX1.1GraphX的主要概念1.2GraphX的核心操作1.3示例代码1.4GraphX的应用场景二、SparkStreaming2.1SparkStreaming的主要概念2.2示例代码2.3SparkStreaming的集成2.4SparkStreaming的应用场景SparkGraphX用于处理图和图并行计算。Graph
大数据毕业设计hadoop+spark+hive知识图谱租房数据分析可视化大屏租房推荐系统 58同城租房爬虫房源推荐系统房价预测系统计算机毕业设计机器学习深度学习人工智能 2401_84572577 程序员大数据 hadoop 人工智能
做了那么多年开发，自学了很多门编程语言，我很明白学习资源对于学一门新语言的重要性，这些年也收藏了不少的Python干货，对我来说这些东西确实已经用不到了，但对于准备自学Python的人来说，或许它就是一个宝藏，可以给你省去很多的时间和精力。别在网上瞎学了，我最近也做了一些资源的更新，只要你是我的粉丝，这期福利你都可拿走。我先来介绍一下这些东西怎么用，文末抱走。（1）Python所有方向的学习路线（
Spark集群的三种模式 MelodyYN #Spark spark hadoop big data
文章目录1、Spark的由来1.1Hadoop的发展1.2MapReduce与Spark对比2、Spark内置模块3、Spark运行模式3.1Standalone模式部署配置历史服务器配置高可用运行模式3.2Yarn模式安装部署配置历史服务器运行模式4、WordCount案例1、Spark的由来定义：Hadoop主要解决，海量数据的存储和海量数据的分析计算。Spark是一种基于内存的快速、通用、可
Java中的大数据处理框架对比分析省赚客app开发者 java 开发语言
Java中的大数据处理框架对比分析大家好，我是微赚淘客系统3.0的小编，是个冬天不穿秋裤，天冷也要风度的程序猿！今天，我们将深入探讨Java中常用的大数据处理框架，并对它们进行对比分析。大数据处理框架是现代数据驱动应用的核心，它们帮助企业处理和分析海量数据，以提取有价值的信息。本文将重点介绍ApacheHadoop、ApacheSpark、ApacheFlink和ApacheStorm这四种流行的
写出渗透测试信息收集详细流程卿酌南烛_b805
一、扫描域名漏洞：域名漏洞扫描工具有AWVS、APPSCAN、Netspark、WebInspect、Nmap、Nessus、天镜、明鉴、WVSS、RSAS等。二、子域名探测：1、dns域传送漏洞2、搜索引擎查找（通过Google、bing、搜索c段）3、通过ssl证书查询网站：https://myssl.com/ssl.html和https://www.chinassl.net/ssltools
Spark MLlib模型训练—推荐算法 ALS(Alternative Least Squares) 不二人生 Spark ML 实战 spark-ml 推荐算法算法
SparkMLlib模型训练—推荐算法ALS(AlternativeLeastSquares)如果你平时爱刷抖音，或者热衷看电影，不知道有没有过这样的体验：这类影视App你用得越久，它就好像会读心术一样，总能给你推荐对胃口的内容。其实这种迎合用户喜好的推荐，离不开机器学习中的推荐算法。在今天这一讲，我们就结合两个有趣的电影推荐场景，为你讲解SparkMLlib支持的协同过滤与频繁项集算法电影推荐场
Python基础知识进阶之正则表达式_头歌python正则表达式进阶前端陈萨龙程序员 python 学习面试
最后硬核资料：关注即可领取PPT模板、简历模板、行业经典书籍PDF。技术互助：技术群大佬指点迷津，你的问题可能不是问题，求资源在群里喊一声。面试题库：由技术群里的小伙伴们共同投稿，热乎的大厂面试真题，持续更新中。知识体系：含编程语言、算法、大数据生态圈组件（Mysql、Hive、Spark、Flink）、数据仓库、Python、前端等等。网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是
分布式离线计算—Spark—基础介绍测试开发abbey 人工智能—大数据
原文作者：饥渴的小苹果原文地址：【Spark】Spark基础教程目录Spark特点Spark相对于Hadoop的优势Spark生态系统Spark基本概念Spark结构设计Spark各种概念之间的关系Executor的优点Spark运行基本流程Spark运行架构的特点Spark的部署模式Spark三种部署方式Hadoop和Spark的统一部署摘要：Spark是基于内存计算的大数据并行计算框架Spar
spark常用命令我是浣熊的微笑 spark
查看报错日志：yarnlogsapplicationIDspark2-submit--masteryarn--classcom.hik.ReadHdfstest-1.0-SNAPSHOT.jar进入$SPARK_HOME目录，输入bin/spark-submit--help可以得到该命令的使用帮助。hadoop@wyy:/app/hadoop/spark100$bin/spark-submit--
spark启动命令学不会又听不懂 spark 大数据分布式
hadoop启动：cd/root/toolssstart-dfs.sh，只需在hadoop01上启动stop-dfs.sh日志查看：cat/root/toolss/hadoop/logs/hadoop-root-datanode-hadoop03.outzookeeper启动：cd/root/toolss/zookeeperbin/zkServer.shstart，三台都要启动bin/zkServ
大数据领域的深度分析——AI是在帮助开发者还是取代他们？阳爱铭大数据与数据中台技术沉淀大数据人工智能后端数据库架构数据库开发 etl工程师 chatgpt
在大数据领域，生成式人工智能（AIGC）的应用正在迅速扩展，改变了数据科学家和开发者的工作方式。本文将从大数据的专业视角，探讨AI工具在这一领域的作用，以及它们是如何帮助开发者而非取代他们的。1.大数据领域的AI工具现状在大数据领域，AI工具已经取得了显著进展，以下是几款主要的AI工具及其功能和实际应用：ApacheSpark+MLlib：ApacheSpark是一个开源的分布式计算系统，广泛用于
大数据新视界 --大数据大厂之 Spark 性能优化秘籍：从配置到代码实践青云交大数据新视界 Spark 性能优化内存分配并行度存储级别 shuffle 减少算法优化代码实践数据读取广播变量数据倾斜 Spark 数据库
亲爱的朋友们，热烈欢迎你们来到青云交的博客！能与你们在此邂逅，我满心欢喜，深感无比荣幸。在这个瞬息万变的时代，我们每个人都在苦苦追寻一处能让心灵安然栖息的港湾。而我的博客，正是这样一个温暖美好的所在。在这里，你们不仅能够收获既富有趣味又极为实用的内容知识，还可以毫无拘束地畅所欲言，尽情分享自己独特的见解。我真诚地期待着你们的到来，愿我们能在这片小小的天地里共同成长，共同进步。本博客的精华专栏：Ja
编程常用命令总结 Yellow0523 Linux BigData 大数据
编程命令大全1.软件环境变量的配置JavaScalaSparkHadoopHive2.大数据软件常用命令Spark基本命令Spark-SQL命令Hive命令HDFS命令YARN命令Zookeeper命令kafka命令Hibench命令MySQL命令3.Linux常用命令Git命令conda命令pip命令查看Linux系统的详细信息查看Linux系统架构(X86还是ARM，两种方法都可)端口号命令L
【面试系列】Spark 高频面试题解答野老杂谈全网最全IT公司面试宝典面试 spark 职场和发展大数据
欢迎来到我的博客，很高兴能够在这里和您见面！欢迎订阅相关专栏：⭐️全网最全IT互联网公司面试宝典：收集整理全网各大IT互联网公司技术、项目、HR面试真题.⭐️AIGC时代的创新与未来：详细讲解AIGC的概念、核心技术、应用领域等内容。⭐️大数据平台建设指南：全面讲解从数据采集到数据可视化的整个过程，掌握构建现代化数据平台的核心技术和方法。⭐️《遇见Python：初识、了解与热恋》：涵盖了Pytho
spark常见面试题爱敲代码的小黑 spark 大数据分布式
文章目录1.Spark的运行流程？2.Spark中的RDD机制理解吗？3.RDD的宽窄依赖4.DAG中为什么要划分Stage？5.Spark程序执行，有时候默认为什么会产生很多task，怎么修改默认task执行个数？6.RDD中reduceBykey与groupByKey哪个性能好，为什么？7.SparkMasterHA主从切换过程不会影响到集群已有作业的运行，为什么？8.SparkMaster使
Spark面试题 golove666 面试题大全 spark 大数据分布式面试
Spark面试题1.Spark基础概念1.1解释Spark是什么以及它的主要特点Spark是什么？Spark的主要特点1.2描述Spark运行时架构和组件主要的Spark架构组件：1.3讲述Spark中的弹性分布式数据集（RDD）和数据帧（DataFrame）弹性分布式数据集（RDD）主要特征：创建和转换：使用场景：数据帧（DataFrame）主要特征：创建和操作：使用场景：RDD与DataFra
图计算：基于SparkGrpahX计算聚类系数妙龄少女郭德纲 Spark 图算法 Scala 聚类数据挖掘机器学习
图计算：基于SparkGrpahX计算聚类系数文章目录图计算：基于SparkGrpahX计算聚类系数一、什么是聚类系数二、基于SparkGraphX的聚类系数代码实现总结一、什么是聚类系数聚类系数（ClusteringCoefficient）是图计算和网络分析中的一个重要概念，用于衡量网络中节点的局部聚集程度。它有助于理解网络中节点之间的紧密程度和网络的结构特性。这是一种用来衡量图中节点聚类程度的
2024年最全使用Python求解方程_python解方程(1)，字节面试官迟到 2401_84569545 程序员 python 学习面试
最后硬核资料：关注即可领取PPT模板、简历模板、行业经典书籍PDF。技术互助：技术群大佬指点迷津，你的问题可能不是问题，求资源在群里喊一声。面试题库：由技术群里的小伙伴们共同投稿，热乎的大厂面试真题，持续更新中。知识体系：含编程语言、算法、大数据生态圈组件（Mysql、Hive、Spark、Flink）、数据仓库、Python、前端等等。网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是
Spark运行时架构 tooolik spark 架构大数据
目录一，Spark运行时架构二，YARN集群架构（一）YARN集群主要组件1、ResourceManager-资源管理器2、NodeManager-节点管理器3、Task-任务4、Container-容器5、ApplicationMaster-应用程序管理器6，总结（二）YARN集群中应用程序的执行流程三、SparkStandalone架构（一）client提交方式（二）cluster提交方式四、
使用SparkSql进行表的分析与统计 xingyuan8 大数据 java
背景我们的数据挖掘平台对数据统计有比较迫切的需求，而Spark本身对数据统计已经做了一些工作，希望梳理一下Spark已经支持的数据统计功能，后期再进行扩展。准备数据在参考文献6中下载鸢尾花数据，此处格式为iris.data格式，先将data后缀改为csv后缀（不影响使用，只是为了保证后续操作不需要修改）。数据格式如下：SepalLengthSepalWidthPetalLengthPetalWid
13.Spark Core-Spark中广播变量和累加器 __元昊__
一、前述Spark中因为算子中的真正逻辑是发送到Executor中去运行的，所以当Executor中需要引用外部变量时，需要使用广播变量。累机器相当于统筹大变量，常用于计数，统计。二、具体原理1、广播变量广播变量理解图image注意事项1、能不能将一个RDD使用广播变量广播出去？不能，因为RDD是不存储数据的。可以将RDD的结果广播出去。2、广播变量只能在Driver端定义，不能在Executor
比较Spark与Flink 傲雪凌霜，松柏长青大数据后端 spark flink 大数据
ApacheSpark和ApacheFlink都是目前非常流行的大数据处理引擎，但它们在架构、处理模式、应用场景等方面有一些显著的区别。下面是二者的对比：1.处理模式Spark:主要支持批处理（BatchProcessing），也能通过SparkStreaming处理流式数据，但SparkStreaming本质上是通过微批（micro-batching）的方式处理流数据，延迟相对较高。SparkS
Spark底层逻辑傲雪凌霜，松柏长青大数据后端 spark 大数据
ApacheSpark的底层逻辑可以从其核心概念、组件和执行流程等方面来理解。Spark提供了一个分布式数据处理框架，其底层逻辑基于批处理架构，能够在大规模集群中高效地处理数据。以下是Spark的底层逻辑的详细介绍：1.核心概念Spark的底层基于几个核心概念来实现分布式计算，包括：RDD（ResilientDistributedDataset，弹性分布式数据集）：RDD是Spark最基础的数据抽
Spark - 升级版数据源JDBC2 大猪大猪
在spark的数据源中，只支持Append,Overwrite,ErrorIfExists,Ignore,这几种模式，但是我们在线上的业务几乎全是需要upsert功能的，就是已存在的数据肯定不能覆盖，在mysql中实现就是采用：ONDUPLICATEKEYUPDATE，有没有这样一种实现？官方：不好意思，不提供，dounine：我这有呀，你来用吧。哈哈，为了方便大家的使用我已经把项目打包到mave
PySpark 静听山水 Spark spark
PySpark的本质确实是Python的一个接口层，它允许你使用Python语言来编写ApacheSpark应用程序。通过这个接口，你可以利用Spark强大的分布式计算能力，同时享受Python的易用性和灵活性。1、PySpark的工作原理PySpark的工作原理可以概括为以下几个步骤：编写Python代码：开发者使用Python语法来编写Spark应用程序。这些程序通常涉及创建RDDs（弹性分布
Ubuntu的ssh 请不要问我是谁
安装sshsudoapt-getupdatesudoapt-getinstallopenssh-server检测ssh是否启动sudops-e|grepssh创建root用户sudopasswdroot配置本机无密码ssh登录cd/home/spark0ssh-keygen-trsa-P""cat.ssh/id_rsa.pub>>.ssh/authorized_keyschmod600.ssh/a
2024年大数据最新实时数仓之实时数仓架构(Hudi) 2401_84185556 程序员大数据架构
技术框架Kafka：用于接入数据源；FlinkCDC：如果直接接入业务数据源可以考虑CDC方式，如果通过Kafka缓冲接入业务数据可以忽略;Flink：用于数据ETL，包括接入数据、处理数据及输出数据全链路数据计算任务；Spark：用于数据ETL，包括处理数据及输出数据全链路数据计算任务；Hudi：湖仓一体数据管理框架，用来管理模型数据，包括ODS/DWD/DWS/DIM/ADS等；Doris：O
实时数仓之实时数仓架构(Hudi)(1)，2024年最新熬夜整理华为最新大数据开发笔试题 2401_84181221 程序员架构大数据
+Hudi：湖仓一体数据管理框架，用来管理模型数据，包括ODS/DWD/DWS/DIM/ADS等；+Doris：OLAP引擎，同步数仓结果模型，对外提供数据服务支持；+Hbase：用来存储维表信息，维表数据来源一部分有Flink加工实时写入，另一部分是从Spark任务生产，其主要作用用来支持FlinkETL处理过程中的LookupJoin功能。这里选用Hbase原因主要因为Table的HbaseC
戴尔笔记本win8系统改装win7系统 sophia天雪 win7 戴尔改装系统 win8
戴尔win8 系统改装win7 系统详述第一步：使用U盘制作虚拟光驱： 1）下载安装UltraISO：注册码可以在网上搜索。 2）启动UltraISO，点击“文件”—》“打开”按钮，打开已经准备好的ISO镜像文
BeanUtils.copyProperties使用笔记 bylijinnan java
BeanUtils.copyProperties VS PropertyUtils.copyProperties 两者最大的区别是： BeanUtils.copyProperties会进行类型转换，而PropertyUtils.copyProperties不会。既然进行了类型转换，那BeanUtils.copyProperties的速度比不上PropertyUtils.copyProp
MyEclipse中文乱码问题 0624chenhong MyEclipse
一、设置新建常见文件的默认编码格式，也就是文件保存的格式。在不对MyEclipse进行设置的时候，默认保存文件的编码，一般跟简体中文操作系统（如windows2000，windowsXP）的编码一致，即GBK。在简体中文系统下，ANSI 编码代表 GBK编码;在日文操作系统下，ANSI 编码代表 JIS 编码。 Window-->Preferences-->General -
发送邮件不懂事的小屁孩 send email
import org.apache.commons.mail.EmailAttachment; import org.apache.commons.mail.EmailException; import org.apache.commons.mail.HtmlEmail; import org.apache.commons.mail.MultiPartEmail;
动画合集换个号韩国红果果 html css
动画指一种样式变为另一种样式 keyframes应当始终定义0 100 过程 1 transition 制作鼠标滑过图片时的放大效果 css .wrap{ width: 340px;height: 340px; position: absolute; top: 30%; left: 20%; overflow: hidden; bor
网络最常见的攻击方式竟然是SQL注入蓝儿唯美 sql注入
NTT研究表明，尽管SQL注入（SQLi）型攻击记录详尽且为人熟知，但目前网络应用程序仍然是SQLi攻击的重灾区。信息安全和风险管理公司NTTCom Security发布的《2015全球智能威胁风险报告》表明，目前黑客攻击网络应用程序方式中最流行的，要数SQLi攻击。报告对去年发生的60亿攻击行为进行分析，指出SQLi攻击是最常见的网络应用程序攻击方式。全球网络应用程序攻击中，SQLi攻击占
java笔记2 a-john java
类的封装： 1，java中，对象就是一个封装体。封装是把对象的属性和服务结合成一个独立的的单位。并尽可能隐藏对象的内部细节（尤其是私有数据） 2，目的：使对象以外的部分不能随意存取对象的内部数据（如属性），从而使软件错误能够局部化，减少差错和排错的难度。 3，简单来说，“隐藏属性、方法或实现细节的过程”称为——封装。 4，封装的特性： 4.1设置
[Andengine]Error：can't creat bitmap form path “gfx/xxx.xxx” aijuans 学习Android遇到的错误
最开始遇到这个错误是很早以前了，以前也没注意，只当是一个不理解的bug，因为所有的texture，textureregion都没有问题，但是就是提示错误。昨天和美工要图片，本来是要背景透明的png格式，可是她却给了我一个jpg的。说明了之后她说没法改，因为没有png这个保存选项。我就看了一下，和她要了psd的文件，还好我有一点
自己写的一个繁体到简体的转换程序 asialee java 转换繁体 filter 简体
今天调研一个任务，基于java的filter实现繁体到简体的转换，于是写了一个demo，给各位博友奉上，欢迎批评指正。实现的思路是重载request的调取参数的几个方法，然后做下转换。
android意图和意图监听器技术百合不是茶 android 显示意图隐式意图意图监听器
Intent是在activity之间传递数据;Intent的传递分为显示传递和隐式传递显式意图：调用Intent.setComponent() 或 Intent.setClassName() 或 Intent.setClass()方法明确指定了组件名的Intent为显式意图，显式意图明确指定了Intent应该传递给哪个组件。隐式意图;不指明调用的名称,根据设
spring3中新增的@value注解 bijian1013 java spring @Value
在spring 3.0中，可以通过使用@value，对一些如xxx.properties文件中的文件，进行键值对的注入，例子如下： 1.首先在applicationContext.xml中加入： <beans xmlns="http://www.springframework.
Jboss启用CXF日志 sunjing log jboss CXF
1. 在standalone.xml配置文件中添加system-properties： <system-properties> <property name="org.apache.cxf.logging.enabled" value=&
【Hadoop三】Centos7_x86_64部署Hadoop集群之编译Hadoop源代码 bit1129 centos
编译必需的软件 Firebugs3.0.0 Maven3.2.3 Ant JDK1.7.0_67 protobuf-2.5.0 Hadoop 2.5.2源码包 Firebugs3.0.0 http://sourceforge.jp/projects/sfnet_findbug
struts2验证框架的使用和扩展白糖_ 框架 xml bean struts 正则表达式
struts2能够对前台提交的表单数据进行输入有效性校验，通常有两种方式： 1、在Action类中通过validatexx方法验证，这种方式很简单，在此不再赘述； 2、通过编写xx-validation.xml文件执行表单验证，当用户提交表单请求后，struts会优先执行xml文件，如果校验不通过是不会让请求访问指定action的。本文介绍一下struts2通过xml文件进行校验的方法并说
记录-感悟 braveCS 感悟
再翻翻以前写的感悟，有时会发现自己很幼稚，也会让自己找回初心。 2015-1-11 1. 能在工作之余学习感兴趣的东西已经很幸福了； 2. 要改变自己，不能这样一直在原来区域，要突破安全区舒适区，才能提高自己，往好的方面发展； 3. 多反省多思考；要会用工具，而不是变成工具的奴隶； 4. 一天内集中一个定长时间段看最新资讯和偏流式博
编程之美-数组中最长递增子序列 bylijinnan 编程之美
import java.util.Arrays; import java.util.Random; public class LongestAccendingSubSequence { /** * 编程之美数组中最长递增子序列 * 书上的解法容易理解 * 另一方法书上没有提到的是，可以将数组排序（由小到大）得到新的数组， * 然后求排序后的数组与原数
读书笔记5 chengxuyuancsdn 重复提交 struts2的token验证
1、重复提交 2、struts2的token验证 3、用response返回xml时的注意 1、重复提交 (1)应用场景 (1-1)点击提交按钮两次。 (1-2)使用浏览器后退按钮重复之前的操作，导致重复提交表单。 (1-3)刷新页面 (1-4)使用浏览器历史记录重复提交表单。 (1-5)浏览器重复的 HTTP 请求。 (2)解决方法 (2-1)禁掉提交按钮 (2-2)
[时空与探索]全球联合进行第二次费城实验的可能性 comsci
二次世界大战前后,由爱因斯坦参加的一次在海军舰艇上进行的物理学实验 -费城实验至今给我们大家留下很多迷团..... 关于费城实验的详细过程,大家可以在网络上搜索一下,我这里就不详细描述了在这里,我的意思是,现在
easy connect 之 ORA-12154: TNS: 无法解析指定的连接标识符 daizj oracle ORA-12154
用easy connect连接出现“tns无法解析指定的连接标示符”的错误，如下： C:\Users\Administrator>sqlplus username/[email protected]:1521/orcl SQL*Plus: Release 10.2.0.1.0 – Production on 星期一 5月 21 18:16:20 2012 Copyright (c) 198
简单排序:归并排序 dieslrae 归并排序
public void mergeSort(int[] array){ int temp = array.length/2; if(temp == 0){ return; } int[] a = new int[temp]; int
C语言中字符串的\0和空格 dcj3sjt126com c
\0 为字符串结束符，比如说： abcd (空格)cdefg；存入数组时，空格作为一个字符占有一个字节的空间，我们
解决Composer国内速度慢的办法 dcj3sjt126com Composer
用法：有两种方式启用本镜像服务： 1 将以下配置信息添加到 Composer 的配置文件 config.json 中（系统全局配置）。见“例1” 2 将以下配置信息添加到你的项目的 composer.json 文件中（针对单个项目配置）。见“例2” 为了避免安装包的时候都要执行两次查询，切记要添加禁用 packagist 的设置，如下 1 2 3 4 5
高效可伸缩的结果缓存 shuizhaosi888 高效可伸缩的结果缓存
/** * 要执行的算法，返回结果v */ public interface Computable<A, V> { public V comput(final A arg); } /** * 用于缓存数据 */ public class Memoizer<A, V> implements Computable<A,
三点定位的算法 haoningabc c 算法
三点定位，已知a,b,c三个顶点的x,y坐标和三个点都z坐标的距离，la，lb,lc 求z点的坐标原理就是围绕a,b,c 三个点画圆，三个圆焦点的部分就是所求但是，由于三个点的距离可能不准，不一定会有结果，所以是三个圆环的焦点，环的宽度开始为0，没有取到则加1 运行 gcc -lm test.c test.c代码如下 #include "stdi
epoll使用详解 jimmee c linux 服务端编程 epoll
epoll - I/O event notification facility在linux的网络编程中，很长的时间都在使用select来做事件触发。在linux新的内核中，有了一种替换它的机制，就是epoll。相比于select，epoll最大的好处在于它不会随着监听fd数目的增长而降低效率。因为在内核中的select实现中，它是采用轮询来处理的，轮询的fd数目越多，自然耗时越多。并且，在linu
Hibernate对Enum的映射的基本使用方法 linzx0212 enum Hibernate
枚举 /** * 性别枚举 */ public enum Gender { MALE(0), FEMALE(1), OTHER(2); private Gender(int i) { this.i = i; } private int i; public int getI
第10章高级事件（下） onestopweb 事件
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
孙子兵法 roadrunners 孙子兵法
始计第一孙子曰：兵者，国之大事，死生之地，存亡之道，不可不察也。故经之以五事，校之以计，而索其情：一曰道，二曰天，三曰地，四曰将，五曰法。道者，令民于上同意，可与之死，可与之生，而不危也；天者，阴阳、寒暑、时制也；地者，远近、险易、广狭、死生也；将者，智、信、仁、勇、严也；法者，曲制、官道、主用也。凡此五者，将莫不闻，知之者胜，不知之者不胜。故校之以计，而索其情，曰
MySQL双向复制 tomcat_oracle mysql
本文包括: 主机配置从机配置建立主-从复制建立双向复制背景按照以下简单的步骤: 参考一下：在机器A配置主机(192.168.1.30) 在机器B配置从机(192.168.1.29) 我们可以使用下面的步骤来实现这一点步骤1：机器A设置主机在主机中打开配置文件 ,
zoj 3822 Domination(dp) 阿尔萨斯 Mina
题目链接：zoj 3822 Domination 题目大意：给定一个N∗M的棋盘，每次任选一个位置放置一枚棋子，直到每行每列上都至少有一枚棋子，问放置棋子个数的期望。解题思路：大白书上概率那一张有一道类似的题目，但是因为时间比较久了，还是稍微想了一下。dp[i][j][k]表示i行j列上均有至少一枚棋子，并且消耗k步的概率（k≤i∗j）,因为放置在i+1~n上等价与放在i+1行上，同理