Spark读取HBase数据源

读取HDFS相关的数据源时,大量使用mapreduce封装的读取数据源的方式,而一个mapreduce job会依赖InputFormat对读取的数据进行格式校验、输入切分等操作。读取HBase数据源,则使用了TableInputFormat。先来看看InputFormat。

InputFormat

InputFormat是mapreduce提供的数据源格式接口,也就是说,通过该接口可以支持读取各种各样的数据源(文件系统,数据库等),从而进行mapreduce计算。

看下InputFormat接口定义:

public abstract class InputFormat {

  /** 
   * Logically split the set of input files for the job.  
   * 
   * @param context job configuration.
   * @return an array of {@link InputSplit}s for the job.
   */
  public abstract 
    List getSplits(JobContext context
                               ) throws IOException, InterruptedException;
  
  /**
   * Create a record reader for a given split. The framework will call
   * {@link RecordReader#initialize(InputSplit, TaskAttemptContext)} before
   * the split is used.
   */
  public abstract 
    RecordReader createRecordReader(InputSplit split,
                                         TaskAttemptContext context
                                        ) throws IOException, 
                                                 InterruptedException;

}

getSplits决定逻辑分区的策略,createRecordReader提供了获取切分后分区记录的迭代器。

TableInputFormat

TalbeInputFormat是HBase提供的接口,看看他的分区策略:

RegionSizeCalculator sizeCalculator =
    new RegionSizeCalculator(getRegionLocator(), getAdmin());
TableName tableName = getTable().getName();
Pair keys = getStartEndKeys();
if (keys == null || keys.getFirst() == null ||
    keys.getFirst().length == 0) {
  HRegionLocation regLoc =
      getRegionLocator().getRegionLocation(HConstants.EMPTY_BYTE_ARRAY, false);
  if (null == regLoc) {
    throw new IOException("Expecting at least one region.");
  }
  List splits = new ArrayList<>(1);
  //拿到region的数量,用来做为partitin的数量
  long regionSize = sizeCalculator.getRegionSize(regLoc.getRegionInfo().getRegionName());
  //创建TableSplit,也就是InputSplit
  TableSplit split = new TableSplit(tableName, scan,
      HConstants.EMPTY_BYTE_ARRAY, HConstants.EMPTY_BYTE_ARRAY, regLoc
          .getHostnamePort().split(Addressing.HOSTNAME_PORT_SEPARATOR)[0], regionSize);
  splits.add(split);

采用的分区策略就是根据region的数量,决定partitin的数量。

createRecordReader

public RecordReader createRecordReader(
      InputSplit split, TaskAttemptContext context)
  throws IOException {
    if (table == null) {
      throw new IOException("Cannot create a record reader because of a" +
          " previous error. Please look at the previous logs lines from" +
          " the task's full log for more details.");
    }
    TableSplit tSplit = (TableSplit) split;
    LOG.info("Input split length: " + StringUtils.humanReadableInt(tSplit.getLength()) + " bytes.");
    TableRecordReader trr = this.tableRecordReader;
    // if no table record reader was provided use default
    if (trr == null) {
      trr = new TableRecordReader();
    }
    Scan sc = new Scan(this.scan);
    sc.setStartRow(tSplit.getStartRow());
    sc.setStopRow(tSplit.getEndRow());
    trr.setScan(sc);
    trr.setHTable(table);
    return trr;
  }

Spark 读取HBase数据源

sparkContext.newAPIHadoopRDD

val hBaseRDD = sc.newAPIHadoopRDD(hbaseConfig, classOf[TableInputFormat],
  classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
  classOf[org.apache.hadoop.hbase.client.Result])

sparkContext会创建一个RDD

new NewHadoopRDD(this, fClass, kClass, vClass, jconf)

直接看NewHadoopRDD的compute、getPartitions方法

override def getPartitions: Array[Partition] = {
  //实例化InputFormat对象 也就是我们传入的TableInputFormat(可能是其它InputFormat,这里只是举个例子)
  val inputFormat = inputFormatClass.newInstance
  inputFormat match {
    case configurable: Configurable =>
      configurable.setConf(_conf)
    case _ =>
  }
  val jobContext = new JobContextImpl(_conf, jobId)
  //拿到所有split
  val rawSplits = inputFormat.getSplits(jobContext).toArray
  //拿到总分区数,并转换为spark的套路
  val result = new Array[Partition](rawSplits.size)
  for (i <- 0 until rawSplits.size) {
    //把每个split封装成partition
    result(i) = new NewHadoopPartition(id, i, rawSplits(i).asInstanceOf[InputSplit with Writable])
  }
  result
}
.compute()

rivate val format = inputFormatClass.newInstance
      format match {
        case configurable: Configurable =>
          configurable.setConf(conf)
        case _ =>
      }
      //满足mapreduce的一切要求...
      private val attemptId = new TaskAttemptID(jobTrackerId, id, TaskType.MAP, split.index, 0)
      private val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId)
      private var finished = false
      private var reader =
      try {
        //拿到关键的RecordReader
        val _reader = format.createRecordReader(
          split.serializableHadoopSplit.value, hadoopAttemptContext)
        _reader.initialize(split.serializableHadoopSplit.value, hadoopAttemptContext)
        _reader
      } catch {
        case e: IOException if ignoreCorruptFiles =>
          logWarning(
            s"Skipped the rest content in the corrupted file: ${split.serializableHadoopSplit}",
            e)
          finished = true
          null
  }
 
//构造迭代器 hasNext和next
override def hasNext: Boolean = {
  if (!finished && !havePair) {
    try {
      finished = !reader.nextKeyValue
    } catch {
      case e: IOException if ignoreCorruptFiles =>
        logWarning(
          s"Skipped the rest content in the corrupted file: ${split.serializableHadoopSplit}",
          e)
        finished = true
    }
    if (finished) {
      // Close and release the reader here; close() will also be called when the task
      // completes, but for tasks that read from many files, it helps to release the
      // resources early.
      close()
    }
    havePair = !finished
  }
  !finished
}

override def next(): (K, V) = {
  if (!hasNext) {
    throw new java.util.NoSuchElementException("End of stream")
  }
  havePair = false
  if (!finished) {
    inputMetrics.incRecordsRead(1)
  }
  if (inputMetrics.recordsRead % SparkHadoopUtil.UPDATE_INPUT_METRICS_INTERVAL_RECORDS == 0) {
    updateBytesRead()
  }
  (reader.getCurrentKey, reader.getCurrentValue)
}

HBaseContext

HBaseContext是对spark操作HBase(bulk put, get, increment,delele,scan)的封装。底层其实也是通过NewHadoopRDD实现的。

Scan与读取HFile

从HBase读取大数据量的时候,基本都是通过直接读取HFile的方式,幸运的是,Spark已经为我们实现了读取HFile的方法。我们可能会使用HBaseContext.hbaseRDD(tableName:TalbeName. scans:Scan):RDD[(ImmutableBytesWritable, Result)]来读取HBase数据,实际上是调用了NewHadoopRDD来读取HBase的HFile。


总结:

Spark为了兼容mapreduce,给出了类似hadoopRDD()的接口,hbase为了兼容mapreduce,给出了TableInputFormat之类的接口。从而使得spark可以通过hbase获取数据。

1、Spark可以通过TableInputFormat接口访问HBase数据。

2、Spark读取HBase的分区数据是由HBase表的HRegion数量决定的。

3、HBase提供了HBaseContext工具来简化Spark读取HBase的API。

4、Spark读取HBase实际上是直接取的HFile。






你可能感兴趣的:(Spark,HBase)