[笔记]hadoop mapred InputFormat分析

Hadoop MapReduce的编程接口层主要有5个可编程组件,分别为InputFormat、Mapper、Partitioner、Reducer和OutputFormat。

InputFormat
主要用于描述输入数据的格式,提供两个功能:
  • 数据切分:将输入数据切分为若干个split(分片),每个split会被分发到一个Map任务中。
  • 记录识别:通过创建RecordReader,使用它将某个split(分片)中的记录(key, value形式)识别出来(Mapper使用split前的初始化),每个记录会作为Mapper中map函数的输入。
  • [/list]
    public abstract 
        List<InputSplit> getSplits(JobContext context
                                   ) throws IOException, InterruptedException;
    public abstract 
        RecordReader<K,V> createRecordReader(InputSplit split,
                                             TaskAttemptContext context
                                            ) throws IOException, 
                                                     InterruptedException;
    

    getSplits:
    引用
    Logically split the set of input files for the job.
    Each InputSplit is then assigned to an individual Mapper for processing.
    Note: The split is a logical split of the inputs and the input files are not physically split into chunks. For e.g. a split could be <input-file-path, start, offset> tuple. The InputFormat also creates the RecordReader to read the InputSplit.

    它只在逻辑上对输入数据进行分片,并不会在磁盘上将其切片分成分片进行存储。InputSplit只记录了分片的元数据信息(起始位置、长度以及所在的节点列表等)。
    createRecordReader:
    引用
    Create a record reader for a given split. The framework will call RecordReader.initialize(InputSplit, TaskAttemptContext) before the split is used.

    FileInputFormat的示例:
    [笔记]hadoop mapred InputFormat分析_第1张图片

    InputFormat (org.apache.hadoop.mapreduce) 子类层次图:
    [笔记]hadoop mapred InputFormat分析_第2张图片

    TextInputFormat分析
    [list]
  • 文件切分算法
  • 文件切分算法主要决定InputSplit的个数以及每个InputSplit对应的数据段。TextInputFormat继承FileInputFormat,以文件为单位切分生成InputSplit。
    引用
    protected long computeSplitSize(long blockSize, long minSize,
                                      long maxSize) {
        return Math.max(minSize, Math.min(maxSize, blockSize));
      }
    在计算splitSize中使用了blockSize, minSize, maxSize。
    blockSize:文件在HDFS中存储的block的大小,默认为64MB,通过dfs.block.size设置。
    minSize:InputSplit的最小值,由配置参数mapred.min.split.size设置,默认值为1。
    maxSize:InputSplit的最大值,由配置参数mapred.max.split.size设置,默认值为Long.MAX_VALUE。
    一旦确定splitSize值后,FileInputFormat将文件一次切成大小为splitSize的InputSplit,最后剩下不足splitSize的数据块单独成为一个InputSplit。
         long bytesRemaining = length;
            while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
              int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
              splits.add(new FileSplit(path, length-bytesRemaining, splitSize, 
                                       blkLocations[blkIndex].getHosts()));
              bytesRemaining -= splitSize;
            }
            
            if (bytesRemaining != 0) {
              splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining, 
                         blkLocations[blkLocations.length-1].getHosts()));
            }
  • FileSplit
  • FileSplit继承InputSplit,包含了InputSplit所在的文件、起始位置、长度以及所在host的列表。
      /** Constructs a split with host information
       *
       * @param file the file name
       * @param start the position of the first byte in the file to process
       * @param length the number of bytes in the file to process
       * @param hosts the list of hosts containing the block, possibly null
       */
      public FileSplit(Path file, long start, long length, String[] hosts) {
        this.file = file;
        this.start = start;
        this.length = length;
        this.hosts = hosts;
      }

    其中hosts的获取是通过InputSplit的所在文件查找(向NameNode)获取文件的所有BlockLocation,并通过InputSplit的起始位置查找对应的blkIndex,然后通过blkIndex获取对应BlockLocation的host信息。
  • LineRecordReader
  •   public RecordReader<LongWritable, Text> 
        createRecordReader(InputSplit split,
                           TaskAttemptContext context) {
        return new LineRecordReader();
      }

    LineRecordReader继承了RecordReader类,并适配了LineReader类。LineReader类通过构建了buffer字节数组缓冲(缓冲区大小由参数io.file.buffer.size设置,默认为64K),将数据从流中读出(DFSClient.DFSInputStream.read(byte buf[], int off, int len)),当Record跨块时,会重新定位node,并至少再次读取一次(从新定位的node中读取buffer长度的字节数组)
                if (pos > blockEnd) {
                  currentNode = blockSeekTo(pos);
                }
                int realLen = (int) Math.min((long) len, (blockEnd - pos + 1L));
                int result = readBuffer(buf, off, realLen);
    

    你可能感兴趣的:(hadoop)