汤高

研究MapReduce源码之实现自定义LineRecordReader完成多行读取文件内容

TextInputFormat是Hadoop默认的数据输入格式,但是它只能一行一行的读记录，如果要读取多行怎么办？
很简单自己写一个输入格式，然后写一个对应的Recordreader就可以了，但是要实现确不是这么简单的

首先看看TextInputFormat是怎么实现一行一行读取的

大家看一看源码

public class TextInputFormat extends FileInputFormat<LongWritable, Text> {

  @Override
  public RecordReader 
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
    String delimiter = context.getConfiguration().get(
        "textinputformat.record.delimiter");
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
    return new LineRecordReader(recordDelimiterBytes);
  }
//这个对文件做压缩用的
  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    final CompressionCodec codec =
      new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    if (null == codec) {
      return true;
    }
    return codec instanceof SplittableCompressionCodec;
  }

}

我们只要看第一个createRecordReader方法即可，从源码分析可知，它new了一个LineRecordReader，那么我们再来看看LineRecordReader的源码，看看这小子的内部世界

public class LineRecordReader extends RecordReader {
  private static final Log LOG = LogFactory.getLog(LineRecordReader.class);
  public static final String MAX_LINE_LENGTH = 
    "mapreduce.input.linerecordreader.line.maxlength";

  private long start;
  private long pos;
  private long end;
  private SplitLineReader in;
  private FSDataInputStream fileIn;
  private Seekable filePosition;
  private int maxLineLength;
  private LongWritable key;
  private Text value;
  private boolean isCompressedInput;
  private Decompressor decompressor;
  private byte[] recordDelimiterBytes;

  public LineRecordReader() {
  }

  public LineRecordReader(byte[] recordDelimiter) {
    this.recordDelimiterBytes = recordDelimiter;
  }

  public void initialize(InputSplit genericSplit,
                         TaskAttemptContext context) throws IOException {
    FileSplit split = (FileSplit) genericSplit;
    Configuration job = context.getConfiguration();
    this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
    start = split.getStart();
    end = start + split.getLength();
    final Path file = split.getPath();

    // open the file and seek to the start of the split
    final FileSystem fs = file.getFileSystem(job);
    fileIn = fs.open(file);

    CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
    if (null!=codec) {
      isCompressedInput = true; 
      decompressor = CodecPool.getDecompressor(codec);
      if (codec instanceof SplittableCompressionCodec) {
        final SplitCompressionInputStream cIn =
          ((SplittableCompressionCodec)codec).createInputStream(
            fileIn, decompressor, start, end,
            SplittableCompressionCodec.READ_MODE.BYBLOCK);
        in = new CompressedSplitLineReader(cIn, job,
            this.recordDelimiterBytes);
        start = cIn.getAdjustedStart();
        end = cIn.getAdjustedEnd();
        filePosition = cIn;
      } else {
        in = new SplitLineReader(codec.createInputStream(fileIn,
            decompressor), job, this.recordDelimiterBytes);
        filePosition = fileIn;
      }
    } else {
      fileIn.seek(start);
      in = new UncompressedSplitLineReader(
          fileIn, job, this.recordDelimiterBytes, split.getLength());
      filePosition = fileIn;
    }
    // If this is not the first split, we always throw away first record
    // because we always (except the last split) read one extra line in
    // next() method.
    if (start != 0) {
      start += in.readLine(new Text(), 0, maxBytesToConsume(start));
    }
    this.pos = start;
  }


  private int maxBytesToConsume(long pos) {
    return isCompressedInput
      ? Integer.MAX_VALUE
      : (int) Math.max(Math.min(Integer.MAX_VALUE, end - pos), maxLineLength);
  }

  private long getFilePosition() throws IOException {
    long retVal;
    if (isCompressedInput && null != filePosition) {
      retVal = filePosition.getPos();
    } else {
      retVal = pos;
    }
    return retVal;
  }

  private int skipUtfByteOrderMark() throws IOException {
    // Strip BOM(Byte Order Mark)
    // Text only support UTF-8, we only need to check UTF-8 BOM
    // (0xEF,0xBB,0xBF) at the start of the text stream.
    int newMaxLineLength = (int) Math.min(3L + (long) maxLineLength,
        Integer.MAX_VALUE);
    int newSize = in.readLine(value, newMaxLineLength, maxBytesToConsume(pos));
    // Even we read 3 extra bytes for the first line,
    // we won't alter existing behavior (no backwards incompat issue).
    // Because the newSize is less than maxLineLength and
    // the number of bytes copied to Text is always no more than newSize.
    // If the return size from readLine is not less than maxLineLength,
    // we will discard the current line and read the next line.
    pos += newSize;
    int textLength = value.getLength();
    byte[] textBytes = value.getBytes();
    if ((textLength >= 3) && (textBytes[0] == (byte)0xEF) &&
        (textBytes[1] == (byte)0xBB) && (textBytes[2] == (byte)0xBF)) {
      // find UTF-8 BOM, strip it.
      LOG.info("Found UTF-8 BOM and skipped it");
      textLength -= 3;
      newSize -= 3;
      if (textLength > 0) {
        // It may work to use the same buffer and not do the copyBytes
        textBytes = value.copyBytes();
        value.set(textBytes, 3, textLength);
      } else {
        value.clear();
      }
    }
    return newSize;
  }

  public boolean nextKeyValue() throws IOException {
    if (key == null) {
      key = new LongWritable();
    }
    key.set(pos);
    if (value == null) {
      value = new Text();
    }
    int newSize = 0;
    // We always read one extra line, which lies outside the upper
    // split limit i.e. (end - 1)
    while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
      if (pos == 0) {
        newSize = skipUtfByteOrderMark();
      } else {
        newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));
        pos += newSize;
      }

      if ((newSize == 0) || (newSize < maxLineLength)) {
        break;
      }

      // line too long. try again
      LOG.info("Skipped line of size " + newSize + " at pos " + 
               (pos - newSize));
    }
    if (newSize == 0) {
      key = null;
      value = null;
      return false;
    } else {
      return true;
    }
  }

  @Override
  public LongWritable getCurrentKey() {
    return key;
  }

  @Override
  public Text getCurrentValue() {
    return value;
  }

  /**
   * Get the progress within the split
   */
  public float getProgress() throws IOException {
    if (start == end) {
      return 0.0f;
    } else {
      return Math.min(1.0f, (getFilePosition() - start) / (float)(end - start));
    }
  }

  public synchronized void close() throws IOException {
    try {
      if (in != null) {
        in.close();
      }
    } finally {
      if (decompressor != null) {
        CodecPool.returnDecompressor(decompressor);
      }
    }
  }
}

（里面96-97行
in = new CompressedSplitLineReader(cIn, job, this.recordDelimiterBytes);

108-109行
in = new CompressedSplitLineReader(cIn, job,this.recordDelimiterBytes);

这是用来压缩的，大家不用管，知道有这回事就行了
其实这两个类都继承自SplitLineReader，是与压缩有关的，后面我们自定义的时候不用改，粘贴复制过来就行。
）
从上面发现了一个问题，看源码的第57行

private SplitLineReader in;

它引入了一个SplitLineReader 类,用这个小子来读取每一行，不信？你看源码的182-197行，如下（我的基于2.6.4版本的源码，不同的版本代码应该差别不大）

 while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
      if (pos == 0) {
        newSize = skipUtfByteOrderMark();
      } else {
        newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));
        pos += newSize;
      }

      if ((newSize == 0) || (newSize < maxLineLength)) {
        break;
      }

      // line too long. try again
      LOG.info("Skipped line of size " + newSize + " at pos " + 
               (pos - newSize));
    }

发现没有 ===》 newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));

它用了SplitLineReader 里面的一个方法readLine来读取，所以我们就得继续跟踪去看看SplitLineReader 这个小子的庐山真面目，下面看SplitLineReader 的源码

public class SplitLineReader extends org.apache.hadoop.util.LineReader {
  public SplitLineReader(InputStream in, byte[] recordDelimiterBytes) {
    super(in, recordDelimiterBytes);
  }

  public SplitLineReader(InputStream in, Configuration conf,
      byte[] recordDelimiterBytes) throws IOException {
    super(in, conf, recordDelimiterBytes);
  }

  public boolean needAdditionalRecordAfterSplit() {
    return false;
  }
}

发现这家伙继承自LineReader（快接近我们的目标了，坚持看下去），发现这小子里面根本就没有readLine方法，大家是不是觉得我在忽悠大家，哈哈，我没有忽悠大家，它源码里面确实没有，但是但是，它可是继承了LineReader这个类，说不定他的父类LineReader有了，好，不信我们去看看LineReader
继续跟踪到LineReader的源码

public class LineReader implements Closeable {
  private static final int DEFAULT_BUFFER_SIZE = 64 * 1024;
  private int bufferSize = DEFAULT_BUFFER_SIZE;
  private InputStream in;
  private byte[] buffer;
  // the number of bytes of real data in the buffer
  private int bufferLength = 0;
  // the current position in the buffer
  private int bufferPosn = 0;

  private static final byte CR = '\r';
  private static final byte LF = '\n';

  // The line delimiter
  private final byte[] recordDelimiterBytes;

  /**
   * Create a line reader that reads from the given stream using the
   * default buffer-size (64k).
   * @param in The input stream
   * @throws IOException
   */
  public LineReader(InputStream in) {
    this(in, DEFAULT_BUFFER_SIZE);
  }

  /**
   * Create a line reader that reads from the given stream using the 
   * given buffer-size.
   * @param in The input stream
   * @param bufferSize Size of the read buffer
   * @throws IOException
   */
  public LineReader(InputStream in, int bufferSize) {
    this.in = in;
    this.bufferSize = bufferSize;
    this.buffer = new byte[this.bufferSize];
    this.recordDelimiterBytes = null;
  }

  /**
   * Create a line reader that reads from the given stream using the
   * io.file.buffer.size specified in the given
   * Configuration.
   * @param in input stream
   * @param conf configuration
   * @throws IOException
   */
  public LineReader(InputStream in, Configuration conf) throws IOException {
    this(in, conf.getInt("io.file.buffer.size", DEFAULT_BUFFER_SIZE));
  }

  /**
   * Create a line reader that reads from the given stream using the
   * default buffer-size, and using a custom delimiter of array of
   * bytes.
   * @param in The input stream
   * @param recordDelimiterBytes The delimiter
   */
  public LineReader(InputStream in, byte[] recordDelimiterBytes) {
    this.in = in;
    this.bufferSize = DEFAULT_BUFFER_SIZE;
    this.buffer = new byte[this.bufferSize];
    this.recordDelimiterBytes = recordDelimiterBytes;
  }

  /**
   * Create a line reader that reads from the given stream using the
   * given buffer-size, and using a custom delimiter of array of
   * bytes.
   * @param in The input stream
   * @param bufferSize Size of the read buffer
   * @param recordDelimiterBytes The delimiter
   * @throws IOException
   */
  public LineReader(InputStream in, int bufferSize,
      byte[] recordDelimiterBytes) {
    this.in = in;
    this.bufferSize = bufferSize;
    this.buffer = new byte[this.bufferSize];
    this.recordDelimiterBytes = recordDelimiterBytes;
  }

  /**
   * Create a line reader that reads from the given stream using the
   * io.file.buffer.size specified in the given
   * Configuration, and using a custom delimiter of array of
   * bytes.
   * @param in input stream
   * @param conf configuration
   * @param recordDelimiterBytes The delimiter
   * @throws IOException
   */
  public LineReader(InputStream in, Configuration conf,
      byte[] recordDelimiterBytes) throws IOException {
    this.in = in;
    this.bufferSize = conf.getInt("io.file.buffer.size", DEFAULT_BUFFER_SIZE);
    this.buffer = new byte[this.bufferSize];
    this.recordDelimiterBytes = recordDelimiterBytes;
  }


  /**
   * Close the underlying stream.
   * @throws IOException
   */
  public void close() throws IOException {
    in.close();
  }

  /**
   * Read one line from the InputStream into the given Text.
   *
   * @param str the object to store the given line (without newline)
   * @param maxLineLength the maximum number of bytes to store into str;
   *  the rest of the line is silently discarded.
   * @param maxBytesToConsume the maximum number of bytes to consume
   *  in this call.  This is only a hint, because if the line cross
   *  this threshold, we allow it to happen.  It can overshoot
   *  potentially by as much as one buffer length.
   *
   * @return the number of bytes read including the (longest) newline
   * found.
   *
   * @throws IOException if the underlying stream throws
   */
  public int readLine(Text str, int maxLineLength,
                      int maxBytesToConsume) throws IOException {
    if (this.recordDelimiterBytes != null) {
      return readCustomLine(str, maxLineLength, maxBytesToConsume);
    } else {
      return readDefaultLine(str, maxLineLength, maxBytesToConsume);
    }
  }

  protected int fillBuffer(InputStream in, byte[] buffer, boolean inDelimiter)
      throws IOException {
    return in.read(buffer);
  }

  /**
   * Read a line terminated by one of CR, LF, or CRLF.
   */
  private int readDefaultLine(Text str, int maxLineLength, int maxBytesToConsume)
  throws IOException {
    /* We're reading data from in, but the head of the stream may be
     * already buffered in buffer, so we have several cases:
     * 1. No newline characters are in the buffer, so we need to copy
     *    everything and read another buffer from the stream.
     * 2. An unambiguously terminated line is in buffer, so we just
     *    copy to str.
     * 3. Ambiguously terminated line is in buffer, i.e. buffer ends
     *    in CR.  In this case we copy everything up to CR to str, but
     *    we also need to see what follows CR: if it's LF, then we
     *    need consume LF as well, so next call to readLine will read
     *    from after that.
     * We use a flag prevCharCR to signal if previous character was CR
     * and, if it happens to be at the end of the buffer, delay
     * consuming it until we have a chance to look at the char that
     * follows.
     */
    str.clear();
    int txtLength = 0; //tracks str.getLength(), as an optimization
    int newlineLength = 0; //length of terminating newline
    boolean prevCharCR = false; //true of prev char was CR
    long bytesConsumed = 0;
    do {
      int startPosn = bufferPosn; //starting from where we left off the last time
      if (bufferPosn >= bufferLength) {
        startPosn = bufferPosn = 0;
        if (prevCharCR) {
          ++bytesConsumed; //account for CR from previous read
        }
        bufferLength = fillBuffer(in, buffer, prevCharCR);
        if (bufferLength <= 0) {
          break; // EOF
        }
      }
      for (; bufferPosn < bufferLength; ++bufferPosn) { //search for newline
        if (buffer[bufferPosn] == LF) {
          newlineLength = (prevCharCR) ? 2 : 1;
          ++bufferPosn; // at next invocation proceed from following byte
          break;
        }
        if (prevCharCR) { //CR + notLF, we are at notLF
          newlineLength = 1;
          break;
        }
        prevCharCR = (buffer[bufferPosn] == CR);
      }
      int readLength = bufferPosn - startPosn;
      if (prevCharCR && newlineLength == 0) {
        --readLength; //CR at the end of the buffer
      }
      bytesConsumed += readLength;
      int appendLength = readLength - newlineLength;
      if (appendLength > maxLineLength - txtLength) {
        appendLength = maxLineLength - txtLength;
      }
      if (appendLength > 0) {
        str.append(buffer, startPosn, appendLength);
        txtLength += appendLength;
      }
    } while (newlineLength == 0 && bytesConsumed < maxBytesToConsume);

    if (bytesConsumed > Integer.MAX_VALUE) {
      throw new IOException("Too many bytes before newline: " + bytesConsumed);
    }
    return (int)bytesConsumed;
  }

  /**
   * Read a line terminated by a custom delimiter.
   */
  private int readCustomLine(Text str, int maxLineLength, int maxBytesToConsume)
      throws IOException {
   /* We're reading data from inputStream, but the head of the stream may be
    *  already captured in the previous buffer, so we have several cases:
    * 
    * 1. The buffer tail does not contain any character sequence which
    *    matches with the head of delimiter. We count it as a 
    *    ambiguous byte count = 0
    *    
    * 2. The buffer tail contains a X number of characters,
    *    that forms a sequence, which matches with the
    *    head of delimiter. We count ambiguous byte count = X
    *    
    *    // ***  eg: A segment of input file is as follows
    *    
    *    " record 1792: I found this bug very interesting and
    *     I have completely read about it. record 1793: This bug
    *     can be solved easily record 1794: This ." 
    *    
    *    delimiter = "record";
    *        
    *    supposing:- String at the end of buffer =
    *    "I found this bug very interesting and I have completely re"
    *    There for next buffer = "ad about it. record 179       ...."           
    *     
    *     The matching characters in the input
    *     buffer tail and delimiter head = "re" 
    *     Therefore, ambiguous byte count = 2 ****   //
    *     
    *     2.1 If the following bytes are the remaining characters of
    *         the delimiter, then we have to capture only up to the starting 
    *         position of delimiter. That means, we need not include the 
    *         ambiguous characters in str.
    *     
    *     2.2 If the following bytes are not the remaining characters of
    *         the delimiter ( as mentioned in the example ), 
    *         then we have to include the ambiguous characters in str. 
    */
    str.clear();
    int txtLength = 0; // tracks str.getLength(), as an optimization
    long bytesConsumed = 0;
    int delPosn = 0;
    int ambiguousByteCount=0; // To capture the ambiguous characters count
    do {
      int startPosn = bufferPosn; // Start from previous end position
      if (bufferPosn >= bufferLength) {
        startPosn = bufferPosn = 0;
        bufferLength = fillBuffer(in, buffer, ambiguousByteCount > 0);
        if (bufferLength <= 0) {
          if (ambiguousByteCount > 0) {
            str.append(recordDelimiterBytes, 0, ambiguousByteCount);
            bytesConsumed += ambiguousByteCount;
          }
          break; // EOF
        }
      }
      for (; bufferPosn < bufferLength; ++bufferPosn) {
        if (buffer[bufferPosn] == recordDelimiterBytes[delPosn]) {
          delPosn++;
          if (delPosn >= recordDelimiterBytes.length) {
            bufferPosn++;
            break;
          }
        } else if (delPosn != 0) {
          bufferPosn--;
          delPosn = 0;
        }
      }
      int readLength = bufferPosn - startPosn;
      bytesConsumed += readLength;
      int appendLength = readLength - delPosn;
      if (appendLength > maxLineLength - txtLength) {
        appendLength = maxLineLength - txtLength;
      }
      bytesConsumed += ambiguousByteCount;
      if (appendLength >= 0 && ambiguousByteCount > 0) {
        //appending the ambiguous characters (refer case 2.2)
        str.append(recordDelimiterBytes, 0, ambiguousByteCount);
        ambiguousByteCount = 0;
        // since it is now certain that the split did not split a delimiter we
        // should not read the next record: clear the flag otherwise duplicate
        // records could be generated
        unsetNeedAdditionalRecordAfterSplit();
      }
      if (appendLength > 0) {
        str.append(buffer, startPosn, appendLength);
        txtLength += appendLength;
      }
      if (bufferPosn >= bufferLength) {
        if (delPosn > 0 && delPosn < recordDelimiterBytes.length) {
          ambiguousByteCount = delPosn;
          bytesConsumed -= ambiguousByteCount; //to be consumed in next
        }
      }
    } while (delPosn < recordDelimiterBytes.length 
        && bytesConsumed < maxBytesToConsume);
    if (bytesConsumed > Integer.MAX_VALUE) {
      throw new IOException("Too many bytes before delimiter: " + bytesConsumed);
    }
    return (int) bytesConsumed; 
  }

  /**
   * Read from the InputStream into the given Text.
   * @param str the object to store the given line
   * @param maxLineLength the maximum number of bytes to store into str.
   * @return the number of bytes read including the newline
   * @throws IOException if the underlying stream throws
   */
  public int readLine(Text str, int maxLineLength) throws IOException {
    return readLine(str, maxLineLength, Integer.MAX_VALUE);
  }

  /**
   * Read from the InputStream into the given Text.
   * @param str the object to store the given line
   * @return the number of bytes read including the newline
   * @throws IOException if the underlying stream throws
   */
  public int readLine(Text str) throws IOException {
    return readLine(str, Integer.MAX_VALUE, Integer.MAX_VALUE);
  }

  protected int getBufferPosn() {
    return bufferPosn;
  }

  protected int getBufferSize() {
    return bufferSize;
  }

  protected void unsetNeedAdditionalRecordAfterSplit() {
    // needed for custom multi byte line delimiters only
    // see MAPREDUCE-6549 for details
  }
}

它里面有很多方法，真的有我们要的readLine方法，说明我的推断没有错，没有忽悠大家，它重载了好几个readLine方法
其他我们不去管，我们只管对我们有用的，如下

 public int readLine(Text str, int maxLineLength,
                      int maxBytesToConsume) throws IOException {
    if (this.recordDelimiterBytes != null) {
      return readCustomLine(str, maxLineLength, maxBytesToConsume);
    } else {
      return readDefaultLine(str, maxLineLength, maxBytesToConsume);
    }
  }

它里面调用了readCustomLine方法和readDefaultLine方法，下面看看这两个方法

 private int readCustomLine(Text str, int maxLineLength, int maxBytesToConsume)
      throws IOException {
    str.clear();
    int txtLength = 0; // tracks str.getLength(), as an optimization
    long bytesConsumed = 0;
    int delPosn = 0;
    int ambiguousByteCount=0; // To capture the ambiguous characters count
    do {
      int startPosn = bufferPosn; // Start from previous end position
      if (bufferPosn >= bufferLength) {
        startPosn = bufferPosn = 0;
        bufferLength = fillBuffer(in, buffer, ambiguousByteCount > 0);
        if (bufferLength <= 0) {
          if (ambiguousByteCount > 0) {
            str.append(recordDelimiterBytes, 0, ambiguousByteCount);
            bytesConsumed += ambiguousByteCount;
          }
          break; // EOF
        }
      }
      for (; bufferPosn < bufferLength; ++bufferPosn) {
        if (buffer[bufferPosn] == recordDelimiterBytes[delPosn]) {
          delPosn++;
          if (delPosn >= recordDelimiterBytes.length) {
            bufferPosn++;
            break;
          }
        } else if (delPosn != 0) {
          bufferPosn--;
          delPosn = 0;
        }
      }
      int readLength = bufferPosn - startPosn;
      bytesConsumed += readLength;
      int appendLength = readLength - delPosn;
      if (appendLength > maxLineLength - txtLength) {
        appendLength = maxLineLength - txtLength;
      }
      bytesConsumed += ambiguousByteCount;
      if (appendLength >= 0 && ambiguousByteCount > 0) {
        //appending the ambiguous characters (refer case 2.2)
        str.append(recordDelimiterBytes, 0, ambiguousByteCount);
        ambiguousByteCount = 0;
        // since it is now certain that the split did not split a delimiter we
        // should not read the next record: clear the flag otherwise duplicate
        // records could be generated
        unsetNeedAdditionalRecordAfterSplit();
      }
      if (appendLength > 0) {
        str.append(buffer, startPosn, appendLength);
        txtLength += appendLength;
      }
      if (bufferPosn >= bufferLength) {
        if (delPosn > 0 && delPosn < recordDelimiterBytes.length) {
          ambiguousByteCount = delPosn;
          bytesConsumed -= ambiguousByteCount; //to be consumed in next
        }
      }
    } while (delPosn < recordDelimiterBytes.length 
        && bytesConsumed < maxBytesToConsume);
    if (bytesConsumed > Integer.MAX_VALUE) {
      throw new IOException("Too many bytes before delimiter: " + bytesConsumed);
    }
    return (int) bytesConsumed; 
  }

 private int readDefaultLine(Text str, int maxLineLength, int maxBytesToConsume)
  throws IOException {
    str.clear();
    int txtLength = 0; //tracks str.getLength(), as an optimization
    int newlineLength = 0; //length of terminating newline
    boolean prevCharCR = false; //true of prev char was CR
    long bytesConsumed = 0;
    do {
      int startPosn = bufferPosn; //starting from where we left off the last time
      if (bufferPosn >= bufferLength) {
        startPosn = bufferPosn = 0;
        if (prevCharCR) {
          ++bytesConsumed; //account for CR from previous read
        }
        bufferLength = fillBuffer(in, buffer, prevCharCR);
        if (bufferLength <= 0) {
          break; // EOF
        }
      }
      for (; bufferPosn < bufferLength; ++bufferPosn) { //search for newline
        if (buffer[bufferPosn] == LF) {
          newlineLength = (prevCharCR) ? 2 : 1;
          ++bufferPosn; // at next invocation proceed from following byte
          break;
        }
        if (prevCharCR) { //CR + notLF, we are at notLF
          newlineLength = 1;
          break;
        }
        prevCharCR = (buffer[bufferPosn] == CR);
      }
      int readLength = bufferPosn - startPosn;
      if (prevCharCR && newlineLength == 0) {
        --readLength; //CR at the end of the buffer
      }
      bytesConsumed += readLength;
      int appendLength = readLength - newlineLength;
      if (appendLength > maxLineLength - txtLength) {
        appendLength = maxLineLength - txtLength;
      }
      if (appendLength > 0) {
        str.append(buffer, startPosn, appendLength);
        txtLength += appendLength;
      }
    } while (newlineLength == 0 && bytesConsumed < maxBytesToConsume);

    if (bytesConsumed > Integer.MAX_VALUE) {
      throw new IOException("Too many bytes before newline: " + bytesConsumed);
    }
    return (int)bytesConsumed;
  }

注意注意readCustomLine和readDefaultLine方法的第一句都用了一句代码（重点，后面我们自定义时要改的地方）
他们都用了str.clear();
这句代码什么意思，意思是系统每读完一行，它会清空这一行的值！！！
如果我们自定义读取多行的时候，肯定不能清空它，因为我们需要它来计数第二行的位置
比如
123，
456
789，
111
如果一次读两行的话假如我把第一行清空了，那么我第二行的偏移量就得不到正确的值了，读出来的值本应该是
123，456
789，111
但是如果清空了的话就读出来少了一行
变成了
456
111

所以我们只有独到最后一行才清空值，它前面的行都不能清空

这就要求我们到时候自己重载一个方法了，到时候再说，大家接着看

有没有晕了，没有？那就好，我都说得这么清楚了，还晕的话，大家就先休息一下

我帮大家理一理：看都用到了哪些类

最开始TextInputFormat里面用到了LineRecordReader，LineRecordReader里面用到了SplitLineReader，而SplitLineReader里面用到了LineReader
自定义的时候思路按这个来
TextInputFormat–》LineRecordReader–》SplitLineReader–》LineReader

下面写第一个自定义的TextInputFormat

package com.my.input;

import java.io.IOException;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.SplittableCompressionCodec;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;


public class myInputFormat extends FileInputFormat {
    //用来压缩的
    @Override
      protected boolean isSplitable(JobContext context, Path file) {
        final CompressionCodec codec =
          new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
        if (null == codec) {
          return true;
        }
        return codec instanceof SplittableCompressionCodec;
      }

    @Override
    public RecordReader createRecordReader(InputSplit genericSplit, TaskAttemptContext context)
            throws IOException, InterruptedException {
        context.setStatus(genericSplit.toString());
        return new MyRecordReader(context.getConfiguration());
    }

}

接着写一个自定义的LineRecordReader
其中修改了182行开始的以下代码

因为我这里要实现输出多行，所以写了一个for循环,又由于我前面说得前面的行不能清空，所以要加一个boolean标志量
把

 while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
      if (pos == 0) {
        newSize = skipUtfByteOrderMark();
      } else {
        newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));
        pos += newSize;
      }

      if ((newSize == 0) || (newSize < maxLineLength)) {
        break;
      }

      // line too long. try again
      LOG.info("Skipped line of size " + newSize + " at pos " + 
               (pos - newSize));
    }

改为

boolean clear = true;
        for (int i = 1; i <= 2; i++) {
            if (i == 2) {
                clear = false;
            }
            while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
                if (pos == 0) {
                    newSize = skipUtfByteOrderMark();
                } else {
                    newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos), clear);
                    pos += newSize;
                }

                if ((newSize == 0) || (newSize < maxLineLength)) {
                    break;
                }

                // line too long. try again
                LOG.info("Skipped line of size " + newSize + " at pos " + (pos - newSize));
            }
        }

再写自定义SplitLineReader

package com.my.lingRecordReader;
import java.io.IOException;
import java.io.InputStream;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;

@InterfaceAudience.Private
@InterfaceStability.Unstable
public class MySplitLineReader extends MyLineReader {
  public MySplitLineReader(InputStream in, byte[] recordDelimiterBytes) {
    super(in, recordDelimiterBytes);
  }

  public MySplitLineReader(InputStream in, Configuration conf,
      byte[] recordDelimiterBytes) throws IOException {
    super(in, conf, recordDelimiterBytes);
  }

  public boolean needAdditionalRecordAfterSplit() {
    return false;
  }
}

最后写自定义LineReader
它重载了一个readLine(Text str, int maxLineLength, int maxBytesToConsume, boolean clear)方法来实现不清空前面读取的行的值

package com.my.lingRecordReader;

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import java.io.Closeable;
import java.io.IOException;
import java.io.InputStream;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;

/**
 * A class that provides a line reader from an input stream. Depending on the
 * constructor used, lines will either be terminated by:
 * 
 * one of the following: '\n' (LF) , '\r' (CR), or '\r\n' (CR+LF).
 * or, a custom byte sequence delimiter
 * 
 * In both cases, EOF also terminates an otherwise unterminated line.
 */
@InterfaceAudience.LimitedPrivate({ "MapReduce" })
@InterfaceStability.Unstable
public class MyLineReader implements Closeable {
    private static final int DEFAULT_BUFFER_SIZE = 64 * 1024;
    private int bufferSize = DEFAULT_BUFFER_SIZE;
    private InputStream in;
    private byte[] buffer;
    // the number of bytes of real data in the buffer
    private int bufferLength = 0;
    // the current position in the buffer
    private int bufferPosn = 0;

    private static final byte CR = '\r';
    private static final byte LF = '\n';

    // The line delimiter
    private final byte[] recordDelimiterBytes;

    /**
     * Create a line reader that reads from the given stream using the default
     * buffer-size (64k).
     * 
     * @param in
     *            The input stream
     * @throws IOException
     */
    public MyLineReader(InputStream in) {
        this(in, DEFAULT_BUFFER_SIZE);
    }

    /**
     * Create a line reader that reads from the given stream using the given
     * buffer-size.
     * 
     * @param in
     *            The input stream
     * @param bufferSize
     *            Size of the read buffer
     * @throws IOException
     */
    public MyLineReader(InputStream in, int bufferSize) {
        this.in = in;
        this.bufferSize = bufferSize;
        this.buffer = new byte[this.bufferSize];
        this.recordDelimiterBytes = null;
    }

    /**
     * Create a line reader that reads from the given stream using the
     * io.file.buffer.size specified in the given
     * Configuration.
     * 
     * @param in
     *            input stream
     * @param conf
     *            configuration
     * @throws IOException
     */
    public MyLineReader(InputStream in, Configuration conf) throws IOException {
        this(in, conf.getInt("io.file.buffer.size", DEFAULT_BUFFER_SIZE));
    }

    /**
     * Create a line reader that reads from the given stream using the default
     * buffer-size, and using a custom delimiter of array of bytes.
     * 
     * @param in
     *            The input stream
     * @param recordDelimiterBytes
     *            The delimiter
     */
    public MyLineReader(InputStream in, byte[] recordDelimiterBytes) {
        this.in = in;
        this.bufferSize = DEFAULT_BUFFER_SIZE;
        this.buffer = new byte[this.bufferSize];
        this.recordDelimiterBytes = recordDelimiterBytes;
    }

    /**
     * Create a line reader that reads from the given stream using the given
     * buffer-size, and using a custom delimiter of array of bytes.
     * 
     * @param in
     *            The input stream
     * @param bufferSize
     *            Size of the read buffer
     * @param recordDelimiterBytes
     *            The delimiter
     * @throws IOException
     */
    public MyLineReader(InputStream in, int bufferSize, byte[] recordDelimiterBytes) {
        this.in = in;
        this.bufferSize = bufferSize;
        this.buffer = new byte[this.bufferSize];
        this.recordDelimiterBytes = recordDelimiterBytes;
    }

    /**
     * Create a line reader that reads from the given stream using the
     * io.file.buffer.size specified in the given
     * Configuration, and using a custom delimiter of array of
     * bytes.
     * 
     * @param in
     *            input stream
     * @param conf
     *            configuration
     * @param recordDelimiterBytes
     *            The delimiter
     * @throws IOException
     */
    public MyLineReader(InputStream in, Configuration conf, byte[] recordDelimiterBytes) throws IOException {
        this.in = in;
        this.bufferSize = conf.getInt("io.file.buffer.size", DEFAULT_BUFFER_SIZE);
        this.buffer = new byte[this.bufferSize];
        this.recordDelimiterBytes = recordDelimiterBytes;
    }

    /**
     * Close the underlying stream.
     * 
     * @throws IOException
     */
    public void close() throws IOException {
        in.close();
    }

    /**
     * Read one line from the InputStream into the given Text.
     *
     * @param str
     *            the object to store the given line (without newline)
     * @param maxLineLength
     *            the maximum number of bytes to store into str; the rest of the
     *            line is silently discarded.
     * @param maxBytesToConsume
     *            the maximum number of bytes to consume in this call. This is
     *            only a hint, because if the line cross this threshold, we
     *            allow it to happen. It can overshoot potentially by as much as
     *            one buffer length.
     *
     * @return the number of bytes read including the (longest) newline found.
     *
     * @throws IOException
     *             if the underlying stream throws
     */
    public int readLine(Text str, int maxLineLength, int maxBytesToConsume) throws IOException {
        if (this.recordDelimiterBytes != null) {
            return readCustomLine(str, maxLineLength, maxBytesToConsume);
        } else {
            return readDefaultLine(str, maxLineLength, maxBytesToConsume);
        }
    }

    public int readLine(Text str, int maxLineLength, int maxBytesToConsume, boolean clear) throws IOException {
        if (this.recordDelimiterBytes != null) {
            return readCustomLine(str, maxLineLength, maxBytesToConsume,clear);
        } else {
            return readDefaultLine(str, maxLineLength, maxBytesToConsume,clear);
        }
    }

    protected int fillBuffer(InputStream in, byte[] buffer, boolean inDelimiter) throws IOException {
        return in.read(buffer);
    }

    private int readDefaultLine(Text str, int maxLineLength, int maxBytesToConsume, boolean clear) throws IOException {
        if (clear) {
            str.clear();
        }
        int txtLength = 0; // tracks str.getLength(), as an optimization
        int newlineLength = 0; // length of terminating newline
        boolean prevCharCR = false; // true of prev char was CR
        long bytesConsumed = 0;
        do {
            int startPosn = bufferPosn; // starting from where we left off the
                                        // last time
            if (bufferPosn >= bufferLength) {
                startPosn = bufferPosn = 0;
                if (prevCharCR) {
                    ++bytesConsumed; // account for CR from previous read
                }
                bufferLength = fillBuffer(in, buffer, prevCharCR);
                if (bufferLength <= 0) {
                    break; // EOF
                }
            }
            for (; bufferPosn < bufferLength; ++bufferPosn) { // search for
                                                                // newline
                if (buffer[bufferPosn] == LF) {
                    newlineLength = (prevCharCR) ? 2 : 1;
                    ++bufferPosn; // at next invocation proceed from following
                                    // byte
                    break;
                }
                if (prevCharCR) { // CR + notLF, we are at notLF
                    newlineLength = 1;
                    break;
                }
                prevCharCR = (buffer[bufferPosn] == CR);
            }
            int readLength = bufferPosn - startPosn;
            if (prevCharCR && newlineLength == 0) {
                --readLength; // CR at the end of the buffer
            }
            bytesConsumed += readLength;
            int appendLength = readLength - newlineLength;
            if (appendLength > maxLineLength - txtLength) {
                appendLength = maxLineLength - txtLength;
            }
            if (appendLength > 0) {
                str.append(buffer, startPosn, appendLength);
                txtLength += appendLength;
            }
        } while (newlineLength == 0 && bytesConsumed < maxBytesToConsume);

        if (bytesConsumed > Integer.MAX_VALUE) {
            throw new IOException("Too many bytes before newline: " + bytesConsumed);
        }
        return (int) bytesConsumed;
    }

    /**
     * Read a line terminated by one of CR, LF, or CRLF.
     */
    private int readDefaultLine(Text str, int maxLineLength, int maxBytesToConsume) throws IOException {
        str.clear();
        int txtLength = 0; // tracks str.getLength(), as an optimization
        int newlineLength = 0; // length of terminating newline
        boolean prevCharCR = false; // true of prev char was CR
        long bytesConsumed = 0;
        do {
            int startPosn = bufferPosn; // starting from where we left off the
                                        // last time
            if (bufferPosn >= bufferLength) {
                startPosn = bufferPosn = 0;
                if (prevCharCR) {
                    ++bytesConsumed; // account for CR from previous read
                }
                bufferLength = fillBuffer(in, buffer, prevCharCR);
                if (bufferLength <= 0) {
                    break; // EOF
                }
            }
            for (; bufferPosn < bufferLength; ++bufferPosn) { // search for
                                                                // newline
                if (buffer[bufferPosn] == LF) {
                    newlineLength = (prevCharCR) ? 2 : 1;
                    ++bufferPosn; // at next invocation proceed from following
                                    // byte
                    break;
                }
                if (prevCharCR) { // CR + notLF, we are at notLF
                    newlineLength = 1;
                    break;
                }
                prevCharCR = (buffer[bufferPosn] == CR);
            }
            int readLength = bufferPosn - startPosn;
            if (prevCharCR && newlineLength == 0) {
                --readLength; // CR at the end of the buffer
            }
            bytesConsumed += readLength;
            int appendLength = readLength - newlineLength;
            if (appendLength > maxLineLength - txtLength) {
                appendLength = maxLineLength - txtLength;
            }
            if (appendLength > 0) {
                str.append(buffer, startPosn, appendLength);
                txtLength += appendLength;
            }
        } while (newlineLength == 0 && bytesConsumed < maxBytesToConsume);

        if (bytesConsumed > Integer.MAX_VALUE) {
            throw new IOException("Too many bytes before newline: " + bytesConsumed);
        }
        return (int) bytesConsumed;
    }

    private int readCustomLine(Text str, int maxLineLength, int maxBytesToConsume, boolean clear) throws IOException {
        if (clear) {
            str.clear();
        }
        int txtLength = 0; // tracks str.getLength(), as an optimization
        long bytesConsumed = 0;
        int delPosn = 0;
        int ambiguousByteCount = 0; // To capture the ambiguous characters count
        do {
            int startPosn = bufferPosn; // Start from previous end position
            if (bufferPosn >= bufferLength) {
                startPosn = bufferPosn = 0;
                bufferLength = fillBuffer(in, buffer, ambiguousByteCount > 0);
                if (bufferLength <= 0) {
                    if (ambiguousByteCount > 0) {
                        str.append(recordDelimiterBytes, 0, ambiguousByteCount);
                        bytesConsumed += ambiguousByteCount;
                    }
                    break; // EOF
                }
            }
            for (; bufferPosn < bufferLength; ++bufferPosn) {
                if (buffer[bufferPosn] == recordDelimiterBytes[delPosn]) {
                    delPosn++;
                    if (delPosn >= recordDelimiterBytes.length) {
                        bufferPosn++;
                        break;
                    }
                } else if (delPosn != 0) {
                    bufferPosn--;
                    delPosn = 0;
                }
            }
            int readLength = bufferPosn - startPosn;
            bytesConsumed += readLength;
            int appendLength = readLength - delPosn;
            if (appendLength > maxLineLength - txtLength) {
                appendLength = maxLineLength - txtLength;
            }
            bytesConsumed += ambiguousByteCount;
            if (appendLength >= 0 && ambiguousByteCount > 0) {
                // appending the ambiguous characters (refer case 2.2)
                str.append(recordDelimiterBytes, 0, ambiguousByteCount);
                ambiguousByteCount = 0;
                // since it is now certain that the split did not split a
                // delimiter we
                // should not read the next record: clear the flag otherwise
                // duplicate
                // records could be generated
                unsetNeedAdditionalRecordAfterSplit();
            }
            if (appendLength > 0) {
                str.append(buffer, startPosn, appendLength);
                txtLength += appendLength;
            }
            if (bufferPosn >= bufferLength) {
                if (delPosn > 0 && delPosn < recordDelimiterBytes.length) {
                    ambiguousByteCount = delPosn;
                    bytesConsumed -= ambiguousByteCount; // to be consumed in
                                                            // next
                }
            }
        } while (delPosn < recordDelimiterBytes.length && bytesConsumed < maxBytesToConsume);
        if (bytesConsumed > Integer.MAX_VALUE) {
            throw new IOException("Too many bytes before delimiter: " + bytesConsumed);
        }
        return (int) bytesConsumed;
    }

    /**
     * Read a line terminated by a custom delimiter.
     */
    private int readCustomLine(Text str, int maxLineLength, int maxBytesToConsume) throws IOException {
        str.clear();
        int txtLength = 0; // tracks str.getLength(), as an optimization
        long bytesConsumed = 0;
        int delPosn = 0;
        int ambiguousByteCount = 0; // To capture the ambiguous characters count
        do {
            int startPosn = bufferPosn; // Start from previous end position
            if (bufferPosn >= bufferLength) {
                startPosn = bufferPosn = 0;
                bufferLength = fillBuffer(in, buffer, ambiguousByteCount > 0);
                if (bufferLength <= 0) {
                    if (ambiguousByteCount > 0) {
                        str.append(recordDelimiterBytes, 0, ambiguousByteCount);
                        bytesConsumed += ambiguousByteCount;
                    }
                    break; // EOF
                }
            }
            for (; bufferPosn < bufferLength; ++bufferPosn) {
                if (buffer[bufferPosn] == recordDelimiterBytes[delPosn]) {
                    delPosn++;
                    if (delPosn >= recordDelimiterBytes.length) {
                        bufferPosn++;
                        break;
                    }
                } else if (delPosn != 0) {
                    bufferPosn--;
                    delPosn = 0;
                }
            }
            int readLength = bufferPosn - startPosn;
            bytesConsumed += readLength;
            int appendLength = readLength - delPosn;
            if (appendLength > maxLineLength - txtLength) {
                appendLength = maxLineLength - txtLength;
            }
            bytesConsumed += ambiguousByteCount;
            if (appendLength >= 0 && ambiguousByteCount > 0) {
                // appending the ambiguous characters (refer case 2.2)
                str.append(recordDelimiterBytes, 0, ambiguousByteCount);
                ambiguousByteCount = 0;
                // since it is now certain that the split did not split a
                // delimiter we
                // should not read the next record: clear the flag otherwise
                // duplicate
                // records could be generated
                unsetNeedAdditionalRecordAfterSplit();
            }
            if (appendLength > 0) {
                str.append(buffer, startPosn, appendLength);
                txtLength += appendLength;
            }
            if (bufferPosn >= bufferLength) {
                if (delPosn > 0 && delPosn < recordDelimiterBytes.length) {
                    ambiguousByteCount = delPosn;
                    bytesConsumed -= ambiguousByteCount; // to be consumed in
                                                            // next
                }
            }
        } while (delPosn < recordDelimiterBytes.length && bytesConsumed < maxBytesToConsume);
        if (bytesConsumed > Integer.MAX_VALUE) {
            throw new IOException("Too many bytes before delimiter: " + bytesConsumed);
        }
        return (int) bytesConsumed;
    }

    /**
     * Read from the InputStream into the given Text.
     * 
     * @param str
     *            the object to store the given line
     * @param maxLineLength
     *            the maximum number of bytes to store into str.
     * @return the number of bytes read including the newline
     * @throws IOException
     *             if the underlying stream throws
     */
    public int readLine(Text str, int maxLineLength) throws IOException {
        return readLine(str, maxLineLength, Integer.MAX_VALUE);
    }

    /**
     * Read from the InputStream into the given Text.
     * 
     * @param str
     *            the object to store the given line
     * @return the number of bytes read including the newline
     * @throws IOException
     *             if the underlying stream throws
     */
    public int readLine(Text str) throws IOException {
        return readLine(str, Integer.MAX_VALUE, Integer.MAX_VALUE);
    }

    protected int getBufferPosn() {
        return bufferPosn;
    }

    protected int getBufferSize() {
        return bufferSize;
    }

    protected void unsetNeedAdditionalRecordAfterSplit() {
        // needed for custom multi byte line delimiters only
        // see MAPREDUCE-6549 for details
    }
}

下面是两个帮助类，用来压缩的，自定义一下（其实是复制源码，没有改动，只改了类名，因为我改了整个继承结构，所以只能加一个压缩的）

package com.my.lingRecordReader;
/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
import java.io.IOException;
import java.io.InputStream;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.SplitCompressionInputStream;


/**
 * Line reader for compressed splits
 *
 * Reading records from a compressed split is tricky, as the
 * LineRecordReader is using the reported compressed input stream
 * position directly to determine when a split has ended.  In addition the
 * compressed input stream is usually faking the actual byte position, often
 * updating it only after the first compressed block after the split is
 * accessed.
 *
 * Depending upon where the last compressed block of the split ends relative
 * to the record delimiters it can be easy to accidentally drop the last
 * record or duplicate the last record between this split and the next.
 *
 * Split end scenarios:
 *
 * 1) Last block of split ends in the middle of a record
 *      Nothing special that needs to be done here, since the compressed input
 *      stream will report a position after the split end once the record
 *      is fully read.  The consumer of the next split will discard the
 *      partial record at the start of the split normally, and no data is lost
 *      or duplicated between the splits.
 *
 * 2) Last block of split ends in the middle of a delimiter
 *      The line reader will continue to consume bytes into the next block to
 *      locate the end of the delimiter.  If a custom delimiter is being used
 *      then the next record must be read by this split or it will be dropped.
 *      The consumer of the next split will not recognize the partial
 *      delimiter at the beginning of its split and will discard it along with
 *      the next record.
 *
 *      However for the default delimiter processing there is a special case
 *      because CR, LF, and CRLF are all valid record delimiters.  If the
 *      block ends with a CR then the reader must peek at the next byte to see
 *      if it is an LF and therefore part of the same record delimiter.
 *      Peeking at the next byte is an access to the next block and triggers
 *      the stream to report the end of the split.  There are two cases based
 *      on the next byte:
 *
 *      A) The next byte is LF
 *           The split needs to end after the current record is returned.  The
 *           consumer of the next split will discard the first record, which
 *           is degenerate since LF is itself a delimiter, and start consuming
 *           records after that byte.  If the current split tries to read
 *           another record then the record will be duplicated between splits.
 *
 *      B) The next byte is not LF
 *           The current record will be returned but the stream will report
 *           the split has ended due to the peek into the next block.  If the
 *           next record is not read then it will be lost, as the consumer of
 *           the next split will discard it before processing subsequent
 *           records.  Therefore the next record beyond the reported split end
 *           must be consumed by this split to avoid data loss.
 *
 * 3) Last block of split ends at the beginning of a delimiter
 *      This is equivalent to case 1, as the reader will consume bytes into
 *      the next block and trigger the end of the split.  No further records
 *      should be read as the consumer of the next split will discard the
 *      (degenerate) record at the beginning of its split.
 *
 * 4) Last block of split ends at the end of a delimiter
 *      Nothing special needs to be done here. The reader will not start
 *      examining the bytes into the next block until the next record is read,
 *      so the stream will not report the end of the split just yet.  Once the
 *      next record is read then the next block will be accessed and the
 *      stream will indicate the end of the split.  The consumer of the next
 *      split will correctly discard the first record of its split, and no
 *      data is lost or duplicated.
 *
 *      If the default delimiter is used and the block ends at a CR then this
 *      is treated as case 2 since the reader does not yet know without
 *      looking at subsequent bytes whether the delimiter has ended.
 *
 * NOTE: It is assumed that compressed input streams *never* return bytes from
 *       multiple compressed blocks from a single read.  Failure to do so will
 *       violate the buffering performed by this class, as it will access
 *       bytes into the next block after the split before returning all of the
 *       records from the previous block.
 */
@InterfaceAudience.Private
@InterfaceStability.Unstable
public class MyCompressedSplitLineReader extends MySplitLineReader {

  SplitCompressionInputStream scin;
  private boolean usingCRLF;
  private boolean needAdditionalRecord = false;
  private boolean finished = false;

  public MyCompressedSplitLineReader(SplitCompressionInputStream in,
                                   Configuration conf,
                                   byte[] recordDelimiterBytes)
                                       throws IOException {
    super(in, conf, recordDelimiterBytes);
    scin = in;
    usingCRLF = (recordDelimiterBytes == null);
  }

  @Override
  protected int fillBuffer(InputStream in, byte[] buffer, boolean inDelimiter)
      throws IOException {
    int bytesRead = in.read(buffer);

    // If the split ended in the middle of a record delimiter then we need
    // to read one additional record, as the consumer of the next split will
    // not recognize the partial delimiter as a record.
    // However if using the default delimiter and the next character is a
    // linefeed then next split will treat it as a delimiter all by itself
    // and the additional record read should not be performed.
    if (inDelimiter && bytesRead > 0) {
      if (usingCRLF) {
        needAdditionalRecord = (buffer[0] != '\n');
      } else {
        needAdditionalRecord = true;
      }
    }
    return bytesRead;
  }

  @Override
  public int readLine(Text str, int maxLineLength, int maxBytesToConsume)
      throws IOException {
    int bytesRead = 0;
    if (!finished) {
      // only allow at most one more record to be read after the stream
      // reports the split ended
      if (scin.getPos() > scin.getAdjustedEnd()) {
        finished = true;
      }

      bytesRead = super.readLine(str, maxLineLength, maxBytesToConsume);
    }
    return bytesRead;
  }

  @Override
  public boolean needAdditionalRecordAfterSplit() {
    return !finished && needAdditionalRecord;
  }
}

第二个与压缩有关的类

package com.my.lingRecordReader;

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */


import java.io.IOException;
import java.io.InputStream;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.io.Text;

/**
 * SplitLineReader for uncompressed files.
 * This class can split the file correctly even if the delimiter is multi-bytes.
 */
@InterfaceAudience.Private
@InterfaceStability.Unstable
public class MyUncompressedSplitLineReader extends MySplitLineReader {
  private boolean needAdditionalRecord = false;
  private long splitLength;
  /** Total bytes read from the input stream. */
  private long totalBytesRead = 0;
  private boolean finished = false;
  private boolean usingCRLF;

  public MyUncompressedSplitLineReader(FSDataInputStream in, Configuration conf,
      byte[] recordDelimiterBytes, long splitLength) throws IOException {
    super(in, conf, recordDelimiterBytes);
    this.splitLength = splitLength;
    usingCRLF = (recordDelimiterBytes == null);
  }

  @Override
  protected int fillBuffer(InputStream in, byte[] buffer, boolean inDelimiter)
      throws IOException {
    int maxBytesToRead = buffer.length;
    if (totalBytesRead < splitLength) {
      maxBytesToRead = Math.min(maxBytesToRead,
                                (int)(splitLength - totalBytesRead));
    }
    int bytesRead = in.read(buffer, 0, maxBytesToRead);

    // If the split ended in the middle of a record delimiter then we need
    // to read one additional record, as the consumer of the next split will
    // not recognize the partial delimiter as a record.
    // However if using the default delimiter and the next character is a
    // linefeed then next split will treat it as a delimiter all by itself
    // and the additional record read should not be performed.
    if (totalBytesRead == splitLength && inDelimiter && bytesRead > 0) {
      if (usingCRLF) {
        needAdditionalRecord = (buffer[0] != '\n');
      } else {
        needAdditionalRecord = true;
      }
    }
    if (bytesRead > 0) {
      totalBytesRead += bytesRead;
    }
    return bytesRead;
  }

  @Override
  public int readLine(Text str, int maxLineLength, int maxBytesToConsume)
      throws IOException {
    int bytesRead = 0;
    if (!finished) {
      // only allow at most one more record to be read after the stream
      // reports the split ended
      if (totalBytesRead > splitLength) {
        finished = true;
      }

      bytesRead = super.readLine(str, maxLineLength, maxBytesToConsume);
    }
    return bytesRead;
  }

  @Override
  public boolean needAdditionalRecordAfterSplit() {
    return !finished && needAdditionalRecord;
  }

  @Override
  protected void unsetNeedAdditionalRecordAfterSplit() {
    needAdditionalRecord = false;
  }
}

最后就可以来测试了

先看看测试前的文件内容

package com.my.lingRecordReader;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import com.my.input.myInputFormat;

public class myTest {
    /**
     * 
     * @author 汤高

     */
    //Map过程
    static int count=0;
    public static class MyTestMapper extends Mapper {
        /***
         * 
         */
        @Override
        protected void map(LongWritable key, Text value, Mapper.Context context)
                throws IOException, InterruptedException {
            //默认的map的value是每一行,我这里自定义的是以空格分割
            count++;
            //String[] vs = value.toString().split(",");
            //for (String v : vs) {
                //写出去
                context.write(new Text(value), key);
            //}
            System.out.println("========>"+count);
        }
    }

    public static void main(String[] args) {

        Configuration conf=new Configuration();
        try {
            //args从控制台获取路径 解析得到域名
            String[] paths=new GenericOptionsParser(conf,args).getRemainingArgs();
            if(paths.length<2){
                throw new RuntimeException("必須輸出 輸入 和输出路径");
            }
            //得到一个Job 并设置名字
            Job job=Job.getInstance(conf,"myTest");
            //设置Jar 使本程序在Hadoop中运行
            job.setJarByClass(myTest.class);
            //设置Map处理类
            job.setMapperClass(MyTestMapper.class);
            job.setInputFormatClass(MyTextInputFormat.class);
            //设置map的输出类型,因为不一致,所以要设置
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(LongWritable.class);
            //设置输入和输出目录
            FileInputFormat.addInputPath(job, new Path(paths[0]));
            FileOutputFormat.setOutputPath(job, new Path(paths[1] + System.currentTimeMillis()));// 整合好结果后输出的位置
            //启动运行
            System.exit(job.waitForCompletion(true) ? 0:1);
        } catch (IOException e) {
            e.printStackTrace();
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }


}

测试结果：

看第2列的偏移量，发现已经实现了一次读多行(我测试的是2行)

到此所有分析已经完了，研究源码真不容易，花了我一个晚上去研究hadoop的源码，然后再花了几个小时把这些内容写成博客，所以，码字不易，转载请指明出处http://blog.csdn.net/tanggao1314/article/details/51307642

你可能感兴趣的:(大数据与云计算,大数据生态系统技术)

MongoDB深度解析与实践案例我的运维人生 mongodb 数据库运维开发技术共享
MongoDB深度解析与实践案例在当今大数据与云计算盛行的时代，NoSQL数据库以其灵活的数据模型、水平扩展能力和高性能，成为处理海量数据的重要工具之一。MongoDB，作为NoSQL数据库的杰出代表，凭借其面向文档的存储结构、强大的查询语言以及丰富的生态系统，赢得了众多开发者和企业的青睐。本文将深入探讨MongoDB的核心特性、架构设计原则，并通过一个实际案例展示其在实际项目中的应用。一、Mon
MongoDB深度解析与实践案例我的运维人生 mongodb 数据库运维开发技术共享
MongoDB深度解析与实践案例在当今大数据与云计算蓬勃发展的时代，NoSQL数据库以其灵活的数据模型、水平扩展能力和高性能，成为处理海量数据的重要工具。其中，MongoDB作为NoSQL数据库的佼佼者，凭借其面向文档的存储方式、强大的查询语言以及丰富的生态系统，在各类应用场景中大放异彩。本文将深入探讨MongoDB的核心特性、架构设计原则，并通过实际代码案例展示其在数据处理中的应用。一、Mong
MySQL系列之数据导入导出 ZHOU西口数据库 mysql 数据库备份与恢复 mysqldump load data
前言大数据与云计算作为当今时代，数据要素发展的“动力引擎”，已经走进了社会生活的方方方面。而背后承载的云服务或数据服务的高效运转，起了决定作用。作为数据存储的重要工具，数据库的品类和特性也日新月异。从树型、网络型到关系型，从集中式到分布式，均可胜任不同的业务场景和数据存储要求。在这个云时代（CloudAge），作为“轻、快、高”的代表，MySQL作为RDB的优等生，备受各行各业的青睐。从今天开始，
大数据导论（2）---大数据与云计算、物联网、人工智能冒冒菜菜大数据导论大数据导论云计算和物联网课程学习
文章目录1.云计算1.1云计算概念1.2云计算的服务模式和类型1.3云计算的数据中心与应用2.物联网2.1物联网的概念和关键技术2.2物联网的应用和产业2.3大数据与云计算、物联网的关系1.云计算1.1云计算概念 1.首先从商业角度给云计算下一个定义：通过网络、以服务的方式为千家万户（包含政府、企业和个人用户）提供非常廉价的IT资源。 2.云计算是一种全新的技术，包含了虚拟化、分布式存储、分布式计
大数据与云计算 | 华科软院2020年期末考试试题及答案哆啦一泓实验考试与课设
【注：答案为本人所写，仅供参考】1.就本课程最后一个实验，回答下列问题：(1)请描述该实验系统的功能和你所做的工作(8分)；(2)详细描述实验系统在云端的部署过程(6分)；(3)简述实验过程的难点/痛点和自己做实验的体会(6分)。(1)在阿里云ECS服务器上安装FTP、MySQL服务、JDK、Nginx、tomcat，并编写JavaWeb应用，部署到阿里云服务器，实现学生信息的增删查改、管理员登录
大数据技术原理与应用期末复习知识点全总结（林子雨版天玑y 期末复习大数据学习学习方法笔记 bigdata hdfs hadoop
目录1.第一章大数据概述：（一）三次信息化浪潮（二）人类社会数据产生方式的3个阶段（三）大数据的3个发展阶段（四）大数据4V概念（五）数据存储单位之间的换算关系（六）大数据对科学研究的影响（七）大数据对思维方式的影响（八）大数据技术的不同层面及其功能（九）大数据计算模式及其代表产品（十）大数据产业的6个层次（十一）大数据与云计算、物联网（十二）物联网体系架构（十三）大数据与云计算、物联网的关系第二
助推酒店产业智能化升级 I 喜尔康出席中国饭店协会成立三十周年总结展望大会智哪儿全屋智能智能家居智能家居
1月8日，中国饭店协会六届四次理事会暨中国饭店协会成立三十周年总结展望大会在广州隆重举办。作为中国饭店协会理事单位及此次大会的赞助商，喜尔康受邀出席大会。现场，喜尔康集团董事长吴锡山发表了《智能家居赋能后装修时代》的主题演讲，引发现场共鸣。1、智能家居势不可挡吴锡山表示，5G、大数据与云计算等新技术的发展，特别是科技巨头推动的人工智能大爆发，带来了生产关系的深刻变革。家居等各行各业，包括酒店、建筑
3-分布式存储之Ceph 师范大学通信大怨总分布式 ceph
任务背景虽然使用了分布式的glusterfs存储,但是对于爆炸式的数据增长仍然感觉力不从心。对于大数据与云计算等技术的成熟,存储也需要跟上步伐.所以这次我们选用对象存储.任务要求1,搭建ceph集群2,实现对象存储的应用任务拆解1,了解ceph2,搭建ceph集群3,了解rados原生数据存取4,实现ceph文件存储5,实现ceph块存储6,实现ceph对象存储学习目标能够成功部署ceph集群能够
大数据和智能数据应用架构系列教程之：大数据与云计算禅与计算机程序设计艺术 AI实战大数据AI人工智能 Python实战大数据人工智能语言模型 Java Python 架构设计
作者：禅与计算机程序设计艺术1.背景介绍大数据简介大数据（英语：BigData），指的是一个涵盖多个不同主题、来源、传播方式的海量、复杂和不断增长的数据集合。由于数据的增长迅速、结构化程度高、采集渠道多样，使得大数据产生了新的分析需求、挖掘价值并推动产业革命。随着大数据的飞速发展，越来越多的人们发现自己正在被迫依赖于数据驱动的生产活动，包括金融服务、商业模式、个性化推荐等。同时，大数据也为各行各业
《云计算-刘鹏》学习笔记-第一章：大数据与云计算流动的风与雪其他云计算大数据 IaaS PaaS SaaS
文章目录0笔记说明1大数据时代2云计算——大数据的计算3云计算发展现状4云计算实现机制5云计算压倒性的成本优势0笔记说明参考书籍为《云计算-第三版》，作者为刘鹏。1大数据时代大数据的定义如下：海量数据或巨量数据，其规模巨大到无法通过目前主流的计算机系统在合理时间内获取、存储、管理、处理并提炼以帮助使用者决策。大数据具有以下的特征，即4V+1C：1、数据量大(Volume)：存储的数据量巨大，PB级
大数据与云计算技术---（二）Openstack云计算平台李牛克斯小学生. 六 Linux企业运维 openstack 云计算
一、环境主机网络控制节点服务器配置网络接口配置域名解析网络时间协议(NTP)控制节点服务器其它节点服务器OpenStack包启用OpenStack库安装OpenStack客户端SQL数据库安全并配置组件启动数据库服务消息队列安全并配置组件图形工具Memcached安全并配置组件启动Memcached服务认证服务安装和配置先决条件安全并配置组件c2cec39f898636bfa542配置Apache
大数据、人工智能与云计算的融合与应用 ShuYunBIGDATA 大数据
1引言人工智能、大数据与云计算三者有着密不可分的联系。人工智能从1956年开始发展，在大数据技术出现之前已经发展了数十年，几起几落，但当遇到了大数据与分布式技术的发展，解决了计算力和训练数据量的问题，开始产生巨大的生产价值；同时，大数据技术通过将传统机器学习算法分布式实现，向人工智能领域延伸；此外，随着数据不断汇聚在一个平台，企业大数据基础平台服务各个部门以及分支机构的需求越来越迫切。通过容器技术
JavaEE入门级别最全教程1--初学者必看 itLaity Java基础知识讲解与总结 javaee java 初学者
导读相信很多初入编程的小伙伴对于语言有种选择恐惧症，对于Java也不知怎么去学，这期文章J哥会给大家整理最适合小白学习的JavaEE教程。大数据的概述#大数据与云计算的学习概念:海量数据，具有高增长率、数据类型多样性、一定时间内无法使用常规软件进行捕捉、管理和处理的数据集合。特征:4V特征(是大家普遍认可的)大量多样高速价值#大数据能做什么？在海量的各种各样类型的价值密度低的数据中，我们要进行的是
架构师必知必会系列：大数据处理与架构禅与计算机程序设计艺术禅与计算机程序设计艺术架构师必知必会系列大数据AI人工智能大数据人工智能语言模型 Java Python 架构设计
作者：禅与计算机程序设计艺术1.简介随着互联网、电子商务等新兴产业的发展，互联网企业在海量数据产生、收集、分析的过程中越来越依赖于大数据处理平台进行数据的存储、加工、计算。由于数据量的爆炸性增长，传统的数据处理技术已经无法满足实时分析需求。为了解决这一难题，云计算与大数据平台成为行业主要的发展方向。目前，云计算与大数据领域处于蓬勃发展阶段。大数据与云计算技术的广泛应用导致了大数据的“三驾马车”模型
大数据和智能数据应用架构系列教程之：大数据与云计算禅与计算机程序设计艺术禅与计算机程序设计艺术大数据AI人工智能大数据人工智能语言模型 Java Python 架构设计
作者：禅与计算机程序设计艺术1.背景介绍云计算是现代IT技术中一个重要组成部分，它赋予了用户更多的灵活性、弹性、按需付费能力等，随着互联网和移动互联网的蓬勃发展，越来越多的企业开始转向云计算平台作为基础设施，构建自己的大数据和智能分析平台。而大数据的应用也越来越成为云计算平台的一个重要组成部分，包括数据采集、数据存储、数据处理、数据分析等。传统上，大数据应用架构往往存在以下几个难点：数据采集难度高
2019年华为网络精英挑战赛-大数据 Wakeupeme328514
1.1大数据的基本特征Volume：数量大；Variety：种类和来源多样化；Velocity：及时性要求高；Value：价值密度低。1.1.2Hadoop特点开放，全球生态；结构化、半结构化、非结构化；高性能、实时。1.2大数据理念变革与传统数据对比创新点1.3大数据与云计算、人工智能AI1.4企业级大数据关键技术1.4.2数据处理批处理：适用于传统数据库或分布式数据库；支持结构化与非结构化数据
大数据概述（林子雨慕课课程）几窗花鸢大数据应用大数据
文章目录1.大数据概述1.1大数据概念和影响1.2大数据的应用1.3大数据的关键技术1.4大数据与云计算和物联网的关系云计算物联网1.大数据概述大数据的四大特点：大量化、快速化、多样化、价值密度低1.1大数据概念和影响大数据摩尔定律大数据由结构化和非结构化的数据组成，非结构化的数据占比大，如图像数据结构化的数据就是关系数据库表中的图表数据非结构化的数据种类繁多大数据从数据的生成到消耗，时间窗口非常
hadoop生态现状、介绍、部署小小哭包服务器大数据 Linux hadoop 大数据分布式
一、引出hadoop1、hadoop的高薪现状各招聘平台都有许多hadoop高薪职位，可以看看职位所需求的技能---->hadoop是什么，为什么会这么高薪？引出大数据，大数据时代，大数据与云计算2、大数据时代的介绍大数据的故事，google根据海量数据所作出的一次流行病传播趋势预测，及时性和准确性都远超医疗体系根据传统方法所作出的预警，渲染大数据技术将给这个时代带来的巨大变革---->大数据的4
大数据与云计算实验一惑星撞地球大数据云计算
检查是否开启sudoservicedockerstatus开启服务sudoservicedockerstart运行服务sudodockerrun-itd-p8080:80nginx查询IDdockerps-all进入容器shellsudodockerexec-it/bin/bash找到/usr/share/nginx/html/index.html文件编辑完成
大数据与云计算——让我们进入数字化的新纪元 Sirius·Black 大数据云计算
当谈论大数据和云计算时，我们进入了一个数字化时代的新纪元。这两个领域在科技和商业领域都有着深远的影响，改变了我们如何处理和存储数据，以及如何进行计算和分析。本文将探讨大数据和云计算的基本概念，它们的关系以及它们在不同领域的应用。大数据与云计算——数字化的新纪元基本概念什么是大数据什么是云计算大数据与云计算的关系1.存储和处理大数据2.弹性和可扩展性3.数据分析和挖掘4.数据安全和隐私应用领域1.医
基于 KubeSphere 的应用容器化在智能网联汽车领域的实践云计算
公司简介某国家级智能网联汽车研究中心成立于2018年，是担当产业发展咨询与建议、共性技术研发中心、创新成果转化的国家级创新平台，旨在提高我国在智能网联汽车及相关产业在全球价值链中的地位。目前着力建设基于大数据与云计算的智能汽车云端运营控制中心平台。推进云端运营控制中心建设的过程中，运控中心平台的集成、部署、运维方案经历了3代的升级迭代过程。第一代部署方案是直接将平台的前后端各个模块手动部署在自有物
问道崂山 2018·中国（青岛）大数据应用与解决方案高峰论坛圆满落幕 chuntu1126 大数据嵌入式操作系统
12月6日-7日，“2018问道崂山·中国（青岛）大数据应用与解决方案高峰论坛-暨首届大快搜索合作伙伴生态系统大会&开发者技术沙龙”在青岛海天大剧院酒店成功举办。本次高峰论坛由青岛市大数据与云计算行业协会、山东省计算机学会大数据与智能计算专委会联合主办，大快搜索、青岛新闻网承办，论坛以“创新大数据汇聚新动能”为主题，依托本次活动主要承办方大快搜索全国合作伙伴资源，邀请了百余家知名大数据企业参会，共
大数据课程复习腹黑客大数据
信息科技为大数据时代提供技术支持存储设备容量不断增加CPU处理能力大幅度提升网络带宽不断增加大数据4V特征数据量大数据类型繁多处理速度快价值密度底大数据对思维方式的影响全样而非抽样效率而非精确相关而非因果云计算关键技术虚拟化分布式存储分布式计算多租户大数据与云计算，物联网的关系三者区别大数据侧重与对海量数据的存储，处理分析，从海量数据中发现价值，服务生产生活云计算本质旨在整合优化各种IT资源，通过
为什么这么多人都想学大数据？宁可放弃本职工作也要转行学习。大数据具有什么魔力色彩飞上天的猫神
首先大数据是什么：大数据(bigdata,megadata)，或称巨量资料，指的是需要新处理模式才能具有更强的决策力、洞察力和流程优化能力的海量、高增长率和多样化的信息资产。2、大数据的4V特点：Volume（大量）、Velocity（高速）、Variety（多样）、Value（价值）。3、从技术上看，大数据与云计算的关系就像一枚硬币的正反面一样密不可分。大数据必然无法用单台的计算机进行处理，必须
大数据与云计算之间的关系是怎样的？大数据基础入门教程大数据 hadoop spark
如今，两种主流技术已成为IT领域关注的焦点-大数据和云计算。根本不同的是，大数据只涉及处理海量数据，而云计算则涉及基础架构。但是，大数据和云技术提供的简化功能是其被大量企业采用的主要原因。例如，亚马逊的“ElasticMapReduce”演示了如何利用CloudElasticComputes的功能进行大数据处理。两者的结合为组织带来了有益的结果。更不用说，这两种技术都处于发展阶段，但是它们的结合在
大数据与云计算 HappySSweet 大数据
大数据的4个特点：量大：存储大，计算量大样多：来源多，格式多快速：生成速度快，处理速度要求快价值密度低：价值密度的高低和数据总量的大小成反比云计算和大数据的关系：云计算是底层平台，大数据是应用，云计算作为底层平台整合计算和存储网络等资源，同时提供基础架构资源弹性伸缩的能力，大数据在云计算平台支撑下，调度下层资源进行数据源加载，计算和最终结果输出等动作。
基于 KubeSphere 的应用容器化在智能网联汽车领域的实践 KubeSphere 云原生 k8s 容器平台 kubesphere 云计算
公司简介某国家级智能网联汽车研究中心成立于2018年，是担当产业发展咨询与建议、共性技术研发中心、创新成果转化的国家级创新平台，旨在提高我国在智能网联汽车及相关产业在全球价值链中的地位。目前着力建设基于大数据与云计算的智能汽车云端运营控制中心平台。推进云端运营控制中心建设的过程中，运控中心平台的集成、部署、运维方案经历了3代的升级迭代过程。第一代部署方案是直接将平台的前后端各个模块手动部署在自有物
基于 KubeSphere 的应用容器化在智能网联汽车领域的实践云计算
公司简介某国家级智能网联汽车研究中心成立于2018年，是担当产业发展咨询与建议、共性技术研发中心、创新成果转化的国家级创新平台，旨在提高我国在智能网联汽车及相关产业在全球价值链中的地位。目前着力建设基于大数据与云计算的智能汽车云端运营控制中心平台。推进云端运营控制中心建设的过程中，运控中心平台的集成、部署、运维方案经历了3代的升级迭代过程。第一代部署方案是直接将平台的前后端各个模块手动部署在自有物
大数据与云计算柴玉宾
通俗讲解：未来云计算下面读两个故事一定弄懂“云计算”故事一公共电网抛弃了爱迪生爱迪生的牛气无法言说，这辈子有2000多项发明，在科学界他吃的盐比普通人吃的饭还多。但就是这么一个牛人，也曾被拍打在沙滩上：公共电网狠狠地抛弃了他。1878年，爱迪生决定开发一种新产品——电灯泡，为了持续地给它供电，他紧跟着又发明了电流表、发电机等，这是一套完整的供电系统：爱迪生灯具公司制造灯泡，爱迪生电器公司制造发电机
解锁潜力，驭数赋能：大数据与云计算的强强联合久数君数据可视化大数据云计算物联网信息可视化
随着数字化时代的来临，大数据和云计算已成为信息技术领域的两大热门话题。大数据指的是以海量、高速、多样化的数据为基础，通过分析和挖掘来获得有价值的信息和洞察。而云计算则是一种基于网络的计算模式，通过将数据和应用程序存储在云端服务器上，实现资源共享和灵活扩展。这两个领域的结合，为企业和组织带来了许多机遇和挑战。大数据和云计算的结合使得数据的收集、存储和处理更加高效和便捷。通过云计算的弹性和可扩展性，企
java解析APK 3213213333332132 java apk linux 解析APK
解析apk有两种方法 1、结合安卓提供apktool工具，用java执行cmd解析命令获取apk信息 2、利用相关jar包里的集成方法解析apk 这里只给出第二种方法，因为第一种方法在linux服务器下会出现不在控制范围之内的结果。 public class ApkUtil { /** * 日志对象 */ private static Logger
nginx自定义ip访问N种方法 ronin47 nginx 禁止ip访问
　　　因业务需要，禁止一部分内网访问接口，　由于前端架了F5，直接用deny或allow是不行的，这是因为直接获取的前端Ｆ５的地址。　　　所以开始思考有哪些主案可以实现这样的需求，目前可实施的是三种：　　　一：把ip段放在redis里，写一段lua 二：利用geo传递变量，写一段
mysql timestamp类型字段的CURRENT_TIMESTAMP与ON UPDATE CURRENT_TIMESTAMP属性 dcj3sjt126com mysql
timestamp有两个属性，分别是CURRENT_TIMESTAMP 和ON UPDATE CURRENT_TIMESTAMP两种，使用情况分别如下： 1. CURRENT_TIMESTAMP 当要向数据库执行insert操作时，如果有个timestamp字段属性设为 CURRENT_TIMESTAMP，则无论这
struts2+spring+hibernate分页显示 171815164 Hibernate
分页显示一直是web开发中一大烦琐的难题，传统的网页设计只在一个JSP或者ASP页面中书写所有关于数据库操作的代码，那样做分页可能简单一点，但当把网站分层开发后，分页就比较困难了，下面是我做Spring+Hibernate+Struts2项目时设计的分页代码，与大家分享交流。　　1、DAO层接口的设计，在MemberDao接口中定义了如下两个方法： public in
构建自己的Wrapper应用 g21121 rap
我们已经了解Wrapper的目录结构，下面可是正式利用Wrapper来包装我们自己的应用，这里假设Wrapper的安装目录为:/usr/local/wrapper。首先，创建项目应用 &nb
[简单]工作记录_多线程相关 53873039oycg 多线程
最近遇到多线程的问题,原来使用异步请求多个接口(n*3次请求) 方案一使用多线程一次返回数据,最开始是使用5个线程,一个线程顺序请求3个接口,超时终止返回缺点测试发现必须3个接
调试jdk中的源码，查看jdk局部变量程序员是怎么炼成的 jdk 源码
转自：http://www.douban.com/note/211369821/ 学习jdk源码时使用-- 学习java最好的办法就是看jdk源代码，面对浩瀚的jdk（光源码就有40M多，比一个大型网站的源码都多）从何入手呢，要是能单步调试跟进到jdk源码里并且能查看其中的局部变量最好了。可惜的是sun提供的jdk并不能查看运行中的局部变量
Oracle RAC Failover 详解 aijuans oracle
Oracle RAC 同时具备HA(High Availiablity) 和LB(LoadBalance). 而其高可用性的基础就是Failover(故障转移). 它指集群中任何一个节点的故障都不会影响用户的使用，连接到故障节点的用户会被自动转移到健康节点，从用户感受而言，是感觉不到这种切换。 Oracle 10g RAC 的Failover 可以分为3种： 1. Client-Si
form表单提交数据编码方式及tomcat的接受编码方式 antonyup_2006 JavaScript tomcat 浏览器互联网 servlet
原帖地址：http://www.iteye.com/topic/266705 form有2中方法把数据提交给服务器，get和post,分别说下吧。（一）get提交 1.首先说下客户端（浏览器）的form表单用get方法是如何将数据编码后提交给服务器端的吧。对于get方法来说，都是把数据串联在请求的url后面作为参数，如：http://localhost:
JS初学者必知的基础百合不是茶 js函数 js入门基础
JavaScript是网页的交互语言,实现网页的各种效果, JavaScript 是世界上最流行的脚本语言。 JavaScript 是属于 web 的语言，它适用于 PC、笔记本电脑、平板电脑和移动电话。 JavaScript 被设计为向 HTML 页面增加交互性。许多 HTML 开发者都不是程序员，但是 JavaScript 却拥有非常简单的语法。几乎每个人都有能力将小的
iBatis的分页分析与详解 bijian1013 java ibatis
分页是操作数据库型系统常遇到的问题。分页实现方法很多，但效率的差异就很大了。iBatis是通过什么方式来实现这个分页的了。查看它的实现部分，发现返回的PaginatedList实际上是个接口，实现这个接口的是PaginatedDataList类的对象，查看PaginatedDataList类发现，每次翻页的时候最
精通Oracle10编程SQL(15)使用对象类型 bijian1013 oracle 数据库 plsql
/* *使用对象类型 */ --建立和使用简单对象类型 --对象类型包括对象类型规范和对象类型体两部分。 --建立和使用不包含任何方法的对象类型 CREATE OR REPLACE TYPE person_typ1 as OBJECT( name varchar2(10),gender varchar2(4),birthdate date ); drop type p
【Linux命令二】文本处理命令awk bit1129 linux命令
awk是Linux用来进行文本处理的命令，在日常工作中，广泛应用于日志分析。awk是一门解释型编程语言，包含变量，数组，循环控制结构，条件控制结构等。它的语法采用类C语言的语法。 awk命令用来做什么？ 1.awk适用于具有一定结构的文本行，对其中的列进行提取信息 2.awk可以把当前正在处理的文本行提交给Linux的其它命令处理，然后把直接结构返回给awk 3.awk实际工
JAVA(ssh2框架)+Flex实现权限控制方案分析白糖_ java
目前项目使用的是Struts2+Hibernate+Spring的架构模式，目前已经有一套针对SSH2的权限系统，运行良好。但是项目有了新需求：在目前系统的基础上使用Flex逐步取代JSP，在取代JSP过程中可能存在Flex与JSP并存的情况，所以权限系统需要进行修改。【SSH2权限系统的实现机制】权限控制分为页面和后台两块：不同类型用户的帐号分配的访问权限是不同的，用户使
angular.forEach boyitech AngularJS AngularJS API angular.forEach
angular.forEach 描述: 循环对obj对象的每个元素调用iterator, obj对象可以是一个Object或一个Array. Iterator函数调用方法: iterator(value, key, obj), 其中obj是被迭代对象，key是obj的property key或者是数组的index，value就是相应的值啦. (此函数不能够迭代继承的属性.)
java-谷歌面试题-给定一个排序数组，如何构造一个二叉排序树 bylijinnan 二叉排序树
import java.util.LinkedList; public class CreateBSTfromSortedArray { /** * 题目:给定一个排序数组，如何构造一个二叉排序树 * 递归 */ public static void main(String[] args) { int[] data = { 1, 2, 3, 4,
action执行2次 Chen.H JavaScript jsp XHTML css Webwork
xwork 写道 <action name="userTypeAction" class="com.ekangcount.website.system.view.action.UserTypeAction"> <result name="ssss" type="dispatcher">
[时空与能量]逆转时空需要消耗大量能源 comsci 能源
无论如何,人类始终都想摆脱时间和空间的限制....但是受到质量与能量关系的限制,我们人类在目前和今后很长一段时间内,都无法获得大量廉价的能源来进行时空跨越..... 在进行时空穿梭的实验中,消耗超大规模的能源是必然
oracle的正则表达式(regular expression)详细介绍 daizj oracle 正则表达式
正则表达式是很多编程语言中都有的。可惜oracle8i、oracle9i中一直迟迟不肯加入，好在oracle10g中终于增加了期盼已久的正则表达式功能。你可以在oracle10g中使用正则表达式肆意地匹配你想匹配的任何字符串了。正则表达式中常用到的元数据(metacharacter)如下： ^ 匹配字符串的开头位置。 $ 匹配支付传的结尾位置。 *
报表工具与报表性能的关系 datamachine 报表工具 birt 报表性能润乾报表
在选择报表工具时，性能一直是用户关心的指标，但是，报表工具的性能和整个报表系统的性能有多大关系呢？要回答这个问题，首先要分析一下报表的处理过程包含哪些环节，哪些环节容易出现性能瓶颈，如何优化这些环节。一、报表处理的一般过程分析 1、用户选择报表输入参数后，报表引擎会根据报表模板和输入参数来解析报表，并将数据计算和读取请求以SQL的方式发送给数据库。 2、
初一上学期难记忆单词背诵第一课 dcj3sjt126com word english
what 什么 your 你 name 名字 my 我的 am 是 one 一 two 二 three 三 four 四 five 五 class 班级，课 six 六 seven 七 eight 八 nince 九 ten 十 zero 零 how 怎样 old 老的 eleven 十一 twelve 十二 thirteen
我学过和准备学的各种技术 dcj3sjt126com 技术
语言VB https://msdn.microsoft.com/zh-cn/library/2x7h1hfk.aspxJava http://docs.oracle.com/javase/8/C# https://msdn.microsoft.com/library/vstudioPHP http://php.net/manual/en/Html
struts2中token防止重复提交表单蕃薯耀重复提交表单 struts2中token
struts2中token防止重复提交表单 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 蕃薯耀 2015年7月12日 11:52:32 星期日 ht
线性查找二维数组 hao3100590 二维数组
1.算法描述有序（行有序，列有序，且每行从左至右递增，列从上至下递增）二维数组查找，要求复杂度O(n) 2.使用到的相关知识：结构体定义和使用，二维数组传递（http://blog.csdn.net/yzhhmhm/article/details/2045816） 3.使用数组名传递这个的不便之处很明显，一旦确定就是不能设置列值 //使
spring security 3中推荐使用BCrypt算法加密密码 jackyrong Spring Security
spring security 3中推荐使用BCrypt算法加密密码了，以前使用的是md5， Md5PasswordEncoder 和 ShaPasswordEncoder，现在不推荐了，推荐用bcrpt Bcrpt中的salt可以是随机的，比如： int i = 0; while (i < 10) { String password = "1234
学习编程并不难,做到以下几点即可! lampcy java html 编程语言
不论你是想自己设计游戏，还是开发iPhone或安卓手机上的应用，还是仅仅为了娱乐，学习编程语言都是一条必经之路。编程语言种类繁多，用途各异，然而一旦掌握其中之一，其他的也就迎刃而解。作为初学者，你可能要先从Java或HTML开始学，一旦掌握了一门编程语言，你就发挥无穷的想象，开发各种神奇的软件啦。 1、确定目标学习编程语言既充满乐趣，又充满挑战。有些花费多年时间学习一门编程语言的大学生到
架构师之mysql----------------用group+inner join,left join ,right join 查重复数据（替代in) nannan408 right join
1.前言。如题。 2.代码 (1)单表查重复数据,根据a分组 SELECT m.a,m.b, INNER JOIN （select a,b,COUNT(*) AS rank FROM test.`A` A GROUP BY a HAVING rank>1 )k ON m.a=k.a （2）多表查询，使用改为le
jQuery选择器小结 VS 节点查找（附css的一些东西） Everyday都不同 jquery css name选择器追加元素查找节点
最近做前端页面，频繁用到一些jQuery的选择器，所以特意来总结一下：测试页面： <html> <head> <script src="jquery-1.7.2.min.js"></script> <script> /*$(function() { $(documen
关于EXT tntxia ext
ExtJS是一个很不错的Ajax框架，可以用来开发带有华丽外观的富客户端应用，使得我们的b/s应用更加具有活力及生命力。ExtJS是一个用 javascript编写，与后台技术无关的前端ajax框架。因此，可以把ExtJS用在.Net、Java、Php等各种开发语言开发的应用中。 ExtJs最开始基于YUI技术，由开发人员Jack
一个MIT计算机博士对数学的思考 xjnine Math
在过去的一年中，我一直在数学的海洋中游荡，research进展不多，对于数学世界的阅历算是有了一些长进。为什么要深入数学的世界？作为计算机的学生，我没有任何企图要成为一个数学家。我学习数学的目的，是要想爬上巨人的肩膀，希望站在更高的高度，能把我自己研究的东西看得更深广一些。说起来，我在刚来这个学校的时候，并没有预料到我将会有一个深入数学的旅程。我的导师最初希望我去做的题目，是对appe