Blossomy

mapreduce编程过程

概述

流程图
默认类
wordcount完整示例

流程图

mapreduce完整流程

默认类

和上边流程图里的类对应

inputformat	TextInputFormat
recordreader	LineRecoredReader
inputSplit	FileSplit
Map	不知道
combine	不知道
partitioner	HashPartitioner
GroupComparator	这个很神奇，可以说是奇迹发生的地方，类里和主类里两个地方都可以写
reduce	identityMap，老jar包里的类
outputformat	fileOutputFormat
recordReader	LineRecordReader
outputCommitter	FileOup0utCommitter

WordCount完整代码示例

inputformat 类对应 TextInputFormat ，而TextInputFormat 又继承自FileinputFormat，其实精华都在 FileinputFormat 里，下面首先为 FileinputFormat 类，然后为 TextInputFormat 类
FileInputFormat 源码

/** 
 * A base class for file-based {@link InputFormat}s.
 * 
 * FileInputFormat is the base class for all file-based 
 * InputFormats. This provides a generic implementation of
 * {@link #getSplits(JobContext)}.
 * Subclasses of FileInputFormat can also override the 
 * {@link #isSplitable(JobContext, Path)} method to ensure input-files are
 * not split-up and are processed as a whole by {@link Mapper}s.
 */
public abstract class FileInputFormat extends InputFormat {

  public static enum Counter { 
    BYTES_READ
  }
  
  private static final Log LOG = LogFactory.getLog(FileInputFormat.class);

  private static final double SPLIT_SLOP = 1.1;   // 10% slop

  private static final PathFilter hiddenFileFilter = new PathFilter(){
      public boolean accept(Path p){
        String name = p.getName(); 
        return !name.startsWith("_") && !name.startsWith("."); 
      }
    }; 

  static final String NUM_INPUT_FILES = "mapreduce.input.num.files";

  /**
   * Proxy PathFilter that accepts a path only if all filters given in the
   * constructor do. Used by the listPaths() to apply the built-in
   * hiddenFileFilter together with a user provided one (if any).
   */
  private static class MultiPathFilter implements PathFilter {
    private List filters;

    public MultiPathFilter(List filters) {
      this.filters = filters;
    }

    public boolean accept(Path path) {
      for (PathFilter filter : filters) {
        if (!filter.accept(path)) {
          return false;
        }
      }
      return true;
    }
  }

  /**
   * Get the lower bound on split size imposed by the format.
   * @return the number of bytes of the minimal split for this format
   */
  protected long getFormatMinSplitSize() {
    return 1;
  }

  /**
   * Is the given filename splitable? Usually, true, but if the file is
   * stream compressed, it will not be.
   * 
   * FileInputFormat implementations can override this and return
   * false to ensure that individual input files are never split-up
   * so that {@link Mapper}s process entire files.
   * 
   * @param context the job context
   * @param filename the file name to check
   * @return is this file splitable?
   */
  protected boolean isSplitable(JobContext context, Path filename) {
    return true;
  }

  /**
   * Set a PathFilter to be applied to the input paths for the map-reduce job.
   * @param job the job to modify
   * @param filter the PathFilter class use for filtering the input paths.
   */
  public static void setInputPathFilter(Job job,
                                        Class filter) {
    job.getConfiguration().setClass("mapred.input.pathFilter.class", filter, 
                                    PathFilter.class);
  }

  /**
   * Set the minimum input split size
   * @param job the job to modify
   * @param size the minimum size
   */
  public static void setMinInputSplitSize(Job job,
                                          long size) {
    job.getConfiguration().setLong("mapred.min.split.size", size);
  }

  /**
   * Get the minimum split size
   * @param job the job
   * @return the minimum number of bytes that can be in a split
   */
  public static long getMinSplitSize(JobContext job) {
    return job.getConfiguration().getLong("mapred.min.split.size", 1L);
  }

  /**
   * Set the maximum split size
   * @param job the job to modify
   * @param size the maximum split size
   */
  public static void setMaxInputSplitSize(Job job,
                                          long size) {
    job.getConfiguration().setLong("mapred.max.split.size", size);
  }

  /**
   * Get the maximum split size.
   * @param context the job to look at.
   * @return the maximum number of bytes a split can include
   */
  public static long getMaxSplitSize(JobContext context) {
    return context.getConfiguration().getLong("mapred.max.split.size", 
                                              Long.MAX_VALUE);
  }

  /**
   * Get a PathFilter instance of the filter set for the input paths.
   *
   * @return the PathFilter instance set for the job, NULL if none has been set.
   */
  public static PathFilter getInputPathFilter(JobContext context) {
    Configuration conf = context.getConfiguration();
    Class filterClass = conf.getClass("mapred.input.pathFilter.class", null,
        PathFilter.class);
    return (filterClass != null) ?
        (PathFilter) ReflectionUtils.newInstance(filterClass, conf) : null;
  }

  /** List input directories.
   * Subclasses may override to, e.g., select only files matching a regular
   * expression. 
   * 
   * @param job the job to list input paths for
   * @return array of FileStatus objects
   * @throws IOException if zero items.
   */
  protected List listStatus(JobContext job
                                        ) throws IOException {
    List result = new ArrayList();
    Path[] dirs = getInputPaths(job);
    if (dirs.length == 0) {
      throw new IOException("No input paths specified in job");
    }
    
    // get tokens for all the required FileSystems..
    TokenCache.obtainTokensForNamenodes(job.getCredentials(), dirs, 
                                        job.getConfiguration());

    List errors = new ArrayList();
    
    // creates a MultiPathFilter with the hiddenFileFilter and the
    // user provided one (if any).
    List filters = new ArrayList();
    filters.add(hiddenFileFilter);
    PathFilter jobFilter = getInputPathFilter(job);
    if (jobFilter != null) {
      filters.add(jobFilter);
    }
    PathFilter inputFilter = new MultiPathFilter(filters);
    
    for (int i=0; i < dirs.length; ++i) {
      Path p = dirs[i];
      FileSystem fs = p.getFileSystem(job.getConfiguration()); 
      FileStatus[] matches = fs.globStatus(p, inputFilter);
      if (matches == null) {
        errors.add(new IOException("Input path does not exist: " + p));
      } else if (matches.length == 0) {
        errors.add(new IOException("Input Pattern " + p + " matches 0 files"));
      } else {
        for (FileStatus globStat: matches) {
          if (globStat.isDir()) {
            for(FileStatus stat: fs.listStatus(globStat.getPath(),
                inputFilter)) {
              result.add(stat);
            }          
          } else {
            result.add(globStat);
          }
        }
      }
    }

    if (!errors.isEmpty()) {
      throw new InvalidInputException(errors);
    }
    LOG.info("Total input paths to process : " + result.size()); 
    return result;
  }
  

  /** 
   * Generate the list of files and make them into FileSplits.
   */ 
  public List getSplits(JobContext job
                                    ) throws IOException {
    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
    long maxSize = getMaxSplitSize(job);

    // generate splits
    List splits = new ArrayList();
    Listfiles = listStatus(job);
    for (FileStatus file: files) {
      Path path = file.getPath();
      FileSystem fs = path.getFileSystem(job.getConfiguration());
      long length = file.getLen();
      BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
      if ((length != 0) && isSplitable(job, path)) { 
        long blockSize = file.getBlockSize();
        long splitSize = computeSplitSize(blockSize, minSize, maxSize);

        long bytesRemaining = length;
        while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
          splits.add(new FileSplit(path, length-bytesRemaining, splitSize, 
                                   blkLocations[blkIndex].getHosts()));
          bytesRemaining -= splitSize;
        }
        
        if (bytesRemaining != 0) {
          splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining, 
                     blkLocations[blkLocations.length-1].getHosts()));
        }
      } else if (length != 0) {
        splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
      } else { 
        //Create empty hosts array for zero length files
        splits.add(new FileSplit(path, 0, length, new String[0]));
      }
    }
    
    // Save the number of input files in the job-conf
    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());

    LOG.debug("Total # of splits: " + splits.size());
    return splits;
  }

  protected long computeSplitSize(long blockSize, long minSize,
                                  long maxSize) {
    return Math.max(minSize, Math.min(maxSize, blockSize));
  }

  protected int getBlockIndex(BlockLocation[] blkLocations, 
                              long offset) {
    for (int i = 0 ; i < blkLocations.length; i++) {
      // is the offset inside this block?
      if ((blkLocations[i].getOffset() <= offset) &&
          (offset < blkLocations[i].getOffset() + blkLocations[i].getLength())){
        return i;
      }
    }
    BlockLocation last = blkLocations[blkLocations.length -1];
    long fileLength = last.getOffset() + last.getLength() -1;
    throw new IllegalArgumentException("Offset " + offset + 
                                       " is outside of file (0.." +
                                       fileLength + ")");
  }

  /**
   * Sets the given comma separated paths as the list of inputs 
   * for the map-reduce job.
   * 
   * @param job the job
   * @param commaSeparatedPaths Comma separated paths to be set as 
   *        the list of inputs for the map-reduce job.
   */
  public static void setInputPaths(Job job, 
                                   String commaSeparatedPaths
                                   ) throws IOException {
    setInputPaths(job, StringUtils.stringToPath(
                        getPathStrings(commaSeparatedPaths)));
  }

  /**
   * Add the given comma separated paths to the list of inputs for
   *  the map-reduce job.
   * 
   * @param job The job to modify
   * @param commaSeparatedPaths Comma separated paths to be added to
   *        the list of inputs for the map-reduce job.
   */
  public static void addInputPaths(Job job, 
                                   String commaSeparatedPaths
                                   ) throws IOException {
    for (String str : getPathStrings(commaSeparatedPaths)) {
      addInputPath(job, new Path(str));
    }
  }

  /**
   * Set the array of {@link Path}s as the list of inputs
   * for the map-reduce job.
   * 
   * @param job The job to modify 
   * @param inputPaths the {@link Path}s of the input directories/files 
   * for the map-reduce job.
   */ 
  public static void setInputPaths(Job job, 
                                   Path... inputPaths) throws IOException {
    Configuration conf = job.getConfiguration();
    Path path = inputPaths[0].getFileSystem(conf).makeQualified(inputPaths[0]);
    StringBuffer str = new StringBuffer(StringUtils.escapeString(path.toString()));
    for(int i = 1; i < inputPaths.length;i++) {
      str.append(StringUtils.COMMA_STR);
      path = inputPaths[i].getFileSystem(conf).makeQualified(inputPaths[i]);
      str.append(StringUtils.escapeString(path.toString()));
    }
    conf.set("mapred.input.dir", str.toString());
  }

  /**
   * Add a {@link Path} to the list of inputs for the map-reduce job.
   * 
   * @param job The {@link Job} to modify
   * @param path {@link Path} to be added to the list of inputs for 
   *            the map-reduce job.
   */
  public static void addInputPath(Job job, 
                                  Path path) throws IOException {
    Configuration conf = job.getConfiguration();
    path = path.getFileSystem(conf).makeQualified(path);
    String dirStr = StringUtils.escapeString(path.toString());
    String dirs = conf.get("mapred.input.dir");
    conf.set("mapred.input.dir", dirs == null ? dirStr : dirs + "," + dirStr);
  }
  
  // This method escapes commas in the glob pattern of the given paths.
  private static String[] getPathStrings(String commaSeparatedPaths) {
    int length = commaSeparatedPaths.length();
    int curlyOpen = 0;
    int pathStart = 0;
    boolean globPattern = false;
    List pathStrings = new ArrayList();
    
    for (int i=0; i

 
   TextInputFormat 源码 
   public class TextInputFormat extends FileInputFormat {

  @Override
  public RecordReader 
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
    return new LineRecordReader();
  }

  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    CompressionCodec codec = 
      new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    return codec == null;
  }

}

 
   
   inputformat 类里先用 FileSplit 把原始文件分成很多片，然后 Recordreader 以Filesplit为单位来处理分片 
   我认为精华都在 LineRecordReader，分片很简单，无非是文件大小为10，现在按4，4，2 分，下面首先为 FileSplit 类，然后为 LineRecordReader 类 
   public class FileSplit extends InputSplit implements Writable {
  private Path file;
  private long start;
  private long length;
  private String[] hosts;

  FileSplit() {}

  /** Constructs a split with host information
   *
   * @param file the file name
   * @param start the position of the first byte in the file to process
   * @param length the number of bytes in the file to process
   * @param hosts the list of hosts containing the block, possibly null
   */
  public FileSplit(Path file, long start, long length, String[] hosts) {
    this.file = file;
    this.start = start;
    this.length = length;
    this.hosts = hosts;
  }
 
  /** The file containing this split's data. */
  public Path getPath() { return file; }
  
  /** The position of the first byte in the file to process. */
  public long getStart() { return start; }
  
  /** The number of bytes in the file to process. */
  @Override
  public long getLength() { return length; }

  @Override
  public String toString() { return file + ":" + start + "+" + length; }

  ////////////////////////////////////////////
  // Writable methods
  ////////////////////////////////////////////

  @Override
  public void write(DataOutput out) throws IOException {
    Text.writeString(out, file.toString());
    out.writeLong(start);
    out.writeLong(length);
  }

  @Override
  public void readFields(DataInput in) throws IOException {
    file = new Path(Text.readString(in));
    start = in.readLong();
    length = in.readLong();
    hosts = null;
  }

  @Override
  public String[] getLocations() throws IOException {
    if (this.hosts == null) {
      return new String[]{};
    } else {
      return this.hosts;
    }
  }
} 
   LineRecordReader 源码，这个着实能学到很多东西 
   public class LineRecordReader extends RecordReader {
  private static final Log LOG = LogFactory.getLog(LineRecordReader.class);

  private CompressionCodecFactory compressionCodecs = null;
  private long start;
  private long pos;
  private long end;
  private LineReader in;
  private int maxLineLength;
  private LongWritable key = null;
  private Text value = null;

  public void initialize(InputSplit genericSplit,
                         TaskAttemptContext context) throws IOException {
    FileSplit split = (FileSplit) genericSplit;
    Configuration job = context.getConfiguration();
    this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength",
                                    Integer.MAX_VALUE);
    start = split.getStart();
    end = start + split.getLength();
    final Path file = split.getPath();
    compressionCodecs = new CompressionCodecFactory(job);
    final CompressionCodec codec = compressionCodecs.getCodec(file);

    // open the file and seek to the start of the split
    FileSystem fs = file.getFileSystem(job);
    FSDataInputStream fileIn = fs.open(split.getPath());
    boolean skipFirstLine = false;
    if (codec != null) {
      in = new LineReader(codec.createInputStream(fileIn), job);
      end = Long.MAX_VALUE;
    } else {
      if (start != 0) {
        skipFirstLine = true;
        --start;
        fileIn.seek(start);
      }
      in = new LineReader(fileIn, job);
    }
    if (skipFirstLine) {  // skip first line and re-establish "start".
      start += in.readLine(new Text(), 0,
                           (int)Math.min((long)Integer.MAX_VALUE, end - start));
    }
    this.pos = start;
  }
  
  public boolean nextKeyValue() throws IOException {
    if (key == null) {
      key = new LongWritable();
    }
    key.set(pos);
    if (value == null) {
      value = new Text();
    }
    int newSize = 0;
    while (pos < end) {
      newSize = in.readLine(value, maxLineLength,
                            Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),
                                     maxLineLength));
      if (newSize == 0) {
        break;
      }
      pos += newSize;
      if (newSize < maxLineLength) {
        break;
      }

      // line too long. try again
      LOG.info("Skipped line of size " + newSize + " at pos " + 
               (pos - newSize));
    }
    if (newSize == 0) {
      key = null;
      value = null;
      return false;
    } else {
      return true;
    }
  }

  @Override
  public LongWritable getCurrentKey() {
    return key;
  }

  @Override
  public Text getCurrentValue() {
    return value;
  }

  /**
   * Get the progress within the split
   */
  public float getProgress() {
    if (start == end) {
      return 0.0f;
    } else {
      return Math.min(1.0f, (pos - start) / (float)(end - start));
    }
  }
  
  public synchronized void close() throws IOException {
    if (in != null) {
      in.close(); 
    }
  }
} 
   
   
   Map 类，下面为wordcount 类默认的Map类public class TokenizerMapper extends Mapper{
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
      
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  } 
   Combine 类，wordcount 使用 IntSumReducer 类作为 Combine类，和 reduce 类一模一样 
   public class IntSumReducer 
       extends Reducer {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable values, 
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  } 
   HashPartitioner 类，作用是用户指定 什么样的值到那个reducer去，源代码如下 
   public class HashPartitioner extends Partitioner {

  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K key, V value,int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }
} 
   GroupingComparator，奇迹发生的地方，wordcount 的 key 为 Text 类， 这里 Text 类里的 Comparator 即为我所指的 GroupingComparator ，好好利用这个类 可以实现多次聚类 
   public static class Comparator extends WritableComparator {
    public Comparator() {
      super(Text.class);
    }
    public int compare(byte[] b1, int s1, int l1,
                       byte[] b2, int s2, int l2) {
      int n1 = WritableUtils.decodeVIntSize(b1[s1]);
      int n2 = WritableUtils.decodeVIntSize(b2[s2]);
      return compareBytes(b1, s1+n1, l1-n1, b2, s2+n2, l2-n2);
    }
  }
  static {
    // register this comparator
    WritableComparator.define(Text.class, new Comparator());
  } 
    wordcount 的 reduce 类，和上边的 combine 类一样，其实正确的说法是上面的 combine 类和这里的reducer 类一样 
   public static class IntSumReducer extends Reducer {
    private IntWritable result = new IntWritable();
    public void reduce(Text key, Iterable values,Context context) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
} 
   TextOutputFormat 类，和 FileOutputFormat 类一样，它也有一个父类FileOutputFormat 类，其实精华都在 FileOutputFormat 类里。 
   FileOutputFormat 源代码
 
   public abstract class FileOutputFormat extends OutputFormat {
  
  protected static final String BASE_OUTPUT_NAME = "mapreduce.output.basename";
  protected static final String PART = "part";

  public static enum Counter { 
    BYTES_WRITTEN
  }

  /** Construct output file names so that, when an output directory listing is
   * sorted lexicographically, positions correspond to output partitions.*/
  private static final NumberFormat NUMBER_FORMAT = NumberFormat.getInstance();
  static {
    NUMBER_FORMAT.setMinimumIntegerDigits(5);
    NUMBER_FORMAT.setGroupingUsed(false);
  }
  private FileOutputCommitter committer = null;

  /**
   * Set whether the output of the job is compressed.
   * @param job the job to modify
   * @param compress should the output of the job be compressed?
   */
  public static void setCompressOutput(Job job, boolean compress) {
    job.getConfiguration().setBoolean("mapred.output.compress", compress);
  }
  
  /**
   * Is the job output compressed?
   * @param job the Job to look in
   * @return true if the job output should be compressed,
   *         false otherwise
   */
  public static boolean getCompressOutput(JobContext job) {
    return job.getConfiguration().getBoolean("mapred.output.compress", false);
  }
  
  /**
   * Set the {@link CompressionCodec} to be used to compress job outputs.
   * @param job the job to modify
   * @param codecClass the {@link CompressionCodec} to be used to
   *                   compress the job outputs
   */
  public static void 
  setOutputCompressorClass(Job job, 
                           Class codecClass) {
    setCompressOutput(job, true);
    job.getConfiguration().setClass("mapred.output.compression.codec", 
                                    codecClass, 
                                    CompressionCodec.class);
  }
  
  /**
   * Get the {@link CompressionCodec} for compressing the job outputs.
   * @param job the {@link Job} to look in
   * @param defaultValue the {@link CompressionCodec} to return if not set
   * @return the {@link CompressionCodec} to be used to compress the 
   *         job outputs
   * @throws IllegalArgumentException if the class was specified, but not found
   */
  public static Class 
  getOutputCompressorClass(JobContext job, 
		                       Class defaultValue) {
    Class codecClass = defaultValue;
    Configuration conf = job.getConfiguration();
    String name = conf.get("mapred.output.compression.codec");
    if (name != null) {
      try {
        codecClass = 
        	conf.getClassByName(name).asSubclass(CompressionCodec.class);
      } catch (ClassNotFoundException e) {
        throw new IllegalArgumentException("Compression codec " + name + 
                                           " was not found.", e);
      }
    }
    return codecClass;
  }
  
  public abstract RecordWriter 
     getRecordWriter(TaskAttemptContext job
                     ) throws IOException, InterruptedException;

  public void checkOutputSpecs(JobContext job
                               ) throws FileAlreadyExistsException, IOException{
    // Ensure that the output directory is set and not already there
    Path outDir = getOutputPath(job);
    if (outDir == null) {
      throw new InvalidJobConfException("Output directory not set.");
    }
    
    // get delegation token for outDir's file system
    TokenCache.obtainTokensForNamenodes(job.getCredentials(), 
                                        new Path[] {outDir}, 
                                        job.getConfiguration());

    if (outDir.getFileSystem(job.getConfiguration()).exists(outDir)) {
      throw new FileAlreadyExistsException("Output directory " + outDir + 
                                           " already exists");
    }
  }

  /**
   * Set the {@link Path} of the output directory for the map-reduce job.
   *
   * @param job The job to modify
   * @param outputDir the {@link Path} of the output directory for 
   * the map-reduce job.
   */
  public static void setOutputPath(Job job, Path outputDir) {
    job.getConfiguration().set("mapred.output.dir", outputDir.toString());
  }

  /**
   * Get the {@link Path} to the output directory for the map-reduce job.
   * 
   * @return the {@link Path} to the output directory for the map-reduce job.
   * @see FileOutputFormat#getWorkOutputPath(TaskInputOutputContext)
   */
  public static Path getOutputPath(JobContext job) {
    String name = job.getConfiguration().get("mapred.output.dir");
    return name == null ? null: new Path(name);
  }
  
  /**
   *  Get the {@link Path} to the task's temporary output directory 
   *  for the map-reduce job
   *  
   * Tasks' Side-Effect Files
   * 
   * Some applications need to create/write-to side-files, which differ from
   * the actual job-outputs.
   * 
   * 
In such cases there could be issues with 2 instances of the same TIP 
   * (running simultaneously e.g. speculative tasks) trying to open/write-to the
   * same file (path) on HDFS. Hence the application-writer will have to pick 
   * unique names per task-attempt (e.g. using the attemptid, say 
   * attempt_200709221812_0001_m_000000_0), not just per TIP. 
   * 
   * To get around this the Map-Reduce framework helps the application-writer 
   * out by maintaining a special 
   * ${mapred.output.dir}/_temporary/_${taskid} 
   * sub-directory for each task-attempt on HDFS where the output of the 
   * task-attempt goes. On successful completion of the task-attempt the files 
   * in the ${mapred.output.dir}/_temporary/_${taskid} (only) 
   * are promoted to ${mapred.output.dir}. Of course, the 
   * framework discards the sub-directory of unsuccessful task-attempts. This 
   * is completely transparent to the application.
   * 
   * The application-writer can take advantage of this by creating any 
   * side-files required in a work directory during execution 
   * of his task i.e. via 
   * {@link #getWorkOutputPath(TaskInputOutputContext)}, and
   * the framework will move them out similarly - thus she doesn't have to pick 
   * unique paths per task-attempt.
   * 
   * The entire discussion holds true for maps of jobs with 
   * reducer=NONE (i.e. 0 reduces) since output of the map, in that case, 
   * goes directly to HDFS. 
   * 
   * @return the {@link Path} to the task's temporary output directory 
   * for the map-reduce job.
   */
  public static Path getWorkOutputPath(TaskInputOutputContext context
                                       ) throws IOException, 
                                                InterruptedException {
    FileOutputCommitter committer = (FileOutputCommitter) 
      context.getOutputCommitter();
    return committer.getWorkPath();
  }

  /**
   * Helper function to generate a {@link Path} for a file that is unique for
   * the task within the job output directory.
   *
   * The path can be used to create custom files from within the map and
   * reduce tasks. The path name will be unique for each task. The path parent
   * will be the job output directory.ls
   *
   * This method uses the {@link #getUniqueFile} method to make the file name
   * unique for the task.
   *
   * @param context the context for the task.
   * @param name the name for the file.
   * @param extension the extension for the file
   * @return a unique path accross all tasks of the job.
   */
  public 
  static Path getPathForWorkFile(TaskInputOutputContext context, 
                                 String name,
                                 String extension
                                ) throws IOException, InterruptedException {
    return new Path(getWorkOutputPath(context),
                    getUniqueFile(context, name, extension));
  }

  /**
   * Generate a unique filename, based on the task id, name, and extension
   * @param context the task that is calling this
   * @param name the base filename
   * @param extension the filename extension
   * @return a string like $name-[mr]-$id$extension
   */
  public synchronized static String getUniqueFile(TaskAttemptContext context,
                                                  String name,
                                                  String extension) {
    TaskID taskId = context.getTaskAttemptID().getTaskID();
    int partition = taskId.getId();
    StringBuilder result = new StringBuilder();
    result.append(name);
    result.append('-');
    result.append(taskId.isMap() ? 'm' : 'r');
    result.append('-');
    result.append(NUMBER_FORMAT.format(partition));
    result.append(extension);
    return result.toString();
  }

  /**
   * Get the default path and filename for the output format.
   * @param context the task context
   * @param extension an extension to add to the filename
   * @return a full path $output/_temporary/$taskid/part-[mr]-$id
   * @throws IOException
   */
  public Path getDefaultWorkFile(TaskAttemptContext context,
                                 String extension) throws IOException{
    FileOutputCommitter committer = 
      (FileOutputCommitter) getOutputCommitter(context);
    return new Path(committer.getWorkPath(), getUniqueFile(context, 
        getOutputName(context), extension));
  }
  
  /**
   * Get the base output name for the output file.
   */
  protected static String getOutputName(JobContext job) {
    return job.getConfiguration().get(BASE_OUTPUT_NAME, PART);
  }

  /**
   * Set the base output name for output file to be created.
   */
  protected static void setOutputName(JobContext job, String name) {
    job.getConfiguration().set(BASE_OUTPUT_NAME, name);
  }

  public synchronized 
     OutputCommitter getOutputCommitter(TaskAttemptContext context
                                        ) throws IOException {
    if (committer == null) {
      Path output = getOutputPath(context);
      committer = new FileOutputCommitter(output, context);
    }
    return committer;
  }
} 
   这里在给出 TextOutputFormat 类，这下就简单多了 
   public class TextOutputFormat extends FileOutputFormat {
  protected static class LineRecordWriter
    extends RecordWriter {
    private static final String utf8 = "UTF-8";
    private static final byte[] newline;
    static {
      try {
        newline = "\n".getBytes(utf8);
      } catch (UnsupportedEncodingException uee) {
        throw new IllegalArgumentException("can't find " + utf8 + " encoding");
      }
    }

    protected DataOutputStream out;
    private final byte[] keyValueSeparator;

    public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
      this.out = out;
      try {
        this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
      } catch (UnsupportedEncodingException uee) {
        throw new IllegalArgumentException("can't find " + utf8 + " encoding");
      }
    }

    public LineRecordWriter(DataOutputStream out) {
      this(out, "\t");
    }

    /**
     * Write the object to the byte stream, handling Text as a special
     * case.
     * @param o the object to print
     * @throws IOException if the write throws, we pass it on
     */
    private void writeObject(Object o) throws IOException {
      if (o instanceof Text) {
        Text to = (Text) o;
        out.write(to.getBytes(), 0, to.getLength());
      } else {
        out.write(o.toString().getBytes(utf8));
      }
    }

    public synchronized void write(K key, V value)
      throws IOException {

      boolean nullKey = key == null || key instanceof NullWritable;
      boolean nullValue = value == null || value instanceof NullWritable;
      if (nullKey && nullValue) {
        return;
      }
      if (!nullKey) {
        writeObject(key);
      }
      if (!(nullKey || nullValue)) {
        out.write(keyValueSeparator);
      }
      if (!nullValue) {
        writeObject(value);
      }
      out.write(newline);
    }

    public synchronized 
    void close(TaskAttemptContext context) throws IOException {
      out.close();
    }
  }

  public RecordWriter 
         getRecordWriter(TaskAttemptContext job
                         ) throws IOException, InterruptedException {
    Configuration conf = job.getConfiguration();
    boolean isCompressed = getCompressOutput(job);
    String keyValueSeparator= conf.get("mapred.textoutputformat.separator",
                                       "\t");
    CompressionCodec codec = null;
    String extension = "";
    if (isCompressed) {
      Class codecClass = 
        getOutputCompressorClass(job, GzipCodec.class);
      codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
      extension = codec.getDefaultExtension();
    }
    Path file = getDefaultWorkFile(job, extension);
    FileSystem fs = file.getFileSystem(conf);
    if (!isCompressed) {
      FSDataOutputStream fileOut = fs.create(file, false);
      return new LineRecordWriter(fileOut, keyValueSeparator);
    } else {
      FSDataOutputStream fileOut = fs.create(file, false);
      return new LineRecordWriter(new DataOutputStream
                                        (codec.createOutputStream(fileOut)),
                                        keyValueSeparator);
    }
  }
} 
   在 经典类 FileOutputformat 类里用到 的两个经典类为，LineRecordWriter 和 FileOutputCommitter 
   LineRecordWriter 是作为 TextOutputFormat内部类出现的，其实上面已经有了，为保持完整性，在列出如下 
   protected static class LineRecordWriter
    extends RecordWriter {
    private static final String utf8 = "UTF-8";
    private static final byte[] newline;
    static {
      try {
        newline = "\n".getBytes(utf8);
      } catch (UnsupportedEncodingException uee) {
        throw new IllegalArgumentException("can't find " + utf8 + " encoding");
      }
    }

    protected DataOutputStream out;
    private final byte[] keyValueSeparator;

    public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
      this.out = out;
      try {
        this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
      } catch (UnsupportedEncodingException uee) {
        throw new IllegalArgumentException("can't find " + utf8 + " encoding");
      }
    }

    public LineRecordWriter(DataOutputStream out) {
      this(out, "\t");
    }

    /**
     * Write the object to the byte stream, handling Text as a special
     * case.
     * @param o the object to print
     * @throws IOException if the write throws, we pass it on
     */
    private void writeObject(Object o) throws IOException {
      if (o instanceof Text) {
        Text to = (Text) o;
        out.write(to.getBytes(), 0, to.getLength());
      } else {
        out.write(o.toString().getBytes(utf8));
      }
    }

    public synchronized void write(K key, V value)
      throws IOException {

      boolean nullKey = key == null || key instanceof NullWritable;
      boolean nullValue = value == null || value instanceof NullWritable;
      if (nullKey && nullValue) {
        return;
      }
      if (!nullKey) {
        writeObject(key);
      }
      if (!(nullKey || nullValue)) {
        out.write(keyValueSeparator);
      }
      if (!nullValue) {
        writeObject(value);
      }
      out.write(newline);
    }

    public synchronized 
    void close(TaskAttemptContext context) throws IOException {
      out.close();
    }
  } 
    FileOutputCommitter 类做的工作比较杂，比如 更改文件名，看看任务有没有成功，等等，主要做后期处理
 
   public class FileOutputCommitter extends OutputCommitter {

  private static final Log LOG = LogFactory.getLog(FileOutputCommitter.class);

  /**
   * Temporary directory name 
   */
  protected static final String TEMP_DIR_NAME = "_temporary";
  public static final String SUCCEEDED_FILE_NAME = "_SUCCESS";
  static final String SUCCESSFUL_JOB_OUTPUT_DIR_MARKER =
    "mapreduce.fileoutputcommitter.marksuccessfuljobs";
  private FileSystem outputFileSystem = null;
  private Path outputPath = null;
  private Path workPath = null;

  /**
   * Create a file output committer
   * @param outputPath the job's output path
   * @param context the task's context
   * @throws IOException
   */
  public FileOutputCommitter(Path outputPath, 
                             TaskAttemptContext context) throws IOException {
    if (outputPath != null) {
      this.outputPath = outputPath;
      outputFileSystem = outputPath.getFileSystem(context.getConfiguration());
      workPath = new Path(outputPath,
                          (FileOutputCommitter.TEMP_DIR_NAME + Path.SEPARATOR +
                           "_" + context.getTaskAttemptID().toString()
                           )).makeQualified(outputFileSystem);
    }
  }

  /**
   * Create the temporary directory that is the root of all of the task 
   * work directories.
   * @param context the job's context
   */
  public void setupJob(JobContext context) throws IOException {
    if (outputPath != null) {
      Path tmpDir = new Path(outputPath, FileOutputCommitter.TEMP_DIR_NAME);
      FileSystem fileSys = tmpDir.getFileSystem(context.getConfiguration());
      if (!fileSys.mkdirs(tmpDir)) {
        LOG.error("Mkdirs failed to create " + tmpDir.toString());
      }
    }
  }

  private static boolean shouldMarkOutputDir(Configuration conf) {
    return conf.getBoolean(SUCCESSFUL_JOB_OUTPUT_DIR_MARKER, 
                           true);
  }

  // Mark the output dir of the job for which the context is passed.
  private void markOutputDirSuccessful(JobContext context)
  throws IOException {
    if (outputPath != null) {
      FileSystem fileSys = outputPath.getFileSystem(context.getConfiguration());
      if (fileSys.exists(outputPath)) {
        // create a file in the folder to mark it
        Path filePath = new Path(outputPath, SUCCEEDED_FILE_NAME);
        fileSys.create(filePath).close();
      }
    }
  }

  /**
   * Delete the temporary directory, including all of the work directories.
   * This is called for all jobs whose final run state is SUCCEEDED
   * @param context the job's context.
   */
  public void commitJob(JobContext context) throws IOException {
    // delete the _temporary folder
    cleanupJob(context);
    // check if the o/p dir should be marked
    if (shouldMarkOutputDir(context.getConfiguration())) {
      // create a _success file in the o/p folder
      markOutputDirSuccessful(context);
    }
  }

  @Override
  @Deprecated
  public void cleanupJob(JobContext context) throws IOException {
    if (outputPath != null) {
      Path tmpDir = new Path(outputPath, FileOutputCommitter.TEMP_DIR_NAME);
      FileSystem fileSys = tmpDir.getFileSystem(context.getConfiguration());
      if (fileSys.exists(tmpDir)) {
        fileSys.delete(tmpDir, true);
      }
    } else {
      LOG.warn("Output path is null in cleanup");
    }
  }

  /**
   * Delete the temporary directory, including all of the work directories.
   * @param context the job's context
   * @param state final run state of the job, should be FAILED or KILLED
   */
  @Override
  public void abortJob(JobContext context, JobStatus.State state)
  throws IOException {
    cleanupJob(context);
  }
  
  /**
   * No task setup required.
   */
  @Override
  public void setupTask(TaskAttemptContext context) throws IOException {
    // FileOutputCommitter's setupTask doesn't do anything. Because the
    // temporary task directory is created on demand when the 
    // task is writing.
  }

  /**
   * Move the files from the work directory to the job output directory
   * @param context the task context
   */
  public void commitTask(TaskAttemptContext context) 
  throws IOException {
    TaskAttemptID attemptId = context.getTaskAttemptID();
    if (workPath != null) {
      context.progress();
      if (outputFileSystem.exists(workPath)) {
        // Move the task outputs to their final place
        moveTaskOutputs(context, outputFileSystem, outputPath, workPath);
        // Delete the temporary task-specific output directory
        if (!outputFileSystem.delete(workPath, true)) {
          LOG.warn("Failed to delete the temporary output" + 
          " directory of task: " + attemptId + " - " + workPath);
        }
        LOG.info("Saved output of task '" + attemptId + "' to " + 
                 outputPath);
      }
    }
  }

  /**
   * Move all of the files from the work directory to the final output
   * @param context the task context
   * @param fs the output file system
   * @param jobOutputDir the final output direcotry
   * @param taskOutput the work path
   * @throws IOException
   */
  private void moveTaskOutputs(TaskAttemptContext context,
                               FileSystem fs,
                               Path jobOutputDir,
                               Path taskOutput) 
  throws IOException {
    TaskAttemptID attemptId = context.getTaskAttemptID();
    context.progress();
    if (fs.isFile(taskOutput)) {
      Path finalOutputPath = getFinalPath(jobOutputDir, taskOutput, 
                                          workPath);
      if (!fs.rename(taskOutput, finalOutputPath)) {
        if (!fs.delete(finalOutputPath, true)) {
          throw new IOException("Failed to delete earlier output of task: " + 
                                 attemptId);
        }
        if (!fs.rename(taskOutput, finalOutputPath)) {
          throw new IOException("Failed to save output of task: " + 
        		  attemptId);
        }
      }
      LOG.debug("Moved " + taskOutput + " to " + finalOutputPath);
    } else if(fs.getFileStatus(taskOutput).isDir()) {
      FileStatus[] paths = fs.listStatus(taskOutput);
      Path finalOutputPath = getFinalPath(jobOutputDir, taskOutput, workPath);
      fs.mkdirs(finalOutputPath);
      if (paths != null) {
        for (FileStatus path : paths) {
          moveTaskOutputs(context, fs, jobOutputDir, path.getPath());
        }
      }
    }
  }

  /**
   * Delete the work directory
   */
  @Override
  public void abortTask(TaskAttemptContext context) {
    try {
      if (workPath != null) { 
        context.progress();
        outputFileSystem.delete(workPath, true);
      }
    } catch (IOException ie) {
      LOG.warn("Error discarding output" + StringUtils.stringifyException(ie));
    }
  }

  /**
   * Find the final name of a given output file, given the job output directory
   * and the work directory.
   * @param jobOutputDir the job's output directory
   * @param taskOutput the specific task output file
   * @param taskOutputPath the job's work directory
   * @return the final path for the specific output file
   * @throws IOException
   */
  private Path getFinalPath(Path jobOutputDir, Path taskOutput, 
                            Path taskOutputPath) throws IOException {
    URI taskOutputUri = taskOutput.toUri();
    URI relativePath = taskOutputPath.toUri().relativize(taskOutputUri);
    if (taskOutputUri == relativePath) {
      throw new IOException("Can not get the relative path: base = " + 
          taskOutputPath + " child = " + taskOutput);
    }
    if (relativePath.getPath().length() > 0) {
      return new Path(jobOutputDir, relativePath.getPath());
    } else {
      return jobOutputDir;
    }
  }

  /**
   * Did this task write any files in the work directory?
   * @param context the task's context
   */
  @Override
  public boolean needsTaskCommit(TaskAttemptContext context
                                 ) throws IOException {
    return workPath != null && outputFileSystem.exists(workPath);
  }

  /**
   * Get the directory that the task should write results into
   * @return the work directory
   * @throws IOException
   */
  public Path getWorkPath() throws IOException {
    return workPath;
  }
} 
   其实这才是真正的wordcount代码编写过程，网上一堆人只写主类，那样理解太偏了，下面是wordcount 的 主类，也就是运行时的入口 
   public class WordCount {
  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount  ");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setInputFormatClass(TextInputFormat.class);
    TextInputFormat.addInputPath(job, new Path(otherArgs[0]));
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(IntWritable.class);
    job.setGroupingComparatorClass(Text.Comparator.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    TextOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
} 
   运行时的命令如下 
   hadoop --config 配置文件位置 jar jar包路径 WordCount in out 
   wordcount完整版结束

场景题：100G的文件里有很多id，用1G内存的机器排序，怎么做？
海量数据排序思路核心方案：外排序（分治+多路归并）MapReduce外排序是指数据量太大，无法全部加载到内存中，需要将数据分成多个小块进行排序，然后将排序后的小块合并成一个大的有序块1.分块排序（Map阶段）分块策略按1G内存容量限制，将100G文件拆分为200个500MB分块（保留内存用于排序计算和系统开销）内存排序每个分块加载至内存后：①使用快速排序（时间复杂度O(nlogn)）②去重优化：若
上万个Map运行时链接ApplicationMaster超时FAILED 500佰大数据云计算 big data mapreduce
#MapReduce业务常见故障#大数据#生产环境真实案例#MapReduce#批计算#离线业务#整理#经验总结说明：此篇总结MapReduce业务常见故障案例处理方案结合自身经历总结不易+关注+收藏欢迎留言更多专题(详见)：MapReduce计算引擎详解--项目优化(指导书)上万个Map运行时链接ApplicationMaster超时FAILED症状Mapreduce任务会并发起几万个map,会
hive 数字转换字符串_Hive架构及Hive SQL的执行流程解读 weixin_39756416 hive 数字转换字符串
1、Hive产生背景MapReduce编程的不便性HDFS上的文件缺少Schema(表名，名称，ID等，为数据库对象的集合)2、Hive是什么Hive的使用场景是什么？基于Hadoop做一些数据清洗啊(ETL)、报表啊、数据分析可以将结构化的数据文件映射为一张数据库表，并提供类SQL查询功能。Hive是SQL解析引擎，它将SQL语句转译成M/RJob然后在Hadoop执行。由Facebook开源，
mySQL和Hive的区别 iijik55 面试学习路线阿里巴巴 hive mysql 大数据 tomcat 面试
SQL和HQL的区别整体1、存储位置：Hive在Hadoop上；Mysql将数据存储在设备或本地系统中；2、数据更新：Hive不支持数据的改写和添加，是在加载的时候就已经确定好了；数据库可以CRUD；3、索引：Hive无索引，每次扫描所有数据，底层是MR，并行计算，适用于大数据量；MySQL有索引，适合在线查询数据；4、执行：Hive底层是MapReduce；MySQL底层是执行引擎；5、可扩展性
Hadoop、Spark和 Hive 的详细关系夜行容忍 hadoop spark hive
Hadoop、Spark和Hive的详细关系1.ApacheHadoopHadoop是一个开源框架，用于分布式存储和处理大规模数据集。核心组件：HDFS(HadoopDistributedFileSystem)：分布式文件系统，提供高吞吐量的数据访问。YARN(YetAnotherResourceNegotiator)：集群资源管理和作业调度系统。MapReduce：基于YARN的并行处理框架，用
大数据面试之路 (一) 数据倾斜愿与狸花过一生大数据面试职场和发展
记录大数据面试历程数据倾斜大数据岗位，数据倾斜面试必问的一个问题。一、数据倾斜的表现与原因表现某个或某几个Task执行时间过长，其他Task快速完成。Spark/MapReduce作业卡在某个阶段（如reduce阶段），日志显示少数Task处理大量数据。资源利用率不均衡（如CPU、内存集中在某些节点）。常见场景Key分布不均：如某些Key对应的数据量极大（如用户ID为空的记录、热点事件）。数据分区
Hadoop的运行模式对许 #Hadoop hadoop 大数据分布式
Hadoop的运行模式1、本地运行模式2、伪分布式运行模式3、完全分布式运行模式4、区别与总结Hadoop有三种可以运行的模式：本地运行模式、伪分布式运行模式和完全分布式运行模式1、本地运行模式本地运行模式无需任何守护进程，单机运行，所有的程序都运行在同一个JVM上执行Hadoop安装后默认为本地模式，数据存储在Linux本地。在本地模式下调试MapReduce程序非常高效方便，一般该模式主要是在
Hadoop的mapreduce的执行过程画纸仁大数据 hadoop mapreduce 大数据
一、map阶段的执行过程第一阶段：把输入目录下文件按照一定的标准逐个进行逻辑切片，形成切片规划。默认Splitsize=Blocksize（128M），每一个切片由一个MapTask处理。（getSplits）第二阶段：对切片中的数据按照一定的规则读取解析返回对。默认是按行读取数据。key是每一行的起始位置偏移量，value是本行的文本内容。（TextInputFormat）第三阶段：调用Mapp
Hadoop：分布式计算平台初探 dccrtbn6261333 大数据运维 java
Hadoop是一个开发和运行处理大规模数据的软件平台，是Apache的一个用java语言实现开源软件框架，实现在大量计算机组成的集群中对海量数据进行分布式计算。Hadoop框架中最核心设计就是：MapReduce和HDFS。MapReduce提供了对数据的计算，HDFS提供了海量数据的存储。MapReduceMapReduce的思想是由Google的一篇论文所提及而被广为流传的，简单的一句话解释M
探秘开源项目 MapReduce：分布式计算的新篇章褚知茉Jade
探秘开源项目MapReduce：分布式计算的新篇章去发现同类优质开源项目:https://gitcode.com/在大数据处理领域，一个名字始终熠熠生辉，那就是。这是一个由Google提出的并被广泛应用的编程模型，用于大规模数据集的并行计算。本文将带你深入了解这一开源实现的魅力，分析其技术原理，探讨它的应用场景，并揭示它独特的特性。项目简介该项目是ChubbyJiang对原始GoogleMapRe
MapReduce：分布式并行编程的基石 JAZJD mapreduce 分布式大数据
目录概述分布式并行编程分布式并行编程模型分布式并行编程框架MapReduce模型简介Map和Reduce函数Map函数Map函数的输入和输出Map函数的常见操作Reduce函数Reduce函数的输入和输出Reduce函数的常见操作工作流程概述各个阶段1.输入分片2.Map阶段3.Shuffle阶段4.Reduce阶段MapReduce工作流程总结Shuffle过程详解1.分区（Partitioni
MapReduce：分布式计算的基石 Earth explosion mapreduce 大数据
MapReduce是一种用于处理和生成大数据集的编程模型，以及一个用于执行该模型的关联实现。它使得在大型商用硬件集群（数千台机器）上进行并行处理海量数据成为可能。本文将深入探讨MapReduce的核心概念、工作原理、应用场景以及一些高级主题。核心概念：分而治之MapReduce的核心思想是“分而治之”。它将复杂的计算任务分解成两个主要阶段：Map阶段和Reduce阶段。Map阶段:输入数据被分割成
【Hadoop】如何理解MapReduce？ 2302_79952574 hadoop mapreduce 数据库
MapReduce是一种用于处理大规模数据集的编程模型和计算框架。它的核心思想是将复杂的计算任务分解为两个简单的阶段：Map（映射）和Reduce（归约）。通过这种方式，MapReduce可以高效地并行处理海量数据。一.MapReduce的核心概念1.Map（映射）：将输入数据分割成小块，并对每个小块进行初步处理。输出键值对（key-valuepairs），例如。2.Shuffle和Sort（洗牌
Hadoop介绍：什么是Hadoop？了解Hadoop的应用 Zzzxt007 hadoop 大数据分布式
一、认识Hadoop框架Hadoop是一个提供分布式存储和计算的开源软件框架，使用Java语言编写，具有高扩展性、高容错性、无共享和高可用（HA）等特点，非常适合处理海量数据。它基于Google发布的MapReduce论文实现，并且应用了函数式编程的思想。Hadoop框架主要包括HDFS（HadoopDistributedFileSystem，Hadoop分布式文件系统）、MapReduce、YA
【Hadoop】详解HDFS 2302_79952574 hadoop hdfs 大数据
Hadoop分布式文件系统(HDFS)被设计成适合运行在通用硬件上的分布式文件系统，它是一个高度容错性的系统，适合部署在廉价的机器上，能够提供高吞吐量的数据访问，非常适合大规模数据集上的应用。为了做到可靠性，HDFS创建了多份数据块的副本，并将它们放置在服务器群的计算节点中，MapReduce可以在它们所在的节点上处理这些数据。1.HDFS的设计目标存储大规模数据：HDFS可以存储并管理PB级甚至
hadoop框架与核心组件刨析（四）MapReduce 小刘爱喇石( ˝ᗢ̈˝ ) hadoop mapreduce 大数据
MapReduce是一种用于大规模数据处理的编程模型和计算框架，最初由Google提出，后来由ApacheHadoop实现并广泛应用。它的核心思想是将数据处理任务分解为两个阶段：Map和Reduce，并通过分布式计算并行处理海量数据。MapReduce的核心思想分而治之：将大规模数据集分割成多个小块，分布到集群中的多个节点上并行处理。Map阶段：将输入数据转换为键值对（Key-ValuePair）
hadoop运行java程序命令_使用命令行编译打包运行自己的MapReduce程序 Hadoop2.6.0 emi0wb
网上的MapReduceWordCount教程对于如何编译WordCount.java几乎是一笔带过…而有写到的，大多又是0.20等旧版本版本的做法，即javac-classpath/usr/local/hadoop/hadoop-1.0.1/hadoop-core-1.0.1.jarWordCount.java，但较新的2.X版本中，已经没有hadoop-core*.jar这个文件，因此编辑和打
大数据Hadoop集群运行程序赵广陆 hadoop hadoop big data mapreduce
目录1运行自带的MapReduce程序2常见错误1运行自带的MapReduce程序下面我们在Hadoop集群上运行一个MapReduce程序，以帮助读者对分布式计算有个基本印象。在安装Hadoop时，系统给用户提供了一些MapReduce示例程序，其中有一个典型的用于计算圆周率的Java程序包，现在运行该程序。该jar包文件的位置和文件名是“~/hadoop-3.1.0/share/Hadoop/
大数据面试系列之——Hadoop 潜心_守道大数据面经面试大数据 Hadoop
Hadoop的三个核心：HDFS（分布式存储系统）MapReduce（分布式计算系统）YARN(分布式资源调度)1.Hadoop集群的几种搭建模式1.单机模式：直接解压安装，不存在分布式存储系统2.伪分布式：NameNode和DataNode安装于同一个节点，无法体现分布式处理的优势。3.完全分布式：一个主节点，多个从节点，存在如果主节点宕机，集群就无法使用的缺点。4.高可用模式：多个主节点，多个
hadoop 百里自来卷 hadoop 大数据分布式
Hadoop是一个用于分布式存储和处理大规模数据的开源框架，它的架构主要由以下几个核心组件组成：1.Hadoop生态系统核心组件Hadoop的核心架构主要包括HDFS（HadoopDistributedFileSystem）和YARN（YetAnotherResourceNegotiator），以及MapReduce计算框架：1.1HDFS（分布式文件系统）HDFS负责存储大规模数据，采用主从架构
第一个Hadoop程序 lqlj2233 hadoop 大数据分布式
编写和运行第一个Hadoop程序是学习Hadoop的重要步骤。以下是一个经典的“WordCount”程序示例，它统计文本文件中每个单词出现的次数。我们将使用Java编写MapReduce程序，并在Hadoop集群上运行它。一、WordCount程序概述WordCount是Hadoop的“HelloWorld”程序。它的基本逻辑如下：Mapper：读取输入文件，将每一行文本拆分为单词，并输出每个单词
【自学笔记】Hadoop基础知识点总览-持续更新 Long_poem 笔记 hadoop 大数据
提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档文章目录Hadoop基础知识点总览1.Hadoop简介2.Hadoop生态系统3.HDFS（HadoopDistributedFileSystem）HDFS基本命令4.MapReduceWordCount示例（Java）5.YARN（YetAnotherResourceNegotiator）6.其他组件简介总结Hadoop基础知识点总
Spark是什么？可以用来做什么？ Bugkillers 大数据 spark 大数据分布式
ApacheSpark是一个开源的分布式计算框架，专为处理大规模数据而设计。它最初由加州大学伯克利分校开发，现已成为大数据处理领域的核心工具之一。相比传统的HadoopMapReduce，Spark在速度、易用性和功能多样性上具有显著优势。一、Spark的核心特点速度快：基于内存计算（In-MemoryProcessing），比基于磁盘的MapReduce快10~100倍。支持高效的DAG（有向无
大数据面试临阵磨枪不知看什么？看这份心理就有底了-大数据常用技术栈常见面试100道题大模型大数据攻城狮大数据面试职场和发展面试题数据仓库算法
目录1描述Hadoop的架构和它的主要组件。2MapReduce的工作原理是什么？3什么是YARN，它在Hadoop中扮演什么角色？4Spark和HadoopMapReduce的区别是什么？5如何在Spark中实现数据的持久化？6SparkStreaming的工作原理是什么？7如何优化Spark作业的性能？8描述HBase的架构和它的主要组件。9HBase的读写流程是怎样的？10HBase如何处理
Spark核心之06：知识点梳理小技工丨大数据技术学习 SparkSQL spark 大数据
spark知识点梳理spark_〇一1、spark是什么spark是针对于大规模数据处理的统一分析引擎，它是基于内存计算框架，计算速度非常之快，但是它仅仅只是涉及到计算，并没有涉及到数据的存储，后期需要使用spark对接外部的数据源，比如hdfs。2、spark四大特性1、速度快spark比mapreduce快的2个主要原因1、基于内存（1）mapreduce任务后期再计算的时候，每一个job的输
Hadoop基础知识及部署模式 2301_82242502 hadoop 大数据分布式
一、Hadoop是什么Hadoop是一个由Apache基金会所开发的分布式系统基础架构。用户可以在不了解分布式底层细节的情况下，开发分布式程序。充分利用集群的威力，解决海量数据的存储及海量数据的分析计算问题。广义上的Hadoop是指Hadoop的整个技术生态圈；狭义上的Hadoop指的是其核心三大组件，包括HDFS、YARN及MapReduce.二、Hadoop的发展史Hadoop起源于Lucen
探讨Hadoop的基础架构及其核心特点 xx155802862xx hadoop 大数据分布式
Hadoop是一个开源软件框架，用于存储和处理大规模数据集。它是Apache软件基金会下的一个项目，灵感来源于Google的两篇论文：一篇关于Google文件系统（GFS），另一篇关于MapReduce。Hadoop设计用于从单台服务器扩展到数千台机器，每台机器提供局部计算和存储。而不仅仅是处理大数据，Hadoop的真正价值在于其对于数据的高容错性、可扩展性以及相对低成本的存储和处理能力。以下是探
大数据技术学习框架（更新中......）小技工丨大数据技术学习大数据学习
Hadoop相关HDFS分布式文件系统MR(MapReduce)离线数据处理MR-图解YARN集群资源管理ZooKeeperZooKeeper分布式协调框架Hive相关Hive-01之数仓、架构、数据类型、DDL、内外部表Hive-02之分桶表、数据导入导出、静动态分区、查询、排序、hiveserver2Hive-03之传参、常用函数、explode、lateralview、行专列、列转行、UDF
入门Apache Spark：基础知识和架构解析 juer_0001 java spark
介绍ApacheSparkSpark的历史和背景ApacheSpark是一种快速、通用、可扩展的大数据处理引擎，最初由加州大学伯克利分校的AMPLab开发，于2010年首次推出。它最初设计用于支持分布式计算框架MapReduce的交互式查询，但逐渐发展成为一种更通用的数据处理引擎，能够处理数据流、批处理和机器学习等工作负载。Spark的特点和优势Spark是一种快速、通用、可扩展的大数据处理框架，
jdbc连接数据库步骤oracle,jdbc连接oracle数据库的步骤 weixin_39726044
使用E-MapReduce集群sqoop组件同步云外Oracle数据库数据到集群hiveE-MapReduce集群sqoop组件可以同步数据库的数据到集群里，不同的数据库源网络配置有一些差异网络配置。最常用的场景是从rdsmysql同步数据，最近也有用户询问如何同步云外专有Oracle数据库数据到hive。云外专有数据库需要集群所有节点通过公网访问，要创建VPC网络，使用VPC网络...文章鸿初2
Java序列化进阶篇 g21121 java序列化
1.transient 类一旦实现了Serializable 接口即被声明为可序列化，然而某些情况下并不是所有的属性都需要序列化，想要人为的去阻止这些属性被序列化，就需要用到transient 关键字。
escape()、encodeURI()、encodeURIComponent()区别详解 aigo JavaScript Web
原文：http://blog.sina.com.cn/s/blog_4586764e0101khi0.html JavaScript中有三个可以对字符串编码的函数，分别是： escape,encodeURI,encodeURIComponent，相应3个解码函数：,decodeURI,decodeURIComponent 。下面简单介绍一下它们的区别 1 escape()函
ArcgisEngine实现对地图的放大、缩小和平移 Cb123456 添加矢量数据对地图的放大、缩小和平移 Engine
ArcgisEngine实现对地图的放大、缩小和平移: 个人觉得是平移，不过网上的都是漫游，通俗的说就是把一个地图对象从一边拉到另一边而已。就看人说话吧. 具体实现: 一、引入命名空间 using ESRI.ArcGIS.Geometry; using ESRI.ArcGIS.Controls; 二、代码实现.
Java集合框架概述天子之骄 Java集合框架概述
集合框架集合框架可以理解为一个容器，该容器主要指映射(map)、集合(set)、数组(array)和列表(list)等抽象数据结构。从本质上来说，Java集合框架的主要组成是用来操作对象的接口。不同接口描述不同的数据类型。简单介绍： Collection接口是最基本的接口，它定义了List和Set，List又定义了LinkLi
旗正4.0页面跳转传值问题何必如此 java jsp
跳转和成功提示 a) 成功字段非空forward 成功字段非空forward，不会弹出成功字段，为jsp转发，页面能超链接传值,传输变量时需要拼接。接拼接方式list.jsp?test="+strweightUnit+"或list.jsp?test="+weightUnit+&qu
全网唯一:移动互联网服务器端开发课程 cocos2d-x小菜 web开发移动开发移动端开发移动互联程序员
移动互联网时代来了！ App市场爆发式增长为Web开发程序员带来新一轮机遇，近两年新增创业者，几乎全部选择了移动互联网项目！传统互联网企业中超过98%的门户网站已经或者正在从单一的网站入口转向PC、手机、Pad、智能电视等多端全平台兼容体系。据统计，AppStore中超过85%的App项目都选择了PHP作为后端程
Log4J通用配置|注意问题笔记 7454103 DAO apache tomcat log4j Web
关于日志的等级那些去百度就知道了！这几天要搭个新框架配置了日志记下来！做个备忘！ #这里定义能显示到的最低级别,若定义到INFO级别,则看不到DEBUG级别的信息了~! log4j.rootLogger=INFO,allLog # DAO层 log记录到dao.log 控制台和总日志文件 log4j.logger.DAO=INFO,dao,C
SQLServer TCP/IP 连接失败问题 ---SQL Server Configuration Manager darkranger sql c windows SQL Server XP
当你安装完之后,连接数据库的时候可能会发现你的TCP/IP 没有启动.. 发现需要启动客户端协议 : TCP/IP 需要打开 SQL Server Configuration Manager... 却发现无法打开 SQL Server Configuration Manager..?? 解决方法: C:\WINDOWS\system32目录搜索framedyn.
[置顶] 做有中国特色的程序员 aijuans 程序员
从出版业说起网络作品排到靠前的，都不会太难看，一般人不爱看某部作品也是因为不喜欢这个类型，而此人也不会全不喜欢这些网络作品。究其原因，是因为网络作品都是让人先白看的，看的好了才出了头。而纸质作品就不一定了，排行榜靠前的，有好作品，也有垃圾。许多大牛都是写了博客，后来出了书。这些书也都不次，可能有人让为不好，是因为技术书不像小说，小说在读故事，技术书是在学知识或温习知识，有些技术书读得可
document.domain 跨域问题 avords document
document.domain用来得到当前网页的域名。比如在地址栏里输入：javascript:alert(document.domain); //www.315ta.com我们也可以给document.domain属性赋值，不过是有限制的，你只能赋成当前的域名或者基础域名。比如：javascript:alert(document.domain = "315ta.com");
关于管理软件的一些思考 houxinyou 管理
工作好多看年了,一直在做管理软件,不知道是我最开始做的时候产生了一些惯性的思维,还是现在接触的管理软件水平有所下降.换过好多年公司,越来越感觉现在的管理软件做的越来越乱. 在我看来,管理软件不论是以前的结构化编程,还是现在的面向对象编程,不管是CS模式,还是BS模式.模块的划分是很重要的.当然,模块的划分有很多种方式.我只是以我自己的划分方式来说一下. 做为管理软件,就像现在讲究MVC这
NoSQL数据库之Redis数据库管理(String类型和hash类型) bijian1013 redis 数据库 NoSQL
一.Redis的数据类型 1.String类型及操作 String是最简单的类型，一个key对应一个value，string类型是二进制安全的。Redis的string可以包含任何数据，比如jpg图片或者序列化的对象。 Set方法：设置key对应的值为string类型的value
Tomcat 一些技巧征客丶 java tomcat dos
以下操作都是在windows 环境下一、Tomcat 启动时配置 JAVA_HOME 在 tomcat 安装目录，bin 文件夹下的 catalina.bat 或 setclasspath.bat 中添加 set JAVA_HOME=JAVA 安装目录 set JRE_HOME=JAVA 安装目录/jre 即可；二、查看Tomcat 版本在 tomcat 安装目
【Spark七十二】Spark的日志配置 bit1129 spark
在测试Spark Streaming时，大量的日志显示到控制台，影响了Spark Streaming程序代码的输出结果的查看(代码中通过println将输出打印到控制台上)，可以通过修改Spark的日志配置的方式，不让Spark Streaming把它的日志显示在console 在Spark的conf目录下，把log4j.properties.template修改为log4j.p
Haskell版冒泡排序 bookjovi 冒泡排序 haskell
面试的时候问的比较多的算法题要么是binary search，要么是冒泡排序，真的不想用写C写冒泡排序了，贴上个Haskell版的，思维简单，代码简单，下次谁要是再要我用C写冒泡排序，直接上个haskell版的，让他自己去理解吧。 sort [] = [] sort [x] = [x] sort (x:x1:xs) | x>x1 = x1:so
java 路径配置文件读取 bro_feng java
这几天做一个项目，关于路径做如下笔记，有需要供参考。取工程内的文件，一般都要用相对路径，这个自然不用多说。在src统计目录建配置文件目录res,在res中放入配置文件。读取文件使用方式： 1. MyTest.class.getResourceAsStream("/res/xx.properties") 2. properties.load(MyTest.
读《研磨设计模式》-代码笔记-简单工厂模式 bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ package design.pattern; /* * 个人理解：简单工厂模式就是IOC; * 客户端要用到某一对象，本来是由客户创建的，现在改成由工厂创建，客户直接取就好了 */ interface IProduct {
SVN与JIRA的关联 chenyu19891124 SVN
SVN与JIRA的关联一直都没能装成功，今天凝聚心思花了一天时间整合好了。下面是自己整理的步骤：一、搭建好SVN环境，尤其是要把SVN的服务注册成系统服务二、装好JIRA，自己用是jira-4.3.4破解版三、下载SVN与JIRA的插件并解压，然后拷贝插件包下lib包里的三个jar，放到Atlassian\JIRA 4.3.4\atlassian-jira\WEB-INF\lib下，再
JWFDv0.96 最新设计思路 comsci 数据结构算法工作企业应用公告
随着工作流技术的发展，工作流产品的应用范围也不断的在扩展，开始进入了像金融行业(我已经看到国有四大商业银行的工作流产品招标公告了)，实时生产控制和其它比较重要的工程领域，而
vi 保存复制内容格式粘贴 daizj vi 粘贴复制保存原格式不变形
vi是linux中非常好用的文本编辑工具，功能强大无比，但对于复制带有缩进格式的内容时，粘贴的时候内容错位很严重，不会按照复制时的格式排版，vi能不能在粘贴时，按复制进的格式进行粘贴呢？答案是肯定的，vi有一个很强大的命令可以实现此功能。在命令模式输入:set paste，则进入paste模式，这样再进行粘贴时
shell脚本运行时报错误：/bin/bash^M: bad interpreter 的解决办法 dongwei_6688 shell脚本
出现原因：windows上写的脚本，直接拷贝到linux系统上运行由于格式不兼容导致解决办法： 1. 比如文件名为myshell.sh，vim myshell.sh 2. 执行vim中的命令 : set ff?查看文件格式，如果显示fileformat=dos，证明文件格式有问题 3. 执行vim中的命令 :set fileformat=unix 将文件格式改过来就可以了，然后:w
高一上学期难记忆单词 dcj3sjt126com word english
honest 诚实的；正直的 argue 争论 classical 古典的 hammer 锤子 share 分享；共有 sorrow 悲哀；悲痛 adventure 冒险 error 错误；差错 closet 壁橱；储藏室 pronounce 发音；宣告 repeat 重做；重复 majority 大多数；大半 native 本国的，本地的，本国
hibernate查询返回DTO对象，DTO封装了多个pojo对象的属性 frankco POJO hibernate查询 DTO
DTO-数据传输对象；pojo-最纯粹的java对象与数据库中的表一一对应。简单讲：DTO起到业务数据的传递作用，pojo则与持久层数据库打交道。有时候我们需要查询返回DTO对象，因为DTO
Partition List hcx2013 partition
Given a linked list and a value x, partition it such that all nodes less than x come before nodes greater than or equal to x. You should preserve the original relative order of th
Spring MVC测试框架详解——客户端测试 jinnianshilongnian
上一篇《Spring MVC测试框架详解——服务端测试》已经介绍了服务端测试，接下来再看看如果测试Rest客户端，对于客户端测试以前经常使用的方法是启动一个内嵌的jetty/tomcat容器，然后发送真实的请求到相应的控制器；这种方式的缺点就是速度慢；自Spring 3.2开始提供了对RestTemplate的模拟服务器测试方式，也就是说使用RestTemplate测试时无须启动服务器，而是模拟一
关于推荐个人观点 liyonghui160com 推荐系统关于推荐个人观点
回想起来，我也做推荐了3年多了，最近公司做了调整招聘了很多算法工程师，以为需要多么高大上的算法才能搭建起来的，从实践中走过来，我只想说【不是这样的】第一次接触推荐系统是在四年前入职的时候，那时候，机器学习和大数据都是没有的概念，什么大数据处理开源软件根本不存在，我们用多台计算机web程序记录用户行为，用.net的w
不间断旋转的动画 pangyulei 动画
CABasicAnimation* rotationAnimation; rotationAnimation = [CABasicAnimation animationWithKeyPath:@"transform.rotation.z"]; rotationAnimation.toValue = [NSNumber numberWithFloat: M
自定义annotation sha1064616837 java enum annotation reflect
对象有的属性在页面上可编辑，有的属性在页面只可读，以前都是我们在页面上写死的，时间一久有时候会混乱，此处通过自定义annotation在类属性中定义。越来越发现Java的Annotation真心很强大，可以帮我们省去很多代码，让代码看上去简洁。下面这个例子主要用到了 1.自定义annotation：@interface，以及几个配合着自定义注解使用的几个注解 2.简单的反射 3.枚举
Spring 源码 up2pu spring
1.Spring源代码 https://github.com/SpringSource/spring-framework/branches/3.2.x 注：兼容svn检出 2.运行脚本 import-into-eclipse.bat 注：需要设置JAVA_HOME为jdk 1.7 build.gradle compileJava { sourceCompatibilit
利用word分词来计算文本相似度 yangshangchuan word word分词文本相似度余弦相似度简单共有词
word分词提供了多种文本相似度计算方式：方式一：余弦相似度，通过计算两个向量的夹角余弦值来评估他们的相似度实现类：org.apdplat.word.analysis.CosineTextSimilarity 用法如下： String text1 = "我爱购物"; String text2 = "我爱读书"; String text3 =

mapreduce编程过程

Tasks' Side-Effect Files

你可能感兴趣的:(mapreduce)