Hadoop官方文档翻译——MapReduce Tutorial

  • MapReduce Tutorial(个人指导)
    • Purpose(目的)
    • Prerequisites(必备条件)
    • Overview(综述)
    • Inputs and Outputs(输入输出)
    • MapReduce - User Interfaces(用户接口)
      • Payload(有效负载)
        • Mapper
        • Reducer
        • Partitioner
        • Counter
      • Job Configuration(作业配置)
      • Task Execution & Environment(任务执行和环境)
        • Memory Management(内存管理)
        • Map Parameters(Map参数)
        • Shuffle/Reduce Parameters(Shuffle/Reduce参数)
        • Configured Parameters(配置参数)
        • Task Logs(任务日志)
        • Distributing Libraries(分布式缓存 库)
      • Job Submission and Monitoring(作业提交和监控)
        • Job Control(作业控制)
      • Job Input(作业输入)
        • InputSplit(输入块)
        • RecordReader(记录读取器)
      • Job Output(作业输出)
        • OutputCommitter(输出提交器)
        • Task Side-Effect Files(任务副文件)
        • RecordWriter(记录输出器)
      • Other Useful Features(其他有用的特性)
        • Submitting Jobs to Queues(提交作业到队列中)
        • Counters(计数器)
        • DistributedCache(分布式缓存)
        • Profiling(分析器)
        • Debugging(调试器)
        • Data Compression(数据压缩)
        • Skipping Bad Records(跳过不良数据数据)

 

 

Purpose

This document comprehensively describes all user-facing facets of the Hadoop MapReduce framework and serves as a tutorial.

该文档作为一份个人指导全面性得描述了所有用户使用Hadoop Mapreduce框架时遇到的方方面面。

Prerequisites

Ensure that Hadoop is installed, configured and is running. More details:

    • Single Node Setup for first-time users.
    • Cluster Setup for large, distributed clusters.

确保Hadoop安装、配置和运行。更多细节:

    • 初次使用用户配置单节点。
    •  配置大型、分布式集群

Overview

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

The MapReduce framework consists of a single master ResourceManager, one slave NodeManager per cluster-node, and MRAppMaster per application (see YARN Architecture Guide).

Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration.

The Hadoop job client then submits the job (jar/executable etc.) and configuration to the ResourceManager which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Although the Hadoop framework is implemented in Java™, MapReduce applications need not be written in Java.

    • Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer.
    • Hadoop Pipes is a SWIG-compatible C++ API to implement MapReduce applications (non JNI™ based).

Hadoop Mapreduce是一个易于编程并且能在大型集群(上千节点)快速地并行得处理大量数据的软件框架,以可靠,容错的方式部署在商用机器上。

MapReduce Job通常将独立大块数据切片以完全并行的方式在map任务中处理。该框架对maps输出的做为reduce输入的数据进行排序,Job的输入输出都是存储在文件系统中。该框架调度任务、监控任务和重启失效的任务。

一般来说计算节点和存储节点都是同样的设置,MapReduce框架和HDFS运行在同组节点。这样的设定使得MapReduce框架能够以更高的带宽来执行任务,当数据已经在节点上时。

MapReduce 框架包含一个主ResourceManager,每个集群节点都有一个从NodeManager和每个应用都有一个MRAppMaster。

应用最少必须指定输入和输出的路径并且通过实现合适的接口或者抽象类来提供map和reduce功能。前面这部分内容和其他Job参数构成了Job的配置。

Hadoop 客户端提交Job和配置信息给ResourceManger,它将负责把配置信息分配给从属节点,调度任务并且监控它们,把状态信息和诊断信息传输给客户端。

  尽管 MapReduce 框架是用Java实现的,但是 MapReduce 应用却不一定要用Java编写。

    • Hadoop Streaming 是一个工具允许用户创建和运行任何可执行文件。
    • Hadoop Pipes 是兼容SWIG用来实现 MapReduce 应用的C++ API(不是基于JNI).

Inputs and Outputs

The MapReduce framework operates exclusively on  pairs, that is, the framework views the input to the job as a set of  pairs and produces a set of pairs as the output of the job, conceivably of different types.

The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement theWritableComparable interface to facilitate sorting by the framework.

Input and Output types of a MapReduce job:

(input)  -> map -> -> combine -> -> reduce ->  (output)

MapReduce 框架只操作键值对,MapReduce 将job的不同类型输入当做键值对来处理并且生成一组键值对作为输出。

Key和Value类必须通过实现Writable接口来实现序列化。此外,Key类必须实现WritableComparable 来使得排序更简单。

MapRedeuce job 的输入输出类型:

(input) ->map->  ->combine->  ->reduce-> (output)

MapReduce - User Interfaces

This section provides a reasonable amount of detail on every user-facing aspect of the MapReduce framework. This should help users implement, configure and tune their jobs in a fine-grained manner. However, please note that the javadoc for each class/interface remains the most comprehensive documentation available; this is only meant to be a tutorial.

Let us first take the Mapper and Reducer interfaces. Applications typically implement them to provide the map and reduce methods.

We will then discuss other core interfaces including Job, Partitioner, InputFormat, OutputFormat, and others.

Finally, we will wrap up by discussing some useful features of the framework such as the DistributedCache, IsolationRunner etc.

这部分将展示 MapReduce 中面向用户方面的尽可能多的细节。这将会帮助用户更小粒度地实现、配置和调试它们的Job。然而,请在 Javadoc 中查看每个类和接口的综合用法,这里仅仅是作为一份指导。

让我们首先来看看Mapper和Reducer接口。应用通常只实现它们提供的map和reduce方法。

我们将会讨论其他接口包括Job、Partitioner、InputFormat和其他的。

最后,我们会讨论一些有用的特性像分布式缓存、隔离运行等。

 

Payload

Applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods. These form the core of the job.

应用通常实现Mapper和Reducer接口提供map和reduce方法。这是Job的核心代码。

Mapper

Mapper maps input key/value pairs to a set of intermediate key/value pairs.

Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.

The Hadoop MapReduce framework spawns(产卵) one map task for each InputSplit generated by the InputFormat for the job.

Overall, Mapper implementations are passed the Job for the job via the Job.setMapperClass(Class) method. The framework then calls map(WritableComparable, Writable, Context) for each key/value pair in the InputSplit for that task. Applications can then override the cleanup(Context) method to perform any required cleanup.

Output pairs do not need to be of the same types as input pairs. A given input pair may map to zero or many output pairs. Output pairs are collected with calls to context.write(WritableComparable, Writable).

Applications can use the Counter to report its statistics.

All intermediate(中间的) values associated(联系) with a given output key are subsequently(随后) grouped by the framework, and passed to the Reducer(s) to determine the final output. Users can control the grouping by specifying a Comparator via Job.setGroupingComparatorClass(Class).

The Mapper outputs are sorted and then partitioned per Reducer. The total number of partitions is the same as the number of reduce tasks for the job. Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.

Users can optionally(随意) specify a combiner, via Job.setCombinerClass(Class), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.

The intermediate, sorted outputs are always stored in a simple (key-len, key, value-len, value) format. Applications can control if, and how, the intermediate outputs are to be compressed and the CompressionCodec to be used via the Configuration.

Mappers将输入的键值对转换成中间键值对。

Maps是多个单独执行的任务将输入转换成中间记录。那些被转换的中间记录不一定要和输入的记录为相同类型。输入键值对可以在map后输出0或者更多的键值对。

MapReduce 会根据 InputFormat 切分成的各个 InputSplit 都创建一个map任务

总的来说,通过 job.setMapperClass(Class)来给Job设置Mapper实现类,并且将InputSplit输入到map方法进行处理。应用可复写cleanup方法来执行任何需要回收清除的操作。

输出键值对不一定要和输入键值对为相同的类型。一个键值对输入可以输出0至多个不等的键值对。输出键值对将通过context.write(WritableComparable,Writable)方法进行缓存。

应用可以通过Counter进行统计。

所有的中间值都会按照Key进行排序,然后传输给一个特定的Reducer做最后确定的输出。用户可以通过Job.setGroupingComparatorClass(Class)来控制分组规则。

Mapper输出会被排序并且分区到每一个Reducer。分区数和Reduce的数目是一致的。用户可以通过实现一个自定义的Partitioner来控制哪个key对应哪个Reducer。

用户可以随意指定一个combiner,Job.setCombinerClass(Class),来执行局部输出数据的整合,将有效地降低Mapper和Reducer之间的数据传输量。

那些经过排序的中间记录通常会以(key-len, key, value-len, value)的简单格式储存。应用可以通过配置来决定是否需要和怎样压缩数据和选择压缩方式。

 

  How Many Maps?

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

The right level of parallelism(平行) for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.

Thus, if you expect 10TB of input data and have a blocksize of 128MB, you'll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.

   maps的数据通常依赖于输入数据的总长度,也就是,输入文档的总block数。

每个节点map的正常并行度应该在10-100之间,尽管每个cpu已经设置的上限值为300。任务的配置会花费一些时间,最少需要花费一分钟来启动运行。

因此,如果你有10TB的数据输入和定义blocksize为128M,那么你将需要82000 maps,除非通过Configuration.set(MRJobConfig.NUM_MAPS, int)(设置一个默认值通知框架)来设置更高的值。

 

Reducer

Reducer reduces a set of intermediate values which share a key to a smaller set of values.

The number of reduces for the job is set by the user via Job.setNumReduceTasks(int).

Overall(总的来说), Reducer implementations are passed the Job for the job via the Job.setReducerClass(Class) method and can override it to initialize themselves. The framework then callsreduce(WritableComparable, Iterable, Context) method for each  pair in the grouped inputs. Applications can then override the cleanup(Context)method to perform any required cleanup.

Reducer has 3 primary(主要) phases(阶段): shuffle, sort and reduce.

   Reduce处理一系列相同key的中间记录。

用户可以通过 Job.setNumReduceTasks(int) 来设置reduce的数量。

总的来说,通过 Job.setReducerClass(Class) 可以给 job 设置 recuder 的实现类并且进行初始化。框架将会调用 reduce 方法来处理每一组按照一定规则分好的输入数据,应用可以通过复写cleanup 方法执行任何清理工作。

Reducer有3个主要阶段:混洗、排序和reduce。

 

Shuffle

Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches(取得) the relevant(有关的,恰当的) partition of the output of all the mappers, via HTTP.

输出到Reducer的数据都在Mapper阶段经过排序的。在这个阶段框架将通过HTTP从恰当的Mapper的分区中取得数据。

 

Sort

The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage(阶段).

The shuffle and sort phases occur simultaneously(同时); while map-outputs are being fetched they are merged.

这个阶段框架将对输入到的 Reducer 的数据通过key(不同的 Mapper 可能输出相同的key)进行分组。

混洗和排序阶段是同时进行;map的输出数据被获取时会进行合并。

 

Secondary Sort

If equivalence(平等的) rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify a Comparator via Job.setSortComparatorClass(Class). Since Job.setGroupingComparatorClass(Class) can be used to control how intermediate keys are grouped, these can be used in conjunction(协调) to simulate(模拟) secondary sort on values.

如果想要对中间记录实现与 map 阶段不同的排序方式,可以通过Job.setSortComparatorClass(Class) 来设置一个比较器 。Job.setGroupingComparatorClass(Class) 被用于控制中间记录的排序方式,这些能用来进行值的二次排序。

 

Reduce

In this phase the reduce(WritableComparable, Iterable, Context) method is called for each  pair in the grouped inputs.

The output of the reduce task is typically written to the FileSystem via Context.write(WritableComparable, Writable).

Applications can use the Counter to report its statistics.

The output of the Reducer is not sorted.

在这个阶段reduce方法将会被调用来处理每个已经分好的组键值对。

reduce 任务一般通过 Context.write(WritableComparable, Writable) 将数据写入到FileSystem。

应用可以使用 Counter 进行统计。

Recuder 输出的数据是不经过排序的。

 

How Many Reduces?

The right number of reduces seems to be 0.95 or 1.75 multiplied(乘上) by (<no. of nodes> * <no. of maximum containers per node>).

With 0.95 all of the reduces can launch immediately(立刻) and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave(波浪) of reduces doing a much better job of load balancing(均衡).

Increasing the number of reduces increases the framework overhead(负担,天花板), but increases load balancing and lowers the cost of failures.

The scaling(规模) factors above are slightly(轻微的) less than whole numbers to reserve a few reduce slots in the framework for speculative(推测的)-tasks and failed tasks.

   合适的 reduce 总数应该在 节点数*每个节点的容器数*0.95 至 节点数*每个节点的容器数*1.75 之间。

当设定值为0.95时,map任务结束后所有的 reduce 将会立刻启动并且开始转移数据,当设定值为1.75时,处理更多任务的时候将会快速地一轮又一轮地运行 reduce 达到负载均衡。

reduce 的数目的增加将会增加框架的负担(天花板),但是会提高负载均衡和降低失败率。

整体的规模将会略小于总数,因为有一些 reduce slot 用来存储推测任务和失败任务。

 

Reducer NONE

It is legal to set the number of reduce-tasks to zero if no reduction is desired.

In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by FileOutputFormat.setOutputPath(Job, Path). The framework does not sort the map-outputs before writing them out to the FileSystem.

当没有 reduction 需求的时候可以将 reduce-task 的数目设置为0,是允许的。

在这种情况当中,map任务将直接输出到 FileSystem,可通过  FileOutputFormat.setOutputPath(Job, Path) 来设置。该框架不会对输出的FileSystem 的数据进行排序。

 

Partitioner

Partitioner partitions the key space.

Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset (子集)of the key) is used to derive(取得;源自) the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent to for reduction.

HashPartitioner is the default Partitioner.

   Partitioner对key进行分区。

Partitioner 对 map 输出的中间值的 key(Recuder之前)进行分区。分区采用的默认方法是对 key 取 hashcode。分区数等于 job 的 reduce 任务数。因此这会根据中间值的key 将数据传输到对应的 reduce。

HashPartitioner 是默认的的分区器。

 

Counter

Counter is a facility for MapReduce applications to report its statistics.

Mapper and Reducer implementations can use the Counter to report statistics.

Hadoop MapReduce comes bundled(捆绑) with a library of generally(普遍的) useful mappers, reducers, and partitioners.

    计数器是一个工具用于报告 Mapreduce 应用的统计。

Mapper 和 Reducer 实现类可使用计数器来报告统计值。

Hadoop Mapreduce 是普遍的可用的 Mappers、Reducers 和 Partitioners 组成的一个库。

 

Job Configuration

Job represents(代表,表示) a MapReduce job configuration.

Job is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution. The framework tries to faithfully(如实的) execute the job as described by Job, however:

    • Some configuration parameters may have been marked as final by administrators (see Final Parameters) and hence cannot be altered(改变).
    • While some job parameters are straight-forward to set (e.g. Job.setNumReduceTasks(int)), other parameters interact(互相影响) subtly(微妙的) with the rest of the framework and/or job configuration and are more complex to set (e.g. Configuration.set(JobContext.NUM_MAPS, int)).

Job is typically used to specify the Mapper, combiner (if any), Partitioner, Reducer, InputFormat, OutputFormat implementations. FileInputFormat indicates(指定,表明) the set of input files (FileInputFormat.setInputPaths(Job, Path...)/ FileInputFormat.addInputPath(Job, Path)) and ( FileInputFormat.setInputPaths(Job, String...)/ FileInputFormat.addInputPaths(Job, String))and where the output files should be written ( FileOutputFormat.setOutputPath(Path)).

Optionally, Job is used to specify other advanced facets of the job such as the Comparator to be used, files to be put in the DistributedCache, whether intermediate and/or job outputs are to be compressed (and how), whether job tasks can be executed in a speculative manner ( setMapSpeculativeExecution(boolean))/ setReduceSpeculativeExecution(boolean)), maximum number of attempts per task (setMaxMapAttempts(int)/ setMaxReduceAttempts(int)) etc.

Of course, users can use Configuration.set(String, String)/ Configuration.get(String) to set/get arbitrary parameters needed by applications. However, use the DistributedCache for large amounts of (read-only) data.

   Job类用来表示MapReduce作业的配置。Job是用户用来描述MapReduce job在Hadoop框架运行的主要接口。Hadoop将尽量如实地按照job所描述的来执行。然而:

    • 一些配置参数已经被管理员标注为不可更改的因此不能被改变。
    • 一些参数是直接设置的(如Job.setNumReduceTasks(int)),有一些参数是跟框架或者任务配置之间有微妙的互相影响并且复杂的设置。

Job 典型地用于指定Mapper、Combiner、Partitioner、Reducer、InputFormat、OutputFormat实现类。 FileInputFormat指定输入文档的设定(FileInputFormat.setInputPaths(Job, Path...)/FileInputFormat.addInputPath(Job, Path))和(FileInputFormat.setInputPaths(Job, String...)/FileInputFormat.addInputPaths(Job, String))和输出文件应该写入通过(FileOutputFormat.setOutputPath(Path)).

随意地,Job也常用来指定job的其他高级配置,例如比较器、文档置于分布式缓存、中间记录是否压缩和怎样压缩, job任务是否已预测的方式去执行,每个任务的最大处理量等等。

当然,用户可以使用来设置或者获得应用所需要的任何参数。然而,使用分布式缓存来存储大量的可读数据。

 

Task Execution & Environment

The MRAppMaster executes the Mapper/Reducer task as a child process in a separate jvm.

The child-task inherits the environment of the parent MRAppMaster. The user can specify additional options to the child-jvm via the mapreduce.{map|reduce}.java.opts and configuration parameter in the Job such as non-standard paths for the run-time linker to search shared libraries via -Djava.library.path=<> etc. If the mapreduce.{map|reduce}.java.opts parameters contains the symbol @taskid@ it is interpolated with value of taskid of the MapReduce task.

Here is an example with multiple arguments and substitutions, showing jvm GC logging, and start of a passwordless(无密码) JVM JMX agent so that it can connect with jconsole and the likes to watch child memory, threads and get thread dumps. It also sets the maximum heap-size of the map and reduce child jvm to 512MB & 1024MB respectively. It also adds an additional path to the java.library.path of the child-jvm.

 1 
 2 
 3   mapreduce.map.java.opts
 4 
 5   
 6 
 7     -Xmx512M -Djava.library.path=/home/mycompany/lib -verbose:gc -Xloggc:/tmp/@[email protected]
 8 
 9     -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
10 
11   
12 
13 
14 
15  
16 
17 
18 
19   mapreduce.reduce.java.opts
20 
21   
22 
23     -Xmx1024M -Djava.library.path=/home/mycompany/lib -verbose:gc -Xloggc:/tmp/@[email protected]
24 
25     -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
26 
27   
28 
29 

 

MRAppMaster 在一个单独的jvm中运行Mapper/Reducer任务做为一个子进程。

子任务继承父MRAppMaster的运行环境。用户可以通过(mapreduce.{map|reduce}.java.opts和配置参数例如通过 Djava.library.path=<>可以设置非标准的路径用于运行时搜索库)指定额外的设置。如果mapreduce.{map|reduce}.java.opts参数包含@taskid@ 符号那么Mapreduce任务将会被修改为taskid的值。

下面有个例子;配置多个参数和代替,展示jvm gc 日志,和 JVM JMX 代理用于无密码登录以致可以连接JConsole来监控子程序的内存、线程和线程垃圾回收。也分别设置了map和reduce的最大堆内存为512M和1024M。它也给子jvm添加了额外的路径通过java.library.path参数。

 

Memory Management

Users/admins can also specify the maximum virtual memory of the launched child-task, and any sub-process it launches recursively, using mapreduce.{map|reduce}.memory.mb. Note that the value set here is a per process limit. The value for mapreduce.{map|reduce}.memory.mb should be specified in mega bytes (MB). And also the value must be greater than or equal to the -Xmx passed to JavaVM, else the VM might not start.

Note: mapreduce.{map|reduce}.java.opts are used only for configuring the launched child tasks from MRAppMaster. Configuring the memory options for daemons is documented inConfiguring the Environment of the Hadoop Daemons.

The memory available to some parts of the framework is also configurable. In map and reduce tasks, performance may be influenced by adjusting parameters influencing the concurrency of operations and the frequency with which data will hit disk. Monitoring the filesystem counters for a job- particularly relative to byte counts from the map and into the reduce- is invaluable to the tuning of these parameters.

用户或者管理员可以使用mapreduce.{map|reduce}.memory.mb指定子任务或者任何子进程运行的最大虚拟内存。需要注意的这里的值是针对每个进程的限制。{map|reduce}.memory.mb的值是以MB为单位的。并且这个值应该大于等于传给JavaVM的-Xmx的值,要不VM可能会无法启动。

  说明:mapreduce.{map|reduce}.java.opts只用来设置MRAppMaster发出的子任务。守护线程的内存选项配置在Configuring the Environment of the Hadoop Daemons.

  框架的一些组成部分的内存也是可配置的。在map和reduce任务中,性能可能会受到并发数的调整和写入到磁盘的频率的影响。文件系统计数器监控作业的map输出和输入到reduce的字节数对于调整这         些参数是宝贵的。

Map Parameters

A record emitted(发射) from a map will be serialized into a buffer and metadata will be stored into accounting buffers. As described in the following options, when either the serialization buffer or the metadata exceed(超过) a threshold(入口), the contents of the buffers will be sorted and written to disk in the background while the map continues to output records. If either buffer fills completely while the spill is in progress, the map thread will block. When the map is finished, any remaining records are written to disk and all on-disk segments are merged into a single file. Minimizing the number of spills to disk can decrease map time, but a larger buffer also decreases the memory available to the mapper.

Name

Type

Description

mapreduce.task.io.sort.mb

int

The cumulative(累积) size of the serialization and accounting buffers storing records emitted from the map, in megabytes.

mapreduce.map.sort.spill.percent

float

The soft limit in the serialization buffer. Once reached, a thread will begin to spill the contents to disk in the background.

Other notes

    • If either spill threshold is exceeded while a spill is in progress, collection will continue until the spill is finished. For example, if mapreduce.map.sort.spill.percent is set to 0.33, and the remainder(剩余) of the buffer is filled while the spill runs, the next spill will include all the collected records, or 0.66 of the buffer, and will not generate additional spills. In other words, the thresholds are defining triggers, not blocking.
    • A record larger than the serialization buffer will first trigger a spill, then be spilled to a separate file. It is undefined whether or not this record will first pass through the combiner.

Map发出的数据将会被序列化在缓存中和源数据将会储存在统计缓存。正如接下来的配置所描述的,当序列化缓存和元数据超过设定的临界值,缓存中的内容将会后台中写入到磁盘中而map将会继续输出记录。当缓存完全满了溢出之后,map线程将会阻塞。当map任务结束,所有剩下的记录都会被写到磁盘中并且磁盘中所有文件块会被合并到一个单独的文件。减小溢出值将减少map的时间,但更大的缓存会减少mapper的内存消耗。

其他说明:

    • 当任何一个spill超出的临界值,收集还会持续进行直到结束。例如,当mapreduce.map.sort.spill.percent 设置为0.33,那么剩余的缓存将会继续填充而spill会继续运行,而下一个spill将会包含所有的收集的记录,而当值为0.66,将不会产生另一个spills。也就是说,临界值会被触发,但不会阻塞。
    • 一个记录大于序列化缓存将会第一时间触发溢出,并且会被写到一个单独的文件。无论是否有定义都会第一时间通过combiner进行传输。

 

Shuffle/Reduce Parameters

As described previously, each reduce fetches the output assigned to it by the Partitioner via HTTP into memory and periodically merges these outputs to disk. If intermediate compression of map outputs is turned on, each output is decompressed into memory. The following options affect the frequency of these merges to disk prior to the reduce and the memory allocated to map output during the reduce.

Name

Type

Description

mapreduce.task.io.soft.factor

int

Specifies the number of segments on disk to be merged at the same time. It limits the number of open files and compression codecs during merge. If the number of files exceeds this limit, the merge will proceed in several passes. Though this limit also applies to the map, most jobs should be configured so that hitting this limit is unlikely there.

mapreduce.reduce.merge.inmem.thresholds

int

The number of sorted map outputs fetched into memory before being merged to disk. Like the spill thresholds in the preceding note, this is not defining a unit of partition, but a trigger. In practice, this is usually set very high (1000) or disabled (0), since merging in-memory segments is often less expensive than merging from disk (see notes following this table). This threshold influences only the frequency of in-memory merges during the shuffle.

mapreduce.reduce.shuffle.merge.percent

float

The memory threshold for fetched map outputs before an in-memory merge is started, expressed as a percentage of memory allocated to storing map outputs in memory. Since map outputs that can't fit in memory can be stalled, setting this high may decrease parallelism between the fetch and merge. Conversely, values as high as 1.0 have been effective for reduces whose input can fit entirely in memory. This parameter influences only the frequency of in-memory merges during the shuffle.

mapreduce.reduce.shuffle.input.buffer.percent

float

The percentage of memory- relative to the maximum heapsize as typically specified in mapreduce.reduce.java.opts- that can be allocated to storing map outputs during the shuffle. Though some memory should be set aside for the framework, in general it is advantageous to set this high enough to store large and numerous map outputs.

mapreduce.reduce.input.buffer.percent

float

The percentage of memory relative to the maximum heapsize in which map outputs may be retained during the reduce. When the reduce begins, map outputs will be merged to disk until those that remain are under the resource limit this defines. By default, all map outputs are merged to disk before the reduce begins to maximize the memory available to the reduce. For less memory-intensive reduces, this should be increased to avoid trips to disk.

Other notes

    • If a map output is larger than 25 percent of the memory allocated to copying map outputs, it will be written directly to disk without first staging through memory.
    • When running with a combiner, the reasoning about high merge thresholds and large buffers may not hold. For merges started before all map outputs have been fetched, the combiner is run while spilling to disk. In some cases, one can obtain better reduce times by spending resources combining map outputs- making disk spills small and parallelizing spilling and fetching- rather than aggressively increasing buffer sizes.
    • When merging in-memory map outputs to disk to begin the reduce, if an intermediate merge is necessary because there are segments to spill and at leastmapreduce.task.io.sort.factor segments already on disk, the in-memory map outputs will be part of the intermediate merge.

正如前面提到的,每个reduce都会通过HTTP在内存中拿到Partitioner分配好的数据并且定期地合并数据写到磁盘中。如果map输出的中间值都进行压缩,那么每个输出都会减少内存的压力。下面这些设置将会影响reduce之前的数据合并到磁盘的频率和reduce过程中分配给map输出的内存。

其他说明:

    • 如果一个map输出大于分配给用于复制map输出的内存的25%,那么将会直接写到磁盘不会通过内存进行临时缓存。
    • 当运行一个combiner,高的临界值和大的缓存的理由将没有效果。在map输出进行合并之前,combiner将会进行溢出写到磁盘的操作。在一些例子当中,耗费资源combine map输出数据获得更小的溢出会比粗暴地增加缓存大小使得recuder的时间更少。
    • 当合并内存中的map数据到磁盘来开始recuder时,如果磁盘中已经存在部分切片数据的话,那么必须将内存中的数据作为磁盘中间数据的一部分来进行合并操作。

 

Configured Parameters

The following properties are localized in the job configuration for each task's execution:

Name

Type

Description

mapreduce.job.id

String

The job id

mapreduce.job.jar

String

job.jar location in job directory

mapreduce.job.local.dir

String

The job specific shared scratch space

mapreduce.task.id

String

The task id

mapreduce.task.attempt.id

String

The task attempt id

mapreduce.task.is.map

boolean

Is this a map task

mapreduce.task.partition

int

The id of the task within the job

mapreduce.map.input.file

String

The filename that the map is reading from

mapreduce.map.input.start

long

The offset of the start of the map input split

mapreduce.map.input.length

long

The number of bytes in the map input split

mapreduce.task.output.dir

String

The task's temporary output directory

Note: During the execution of a streaming job, the names of the "mapreduce" parameters are transformed. The dots ( . ) become underscores ( _ ). For example, mapreduce.job.id becomes mapreduce_job_id and mapreduce.job.jar becomes mapreduce_job_jar. To get the values in a streaming job's mapper/reducer use the parameter names with the underscores.

说明:流式任务的执行过程中,名字以mapreduce开头的参数会被改变。符号(.)会变成(_)。例如,mapreduce.job.id会变成mapreduce_job_id和mapreduce.job.jar会变成mapreduce_job_jar。在Mapper/Reducer中使用带下划线的参数名来获得对应的值。

 

Task Logs

The standard output (stdout) and error (stderr) streams and the syslog of the task are read by the NodeManager and logged to ${HADOOP_LOG_DIR}/userlogs.

NodeManager 会读取stdout、sterr和任务的syslog并写到${HADOOP_LOG_DIR}/userlogs。

 

Distributing Libraries

The DistributedCache can also be used to distribute both jars and native libraries for use in the map and/or reduce tasks. The child-jvm always has its current working directory added to the java.library.path and LD_LIBRARY_PATH. And hence the cached libraries can be loaded via System.loadLibrary or System.load. More details on how to load shared libraries through distributed cache are documented at Native Libraries.

分布是缓存也可以在map/reduce任务中用来分不是存储jars和本地库。子JVM经常将它的工作路径添加到java.librarypath和LD_LIBRARY_PATH.因此缓存的库能通过System.loadLibrary 或者 System.load 来加载。更多关于如何通过分布式缓存来加载第三方库参考Native Libraries.

 

Job Submission and Monitoring

Job is the primary interface by which user-job interacts with the ResourceManager.

Job provides facilities to submit jobs, track their progress, access component-tasks' reports and logs, get the MapReduce cluster's status information and so on.

The job submission process involves:

    1. Checking the input and output specifications of the job.
    2. Computing the InputSplit values for the job.
    3. Setting up the requisite accounting information for the DistributedCache of the job, if necessary.
    4. Copying the job's jar and configuration to the MapReduce system directory on the FileSystem.
    5. Submitting the job to the ResourceManager and optionally monitoring it's status.

Job history files are also logged to user specified directory mapreduce.jobhistory.intermediate-done-dir and mapreduce.jobhistory.done-dir, which defaults to job output directory.

User can view the history logs summary in specified directory using the following command 
$ mapred job -history output.jhist 
This command will print job details, failed and killed tip details. 
More details about the job such as successful tasks and task attempts made for each task can be viewed using the following command 
$ mapred job -history all output.jhist

Normally the user uses Job to create the application, describe various facets of the job, submit the job, and monitor its progress.

Job 是用户Job与ResourceManager交互的主要接口。

Job 提供工具去提交jobs、跟踪他们的进程、使用组成任务的报告和日志,获得MapReduce集群的状态信息和其他。

Job的提交包含以下内容:

  1. 检查Job的输入输出指定
  2. 计算Job的InputSplit的值
  3. 如果必要的话,设置分布式缓存的需求信息。
  4. 将Job的jar和configuration复制到Mapreduce系统的文件系统路径下。
  5. 将Job提交到ResourceManger并且随时监控它的状态。

Job的历史文件也被记录到用户通过mapreduce.jobhistory.intermediate-done-dir and mapreduce.jobhistory.done-dir指定的路径下,默认是Job的输出路径。

用户可以通过下面的指令来查看指定路径下的所有的历史记录。

$ mapred job -history output.jhist 

这个命令可以打印job的细节,失败和杀死Job的技巧。用以下的命令可以考到更多关于Job例如成功任务和每个任务的目的细节。

$ mapred job -history all output.jhist

Normally the user uses Job to create the application, describe various facets of the job, submit the job, and monitor its progress.

一般来说用户使用Job来创建应用,描述Job的各个方面,提交Job和监控它的进程。

 

Job Control

Users may need to chain MapReduce jobs to accomplish(实现) complex tasks which cannot be done via a single MapReduce job. This is fairly easy since the output of the job typically goes to distributed file-system, and the output, in turn(依次), can be used as the input for the next job.

However, this also means that the onus on ensuring jobs are complete (success/failure) lies squarely on the clients. In such cases, the various job-control options are:

    • Job.submit() : Submit the job to the cluster and return immediately.
    • Job.waitForCompletion(boolean) : Submit the job to the cluster and wait for it to finish.

用户可能需要将多个任务串行实现复杂任务而没办法通过一个MapReduce任务实现。这是相当容易,job的output通常是输出到分布式缓存,而输出,依次作为下一个任务的输入。

然而,这也意味确保任务的完成(成功/失败)的义务是完全建立在客户端上。在这种情况下,各种作业的控制选项有:

    • Job.submit() :提交作业给集群并立刻回复
    • Job.waitForCompletion(boolean) :提交作业给集群并且等待它完成。

 

Job Input

InputFormat describes the input-specification for a MapReduce job.

The MapReduce framework relies on the InputFormat of the job to:

    1. Validate the input-specification of the job.
    2. Split-up the input file(s) into logical InputSplit instances, each of which is then assigned to an individual Mapper.
    3. Provide the RecordReader implementation used to glean input records from the logical InputSplit for processing by the Mapper.

The default behavior of file-based InputFormat implementations, typically sub-classes of FileInputFormat, is to split the input into logical InputSplit instances based on the total size, in bytes, of the input files. However, the FileSystem blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set viamapreduce.input.fileinputformat.split.minsize.

Clearly, logical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task.

TextInputFormat is the default InputFormat.

If TextInputFormat is the InputFormat for a given job, the framework detects input-files with the .gz extensions and automatically decompresses them using the appropriate CompressionCodec. However, it must be noted that compressed files with the above extensions cannot be split and each compressed file is processed in its entirety by a single mapper.

InputFormat描述MapReduce Job的输入规定。

MapReduce框架依赖Job的InputFormat:

    1. 使Job的输入设定生效。
    2. 将输入文件分割成逻辑上的输入块实例,并将每一输入块分配给单独的Mapper。
    3. 提供RecordReader实现用于收集从逻辑输入块的记录输入到Mapper中。

那些默认的基于InputFormat的实现,通常来说FileInputForamt的子类,基于总字节数将输入基于字节数分成逻辑输入块实例,然而,FileSystem的块大小将是inputSplits的上限值,下限值可以通过mapreduce.input.fileinputformat.split.minsize来设置。

很明显,很多应用必须重视记录的边界,因存在着输入大小不足以逻辑分割。在这种情况,应用应当实现一个RecordReader,负责在单独任务中处理记录边界和显示,面向记录的逻辑视图。

TextInputForamt是默认的InputForamt。

如果job的InputForamt是TextInputFormat,框架会对输入文件进行检测,如果扩展名为.gz那么会自动用合适的压缩编码器进行解压。然而,必须说明的是经过压缩的文件将不能被切割并且每一个压缩文件都必须完全在一个Mapper单独处理。

 

InputSplit

InputSplit represents the data to be processed by an individual Mapper.

Typically InputSplit presents a byte-oriented view of the input, and it is the responsibility of RecordReader to process and present a record-oriented view.

FileSplit is the default InputSplit. It sets mapreduce.map.input.file to the path of the input file for the logical split.

输入块表示每个单独Mapper处理的数据。

通常来说,输入块代表输入的面向字节视图,而RecordReader代表的是面向记录视图。

FileSplit是默认的InputSplit。mapreduce.map.input.file设置用于逻辑分割的输入路径。

 

RecordReader

RecordReader reads  pairs from an InputSplit.

Typically the RecordReader converts the byte-oriented view of the input, provided by the InputSplit, and presents a record-oriented to the Mapper implementations for processing.RecordReader thus assumes the responsibility of processing record boundaries and presents the tasks with keys and values.

RecordReader从InputSplit读取键值对。

通常来说,RecordReader将输入的面向字节视图转换成面向记录视图并输入到Mapper的实现类进行处理。RecordReader因此承担起处理记录边界和显示任务的Keys和Values的责任。

 

Job Output

OutputFormat describes the output-specification for a MapReduce job.

The MapReduce framework relies on the OutputFormat of the job to:

    1. Validate the output-specification of the job; for example, check that the output directory doesn't already exist.
    2. Provide the RecordWriter implementation used to write the output files of the job. Output files are stored in a FileSystem.

TextOutputFormat is the default OutputFormat.

OutputFormat 描述MapReduce Job的输出规定。

MapReduce 框架依赖于Job的OutputFormat:

    1. 使job的输出设置生效;例如,检查输出路径是否已经存在。
    2. 提供RecordWriter实现用于输出文件。输出文件储存在FileSystem。

 

OutputCommitter

OutputCommitter describes the commit of task output for a MapReduce job.

The MapReduce framework relies on the OutputCommitter of the job to:

    1. Setup the job during initialization. For example, create the temporary output directory for the job during the initialization of the job. Job setup is done by a separate task when the job is in PREP state and after initializing tasks. Once the setup task completes, the job will be moved to RUNNING state.
    2. Cleanup the job after the job completion. For example, remove the temporary output directory after the job completion. Job cleanup is done by a separate task at the end of the job. Job is declared SUCCEDED/FAILED/KILLED after the cleanup task completes.
    3. Setup the task temporary output. Task setup is done as part of the same task, during task initialization.
    4. Check whether a task needs a commit. This is to avoid the commit procedure if a task does not need commit.
    5. Commit of the task output. Once task is done, the task will commit it's output if required.
    6. Discard the task commit. If the task has been failed/killed, the output will be cleaned-up. If task could not cleanup (in exception block), a separate task will be launched with same attempt-id to do the cleanup.

FileOutputCommitter is the default OutputCommitter. Job setup/cleanup tasks occupy map or reduce containers, whichever is available on the NodeManager. And JobCleanup task, TaskCleanup tasks and JobSetup task have the highest priority, and in that order.

OutputCommitter 描述着MapReduce的任务输出的提交。

MapReduce依赖于Job的输出提交器:

    1. 初始化时设置Job。例如,job的初始化过程中创建临时输出路径。当Job处于准备阶段和初始化任务之后,Job通过一个单独的任务完成创建。,一旦任务的创建完成之后,job将会转成运行状态。
    2. Job完成之后清除Job。例如,Job完成后移除临时输出路径。Job结束之时用一个单独的任务完成Job的清除。在完成对任务的清除之后Job会声明SUCCEDED/FAILED/KILLED.
    3. 设置任务临时输出。在任务的初始化过程中,任务设置作为任务的一部分来完成。
    4. 检查一个任务是否需要提交。这将避免一个不需要提交的任务执行提交程序。
    5. 提交任务的输出。一旦任务完成,任务将会提交它的输出如果需要的话。
    6. 放弃任务提交。如果任务已经失败或者被杀死,那么输出将会被清除掉。如果任务因为意外没有被清除掉,那么一个单独的任务将会被运行来执行清除工作。

FileOutputCommitter是默认的OutputCommitter。Job 创建/清除任务占有map或者reduce容器,无论NodeManager是否可用。Job的清除任务,任务的清除任务和Job的创建任务拥有最高的优先级。

 

Task Side-Effect Files 

In some applications, component tasks need to create and/or write to side-files, which differ from the actual job-output files.

In such cases there could be issues with two instances of the same Mapper or Reducer running simultaneously (for example, speculative tasks) trying to open and/or write to the same file (path) on the FileSystem. Hence the application-writer will have to pick unique names per task-attempt (using the attemptid, say attempt_200709221812_0001_m_000000_0), not just per task.

To avoid these issues the MapReduce framework, when the OutputCommitter is FileOutputCommitter, maintains a special${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} sub-directory accessible via ${mapreduce.task.output.dir} for each task-attempt on the FileSystem where the output of the task-attempt is stored. On successful completion of the task-attempt, the files in the ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} (only) are promoted to${mapreduce.output.fileoutputformat.outputdir}. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This process is completely transparent to the application.

The application-writer can take advantage of this feature by creating any side-files required in ${mapreduce.task.output.dir} during execution of a task viaFileOutputFormat.getWorkOutputPath(Conext), and the framework will promote them similarly for succesful task-attempts, thus eliminating the need to pick unique paths per task-attempt.

Note: The value of ${mapreduce.task.output.dir} during execution of a particular task-attempt is actually ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_{$taskid}, and this value is set by the MapReduce framework. So, just create any side-files in the path returned by FileOutputFormat.getWorkOutputPath(Conext) from MapReduce task to take advantage of this feature.

The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since output of the map, in that case, goes directly to HDFS.

在一些应用当中,组成的任务必须创建一些其他文档,跟实际输出不同的文档。

在这些情况当中将会同时存在两个Mapper或者Reducer实例去打开或者写到FileSystem中相同的文档。因此应用开发者将会获取独一无二的任务目的(使用目的ID,假如say attempt_200709221812_0001_m_000000_0),不仅是每个任务。

说明:${mapreduce.task.output.dir}的值在一个特定任务执行过程中实际上是${mapreduce.output.fileoutputformat.outputdir}/_temporary/_{$taskid}的值,而这个值是由MapReduce框架设定的。所以,MapReduce任务利用这个特性从FileOutputForamt.getWorkOutPath(Context)返回的路径创建副文档。

整个讨论适用于作业有map但没有reduce的情况,因此map的output直接写到HDFS.

 

RecordWriter

RecordWriter writes the output  pairs to an output file.

RecordWriter implementations write the job outputs to the FileSystem.

RecordWriter将键值对的输出写到输出文件中。

RecordWriter实现类将job的输出写到FileSytem。

 

Other Useful Features 

Submitting Jobs to Queues 

Users submit jobs to Queues. Queues, as collection of jobs, allow the system to provide specific functionality. For example, queues use ACLs to control which users who can submit jobs to them. Queues are expected to be primarily used by Hadoop Schedulers.

Hadoop comes configured with a single mandatory queue, called 'default'. Queue names are defined in the mapreduce.job.queuename> property of the Hadoop site configuration. Some job schedulers, such as the Capacity Scheduler, support multiple queues.

A job defines the queue it needs to be submitted to through the mapreduce.job.queuename property, or through the Configuration.set(MRJobConfig.QUEUE_NAME, String) API. Setting the queue name is optional. If a job is submitted without an associated queue name, it is submitted to the 'default' queue.

用户提交job到队列中。队列,也就是job的集合,允许系统提供特定的功能。例如,队列使用ACLS来控制哪些用户可以提交队列。Hadoop Schedulers是队列的主要使用者。

Hadoop设置一个单独的强制的队列,称之为“默认”。队列的名称是在Hadoop-site配置文件中的mapreduce.job.queuename>属性决定的。一些作业调度器支持多个的队列,例如容量调度器。

一个作业通过mapreduce.job.queuename属性或者Configuration.set(MRJobConfig.QUEUE_NAME, String)API来定义一个队列。设置队列的名字是可选的。如果一个作业被提交时并没有设置队列名称,那么队列名称为“默认”。

 

Counters 

Counters represent global counters, defined either by the MapReduce framework or applications. Each Counter can be of any Enum type. Counters of a particular Enum are bunched into groups of type Counters.Group.

Applications can define arbitrary Counters (of type Enum) and update them via Counters.incrCounter(Enum, long) or Counters.incrCounter(String, String, long) in the map and/or reducemethods. These counters are then globally aggregated by the framework.

计数器是全局计数器,由MapReduce框架或者应用定义。每一个计数器都可以是任何枚举类型。Counters of a particular Enum are bunched into groups of type Counters.Group。

应用可以定义任意计数器和通过 Counters.incrCounter(Enum, long) 或者Counters.incrCounter(String, String, long)来更新在map/reduce方法中。这些计数器是通过框架进行全局计算的。

 

DistributedCache 

DistributedCache distributes application-specific, large, read-only files efficiently.

DistributedCache is a facility provided by the MapReduce framework to cache files (text, archives, jars and so on) needed by applications.

Applications specify the files to be cached via urls (hdfs://) in the Job. The DistributedCache assumes that the files specified via hdfs:// urls are already present on the FileSystem.

The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.

DistributedCache tracks the modification timestamps of the cached files. Clearly the cache files should not be modified by the application or externally while the job is executing.

DistributedCache can be used to distribute simple, read-only data/text files and more complex types such as archives and jars. Archives (zip, tar, tgz and tar.gz files) are un-archived at the slave nodes. Files have execution permissions set.

The files/archives can be distributed by setting the property mapreduce.job.cache.{files|archives}. If more than one file/archive has to be distributed, they can be added as comma separated paths. The properties can also be set by APIs Job.addCacheFile(URI)/ Job.addCacheArchive(URI) and Job.setCacheFiles(URI[])/ Job.setCacheArchives(URI[]) where URI is of the form hdfs://host:port/absolute-path#link-name. In Streaming, the files can be distributed through command line option -cacheFile/-cacheArchive.

The DistributedCache can also be used as a rudimentary software distribution mechanism for use in the map and/or reduce tasks. It can be used to distribute both jars and native libraries. The Job.addArchiveToClassPath(Path) or Job.addFileToClassPath(Path) api can be used to cache files/jars and also add them to the classpath of child-jvm. The same can be done by setting the configuration properties mapreduce.job.classpath.{files|archives}. Similarly the cached files that are symlinked into the working directory of the task can be used to distribute native libraries and load them.

分布式缓存有效地分布存储应用专用的、大的、只读的文件。

分布是缓存是MapReduce框架提供给应用用于缓存文件(文本,档案、jar包和其他)。

应用可以通过urls (hdfs://)在Job中指定文件的缓存路径。分布式缓存假设通过hdfs:// urls指定的文件已经存在现在的FileSystem。

这个框架将在某个从属节点执行任何任务之前复制必要的文件到该节点上。它的高效源于这样的事实:每个作业只复制一次到那些能够存档但是还没存档的节点上。

分布式缓存跟踪缓存文件的修改时间戳。显然当作业在执行时缓存文件不应该被应用或者外部修改。

分布式缓存可以用来分布缓存简单的、只读的的数据或者文本文档和更复杂类型例如档案和Jar包。档案(zip, tar, tgz and tar.gz files)指的是未存档到从属节点的。文档是有执行权限的。

文件可以通过设置mapreduce.job.cache.{files|archives}属性来分配存储。如果有更多的文件需要存储,那么在用逗号隔开路径即可。该属性还可以通过Job.addCacheFile(URI)/ Job.addCacheArchive(URI) and Job.setCacheFiles(URI[])/ Job.setCacheArchives(URI[]) 来设置,URL的格式为hdfs://host:port/absolute-path#link-name。文件可以通过命令-cacheFile/-cacheArchive来实现分配存储。

分布式缓存也可以用作一个基本的软件分发机制用于map/reduce任务。它也可以用来分布存储jar包和本地库。Job.addArchiveToClassPath(Path) or Job.addFileToClassPath(Path) api可以用来缓存文件/jars并且子Jvm也会将它们添加到类路径下。通过设置mapreduce.job.classpath.{files|archives}属性也可以达到同样效果。同样地缓存文件通过符号链接到任务的工作路径来分布缓存本地库和加载它们。

 

Private and Public DistributedCache Files 

DistributedCache files can be private or public, that determines how they can be shared on the slave nodes.

    • "Private" DistributedCache files are cached in a local directory private to the user whose jobs need these files. These files are shared by all tasks and jobs of the specific user only and cannot be accessed by jobs of other users on the slaves. A DistributedCache file becomes private by virtue of its permissions on the file system where the files are uploaded, typically HDFS. If the file has no world readable access, or if the directory path leading to the file has no world executable access for lookup, then the file becomes private.
    • "Public" DistributedCache files are cached in a global directory and the file access is setup such that they are publicly visible to all users. These files can be shared by tasks and jobs of all users on the slaves. A DistributedCache file becomes public by virtue of its permissions on the file system where the files are uploaded, typically HDFS. If the file has world readable access, AND if the directory path leading to the file has world executable access for lookup, then the file becomes public. In other words, if the user intends to make a file publicly available to all users, the file permissions must be set to be world readable, and the directory permissions on the path leading to the file must be world executable.

分布式缓存文件可以是私有的或者公有的,以确定它们是否可以被分享到从属节点。

    •  私有分布式缓存文件被缓存在局部路径属于那些作业需要这些文件的用户。这些文件只可以被指定用户的所有任务和Job使用,而这些节点的其他用户就不能使用。分布式缓存文档在它所上传的文件系统中通过他的权限变成私有的,文件系统通常为HDFS.如果这些文档没有全局读取权限,或者它的路径没有全局的可执行查找权限,那么这些文档就是私有的。
    • 公有分布式缓存文档被缓存在一个全局路径并且文件被设置为对所有用户都可见。这些文件可以被所有节点上的所有用户分享。分布式缓存文件在它所上传的文件系统上通过他的权限变成公有的,文件系统通常为HDFS。如果文件具有全局可读权限,并且他的路径具有全局的可执行查找权限,那么它就是公有的。也就是说,如果用户想要使文件对所有用户可见可操作,那么文件权限必须是全局可读和他的路径权限必须是全局可执行。

 

Profiling

Profiling is a utility to get a representative (2 or 3) sample of built-in java profiler for a sample of maps and reduces.

User can specify whether the system should collect profiler information for some of the tasks in the job by setting the configuration property mapreduce.task.profile. The value can be set using the api Configuration.set(MRJobConfig.TASK_PROFILE, boolean). If the value is set true, the task profiling is enabled. The profiler information is stored in the user log directory. By default, profiling is not enabled for the job.

Once user configures that profiling is needed, she/he can use the configuration property mapreduce.task.profile.{maps|reduces} to set the ranges of MapReduce tasks to profile. The value can be set using the api Configuration.set(MRJobConfig.NUM_{MAP|REDUCE}_PROFILES, String). By default, the specified range is 0-2.

User can also specify the profiler configuration arguments by setting the configuration property mapreduce.task.profile.params. The value can be specified using the api Configuration.set(MRJobConfig.TASK_PROFILE_PARAMS, String). If the string contains a %s, it will be replaced with the name of the profiling output file when the task runs. These parameters are passed to the task child JVM on the command line. The default value for the profiling parameters is -agentlib:hprof=cpu=samples,heap=sites,force=n,thread=y,verbose=n,file=%s.

分析器是一个工具可以用来获取2到3个Java内置分析器关于map和reduce的分析样本。

用户可以通过 mapreduce.task.profile来指定系统是否要收集某个作业的一些任务分析信息。这个值也可以通过Configuration.set(MRJobConfig.TASK_PROFILE, boolean) api来设置。如果这个值为真,那么任务分析将会生效。分析器的信息将储存在用户的log路径下。该属性默认是不生效的。

一旦用户配置了该属性,那么他/她就可以通过 mapreduce.task.profile.{maps|reduces} 来设置MapReduce任务的范围。这个值也可以通过Configuration.set(MRJobConfig.NUM_{MAP|REDUCE}_PROFILES, String) api来设置。默认的值为0-2。

用户也可以通过配置mapreduce.task.profile.params属性来指定分析器的的参数。这个值也可以通过api Configuration.set(MRJobConfig.TASK_PROFILE_PARAMS, String)来设置。假如字符串里面包含%s,那么将会在任务执行时被替换成分析输出文件的名字。这些参数将会在命令行中传输给任务所在的子JVM。默认的参数的值为-agentlib:hprof=cpu=samples,heap=sites,force=n,thread=y,verbose=n,file=%s。

 

Debugging

The MapReduce framework provides a facility to run user-provided scripts for debugging. When a MapReduce task fails, a user can run a debug script, to process task logs for example. The script is given access to the task's stdout and stderr outputs, syslog and jobconf. The output from the debug script's stdout and stderr is displayed on the console diagnostics and also as part of the job UI.

In the following sections we discuss how to submit a debug script with a job. The script file needs to be distributed and submitted to the framework.

MapReduce框架提供一个工具用来运行用户提供的脚本用于调试。当一个MapReduce任务失败,用户可以运行调试脚本,去处理任务log。脚本可以读取任务的stdout、stderr输出、syslog和jobconf。调试脚本的stdout和sterr输出将会作为Job UI的一部分显示出来。

在接下来的部分我们将讨论如何提交一个调试脚本到作业中。脚本文件需要提交和存储在框架中。

 

How to distribute the script file:

The user needs to use DistributedCache to distribute and symlink the script file.

用户需要使用分布式缓存来分发和符号链接脚本文件。

 

How to submit the script:

A quick way to submit the debug script is to set values for the properties mapreduce.map.debug.script and mapreduce.reduce.debug.script, for debugging map and reduce tasks respectively. These properties can also be set by using APIs Configuration.set(MRJobConfig.MAP_DEBUG_SCRIPT, String) and Configuration.set(MRJobConfig.REDUCE_DEBUG_SCRIPT, String). In streaming mode, a debug script can be submitted with the command-line options -mapdebug and -reducedebug, for debugging map and reduce tasks respectively.

The arguments to the script are the task's stdout, stderr, syslog and jobconf files. The debug command, run on the node where the MapReduce task failed, is: 
$script $stdout $stderr $syslog $jobconf

Pipes programs have the c++ program name as a fifth argument for the command. Thus for the pipes programs the command is 
$script $stdout $stderr $syslog $jobconf $program

通过mapreduce.map.debug.script 和nd mapreduce.reduce.debug.script属性来分别设置map和reduce任务的调试脚本是一个快速的提交调试脚本的方法。这些属性可以通过APIs Configuration.set(MRJobConfig.MAP_DEBUG_SCRIPT, String) 和 Configuration.set(MRJobConfig.REDUCE_DEBUG_SCRIPT, String)来设置。在流式编程模式,可以通过命令行选项 -mapdebug 和 –reducedebug来分别设置map和reduce的调试脚本用于调试。

脚本的参数是任务的标准输出、标准错误、系统日志和作业配置文档。调试命令,运行在某个Mapreduce任务失败的节点上,是$script $stdout $stderr $syslog $jobconf $program。

拥有C++程度的Pipes项目在命令中增加第五个参数。因此命令如下:$script $stdout $stderr $syslog $jobconf $program

 

Default Behavior:

For pipes, a default script is run to process core dumps under gdb, prints stack trace and gives info about running threads.

在Pipes中,默认的脚本是运行在GDP的核心转储,打印堆跟踪和运行线程的信息。

 

Data Compression

Hadoop MapReduce provides facilities for the application-writer to specify compression for both intermediate map-outputs and the job-outputs i.e. output of the reduces. It also comes bundled with CompressionCodec implementation for the zlib compression algorithm. The gzip, bzip2, snappy, and lz4 file format are also supported.

Hadoop also provides native implementations of the above compression codecs for reasons of both performance (zlib) and non-availability of Java libraries. More details on their usage and availability are available here.

Hadoop MapReduce提供一个功能让应用开发指定压缩方式用于map输出的中间数据和job-outputs也就是reduce的输出。它也捆绑着实现zlib压缩算法的压缩编码器。支持gzip、bzip2、snappy和lz4文件格式的文档。

Hadoop也提供上述编码器的本地实现,因为性能和Java库不支持的原因。更多关于它们的使用细节和可用性可参考官方文档。

 

Intermediate Outputs

Applications can control compression of intermediate map-outputs via the Configuration.set(MRJobConfig.MAP_OUTPUT_COMPRESS, boolean) api and the CompressionCodec to be used via the Configuration.set(MRJobConfig.MAP_OUTPUT_COMPRESS_CODEC, Class) api.

应用可以通过Configuration.set(MRJobConfig.MAP_OUTPUT_COMPRESS, boolean) api来设置是否对map的输出进行压缩和Configuration.set(MRJobConfig.MAP_OUTPUT_COMPRESS_CODEC, Class) api指定压缩编码器。

 

Job Outputs

Applications can control compression of job-outputs via the FileOutputFormat.setCompressOutput(Job, boolean) api and the CompressionCodec to be used can be specified via the FileOutputFormat.setOutputCompressorClass(Job, Class) api.

If the job outputs are to be stored in the SequenceFileOutputFormat, the required SequenceFile.CompressionType (i.e. RECORD / BLOCK - defaults to RECORD) can be specified via the SequenceFileOutputFormat.setOutputCompressionType(Job, SequenceFile.CompressionType) api.

应用可以通过FileOutputFormat.setCompressOutput(Job, boolean) api来控制是否对作业输出进行压缩和通过FileOutputFormat.setOutputCompressorClass(Job, Class)api来设置压缩编码器。

如果作业的输出是以SequenceFileOutputFormat格式存储的,那么需要序列化。压缩类型通过SequenceFileOutputFormat.setOutputCompressionType(Job, SequenceFile.CompressionType) api来指定。

 

Skipping Bad Records

Hadoop provides an option where a certain set of bad input records can be skipped when processing map inputs. Applications can control this feature through the SkipBadRecords class.

This feature can be used when map tasks crash deterministically on certain input. This usually happens due to bugs in the map function. Usually, the user would have to fix these bugs. This is, however, not possible sometimes. The bug may be in third party libraries, for example, for which the source code is not available. In such cases, the task never completes successfully even after multiple attempts, and the job fails. With this feature, only a small portion of data surrounding the bad records is lost, which may be acceptable for some applications (those performing statistical analysis on very large data, for example).

By default this feature is disabled. For enabling it, refer to SkipBadRecords.setMapperMaxSkipRecords(Configuration, long) and SkipBadRecords.setReducerMaxSkipGroups(Configuration, long).

With this feature enabled, the framework gets into 'skipping mode' after a certain number of map failures. For more details, seeSkipBadRecords.setAttemptsToStartSkipping(Configuration, int). In 'skipping mode', map tasks maintain the range of records being processed. To do this, the framework relies on the processed record counter. See SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS and SkipBadRecords.COUNTER_REDUCE_PROCESSED_GROUPS. This counter enables the framework to know how many records have been processed successfully, and hence, what record range caused a task to crash. On further attempts, this range of records is skipped.

The number of records skipped depends on how frequently the processed record counter is incremented by the application. It is recommended that this counter be incremented after every record is processed. This may not be possible in some applications that typically batch their processing. In such cases, the framework may skip additional records surrounding the bad record. Users can control the number of skipped records through SkipBadRecords.setMapperMaxSkipRecords(Configuration, long) andSkipBadRecords.setReducerMaxSkipGroups(Configuration, long). The framework tries to narrow the range of skipped records using a binary search-like approach. The skipped range is divided into two halves and only one half gets executed. On subsequent failures, the framework figures out which half contains bad records. A task will be re-executed till the acceptable skipped value is met or all task attempts are exhausted. To increase the number of task attempts, use Job.setMaxMapAttempts(int) and Job.setMaxReduceAttempts(int)

Skipped records are written to HDFS in the sequence file format, for later analysis. The location can be changed through SkipBadRecords.setSkipOutputPath(JobConf, Path).

Hadoop提供一个选项当执行map输入时可以跳过某一组确定的坏数据。应用可以通过SkipBadRecords 类来控制特性。

当map任务中某些输入一定会导致崩溃时可以使用这个属性。这通常发生在map函数中的bug。通常地,用户会修复这些bug。然而,某些时候不一定有用。这个bug可能是第三方库导致的,例如那些源代码看不了的。在这些情况当中,尽管经过多次尝试都没有办法完成任务,作业也会失败。通过这个属性,只有一小部分的坏数据周边数据会丢失,这对于某些应用是可以接受的(那么数据量非常的统计分析)

这个属性默认是失效的。可以通过SkipBadRecords.setMapperMaxSkipRecords(Configuration, long) 和SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)。来使它生效。

当这个属性生效,框架在一定数量的map失败后会进入“跳过模式”。在跳过模式中,map任务维持被处理数据的范围,看看SkipBadRecords.setAttemptsToStartSkipping(Configuration, int)。为了达到这个目标,框架依赖于记录计数器。看看SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS and SkipBadRecords.COUNTER_REDUCE_PROCESSED_GROUPS的说明。这个计数器是的框架可以知道有多少条记录被成功处理了,因此来找出哪些记录范围会引起任务崩溃。在进一步的尝试中,这些范围的记录会被跳过。

跳过记录的数目取决于运行的记录计数器的增长频率。建议这个计数器在每天记录处理增加。这在批量处理中可已不太可能实现。在这些情况当中,框架会跳过不良记录附近的额外数据。用户可以通过SkipBadRecords.setMapperMaxSkipRecords(Configuration, long) andSkipBadRecords.setReducerMaxSkipGroups(Configuration, long)来控制跳过记录的数量。框架会试图使用二进制搜索方式来缩窄跳过记录的范围。跳过范围被分成两部分并且只有其中一半会被拿来执行。在接下来的错误当中,框架将会指出哪一半范围包含不良数据。一个任务将会重新执行直到跳过记录或者尝试次数用完。可以通过Job.setMaxMapAttempts(int) and Job.setMaxReduceAttempts(int).来增加尝试次数。

跳过的记录将会以序列化的形式写到HDFS中。可以通过 SkipBadRecords.setSkipOutputPath(JobConf, Path)来修改路径。

 

*由于译者本身能力有限,所以译文中肯定会出现表述不正确的地方,请大家多多包涵,也希望大家能够指出文中翻译得不对或者不准确的地方,共同探讨进步,谢谢。

 

转载于:https://www.cnblogs.com/simple-focus/p/6108737.html

你可能感兴趣的:(Hadoop官方文档翻译——MapReduce Tutorial)