There are often things you would like to know about the data you are analyzing but that are peripheral to the analysis you are performing. For example, if you were counting invalid records and discovered that the proportion of invalid records in the whole dataset was very high, you might be prompted to check why so many records were being marked as invalid—perhaps there is a bug in the part of the program that detects invalid records?
Counters are a useful channel for gathering statistics about the job: for quality control or for application level-statistics.
这个很容易理解, 就是计数器, 在分布式的, 大量的node上实现全局的counter, 用于监控, 记录分析结果等等.
想一下, 这个倒是比较容易实现, 各个节点独立计数, 分别汇报给jobtraker, jobtraker进行汇总得到最终的值, 所以当job没有结束时, 这个counter值是不准的, 因为可能有多个task attempt, 所以中间结果可能计数偏大, 当job结束的时候, 只会保留一个attempt的结果.
Hadoop maintains some built-in counters for every job, which report various metrics for your job. For example, there are counters for the number of bytes and records processed, which allows you to confirm that the expected amount of input was consumed and the expected amount of output was produced.
Counters are divided into groups, and there are several groups for the built-in counters, listed in
MapReduce Task Counters
org.apache.hadoop.mapred.Task$Counter (0.20)
org.apache.hadoop.mapreduce.TaskCounter (post 0.20)
Filesystem Counters
FileSystemCounters (0.20)
org.apache.hadoop.mapreduce.FileSystemCounter (post 0.20)
FileInput-Format Counters
org.apache.hadoop.mapred.FileInputFormat$Counter (0.20)
org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter (post 0.20)
FileOutput-Format Counters
org.apache.hadoop.mapred.FileOutputFormat$Counter (0.20)
org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter (post 0.20)
Job Counters
org.apache.hadoop.mapred.JobInProgress$Counter (0.20)
org.apache.hadoop.mapreduce.JobCounter (post 0.20)
可见Hadoop本身就依赖于counter来经行Hadoop Job状态的监控, 所以设置了大量的built-in counters, 最终在Hadoop管理界面上的数据大部分来自于这些built-in counter.
MapReduce allows user code to define a set of counters, which are then incremented as desired in the mapper or reducer. Counters are defined by a Java enum, which serves to group related counters. A job may define an arbitrary number of enums, each with an arbitrary number of fields. The name of the enum is the group name, and the enum’s fields are the counter names.
Counters are global: the MapReduce framework aggregates them across all maps and reduces to produce a grand total at the end of the job.
The code makes use of a dynamic counter—one that isn’t defined by a Java enum. Since a Java enum’s fields are defined at compile time, you can’t create new counters on the fly using enums. Here we want to count the distribution of temperature quality codes, and though the format specification defines the values that it can take, it is more con venient to use a dynamic counter to emit the values that it actually takes. The method we use on the Reporter object takes a group and counter name using String names:
public void incrCounter(String group, String counter, long amount)
In addition to being available via the web UI and the command line (using hadoop job -counter), you can retrieve counter values using the Java API. You can do this while the job is running, although it is more usual to get counters at the end of a job run, when they are stable.
Counters counters = job.getCounters(); long missing = counters.getCounter(MaxTemperatureWithCounters.Temperature.MISSING); long total = counters.findCounter("org.apache.hadoop.mapred.Task$Counter", "MAP_INPUT_RECORDS").getCounter();
A Streaming MapReduce program can increment counters by sending a specially formatted line to the standard error stream, which is co-opted as a control channel in this case. The line must have the following format:
reporter:counter:group,counter,amount
This snippet in Python shows how to increment the “Missing” counter in the “Temperature” group by one:
sys.stderr.write("reporter:counter:Temperature,Missing,1\n")
The ability to sort data is at the heart of MapReduce. Even if your application isn’t concerned with sorting per se, it may be able to use the sorting stage that MapReduce provides to organize its data. In this section, we will examine different ways of sorting datasets and how you can control the sort order in MapReduce.
排序, 作为hadoop的核心, 是平台默认提供的功能, 因为最终输出是默认按照key排序的.
那么系统是根据什么来对key进行排序的?
Controlling Sort Order
The sort order for keys is controlled by a RawComparator, which is found as follows:
1. If the property mapred.output.key.comparator.class is set, either explicitly or by calling setSortComparatorClass() on Job, then an instance of that class is used. (In the old API the equivalent method is setOutputKeyComparatorClass() on JobConf.)
2. Otherwise, keys must be a subclass of WritableComparable, and the registered comparator for the key class is used.3. If there is no registered comparator, then a RawComparator is used that deserializes the byte streams being compared into objects and delegates to the WritableComparable’s compareTo() method.
These rules reinforce why it’s important to register optimized versions of RawComparators for your own custom Writable classes (which is covered in “Implementing a Raw-Comparator for speed” on page 108), and also that it’s straightforward to override the sort order by setting your own comparator (we do this in “Secondary Sort” on page 276).如果set了job的SortComparatorClass, 那么会用这个进行排序, 否则key必须为WritableComparable, 并且最好定义comparator, 以避免每次比较都要deserializes到objects.
hadoop 0.20.2 api里面,作业被重新定义到了类 org.apache.hadoop.mapreduce.Job。
它有3个特别的方法:
job.setSortComparatorClass(RawComparator c); Define the comparator that controls how the keys are sorted before they are passed to the Reducer
.
Partitioner
for the job.
job.setGroupingComparatorClass(RawComparator c); Define the comparator that controls which keys are grouped together for a single call to Reducer.reduce
如果用户不定以这3个API, 那么系统的处理过程如下
1. key必须为WritableComparable, 所以用WritableComparable的compareTo()对key进行默认排序
2. 用默认的hash方法按key进行partition
3. 默认以key进行group, 把相同key的value形成list传给reducer
而系统会用用户定义的class来替换上面的处理逻辑, 下面各种算法, 大都是重写这3个函数来实现不同的功能
How can you produce a globally sorted file using Hadoop? The naive answer is to use a single partition. But this is incredibly inefficient for large files, since one machine has to process all of the output, so you are throwing away the benefits of the parallel architecture that MapReduce provides.
上面的方法产生的结果, 对于每个reducer是有序的, 但是不同的reducer之间无法保证有序, 除非只用一个reducer, 但是当数据量很大时, 这个不现实.
怎么办?
想法很简单, 了解数据的分布, 重写Partitioner按照范围对key进行partition, 这样就可以保证全局有序了.
Temperature range < –5.6°C [–5.6°C, 13.9°C) [13.9°C, 22.0°C) >= 22.0°C
Proportion of records 29% 24% 23% 24%
The MapReduce framework sorts the records by key before they reach the reducers.
For any particular key, however, the values are not sorted. The order that the values appear is not even stable from one run to the next, since they come from different map tasks, which may finish at different times from run to run.
Generally speaking, most MapReduce programs are written so as not to depend on the order that the values appear to the reduce function. However, it is possible to impose an order on the values
by sorting and grouping the keys in a particular way.
如果需要按value group, 并保证相同key的value也要排序, 怎么办?
To summarize, there is a recipe here to get the effect of sorting by value:
• Make the key a composite of the natural key and the natural value.
• The sort comparator should order by the composite key, that is, the natural key and natural value.
• The partitioner and grouping comparator for the composite key should consider only the natural key for partitioning and grouping.
这个就是上面说的通过重写那3个function实现功能的典型例子,
首先因为除了要按key排序, 还要按value排序, 所以把key,value组成composite key, 这样在后面排序的时候就可以取到value的值
重写SortComparatorClass, 排序的时候先按key排, key一样再按value排
这样看似已经完成了, 其实没那么简单, 因为你最终需要把相同key的value都放到一个list里面传给reducer
所以必须保证相同key的data被partition到同一个reducer, 这样就需要重写PartitionerClass, 否则他会按照整个composite key进行partition
同样需要重写GroupingComparatorClass保证, group的时候按照key而不是composite key
MapReduce can perform joins between large datasets, but writing the code to do joins from scratch is fairly involved. Rather than writing MapReduce programs, you might consider using a higher-level framework such as Pig, Hive, or Cascading, in which join operations are a core part of the implementation.
对于join, 想上面建议的, 用pig或hive, 当然我们这儿可以谈一下思路, 但真的没有必要自己去写join
两个文件join, 如果其中一个比较小, 可以load到memory中, 这个简单, 就把这个小文件放到distributed cache中, 然后每个mapper把小文件load到memory, 对大文件过滤一下就实现了.
如果两个文件都很大, 这个比较麻烦一些
首先把两个文件同时做为mapper的输入如下, 并把需要join的值作为map的key
MultipleInputs.addInputPath(job, ncdcInputPath, TextInputFormat.class, JoinRecordMapper.class); MultipleInputs.addInputPath(job, stationInputPath, TextInputFormat.class, JoinStationMapper.class); FileOutputFormat.setOutputPath(job, outputPath);
这样可以想象, 在reducer, 所有具有这个key的值都在一个list里面, 分别找出其中属于不同输入file的data, 拼一下就ok了
剩下的问题, 必须要用例子才能说清, 两个表join, 如下
Station Id Station Name
011990-99999 SIHCCAJAVRI
Station Id other information
011990-99999 0067011990999991950051507004+68750...
011990-99999 0043011990999991950051512004+68750...
011990-99999 0043011990999991950051518004+68750...
把两个表的station id作为key, 通过mappper, 到reducer的数据如下
011990-99999, [0067011990999991950051507004+68750..., 0043011990999991950051512004+68750..., 0043011990999991950051518004+68750..., SIHCCAJAVRI ]
这儿的SIHCCAJAVRI在value list的最后, 是最差的case, 实际上你不能保证它在list什么位置上出现, 因为MR过程默认不会对value进行排序
而join最终的结果如下,
Station Id Station Name other information
011990-99999 SIHCCAJAVRI 0067011990999991950051507004+68750...
011990-99999 SIHCCAJAVRI 0043011990999991950051512004+68750...
011990-99999 SIHCCAJAVRI 0043011990999991950051518004+68750...
所以必然的做法就是, 遍历list, 先找到SIHCCAJAVRI, 然后和其他值进行并接, 而在没有找到SIHCCAJAVRI之前的值都必须要buffer到memory
对于小数据, 这个没有问题, 但对于大数据, 很有可能对应于一个Station id的other information非常多, 那么很可能会把memory buffer爆掉
所以解决的方法就是, 我们保证总是可以在第一个读到SIHCCAJAVRI
怎么保证, 方法就是用Secondary sort,
用(Station Id, Document Id)组成组合键
然后按组合键进行sort, 按station id进行partition和group.
Side data can be defined as extra read-only data needed by a job to process the main dataset. The challenge is to make side data available to all the map or reduce tasks which are spread across the cluster) in a convenient and efficient fashion.
You can set arbitrary key-value pairs in the job configuration using the various setter methods on Configuration (or JobConf in the old MapReduce API). This is very useful if you need to pass a small piece of metadata to your tasks.
如果你需要传一些参数之类的, 比较小的数据量, 那么用job configuration就足够了.
Rather than serializing side data in the job configuration, it is preferable to distribute datasets using Hadoop’s distributed cache mechanism. This provides a service for copying files and archives to the task nodes in time for the tasks to use them when they run. To save network bandwidth, files are normally copied to any particular node once per job.
如果数据比较大, 比如语料库, 就需要用distributed cache
When you launch a job, Hadoop copies the files specified by the -files, -archives and -libjars options to the jobtracker’s filesystem (normally HDFS).
Then, before a task is run, the tasktracker copies the files from the jobtracker’s filesystem to a local disk—the cache—so the task can access the files. The files are said to be localized at this point.
From the task’s point of view, the files are just there (and it doesn’t care that they came from HDFS). In addition, files specified by -libjars are added to the task’s classpath before it is launched.
对于distributed cache, 三种参数, file比较好理解, 对于archives会自动解压, 而libjars会自动把jar加到java class path
在Job launch的时候, 会把这些文件拷贝到HDFS上, 然后tasktracker会在task run之前将这些文件拷贝到本地磁盘上, 以便于访问.
The tasktracker also maintains a reference count for the number of tasks using each file in the cache. Before the task has run, the file’s reference count is incremented by one; then after the task has run, the count is decreased by one. Only when the count reaches zero is it eligible for deletion, since no tasks are using it. Files are deleted to make room for a new file when the cache exceeds a certain size—10 GB by default. The cache size may be changed by setting the configuration property local.cache.size, which is measured in bytes.
简单的说通过引用计数去保证在没有task再使用的时候, 删除这些文件.