我们知道,在Hadoop中,最终的处理结果集中的数据,除非就由一个Reduce Task处理,否则结果数据集只是局部有序而非全排序。
在新版本的Hadoop中,内置了三个采样器: SplitSampler,RandomSampler和IntervalSampler。这三个采样器都是InputSampler类的静态内部类,并且都实现了InputSampler类的内部接口Sampler,涉及的相关代码如下:
/** * Utility for collecting samples and writing a partition file for * {@link org.apache.hadoop.mapred.lib.TotalOrderPartitioner}. */ public class InputSamplerimplements Tool { ... /** *采样器接口 */ public interface Sampler { /** * 从输入数据几种获得一个数据采样的子集,然后通过这些采样数据在Map端由 * TotalOrderPartitioner对处理数据做hash分组,以保证不同Reduce处理数据的有序性。 * 该方法的具体采样逻辑由继承类实现。 * For a given job, collect and return a subset of the keys from the * input data. */ K[] getSample(InputFormat inf, JobConf job) throws IOException; } /** * 分片数据采样器,即从N个分片中采样,效率最高 * Samples the first n records from s splits. * Inexpensive way to sample random data. */ public static class SplitSampler implements Sampler { ... } /** * 通用的随机数据采样器,按一定的频率对所有数据做随机采样,效率很低 * Sample from random points in the input. * General-purpose sampler. Takes numSamples / maxSplitsSampled inputs from * each split. */ public static class RandomSampler implements Sampler { ... } /** * 有固定采样间隔的数据采样器,适合有序的数据集,效率较随机数据采样器要好一些 * Sample from s splits at regular intervals. * Useful for sorted data. */ public static class IntervalSampler implements Sampler { ... } ... }
/** * Samples the first n records from s splits. * Inexpensive way to sample random data. */ public static class SplitSamplerimplements Sampler { ... /** * From each split sampled, take the first numSamples / numSplits records. */ @SuppressWarnings("unchecked") // ArrayList::toArray doesn't preserve type public K[] getSample(InputFormat inf, JobConf job) throws IOException { //通过InputFormat组件读取所有的分片信息,之前在InputFormat组件的学习中已学习过 InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks()); ArrayList samples = new ArrayList (numSamples); //获得采样分区数,在最大采样数最大分区数和总分区数中选择较小的 int splitsToSample = Math.min(maxSplitsSampled, splits.length); //获取采样分区间隔 int splitStep = splits.length / splitsToSample; //计算获取每个分区的采样数 int samplesPerSplit = numSamples / splitsToSample; long records = 0; for (int i = 0; i < splitsToSample; ++i) { //获取第(i * splitStep)分片的RecordReader对象,并由该对象解析将数据解析成key/value RecordReader reader = inf.getRecordReader(splits[i * splitStep], job, Reporter.NULL); K key = reader.createKey(); V value = reader.createValue(); while (reader.next(key, value)) {//向采样的空key和value中读入数据 //将采样的key加入samples数组 samples.add(key); key = reader.createKey(); ++records; if ((i+1) * samplesPerSplit <= records) {//判断是否满足采样数 break; } } reader.close(); } //返回采样的key的数组,供TotalOrderPartitioner使用 return (K[])samples.toArray(); } }
public static class IntervalSamplerimplements Sampler { ... /** * 根据一定的间隔从s个分区中采样数据,非常适合对排好序的数据采样 * For each split sampled, emit when the ratio of the number of records * retained to the total record count is less than the specified * frequency. */ @SuppressWarnings("unchecked") // ArrayList::toArray doesn't preserve type public K[] getSample(InputFormat inf, JobConf job) throws IOException { //通过InputFormat组件读取所有的分片信息 InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks()); ArrayList samples = new ArrayList (); //获得采样分区数,在最大采样数最大分区数和总分区数中选择较小的 int splitsToSample = Math.min(maxSplitsSampled, splits.length); //获取采样分区间隔 int splitStep = splits.length / splitsToSample; long records = 0; long kept = 0; for (int i = 0; i < splitsToSample; ++i) { //获取第(i * splitStep)分片的RecordReader对象,并由该对象解析将数据解析成key/value RecordReader reader = inf.getRecordReader(splits[i * splitStep], job, Reporter.NULL); K key = reader.createKey(); V value = reader.createValue(); while (reader.next(key, value)) {//向采样的空key和value中读入数据 ++records; if ((double) kept / records < freq) {//判断当前样本数与已经读取的记录数的比值小于freq ++kept; samples.add(key); key = reader.createKey(); } } reader.close(); } //返回采样的key的数组,供TotalOrderPartitioner使用 return (K[])samples.toArray(); } }
