需求描述:
1. 对文件1.txt中统计每个单词的个数(wordcount)$ cat 1.txt
aa bb aa dd ff rr ee aa kk jj hh uu ii tt rr tt oo uu
2. 输出文件限定为两个,其中一个存放aa~kk之间的单词,另外一个存放ll~zz之间的单词
解决方法:
MR默认的reduce输出分区为HashParition
public class HashPartitioner<K, V> extends Partitioner<K, V> { /** Use {@link Object#hashCode()} to partition. */ public int getPartition(K key, V value, int numReduceTasks) { return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; } }
重写改方法即可:
private static class MyPartitioner extends Partitioner<Text,IntWritable> { @Override public int getPartition(Text key, IntWritable value, int numReduceTasks) { if (key.toString().compareTo("aa") >= 0 && key.toString().compareTo("kk") <= 0) { return 0; } else { return 1; } } }
设定conf和job参数:
conf.set("mapred.reduce.tasks", "2"); job.setPartitionerClass(MyPartitioner.class);
输出结果:
$ hadoop fs -cat /lxw/output/part-r-00000 aa 3 bb 1 dd 1 ee 1 ff 1 hh 1 ii 1 jj 1 kk 1
$ hadoop fs -cat /lxw/output/part-r-00001 oo 1 rr 2 tt 2 uu 2