上面流程是整个 MapReduce 最全工作流程,但是 Shuffle 过程只是从第 7 步开始到第 16 步结束,具体 Shuffle 过程详解,如下:
注意:
mapreduce.task.io.sort.mb 默认 100M。
Map 方法之后,Reduce 方法之前的数据处理过程称之为 Shuffle。
要求将统计结构按照条件输出到不同文件中(分区)。比如:将统计结构按照手机归属地不同省份输出到不同文件中(分区)
默认分区是根据 key 的 hashCode 对 ReduceTasks 个数取模得到的。用户没法控制哪个 key 存储到哪个分区。
part-r-000xx
;Exception
;part-r-00000
;假如:假设自定义分区数为 5,则
将统计结果按照手机归属地不同省份输出到不同文件中(分区)
手机号 136、137、138、139 开头都分别放到一个独立的 4 个文件中,其他开头的放到一个文件中。
package com.fickler.mapreduce.writable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
/**
* @author dell
* @version 1.0
*/
public class ProvincePartitioner extends Partitioner<Text, FlowBean> {
@Override
public int getPartition(Text text, FlowBean flowBean, int i) {
//获取手机号的前三位prePhone
String phone = text.toString();
String prePhone = phone.substring(0, 3);
//定义一个分区号变量partition,根据prePhone设置分区号
int partition;
if ("136".equals(prePhone)){
partition = 0;
}else if ("137".equals(prePhone)){
partition = 1;
} else if ("138".equals(prePhone)) {
partition = 2;
} else if ("139".equals(prePhone)) {
partition = 3;
}else {
partition = 4;
}
//最后返回分区号
return partition;
}
}
package com.fickler.mapreduce.writable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
/**
* @author dell
* @version 1.0
*/
public class FlowDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
//1.获取job
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration);
//2.设置jar
job.setJarByClass(FlowDriver.class);
//3.关联mapper和Reducer
job.setMapperClass(FlowMapper.class);
job.setReducerClass(FlowReducer.class);
//4.设置mapper输出key和value类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FlowBean.class);
//5.设置最终数据输出的key和value类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
//8.指定自定义分区器
job.setPartitionerClass(ProvincePartitioner.class);
//9.同时指定相应数量的ReduceTask
job.setNumReduceTasks(5);
//6.设置数据的输入路径和输出路径
FileInputFormat.setInputPaths(job, new Path("C:\\Users\\dell\\Desktop\\input"));
FileOutputFormat.setOutputPath(job, new Path("C:\\Users\\dell\\Desktop\\output"));
//7.提交job
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}
排序是 MapReduce 框架中最重要的操作之一。
MapTask 和 ReduceTask 均会对数据按照 key 进行排序。该操作属于 Hadoop 的默认行为。任何应用程序中的数据均会被排序,而不管逻辑上是否需要。
默认排序是按照字典顺序排序,且实现该排序的方法是快速排序。
对于 MapTask,它会将处理的结果暂时放到环形缓冲区中,当环形缓冲区使用率到达一定阈值后,再对缓冲区中的数据进行一次快速排序,并将这些有序数据溢写到磁盘上,而当数据处理完毕后,它会对磁盘上所有文件进行归并排序。
对于 ReduceTask,它从每个 MapTask 上远程拷贝相应的数据文件,如果文件大小超过一定阈值,则溢写到磁盘上,否则存储在内存中。如果磁盘上文件数目达到一定阈值,则进行一次归并排序以生成一个更大文件;如果内存中文件大小伙子数目超过一定阈值,则进行一次合并后将数据溢写到磁盘上。当所有数据拷贝完毕后,ReduceTask 统一对内存和磁盘上的所有数据进行一次归并排序。
MapReduce 根据输入记录的键对数据集排序。保证输出的每个文件内部有序。
最终输出结果只有一个文件,且文件内部有序。实现方式是只设置一个 ReduceTask。但该方法在处理大型文件时效率极低,因为一台机器处理所有文件,完全丧失了 MapReduce 所提供的并行架构。
在 Reduce 端对 key 进行分组。应用于:在接收的 key 为 bean 对象时,想让一个或几个字段相同(全部字段比较不相同)的 key 进入到同一个 reduce 方法时,可以采用分组排序。
在自定义排序过程中,如果 compareTo 中的判断条件为两个即为二次排序。
bean 对象做为 key 传输,需要实现 WritableComparable 接口重写 compareTo 方法,就可以实现排序。
@Override
public int compareTo(FlowBean bean) {
int result;
// 按照总流量大小,倒序排列
if (this.sumFlow > bean.getSumFlow()) {
result = -1;
}else if (this.sumFlow < bean.getSumFlow()) {
result = 1;
}else {
result = 0;
}
return result;
}
根据上个案例产生的结果,再次对总流量进行倒序排序
package com.fickler.mapreduce.writablecompable;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
/**
* @author dell
* @version 1.0
*/
public class FlowBean implements WritableComparable<FlowBean> {
private long upFlow;
private long downFlow;
private long sumFlow;
public FlowBean() {
}
public long getUpFlow() {
return upFlow;
}
public void setUpFlow(long upFlow) {
this.upFlow = upFlow;
}
public long getDownFlow() {
return downFlow;
}
public void setDownFlow(long downFlow) {
this.downFlow = downFlow;
}
public long getSumFlow() {
return sumFlow;
}
public void setSumFlow(long sumFlow) {
this.sumFlow = sumFlow;
}
public void setSumFlow() {
this.sumFlow = this.upFlow + this.downFlow;
}
@Override
public String toString() {
return upFlow +
"\t" + downFlow +
"\t" + sumFlow;
}
@Override
public int compareTo(FlowBean o) {
if (this.sumFlow > o.sumFlow){
return -1;
} else if (this.sumFlow < o.sumFlow){
return 1;
} else {
return 0;
}
}
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeLong(this.upFlow);
dataOutput.writeLong(this.downFlow);
dataOutput.writeLong(this.sumFlow);
}
@Override
public void readFields(DataInput dataInput) throws IOException {
this.upFlow = dataInput.readLong();
this.downFlow = dataInput.readLong();
this.sumFlow = dataInput.readLong();
}
}
package com.fickler.mapreduce.writablecompable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* @author dell
* @version 1.0
*/
public class FlowMapper extends Mapper<LongWritable, Text, FlowBean, Text> {
private FlowBean outK = new FlowBean();
private Text outV = new Text();
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, FlowBean, Text>.Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] split = line.split("\t");
outK.setUpFlow(Long.parseLong(split[1]));
outK.setDownFlow(Long.parseLong(split[2]));
outK.setSumFlow();
outV.set(split[0]);
context.write(outK, outV);
}
}
package com.fickler.mapreduce.writablecompable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* @author dell
* @version 1.0
*/
public class FlowReducer extends Reducer<FlowBean, Text, Text, FlowBean> {
@Override
protected void reduce(FlowBean key, Iterable<Text> values, Reducer<FlowBean, Text, Text, FlowBean>.Context context) throws IOException, InterruptedException {
for (Text value : values){
context.write(value, key);
}
}
}
package com.fickler.mapreduce.writablecompable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
/**
* @author dell
* @version 1.0
*/
public class FlowDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration);
job.setJarByClass(FlowDriver.class);
job.setMapperClass(FlowMapper.class);
job.setReducerClass(FlowReducer.class);
job.setMapOutputKeyClass(FlowBean.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
FileInputFormat.setInputPaths(job, new Path("C:\\Users\\dell\\Desktop\\input"));
FileOutputFormat.setOutputPath(job, new Path("C:\\Users\\dell\\Desktop\\output"));
boolean b = job.waitForCompletion(true);
System.out.println(b ? 0 : 1);
}
}
要求每个省份手机号输出的文件中按照总流量内部排序。
基于前一个需求,增加自定分区类,分区按照省份手机号设置。
package com.fickler.mapreduce.writablecompable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
/**
* @author dell
* @version 1.0
*/
public class ProvincePartitioner extends Partitioner<FlowBean, Text> {
@Override
public int getPartition(FlowBean flowBean, Text text, int i) {
String phone = text.toString();
String prePhone = phone.substring(0, 3);
int partition;
if ("136".equals(prePhone)){
partition = 0;
} else if ("137".equals(prePhone)) {
partition = 1;
} else if ("138".equals(prePhone)){
partition = 2;
} else if ("139".equals(prePhone)) {
partition = 3;
} else {
partition = 4;
}
return partition;
}
}
job.setPartitionerClass(ProvincePartitioner.class);
job.setNumReduceTasks(5);
自定义 Combiner 实现步骤
public class WordCountCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable outV = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
outV.set(sum);
context.write(key,outV);
}
}
job.setCombinerClass(WordCountCombiner.class);
统计过程中对每一 MapTask 的输出进行局部汇总,以减少网络传输量即采用 Combiner 功能。
期望:Combine 输入数据多,输出时经过合并,输出数据降低。
package com.fickler.mapreduce.wordcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* @author dell
* @version 1.0
*/
public class WordCountCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable outV = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable intWritable : values){
sum += intWritable.get();
}
outV.set(sum);
context.write(key, outV);
}
}
job.setCombinerClass(WordCountCombiner.class);
job.setCombinerClass(WordCountReducer.class);