每台机器处理一部分数据的排序
能避免不产生shuffle就不产生shuffle
决定了Map输出的数据会被哪个Reduce 进行处理
job.setPartitionerClass(UsesrFlowPartition.class);
决定了Map输出的数据按照Key以什么样的方式进行排序
job.setSortComparatorClass(null);
/**
* 对于这个类型在Shuffle中的排序,按照上行流量降序
* @param o
* @return
*/
@Override
public int compareTo(FlowBean02 o) {
//只按照上行总流量降序排序即可
return -Long.valueOf(this.getUpFlow()).compareTo(Long.valueOf(o.getUpFlow()));
}
实现了对Key进行分组,属于同一组的Value会放入同一个迭代器中
分组的本质:也是比较,如果key相同就是一组,不相同就不是一组
job.setGroupingComparatorClass(null);
都需要调用比较器来实现
数据:wc.txt
hadoop hive spark hbase
spark hive spark spark
hbase hbase hbase hbase hbase
hive hbase hbase hbase hive
spark hbase spark hadoop
hadoop
hadoop hadoop hadoop hadoop hadoop
spark hadoop hive spark hadoop
TextInputFormat extends FileInputFormat extends InputFormat
分片:getSplit
hadoop hive spark hbase
spark hive spark spark
hbase hbase hbase hbase hbase
hive hbase hbase hbase hive
spark hbase spark hadoop
hadoop
hadoop hadoop hadoop hadoop hadoop
spark hadoop hive spark hadoop
每个分片的每条数会被转换KeyValue对:nextKeyValue
key value
0 hadoop hive spark hbase
10 spark hive spark spark
20 hbase hbase hbase hbase hbase
30 hive hbase hbase hbase hive
key value
0 spark hbase spark hadoop
10 hadoop
20 hadoop hadoop hadoop hadoop hadoop
30 spark hadoop hive spark hadoop
所有分片的KeyValue会传递给Map阶段
自己定义Map类继承Mapper类
Map阶段会根据Input传递过来的分片启动对应的MapTask进程
每个MapTask都会调用map方法对自己负责的每个KeyValue 进行处理
hadoop 1
hive 1
spark 1
hbase 1
spark 1
hive 1
spark 1
spark 1
hbase 1
hbase 1
hbase 1
hbase 1
hbase 1
hive 1
hbase 1
hbase 1
hbase 1
hive 1
spark 1
hbase 1
spark 1
hadoop 1
hadoop 1
hadoop 1
hadoop 1
hadoop 1
hadoop 1
hadoop 1
spark 1
hadoop 1
hive 1
spark 1
hadoop 1
对每个MapTask输出的数据进行处理:流程是一致的
hadoop 1
hive 1
spark 1
hbase 1
spark 1
hive 1
spark 1
spark 1
hbase 1
hbase 1
hbase 1
hbase 1
hbase 1
hive 1
hbase 1
hbase 1
hbase 1
hive 1
将内存中数据写入磁盘
分区和排序
hadoop 1 reduce1
hive 1 reduce2
spark 1 reduce2
hbase 1 reduce1
spark 1 reduce2
hive 1 reduce2
spark 1 reduce2
spark 1 reduce2
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hive 1 reduce2
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hive 1 reduce2
内存中的这次排序:归并排序
hadoop 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hive 1 reduce2
hive 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hive 1 reduce2
hive 1 reduce2
file1
hadoop 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hive 1 reduce2
hive 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
file2
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hive 1 reduce2
hive 1 reduce2
file1
hadoop 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hive 1 reduce2
hive 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
file2
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hive 1 reduce2
hive 1 reduce2
file1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hbase 1 reduce1
spark 1 reduce2
spark 1 reduce2
file2
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hive 1 reduce2
spark 1 reduce2
spark 1 reduce2
每个MapTask产生的所有的小文件会进行合并
hadoop 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hive 1 reduce2
hive 1 reduce2
hive 1 reduce2
hive 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hbase 1 reduce1
hive 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
所有MapTask都生成一个大文件以后,整个Map端shuffle结束,会通知ReduceTask过来取数据
取数据:每个ReduceTask会到每个Map输出的结果文件中取属于自己要处理的数据
从MapTask1拉取数据
hadoop 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
从MapTask2拉取数据
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hbase 1 reduce1
从MapTask1拉取数据
hive 1 reduce2
hive 1 reduce2
hive 1 reduce2
hive 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
从MapTask2拉取数据
hive 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
合并的过程中进行排序:插入排序
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hadoop 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hbase 1 reduce1
hive 1 reduce2
hive 1 reduce2
hive 1 reduce2
hive 1 reduce2
hive 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
spark 1 reduce2
hadoop 1,1,1,1,1,1,1,1,1,1
hbase 1,1,1,1,1,1,1,1,1,1
hive 1,1,1,1,1
spark 1,1,1,1,1,1,1,1
目的:提高分组的效率,相同的key都放在一起
每个ReduceTask都对自己的每一条数据调用reduce方法
hadoop 10
hbase 10
hive 5
spark 8