基于MapReduce的二次排序

1.需求

现给出一系列订单数据,要求用“mapreduce自己的排序机制”将每条订单数据中成交额最大的数据排在第一位显示出来。
数据源:
订单id 商品id 成交金额
Order_0000001	Pdt_01	222.8
Order_0000001	Pdt_05	25.8
Order_0000002	Pdt_03	522.8
Order_0000002	Pdt_04	122.4
Order_0000002	Pdt_05	722.4
Order_0000003	Pdt_01	222.8

2.思路

1)利用“订单id”与“成交金额”作为联合主键(以bean的形式),如此一来可以将map阶段读取到的所有订单数据 按订单id分区(利用partitioner),以金额排序(WritableComparable中的compareTo方法),并发送到reduce
2)在reduce端利用 GroupingComparator将订单id相同的聚合成组,后之间输出

3.代码

1)OrderBean类,实现WritableComparatable接口
public class OrderBean implements WritableComparable {
	private Text orderId;
	private DoubleWritable price;
	
	public OrderBean(){
		
	}
	public OrderBean(Text itemid, DoubleWritable amount) {
		set(itemid, amount);
	}

	public void set(Text orderId, DoubleWritable price) {
		this.orderId = orderId;
		this.price = price;
	}
	
	public Text getOrderId() {
		return orderId;
	}

	public void setOrderId(Text orderId) {
		this.orderId = orderId;
	}

	public DoubleWritable getPrice() {
		return price;
	}

	public void setPrice(DoubleWritable price) {
		this.price = price;
	}
	
	@Override
	public void write(DataOutput out) throws IOException {
		out.writeUTF(orderId.toString());
		out.writeDouble(price.get());
	}
	@Override
	public void readFields(DataInput in) throws IOException {
		String readUTF = in.readUTF();
		double readDouble = in.readDouble();
		
		this.orderId = new Text(readUTF);
		this.price = new DoubleWritable(readDouble);
	}
	
	@Override
	public int compareTo(OrderBean o) {
		int cmp = this.orderId.compareTo(o.getOrderId());
		if(cmp == 0){//当orderId相同时
			cmp = -this.price.compareTo(o.getPrice()); //从大到小的逆序
		}
		return cmp;
	}
	
	@Override
	public String toString() {
		return this.orderId.toString() + "\t" + this.price.get();
	}
}


2)Mapper类
	//拿到orderId与成交金额,并赋值到bean对象中,最后输出该对象
	static class SecondarySortMapper extends Mapper{
		OrderBean ob = new OrderBean();
		Text t = new Text();
		@Override
		protected void map(LongWritable key,Text value,Context context)
				throws IOException, InterruptedException {
			String line = value.toString();
			String[] fields = line.split("\t");
			String orderId = fields[0];
			double price = Double.parseDouble(fields[2]);
			ob.set(new Text(orderId), new DoubleWritable(price));
			t.set(ob.toString());
			context.write(ob,NullWritable.get());
		}
	}
	

3)Partitioner类
//将不同orderId的bean交给不同的reduceTask处理
public class SecondarySortPartitioner extends Partitioner{

	@Override
	public int getPartition(OrderBean key, NullWritable value, int numPartitions) {
		//相同id的bean 会发往相同的partition
		//产生的分区数会跟用户设置的reduce任务数一致
		return (key.g//将不同orderId的bean交给不同的reduceTask处理
public class SecondarySortPartitioner extends Partitioner{

	@Override
	public int getPartition(OrderBean key, NullWritable value, int numPartitions) {
		//相同id的bean 会发往相同的partition
		//产生的分区数会跟用户设置的reduce任务数一致
		return (key.getOrderId().hashCode() & Integer.MAX_VALUE) % numPartitions ;
	}

}
etOrderId().hashCode() & Integer.MAX_VALUE) % numPartitions ;
	}

}

4)GroupingComparator类
/*GroupingComparator的作用是调用reduce时对数据进行分组
 *
 *reduce的工作机制: 
 *reduce任务会接收map阶段输出的key与经过shuffle阶段整合过的values(集合) ,当reduce任务处理完当前的后,
 *他会判断下一条记录的key是不是和当前的key在同一组中。如果是,那么reduce任务会继续处理这条记录。如果不是则当前reduce任务结束
 *
 *话说回来,如果不用GroupingComparator的分组的话,那么同一组记录要在reduce方法中独立处理,那么有些数据可能需要传递,因此为增加复杂度。
 *因此设置GroupingComparator的目的就是降低复杂度
 */
public class SecondarySortGC extends WritableComparator{

	//传入作为key的bean的class类型,以及制定要让框架作反射获取的实例对象
	protected SecondarySortGC() {
		super(OrderBean.class, true);
	}
	
	
	@Override
	public int compare(Object a, Object b) {
		OrderBean abean = (OrderBean) a;
		OrderBean bbean = (OrderBean) b;
		
		//对两个bean作比较时,只比较他们的orderid
		return abean.getOrderId().compareTo(bbean.getOrderId());
	}

}

5)Reducer类

static class SecondarySortReducer extends Reducer{
		@Override
		protected void reduce(
				OrderBean key,Iterable values,Context context)
				throws IOException, InterruptedException {
			context.write(key, NullWritable.get());
		}
	}

6)main方法

	public static void main(String[] args)throws Exception {
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		
		job.setJarByClass(SecondarySort.class);
		
		job.setMapperClass(SecondarySortMapper.class);
		job.setReducerClass(SecondarySortReducer.class);
		job.setPartitionerClass(SecondarySortPartitioner.class);
		job.setGroupingComparatorClass(SecondarySortGC.class);
		
		job.setOutputKeyClass(OrderBean.class);
		job.setOutputValueClass(NullWritable.class);
		
		FileInputFormat.setInputPaths(job, new Path("H:/大数据/mapreduce/secondarysort/input"));
		FileOutputFormat.setOutputPath(job, new Path("H:/大数据/mapreduce/secondarysort/output"));
		
		job.setNumReduceTasks(3);
		job.waitForCompletion(true);
	}

4.输出

1)part-r-00000
Order_0000003	222.8

2)part-r-00001
Order_0000001	222.8
Order_0000001	25.8

3)part-r-00002
Order_0000002	722.4
Order_0000002	522.8
Order_0000002	122.4



你可能感兴趣的:(大数据,MapReduce)