MapReduce自定义对象序列化

MapReduce自定义对象序列化

数据如下:
MapReduce自定义对象序列化_第1张图片
首先在本地文件系统这里我使用的是centos6.7图形化界面安装
打开终端,最好切换到root用户下,规避需要权限的操作,
可以参考我写的l博客linux基础入门
要把同一个用户的上行流量、下行流量进行累加,并计算出综合。
例如上面的13897230503有两条记录,就要对这两条记录进行累加,计算总和,得到:
13897230503,500,1600,2100
(2)实现思路
map
接收日志的一行数据,key为行的偏移量,value为此行数据。
输出时,应以手机号为key,value应为一个整体,包括:上行流量、下行流量、总流量。
手机号是字符串类型Text,而这个整体不能用基本数据类型表示,需要我们自定义一个bean对象,并且要实现可序列化。
key: 13897230503
value: < upFlow:100, dFlow:300, sumFlow:400 >
reduce
接收一个手机号标识的key,及这个手机号对应的bean对象集合。
例如:
key:
13897230503
value:
< upFlow:400, dFlow:1300, sumFlow:1700 >,
< upFlow:100, dFlow:300, sumFlow:400 >
迭代bean对象集合,累加各项,形成一个新的bean对象,例如:
< upFlow:400+100, dFlow:1300+300, sumFlow:1700+400 >
最后输出:
key: 13897230503
value: < upFlow:500, dFlow:1600, sumFlow:2100 >
创建项目此处我使用的集成开发环境为Eclipse
第一个类是封装的JavaBean名称为FlowBean

package comtoo;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.Writable;

public class FlowBean implements Writable{
	
	int upFlow;
	int downFlow;
	double toFlow;
	
	public FlowBean(int upFlow,int downFlow,double toFlow){
		this.upFlow=upFlow;
		this.downFlow=downFlow;
		this.toFlow=toFlow;
	}


	public int getUpFlow() {
		return upFlow;
	}


	public void setUpFlow(int upFlow) {
		this.upFlow = upFlow;
	}


	public int getDownFlow() {
		return downFlow;
	}


	public void setDownFlow(int downFlow) {
		this.downFlow = downFlow;
	}


	public double getToFlow() {
		return toFlow;
	}


	public void setToFlow(int toFlow) {
		this.toFlow = toFlow;
	}
	
	@Override
	public void readFields(DataInput in) throws IOException {
		upFlow = in.readInt();
		downFlow = in.readInt();
		toFlow = in.readDouble();
		
	}

	@Override
	public void write(DataOutput out) throws IOException {
		out.writeInt(upFlow);
		out.writeInt(downFlow);
		out.writeDouble(toFlow);
	}


	@Override
	public String toString() {
		return upFlow+":"+downFlow+":"+toFlow;
	}


	public FlowBean() {
		super();
	}
}

第二个类是控制分区数量的,体现在Map端执行,类名为MyPartitioner

package com;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class MyPartitioner extends Partitioner {

	@Override
	public int getPartition(Text key, FlowBean value, int partitionNum) {
		//截取key的前三位用来做比较
		String phoneAre=key.toString().substring(0, 3);
		if(phoneAre.equals("137")){
			//如果为第一种情况,划分至第一个分区
			return 0;
		}
		if(phoneAre.equals("133")){
			return 1;
		}
		if(phoneAre.equals("138")){
			return 2;
		}
		if(phoneAre.equals("135")){
			return 3;
		}
		return 4;
	}

}

第三个类是实现具体的业务逻辑的,包含了程序主入口,和Map,Reducer函数的实现
类名为FlowWritable

package com;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class FlowWritable {
	
	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		
		FileSystem fs = FileSystem.get(conf);
		Path inputpath = new Path(args[0]);
		Path outputpath = new Path(args[1]);
		if(fs.exists(outputpath)){
			fs.delete(outputpath, true);
		}
		job.setJarByClass(FlowWritable.class);
		job.setJobName("Flow");
		
		job.setMapperClass(Map.class);
		job.setReducerClass(Red.class);
		
		FileInputFormat.setInputPaths(job, inputpath);
		FileOutputFormat.setOutputPath(job, outputpath);
		
		job.setMapOutputKeyClass(Text.class);
		job.setOutputValueClass(FlowBean.class);
		
		job.setPartitionerClass(MyPartitioner.class);
		job.setNumReduceTasks(5);
		
		job.waitForCompletion(true);
	}
	

	public static class Map extends Mapper{
		public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{
			
			String[] line = value.toString().split("\\|");
			
			FlowBean fl = new FlowBean(Integer.parseInt(line[1]), Integer.parseInt(line[2]));
			
			context.write(new Text(line[0]),fl);
		
		}
	}
	public static class Red extends Reducer{
		public void reduce(Text key,Iterable value,Context context) throws IOException, InterruptedException{
			
			
			int upSum=0;
			int downSum=0;
			int toSum=0;
			
			for (FlowBean fl : value) {
				
				upSum+=fl.getUpFlow();
				
				downSum+=fl.getDownFlow();
				
			}
			toSum=upSum+downSum;
			
			context.write(key, new Text(upSum+":"+downSum+":"+toSum));
		}
	}
}

根据以上步骤可实现基础的MapReduce分析归类及分区操作,重要:Map端的Partitioner基本上决定了reduce的数量。当然在主函数中需要设置job.setNumReduceTasks(数字);l来控制reduce的数量。从而提升运算速度!

你可能感兴趣的:(大数据)