案例要求:实现对手机号的上行和下行流量统计并分组
测试数据如下图所示:
分析:
通常情况下,mapper和reducer的输入输出类型可以为LongWritable,Text等,如果我们要传递自定义的bean,则需要符合hadoop的序列化规范。查看LongWritable源码可以看到其实现了WritableComparable
接口:
/** A WritableComparable for longs. */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class LongWritable implements WritableComparable {...}
同理,我们自定义的bean要想被hadoop的mapreduce框架传递,也需要实现同样的接口。实际上WritableComparable接口是Writable和Comparable接口的组合接口,分别使bean能够被序列化和比较:
@InterfaceAudience.Public
@InterfaceStability.Stable
public interface WritableComparable extends Writable, Comparable {
}
不同于jdk默认的序列化方式,hadoop中剔除了对bean的继承结构和实现接口的序列化,只保留了bean内部的字段,节省了网络传输的带宽。
接下来我们就自己实现一个符合hadoop序列化规范的bean。
FlowBean:
(省略了对应字段的setter和getter方法)
public class FlowBean implements Writable {
private String phone;
private long upStream;
private long downStream;
private long sumStream;
/**
* 在反序列化时,反射机制需要调用空参构造函数
*/
public FlowBean() {}
public FlowBean(String phone, long upStream, long downStream) {
super();
this.phone = phone;
this.upStream = upStream;
this.downStream = downStream;
this.sumStream = upStream + downStream;
}
/**
* 从数据流中反序列化出对象的数据
* 读取对象的顺序必须与序列化时的字段顺序一致
*/
public void readFields(DataInput input) throws IOException {
phone = input.readUTF();
upStream = input.readLong();
downStream = input.readLong();
sumStream = input.readLong();
}
/**
* 将对象序列化到流中
*/
public void write(DataOutput output) throws IOException {
output.writeUTF(phone);
output.writeLong(upStream);
output.writeLong(downStream);
output.writeLong(sumStream);
}
/**
* reduce结果输出格式
*/
@Override
public String toString() {
return "" + upStream + "\t" + downStream + "\t" + sumStream;
}
}
FlowMapper:
public class FlowMapper extends Mapper {
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] fields = StringUtils.split(line, "\t");
String phone = fields[1];
long upStream = Long.parseLong(fields[7]);
long downStream = Long.parseLong(fields[8]);
context.write(new Text(phone), new FlowBean(phone, upStream, downStream));
}
}
FlowReducer:
public class FlowReducer extends Reducer {
@Override
protected void reduce(Text key, Iterable values, Context context)
throws IOException, InterruptedException {
long upStreamCounter = 0;
long downStreamCounter = 0;
for (FlowBean bean: values) {
upStreamCounter += bean.getUpStream();
downStreamCounter += bean.getDownStream();
}
context.write(key, new FlowBean(key.toString(), upStreamCounter, downStreamCounter));
}
}
FlowRunner:
Runner的标准实现方式是继承Configured类并实现Tool接口。
public class FlowRunner extends Configured implements Tool{
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(FlowRunner.class);
job.setMapperClass(FlowMapper.class);
job.setReducerClass(FlowReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FlowBean.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true)?0:1;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new FlowRunner(), args);
System.exit(res);
}
}
对输出结果的排序,例如,按照总流量由高到低排序,那么FlowBean可以直接实现WritableComparable接口:
public class FlowBean implements WritableComparable {...}
同时重写compareTo方法:
public int compareTo(FlowBean o) {
return sumStream > o.sumStream ? -1 : 1;
}
修改mapper和reducer代码如下:
public class SortMR {
public static class SortMapper extends Mapper {
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] fields = StringUtils.split(line, "\t");
String phone = fields[0];
long upStream = Long.parseLong(fields[1]);
long downStream = Long.parseLong(fields[2]);
context.write(new FlowBean(phone, upStream, downStream), NullWritable.get());
}
}
public static class SortReducer extends Reducer {
@Override
protected void reduce(FlowBean key, Iterable values, Context context)
throws IOException, InterruptedException {
String phone = key.getPhone();
context.write(new Text(phone), key);
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(SortMR.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setMapOutputKeyClass(FlowBean.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)?0:1);
}
}
提交到yarn中执行结果如下:
如果要对执行结果进行分组,即,不同区段的手机号流量统计结果输出到不同的文件中,我们需要设置Reducer的并发任务数。
首先,自定义一个partitioner类,如下:
public class AreaPartitioner extends Partitioner{
private static HashMap areaMap = new HashMap();
static{
areaMap.put("135", 0);
areaMap.put("136", 1);
areaMap.put("137", 2);
areaMap.put("138", 3);
areaMap.put("139", 4);
}
@Override
public int getPartition(KEY key, VALUE value, int numPartitions) {
//从key中拿到手机号,查询手机归属地字典,不同的省份返回不同的组号
int areaCoder = areaMap.get(key.toString().substring(0, 3))==null?5:areaMap.get(key.toString().substring(0, 3));
return areaCoder;
}
}
然后在configuration中进行如下配置:
// 设置自定义的分组逻辑定义
job.setPartitionerClass(AreaPartitioner.class);
// 设置reduce任务的并发数,应与分组的数量保持一致;如果多余分组数量,会产生空的reducer结果文件,不会报错;
// 如果少于分组数量,则会报错;如果设为1,则与默认情况相同,只会有一个reducer进程执行,产生一个reducer结果。
job.setNumReduceTasks(6);
这样job执行的结果将如下所示:
如果将测试数据复制四份,如下:
提交任务到yarn处理,在map任务启动并尚未完成之前查看java进程:
可以发现同时会有5个YarnChild进程在执行map任务。由于每个小文件都会占据一个block,每个block需要一个进程进行map任务处理,如此文件数目越多,map任务进程越多,消耗资源越多,效率越低。
实际上,map任务的并发数是由切片的数量决定的。有多少个切片,就启动多少map任务去执行。切片是一个逻辑概念,指的就是文件中数据的偏移量。切片的具体大小应该根据所处理的文件大小来调整。具体的从map到reduce输入输出任务处理过程,称为shuffle。