Hadoop系列(四)MapReduce 传递自定义bean并对结果排序、对mapper和reducer并发数的分析

案例要求:实现对手机号的上行和下行流量统计并分组
测试数据如下图所示:

分析:
通常情况下,mapper和reducer的输入输出类型可以为LongWritable,Text等,如果我们要传递自定义的bean,则需要符合hadoop的序列化规范。查看LongWritable源码可以看到其实现了WritableComparable接口:

/** A WritableComparable for longs. */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class LongWritable implements WritableComparable {...}

同理,我们自定义的bean要想被hadoop的mapreduce框架传递,也需要实现同样的接口。实际上WritableComparable接口是Writable和Comparable接口的组合接口,分别使bean能够被序列化和比较:

@InterfaceAudience.Public
@InterfaceStability.Stable
public interface WritableComparable extends Writable, Comparable {
}

不同于jdk默认的序列化方式,hadoop中剔除了对bean的继承结构和实现接口的序列化,只保留了bean内部的字段,节省了网络传输的带宽。

接下来我们就自己实现一个符合hadoop序列化规范的bean。

FlowBean:
(省略了对应字段的setter和getter方法)

public class FlowBean implements Writable {
    
    private String phone;
    private long upStream;
    private long downStream;
    private long sumStream;
    
    /**
     * 在反序列化时,反射机制需要调用空参构造函数
     */
    public FlowBean() {}
    
    public FlowBean(String phone, long upStream, long downStream) {
        super();
        this.phone = phone;
        this.upStream = upStream;
        this.downStream = downStream;
        this.sumStream = upStream + downStream;
    }

    /**
     * 从数据流中反序列化出对象的数据
     * 读取对象的顺序必须与序列化时的字段顺序一致
     */
    public void readFields(DataInput input) throws IOException {
        phone = input.readUTF();
        upStream = input.readLong();
        downStream = input.readLong();
        sumStream = input.readLong();
    }

    /**
     * 将对象序列化到流中
     */
    public void write(DataOutput output) throws IOException {
        output.writeUTF(phone);
        output.writeLong(upStream);
        output.writeLong(downStream);
        output.writeLong(sumStream);
    }

    /**
     * reduce结果输出格式
     */
    @Override
    public String toString() {
        return "" + upStream + "\t" + downStream + "\t" + sumStream;
    }

}

FlowMapper:

public class FlowMapper extends Mapper {
    
    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        
        String line = value.toString();
        String[] fields = StringUtils.split(line, "\t");
        
        String phone = fields[1];
        long upStream = Long.parseLong(fields[7]);
        long downStream = Long.parseLong(fields[8]);
        
        context.write(new Text(phone), new FlowBean(phone, upStream, downStream));
        
    }

}

FlowReducer:

public class FlowReducer extends Reducer {
    
    @Override
    protected void reduce(Text key, Iterable values, Context context)
            throws IOException, InterruptedException {
        
        long upStreamCounter = 0;
        long downStreamCounter = 0;
        
        for (FlowBean bean: values) {
            upStreamCounter += bean.getUpStream();
            downStreamCounter += bean.getDownStream();
        }
        
        context.write(key, new FlowBean(key.toString(), upStreamCounter, downStreamCounter));
        
    }

}

FlowRunner:
Runner的标准实现方式是继承Configured类并实现Tool接口。

public class FlowRunner extends Configured implements Tool{

    public int run(String[] args) throws Exception {
        
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        
        job.setJarByClass(FlowRunner.class);
        
        job.setMapperClass(FlowMapper.class);
        job.setReducerClass(FlowReducer.class);
        
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(FlowBean.class);
        
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);
        
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
        return job.waitForCompletion(true)?0:1;
    }
    
    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new FlowRunner(), args);
        System.exit(res);
    }

}

对输出结果的排序,例如,按照总流量由高到低排序,那么FlowBean可以直接实现WritableComparable接口:

public class FlowBean implements WritableComparable {...}

同时重写compareTo方法:

    public int compareTo(FlowBean o) {
        return sumStream > o.sumStream ? -1 : 1;
    }

修改mapper和reducer代码如下:

public class SortMR {
    
    public static class SortMapper extends Mapper {
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            
            String line = value.toString();
            String[] fields = StringUtils.split(line, "\t");
            String phone = fields[0];
            long upStream = Long.parseLong(fields[1]);
            long downStream = Long.parseLong(fields[2]);
            
            context.write(new FlowBean(phone, upStream, downStream), NullWritable.get());
        }
    }
    
    public static class SortReducer extends Reducer {
        @Override
        protected void reduce(FlowBean key, Iterable values, Context context)
                throws IOException, InterruptedException {
            
            String phone = key.getPhone();
            context.write(new Text(phone), key);
        }
    }
    
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        
        job.setJarByClass(SortMR.class);
        
        job.setMapperClass(SortMapper.class);
        job.setReducerClass(SortReducer.class);
        
        job.setMapOutputKeyClass(FlowBean.class);
        job.setMapOutputValueClass(NullWritable.class);
        
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);
        
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
        System.exit(job.waitForCompletion(true)?0:1);
    }
}

提交到yarn中执行结果如下:

如果要对执行结果进行分组,即,不同区段的手机号流量统计结果输出到不同的文件中,我们需要设置Reducer的并发任务数。
首先,自定义一个partitioner类,如下:

public class AreaPartitioner extends Partitioner{

    private static HashMap areaMap = new HashMap();
    
    static{
        areaMap.put("135", 0);
        areaMap.put("136", 1);
        areaMap.put("137", 2);
        areaMap.put("138", 3);
        areaMap.put("139", 4);
    }
    
    @Override
    public int getPartition(KEY key, VALUE value, int numPartitions) {
        //从key中拿到手机号,查询手机归属地字典,不同的省份返回不同的组号
        int areaCoder  = areaMap.get(key.toString().substring(0, 3))==null?5:areaMap.get(key.toString().substring(0, 3));
        return areaCoder;
    }

}

然后在configuration中进行如下配置:

    // 设置自定义的分组逻辑定义
    job.setPartitionerClass(AreaPartitioner.class);
    
    // 设置reduce任务的并发数,应与分组的数量保持一致;如果多余分组数量,会产生空的reducer结果文件,不会报错;
    // 如果少于分组数量,则会报错;如果设为1,则与默认情况相同,只会有一个reducer进程执行,产生一个reducer结果。
    job.setNumReduceTasks(6);

这样job执行的结果将如下所示:

如果将测试数据复制四份,如下:

提交任务到yarn处理,在map任务启动并尚未完成之前查看java进程:

可以发现同时会有5个YarnChild进程在执行map任务。由于每个小文件都会占据一个block,每个block需要一个进程进行map任务处理,如此文件数目越多,map任务进程越多,消耗资源越多,效率越低。

实际上,map任务的并发数是由切片的数量决定的。有多少个切片,就启动多少map任务去执行。切片是一个逻辑概念,指的就是文件中数据的偏移量。切片的具体大小应该根据所处理的文件大小来调整。具体的从map到reduce输入输出任务处理过程,称为shuffle。

你可能感兴趣的:(hadoop,mapreduce)