MapReduce 通过key排序的例子一

在Hadoop中,排序是MapReduce的灵魂,MapTask和ReduceTask均会对数据按Key排序,这个操作是MR框架的默认行为,不管你的业务逻辑上是否需要这一操作。

下面这个例子是 统计 每个账户的总净利润(总收入 - 总支出)。

需求:

trade_info 中的数据如下(你可以认为有多个文件,每个文件都有如下类似的数据,
否则一个文件不需要 reducer,直接 combiner 就可以了):

[email protected]    6000    0   2014-02-20
[email protected]        2000    0   2014-02-20
[email protected]           0  100   2014-02-20
[email protected]    3000    0   2014-02-20
[email protected]      9000    0   2014-02-20
[email protected]         0  200   2014-02-20

账户、收入、支出、时间

需要统计每个账户的总收入、总支出、总净利润(总收入 - 总支出),并对总净利润进行排序

分析:

我们想实现这样的效果:

[email protected]    9000      0   9000
[email protected]      9000    200   8800
[email protected]        2000    100   1900

我们需要先统计每个账户的总收入、总支出、总净利润,然后对总收入进行排序
我们把账户、收入、支出、净利润 封装到一个 bean 中。

  • map 阶段
    读取每一行数据,把信息封装到 bean 中,作为 value
  • reduce 阶段
    计算每个账户的总收入、总支出、总的剩余

到了这一步,我们发现,一个 mapreduce 搞不定,这种情况下,可以再引入一个 mapreduce。
所以这里引入多 mapreduce 的概念:一个 mapreduce 搞不定,可以通过 多个 mapreduce 进行多次迭代,达到最终目的。

所以第二个 mapreduce:

  • map 阶段:
    读取 第一个 mapreduce reduce 的输出文件,把信息封装到 bean中,并把 bean 作为 map输出的 k。

为什么要这样设计呢?
我们知道,map完成之后,shuffle 会自动对 map输出的 k 进行排序,
所以我们利用shuffle的排序功能,前提是先在 bean 中 实现 compareTo 方法。

在Hadoop中,排序是MapReduce的灵魂,MapTask和ReduceTask均会对数据按Key排序,这个操作是MR框架的默认行为,不管你的业务逻辑上是否需要这一操作。

而这里我们已经不需要 map 输出 value 了,所以这里可以直接传 null

  • reduce 阶段:
    把 map 输出的 k取出 账户这个字段,并作为 reduce 的输出k
    把 map的输出k 作为 reduce的输出 value。

这样,我们通过 两个 mapreduce 实现了最终效果。

上代码:

InfoBean ,要 实现 WritableComparable 接口,并实现 compareTo 方法,在此方法中指定排序规则:

public class InfoBean implements WritableComparable {

    private String account;
    private double income;
    private double expenses;
    private double surplus;
    
    public void set(String account,double income,double expenses){
        this.account = account;
        this.income = income;
        this.expenses = expenses;
        this.surplus = income - expenses;
    }
    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(account);
        out.writeDouble(income);
        out.writeDouble(expenses);
        out.writeDouble(surplus);
        
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.account = in.readUTF();
        this.income = in.readDouble();
        this.expenses = in.readDouble();
        this.surplus = in.readDouble();
    }

    @Override
    public int compareTo(InfoBean o) {
        if(this.income == o.getIncome()){
            return this.expenses > o.getExpenses() ? 1 : -1;
        }
        return this.income > o.getIncome() ? 1 : -1;
    }

    @Override
    public String toString() {
        return  income + "\t" + expenses + "\t" + surplus;
    }
    public String getAccount() {
        return account;
    }

    public void setAccount(String account) {
        this.account = account;
    }

    public double getIncome() {
        return income;
    }

    public void setIncome(double income) {
        this.income = income;
    }

    public double getExpenses() {
        return expenses;
    }

    public void setExpenses(double expenses) {
        this.expenses = expenses;
    }

    public double getSurplus() {
        return surplus;
    }

    public void setSurplus(double surplus) {
        this.surplus = surplus;
    }
}

SumStep 对 bean 中的数据进行汇总:

public class SumStep {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        
        job.setJarByClass(SumStep.class);
        
        job.setMapperClass(SumMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(InfoBean.class);
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        
        job.setReducerClass(SumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(InfoBean.class);
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
        job.waitForCompletion(true);
    }

    public static class SumMapper extends Mapper{

        private InfoBean bean = new InfoBean();
        private Text k = new Text();
        
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
                
            // split 
            String line = value.toString();
            String[] fields = line.split("\t");
            // get useful field
            String account = fields[0];
            double income = Double.parseDouble(fields[1]);
            double expenses = Double.parseDouble(fields[2]);
            k.set(account);
            bean.set(account, income, expenses);
            context.write(k, bean);
        }
    }
    
    public static class SumReducer extends Reducer{

        private InfoBean bean = new InfoBean();
        
        @Override
        protected void reduce(Text key, Iterable v2s, Context context)
                throws IOException, InterruptedException {
            
            double in_sum = 0;
            double out_sum = 0;
            for(InfoBean bean : v2s){
                in_sum += bean.getIncome();
                out_sum += bean.getExpenses();
            }
            bean.set("", in_sum, out_sum);
            context.write(key, bean);
        }
    }
}

SortStep 把 bean 作为 mapper 的 key,利用 MapReduce 中自带的排序功能进行排序:

public class SortStep {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        
        Job job = Job.getInstance(conf);
        
        job.setJarByClass(SortStep.class);
        
        job.setMapperClass(SortMapper.class);
        job.setMapOutputKeyClass(InfoBean.class);
        job.setMapOutputValueClass(NullWritable.class);
        
        job.setReducerClass(SortReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(InfoBean.class);
        
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
        job.waitForCompletion(true);
    }
    
    public static class SortMapper extends Mapper{

        private InfoBean k = new InfoBean();
        @Override
        protected void map(
                LongWritable key,
                Text value,
                Mapper.Context context)
                throws IOException, InterruptedException {
                
            String line = value.toString();
            String[] fields = line.split("\t");
            k.set(fields[0], Double.parseDouble(fields[1]), Double.parseDouble(fields[2]));
            
            context.write(k, NullWritable.get());
        }
    }
    
    public static class SortReducer extends Reducer{

        private Text k = new Text();
        
        @Override
        protected void reduce(InfoBean key, Iterable values,
                Reducer.Context context)
                throws IOException, InterruptedException {
                
            k.set(key.getAccount());
            
            context.write(k, key);
        }   
    }
}

到这里,我们就完成了。

你可能感兴趣的:(MapReduce 通过key排序的例子一)