在Hadoop中,排序是MapReduce的灵魂,MapTask和ReduceTask均会对数据按Key排序,这个操作是MR框架的默认行为,不管你的业务逻辑上是否需要这一操作。
下面这个例子是 统计 每个账户的总净利润(总收入 - 总支出)。
需求:
trade_info 中的数据如下(你可以认为有多个文件,每个文件都有如下类似的数据,
否则一个文件不需要 reducer,直接 combiner 就可以了):
[email protected] 6000 0 2014-02-20
[email protected] 2000 0 2014-02-20
[email protected] 0 100 2014-02-20
[email protected] 3000 0 2014-02-20
[email protected] 9000 0 2014-02-20
[email protected] 0 200 2014-02-20
账户、收入、支出、时间
需要统计每个账户的总收入、总支出、总净利润(总收入 - 总支出),并对总净利润进行排序
分析:
我们想实现这样的效果:
[email protected] 9000 0 9000
[email protected] 9000 200 8800
[email protected] 2000 100 1900
我们需要先统计每个账户的总收入、总支出、总净利润,然后对总收入进行排序
我们把账户、收入、支出、净利润 封装到一个 bean 中。
- map 阶段
读取每一行数据,把信息封装到 bean 中,作为 value - reduce 阶段
计算每个账户的总收入、总支出、总的剩余
到了这一步,我们发现,一个 mapreduce 搞不定,这种情况下,可以再引入一个 mapreduce。
所以这里引入多 mapreduce 的概念:一个 mapreduce 搞不定,可以通过 多个 mapreduce 进行多次迭代,达到最终目的。
所以第二个 mapreduce:
- map 阶段:
读取 第一个 mapreduce reduce 的输出文件,把信息封装到 bean中,并把 bean 作为 map输出的 k。
为什么要这样设计呢?
我们知道,map完成之后,shuffle 会自动对 map输出的 k 进行排序,
所以我们利用shuffle的排序功能,前提是先在 bean 中 实现 compareTo 方法。
在Hadoop中,排序是MapReduce的灵魂,MapTask和ReduceTask均会对数据按Key排序,这个操作是MR框架的默认行为,不管你的业务逻辑上是否需要这一操作。
而这里我们已经不需要 map 输出 value 了,所以这里可以直接传 null
- reduce 阶段:
把 map 输出的 k取出 账户这个字段,并作为 reduce 的输出k
把 map的输出k 作为 reduce的输出 value。
这样,我们通过 两个 mapreduce 实现了最终效果。
上代码:
InfoBean ,要 实现 WritableComparable 接口,并实现 compareTo 方法,在此方法中指定排序规则:
public class InfoBean implements WritableComparable {
private String account;
private double income;
private double expenses;
private double surplus;
public void set(String account,double income,double expenses){
this.account = account;
this.income = income;
this.expenses = expenses;
this.surplus = income - expenses;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(account);
out.writeDouble(income);
out.writeDouble(expenses);
out.writeDouble(surplus);
}
@Override
public void readFields(DataInput in) throws IOException {
this.account = in.readUTF();
this.income = in.readDouble();
this.expenses = in.readDouble();
this.surplus = in.readDouble();
}
@Override
public int compareTo(InfoBean o) {
if(this.income == o.getIncome()){
return this.expenses > o.getExpenses() ? 1 : -1;
}
return this.income > o.getIncome() ? 1 : -1;
}
@Override
public String toString() {
return income + "\t" + expenses + "\t" + surplus;
}
public String getAccount() {
return account;
}
public void setAccount(String account) {
this.account = account;
}
public double getIncome() {
return income;
}
public void setIncome(double income) {
this.income = income;
}
public double getExpenses() {
return expenses;
}
public void setExpenses(double expenses) {
this.expenses = expenses;
}
public double getSurplus() {
return surplus;
}
public void setSurplus(double surplus) {
this.surplus = surplus;
}
}
SumStep 对 bean 中的数据进行汇总:
public class SumStep {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(SumStep.class);
job.setMapperClass(SumMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(InfoBean.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
job.setReducerClass(SumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(InfoBean.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
public static class SumMapper extends Mapper{
private InfoBean bean = new InfoBean();
private Text k = new Text();
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// split
String line = value.toString();
String[] fields = line.split("\t");
// get useful field
String account = fields[0];
double income = Double.parseDouble(fields[1]);
double expenses = Double.parseDouble(fields[2]);
k.set(account);
bean.set(account, income, expenses);
context.write(k, bean);
}
}
public static class SumReducer extends Reducer{
private InfoBean bean = new InfoBean();
@Override
protected void reduce(Text key, Iterable v2s, Context context)
throws IOException, InterruptedException {
double in_sum = 0;
double out_sum = 0;
for(InfoBean bean : v2s){
in_sum += bean.getIncome();
out_sum += bean.getExpenses();
}
bean.set("", in_sum, out_sum);
context.write(key, bean);
}
}
}
SortStep 把 bean 作为 mapper 的 key,利用 MapReduce 中自带的排序功能进行排序:
public class SortStep {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(SortStep.class);
job.setMapperClass(SortMapper.class);
job.setMapOutputKeyClass(InfoBean.class);
job.setMapOutputValueClass(NullWritable.class);
job.setReducerClass(SortReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(InfoBean.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
public static class SortMapper extends Mapper{
private InfoBean k = new InfoBean();
@Override
protected void map(
LongWritable key,
Text value,
Mapper.Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] fields = line.split("\t");
k.set(fields[0], Double.parseDouble(fields[1]), Double.parseDouble(fields[2]));
context.write(k, NullWritable.get());
}
}
public static class SortReducer extends Reducer{
private Text k = new Text();
@Override
protected void reduce(InfoBean key, Iterable values,
Reducer.Context context)
throws IOException, InterruptedException {
k.set(key.getAccount());
context.write(k, key);
}
}
}
到这里,我们就完成了。