13、MapReduce框架原理(下)

OutputFormat数据输出

outputformat实现类.png

OutputFormat是MapReduce输出的基类,所有实现MapReduce输出都实现了 OutputFormat接口。下面我们介绍几种常见的OutputFormat实现类。

OutputFormat接口实现类

  1. 文本输出TextOutputFormat
    默认的输出格式是TextOutputFormat,它把每条记录写为文本行。它的键和值可以是任意类型,因为TextOutputFormat调用toString()方法把它们转换为字符串。
  2. SequenceFileOutputFormat
    SequenceFileOutputFormat将它的输出写为一个顺序文件。如果输出需要作为后续 MapReduce任务的输入,这便是一种好的输出格式,因为它的格式紧凑,很容易被压缩。
  3. 自定义OutputFormat
    根据用户需求,自定义实现输出。

自定义OutputFormat

为了实现控制最终文件的输出路径,可以自定义OutputFormat。
要在一个mapreduce程序中根据数据的不同输出两类结果到不同目录,这类灵活的输出需求可以通过自定义outputformat来实现。
自定义OutputFormat步骤

  1. 自定义一个类继承FileOutputFormat。
  2. 改写recordwriter,具体改写输出数据的方法write()。

自定义OutputFormat案例

需求

过滤输入的文件中是否包含hadoop

  1. 包含hadoop的网站输出到hadoop.txt
  2. 不包含hadoop的网站输出到other.txt

输入数据

http://hadoop.apache.org/
http://spark.apache.org/
http://hadoop.apache.org/
http://spark.apache.org/
http://hadoop.apache.org/
http://spark.apache.org/
http://hadoop.apache.org/
http://spark.apache.org/
http://hadoop.apache.org/
http://spark.apache.org/
http://spark.apache.org/
http://hadoop.apache.org/
http://spark.apache.org/
https://hbase.apache.org/
https://hive.apache.org/
https://hbase.apache.org/
https://hive.apache.org/
https://hbase.apache.org/
https://hive.apache.org/
https://hbase.apache.org/
https://hive.apache.org/

代码实现

自定义一个outputformat

public class LogOutputFormat extends FileOutputFormat {
    @Override
    public RecordWriter getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
        return new LogRecordWriter(job);
    }
}

具体的写数据RecordWriter

public class LogRecordWriter extends RecordWriter {

    FSDataOutputStream hadoopOutputStream = null;
    FSDataOutputStream otherOutputStream = null;

    public LogRecordWriter(TaskAttemptContext job) {
        try {
            Configuration configuration = job.getConfiguration();
            String s = configuration.get("mapred.output.dir");
            FileSystem fileSystem = FileSystem.get(job.getConfiguration());
            Path hadoopPath = new Path( s + "/hadoop.txt");
            Path otherPath = new Path(s + "/other.txt");
            hadoopOutputStream = fileSystem.create(hadoopPath);
            otherOutputStream = fileSystem.create(otherPath);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    @Override
    public void write(Text key, NullWritable value) throws IOException, InterruptedException {
        if(key.toString().contains("hadoop")) {
            hadoopOutputStream.write(key.toString().getBytes());
        }else {
            otherOutputStream.write(key.toString().getBytes());
        }
    }

    @Override
    public void close(TaskAttemptContext context) throws IOException, InterruptedException {
        if(hadoopOutputStream != null) {
            hadoopOutputStream.close();
        }
        if(otherOutputStream != null) {
            otherOutputStream.close();
        }
    }
}

LogMapper

public class LogMapper extends Mapper {
    Text k = new Text();
    NullWritable v = NullWritable.get();
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        k.set(value.toString());
        context.write(k, v);
    }
}

LogReducer

public class LogReducer extends Reducer {
    @Override
    protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
        String k = null;
        for(NullWritable value : values) {
            k = key.toString() + "\r\n";
            context.write(new Text(k), NullWritable.get());
        }
    }
}

LogDriver

public class LogDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Job job = Job.getInstance(new Configuration());

        job.setJarByClass(LogDriver.class);

        job.setMapperClass(LogMapper.class);
        job.setReducerClass(LogReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setOutputFormatClass(LogOutputFormat.class);

        boolean isSuccess = job.waitForCompletion(true);
    }
}

Join多种应用

Reduce join

Map端的主要工作,为来自不同表(文件)的key,value对打标签以区别不同来源的记录,然后用连接字段作为key,其余部分和新加的标志作为value,最后进行输出
Reduce端的主要工作,在reduce端以连接字段作为key的分组已经完成,我们只需要在每一个分组当中将那些来源于不同文件的记录(在map阶段已经打标志)分开,最后进行合并就可以了

Reduce join案例

需求

将订单表数据和商品表数据连接起来
订单表数据如下

1001    01  1
1002    02  2
1003    03  3
1004    01  4
1005    02  5
1006    03  6

商品信息表数据如下

01  小米
02  华为
03  格力

输出数据为

id='1004', amount=4, pname='小米'
id='1001', amount=1, pname='小米'
id='1005', amount=5, pname='华为'
id='1002', amount=2, pname='华为'
id='1006', amount=6, pname='格力'
id='1003', amount=3, pname='格力'

分析

设置一标志位,区分来自订单表和商品表数据
对key按照pid排序,标志位排序
按照pid进行分组

代码实现

OrderJoinBean

public class OrderJoinBean implements WritableComparable {

    private String id;
    private String pid;
    private Integer amount;
    private String pname;
    private String flag;


    public OrderJoinBean() {
    }

    public void set(String id, String pid, Integer amount, String pname, String flag) {
        this.id = id;
        this.pid = pid;
        this.amount = amount;
        this.pname = pname;
        this.flag = flag;
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    public String getPid() {
        return pid;
    }

    public void setPid(String pid) {
        this.pid = pid;
    }

    public Integer getAmount() {
        return amount;
    }

    public void setAmount(Integer amount) {
        this.amount = amount;
    }

    public String getPname() {
        return pname;
    }

    public void setPname(String pname) {
        this.pname = pname;
    }

    public String getFlag() {
        return flag;
    }

    public void setFlag(String flag) {
        this.flag = flag;
    }

    @Override
    public int compareTo(OrderJoinBean o) {
        if(this.pid.compareTo(o.getPid()) == 0) {
            return -this.flag.compareTo(o.getFlag());
        }else {
            return this.pid.compareTo(o.getPid());
        }
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(this.id);
        out.writeUTF(this.pid);
        out.writeInt(this.amount);
        out.writeUTF(this.pname);
        out.writeUTF(this.flag);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.id = in.readUTF();
        this.pid = in.readUTF();
        this.amount = in.readInt();
        this.pname = in.readUTF();
        this.flag = in.readUTF();
    }

    @Override
    public String toString() {
        return "id='" + id + '\'' +
                ", amount=" + amount +
                ", pname='" + pname + '\'';
    }
}

OrderJoinGroupingCompartor

public class OrderJoinGroupingCompartor extends WritableComparator {
    protected OrderJoinGroupingCompartor() {
        super(OrderJoinBean.class, true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        OrderJoinBean oa = (OrderJoinBean) a;
        OrderJoinBean ob = (OrderJoinBean) b;
        return oa.getPid().compareTo(ob.getPid());
    }
}

OrderJoinMapper

public class OrderJoinMapper extends Mapper {

    private FileSplit fileSplit;
    private OrderJoinBean k = new OrderJoinBean();
    private NullWritable v = NullWritable.get();


    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
       fileSplit = (FileSplit) context.getInputSplit();
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] strings = value.toString().split("\t");
        if(fileSplit.getPath().toString().endsWith("order.txt")) {
            k.set(strings[0], strings[1], Integer.parseInt(strings[2]), "NULL", "order");
        }else {
            k.set("NULL",strings[0], -1, strings[1], "product");
        }
        context.write(k, v);
    }
}

OrderJoinReducer

public class OrderJoinReducer extends Reducer {

    @Override
    protected void reduce(OrderJoinBean key, Iterable values, Context context) throws IOException, InterruptedException {
        int i = 0;
        String pName = "";
        for(NullWritable value : values) {
            if(i == 0) {
                pName = key.getPname();
            }else {
                key.setPname(pName);
                context.write(key, NullWritable.get());
            }
            i++;
        }
    }
}

OrderJoinDriver

public class OrderJoinDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Job job = Job.getInstance(new Configuration());

        job.setJarByClass(OrderJoinDriver.class);

        job.setMapperClass(OrderJoinMapper.class);
        job.setReducerClass(OrderJoinReducer.class);

        job.setMapOutputKeyClass(OrderJoinBean.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setOutputKeyClass(OrderJoinBean.class);
        job.setOutputValueClass(NullWritable.class);

        job.setGroupingComparatorClass(OrderJoinGroupingCompartor.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.waitForCompletion(true);
    }
}

总结

这种方式,合并的操作是在Reduce阶段完成,Reduce端的处理压力太大,Map节点的运算负载则很低,资源利用率不高,且在Reduce阶段极易产生数据倾斜
解决方案就是Map端实现数据合并

Map join

使用场景

适用于一张表很小,一张表很大的场景

优点

在Map端缓存多张表,提前处理业务逻辑,增加了Map端业务,减少Reduce端数据的压力,尽可能减少数据倾斜

具体方法

采用DistributedCache

  1. 在驱动函数中加载缓存
  2. 在Mapper的setup阶段,将文件读取到缓存集合中

Map join案例

还是上面的需求

代码实现

OrderJoinBean

public class OrderJoinBean implements WritableComparable {

    private String id;
    private String pid;
    private Integer amount;
    private String pname;


    public OrderJoinBean() {
    }

    public void set(String id, String pid, Integer amount, String pname) {
        this.id = id;
        this.pid = pid;
        this.amount = amount;
        this.pname = pname;
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    public String getPid() {
        return pid;
    }

    public void setPid(String pid) {
        this.pid = pid;
    }

    public Integer getAmount() {
        return amount;
    }

    public void setAmount(Integer amount) {
        this.amount = amount;
    }

    public String getPname() {
        return pname;
    }

    public void setPname(String pname) {
        this.pname = pname;
    }

    @Override
    public int compareTo(OrderJoinBean o) {
        return this.pid.compareTo(o.getPid());
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(this.id);
        out.writeUTF(this.pid);
        out.writeInt(this.amount);
        out.writeUTF(this.pname);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.id = in.readUTF();
        this.pid = in.readUTF();
        this.amount = in.readInt();
        this.pname = in.readUTF();
    }

    @Override
    public String toString() {
        return "id='" + id + '\'' +
                ", amount=" + amount +
                ", pname='" + pname + '\'';
    }
}

OrderJoinMapper

public class OrderJoinMapper extends Mapper {

    private Map cacheValue = new HashMap();

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        URI[] cacheFiles = context.getCacheFiles();
        BufferedReader reader = new BufferedReader(new FileReader(new File(cacheFiles[0])));
        String line = null;
        while (StringUtils.isNotEmpty(line = reader.readLine())) {
            String[] strings = line.split("\t");
            cacheValue.put(strings[0], strings[1]);
        }
        reader.close();
    }
    private OrderJoinBean orderJoinBean = new OrderJoinBean();
    private NullWritable writeValue = NullWritable.get();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] strings = value.toString().split("\t");
        orderJoinBean.set(strings[0], strings[1], Integer.parseInt(strings[2]), cacheValue.get(strings[1]));
        context.write(orderJoinBean, writeValue);
    }
}

OrderJoinDriver

public class OrderJoinDriver {
    public static void main(String[] args) throws IOException, URISyntaxException, ClassNotFoundException, InterruptedException {

        Job job = Job.getInstance(new Configuration());

        job.setJarByClass(OrderJoinDriver.class);

        job.setMapperClass(OrderJoinMapper.class);

        job.setOutputKeyClass(OrderJoinBean.class);
        job.setOutputValueClass(NullWritable.class);

        job.setNumReduceTasks(0);

        job.addCacheFile(new URI("file:///E:/testdata/order/join/cache/pd.txt"));

        FileInputFormat.setInputPaths(job, new Path("E:\\testdata\\order\\join\\input"));
        FileOutputFormat.setOutputPath(job, new Path("E:\\testdata\\order\\join\\mapjoinoutput"));

        boolean isSuccess = job.waitForCompletion(true);
    }
}

计数器应用

Hadoop为每个作业维护若干内置计数器,以描述多项指标。例如,某些计数器记录已处理的字节数和记录数,使用户可监控已处理的输入数据量和已产生的输出数据量。
相关API

  1. 采用枚举的方式统计计数
    enum MyCounter{MALFORORMED,NORMAL}
    //对枚举定义的自定义计数器加1
    context.getCounter(MyCounter.MALFORORMED).increment(1);
  2. 采用计数器组、计数器名称的方式统计
    context.getCounter("counterGroup", "countera").increment(1);
    组名和计数器名称随便起,但最好有意义。
  3. 计数结果在程序运行后的控制台上查看。
    具体使用看下面的数据清洗

数据清洗

在运行核心业务Mapreduce程序之前,往往要先对数据进行清洗,清理掉不符合用户要求的数据。清理的过程往往只需要运行mapper程序,不需要运行reduce程序。

简单版数据清洗

LogMapper

public class LogMapper extends Mapper{
    Text k = new Text();
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        boolean result = parseLog(line, context);
        if (!result) {
            return;
        }
        k.set(line);
        context.write(k, NullWritable.get());
    }
    // 2 解析日志
    private boolean parseLog(String line, Context context) {
        String[] fields = line.split(" ");
        if (fields.length > 11) {
            context.getCounter("map", "true").increment(1);
            return true;
        } else {
            context.getCounter("map", "false").increment(1);
            return false;
        }
    }
}

LogDriver

public class LogDriver {
    public static void main(String[] args) throws Exception {
        args = new String[] { "e:/input/inputlog", "e:/output1" };

        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setJarByClass(LogDriver.class);

        job.setMapperClass(LogMapper.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        job.setNumReduceTasks(0);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.waitForCompletion(true);
    }
}

复杂版数据清洗

LogBean

public class LogBean {
    private String remote_addr;// 记录客户端的ip地址
    private String remote_user;// 记录客户端用户名称,忽略属性"-"
    private String time_local;// 记录访问时间与时区
    private String request;// 记录请求的url与http协议
    private String status;// 记录请求状态;成功是200
    private String body_bytes_sent;// 记录发送给客户端文件主体内容大小
    private String http_referer;// 用来记录从那个页面链接访问过来的
    private String http_user_agent;// 记录客户浏览器的相关信息

    private boolean valid = true;// 判断数据是否合法

    public String getRemote_addr() {
        return remote_addr;
    }

    public void setRemote_addr(String remote_addr) {
        this.remote_addr = remote_addr;
    }

    public String getRemote_user() {
        return remote_user;
    }

    public void setRemote_user(String remote_user) {
        this.remote_user = remote_user;
    }

    public String getTime_local() {
        return time_local;
    }

    public void setTime_local(String time_local) {
        this.time_local = time_local;
    }

    public String getRequest() {
        return request;
    }

    public void setRequest(String request) {
        this.request = request;
    }

    public String getStatus() {
        return status;
    }

    public void setStatus(String status) {
        this.status = status;
    }

    public String getBody_bytes_sent() {
        return body_bytes_sent;
    }

    public void setBody_bytes_sent(String body_bytes_sent) {
        this.body_bytes_sent = body_bytes_sent;
    }

    public String getHttp_referer() {
        return http_referer;
    }

    public void setHttp_referer(String http_referer) {
        this.http_referer = http_referer;
    }

    public String getHttp_user_agent() {
        return http_user_agent;
    }

    public void setHttp_user_agent(String http_user_agent) {
        this.http_user_agent = http_user_agent;
    }

    public boolean isValid() {
        return valid;
    }

    public void setValid(boolean valid) {
        this.valid = valid;
    }

    @Override
    public String toString() {

        StringBuilder sb = new StringBuilder();
        sb.append(this.valid);
        sb.append("\001").append(this.remote_addr);
        sb.append("\001").append(this.remote_user);
        sb.append("\001").append(this.time_local);
        sb.append("\001").append(this.request);
        sb.append("\001").append(this.status);
        sb.append("\001").append(this.body_bytes_sent);
        sb.append("\001").append(this.http_referer);
        sb.append("\001").append(this.http_user_agent);

        return sb.toString();
    }
}

LogMapper

public class LogMapper extends Mapper {

    Text k = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context)   throws IOException, InterruptedException {
        String line = value.toString();
        LogBean bean = parseLog(line);
        if (!bean.isValid()) {
            return;
        }
        k.set(bean.toString());
        context.write(k, NullWritable.get());
    }

    // 解析日志
    private LogBean parseLog(String line) {
        LogBean logBean = new LogBean();
        String[] fields = line.split(" ");
        if (fields.length > 11) {
            logBean.setRemote_addr(fields[0]);
            logBean.setRemote_user(fields[1]);
            logBean.setTime_local(fields[3].substring(1));
            logBean.setRequest(fields[6]);
            logBean.setStatus(fields[8]);
            logBean.setBody_bytes_sent(fields[9]);
            logBean.setHttp_referer(fields[10]);
            if (fields.length > 12) {
                logBean.setHttp_user_agent(fields[11] + " "+ fields[12]);
            }else {
                logBean.setHttp_user_agent(fields[11]);
            }
            if (Integer.parseInt(logBean.getStatus()) >= 400) {
                logBean.setValid(false);
            }
        }else {
            logBean.setValid(false);
        }
        return logBean;
    }
}

LogDriver

public class LogDriver {
    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setJarByClass(LogDriver.class);

        job.setMapperClass(LogMapper.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.waitForCompletion(true);
    }
}

你可能感兴趣的:(13、MapReduce框架原理(下))