OutputFormat数据输出
OutputFormat是MapReduce输出的基类,所有实现MapReduce输出都实现了 OutputFormat接口。下面我们介绍几种常见的OutputFormat实现类。
OutputFormat接口实现类
- 文本输出TextOutputFormat
默认的输出格式是TextOutputFormat,它把每条记录写为文本行。它的键和值可以是任意类型,因为TextOutputFormat调用toString()方法把它们转换为字符串。 - SequenceFileOutputFormat
SequenceFileOutputFormat将它的输出写为一个顺序文件。如果输出需要作为后续 MapReduce任务的输入,这便是一种好的输出格式,因为它的格式紧凑,很容易被压缩。 - 自定义OutputFormat
根据用户需求,自定义实现输出。
自定义OutputFormat
为了实现控制最终文件的输出路径,可以自定义OutputFormat。
要在一个mapreduce程序中根据数据的不同输出两类结果到不同目录,这类灵活的输出需求可以通过自定义outputformat来实现。
自定义OutputFormat步骤
- 自定义一个类继承FileOutputFormat。
- 改写recordwriter,具体改写输出数据的方法write()。
自定义OutputFormat案例
需求
过滤输入的文件中是否包含hadoop
- 包含hadoop的网站输出到hadoop.txt
- 不包含hadoop的网站输出到other.txt
输入数据
http://hadoop.apache.org/
http://spark.apache.org/
http://hadoop.apache.org/
http://spark.apache.org/
http://hadoop.apache.org/
http://spark.apache.org/
http://hadoop.apache.org/
http://spark.apache.org/
http://hadoop.apache.org/
http://spark.apache.org/
http://spark.apache.org/
http://hadoop.apache.org/
http://spark.apache.org/
https://hbase.apache.org/
https://hive.apache.org/
https://hbase.apache.org/
https://hive.apache.org/
https://hbase.apache.org/
https://hive.apache.org/
https://hbase.apache.org/
https://hive.apache.org/
代码实现
自定义一个outputformat
public class LogOutputFormat extends FileOutputFormat {
@Override
public RecordWriter getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
return new LogRecordWriter(job);
}
}
具体的写数据RecordWriter
public class LogRecordWriter extends RecordWriter {
FSDataOutputStream hadoopOutputStream = null;
FSDataOutputStream otherOutputStream = null;
public LogRecordWriter(TaskAttemptContext job) {
try {
Configuration configuration = job.getConfiguration();
String s = configuration.get("mapred.output.dir");
FileSystem fileSystem = FileSystem.get(job.getConfiguration());
Path hadoopPath = new Path( s + "/hadoop.txt");
Path otherPath = new Path(s + "/other.txt");
hadoopOutputStream = fileSystem.create(hadoopPath);
otherOutputStream = fileSystem.create(otherPath);
} catch (IOException e) {
e.printStackTrace();
}
}
@Override
public void write(Text key, NullWritable value) throws IOException, InterruptedException {
if(key.toString().contains("hadoop")) {
hadoopOutputStream.write(key.toString().getBytes());
}else {
otherOutputStream.write(key.toString().getBytes());
}
}
@Override
public void close(TaskAttemptContext context) throws IOException, InterruptedException {
if(hadoopOutputStream != null) {
hadoopOutputStream.close();
}
if(otherOutputStream != null) {
otherOutputStream.close();
}
}
}
LogMapper
public class LogMapper extends Mapper {
Text k = new Text();
NullWritable v = NullWritable.get();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
k.set(value.toString());
context.write(k, v);
}
}
LogReducer
public class LogReducer extends Reducer {
@Override
protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
String k = null;
for(NullWritable value : values) {
k = key.toString() + "\r\n";
context.write(new Text(k), NullWritable.get());
}
}
}
LogDriver
public class LogDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = Job.getInstance(new Configuration());
job.setJarByClass(LogDriver.class);
job.setMapperClass(LogMapper.class);
job.setReducerClass(LogReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputFormatClass(LogOutputFormat.class);
boolean isSuccess = job.waitForCompletion(true);
}
}
Join多种应用
Reduce join
Map端的主要工作,为来自不同表(文件)的key,value对打标签以区别不同来源的记录,然后用连接字段作为key,其余部分和新加的标志作为value,最后进行输出
Reduce端的主要工作,在reduce端以连接字段作为key的分组已经完成,我们只需要在每一个分组当中将那些来源于不同文件的记录(在map阶段已经打标志)分开,最后进行合并就可以了
Reduce join案例
需求
将订单表数据和商品表数据连接起来
订单表数据如下
1001 01 1
1002 02 2
1003 03 3
1004 01 4
1005 02 5
1006 03 6
商品信息表数据如下
01 小米
02 华为
03 格力
输出数据为
id='1004', amount=4, pname='小米'
id='1001', amount=1, pname='小米'
id='1005', amount=5, pname='华为'
id='1002', amount=2, pname='华为'
id='1006', amount=6, pname='格力'
id='1003', amount=3, pname='格力'
分析
设置一标志位,区分来自订单表和商品表数据
对key按照pid排序,标志位排序
按照pid进行分组
代码实现
OrderJoinBean
public class OrderJoinBean implements WritableComparable {
private String id;
private String pid;
private Integer amount;
private String pname;
private String flag;
public OrderJoinBean() {
}
public void set(String id, String pid, Integer amount, String pname, String flag) {
this.id = id;
this.pid = pid;
this.amount = amount;
this.pname = pname;
this.flag = flag;
}
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getPid() {
return pid;
}
public void setPid(String pid) {
this.pid = pid;
}
public Integer getAmount() {
return amount;
}
public void setAmount(Integer amount) {
this.amount = amount;
}
public String getPname() {
return pname;
}
public void setPname(String pname) {
this.pname = pname;
}
public String getFlag() {
return flag;
}
public void setFlag(String flag) {
this.flag = flag;
}
@Override
public int compareTo(OrderJoinBean o) {
if(this.pid.compareTo(o.getPid()) == 0) {
return -this.flag.compareTo(o.getFlag());
}else {
return this.pid.compareTo(o.getPid());
}
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(this.id);
out.writeUTF(this.pid);
out.writeInt(this.amount);
out.writeUTF(this.pname);
out.writeUTF(this.flag);
}
@Override
public void readFields(DataInput in) throws IOException {
this.id = in.readUTF();
this.pid = in.readUTF();
this.amount = in.readInt();
this.pname = in.readUTF();
this.flag = in.readUTF();
}
@Override
public String toString() {
return "id='" + id + '\'' +
", amount=" + amount +
", pname='" + pname + '\'';
}
}
OrderJoinGroupingCompartor
public class OrderJoinGroupingCompartor extends WritableComparator {
protected OrderJoinGroupingCompartor() {
super(OrderJoinBean.class, true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
OrderJoinBean oa = (OrderJoinBean) a;
OrderJoinBean ob = (OrderJoinBean) b;
return oa.getPid().compareTo(ob.getPid());
}
}
OrderJoinMapper
public class OrderJoinMapper extends Mapper {
private FileSplit fileSplit;
private OrderJoinBean k = new OrderJoinBean();
private NullWritable v = NullWritable.get();
@Override
protected void setup(Context context) throws IOException, InterruptedException {
fileSplit = (FileSplit) context.getInputSplit();
}
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] strings = value.toString().split("\t");
if(fileSplit.getPath().toString().endsWith("order.txt")) {
k.set(strings[0], strings[1], Integer.parseInt(strings[2]), "NULL", "order");
}else {
k.set("NULL",strings[0], -1, strings[1], "product");
}
context.write(k, v);
}
}
OrderJoinReducer
public class OrderJoinReducer extends Reducer {
@Override
protected void reduce(OrderJoinBean key, Iterable values, Context context) throws IOException, InterruptedException {
int i = 0;
String pName = "";
for(NullWritable value : values) {
if(i == 0) {
pName = key.getPname();
}else {
key.setPname(pName);
context.write(key, NullWritable.get());
}
i++;
}
}
}
OrderJoinDriver
public class OrderJoinDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = Job.getInstance(new Configuration());
job.setJarByClass(OrderJoinDriver.class);
job.setMapperClass(OrderJoinMapper.class);
job.setReducerClass(OrderJoinReducer.class);
job.setMapOutputKeyClass(OrderJoinBean.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(OrderJoinBean.class);
job.setOutputValueClass(NullWritable.class);
job.setGroupingComparatorClass(OrderJoinGroupingCompartor.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
总结
这种方式,合并的操作是在Reduce阶段完成,Reduce端的处理压力太大,Map节点的运算负载则很低,资源利用率不高,且在Reduce阶段极易产生数据倾斜
解决方案就是Map端实现数据合并
Map join
使用场景
适用于一张表很小,一张表很大的场景
优点
在Map端缓存多张表,提前处理业务逻辑,增加了Map端业务,减少Reduce端数据的压力,尽可能减少数据倾斜
具体方法
采用DistributedCache
- 在驱动函数中加载缓存
- 在Mapper的setup阶段,将文件读取到缓存集合中
Map join案例
还是上面的需求
代码实现
OrderJoinBean
public class OrderJoinBean implements WritableComparable {
private String id;
private String pid;
private Integer amount;
private String pname;
public OrderJoinBean() {
}
public void set(String id, String pid, Integer amount, String pname) {
this.id = id;
this.pid = pid;
this.amount = amount;
this.pname = pname;
}
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getPid() {
return pid;
}
public void setPid(String pid) {
this.pid = pid;
}
public Integer getAmount() {
return amount;
}
public void setAmount(Integer amount) {
this.amount = amount;
}
public String getPname() {
return pname;
}
public void setPname(String pname) {
this.pname = pname;
}
@Override
public int compareTo(OrderJoinBean o) {
return this.pid.compareTo(o.getPid());
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(this.id);
out.writeUTF(this.pid);
out.writeInt(this.amount);
out.writeUTF(this.pname);
}
@Override
public void readFields(DataInput in) throws IOException {
this.id = in.readUTF();
this.pid = in.readUTF();
this.amount = in.readInt();
this.pname = in.readUTF();
}
@Override
public String toString() {
return "id='" + id + '\'' +
", amount=" + amount +
", pname='" + pname + '\'';
}
}
OrderJoinMapper
public class OrderJoinMapper extends Mapper {
private Map cacheValue = new HashMap();
@Override
protected void setup(Context context) throws IOException, InterruptedException {
URI[] cacheFiles = context.getCacheFiles();
BufferedReader reader = new BufferedReader(new FileReader(new File(cacheFiles[0])));
String line = null;
while (StringUtils.isNotEmpty(line = reader.readLine())) {
String[] strings = line.split("\t");
cacheValue.put(strings[0], strings[1]);
}
reader.close();
}
private OrderJoinBean orderJoinBean = new OrderJoinBean();
private NullWritable writeValue = NullWritable.get();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] strings = value.toString().split("\t");
orderJoinBean.set(strings[0], strings[1], Integer.parseInt(strings[2]), cacheValue.get(strings[1]));
context.write(orderJoinBean, writeValue);
}
}
OrderJoinDriver
public class OrderJoinDriver {
public static void main(String[] args) throws IOException, URISyntaxException, ClassNotFoundException, InterruptedException {
Job job = Job.getInstance(new Configuration());
job.setJarByClass(OrderJoinDriver.class);
job.setMapperClass(OrderJoinMapper.class);
job.setOutputKeyClass(OrderJoinBean.class);
job.setOutputValueClass(NullWritable.class);
job.setNumReduceTasks(0);
job.addCacheFile(new URI("file:///E:/testdata/order/join/cache/pd.txt"));
FileInputFormat.setInputPaths(job, new Path("E:\\testdata\\order\\join\\input"));
FileOutputFormat.setOutputPath(job, new Path("E:\\testdata\\order\\join\\mapjoinoutput"));
boolean isSuccess = job.waitForCompletion(true);
}
}
计数器应用
Hadoop为每个作业维护若干内置计数器,以描述多项指标。例如,某些计数器记录已处理的字节数和记录数,使用户可监控已处理的输入数据量和已产生的输出数据量。
相关API
- 采用枚举的方式统计计数
enum MyCounter{MALFORORMED,NORMAL}
//对枚举定义的自定义计数器加1
context.getCounter(MyCounter.MALFORORMED).increment(1); - 采用计数器组、计数器名称的方式统计
context.getCounter("counterGroup", "countera").increment(1);
组名和计数器名称随便起,但最好有意义。 - 计数结果在程序运行后的控制台上查看。
具体使用看下面的数据清洗
数据清洗
在运行核心业务Mapreduce程序之前,往往要先对数据进行清洗,清理掉不符合用户要求的数据。清理的过程往往只需要运行mapper程序,不需要运行reduce程序。
简单版数据清洗
LogMapper
public class LogMapper extends Mapper{
Text k = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
boolean result = parseLog(line, context);
if (!result) {
return;
}
k.set(line);
context.write(k, NullWritable.get());
}
// 2 解析日志
private boolean parseLog(String line, Context context) {
String[] fields = line.split(" ");
if (fields.length > 11) {
context.getCounter("map", "true").increment(1);
return true;
} else {
context.getCounter("map", "false").increment(1);
return false;
}
}
}
LogDriver
public class LogDriver {
public static void main(String[] args) throws Exception {
args = new String[] { "e:/input/inputlog", "e:/output1" };
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(LogDriver.class);
job.setMapperClass(LogMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
job.setNumReduceTasks(0);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
复杂版数据清洗
LogBean
public class LogBean {
private String remote_addr;// 记录客户端的ip地址
private String remote_user;// 记录客户端用户名称,忽略属性"-"
private String time_local;// 记录访问时间与时区
private String request;// 记录请求的url与http协议
private String status;// 记录请求状态;成功是200
private String body_bytes_sent;// 记录发送给客户端文件主体内容大小
private String http_referer;// 用来记录从那个页面链接访问过来的
private String http_user_agent;// 记录客户浏览器的相关信息
private boolean valid = true;// 判断数据是否合法
public String getRemote_addr() {
return remote_addr;
}
public void setRemote_addr(String remote_addr) {
this.remote_addr = remote_addr;
}
public String getRemote_user() {
return remote_user;
}
public void setRemote_user(String remote_user) {
this.remote_user = remote_user;
}
public String getTime_local() {
return time_local;
}
public void setTime_local(String time_local) {
this.time_local = time_local;
}
public String getRequest() {
return request;
}
public void setRequest(String request) {
this.request = request;
}
public String getStatus() {
return status;
}
public void setStatus(String status) {
this.status = status;
}
public String getBody_bytes_sent() {
return body_bytes_sent;
}
public void setBody_bytes_sent(String body_bytes_sent) {
this.body_bytes_sent = body_bytes_sent;
}
public String getHttp_referer() {
return http_referer;
}
public void setHttp_referer(String http_referer) {
this.http_referer = http_referer;
}
public String getHttp_user_agent() {
return http_user_agent;
}
public void setHttp_user_agent(String http_user_agent) {
this.http_user_agent = http_user_agent;
}
public boolean isValid() {
return valid;
}
public void setValid(boolean valid) {
this.valid = valid;
}
@Override
public String toString() {
StringBuilder sb = new StringBuilder();
sb.append(this.valid);
sb.append("\001").append(this.remote_addr);
sb.append("\001").append(this.remote_user);
sb.append("\001").append(this.time_local);
sb.append("\001").append(this.request);
sb.append("\001").append(this.status);
sb.append("\001").append(this.body_bytes_sent);
sb.append("\001").append(this.http_referer);
sb.append("\001").append(this.http_user_agent);
return sb.toString();
}
}
LogMapper
public class LogMapper extends Mapper {
Text k = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
LogBean bean = parseLog(line);
if (!bean.isValid()) {
return;
}
k.set(bean.toString());
context.write(k, NullWritable.get());
}
// 解析日志
private LogBean parseLog(String line) {
LogBean logBean = new LogBean();
String[] fields = line.split(" ");
if (fields.length > 11) {
logBean.setRemote_addr(fields[0]);
logBean.setRemote_user(fields[1]);
logBean.setTime_local(fields[3].substring(1));
logBean.setRequest(fields[6]);
logBean.setStatus(fields[8]);
logBean.setBody_bytes_sent(fields[9]);
logBean.setHttp_referer(fields[10]);
if (fields.length > 12) {
logBean.setHttp_user_agent(fields[11] + " "+ fields[12]);
}else {
logBean.setHttp_user_agent(fields[11]);
}
if (Integer.parseInt(logBean.getStatus()) >= 400) {
logBean.setValid(false);
}
}else {
logBean.setValid(false);
}
return logBean;
}
}
LogDriver
public class LogDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(LogDriver.class);
job.setMapperClass(LogMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}