1)定义:MapReduce是一个分布式运算程序的编程框架
核心功能:将用户编写的业务逻辑代码和自带默认组件整合成一个完整的分布式运算程序,并发运行在一个Hadoop集群上。
2)优缺点:
优点:
缺点:
4)常用的序列化类型
2)环境准备,创建maven工程
调整maven的setting文件(可能存在版本不兼容问题 我的idea-2021 ,maven-3.54是可以的)
在pom.xml中添加依赖:
org.apache.hadoop
hadoop-client
3.1.3
junit
junit
4.12
org.slf4j
slf4j-log4j12
1.7.30
在项目的src/main/resources目录下,新建一个文件,命名为“log4j.properties”,在文件中填入。
log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
在main/java下创建com.wts.mapreduce.wordcount
写代码(特别注意导包,导入关于hadoop、mapreduce包;)
(1)Mapper类
package com.wts.mapreduce.wordcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/*
* KEYIN,map阶段输入的key类型:LongWritable
* VALUEIN,map阶段输入的value类型:Text
* KEYOUT,map阶段输入的key类型:Text
* VALUEOUT,map阶段输入的value类型:IntWritable
* */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text outK = new Text();
private IntWritable outV = new IntWritable(1);//很重要
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
//1. 获取一行 wts wts
String line = value.toString();
//2. 切割
//wts
//wts
String[] words = line.split(" ");
//3. 输出(封装)
for (String word : words) {
outK.set(word);
context.write(outK, outV);
}
}
}
(2)Reducer类
package com.wts.mapreduce.wordcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/*
* KEYIN,reduce阶段输入的key类型:Text
* VALUEIN,reduce阶段输入的value类型:IntWritable
* KEYOUT,reduce阶段输入的key类型:Text
* VALUEOUT,reduce阶段输入的value类型:IntWritable
* */
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable outV = new IntWritable();
int sum;
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
sum = 0;
//wts(1,1)
//累加
for (IntWritable value : values) {
sum += value.get();
}
outV.set(sum);
//写出
context.write(key, outV);
}
}
(3)Driver类
package com.wts.mapreduce.wordcount;
//是0的有以下几种情况:1.mapper阶段传new intwaierable的时候没写1 2.reducer 把sum+=coun。get写到了循环内 3.最后v.set()没有传sum
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WordCountDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
//1.获取job
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//2、关联driver当前的jar(设置当前jar包路径)
job.setJarByClass(WordCountDriver.class);
//3、关联mapper和reducer
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
//4、设置map输出k、v类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//5、设置最终输出k、v类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//6、设置输入输出路径
FileInputFormat.setInputPaths(job, new Path("G:\\1_Program\\hadoop_learning\\hadoop-input"));
FileOutputFormat.setOutputPath(job, new Path("G:\\1_Program\\hadoop_learning\\hadoop-output\\output02"));//output1不用自己创建?
//7、提交job
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}
本地测试:
1)用maven打包jar包,添加打包插件依赖
maven-compiler-plugin
3.6.1
1.8
maven-assembly-plugin
jar-with-dependencies
make-assembly
package
single
报错解决方案:工程上出现红色报错,是因为插件没有自动加载出来,需要加入依赖
org.apache.maven.plugins
maven-assembly-plugin
3.2.0
2)把程序打包成jar包
(3)修改不带依赖的jar包名称为wc.jar,并拷贝该jar包到Hadoop集群的/opt/module/hadoop-3.1.3路径。
(4)启动Hadoop集群,执行WordCount程序
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop jar wc.jar
com.atguigu.mapreduce.wordcount2.WordCountDriver /input /output
hadoop自己开发的一套序列化(Writable)的特点:
(1)紧凑 :高效使用存储空间。
(2)快速:读写数据的额外开销小。
(3)互操作:支持多语言的交互
在企业开发中往往常用的基本序列化类型不能满足所有需求,比如在Hadoop框架内部传递一个bean对象,那么该对象就需要实现序列化接口。
具体实现bean对象序列化步骤如下7步。
(1)必须实现Writable接口
(2)反序列化时,需要反射调用空参构造函数,所以必须有空参构造
public FlowBean() {
super();
}
(3)重写序列化方法
@Override
public void write(DataOutput out) throws IOException {
out.writeLong(upFlow);
out.writeLong(downFlow);
out.writeLong(sumFlow);
}
(4)重写反序列化方法
@Override
public void readFields(DataInput in) throws IOException {
upFlow = in.readLong();
downFlow = in.readLong();
sumFlow = in.readLong();
}
(5)注意反序列化的顺序和序列化的顺序完全一致
(6)要想把结果显示在文件中,需要重写toString(),可用"\t"分开,方便后续用。
(7)如果需要将自定义的bean放在key中传输,则还需要实现Comparable接口,因为MapReduce框中的Shuffle过程要求对key必须能排序。详见后面排序案例。
@Override
public int compareTo(FlowBean o) {
// 倒序排列,从大到小
return this.sumFlow > o.getSumFlow() ? -1 : 1;
}
要求输出结果:
2)分析过程
3)写代码:导包很重要,很容易出错
FlowBean:
package com.wts.mapreduce.writable;
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
//(1)实现Writable接口
public class FlowBean implements Writable {
private long upFlow;
private long downFlow;
private long sumFlow;
//2.反序列化,需要反射调用空参构造方法,所以必须有空参构造
public FlowBean() {
}
//3.提供三个getter、setter方法
public long getUpFlow() {
return upFlow;
}
public void setUpFlow(long upFlow) {
this.upFlow = upFlow;
}
public long getDownFlow() {
return downFlow;
}
public void setDownFlow(long downFlow) {
this.downFlow = downFlow;
}
public long getSumFlow() {
return sumFlow;
}
public void setSumFlow(long sumFlow) {
this.sumFlow = sumFlow;
}
//这里再加来一个重载
public void setSumFlow() {
this.sumFlow = this.upFlow + this.downFlow;
}
//4.重写序列化、反序列化方法 ,注意序列化的顺序要一直
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeLong(upFlow);
dataOutput.writeLong(downFlow);
dataOutput.writeLong(sumFlow);
}
@Override
public void readFields(DataInput dataInput) throws IOException {
upFlow = dataInput.readLong();
downFlow = dataInput.readLong();
sumFlow = dataInput.readLong();
}
//5.重写toStirng方法输出结果
@Override
public String toString() {
return upFlow + "\t" + downFlow + "\t" + sumFlow;
}
}
Mapper类:
package com.wts.mapreduce.writable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class FlowBeanMapper extends Mapper<LongWritable, Text, Text, FlowBean> {
private Text OutK = new Text();
private FlowBean OutV = new FlowBean();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//1.获取一行数据
//1 13736230513 192.196.100.1 www.atguigu.com 2481 24681 200
String line = value.toString();
//2.切割\t
//1 13736230513 192.196.100.1 www.atguigu.com 2481 24681 200
String[] split = line.split("\t");
//3.获取需要的数据
String phone = split[1];
String up = split[split.length - 3];
String down = split[split.length - 2];
//4.封装
OutK.set(phone);
OutV.setUpFlow(Long.parseLong(up));
OutV.setDownFlow(Long.parseLong(down));
OutV.setSumFlow();
//5.写出
context.write(OutK, OutV);
}
}
Reudcer类:
package com.wts.mapreduce.writable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class FlowBeanReducer extends Reducer<Text, FlowBean, Text, FlowBean> {
private FlowBean OutV = new FlowBean();
@Override
protected void reduce(Text key, Iterable<FlowBean> values, Reducer<Text, FlowBean, Text, FlowBean>.Context context) throws IOException, InterruptedException {
//1.遍历结合 累加值
long totalUp = 0;
long totalDown = 0;
for (FlowBean value : values) {
totalUp += value.getUpFlow();
totalDown += value.getDownFlow();
}
//2.封装OutK,OutV
OutV.setUpFlow(totalUp);
OutV.setDownFlow(totalDown);
OutV.setSumFlow();
//3.写出
context.write(key, OutV);
}
}
Driver类:
package com.wts.mapreduce.writable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class FlowBeanDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
//1.获取job
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//2.设置jar包
job.setJarByClass(FlowBeanDriver.class);
//3.关联mapper和reducer
job.setMapperClass(FlowBeanMapper.class);
job.setReducerClass(FlowBeanReducer.class);
//4.设置mapper 输出key和value的类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FlowBean.class);
//5.设置最后输出的key和value类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
//6.设置输入、输出路径
FileInputFormat.setInputPaths(job, new Path("G:\\1_Program\\hadoop_learning\\hadoop-input\\phone_date"));
FileOutputFormat.setOutputPath(job, new Path("G:\\1_Program\\hadoop_learning\\hadoop-output\\flow-output01"));
//7.提交job
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}
运行结果:
1)MapTask并行度决定机制
2)job提交流程源码分析
3) FileInputFormat 切片机制:
切片大小调整:默认等于block
调大:使minSize(默认为1)比block还大
调小:使maxSiz(默认为L的最大值)比block还小
4)TextInputFormat
实现类: FileInputFormat常见的接口实现类包括:TextInputFormat、KeyValueTextInputFormat、NLineInputFormat、CombineTextInputFormat和自定义InputFormat等。
TextInputFormat是默认的FileInputFormat实现类。按行读取每条记录。
5 ) CombineTextInputFormat
如果使用,TextInputFormat切片机制是对任务按文件规划切片,不管文件多小,都会是一个单独的切片,都会交给一个MapTask,这样如果有大量小文件,就会产生大量的MapTask,处理效率极其低下。
应用场景: 所以,引入 CombineTextInputFormat切片机制:用于小文件过多的场景,它可以将多个小文件从逻辑上规划到一个切片中,这样,多个小文件就可以交给一个MapTask处理。
虚拟存储切片最大值: CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m
注意: 虚拟存储切片最大值设置最好根据实际的小文件大小情况来设置具体的值。
生成过程:
(1)虚拟存储过程:
将输入目录下所有文件大小,依次和设置的setMaxInputSplitSize值比较,如果不大于设置的最大值,逻辑上划分一个块。如果输入文件大于设置的最大值且大于两倍,那么以最大值切割一块;当剩余数据大小超过设置的最大值且不大于最大值2倍,此时将文件均分成2个虚拟存储块(防止出现太小切片)。
例如setMaxInputSplitSize值为4M,输入文件大小为8.02M,则先逻辑上分成一个4M。剩余的大小为4.02M,如果按照4M逻辑划分,就会出现0.02M的小的虚拟存储文件,所以将剩余的4.02M文件切分成(2.01M和2.01M)两个文件。
(2)切片过程:
(a)判断虚拟存储的文件大小是否大于setMaxInputSplitSize值,大于等于则单独形成一个切片。
(b)如果不大于则跟下一个虚拟存储文件进行合并,共同形成一个切片。
(c)测试举例:有4个小文件大小分别为1.7M、5.1M、3.4M以及6.8M这四个小文件,则虚拟存储之后形成6个文件块,大小分别为:
1.7M,(2.55M、2.55M),3.4M以及(3.4M、3.4M)
最终会形成3个切片,大小分别为:
(1.7+2.55)M,(2.55+3.4)M,(3.4+3.4)M
6) CombineTextInputFormat案例实操
为什么每次我的切片数量都是1???算了不学了没必要!
找到问题了:
(a)驱动类中添加代码如下:
// 如果不设置InputFormat,它默认用的是TextInputFormat.class
job.setInputFormatClass(CombineTextInputFormat.class);
//虚拟存储切片最大值设置4m
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);
//number of splits:3
(b)虚拟存储切片最大值设置20m
// 如果不设置InputFormat,它默认用的是TextInputFormat.class
job.setInputFormatClass(CombineTextInputFormat.class);
//虚拟存储切片最大值设置20m
CombineTextInputFormat.setMaxInputSplitSize(job, 20971520);
//number of splits:1
Shuffle是Map方法之后,Reduce方法之前的数据处理过程称为Shuffle
问题的引出,统计结果按照条件输出到不同文件中去
手机号案例:
手机号136、137、138、139开头都分别放到一个独立的4个文件中,其他开头的放到一个文件中。(5个文件)
1)在Writable自定义bean对象实现序列化接口中:增加一个分区类
代码如下:
package com.wts.mapreduce.partitioner;
import org.apache.hadoop.io.Text;
public class Partitionerself extends org.apache.hadoop.mapreduce.Partitioner<Text, FlowBean> {
@Override
public int getPartition(Text text, FlowBean flowBean, int numPartitions) {
//获取手机号前三位
String phone = text.toString();
String pre = phone.substring(0, 3);
//设置分区
int partiton;
if ("136".equals(pre)) {
partiton = 0;
} else if ("137".equals(pre)) {
partiton = 1;
} else if ("138".equals(pre)) {
partiton = 2;
} else if ("139".equals(pre)) {
partiton = 3;
} else {
partiton = 4;
}
//返回分区号
return partiton;
}
}
2)在Driver主函数中:指定自定义分区,同时指定相应的数量ReduceTask
//指定自定义分区器
job.setPartitionerClass(Partitionerself.class);
//同时指定相应的数量
job.setNumReduceTasks(5);
1)概述:
排序是MapReduce框架最重要的操作之一
MapTask和ReduceTask均会对数据key进行排序(默认行为),按照字典顺序排序(实现该排序的方法是快速排序)
部分排序
保证输出的每个文件内部有序
全排序
最终输出只有一个文件,其内部文件有序
辅助排序
二次排序
判断条件有两个
2)原理分析:bean对象做为key传输,需要实现WritableComparable接口重写compareTo方法,就可以实现排序。
3)实操
在Writable序列化基础上实现 手机号倒叙排列:
代码实现步骤:
(1) 在FlowBean基础上实现WritableComparable接口,重写方法
添加一下重写方法的代码:
@Override
public int compareTo(FlowBean o) {
//降序排列
if (o.sumFlow < this.sumFlow) {
return -1;
} else if (o.sumFlow > this.sumFlow) {
return 1;
} else {
return 0;
}
}
(2) 重新编写Mapper类
package com.wts.mapreduce.writablecomparable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class FlowBeanMapper extends Mapper<LongWritable, Text, FlowBean, Text> {
//
private FlowBean outK = new FlowBean();
private Text outV = new Text();
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, FlowBean, Text>.Context context) throws IOException, InterruptedException {
//获取一行
//手机号 上 下 总
String line = value.toString();
String[] splits = line.split("\t");
//只对key排序 换句话就是:把要排序的作为key
outK.setUpFlow(Long.parseLong(splits[1]));//Flowbean作为Key
outK.setDownFlow(Long.parseLong(splits[2]));
outK.setSumFlow();
outV.set(splits[0]);//手机号作为value
//写出
context.write(outK, outV);
}
}
(3) 重新写Reducer类
package com.wts.mapreduce.writablecomparable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class FlowBeanMapper extends Mapper<LongWritable, Text, FlowBean, Text> {
//
private FlowBean outK = new FlowBean();
private Text outV = new Text();
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, FlowBean, Text>.Context context) throws IOException, InterruptedException {
//获取一行
//手机号 上 下 总
String line = value.toString();
String[] splits = line.split("\t");
//只对key排序 换句话就是:把要排序的作为key
outK.setUpFlow(Long.parseLong(splits[1]));//Flowbean作为Key
outK.setDownFlow(Long.parseLong(splits[2]));
outK.setSumFlow();
outV.set(splits[0]);//手机号作为value
//写出
context.write(outK, outV);
}
}
(4) 修改Driver类:(Map的输出数据类型)(文件读取文件夹只保留一个那啥)
package com.wts.mapreduce.writablecomparable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class FlowBeanDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
//1.获取job
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//2.设置jar包
job.setJarByClass(FlowBeanDriver.class);
//3.关联mapper和reducer
job.setMapperClass(FlowBeanMapper.class);
job.setReducerClass(FlowBeanReducer.class);
//4.设置mapper 输出key和value的类型
job.setMapOutputKeyClass(FlowBean.class);
job.setMapOutputValueClass(Text.class);
//5.设置最后输出的key和value类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
//6.设置输入、输出路径
FileInputFormat.setInputPaths(job, new Path("G:\\1_Program\\hadoop_learning\\hadoop-output\\flow-output\\output000"));
FileOutputFormat.setOutputPath(job, new Path("G:\\1_Program\\hadoop_learning\\hadoop-output\\flow-output\\fbcomparable-output"));
//7.提交job
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}
在FlowBean类重写compareTo方法里添加修改:
@Override
public int compareTo(FlowBean o) {
//降序排列
if (o.sumFlow < this.sumFlow) {
return -1;
} else if (o.sumFlow > this.sumFlow) {
return 1;
} else {
//按照上 流 降序排序
if (o.upFlow > this.upFlow) {
return -1;
} else if (o.upFlow < this.upFlow) {
return 1;
} else {
return 0;
}
}
}
在上述代码的基础上,增加Partition类
package com.wts.mapreduce.writablecomparandpartitioner;
import org.apache.hadoop.io.Text;
public class Partitioner extends org.apache.hadoop.mapreduce.Partitioner<FlowBean, Text> {
@Override
public int getPartition(FlowBean flowBean, Text text, int numPartitions) {
String phone = text.toString();
String prephone = phone.substring(0, 3);
//设置分区
int partition;
if ("136".equals(prephone)) {
partition = 0;
} else if ("137".equals(prephone)) {
partition = 1;
} else if ("138".equals(prephone)) {
partition = 2;
} else if ("139".equals(prephone)) {
partition = 3;
} else {
partition = 4;
}
//返回分区号
return partition;
}
}
然后再在Driver主函数类中,设置分区
//设置partition分区器
job.setPartitionerClass(Partitioner.class);
job.setNumReduceTasks(5);
1)概述:
Combiner是在每一个MapTask所在的节点运行
Reducer是接收所有Mapper的数据结果
意义:减少网络传输量
(A,1)、(A,1)、(A,1)、… --> (A,1000))
package com.wts.mapreduce.combiner;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class WordCountCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable outV = new IntWritable();
int sum;
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
sum = 0;
//wts(1,1)
//累加
for (IntWritable value : values) {
sum += value.get();
}
outV.set(sum);
//写出
context.write(key, outV);
}
}
//Combiner
job.setCombinerClass(WordCountCombiner.class);
概述:OutputFormat是MapReduce输出的基类
默认是TextOutputFormat(按行输出)
还可以自定义OutputFormat:(自定义输出到mysql、hbase等存储框架中)
要求:log.txt文件 ,期望输出到 wts.log和other.log文件中
package com.wts.mapreduce.outputformat;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class LogMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {
//默认一行
context.write(value, NullWritable.get());
}
}
2)编写Reducer
package com.wts.mapreduce.outputformat;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class LogReducer extends Reducer<Text, NullWritable, Text, NullWritable> {
@Override
protected void reduce(Text key, Iterable<NullWritable> values, Reducer<Text, NullWritable, Text, NullWritable>.Context context) throws IOException, InterruptedException {
//防止相同的key
for (NullWritable value : values) {
context.write(key, NullWritable.get());
}
}
}
3)自定义一个LogOutputFormat类
package com.wts.mapreduce.outputformat;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class LogOutputFormat extends FileOutputFormat<Text, NullWritable> {
@Override
public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
LogRecordWritable lrw = new LogRecordWritable(job);
return lrw;
}
}
4)编写LogRecordWriter类
package com.wts.mapreduce.outputformat;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import java.io.IOException;
public class LogRecordWritable extends RecordWriter<Text, NullWritable> {
private FSDataOutputStream wtsOut;
private FSDataOutputStream otherOut;
public LogRecordWritable(TaskAttemptContext job) {
//创建两条流
try {
FileSystem fs = FileSystem.get(job.getConfiguration());
wtsOut = fs.create(new Path("G:\\1_Program\\hadoop_learning\\hadoop-output\\outputformat\\1\\wts.log"));
otherOut = fs.create(new Path("G:\\1_Program\\hadoop_learning\\hadoop-output\\outputformat\\other\\others.log"));
} catch (IOException e) {
e.printStackTrace();
}
}
@Override
public void write(Text key, NullWritable value) throws IOException, InterruptedException {
String log = key.toString();
//具体写
if (log.contains("atguigu")) {
wtsOut.writeBytes(log + "\n");
} else {
otherOut.writeBytes(log + "\n");
}
}
@Override
public void close(TaskAttemptContext context) throws IOException, InterruptedException {
//关流
IOUtils.closeStream(wtsOut);
IOUtils.closeStream(otherOut);
}
}
5)编写Driver
package com.wts.mapreduce.outputformat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class LogDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(LogDriver.class);
job.setMapperClass(LogMapper.class);
job.setReducerClass(LogReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
//自定义outputformat
job.setOutputFormatClass(LogOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path("G:\\1_Program\\hadoop_learning\\hadoop-input\\outputformat"));
//虽然我们自定义了outputformat,但是因为我们的outputformat继承自fileoutputformat
//而fileoutputformat要输出一个_SUCCESS文件,所以在这还得指定一个输出目录
FileOutputFormat.setOutputPath(job, new Path("G:\\1_Program\\hadoop_learning\\hadoop-output\\outputformat\\logout"));
boolean result = job.waitForCompletion(true);
System.out.println(result ? 0 : 1);
}
}
多个表的关联
上代码:
1)编写WritableBean
package com.wts.mapreduce.reducerjoin;
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class TableBean implements Writable {
private String id;
private String pid;
private int amount;
private String pname;
private String flag; //判断order表还是pd表
public TableBean() {
}
public TableBean(String id, String pid, int amount, String pname, String flag) {
this.id = id;
this.pid = pid;
this.amount = amount;
this.pname = pname;
this.flag = flag;
}
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getPid() {
return pid;
}
public void setPid(String pid) {
this.pid = pid;
}
public int getAmount() {
return amount;
}
public void setAmount(int amount) {
this.amount = amount;
}
public String getPname() {
return pname;
}
public void setPname(String pname) {
this.pname = pname;
}
public String getFlag() {
return flag;
}
public void setFlag(String flag) {
this.flag = flag;
}
//重写toString方法
@Override
public String toString() {
return id + '\t' + pname + '\t' + amount;
}
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeUTF(id);
dataOutput.writeUTF(pid);
dataOutput.writeInt(amount);
dataOutput.writeUTF(pname);
dataOutput.writeUTF(flag);
}
@Override
public void readFields(DataInput dataInput) throws IOException {
this.id = dataInput.readUTF();
this.pid = dataInput.readUTF();
this.amount = dataInput.readInt();
this.pname = dataInput.readUTF();
this.flag = dataInput.readUTF();
}
}
2)Mapper (setup的代码不是很理解)
package com.wts.mapreduce.reducerjoin;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
public class TableMapper extends Mapper<LongWritable, Text, Text, TableBean> {
private String filename;
private Text outK = new Text();
private TableBean outV = new TableBean();
@Override
protected void setup(Mapper<LongWritable, Text, Text, TableBean>.Context context) throws IOException, InterruptedException {
//获取对应的文件名称
InputSplit Split = context.getInputSplit();
FileSplit fileSplit = (FileSplit) Split;
filename = fileSplit.getPath().getName();
}
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, TableBean>.Context context) throws IOException, InterruptedException {
String line = value.toString();
if (filename.contains("order")) {
String[] split = line.split("\t");
//封装
outK.set(split[1]);
outV.setId(split[0]);
outV.setPid(split[1]);
outV.setAmount(Integer.parseInt(split[2]));
outV.setPname("");
outV.setFlag("order");
} else {
String[] split = line.split("\t");
//封装
outK.set(split[0]);
outV.setId("");
outV.setPid(split[0]);
outV.setAmount(0);
outV.setPname(split[1]);
outV.setFlag("pd");
}
//写出
context.write(outK, outV);
}
}
3)Reducer
package com.wts.mapreduce.reducerjoin;
import org.apache.commons.beanutils.BeanUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.lang.reflect.InvocationTargetException;
import java.util.ArrayList;
public class TableReducer extends Reducer<Text, TableBean, TableBean, NullWritable> {
@Override
protected void reduce(Text key, Iterable<TableBean> values, Reducer<Text, TableBean, TableBean, NullWritable>.Context context) throws IOException, InterruptedException {
// 01 1001 1 order
// 01 1004 4 order
// 01 小米 pd
//初始化
ArrayList<TableBean> orderBeans = new ArrayList<>();
TableBean pdBean = new TableBean();
for (TableBean value : values) {
if (value.getFlag().equals("order")) {
//order表
//创建零时存储value
TableBean tmpOrderBean = new TableBean();
try {
//暂时存储到tmp中
BeanUtils.copyProperties(tmpOrderBean, value);
} catch (IllegalAccessException e) {
e.printStackTrace();
} catch (InvocationTargetException e) {
e.printStackTrace();
}
//将零时tmp对象添加到集合中
orderBeans.add(tmpOrderBean);
} else {
//pd表格
try {
BeanUtils.copyProperties(pdBean, value);
} catch (IllegalAccessException e) {
e.printStackTrace();
} catch (InvocationTargetException e) {
e.printStackTrace();
}
}
}
//遍历集合 替换掉order中的pid -->name 之后写出
for (TableBean orderBean : orderBeans) {
orderBean.setPname(pdBean.getPname());
//写出
context.write(orderBean, NullWritable.get());
}
}
}
4)Driver
package com.wts.mapreduce.reducerjoin;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class TableDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Job job = Job.getInstance(new Configuration());
job.setJarByClass(TableDriver.class);
job.setMapperClass(TableMapper.class);
job.setReducerClass(TableReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(TableBean.class);
job.setOutputKeyClass(TableBean.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.setInputPaths(job, new Path("G:\\1_Program\\hadoop_learning\\hadoop-input\\joinbean"));
FileOutputFormat.setOutputPath(job, new Path("G:\\1_Program\\hadoop_learning\\hadoop-output\\reducejoin\\output01"));
boolean b = job.waitForCompletion(true);
System.exit(b ? 0 : 1);
}
}
小结:表格关联的时候的应用,里面代码思路不是很容易吃透
使用环境:适用于一张表十分小、一张表很大的场景。
思考:在Reduce端处理过多的表,非常容易产生数据倾斜。怎么办?
答案:在Map端缓存多张表,提前处理业务逻辑,这样增加Map端业务,减少Reduce端数据的压力,尽可能的减少数据倾斜。
具体办法:采用DistributedCache
(1)在Mapper的setup阶段,将文件读取到缓存集合中。
(2)在Driver驱动类中加载缓存。
//缓存普通文件到Task运行节点。
job.addCacheFile(new URI("file:///e:/cache/pd.txt"));
//如果是集群运行,需要设置HDFS路径
job.addCacheFile(new URI("hdfs://hadoop102:8020/cache/pd.txt"));
实操:(和reducejoin一样的案例)
1)Mapper
package com.wts.mapreduce.mapjoin;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.HashMap;
public class MapJoinMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
private HashMap<String, String> pdMap = new HashMap();
private Text outK = new Text();
@Override
protected void setup(Mapper<LongWritable, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {
//获取缓存文件,并把文件内容封装到集合pd.txt
URI[] cacheFiles = context.getCacheFiles();
Path path = new Path(cacheFiles[0]);
FileSystem fs = FileSystem.get(context.getConfiguration());
FSDataInputStream fis = fs.open(new Path(cacheFiles[0]));
//通过包装流转换为reader,方便按行读取
BufferedReader reader = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
//逐行读取,按行处理
String line;
while (StringUtils.isNotEmpty(line = reader.readLine())) {
//切割
//01 小米
String[] split = line.split("\t");
pdMap.put(split[0], split[1]);
}
//关流
IOUtils.closeStream(reader);
}
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {
//读取order
//1001 01 1
String[] split = value.toString().split("\t");
//通过每行数据的pid 对应取出pdMap里面的pname
String pname = pdMap.get(split[1]);
//封装: id pname amount
outK.set(split[0] + "\t" + pname + "\t" + split[2]);
//写出
context.write(outK, NullWritable.get());
}
}
2)Driver
package com.wts.mapreduce.mapjoin;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
public class MapJoinDriver {
public static void main(String[] args) throws ClassNotFoundException, InterruptedException, IOException, URISyntaxException {
// 1 获取job信息
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 2 设置加载jar包路径
job.setJarByClass(MapJoinDriver.class);
// 3 关联mapper
job.setMapperClass(MapJoinMapper.class);
// 4 设置Map输出KV类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
// 5 设置最终输出KV类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
// 加载缓存数据
job.addCacheFile(new URI("file:///G:/1_Program/hadoop_learning/hadoop-input/mapjoin/pd/pd.txt"));
// Map端Join的逻辑不需要Reduce阶段,设置reduceTask数量为0
job.setNumReduceTasks(0);
// 6 设置输入输出路径
FileInputFormat.setInputPaths(job, new Path("G:\\1_Program\\hadoop_learning\\hadoop-input\\mapjoin\\order"));
FileOutputFormat.setOutputPath(job, new Path("G:\\1_Program\\hadoop_learning\\hadoop-output\\mapjoin\\output00"));
// 7 提交
boolean b = job.waitForCompletion(true);
System.exit(b ? 0 : 1);
}
}
清理的过程往往只需要运行Mapper程序,不需要运行Reduce程序。
案例:去除日志中字段个数小于等于11的日志(只用Map阶段清洗)
上代码:
1)Mapper
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class ETL_Mapper extends Mapper {
@Override
protected void map(LongWritable key, Text value, Mapper.Context context) throws IOException, InterruptedException {
String line = value.toString();
//解析
boolean result = parseLog(line, context);
//不合法退出
if (!result) {
return;
}
//输出
context.write(value, NullWritable.get());
}
private boolean parseLog(String line, Context context) {
String[] split = line.split(" ");
if (split.length > 11) {
return true;
} else {
return false;
}
}
}
2)Driver
package com.wts.mapreduce.ETL;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class ETL_Driver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
// 输入输出路径需要根据自己电脑上实际的输入输出路径设置
args = new String[]{"G:\\1_Program\\hadoop_learning\\hadoop-input\\etl", "G:\\1_Program\\hadoop_learning\\hadoop-output\\ETL\\output00"};
// 1 获取job信息
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 2 加载jar包
job.setJarByClass(ETL_Driver.class);
// 3 关联map
job.setMapperClass(ETL_Mapper.class);
// 4 设置最终输出类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
// 设置reducetask个数为0
job.setNumReduceTasks(0);
// 5 设置输入和输出路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// 6 提交
boolean b = job.waitForCompletion(true);
System.exit(b ? 0 : 1);
}
}
效果:
1)输入数据接口:InputFormat
(1)默认使用的实现类是:TextInputFormat
(2)TextInputFormat的功能逻辑是:一次读一行文本,然后将该行的起始偏移量作为key,行内容作为value返回。
(3)CombineTextInputFormat可以把多个小文件合并成一个切片处理,提高处理效率。
2)逻辑处理接口:Mapper
用户根据业务需求实现其中三个方法:map() setup() cleanup ()
3)Partitioner分区
(1)有默认实现 HashPartitioner,逻辑是根据key的哈希值和numReduces来返回一个分区号;key.hashCode()&Integer.MAXVALUE % numReduces
(2)如果业务上有特别的需求,可以自定义分区。
4)Comparable排序
(1)当我们用自定义的对象作为key来输出时,就必须要实现WritableComparable接口,重写其中的compareTo()方法。
(2)部分排序:对最终输出的每一个文件进行内部排序。
(3)全排序:对所有数据进行排序,通常只有一个Reduce。
(4)二次排序:排序的条件有两个。
5)Combiner合并
Combiner合并可以提高程序执行效率,减少IO传输。但是使用时必须不能影响原有的业务处理结果。
6)逻辑处理接口:Reducer
用户根据业务需求实现其中三个方法:reduce() setup() cleanup ()
7)输出数据接口:OutputFormat
(1)默认实现类是TextOutputFormat,功能逻辑是:将每一个KV对,向目标文本文件输出一行。
(2)用户还可以自定义OutputFormat。
直接上实操:
1)map端压缩
Configuration conf = new Configuration();
// 开启map端输出压缩
conf.setBoolean("mapreduce.map.output.compress", true);
// 设置map端输出压缩方式
conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class,CompressionCodec.class);
Job job = Job.getInstance(conf);
2)reduce端压缩
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// 设置reduce端输出压缩开启
FileOutputFormat.setCompressOutput(job, true);
// 设置压缩的方式
FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);
// FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
// FileOutputFormat.setOutputCompressorClass(job, DefaultCodec.class);
map和reduce压缩可以不对应