NameNode 和 SecondaryNameNode 不要安装在同一台服务器,ResourceManager 也很消耗内存,不要和 NameNode、SecondaryNameNode 配置在同一台机器上
Hadoop2: NameNode、DataNode、NodeManager
Hadoop3: ResourceManager、DataNod、NodeManager
Hadoop4: SecondaryNameNode、DataNod、NodeManager
<configuration>
<property>
<name>fs.defaultFSname>
<value>hdfs://Hadoop2:8020value>
property>
<property>
<name>hadoop.tmp.dirname>
<value>/opt/Module/hadoop-3.1.3/datavalue>
property>
<property>
<name>hadoop.http.staticuser.username>
<value>rootvalue>
property>
configuration>
<configuration>
<property>
<name>dfs.namenode.http-addressname>
<value>Hadoop2:9870value>
property>
<property>
<name>dfs.namenode.secondary.http-addressname>
<value>Hadoop4:9868value>
property>
<property>
<name>dfs.webhdfs.enabledname>
<value>truevalue>
property>
configuration>
<configuration>
<property>
<name>mapreduce.framework.namename>
<value>yarnvalue>
property>
<property>
<name>mapreduce.jobhistory.addressname>
<value>Hadoop2:10020value>
property>
<property>
<name>mapreduce.jobhistory.webapp.addressname>
<value>Hadoop2:19888value>
property>
configuration>
<configuration>
<property>
<name>yarn.nodemanager.aux-servicesname>
<value>mapreduce_shufflevalue>
property>
<property>
<name>yarn.resourcemanager.hostnamename>
<value>Hadoop3value>
property>
<property>
<description>Environment variables that containers may override rather than use NodeManager's default.description>
<name>yarn.nodemanager.env-whitelistname>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOMEvalue>
property>
<property>
<name>yarn.log-aggregation-enablename>
<value>truevalue>
property>
<property>
<name>yarn.log.server.urlname>
<value>http://Hadoop2:19888/jobhistory/logsvalue>
property>
<property>
<name>yarn.log-aggregation.retain-secondsname>
<value>604800value>
property>
configuration>
# The java implementation to use. By default, this environment
# variable is REQUIRED on ALL platforms except OS X!
export JAVA_HOME=/opt/Module/jdk1.8.0_212
该文件中添加的内容结尾不允许有空格,文件中不允许有空行。
Hadoop2
Hadoop3
Hadoop4
格式化 hdfs namenode -format
启动NameNode sbin/start-dfs.sh
启动ResourceManager sbin/start-yarn.sh
启动HistoryServer bin/mapred --daemon start historyserver
杀死进程 kill 进程号
单节点启动 hdfs --daemon start datanode
远程拷贝(ssh协议)scp -r jdk1.8.0_212/ root@Hadoop3:/opt/Module/
远程同步rsync -rvl /opt/Module/hadoop-3.1.3/etc/hadoop/ root@Hadoop3:/opt/Module/hadoop-3.1.3/etc/hadoop/
hadoop中新建文件夹hadoop fs -mkdir /input
hadoop上传文件 hadoop fs -put $HADOOP_HOME/wcinput/word.txt /input
端口名称 | Hadoop2.x | Hadoop3.x |
---|---|---|
NameNode 内部通信端口 | 8020 / 9000 | 8020 / 9000/9820 |
NameNode HTTP UI | 50070 | 9870 |
MapReduce 查看执行任务端口 | 8088 | 8088 |
历史服务器通信端口 | 19888 | 19888 |
查看 JobHistory http://Hadoop2:19888/jobhistory
命令 | 用法 |
---|---|
hadoop fs -help | 帮助指令 |
hadoop fs -help cat | 单个命令(cat)的详解 |
-moveFromLocal | 从本地剪切粘贴到 HDFS |
-copyFromLocal | 从本地文件系统中拷贝文件到 HDFS 路径去 |
-put | 等同于 copyFromLocal,生产环境更习惯用 put |
-appendToFile | 追加一个文件到已经存在的文件末尾 |
-copyToLocal | 从 HDFS 拷贝到本地 |
-get | 等同于 copyToLocal,生产环境更习惯用 get |
-ls | 显示目录信息 |
-cat | 显示文件内容 |
-chgrp、-chmod、-chown | Linux 文件系统中的用法一样,修改文件所属权限 |
-mkdir | 创建路径 |
-cp | 从 HDFS 的一个路径拷贝到 HDFS 的另一个路径 |
-mv | 在 HDFS 目录中移动文件 |
-tail | 显示一个文件的末尾 1kb 的数据 |
-rm | 删除文件或文件夹 |
-rm -r | 递归删除目录及目录里面内容 |
-du | 统计文件夹的大小信息 |
-setrep | 设置 HDFS 中文件的副本数量 |
Blocak块大小 ,中小公司128m,大公司256m。
块的大小可以通过配置参数( dfs.blocksize)来规定。
默认大小在Hadoop2.x/3.x版本中是128M,1.x版本中是64M。
NameNode 和 SecondaryNameNode工作机制相关参数
CheckPoint 时间设置
默认SecondaryNameNode 每隔一小时执行一次。
<property>
<name>dfs.namenode.checkpoint.periodname>
<value>3600svalue>
property>
一分钟检查一次操作次数,当操作次数达到 1 百万时,SecondaryNameNode 执行一次。
<property>
<name>dfs.namenode.checkpoint.txnsname>
<value>1000000value>
<description>操作动作次数description>
property>
<property>
<name>dfs.namenode.checkpoint.check.periodname>
<value>60svalue>
<description> 1 分钟检查一次操作次数description>
property>
DataNode工作机制相关参数
DN 向 NN 汇报当前解读信息的时间间隔,默认 6 小时;
<property>
<name>dfs.blockreport.intervalMsecname>
<value>21600000value>
description>
property>
DN 扫描自己节点块信息列表的时间,默认 6 小时
<property>
<name>dfs.datanode.directoryscan.intervalname>
<value>21600svalue>
property>
DataNode掉线时限参数设置
<property>
<name>dfs.namenode.heartbeat.recheck-intervalname>
<value>300000value>
property>
<property>
<name>dfs.heartbeat.intervalname>
<value>3value>
用户编写的程序分成三个部分:Mapper、Reducer 和 Driver。
1.Mapper阶段
(1)用户自定义的Mapper要继承自己的父类
(2)Mapper的输入数据是KV对的形式(KV的类型可自定义)
(3)Mapper中的业务逻辑写在map()方法中
(4)Mapper的输出数据是KV对的形式(KV的类型可自定义)
(5)map()方法(MapTask进程)对每一个
2.Reducer阶段
(1)用户自定义的Reducer要继承自己的父类
(2)Reducer的输入数据类型对应Mapper的输出数据类型,也是KV
(3)Reducer的业务逻辑写在reduce()方法中
(4)ReduceTask进程对每一组相同k的
3.Driver阶段
相当于YARN集群的客户端,用于提交我们整个程序到YARN集群,提交的是
封装了MapReduce程序相关运行参数的job对象 。
<dependencies>
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-clientartifactId>
<version>3.1.3version>
dependency>
<dependency>
<groupId>junitgroupId>
<artifactId>junitartifactId>
<version>4.12version>
dependency>
<dependency>
<groupId>org.slf4jgroupId>
<artifactId>slf4j-log4j12artifactId>
<version>1.7.30version>
dependency>
dependencies>
src/main/resources/log4j.properties
log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
package MapReduce.wordCount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* @author shkstart
* @create 2022-04-21 10:26
*/
public class wordCountMapper extends Mapper<LongWritable, Text,Text, IntWritable> {
private Text outk = new Text();
private IntWritable outv=new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString(); // toString
String[] words = line.split(" "); // split()
for (String word : words) {
outk.set(word);
context.write(outk,outv); // Context.write()
}
}
}
package MapReduce.wordCount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class wordCountReducer extends Reducer<Text, IntWritable,Text,IntWritable> {
private IntWritable outv = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum=0;
for (IntWritable value : values) {
sum+=value.get();
}
outv.set(sum);
context.write(key,outv);
}
}
package MapReduce.wordCount;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
/**
* 1 获取job
* 2 设置jar包路径
* 3 关联mapper和reducer
* 4 设置map输出的kv类型
* 5 设置最终输出的kV类型
* 6 设置输入路径和输出路径
* 7 提交job
*/
public class wordCountDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
//1 获取job
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration);
//2 设置jar包路径
job.setJarByClass(wordCountDriver.class);
//3 关联mapper和reducer
job.setMapperClass(wordCountMapper.class);
job.setReducerClass(wordCountReducer.class);
//4 设置map输出的kv类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//5 设置最终输出的kV类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//6 设置输入路径和输出路径
FileInputFormat.setInputPaths(job, new Path("D:\\Big_Data\\input\\wordCountinput\\word.txt"));
FileOutputFormat.setOutputPath(job, new Path("D:\\Big_Data\\output"));
//7 提交job
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}
本地运行模式若出现如下异常:
Exception in thread “main” java.lang.UnsatisfiedLinkError:
org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
解决方案:拷贝 hadoop.dll 文件到 Windows 目录 C:\Windows\System32。
打成jar包在hadoop集群运行
依赖
<!--用maven打jar包,需要添加的打包插件依赖-->
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
程序中 Driver 驱动类中的输入输出路径不能写死
不然会报 Exception in thread “main” java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: D:%5CBig_Data%5Cinput%5CwordCount3%5Ctext.txt 异常
//6 设置输入路径和输出路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
然后生命周期中的package打包功能
将jar包(一个是带依赖环境,一个是不带,选择一个)上传到Hadoop
使用命令 jar wc.jar MapReduce.wordCount3.Driver /input/word.txt /opt/output
运行
注: 输入输出路径都为Hadoop集群中的路径
可通过 Hadoop2:9870访问
自定义 bean 对象实现序列化接口(Writable)
具体实现 bean 对象序列化步骤如下 7 步:
(1)必须实现 Writable 接口
(2)反序列化时,需要反射调用空参构造函数,所以必须有空参构造
(3)重写序列化方法 write
(4)重写反序列化方法 readFields
(5)注意反序列化的顺序和序列化的顺序完全一致
(6)要想把结果显示在文件中,需要重写 toString(),可用"\t"分开,方便后续用
(7)如果需要将自定义的 bean 放在 key 中传输,则还需要实现 Comparable 接口,因为MapReduce 框中的 Shuffle 过程要求对 key 必须能排序。
package MapReduce.writable2;
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
/**
* 1. 定义类实现writable接口
* 2、重写序列化和反序列化方法
* 3、重写空参构造
* 4、toString方法
*/
public class PhoneBean implements Writable {
private long upFlow ; //上行流量
private long downFlow; //下行流量
private long sumFlow; //总流量
public long getUpFlow() {
return upFlow;
}
public void setUpFlow(long upFlow) {
this.upFlow = upFlow;
}
public long getDownFlow() {
return downFlow;
}
public void setDownFlow(long downFlow) {
this.downFlow = downFlow;
}
public long getSumFlow() {
return sumFlow;
}
public void setSumFlow(long sumFlow) {
this.sumFlow = sumFlow;
}
//重载setSumFlow()
public void setSumFlow() {
this.sumFlow = this.upFlow+this.downFlow ;
}
//反序列化时,需要反射调用空参构造函数,所以必须有空参构造
public PhoneBean() {
}
//重写序列化方法
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeLong(upFlow);
dataOutput.writeLong(downFlow);
dataOutput.writeLong(sumFlow);
}
//重写反序列化方法
@Override
public void readFields(DataInput dataInput) throws IOException {
this.upFlow=dataInput.readLong();
this.downFlow=dataInput.readLong();
this.sumFlow=dataInput.readLong();
}
@Override
public String toString() {
return upFlow + "\t" + downFlow + "\t" + sumFlow ;
}
}
package MapReduce.writable2;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* @author shkstart
* @create 2022-04-25 8:40
*/
public class PhoneMapper extends Mapper<LongWritable,Text,Text, PhoneBean> {
private Text outK=new Text();
private PhoneBean outV = new PhoneBean();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//获取行数据
String line = value.toString();
//切割
String[] lineSplit = line.split("\t");
//抓取想要的数据
String phone = lineSplit[1] ;
String down = lineSplit[lineSplit.length-2] ;
String up = lineSplit[lineSplit.length-3] ;
//封装
outK.set(phone);
outV.setUpFlow(Long.parseLong(up));
outV.setDownFlow(Long.parseLong(down));
outV.setSumFlow();
//上下文传输
context.write(outK,outV);
}
}
package MapReduce.writable2;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* @author shkstart
* @create 2022-04-25 9:04
*/
public class PhoneReduce extends Reducer<Text,PhoneBean,Text,PhoneBean> {
private PhoneBean outV=new PhoneBean();
public PhoneReduce() {
}
@Override
protected void reduce(Text key, Iterable<PhoneBean> values, Context context) throws IOException, InterruptedException {
long totalUp=0;
long totalDown=0;
//遍历
for (PhoneBean value : values) {
totalUp+= value.getUpFlow();
totalDown+=value.getDownFlow();
}
//封装
outV.setUpFlow(totalUp);
outV.setDownFlow(totalDown);
outV.setSumFlow();
//写出
context.write(key,outV);
}
}
package MapReduce.writable2;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class PhoneDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
//获取job
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//设置jar
job.setJarByClass(MapReduce.writable2.PhoneDriver.class);
//关联mapper和Reduce
job.setMapperClass(MapReduce.writable2.PhoneMapper.class);
job.setReducerClass(MapReduce.writable2.PhoneReduce.class);
//设置mapper的输出数据类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(PhoneBean.class);
//设置最终的输出数据类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(PhoneBean.class);
//设置输入输出路径
FileInputFormat.setInputPaths(job,new Path("D:\\Big_Data\\input\\inputflow\\phone_data.txt"));
FileOutputFormat.setOutputPath(job,new Path("D:\\Big_Data\\outup"));
//提交job
boolean b = job.waitForCompletion(true);
System.exit(b?0:1);
}
}
MapTask 并行度由切片个数决定,切片个数由输入文件和切片规则决定。
1、切片机制
(1)简单地按照文件的内容长度进行切片
(2)切片大小,默认等于Block大小
(3)切片时不考虑数据集整体,而是逐个针对每一个文件单独切片
2、案例分析
(1)输入数据有两个文件:
file1.txt 320M
file2.txt 10M
(2)经过FileInputFormat的切片机制
运算后,形成的切片信息如下:
file1.txt.split1-- 0~128
file1.txt.split2-- 128~256
file1.txt.split3-- 256~320
file2.txt.split1-- 0~10M
(1)源码中计算切片大小的公式
Math.max(minSize, Math.min(maxSize, blockSize));
mapreduce.input.fileinputformat.split.minsize=1 默认值为1
mapreduce.input.fileinputformat.split.maxsize= Long.MAXValue 默认值Long.MAXValue
因此,默认情况下,切片大小=blocksize。
(2)切片大小设置
maxsize(切片最大值):参数如果调得比blockSize小,则会让切片变小,而且就等于配置的这个参数的值。
minsize(切片最小值):参数调的比blockSize大,则可以让切片变得比blockSize还大。
(3)获取切片信息API
// 获取切片的文件名称
String name = inputSplit.getPath().getName();
// 根据文件类型获取切片信息
FileSplit inputSplit = (FileSplit) context.getInputSplit();
FileInputFormat 常见的接口实现类包括:TextInputFormat、KeyValueTextInputFormat、NLineInputFormat、CombineTextInputFormat 和自定义 InputFormat 等。
TextInputFormat 是默认的 FileInputFormat 实现类。按行读取每条记录。框架默认的 TextInputFormat 切片机制是对任务按文件规划切片,不管文件多小,都会是一个单独的切片,都会交给一个 MapTask,这样如果有大量小文件,就会产生大量的MapTask,处理效率极其低下。
CombineTextInputFormat 用于小文件过多的场景,它可以将多个小文件从逻辑上规划到一个切片中,这样,多个小文件就可以交给一个 MapTask 处理。
虚拟存储切片最大值设置
驱动类中添加代码如下:
// 如果不设置 InputFormat,它默认用的是 TextInputFormat.class
job.setInputFormatClass(CombineTextInputFormat.class);
//虚拟存储切片最大值设置 4m
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m
注意:虚拟存储切片最大值设置最好根据实际的小文件大小情况来设置具体的值。
切片过程
(a)判断虚拟存储的文件大小是否大于setMaxInputSplitSize值,大于等于则单独形成一个切片。
(b)如果不大于则跟下一个虚拟存储文件进行合并,共同形成一个切片。
package MapReduce.combineTextInputforamt2;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class mapper extends Mapper<LongWritable,Text,Text,IntWritable> {
public mapper() {
super();
}
private Text outK=new Text() ;
private IntWritable outV =new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//获取行
String line = value.toString();
//切割
String[] lineSplit = line.split(" ");
//封装
for (String s : lineSplit) {
outK.set(s);
//上下文
context.write(outK,outV);
}
}
}
package MapReduce.combineTextInputforamt2;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* @author shkstart
* @create 2022-04-25 10:25
*/
public class reducer extends Reducer<Text, IntWritable,Text,IntWritable> {
public reducer() {
super();
}
private IntWritable outV = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum=0;
for (IntWritable value : values) {
sum+=value.get();
}
outV.set(sum);
//上下文
context.write(key,outV);
}
}
package MapReduce.combineTextInputforamt2;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class driver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//设置jar
job.setJarByClass(MapReduce.combineTextInputforamt2.driver.class);
//关联mapper和reducer
job.setMapperClass(MapReduce.combineTextInputforamt2.mapper.class);
job.setReducerClass(MapReduce.combineTextInputforamt2.reducer.class);
//设置map输出数据类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//设置最终输出数据类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//设置InputFormat实现类
job.setInputFormatClass(CombineTextInputFormat.class);
//设置 虚拟存储切片最大值设置20m
CombineTextInputFormat.setMaxInputSplitSize(job, 20971520);
//设置输入输出路径
FileInputFormat.setInputPaths(job,new Path("D:\\Big_Data\\input\\inputcombinetextinputformat"));
FileOutputFormat.setOutputPath(job,new Path("D:\\Big_Data\\output"));
//提交job
boolean b = job.waitForCompletion(true);
System.exit(b?0:1);
}
}
(1) 自定义类继承Partitioner,重写getPartition()方法。
public class CustomPartitioner extends Partitioner<Text, FlowBean> {
@Override
public int getPartition(Text key, FlowBean value, int numPartitions) {
// 控制分区代码逻辑
……
return partition;
}
}
(2) 在Job驱动中,设置自定义Partitioner
job.setPartitionerClass(CustomPartitioner.class);
(3) 自定义Partition后,要根据自定义Partitioner的逻辑设置相应数量的ReduceTask
job.setNumReduceTasks(5);
phone_data.txt
1 13736230513 192.196.100.1 www.atguigu.com 2481 24681 200
2 13846544121 192.196.100.2 264 0 200
3 13956435636 192.196.100.3 132 1512 200
4 13966251146 192.168.100.1 240 0 404
5 18271575951 192.168.100.2 www.atguigu.com 1527 2106 200
6 84188413 192.168.100.3 www.atguigu.com 4116 1432 200
7 13590439668 192.168.100.4 1116 954 200
8 15910133277 192.168.100.5 www.hao123.com 3156 2936 200
9 13729199489 192.168.100.6 240 0 200
10 13630577991 192.168.100.7 www.shouhu.com 6960 690 200
11 15043685818 192.168.100.8 www.baidu.com 3659 3538 200
12 15959002129 192.168.100.9 www.atguigu.com 1938 180 500
13 13560439638 192.168.100.10 918 4938 200
14 13470253144 192.168.100.11 180 180 200
期望输出数据: 手机号 136、137、138、139 开头都分别放到一个独立的 4 个文件中,其他开头的放到一个文件中。
package MapReduce.Partitoner3;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class PhoneDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//jar
job.setJarByClass(MapReduce.Partitoner3.PhoneDriver.class);
//map
job.setMapperClass(MapReduce.Partitoner3.PhoneMapper.class);
//PartitionerClass ,同时指定相应数量的 ReduceTask
job.setPartitionerClass(MapReduce.Partitoner3.PhonePartitoner.class);
job.setNumReduceTasks(5);
//设置map输出数据的kv类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(PhoneBean.class);
//设置最终输出数据的kv类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(PhoneBean.class);
//设置输入输出路径
FileInputFormat.setInputPaths(job,new Path("D:\\Big_Data\\input\\inputflow\\phone_data.txt"));
FileOutputFormat.setOutputPath(job,new Path("D:\\Big_Data\\output"));
//提交job
boolean b = job.waitForCompletion(true);
System.exit(b?0:1);
}
}
package MapReduce.Partitoner3;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class PhoneDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//jar
job.setJarByClass(MapReduce.Partitoner3.PhoneDriver.class);
//map
job.setMapperClass(MapReduce.Partitoner3.PhoneMapper.class);
//PartitionerClass ,同时指定相应数量的 ReduceTask
job.setPartitionerClass(MapReduce.Partitoner3.PhonePartitoner.class);
job.setNumReduceTasks(5);
//设置map输出数据的kv类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(PhoneBean.class);
//设置最终输出数据的kv类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(PhoneBean.class);
//设置输入输出路径
FileInputFormat.setInputPaths(job,new Path("D:\\Big_Data\\input\\inputflow\\phone_data.txt"));
FileOutputFormat.setOutputPath(job,new Path("D:\\Big_Data\\output"));
//提交job
boolean b = job.waitForCompletion(true);
System.exit(b?0:1);
}
}
package MapReduce.Partitoner3;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class PhonePartitoner extends Partitioner<Text,PhoneBean> {
@Override
public int getPartition(Text text, PhoneBean phoneBean, int i) {
//控制分区逻辑
//获取手机号前三位
String phoneNum = text.toString();
String subPhone = phoneNum.substring(0, 3);
//根据subPhone分区
if ("136".equals(subPhone)){
return 0;
}else {
if ("137".equals(subPhone)){
return 1;
}else {
if ("138".equals(subPhone)){
return 2;
} else {
if ("139".equals(subPhone)){
return 3;
}else {
return 4;
}
}
}
}
}
}
package MapReduce.Partitoner3;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class PhoneDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//jar
job.setJarByClass(MapReduce.Partitoner3.PhoneDriver.class);
//map
job.setMapperClass(MapReduce.Partitoner3.PhoneMapper.class);
//PartitionerClass ,同时指定相应数量的 ReduceTask
job.setPartitionerClass(MapReduce.Partitoner3.PhonePartitoner.class);
job.setNumReduceTasks(5);
//设置map输出数据的kv类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(PhoneBean.class);
//设置最终输出数据的kv类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(PhoneBean.class);
//设置输入输出路径
FileInputFormat.setInputPaths(job,new Path("D:\\Big_Data\\input\\inputflow\\phone_data.txt"));
FileOutputFormat.setOutputPath(job,new Path("D:\\Big_Data\\output"));
//提交job
boolean b = job.waitForCompletion(true);
System.exit(b?0:1);
}
}
排序是MapReduce框架中最重要的操作之一。
MapTask和ReduceTask均会对数据按照key进行排序,该操作属于Hadoop的默认行为。
默认排序是按照字典顺序排序,且实现该排序的方法是快速排序。
对于MapTask,它会将处理的结果暂时放到环形缓冲区中,当环形缓冲区使
用率达到一定阈值后,再对缓冲区中的数据进行一次快速排序,并将这些有序数据溢写到磁盘上,而当数据处理完毕后,它会对磁盘上所有文件进行归并排序。
对于ReduceTask,它从每个MapTask上远程拷贝相应的数据文件,如果文件大小超过一定阈值,则溢写磁盘上,否则存储在内存中。如果磁盘上文件数目达到一定阈值,则进行一次归并排序以生成一个更大文件;如果内存中文件大小或者数目超过一定阈值,则进行一次合并后将数据溢写到磁盘上。当所有数据拷贝完毕后,ReduceTask统一对内存和磁盘上的所有数据进行一次归并排序。
最终输出结果只有一个文件,且文件内部有序。实现方式是只设置一个ReduceTask。但该方法在处理大型文件时效率极低,因为一台机器处理所有文件,完全丧失了MapReduce所提供的并行架构。
bean 对象做为 key 传输,需要实现 WritableComparable 接口重写compareTo 方法,就可以实现排序。
原数据 需求: 对总流量进行倒序排序。
13470253144 180 200 380
13509468723 110349 404 110753
13560439638 4938 200 5138
13568436656 24681 200 24881
13568436656 954 200 1154
13590439668 954 200 1154
13630577991 690 200 890
13682846555 2910 200 3110
13729199489 0 200 200
13736230513 24681 200 24881
13768778790 120 200 320
13846544121 0 200 200
13956435636 1512 200 1712
13966251146 0 404 404
13975057813 48243 200 48443
13992314666 3720 200 3920
15043685818 3538 200 3738
15910133277 2936 200 3136
15959002129 180 500 680
18271575951 2106 200 2306
18390173782 2412 200 2612
84188413 1432 200 1632
结果:
13509468723 110349 404 110753
13975057813 48243 200 48443
13568436656 24681 200 24881
13736230513 24681 200 24881
13560439638 4938 200 5138
13992314666 3720 200 3920
15043685818 3538 200 3738
15910133277 2936 200 3136
13682846555 2910 200 3110
18390173782 2412 200 2612
18271575951 2106 200 2306
13956435636 1512 200 1712
84188413 1432 200 1632
13590439668 954 200 1154
13568436656 954 200 1154
13630577991 690 200 890
15959002129 180 500 680
13966251146 0 404 404
13470253144 180 200 380
13768778790 120 200 320
13729199489 0 200 200
13846544121 0 200 200
package MapReduce.PaiXu;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class PhoneBean implements WritableComparable<PhoneBean> {
private long upFlow ;
private long downFlow ;
private long sumFlow ;
public PhoneBean() {
}
public long getUpFlow() {
return upFlow;
}
public void setUpFlow(long upFlow) {
this.upFlow = upFlow;
}
public long getDownFlow() {
return downFlow;
}
public void setDownFlow(long downFlow) {
this.downFlow = downFlow;
}
public long getSumFlow() {
return sumFlow;
}
public void setSumFlow(long sumFlow) {
this.sumFlow = sumFlow;
}
public void setSumFlow() {
this.sumFlow = this.upFlow+this.downFlow ;
}
@Override
public String toString() {
return upFlow + "\t" + downFlow + "\t" + sumFlow ;
}
//序列化
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeLong(upFlow);
dataOutput.writeLong(downFlow);
dataOutput.writeLong(sumFlow);
}
//反序列化
@Override
public void readFields(DataInput dataInput) throws IOException {
this.upFlow = dataInput.readLong();
this.downFlow = dataInput.readLong();
this.sumFlow = dataInput.readLong();
}
//排序方法
@Override
public int compareTo(PhoneBean o) {
if (this.sumFlow>o.sumFlow){
return -1;
}else if (this.sumFlow<o.sumFlow){
return 1;
}else {
return 0;
}
}
}
package MapReduce.PaiXu;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class PhoneMapper extends Mapper<LongWritable, Text, PhoneBean,Text> {
private PhoneBean outK = new PhoneBean();
private Text outV = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//获取行
String line = value.toString();
//切割
String[] lineSplit = line.split("\t");
//封装
outK.setUpFlow(Long.parseLong(lineSplit[lineSplit.length-2]));
outK.setDownFlow(Long.parseLong(lineSplit[lineSplit.length-1]));
outK.setSumFlow();
outV.set(lineSplit[1]);
//写出
context.write(outK,outV);
}
}
package MapReduce.PaiXu;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* @author shkstart
* @create 2022-04-27 9:14
*/
public class PhoneReducer extends Reducer<PhoneBean, Text,Text,PhoneBean> {
@Override
protected void reduce(PhoneBean key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
//遍历 values 集合,循环写出,避免总流量相同的情况
for (Text value : values) {
//调换 KV 位置,反向写出
context.write(value,key);
}
}
}
package MapReduce.PaiXu;
import MapReduce.Partitoner3.PhonePartitoner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class PhoneDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//jar
job.setJarByClass(PhoneDriver.class);
//map
job.setMapperClass(PhoneMapper.class);
//reduce
job.setReducerClass(MapReduce.PaiXu.PhoneReducer.class);
//设置map输出数据的kv类型
job.setMapOutputKeyClass(PhoneBean.class);
job.setMapOutputValueClass(Text.class);
//设置最终输出数据的kv类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(PhoneBean.class);
//设置输入输出路径
FileInputFormat.setInputPaths(job,new Path("D:\\Big_Data\\input\\inputflow\\phone_data.txt"));
FileOutputFormat.setOutputPath(job,new Path("D:\\Big_Data\\output"));
//提交job
boolean b = job.waitForCompletion(true);
System.exit(b?0:1);
}
}
在全排序的基础上,增加自定义分区类。
package MapReduce.PaiXu;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class PhoneBean implements WritableComparable<PhoneBean> {
private long upFlow ;
private long downFlow ;
private long sumFlow ;
public PhoneBean() {
}
public long getUpFlow() {
return upFlow;
}
public void setUpFlow(long upFlow) {
this.upFlow = upFlow;
}
public long getDownFlow() {
return downFlow;
}
public void setDownFlow(long downFlow) {
this.downFlow = downFlow;
}
public long getSumFlow() {
return sumFlow;
}
public void setSumFlow(long sumFlow) {
this.sumFlow = sumFlow;
}
public void setSumFlow() {
this.sumFlow = this.upFlow+this.downFlow ;
}
@Override
public String toString() {
return upFlow + "\t" + downFlow + "\t" + sumFlow ;
}
//序列化
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeLong(upFlow);
dataOutput.writeLong(downFlow);
dataOutput.writeLong(sumFlow);
}
//反序列化
@Override
public void readFields(DataInput dataInput) throws IOException {
this.upFlow = dataInput.readLong();
this.downFlow = dataInput.readLong();
this.sumFlow = dataInput.readLong();
}
//排序方法
@Override
public int compareTo(PhoneBean o) {
if (this.sumFlow>o.sumFlow){
return -1;
}else if (this.sumFlow<o.sumFlow){
return 1;
}else {
return 0;
}
}
}
package MapReduce.PaiXu;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class PhoneMapper extends Mapper<LongWritable, Text, PhoneBean,Text> {
private PhoneBean outK = new PhoneBean();
private Text outV = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//获取行
String line = value.toString();
//切割
String[] lineSplit = line.split("\t");
//封装
outK.setUpFlow(Long.parseLong(lineSplit[lineSplit.length-2]));
outK.setDownFlow(Long.parseLong(lineSplit[lineSplit.length-1]));
outK.setSumFlow();
outV.set(lineSplit[1]);
//写出
context.write(outK,outV);
}
}
package MapReduce.PaiXu;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
/**
* @author shkstart
* @create 2022-04-27 9:47
*/
public class PhonePartition extends Partitioner<PhoneBean, Text> {
private int district;
@Override
public int getPartition(PhoneBean phoneBean, Text text, int i) {
//获取手机号前三位
String phoneNum = text.toString();
String subPhone = phoneNum.substring(0, 3);
if ("136".equals(subPhone)){
district = 0;
}else if ("137".equals(subPhone)){
district = 1;
}else if ("138".equals(subPhone)){
district = 2;
}else if ("139".equals(subPhone)){
district = 3;
}else{
district = 4;
}
return district;
}
}
package MapReduce.PaiXu;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* @author shkstart
* @create 2022-04-27 9:14
*/
public class PhoneReducer extends Reducer<PhoneBean, Text,Text,PhoneBean> {
@Override
protected void reduce(PhoneBean key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
//遍历 values 集合,循环写出,避免总流量相同的情况
for (Text value : values) {
//调换 KV 位置,反向写出
context.write(value,key);
}
}
}
package MapReduce.PaiXu;
import MapReduce.Partitoner3.PhonePartitoner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class PhoneDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//jar
job.setJarByClass(PhoneDriver.class);
//map
job.setMapperClass(PhoneMapper.class);
//reduce
job.setReducerClass(MapReduce.PaiXu.PhoneReducer.class);
//partition and ReduceTask
job.setPartitionerClass(MapReduce.PaiXu.PhonePartition.class);
job.setNumReduceTasks(5);
//设置map输出数据的kv类型
job.setMapOutputKeyClass(PhoneBean.class);
job.setMapOutputValueClass(Text.class);
//设置最终输出数据的kv类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(PhoneBean.class);
//设置输入输出路径
FileInputFormat.setInputPaths(job,new Path("D:\\Big_Data\\input\\inputflow\\phone_data.txt"));
FileOutputFormat.setOutputPath(job,new Path("D:\\Big_Data\\output"));
//提交job
boolean b = job.waitForCompletion(true);
System.exit(b?0:1);
}
}
(1)Combiner是MR程序中Mapper和Reducer之外的一种组件。
(2)Combiner组件的父类就是Reducer。
(3)Combiner和Reducer的区别在于运行的位置:
Combiner是在每一个MapTask所在的节点运行;
Reducer是接收全局所有Mapper的输出结果;
(4)Combiner的意义就是对每一个MapTask的输出进行局部汇总,以减小网络传输量。
(5)Combiner能够应用的前提是不能影响最终的业务逻辑,而且,Combiner的输出kv 应该跟Reducer的输入kv类型要对应起来。
自定义一个 Combiner 继承 Reducer,重写 Reduce 方法, 在 Job 驱动类关联Combiner类。
(也可以将 WordcountReducer 作为 Combiner 在 WordcountDriver 驱动类中指定)
操作实例
原数据: 需求: 计算每个单词出现的次数
aa aa aa aa aa ab
c2 c3 c5 ccx cc
b2 b1d bc
d2 d1 d3c
结果:
aa 5
ab 1
b1d 1
b2 1
bc 1
c2 1
c3 1
c5 1
cc 1
ccx 1
d1 1
d2 1
d3c 1
package MapReduce.CombinerTest;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class WordMapper extends Mapper<LongWritable, Text,Text,LongWritable> {
private Text outK = new Text();
private LongWritable outV = new LongWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//获取行 a2 a1 a4 a3 a5 a6
String line = value.toString();
String[] linesSplit = line.split(" ");
for (String s : linesSplit) {
outK.set(s);
context.write(outK,outV);
}
}
}
package MapReduce.CombinerTest;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* @author shkstart
* @create 2022-04-27 14:01
*/
public class WordCombiner extends Reducer<Text, LongWritable,Text, LongWritable> {
private LongWritable outV = new LongWritable();
@Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
int sum =0;
for (LongWritable value : values) {
sum+= value.get();
}
outV.set(sum);
context.write(key,outV);
}
}
package MapReduce.CombinerTest;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class WordReduce extends Reducer<Text, LongWritable,Text,LongWritable> {
private LongWritable outV = new LongWritable();
@Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
int sum =0;
for (LongWritable value : values) {
sum +=value.get();
}
outV.set(sum);
context.write(key,outV);
}
}
package MapReduce.CombinerTest;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WordDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//jar
job.setJarByClass(MapReduce.CombinerTest.WordDriver.class);
//map
job.setMapperClass(MapReduce.CombinerTest.WordMapper.class);
//combiner
job.setCombinerClass(MapReduce.CombinerTest.WordCombiner.class);
//reducer
job.setReducerClass(MapReduce.CombinerTest.WordReduce.class);
//map and final类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
//输入输出文件路径
FileInputFormat.setInputPaths(job,new Path("D:\\Big_Data\\input\\PaiXu"));
FileOutputFormat.setOutputPath(job,new Path("D:\\Big_Data\\output"));
//调教job
boolean b = job.waitForCompletion(true);
System.exit(b?0:1);
}
}
OutputFormat是MapReduce输出的基类,所有实现MapReduce输出都实现了 OutputFormat接口。默认输出格式TextOutputFormat。
可以自定义OutputFormat,1 应用场景:例如:输出数据到MySQL/HBase/Elasticsearch等存储框架中。
自定义一个类继承FileOutputFormat,自定义另一个类继承RecordWriter,具体改写输出数据的方法write(),跟driver类关联。
需求:过滤输入的log日志,包含atguigu的网站输出到e:/atguigu.log,不包含atguigu的网站输出到e:/other.log
待处理数据:
http://www.baidu.com
http://www.google.com
http://cn.bing.com
http://www.atguigu.com
http://www.sohu.com
http://www.sina.com
http://www.sin2a.com
http://www.sin2desa.com
http://www.sindsafa.com
结果:
atguigu.log: http://www.atguigu.com
other.log: http://cn.bing.com
http://www.baidu.com
http://www.google.com
http://www.sin2a.com
http://www.sin2desa.com
http://www.sina.com
http://www.sindsafa.com
http://www.sohu.com
package MapReduce.outputFormat;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* @author shkstart
* @create 2022-04-27 15:03
*/
public class logMapper extends Mapper<LongWritable, Text,Text, NullWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
context.write(value,NullWritable.get());
}
}
package MapReduce.outputFormat;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* @author shkstart
* @create 2022-04-27 15:08
*/
public class logReducer extends Reducer<Text, NullWritable,Text,NullWritable> {
@Override
protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
//防止有相同数据,迭代写出
for (NullWritable value : values) {
context.write(key,value);
}
}
}
package MapReduce.outputFormat;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
/**
* @author shkstart
* @create 2022-04-27 15:05
*/
public class logOutputFormat extends FileOutputFormat<Text, NullWritable> {
@Override
public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
logRWriter logRWriter = new logRWriter(job);
return logRWriter;
}
}
package MapReduce.outputFormat;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import java.io.IOException;
import java.nio.file.FileSystems;
/**
* @author shkstart
* @create 2022-04-27 15:23
*/
public class logRWriter extends RecordWriter<Text, NullWritable> {
private FSDataOutputStream fileDataAtguigu ;
private FSDataOutputStream fileDataOther ;
public logRWriter(TaskAttemptContext job) {
try {
//获取文件系统对象
FileSystem fs = FileSystem.get(job.getConfiguration());
//创建两条流
fileDataAtguigu = fs.create(new Path("d:/Big_Data/atguigu.log"));
fileDataOther = fs.create(new Path("d:/Big_Data/other.log"));
} catch (IOException e) {
e.printStackTrace();
}
}
@Override
public void write(Text text, NullWritable nullWritable) throws IOException, InterruptedException {
//获取行
String line = text.toString();
//判断
if(line.contains("atguigu")){
fileDataAtguigu.writeBytes(line+"\n");
}else {
fileDataOther.writeBytes(line+"\n");
}
}
@Override
public void close(TaskAttemptContext context) throws IOException, InterruptedException {
//关流
IOUtils.closeStream(fileDataAtguigu);
IOUtils.closeStream(fileDataOther);
}
}
package MapReduce.outputFormat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class logDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//jar
job.setJarByClass(MapReduce.outputFormat.logDriver.class);
//map
job.setMapperClass(MapReduce.outputFormat.logMapper.class);
//reducer
job.setReducerClass(MapReduce.outputFormat.logReducer.class);
//fileoutputformat
job.setOutputFormatClass(MapReduce.outputFormat.logOutputFormat.class);
//map and final数据类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
//输入输出路径
FileInputFormat.setInputPaths(job,new Path("D:\\Big_Data\\input\\inputoutputformat"));
FileOutputFormat.setOutputPath(job,new Path("D:\\Big_Data\\output"));
//提交job
boolean b = job.waitForCompletion(true);
System.exit(b?0:1);
}
}