在前面的内容几章内, 我们主要介绍了HDFS的相关内容. 本章开始, 我们讲解下经常使用的Hadoop MapReduce
的相关内容.
有人会觉得, 当前已经到了Spark
几乎一统天下的时代, 学习Map/Reduce
似乎没什么必要. 但是, 我觉得还是有点必要的. 主要原因有三:
Hadoop
的Map/Reduce
上的Hive
. 学习基础的Map/Reduce
操作有利于了解Hive
的运行机制.Map/Reduce
的操作过程中有许多的大数据经典问题, 对于我们后续的借鉴有非常重要的意义.本文相关代码, 可在我的Github项目 https://github.com/SeanYanxml/bigdata/ 目录下可以找到. PS: (如果觉得项目不错, 可以给我一个Star.)
# 文件 hello2019.sh (随便准备即可)
hello 2019
cat
pitty
kitty
able
pitty
cat
# 创建文件夹
hadoop fs -mkdir -p /wordcount/input
# 上传文件
hadoop fs -put hello.sh /wordcount/input/
OK, 进行上述的准备操作后, 我们进行代码的编写部分.
pom.xml
文件, 导入需要的Jar
包.
4.0.0
com.yanxml
bigdata
0.0.1-SNAPSHOT
hadoop
org.apache.hadoop
hadoop-mapreduce-client-common
2.7.5
org.apache.hadoop
hadoop-client
2.7.5
junit
junit
4.12
test
com.alibaba
fastjson
1.2.28
org.apache.maven.plugins
maven-shade-plugin
2.4.3
package
shade
com.yanxml.bigdata.hadoop.mr.wordcount.WordcountDriver
1.7
1.7
org.apache.hadoop.mapreduce.Mapper
)注意:mapred
目录的下的Mapper
, 为Hadoop1.x的, 我们不使用这个版本. WordcountMapper
的主要作用就是将数据切分, 并且将其放入context
, 以供Reducer
进行调用.
/**
* Mapper
* 将数据读取逻辑, 以的形式传递给我.
* KEYIN: 默认情况下, 是mr框架所读到的一行文本的起始偏移量, Long, 在Hadoop内有精简的序列化接口,不直接用Long, 而用LongWriterable.
* VALUE: 默认情况下, 是mr框架所读到的一行文本的内容, String, 同上用Text
*
* KEYOUT: 是用户自定义逻辑处理完成之后输出数据中的key, 在此处为单词, String, 用Text
* VALUEOUT: 是用户自定义逻辑处理完成之后输出数据中的value, 此处为单词次数, Integer, 用IntWriterable
*
* */
public class WordcountMapper extends Mapper{
@Override
/**
* 重写父类的Map接口.
* map阶段的处理业务逻辑就写在自定义的map()方法内.
*
* */
protected void map(LongWritable key, Text value, Mapper.Context context)
throws IOException, InterruptedException {
// 将maptask传给我们的文本内容先转换为String
String line = value.toString().toString();
// 根据空格将这一行切分为单词
String []words = line.split(" ");
// 将单词输出为<单词,1>
for(String word:words){
// 将单词作为key, 将次数作为value, 以便于后续的数据分发, 根据单词分发, 以便于相同单词会到相同的reduceTask内部.
context.write(new Text(word), new IntWritable(1));
}
}
}
org.apache.hadoop.mapreduce.Reducer
)读取Mapper
类处理后的
的数据, 并统计单词的个数. 值得注意的是传递过来的数据类型为
. 相同key类型的数据统一合并传递.(ex:
前三个数据项放入一个Iterator迭代器内进行传递.)
另外, Iterator
数据类型既可以使用hasnext()
方法进行遍历,也可以通过for(Object value:values)
的方式进行遍历.具体代码如下.
/**
* KEYIN, VALUEIN 对应 mapper对应的KEYOUT,VALUEOUT类型对应.
*
* KEYOUT, VALUEOUT 是自定义reducer逻辑处理结果的输出数据类型
* KEYOUT 是单词
* VALUEOUT 是总次数
* */
public class WordcountReducer extends Reducer{
/**
*
*
* 入参key, 是一组单词kv对的key
*
* */
@Override
protected void reduce(Text key, Iterable values,Reducer.Context context)throws IOException, InterruptedException {
int count=0;
Iterator it = values.iterator();
while(it.hasNext()){
count += it.next().get();
}
context.write(key, new IntWritable(count));
// for(IntWritable value:values){
// count += value.get();
// }
}
}
启动一个Map/Reduce
的Job
主要有如下几个步骤:
conf.set("mapreduce.framework.name", "yarn");
/ conf.set("mapreduce.framework.name", "local");
conf.set("yarn.resourcemanager.hostname", "localhost");
conf.set("fs.defaultFS", "hdfs://localhost:9000/");
Configuration
配置Job
, 并配置Jar包或文件目录地址:Combaintor
/Partition
/GroupingComparatorClass
等)submit()
/waitForCompletion()
详细代码如下所示:
package com.yanxml.bigdata.hadoop.mr.wordcount;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
/**
* 相当于一个yarn集群的客户端.
* 需要在此封装我们mr程序的运行相关参数, 指定jar包.
* 最后提交给yarn.
* */
public class WordcountDriver {
public static void main(String[] args) throws IllegalArgumentException, IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
// conf.set("mapreduce.framework.name", "yarn");
// conf.set("yarn.resourcemanager.hostname", "localhost");
conf.set("mapreduce.framework.name", "yarn");
// conf.set("mapreduce.framework.name", "local");
conf.set("yarn.resourcemanager.hostname", "localhost");
conf.set("fs.defaultFS", "hdfs://localhost:9000/");
Job job = Job.getInstance(conf);
// job.setJar("/");
job.setJar("/Users/Sean/Documents/Gitrep/bigdata/hadoop/target/hadoop-0.0.1-SNAPSHOT.jar");
// 指定本程序jar包所在地址
// job.setJarByClass(WordcountDriver.class);
//指定本业务job需要使用的mapper业务类
job.setMapperClass(WordcountMapper.class);
job.setReducerClass(WordcountReducer.class);
// 指定mapper输出数据的kv类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 指定最终输出数据的kv类型
job.setOutputKeyClass(Text.class);
job.setOutputKeyClass(IntWritable.class);
// 指定需要使用Combiner, 以及用哪个类作为Combiner的逻辑.
job.setCombinerClass(WordCountCombiner.class);
// 如果不设置Inputformat, 它默认使用TextInputFormat.class
job.setInputFormatClass(CombineTextInputFormat.class);
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);
CombineTextInputFormat.setMaxInputSplitSize(job, 2097152);
// 指定job的输入文件所在目录
// FileInputFormat.setInputPaths(job, new Path(args[0]));
// 指定job的输出结果
// FileOutputFormat.setOutputPath(job,new Path(args[1]));
// 指定job的输入文件所在目录
FileInputFormat.setInputPaths(job, new Path("/wordcount/input"));
// 指定job的输出结果
FileOutputFormat.setOutputPath(job, new Path("/wordcount/output"));
// 将job中配置的相关参数, 以及job所用的java类所在的jar包,提交给yarn执行
// job.submit();
//
boolean flag = job.waitForCompletion(true);
System.exit(flag?0:1);
}
}
上文的Driver代码中:
conf.set("mapreduce.framework.name", "yarn");
表示运行在Yarn
平台;conf.set("mapreduce.framework.name", "local");
表示运行在本地虚拟机内;job.setJar("/Users/Sean/Documents/Gitrep/bigdata/hadoop/target/hadoop-0.0.1-SNAPSHOT.jar");
表示Job运行的内容在Jar
包内部,通常我们通过hadoop jar
命令提交至Yarn
平台上时需要这样配置;job.setJarByClass(WordcountDriver.class);
指通过寻找类来寻找Jar包, 一般我们运行本地模式的时候这样配置.上述的配置1-1/2-1
与1-2/2-2
通常是成对出现, 分别表示运行在Local
或者Yarn
平台上.
我们使用之前说的maven-shade-plugin
插件将其打包, 配置在本文前面的部分已经给出. 暂不重述.
通过mvn package
命令即可将其打包.
运行程序时候, 根据配置的不同, 你可以选择执行local
模式, 也可以选择执行Yarn
模式. 两者都可以在Eclipse
内执行, 不同点在于, 两者的配置不同, 以及后者需要先进行打包处理.
提交到Yarn上时, 我们同样有两者模式: java -cp
/hadoop jar
java -cp hadoop-0.0.1-SNAPSHOT.jar com.yanxml.bigdata.hadoop.mr.wordcount.WordcountDriver /wordcount/input /wordcount/output
后2位为传入的写入参数.hadoop jar hadoop-0.0.1-SNAPSHOT.jar /wordcount/input /wordcount/output
其中<…>如果打包时以及选择了主类, 此时可以不写, 否则需要输入启动类.运行的相关日志如下所示:
localhost:target Sean$ hadoop jar hadoop-0.0.1-SNAPSHOT.jar /wordcount/input /wordcount/output
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
19/04/03 20:41:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/04/03 20:41:16 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
19/04/03 20:41:16 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
19/04/03 20:41:17 INFO input.FileInputFormat: Total input paths to process : 1
19/04/03 20:41:17 INFO mapreduce.JobSubmitter: number of splits:1
19/04/03 20:41:17 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1553933297569_0003
19/04/03 20:41:18 INFO impl.YarnClientImpl: Submitted application application_1553933297569_0003
19/04/03 20:41:18 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1553933297569_0003/
19/04/03 20:41:18 INFO mapreduce.Job: Running job: job_1553933297569_0003
19/04/03 20:41:26 INFO mapreduce.Job: Job job_1553933297569_0003 running in uber mode : false
19/04/03 20:41:26 INFO mapreduce.Job: map 0% reduce 0%
19/04/03 20:41:31 INFO mapreduce.Job: map 100% reduce 0%
19/04/03 20:41:37 INFO mapreduce.Job: map 100% reduce 100%
19/04/03 20:41:37 INFO mapreduce.Job: Job job_1553933297569_0003 completed successfully
19/04/03 20:41:37 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=96
FILE: Number of bytes written=243449
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=157
HDFS: Number of bytes written=44
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3156
Total time spent by all reduces in occupied slots (ms)=2911
Total time spent by all map tasks (ms)=3156
Total time spent by all reduce tasks (ms)=2911
Total vcore-milliseconds taken by all map tasks=3156
Total vcore-milliseconds taken by all reduce tasks=2911
Total megabyte-milliseconds taken by all map tasks=3231744
Total megabyte-milliseconds taken by all reduce tasks=2980864
Map-Reduce Framework
Map input records=7
Map output records=8
Map output bytes=74
Map output materialized bytes=96
Input split bytes=115
Combine input records=0
Combine output records=0
Reduce input groups=6
Reduce shuffle bytes=96
Reduce input records=8
Reduce output records=6
Spilled Records=16
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=108
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=311427072
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=42
File Output Format Counters
Bytes Written=44
localhost:target Sean$
查看输出文件
localhost:~ Sean$ hadoop fs -cat /wordcount/output/part-r-00000
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
19/04/06 01:10:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019 1
able 1
cat 2
hello 1
kitty 1
pitty 2
1.Mac运行时会出现异常:
localhost:target Sean$ hadoop jar hadoop-0.0.1-SNAPSHOT.jar com.yanxml.bigdata.hadoop.mr.wordcount.WordcountDriver /wordcount/input /wordcount/output Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 Exception in thread "main" java.io.IOException: Mkdirs failed to create /var/folders/lm/j_tf25pd1bn1lvf3nm1qkjd40000gn/T/hadoop-unjar5489522687418409987/META-INF/license at org.apache.hadoop.util.RunJar.ensureDirectory(RunJar.java:129) at org.apache.hadoop.util.RunJar.unJar(RunJar.java:104) at org.apache.hadoop.util.RunJar.unJar(RunJar.java:81) at org.apache.hadoop.util.RunJar.run(RunJar.java:209) at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
解决措施:
zip -d hadoop-0.0.1-SNAPSHOT.jar META-INF/LICENSE
/zip -d wordcount.jar LICENSE
参考文章: mac上运行hadoop Mkdirs failed to create 的坑
解决ES-Hadoop打包报错“Mkdirs failed to create /var/folders…”问题
- 使用
java -cp
命令启动时出现如下报错:localhost:target Sean$ java -cp hadoop-0.0.1-SNAPSHOT.jar com.yanxml.bigdata.hadoop.mr.wordcount.WordcountDriver /wordcount/input /wordcount/output Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120) at org.apache.hadoop.mapreduce.Cluster.
(Cluster.java:82) at org.apache.hadoop.mapreduce.Cluster. (Cluster.java:75) at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1260) at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1256) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) at org.apache.hadoop.mapreduce.Job.connect(Job.java:1256) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1284) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) at >com.yanxml.bigdata.hadoop.mr.wordcount.WordcountDriver.main(WordcountDriver.java:57) 解决办法
pom.xml
文件内添加如下依赖
org.apache.hadoop hadoop-mapreduce-client-common 2.7.5 参考文章: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the co