Hadoop MR 之(一) 编写自己的WordCount

前言

在前面的内容几章内, 我们主要介绍了HDFS的相关内容. 本章开始, 我们讲解下经常使用的Hadoop MapReduce的相关内容.

有人会觉得, 当前已经到了Spark几乎一统天下的时代, 学习Map/Reduce似乎没什么必要. 但是, 我觉得还是有点必要的. 主要原因有三:

  • Hadoop的Map/Reduce框架应当堪称分布式离线计算的先河, 后面的开源项目多多少少对其有点借鉴;
  • 很多公司传统的数据离线计算应当仍然使用HadoopMap/Reduce上的Hive. 学习基础的Map/Reduce操作有利于了解Hive的运行机制.
  • Map/Reduce的操作过程中有许多的大数据经典问题, 对于我们后续的借鉴有非常重要的意义.

本文相关代码, 可在我的Github项目 https://github.com/SeanYanxml/bigdata/ 目录下可以找到. PS: (如果觉得项目不错, 可以给我一个Star.)


前置条件

  • JDK
  • Maven
  • Eclipse(其他IDE也可)
  • 本地安装Hadoop的HDFS/Yarn集群, 便于我们提交到集群上运行
  • 准备基础数据 将WordCount的数据放入指定的HDFS的目录
# 文件 hello2019.sh (随便准备即可)
hello 2019
cat
pitty
kitty
able
pitty
cat
  • 上传HDFS操作
# 创建文件夹
hadoop fs -mkdir -p /wordcount/input
# 上传文件
hadoop fs -put hello.sh /wordcount/input/

Hadoop MR 之(一) 编写自己的WordCount_第1张图片


基本代码编写

OK, 进行上述的准备操作后, 我们进行代码的编写部分.

  • 编写pom.xml文件, 导入需要的Jar包.

	4.0.0
	
		com.yanxml
		bigdata
		0.0.1-SNAPSHOT
	
	hadoop

	

		
			org.apache.hadoop
			hadoop-mapreduce-client-common
			2.7.5
		
		
		
			org.apache.hadoop
			hadoop-client
			2.7.5
		

		
		
			junit
			junit
			4.12
			test
		

		
		
			com.alibaba
			fastjson
			1.2.28
		

	
	
		
			
				org.apache.maven.plugins
				maven-shade-plugin
				2.4.3
				
					
						package
						
							shade
						
						
							
								
									
										com.yanxml.bigdata.hadoop.mr.wordcount.WordcountDriver
										1.7
										1.7
									
								
							
						
					
				
			


		
	


编写Mapper类 (继承 org.apache.hadoop.mapreduce.Mapper)

注意:mapred目录的下的Mapper, 为Hadoop1.x的, 我们不使用这个版本. WordcountMapper的主要作用就是将数据切分, 并且将其放入context, 以供Reducer进行调用.

/**
 * Mapper
 * 将数据读取逻辑, 以的形式传递给我.
 * KEYIN: 默认情况下, 是mr框架所读到的一行文本的起始偏移量, Long, 在Hadoop内有精简的序列化接口,不直接用Long, 而用LongWriterable.
 * VALUE: 默认情况下, 是mr框架所读到的一行文本的内容, String, 同上用Text
 * 
 * KEYOUT: 是用户自定义逻辑处理完成之后输出数据中的key, 在此处为单词, String, 用Text
 * VALUEOUT: 是用户自定义逻辑处理完成之后输出数据中的value, 此处为单词次数, Integer, 用IntWriterable
 * 
 * */

public class WordcountMapper extends Mapper{

	@Override
	/**
	 * 重写父类的Map接口.
	 * map阶段的处理业务逻辑就写在自定义的map()方法内.
	 * 
	 * */
	protected void map(LongWritable key, Text value, Mapper.Context context)
			throws IOException, InterruptedException {
		// 将maptask传给我们的文本内容先转换为String
		String line = value.toString().toString();
		// 根据空格将这一行切分为单词
		String []words = line.split(" ");
		
		// 将单词输出为<单词,1>
		for(String word:words){
			// 将单词作为key, 将次数作为value, 以便于后续的数据分发, 根据单词分发, 以便于相同单词会到相同的reduceTask内部.
			context.write(new Text(word), new IntWritable(1));
		}
		
	}
}
编写Reducer类 (继承 org.apache.hadoop.mapreduce.Reducer)

读取Mapper类处理后的类型的数据, 并统计单词的个数. 值得注意的是传递过来的数据类型为类型. 相同key类型的数据统一合并传递.(ex: 前三个数据项放入一个Iterator迭代器内进行传递.)
另外, Iterator数据类型既可以使用hasnext()方法进行遍历,也可以通过for(Object value:values)的方式进行遍历.具体代码如下.


/**
 * KEYIN, VALUEIN 对应 mapper对应的KEYOUT,VALUEOUT类型对应.
 * 
 * KEYOUT, VALUEOUT 是自定义reducer逻辑处理结果的输出数据类型
 * KEYOUT 是单词
 * VALUEOUT 是总次数
 * */
public class WordcountReducer extends Reducer{

	/**
	 * 
	 * 
	 * 入参key, 是一组单词kv对的key
	 * 
	 * */
	@Override
	protected void reduce(Text key, Iterable values,Reducer.Context context)throws IOException, InterruptedException {
		int count=0;
		Iterator it = values.iterator();
		while(it.hasNext()){
			count += it.next().get();
		}
		context.write(key, new IntWritable(count));
		
//		for(IntWritable value:values){
//			count += value.get();
//		}
	}
	
}

编写Driver类

启动一个Map/ReduceJob主要有如下几个步骤:

  • 配置Configuration;
    • 配置运行模式 conf.set("mapreduce.framework.name", "yarn"); / conf.set("mapreduce.framework.name", "local");
    • 配置服务器主机名 conf.set("yarn.resourcemanager.hostname", "localhost");
    • 配置读取的文件系统 conf.set("fs.defaultFS", "hdfs://localhost:9000/");
  • 通过Configuration配置Job, 并配置Jar包或文件目录地址:
  • 配置Mapper类与Reducer类
  • 配置Mapper类与Reducer类的输出类型与输入类型
  • 配置文件的读取目录与输出目录
  • 其他设置(Combaintor/Partition/GroupingComparatorClass等)
  • 启动submit()/waitForCompletion()

详细代码如下所示:

package com.yanxml.bigdata.hadoop.mr.wordcount;

import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 * 相当于一个yarn集群的客户端.
 * 需要在此封装我们mr程序的运行相关参数, 指定jar包.
 * 最后提交给yarn.
 * */
public class WordcountDriver {
	public static void main(String[] args) throws IllegalArgumentException, IOException, ClassNotFoundException, InterruptedException {
		Configuration conf =  new Configuration();
//		conf.set("mapreduce.framework.name", "yarn");
//		conf.set("yarn.resourcemanager.hostname", "localhost");
		
		conf.set("mapreduce.framework.name", "yarn");
//		conf.set("mapreduce.framework.name", "local");
		conf.set("yarn.resourcemanager.hostname", "localhost");
		conf.set("fs.defaultFS", "hdfs://localhost:9000/");
		
		Job job =  Job.getInstance(conf);
		
//		job.setJar("/");
		job.setJar("/Users/Sean/Documents/Gitrep/bigdata/hadoop/target/hadoop-0.0.1-SNAPSHOT.jar");
		// 指定本程序jar包所在地址
//		job.setJarByClass(WordcountDriver.class);

		
		//指定本业务job需要使用的mapper业务类
		job.setMapperClass(WordcountMapper.class);
		job.setReducerClass(WordcountReducer.class);
		
		// 指定mapper输出数据的kv类型
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		// 指定最终输出数据的kv类型
		job.setOutputKeyClass(Text.class);
		job.setOutputKeyClass(IntWritable.class);
		
		// 指定需要使用Combiner, 以及用哪个类作为Combiner的逻辑.
		job.setCombinerClass(WordCountCombiner.class);
		
		// 如果不设置Inputformat, 它默认使用TextInputFormat.class
		job.setInputFormatClass(CombineTextInputFormat.class);
		CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);
		CombineTextInputFormat.setMaxInputSplitSize(job, 2097152);
		
		// 指定job的输入文件所在目录
//		FileInputFormat.setInputPaths(job, new Path(args[0]));
		// 指定job的输出结果
//        FileOutputFormat.setOutputPath(job,new Path(args[1]));
		
		// 指定job的输入文件所在目录
		FileInputFormat.setInputPaths(job, new Path("/wordcount/input"));
		// 指定job的输出结果
		FileOutputFormat.setOutputPath(job, new Path("/wordcount/output"));
		
		// 将job中配置的相关参数, 以及job所用的java类所在的jar包,提交给yarn执行
//		job.submit();
		
		// 
		boolean flag = job.waitForCompletion(true);
		System.exit(flag?0:1);
		
	}
}


代码运行 & 打包运行

相关配置

上文的Driver代码中:

  • 配置1-1: conf.set("mapreduce.framework.name", "yarn");表示运行在Yarn平台;
  • 配置1-2: conf.set("mapreduce.framework.name", "local");表示运行在本地虚拟机内;
  • 配置2-1: job.setJar("/Users/Sean/Documents/Gitrep/bigdata/hadoop/target/hadoop-0.0.1-SNAPSHOT.jar");表示Job运行的内容在Jar包内部,通常我们通过hadoop jar命令提交至Yarn平台上时需要这样配置;
  • 配置2-2: job.setJarByClass(WordcountDriver.class);指通过寻找类来寻找Jar包, 一般我们运行本地模式的时候这样配置.

上述的配置1-1/2-11-2/2-2通常是成对出现, 分别表示运行在Local或者Yarn平台上.

打包

我们使用之前说的maven-shade-plugin插件将其打包, 配置在本文前面的部分已经给出. 暂不重述.
通过mvn package命令即可将其打包.

运行日志

运行程序时候, 根据配置的不同, 你可以选择执行local模式, 也可以选择执行Yarn模式. 两者都可以在Eclipse内执行, 不同点在于, 两者的配置不同, 以及后者需要先进行打包处理.

提交到Yarn上时, 我们同样有两者模式: java -cp/hadoop jar

  • java -cp hadoop-0.0.1-SNAPSHOT.jar com.yanxml.bigdata.hadoop.mr.wordcount.WordcountDriver /wordcount/input /wordcount/output 后2位为传入的写入参数.
  • hadoop jar hadoop-0.0.1-SNAPSHOT.jar /wordcount/input /wordcount/output 其中<…>如果打包时以及选择了主类, 此时可以不写, 否则需要输入启动类.

运行的相关日志如下所示:

localhost:target Sean$ hadoop jar hadoop-0.0.1-SNAPSHOT.jar /wordcount/input /wordcount/output
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
19/04/03 20:41:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/04/03 20:41:16 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
19/04/03 20:41:16 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
19/04/03 20:41:17 INFO input.FileInputFormat: Total input paths to process : 1
19/04/03 20:41:17 INFO mapreduce.JobSubmitter: number of splits:1
19/04/03 20:41:17 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1553933297569_0003
19/04/03 20:41:18 INFO impl.YarnClientImpl: Submitted application application_1553933297569_0003
19/04/03 20:41:18 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1553933297569_0003/
19/04/03 20:41:18 INFO mapreduce.Job: Running job: job_1553933297569_0003
19/04/03 20:41:26 INFO mapreduce.Job: Job job_1553933297569_0003 running in uber mode : false
19/04/03 20:41:26 INFO mapreduce.Job:  map 0% reduce 0%
19/04/03 20:41:31 INFO mapreduce.Job:  map 100% reduce 0%
19/04/03 20:41:37 INFO mapreduce.Job:  map 100% reduce 100%
19/04/03 20:41:37 INFO mapreduce.Job: Job job_1553933297569_0003 completed successfully
19/04/03 20:41:37 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=96
		FILE: Number of bytes written=243449
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=157
		HDFS: Number of bytes written=44
		HDFS: Number of read operations=6
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=3156
		Total time spent by all reduces in occupied slots (ms)=2911
		Total time spent by all map tasks (ms)=3156
		Total time spent by all reduce tasks (ms)=2911
		Total vcore-milliseconds taken by all map tasks=3156
		Total vcore-milliseconds taken by all reduce tasks=2911
		Total megabyte-milliseconds taken by all map tasks=3231744
		Total megabyte-milliseconds taken by all reduce tasks=2980864
	Map-Reduce Framework
		Map input records=7
		Map output records=8
		Map output bytes=74
		Map output materialized bytes=96
		Input split bytes=115
		Combine input records=0
		Combine output records=0
		Reduce input groups=6
		Reduce shuffle bytes=96
		Reduce input records=8
		Reduce output records=6
		Spilled Records=16
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=108
		CPU time spent (ms)=0
		Physical memory (bytes) snapshot=0
		Virtual memory (bytes) snapshot=0
		Total committed heap usage (bytes)=311427072
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters
		Bytes Read=42
	File Output Format Counters
		Bytes Written=44
localhost:target Sean$

查看输出文件

localhost:~ Sean$ hadoop fs -cat /wordcount/output/part-r-00000
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8

19/04/06 01:10:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019	1
able	1
cat	2
hello	1
kitty	1
pitty	2

Hadoop MR 之(一) 编写自己的WordCount_第2张图片在这里插入图片描述


Q & A

1.Mac运行时会出现异常:

localhost:target Sean$ hadoop jar hadoop-0.0.1-SNAPSHOT.jar  com.yanxml.bigdata.hadoop.mr.wordcount.WordcountDriver /wordcount/input /wordcount/output
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Exception in thread "main" java.io.IOException: Mkdirs failed to create /var/folders/lm/j_tf25pd1bn1lvf3nm1qkjd40000gn/T/hadoop-unjar5489522687418409987/META-INF/license
	at org.apache.hadoop.util.RunJar.ensureDirectory(RunJar.java:129)
	at org.apache.hadoop.util.RunJar.unJar(RunJar.java:104)
	at org.apache.hadoop.util.RunJar.unJar(RunJar.java:81)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:209)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

解决措施: zip -d hadoop-0.0.1-SNAPSHOT.jar META-INF/LICENSE / zip -d wordcount.jar LICENSE
参考文章: mac上运行hadoop Mkdirs failed to create 的坑
解决ES-Hadoop打包报错“Mkdirs failed to create /var/folders…”问题

  1. 使用java -cp命令启动时出现如下报错:
localhost:target Sean$ java -cp hadoop-0.0.1-SNAPSHOT.jar com.yanxml.bigdata.hadoop.mr.wordcount.WordcountDriver /wordcount/input /wordcount/output
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
	at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120)
	at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82)
	at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75)
	at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1260)
	at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1256)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.hadoop.mapreduce.Job.connect(Job.java:1256)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1284)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
	at >com.yanxml.bigdata.hadoop.mr.wordcount.WordcountDriver.main(WordcountDriver.java:57)

解决办法 pom.xml文件内添加如下依赖


			org.apache.hadoop
			hadoop-mapreduce-client-common
			2.7.5
		

参考文章: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the co

你可能感兴趣的:(14.,大数据,-------14.6.,Hadoop)