Hadoop之——WorldCount统计实例

转载请注明出处:http://blog.csdn.net/l1028386804/article/details/78238100

最近,有很多想做大数据的同学发来私信,想请我这位在大数据领域跌打滚爬了多年的老鸟写一些大数据分析的文章,好作为这些同学学习大数据分析从入门到上手再到精通的参考教程,作为一个大数据分析领域的老鸟,很高兴自己在业界得到了很多同行的认可,同时,自己也想将多年来做大数据分析的一些经验和心得分享给大家。那么,今天,就给大家带来一篇Hadoop的入门经典——WordCount统计实例。

一、准备工作

1、Hadoop安装

(1) 伪分布式安装

请参考博文:《Hadoop之——Hadoop2.4.1伪分布搭建》

(2) 集群安装

请参考博文《Hadoop之——CentOS + hadoop2.5.2分布式环境配置》

(3) 高可用集群安装

请参考博文《Hadoop之——Hadoop2.5.2 HA高可靠性集群搭建(Hadoop+Zookeeper)前期准备》和《Hadoop之——Hadoop2.5.2 HA高可靠性集群搭建(Hadoop+Zookeeper)》

2、Eclipse配置

本实例中所有的代码开发和运行都是在Eclipse中进行的,大家可以参考博文《Hadoop之——windows7+eclipse+hadoop2.5.2环境配置》对自己的Eclispe进行相关的配置,以达到在Eclipse中直接运行本实例以及后续Hadoop实例的效果。

二、程序开发

1、统计单词数量的WCMapper类

package com.lyz.hdfs.mr.worldcount;

import java.io.IOException;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * 统计单词数量的Mapper
 * KEYIN:map输入的key,代表当前输入文本的偏移量
 * VALUEIN:当前一行文本
 * KEYOUT:一个单词
 * VALUEOUT:单此每次统计的数据,此示例中就是1
 * @author liuyazhuang
 *
 */
public class WCMapper extends Mapper{
	@Override
	protected void map(LongWritable key, Text value, Mapper.Context context) throws IOException, InterruptedException {
		String line = value.toString();
		String[] words = StringUtils.split(line, " "); 
		for(String word : words){
			context.write(new Text(word), new LongWritable(1));
		}
	}
}

2、统计单词数量的Reducer类

package com.lyz.hdfs.mr.worldcount;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/**
 * 统计单词数量的reducer类
 * KEYIN:当前的一个单词
 * VALUEIN:map中输入过来的单词数量
 * KEYOUT:当前的一个单词
 * VALUEOUT:单词出现的总次数
 * @author liuyazhuang
 *
 */
public class WCReducer extends Reducer{
	@Override
	protected void reduce(Text key, Iterable values, Reducer.Context context) throws IOException, InterruptedException {
		long count  = 0;
		for(LongWritable value : values){
			count += value.get();
		}
		context.write(key, new LongWritable(count));
	}
}

3、运行程序的入口WCRunner类

package com.lyz.hdfs.mr.worldcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


/**
 * 运行统计单词数量的MR程序
 * @author liuyazhuang
 *
 */
public class WCRunner extends Configured implements Tool{
	public static void main(String[] args) throws Exception{
		ToolRunner.run(new Configuration(), new WCRunner(), args);
	}

	@Override
	public int run(String[] arg0) throws Exception {
		
		Configuration conf = new Configuration();
		
		Job job = Job.getInstance(conf);
		
		job.setJarByClass(WCRunner.class);
		
		job.setMapperClass(WCMapper.class);
		job.setReducerClass(WCReducer.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(LongWritable.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(LongWritable.class);
		
		
		FileInputFormat.setInputPaths(job, new Path("D:/hadoop_data/wordcount/src.txt"));
		FileOutputFormat.setOutputPath(job, new Path("D:/hadoop_data/wordcount/dest"));
		
		return job.waitForCompletion(true) ? 0 : 1;
	}
}

三、运行程序

在Eclipse中直接右键类WCRunner, Run As——> Java Application,控制台输出结果如下:

2017-10-14 23:52:51,865 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1019)) - session.id is deprecated. Instead, use dfs.metrics.session-id
2017-10-14 23:52:51,868 INFO  [main] jvm.JvmMetrics (JvmMetrics.java:init(76)) - Initializing JVM Metrics with processName=JobTracker, sessionId=
2017-10-14 23:52:52,665 WARN  [main] mapreduce.JobSubmitter (JobSubmitter.java:copyAndConfigureFiles(150)) - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2017-10-14 23:52:52,669 WARN  [main] mapreduce.JobSubmitter (JobSubmitter.java:copyAndConfigureFiles(259)) - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2017-10-14 23:52:52,675 INFO  [main] input.FileInputFormat (FileInputFormat.java:listStatus(281)) - Total input paths to process : 1
2017-10-14 23:52:52,713 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(396)) - number of splits:1
2017-10-14 23:52:52,788 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(479)) - Submitting tokens for job: job_local994420281_0001
2017-10-14 23:52:52,820 WARN  [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/tmp/hadoop-liuyazhuang/mapred/staging/liuyazhuang994420281/.staging/job_local994420281_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2017-10-14 23:52:52,822 WARN  [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/tmp/hadoop-liuyazhuang/mapred/staging/liuyazhuang994420281/.staging/job_local994420281_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2017-10-14 23:52:52,908 WARN  [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/tmp/hadoop-liuyazhuang/mapred/local/localRunner/liuyazhuang/job_local994420281_0001/job_local994420281_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2017-10-14 23:52:52,909 WARN  [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/tmp/hadoop-liuyazhuang/mapred/local/localRunner/liuyazhuang/job_local994420281_0001/job_local994420281_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2017-10-14 23:52:52,913 INFO  [main] mapreduce.Job (Job.java:submit(1289)) - The url to track the job: http://localhost:8080/
2017-10-14 23:52:52,914 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1334)) - Running job: job_local994420281_0001
2017-10-14 23:52:52,915 INFO  [Thread-2] mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(471)) - OutputCommitter set in config null
2017-10-14 23:52:52,921 INFO  [Thread-2] mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(489)) - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2017-10-14 23:52:52,956 INFO  [Thread-2] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(448)) - Waiting for map tasks
2017-10-14 23:52:52,956 INFO  [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:run(224)) - Starting task: attempt_local994420281_0001_m_000000_0
2017-10-14 23:52:52,982 INFO  [LocalJobRunner Map Task Executor #0] util.ProcfsBasedProcessTree (ProcfsBasedProcessTree.java:isAvailable(181)) - ProcfsBasedProcessTree currently is supported only on Linux.
2017-10-14 23:52:53,048 INFO  [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:initialize(587)) -  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@50b84eb3
2017-10-14 23:52:53,051 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:runNewMapper(733)) - Processing split: file:/D:/hadoop_data/wordcount/src.txt:0+173
2017-10-14 23:52:53,060 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:createSortingCollector(388)) - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2017-10-14 23:52:53,089 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:setEquator(1182)) - (EQUATOR) 0 kvi 26214396(104857584)
2017-10-14 23:52:53,089 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(975)) - mapreduce.task.io.sort.mb: 100
2017-10-14 23:52:53,089 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(976)) - soft limit at 83886080
2017-10-14 23:52:53,089 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(977)) - bufstart = 0; bufvoid = 104857600
2017-10-14 23:52:53,089 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(978)) - kvstart = 26214396; length = 6553600
2017-10-14 23:52:53,096 INFO  [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 
2017-10-14 23:52:53,097 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1437)) - Starting flush of map output
2017-10-14 23:52:53,097 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1455)) - Spilling map output
2017-10-14 23:52:53,097 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1456)) - bufstart = 0; bufend = 326; bufvoid = 104857600
2017-10-14 23:52:53,097 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1458)) - kvstart = 26214396(104857584); kvend = 26214320(104857280); length = 77/6553600
2017-10-14 23:52:53,114 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:sortAndSpill(1641)) - Finished spill 0
2017-10-14 23:52:53,122 INFO  [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:done(1001)) - Task:attempt_local994420281_0001_m_000000_0 is done. And is in the process of committing
2017-10-14 23:52:53,129 INFO  [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - map
2017-10-14 23:52:53,129 INFO  [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:sendDone(1121)) - Task 'attempt_local994420281_0001_m_000000_0' done.
2017-10-14 23:52:53,129 INFO  [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:run(249)) - Finishing task: attempt_local994420281_0001_m_000000_0
2017-10-14 23:52:53,129 INFO  [Thread-2] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - map task executor complete.
2017-10-14 23:52:53,132 INFO  [Thread-2] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(448)) - Waiting for reduce tasks
2017-10-14 23:52:53,132 INFO  [pool-3-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:run(302)) - Starting task: attempt_local994420281_0001_r_000000_0
2017-10-14 23:52:53,138 INFO  [pool-3-thread-1] util.ProcfsBasedProcessTree (ProcfsBasedProcessTree.java:isAvailable(181)) - ProcfsBasedProcessTree currently is supported only on Linux.
2017-10-14 23:52:53,178 INFO  [pool-3-thread-1] mapred.Task (Task.java:initialize(587)) -  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@1c3a583d
2017-10-14 23:52:53,182 INFO  [pool-3-thread-1] mapred.ReduceTask (ReduceTask.java:run(362)) - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@4a7f319c
2017-10-14 23:52:53,192 INFO  [pool-3-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:(193)) - MergerManager: memoryLimit=1503238528, maxSingleShuffleLimit=375809632, mergeThreshold=992137472, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2017-10-14 23:52:53,194 INFO  [EventFetcher for fetching Map Completion Events] reduce.EventFetcher (EventFetcher.java:run(61)) - attempt_local994420281_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2017-10-14 23:52:53,216 INFO  [localfetcher#1] reduce.LocalFetcher (LocalFetcher.java:copyMapOutput(140)) - localfetcher#1 about to shuffle output of map attempt_local994420281_0001_m_000000_0 decomp: 368 len: 372 to MEMORY
2017-10-14 23:52:53,220 INFO  [localfetcher#1] reduce.InMemoryMapOutput (InMemoryMapOutput.java:shuffle(100)) - Read 368 bytes from map-output for attempt_local994420281_0001_m_000000_0
2017-10-14 23:52:53,239 INFO  [localfetcher#1] reduce.MergeManagerImpl (MergeManagerImpl.java:closeInMemoryFile(307)) - closeInMemoryFile -> map-output of size: 368, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->368
2017-10-14 23:52:53,239 INFO  [EventFetcher for fetching Map Completion Events] reduce.EventFetcher (EventFetcher.java:run(76)) - EventFetcher is interrupted.. Returning
2017-10-14 23:52:53,240 INFO  [pool-3-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.
2017-10-14 23:52:53,240 INFO  [pool-3-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(667)) - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
2017-10-14 23:52:53,251 INFO  [pool-3-thread-1] mapred.Merger (Merger.java:merge(591)) - Merging 1 sorted segments
2017-10-14 23:52:53,251 INFO  [pool-3-thread-1] mapred.Merger (Merger.java:merge(690)) - Down to the last merge-pass, with 1 segments left of total size: 359 bytes
2017-10-14 23:52:53,254 INFO  [pool-3-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(742)) - Merged 1 segments, 368 bytes to disk to satisfy reduce memory limit
2017-10-14 23:52:53,255 INFO  [pool-3-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(772)) - Merging 1 files, 372 bytes from disk
2017-10-14 23:52:53,255 INFO  [pool-3-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(787)) - Merging 0 segments, 0 bytes from memory into reduce
2017-10-14 23:52:53,256 INFO  [pool-3-thread-1] mapred.Merger (Merger.java:merge(591)) - Merging 1 sorted segments
2017-10-14 23:52:53,256 INFO  [pool-3-thread-1] mapred.Merger (Merger.java:merge(690)) - Down to the last merge-pass, with 1 segments left of total size: 359 bytes
2017-10-14 23:52:53,257 INFO  [pool-3-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.
2017-10-14 23:52:53,264 INFO  [pool-3-thread-1] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1019)) - mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
2017-10-14 23:52:53,269 INFO  [pool-3-thread-1] mapred.Task (Task.java:done(1001)) - Task:attempt_local994420281_0001_r_000000_0 is done. And is in the process of committing
2017-10-14 23:52:53,270 INFO  [pool-3-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.
2017-10-14 23:52:53,270 INFO  [pool-3-thread-1] mapred.Task (Task.java:commit(1162)) - Task attempt_local994420281_0001_r_000000_0 is allowed to commit now
2017-10-14 23:52:53,272 INFO  [pool-3-thread-1] output.FileOutputCommitter (FileOutputCommitter.java:commitTask(439)) - Saved output of task 'attempt_local994420281_0001_r_000000_0' to file:/D:/hadoop_data/wordcount/dest/_temporary/0/task_local994420281_0001_r_000000
2017-10-14 23:52:53,273 INFO  [pool-3-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - reduce > reduce
2017-10-14 23:52:53,273 INFO  [pool-3-thread-1] mapred.Task (Task.java:sendDone(1121)) - Task 'attempt_local994420281_0001_r_000000_0' done.
2017-10-14 23:52:53,273 INFO  [pool-3-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:run(325)) - Finishing task: attempt_local994420281_0001_r_000000_0
2017-10-14 23:52:53,274 INFO  [Thread-2] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - reduce task executor complete.
2017-10-14 23:52:53,916 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1355)) - Job job_local994420281_0001 running in uber mode : false
2017-10-14 23:52:53,917 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1362)) -  map 100% reduce 100%
2017-10-14 23:52:53,918 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1373)) - Job job_local994420281_0001 completed successfully
2017-10-14 23:52:53,924 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1380)) - Counters: 33
	File System Counters
		FILE: Number of bytes read=1438
		FILE: Number of bytes written=476268
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Map input records=9
		Map output records=20
		Map output bytes=326
		Map output materialized bytes=372
		Input split bytes=103
		Combine input records=0
		Combine output records=0
		Reduce input groups=17
		Reduce shuffle bytes=372
		Reduce input records=20
		Reduce output records=17
		Spilled Records=40
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=0
		CPU time spent (ms)=0
		Physical memory (bytes) snapshot=0
		Virtual memory (bytes) snapshot=0
		Total committed heap usage (bytes)=385875968
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=173
	File Output Format Counters 
		Bytes Written=176

四、附录

本实例所用的输入文件为src.txt,内容如下所示:

dysdgy ubdh shdh
ssusdfy sdusf duyfu
fuyfuyfys sydfyusd sydufyus
dhfdf fyudyfu dyuefyue
dfhusf fyueyf dyiefyu sudiufi
liuyazhuang
liuyazhuang
liuyazhuang
liuyazhuang
输出的结果文件为part-r-00000,内容如下所示:
dfhusf	1
dhfdf	1
duyfu	1
dyiefyu	1
dysdgy	1
dyuefyue	1
fuyfuyfys	1
fyudyfu	1
fyueyf	1
liuyazhuang	4
sdusf	1
shdh	1
ssusdfy	1
sudiufi	1
sydfyusd	1
sydufyus	1
ubdh	1

至此,基于Hadoop的统计单词数量的MapReduce程序开发完成。

五、温馨提示

大家如果在开发过程中遇到了问题,请参考Hadoop专栏中的其他相关博文解决相关问题。



你可能感兴趣的:(Hadoop,Hadoop生态)