MapReduce入门“Hello World” ----WordCount

项目结构

MapReduce入门“Hello World” ----WordCount_第1张图片

具体代码

WordCout.java

FileInputFormat.setInputPaths(job, new Path("/input/input.txt"));
这一步可以设置运行时参数,也就是String[] args
修改为

String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs();
        if(otherArgs.length < 2) {
            System.err.println("Usage: wordcount  [...] ");
            System.exit(2);
        }
####中间省略######
for(int i = 0; i < otherArgs.length - 1; ++i) {
            FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
        }
FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));

这样就不需要在频繁输入一大串的path信息,另外一个好处就是,当hdfs中的文件发生改变的时候,也不需要去修改path信息

MapReduce入门“Hello World” ----WordCount_第2张图片


package com.jxufe.xzy.wordcount;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;

public class WordCount {

	/**
	 * @param args
	 * @throws IOException 
	 * @throws InterruptedException 
	 * @throws ClassNotFoundException 
	 */
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		//配置信息
				Configuration conf = new Configuration();
				conf.set("fs.defaultFS","hdfs://Master:9000");
                conf.set("fs.hdfs.impl","org.apache.hadoop.hdfs.DistributedFileSystem");
				Job job = Job.getInstance(conf);
				
				//设置整个程序的类名
				job.setJarByClass(WordCount.class);
				job.setMapperClass(MMapper.class);//添加mapper类
				job.setReducerClass(RRducer.class);//添加reducer类
				job.setCombinerClass(RRducer.class);
				job.setOutputKeyClass(Text.class);//设置输出类型
				job.setOutputValueClass(IntWritable.class);//设置输出类型
				//设置输入输出文件夹
				FileInputFormat.setInputPaths(job, new Path("/input/input.txt"));
				FileOutputFormat.setOutputPath(job, new Path("/output"));
				System.exit(job.waitForCompletion(true)?0:1);
	}

}

MMaper.java

Maper的任务是将输入的文件进行处理,得到一系列,类型数据,这些数据将通过Context传递给Reducer

package com.jxufe.xzy.wordcount;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;

public class MMapper extends Mapper {
	public static final IntWritable one = new IntWritable(1);
	private Text word = new Text();
	//Text可简单理解就是java中的String
	
	public void map(Object key, Text value, Mapper.Context context) throws IOException, InterruptedException{
		//将value转换成String进行分词(分成一个一个的单词),默认使用空格进行分词
		/*
			while (st.hasMoreElements()) {
            System.out.println(st.nextToken());
			}
			
			StringTokenizer(String str, String delim, boolean returnDelims) 
			第一个参数为需要进行分词的字符串,第二个参数为使用什么符号进行分词
			如果 returnDelims 标志为 true,则分隔符字符也作为标记返回
		*/

		StringTokenizer itr = new StringTokenizer(value.toString());
		while(itr.hasMoreElements()){
			this.word.set(itr.nextToken());
			context.write(this.word,one);
			//context相当于web中的session,在这里用于存储map生成的>(还可以存储其他的运行时参数)
		}
		
	}

}

RRducer.java

Reducer的任务是从Mapper那里领取属于自己那一块的数据,对这一堆的,数据进行归并操作

package com.jxufe.xzy.wordcount;

import java.io.IOException;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Reducer;

public class RRducer extends Reducer {

	public void reduce(Text key,Iterable value, Reducer.Context context) throws IOException, InterruptedException{
		int sum = 0;
		for(IntWritable	 val : value){
			sum += val.get();
		}
		context.write(key,new IntWritable(sum));
		
		
	}

}

你可能感兴趣的:(大数据学习)