学习篇-Hadoop-MapReduce-词频统计

文章目录

          • 一、Hadoop-MapReduce-词频统计-Mapper
          • 二、Hadoop-MapReduce-词频统计-Reducer
          • 三、Hadoop-MapReduce-词频统计-Driver
          • 四、Hadoop-MapReduce-词频统计-本地测试
          • 五、Hadoop-MapReduce-词频统计-Combiner

一、Hadoop-MapReduce-词频统计-Mapper

简要说明:Maps input key/value pairs to a set of intermediate key/value pairs.

释义:Mapper就是将输入的键/值对转换到一组中间键/值对
学习篇-Hadoop-MapReduce-词频统计_第1张图片

  • Mapper中传入的泛型含义

    • KEYIN: Map任务读数据的key类型,offset,是每行数据起始位置的偏移量,LongWritable不再是Java中的Long
    • VALUEIN:Map任务读数据的value类型,其实就是一行行的字符串,Text不再是Java中的String
    • KEYOUT:map方法自定义实现输出的key的类型,例如:对于词频统计就是Text【注意不再是String】
    • VALUEOUT:map方法自定义实现输出的value的类型,例如:对于词频统计就是IntWritable【注意不能是Integer】
  • 自定义词频Mapper:WordCountMapper

    /**
     * @ClassName WordCountMapper
     * @Description 词频统计mapper
     * @Author eastern
     * @Date 2020/4/29 下午2:15
     * @Version 1.0
     *
     * 对于词频统计:(word,1)KEYOUT就是String VALUEOUT就是Integer
     * LongWritable对应Long
     * Text对应String
     * IntWritable对应Integer
     **/
    public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    	@Override
    	protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    		// 通过分隔符分割单词
    		String[] words = value.toString().split("\t");
      		// 遍历单词  
    		for (String word: words) {
         		 // 写入到context中
    			context.write(new Text(word), new IntWritable(1));
    		}
    	}
    }
    
二、Hadoop-MapReduce-词频统计-Reducer

简要说明:Reduces a set of intermediate values which share a key to a smaller set of values.

释义:Reduce 将一组中间值转化成共享一个key,value合并成一组较小的值

比如:

 # 从文件中读取的单词
 (hello,1) (world,1)
 (hello,1) (world,1)
 (hello,1) (world,1)
 (welcome,1)
 # map的输出到reduce端,是按照相同的key分发到一个reduce上去执行
 reduce1:	(hello,1) (hello,1) (hello,1) ===> (hello, <1,1,1>)
 reduce2:	(world,1) (world,1) (world,1) ===> (world, <1,1,1>)
 reduce3:	(welcome,1) ===> (welcome, <1>)

学习篇-Hadoop-MapReduce-词频统计_第2张图片

  • Reducer中传入的泛型含义

    • KEYIN: Map输出的Key的类型
    • VALUEIN:Map输出的Value的类型
    • KEYOUT:reduce方法自定义实现输出的key的类型,例如:对于词频统计就是Text【注意不再是String】
    • VALUEOUT:reduce方法自定义实现输出的value的类型,例如:对于词频统计就是IntWritable【注意不能是Integer】
  • 自定义词频Reducer:WordCountReducer

    /**
     * @ClassName WordCountReducer
     * @Description 词频统计Reducer
     * @Author eastern
     * @Date 2020/4/29 下午3:18
     * @Version 1.0
     **/
    public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    
    	@Override
    	protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
    			InterruptedException {
    		int count = 0;
    		Iterator<IntWritable> iterator = values.iterator();
    		// <1,1,1>
    		while (iterator.hasNext()) {
    			IntWritable value = iterator.next();
    			count += value.get();
    		}
    		context.write(key, new IntWritable(count));
    	}
    }
    
三、Hadoop-MapReduce-词频统计-Driver
/**
 * @ClassName WordCountApp
 * @Description Driver:配置Mapper Reducer的相关属性 提交到本地运行
 * @Author eastern
 * @Date 2020/4/29 下午4:35
 * @Version 1.0
 **/
public class WordCountApp {

	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

		System.setProperty("HADOOP_USER_NAME", "root");
		// 设置HDFS的Configuration
		Configuration configuration = new Configuration();
		configuration.set("fs.defaultFS", "hdfs://139.129.240.xxx:8020");
		configuration.set("dfs.client.use.datanode.hostname", "true");
		configuration.set("dfs.replication", "1");


		// 创建一个job
		Job job = Job.getInstance(configuration);

		// 设置Job对应的参数:主类
		job.setJarByClass(WordCountApp.class);

		// 设置Job对应的参数:设置自定义的Mapper和Reducer处理类
		job.setMapperClass(WordCountMapper.class);
		job.setReducerClass(WordCountReducer.class);

		// 设置Job对应的参数:Mapper输出key和value的类型
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);

		// 设置Job对应的参数:Reducer输出key和value的类型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);

		// 设置Job对应的参数:设置输入/输出路径
		FileInputFormat.setInputPaths(job, new Path("/hdfsapi/test/second/words.txt"));
		FileOutputFormat.setOutputPath(job, new Path("/wordcount/output"));

		// 提交job
		job.waitForCompletion(true);
	}
}
四、Hadoop-MapReduce-词频统计-本地测试
  • 去掉连接hdfs的配置

  • 设置Job对应的参数:设置输入/输出路径,设置成本地路径即可。

    public class WordCountLocalFileApp {
    
    	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    
    		// 创建一个job
    		Job job = Job.getInstance();
    
    		// 设置Job对应的参数:主类
    		job.setJarByClass(WordCountLocalFileApp.class);
    
    		// 设置Job对应的参数:设置自定义的Mapper和Reducer处理类
    		job.setMapperClass(WordCountMapper.class);
    		job.setReducerClass(WordCountReducer.class);
    
    		// 设置Job对应的参数:Mapper输出key和value的类型
    		job.setMapOutputKeyClass(Text.class);
    		job.setMapOutputValueClass(IntWritable.class);
    
    		// 设置Job对应的参数:Reducer输出key和value的类型
    		job.setOutputKeyClass(Text.class);
    		job.setOutputValueClass(IntWritable.class);
    
    		// 设置Job对应的参数:设置输入/输出路径
    		FileInputFormat.setInputPaths(job, new Path("/Users/xxxx/IdeaProjects/bigdata/hadoop-mapreduce/src/main/resources/words.txt"));
    		FileOutputFormat.setOutputPath(job, new Path("/Users/xxxx/IdeaProjects/bigdata/hadoop-mapreduce/src/main/resources/output"));
    
    		// 提交job
    		job.waitForCompletion(true);
    	}
    }
    
五、Hadoop-MapReduce-词频统计-Combiner

学习篇-Hadoop-MapReduce-词频统计_第3张图片

  • map端的聚合操作就叫combiner

  • combiner的优点/局限

    • 减少IO,提升执行效率
    • 求除法运算时,不适合。
  • 案例代码改造:将每个map的输出,先进行累加操作,再输出到reducer

    // 设置Combiner
    job.setCombinerClass(WordCountReducer.class);
    

你可能感兴趣的:(hadoop,mapreduce,大数据,hadoop)