map/reduce实例wordCount单词计数实现功能

map/reduce实例wordCount单词计数实现功能
参考url:https://blog.csdn.net/zhangyunfeixyz/article/details/77151083
输入数据:hdfs://192.168.145.180:8020/user/root/input/djt.txt

hadoop
hadoop
hadoop
dajiangtai
dajiangtai
dajiangtai
hsg
qq.com
hello you
hello me  her

map/reduce处理功能

执行步骤:
 1. map任务处理
1.1 读取输入文件内容,解析成key、value对。对输入文件的每一行,解析成key、value对。每一个键值对调用一次map函数。
1.2 写自己的逻辑,对输入的key、value处理,转换成新的key、value输出。
1.3 对输出的key、value进行分区。
1.4 对不同分区的数据,按照key进行排序、分组。相同key的value放到一个集合中。
1.5 (可选)分组后的数据进行归约。
2.reduce任务处理
2.1 对多个map任务的输出,按照不同的分区,通过网络copy到不同的reduce节点。
2.2 对多个map任务的输出进行合并、排序。写reduce函数自己的逻辑,对输入的key、values处理,转换成新的key、value输出。
2.3 把reduce的输出保存到文件中。

1.3和1.4,1.5是hadoop自动帮我们做的,
我们做的就是上面写的map函数的输出逻辑1.2

map函数重写功能

1.2 写自己的逻辑,对输入的key、value处理,转换成新的key、value输出。

//定义map
    //LongWritable, Text, Text, LongWritable  前两个参数为输入的map类型,后两个参数为输出的map类型
    //如<0,hello you>,<10,hello me>  ---> 
    public static class myMapper    extends Mapper<LongWritable, Text, Text, LongWritable>{

        //定义一个k2,v2
        Text k2 = new Text();
        LongWritable v2 = new LongWritable();
        //输入map类型key,value值为<0,hello you><10,hello me>
        //0或10为行起始字节数据
        @Override       
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] words= value.toString().split(" ");
            for (String word:words) {           
                if(word.trim().isEmpty()==false)
                {
                     Debug.println(word,"1");
                     //word表示第一行中的每个单词,即k2
                     k2.set(word);
                     //没排序分组前每个单词都是1个,由于是long类型所以加L
                     v2.set(1L);
                     //写出
                     context.write(k2, v2);  
                }
            } 
        }
    }

reduce函数重写功能

2.1和2.2功能由hadoop帮我们做了,我们只需要写自己的逻辑reduce函数
2.2 对多个map任务的输出进行合并、排序。写reduce函数自己的逻辑,对输入的key、values处理,转换成新的key、value输出。

//下面这个myReducer函数是输出的函数,逻辑要我们自己写
public static class myReducer extends Reducer<Text,LongWritable,Text,LongWritable>{
        LongWritable v3=new LongWritable();
        //k2,v2s参数形式为变为--->
        @Override
        protected void reduce(Text k2,Iterable v2s,
                     Reducer.Context context) 
                     throws IOException,InterruptedException{
            long count=0L;
            for(LongWritable v2:v2s) {
                count +=v2.get();
            }
            v3.set(count);
            //k2就是k3,都是一个单词
            context.write(k2,v3);
        }
    }
package wordcount;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class wordcount {    
    //定义map
    //LongWritable, Text, Text, LongWritable  前两个参数为输入的map类型,后两个参数为输出的map类型
    //如<0,hello you>,<10,hello me>  ---> 1>1>1>1>
    public static class myMapper    extends Mapper{

        //定义一个k2,v2
        Text k2 = new Text();
        LongWritable v2 = new LongWritable();
        //输入map类型key,value值为<0,hello you><10,hello me>
        //010为行起始字节数据
        @Override       
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] words= value.toString().split(" ");
            for (String word:words) {           
                if(word.trim().isEmpty()==false)
                {
                     Debug.println(word,"1");
                     //word表示第一行中的每个单词,即k2
                     k2.set(word);
                     //没排序分组前每个单词都是1个,由于是long类型所以加L
                     v2.set(1L);
                     //写出
                     context.write(k2, v2);  
                }
             }   
        }
    }
    public static class myReducer extends Reducer{
        LongWritable v3=new LongWritable();
        //k2,v2s参数形式为1,1}>1}>变为--->2>1>
        @Override
        protected void reduce(Text k2,Iterable v2s,
                     Reducer.Context context) 
                     throws IOException,InterruptedException{
            long count=0L;
            for(LongWritable v2:v2s) {
                count +=v2.get();
            }
            v3.set(count);
            //k2就是k3,都是一个单词
            context.write(k2,v3);
        }
    }
    //删除输出目录
    public static void deleteOutDir(Configuration conf,String out_dir)
               throws IOException,URISyntaxException{
        FileSystem fs=FileSystem.get(new URI(out_dir),conf);
        if(fs.exists(new Path(out_dir))==true)
        {
            fs.delete(new Path(out_dir),true);
        }
    }
    public static void main(String[] args) throws Exception
    {
        //加载hadoop conf 驱动
        Configuration conf=new Configuration(); 
        Job job=Job.getInstance(conf,wordcount.class.getSimpleName());
        job.setJarByClass(wordcount.class);
        Path in_path=new Path("hdfs://192.168.145.180:8020/user/root/input/djt.txt");
        FileInputFormat.setInputPaths(job, in_path);
        //通过TextInputFormat把读到的数据处理成形式
        job.setInputFormatClass(TextInputFormat.class);
        //job中加入Mapper,同时MyMapper类接受作为参数传给类中map函数进行数据处理
        job.setMapperClass(myMapper.class);
        //设置输出的的数据类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        //job中加入Reducer,Reducer自动接收处理好的map数据
        job.setReducerClass(myReducer.class);
        //设置输出的的数据类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        //设置输出目录文件output2
        String OUT_DIR = "hdfs://192.168.145.180:8020/user/root/output2";
        FileOutputFormat.setOutputPath(job, new Path(OUT_DIR));
        job.setOutputFormatClass(TextOutputFormat.class);
        //如果这个文件存在则删除,如果文件存在不删除会报错。
        deleteOutDir(conf, OUT_DIR);
        //把处理好的的数据写入文件
        job.waitForCompletion(true);
    }
}

执行方法一:在eclipse开发环境中执行

wordcount.java编辑器中右键run as / run on hadoop执行OK

执行方法二:生成jar拷到hadoop服务器上执行

生成jar方法:

Project Explorder项目工程树型中选择wordcoun.java右键Export/Java/Runnable Jar file
选择项和输入项
Launch configuration: wordcount-wordcount
export destination:D:\应用集合\eclipse\eclipse-workspace\bin_jar\wordcount.jar
Library handling:
可勾:Extract required libraries into generated JAR
点击Finish完成

把生成的jar拷到hadoop服务上执行
拷到/home/hadoop3/app/hadoop目录中

执行:hadoop jar wordcount.jar

具体执行过程如下:

[hadoop3@master hadoop]$ hadoop jar wordcount.jar
18/09/03 19:22:38 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/09/03 19:22:41 INFO input.FileInputFormat: Total input paths to process : 1
18/09/03 19:22:42 INFO mapreduce.JobSubmitter: number of splits:1
18/09/03 19:22:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1535940670011_0002
18/09/03 19:22:42 INFO impl.YarnClientImpl: Submitted application application_1535940670011_0002
18/09/03 19:22:42 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1535940670011_0002/
18/09/03 19:22:42 INFO mapreduce.Job: Running job: job_1535940670011_0002
18/09/03 19:22:51 INFO mapreduce.Job: Job job_1535940670011_0002 running in uber mode : false
18/09/03 19:22:51 INFO mapreduce.Job:  map 0% reduce 0%
18/09/03 19:23:01 INFO mapreduce.Job:  map 100% reduce 0%
18/09/03 19:23:12 INFO mapreduce.Job:  map 100% reduce 100%
18/09/03 19:23:12 INFO mapreduce.Job: Job job_1535940670011_0002 completed successfully
18/09/03 19:23:12 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=235
        FILE: Number of bytes written=255343
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=205
        HDFS: Number of bytes written=65
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters 
        Launched map tasks=1
        Launched reduce tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=7766
        Total time spent by all reduces in occupied slots (ms)=6780
        Total time spent by all map tasks (ms)=7766
        Total time spent by all reduce tasks (ms)=6780
        Total vcore-milliseconds taken by all map tasks=7766
        Total vcore-milliseconds taken by all reduce tasks=6780
        Total megabyte-milliseconds taken by all map tasks=7952384
        Total megabyte-milliseconds taken by all reduce tasks=6942720
    Map-Reduce Framework
        Map input records=10
        Map output records=14
        Map output bytes=201
        Map output materialized bytes=235
        Input split bytes=116
        Combine input records=0
        Combine output records=0
        Reduce input groups=9
        Reduce shuffle bytes=235
        Reduce input records=14
        Reduce output records=9
        Spilled Records=28
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=169
        CPU time spent (ms)=2070
        Physical memory (bytes) snapshot=312471552
        Virtual memory (bytes) snapshot=4165705728
        Total committed heap usage (bytes)=152428544
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=89
    File Output Format Counters 
        Bytes Written=65
[hadoop3@master hadoop]$ 

表示执行成功OK
其执行结果为:
hdfs://192.168.145.180:8020/user/root/output2

dajiangtai  3
hadoop  3
hello   2
her 1
hsg 1
me  1
qq.com  1
you 1

如果执行报错:RunJar jarFile [mainClass] args…
则你可能采用Export/Java/Jar file生成的方式,没有指定mainclass导致的问题
改用Export/Java/Runnable Jar file方式就OK,只不过生成的jar会比较大。
–the–end—-

你可能感兴趣的:(操作系统,Linux/Unix,云平台,hadoop,分布式开发,存储)