大数据学习之路(六)——MapReduce(simple WordCount)

终于开始写代码了


先从配置idea开发环境开始

参考博文 [ https://blog.csdn.net/u010171031/article/details/53024516 ]

注意,src文件夹下需配置以下文件,我是用伪分布式测试的,电脑有点吃不消
core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml
如果不放这几个配置文件的话,要手动指定hdfs服务地址
上传一个用于wordcount的样例文件到hdfs中。

ok,下面是我的代码


1. 编写WordCount类

package com.hadoop.learn.wc;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WordCount {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        // 放开后用于将任务提交给远端server执行
        // conf.set("mapred.jar", "/Users/zhengyifan/app/project/bigdata-learn/hadoop/hadoop.jar");
        Job job = new Job(conf);
        job.setJarByClass(WordCount.class);
        job.setMapperClass(WCMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        FileSystem fs = FileSystem.get(conf);
        System.out.println(fs.getName());
        job.setReducerClass(WCReduce.class);
        Path path = new Path("/tmp/wc");
        FileInputFormat.addInputPath(job, path);

        Path outpath = new Path("/tmp/out_wc");
        // 保证文件不存在
        if (fs.exists(outpath)) {
            fs.delete(outpath, true);
        }
        FileOutputFormat.setOutputPath(job, outpath);
        boolean res = job.waitForCompletion(true);
        if (res == true) {
            System.out.println("执行成功!");
        }else{
            System.out.println("执行失败!");
        }
    }

}

2. Mapper类

package com.hadoop.learn.wc;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String val = value.toString();
        String[] list = val.split(" ");
        for (String s : list) {
            System.out.println(s + "\n===" );
            // 配置idea的文章中是写成静态变量的,应该会减少gc的压力
            context.write(new Text(s), new IntWritable(1));
        }
    }
}

3. Reduce类

package com.hadoop.learn.wc;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class WCReduce extends Reducer<Text, IntWritable, Text, IntWritable>{

    @Override
    protected void reduce(Text key, Iterable value, Context context)
            throws IOException, InterruptedException {
        int count = 0;
        for (IntWritable i : value) {
            count += i.get();
            System.out.println(key.toString() + "---" + count);
        }
        context.write(key, new IntWritable(count));
    }
}

4. 在测试环境中运行

直接run就好了

5. 通过idea,指定jar包,把任务递交给server运行,首先需要打个jar包出来

参考博文 [ https://www.cnblogs.com/blog5277/p/5920560.html ]
放开我WordCount类中的注释,运行。

运行后,通过localhost:8088可以查看进行中的任务

6. 将jar包发布到服务器上执行

hadoop  jar  /path/to/your.jar  com.your.mapreduce.class

我是在mac上执行的,所以会遇到一个问题,我将会将所有问题集中收录,方便自己以后查看,参考[ https://blog.csdn.net/qq_31343581/article/details/80861790 ]

运行后,通过localhost:8088可以查看进行中的任务

ok , 一个简单的MapReduce任务完成了。

你可能感兴趣的:(大数据)