MapReduce可以分解为Map (映射) + Reduce (规约) , 具体过程:
MapReduce是一个分布式运算程序的编程框架,用于用户开发“基于Hadoop的数据分析应用”的核心框架。f核心功能是将用户编写的业务逻辑代码和自带默认组件整合成一个完整的分布式运算程序,并发运行在一个Hadoop集群上。
优点:
缺点:
MapReduce中,执行MapReduce任务的机器角色有两种: JobTracker 和 TaskTracker, 其中JobTracker 用于任务调度, TaskTracker用于执行任务。 一个Hadoop集群中, 只有一台JobTracker。
当Client向JobTracker提交作业时, JobTracker会讲作业拆分到多个TaskTracker去执行, TaskTracker会定时发送心跳信息,如果一段时间JobTracker未收到TaskTracker的心跳信息,则认定该TaskTracker出现故障, 会讲该TaskTracker的任务分配给其他TackTracker。
Java类型 | Hadoop Writable类型 |
Boolean | BooleanWritable |
Byte | ByteWritable |
Int | IntWritable |
Float | FloatWritable |
Long | LongWritable |
Double | DoubleWritable |
String | Text |
Map | MapWritable |
Array | ArrayWritable |
Null | NullWritabl |
统计文档中单词出现的频次
1、引入pom依赖:
org.apache.hadoop
hadoop-client
3.1.3
2、序列化类 Writable
【当前案例中并未使用到自定义的序列化接口】
Hadoop有自己的序列化机制--Writable, 相比于Java的序列化,hadoop的序列化更紧凑、快速、支持多语言。
Hadoop的序列化步骤:
//1 实现Writable接口
@Data
public class FlowBeanWritable implements Writable, Comparable {
private long upFlow;
private long downFlow;
private long sumFlow;
//2 提供无参构造
public FlowBeanWritable() { }
//4 实现序列化和反序列化方法,注意顺序一定要保持一致
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeLong(upFlow);
dataOutput.writeLong(downFlow);
dataOutput.writeLong(sumFlow);
}
@Override
public void readFields(DataInput dataInput) throws IOException {
this.upFlow = dataInput.readLong();
this.downFlow = dataInput.readLong();
this.sumFlow = dataInput.readLong();
}
//5 重写ToString
@Override
public String toString() {
return upFlow + "\t" + downFlow + "\t" + sumFlow;
}
// 6 如果作为Key传输,则还需要实现compareTo方法
@Override
public int compareTo(FlowBeanWritable o) {
// 倒序排列,从大到小
return this.sumFlow > o.getSumFlow() ? -1 : 1;
}
}
3、编写Mapper 类,实现Mapper接口
public class WordCountMapper extends Mapper {
private Text outK = new Text();
private IntWritable outV = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Mapper.Context context) throws IOException, InterruptedException {
// 1 获取一行并将其转成String类型来处理
String line = value.toString();
// 2 将String类型按照空格切割后存进String数组
String[] words = line.split(" ");
// 3 依次取出单词,将每个单词和次数包装成键值对,写入context上下文中供后续调用
for (String word : words) {
// 先将String类型,转为text,再包装成健值对
outK.set(word);
context.write(outK, outV);
}
}
}
Mapper
4、编写Reducer类,继承Reduce抽象类
public class WordCountReducer extends Reducer {
IntWritable outV = new IntWritable();
@Override
protected void reduce(Text key, Iterable values, Reducer.Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
outV.set(sum);
//写出
context.write(key,outV);
}
}
Reducer
Reduce是每组会执行一次,就是相同的key是会分到同一组的,所以此处只需计算每个key的count叠加即可
5、编写Driver驱动类
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WordCountDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// 获取配置信息以及job对象
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 关联当前Driver程序的jar
job.setJarByClass(WordCountDriver.class);
// 指定Mapper和Reducer
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
// 设置输入、输出的k、v类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 设置输入输出路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// 将job提交给yarn运行
Boolean result = job.waitForCompletion(Boolean.TRUE);
}
}