由于文章太长,其余部分在我的其他几篇博客中!
第一部分:Hadoop介绍及安装
第二部分:HDFS
第四部分:项目案例实战
MapReduce是一个分布式运算程序的编程框架,是用户开发“基于Hadoop的数据分析应用”的核心框架。
MapReduce核心功能是将用户编写的业务逻辑代码和自带默认组件整合成一个完整的分布式运算程序,并发运行在一个Hadoop集群上。
MapReduce易编程
它简单的实现一些接口,就可以完成一个分布式程序,这个分布式程序可以分布到大量廉价的PC机器上运行。也就是说你写一个分布式程序,跟写一个简单的串行程序是一模一样的。就是因为这个特点使得MapReduce编程变得非常流行。
良好的扩展性
当你的计算资源不能得到满足的时候,你可以通过简单的增加机器来扩展它的计算能力。
高容错性
MapRedice设计的初衷就是使程字能够部署在廉价的PC机器上,这就要求它具有很高的容错性。比如其中一台机器挂了,它可以把上面的计算任务转移到另外一个节点上运行,不至于这个任务运行失败,而且这个过程不需要人工参与,而完全是由Hadoop内部完成的。
适合PB级以上海量数据的离线处理
可以实现上千台服务器集群并发工作,提供数据处理能力。
不擅长实时计算
MapReduce无法像MySQL一样,在毫秒或者秒级内返回结果。
不擅长流式计算
流式计算的输入数据是动态的,而MapReduce的输入数据集是静态的,不能动态变化。这是因为MapReduce自身的设计特点决定了数据源必须是静态的。
不擅长DAG(有向图)计算
多个应用程序存在依赖关系,后一个应用程序的输入为前一个的输出。在这种情况下,MapReduce并不是不能做,而是使用后,每个MapReduce作业的输出结果都会写入到磁盘,会造成大量的磁盘IO,导致性能非常的低下。
待处理文本
客户端submit()之前,获取待处理数据的信息,然后根据参数配置,形成一个任务分配的规划
提交切片信息(到 Yarn RM)
切片信息、jar包(在不同机器时才需要提交)、xml文件
计算出MapTask的数量
在map阶段读取数据前,FileInputFormat会将输入文件分割成split。split的个数决定了map的个数。影响map个数(split个数)的主要因素有:
文件的大小。当块(dfs.block.size)为128m时,如果输入文件为128m,会被划分为1个split;当块为256m,会被划分为2个split。
文件的个数。FileInputFormat按照文件分割split,并且只会分割大文件,即那些大小超过HDFS块的大小的文件。如果HDFS中dfs.block.size设置为128m,而输入的目录中文件有100个,则划分后的split个数至少为100个。
splitsize的大小。分片是按照splitszie的大小进行分割的,一个split的大小在没有设置的情况下,默认等于hdfs block的大小。但应用程序可以通过两个参数来对splitsize进行调节
对每一个MapTask中的数据进行如下处理:
所有MapTask任务完成后,启动相应数量(前面分区分了多少个,这里就是多少个)的ReduceTask处理数据范围
将同一分区的数据下载到ReduceTask本地磁盘
合并文件 归并排序
Reduce(k,v)
Context.write(k,v)
写到指定路径的文件中
Mapreduce确保每个reducer的输入都是按key排序的。系统执行排序的过程(即将mapper输出作为输入传给reducer)称为shuffle。
分区:把数据扎堆存放
每个reducetask可以读取一个分区,产生一个结果文件,多个分区可以由多个reducetask来分别读取,并分别产生多个结果文件。
序列化
就是把内存中的对象,转换成字节序列(或其他数据传输协议)以便于存储到磁盘(持久化)和网络传输。
反序列化
就是将收到字节序列(或其他数据传输协议)或者是磁盘的持久化数据,转换成内存中的对家。
一般来说,“活的”对象只生存在内存里,关机断电就没有了。而且“活的”对象只能由本地的进程使用,不能被发送到网络上的另外一台计算机。然而序列化可以存储“活的”对象,可以将“活的”对象发送到远程计算机。
Java的序列化是一个重量级序列化框架(Serilazable),一个对象被序列化后,会附带很多额外的信息(各种校验信息,header,继承体系等),不便于在网络中高效传输。所以,Hadoop自己开发了一套序列化机制(Writable),特点如下:
紧凑
紧凑的格式能让我们充分利用网络带宽,而带宽是数据中心最稀缺的资源
快速
进程通信形成了分布式系统的骨架,所以需要尽量减少序列化和反序列化的性能开销,这是基本的;
可扩展
协议为了满足新的需求变化,所以控制客户端和服务器过程中,需要直接引进相应的协议,这些是新协议,原序列化方式能支持新的协议报文;
互操作
能支持不同语言写的客户端和服务端进行交互;
Java类型 | Hadoop Writable类型 |
---|---|
boolean | BooleanWritable |
byte | ByteWritable |
int | IntWritable |
float | FloatWritable |
long | LongWritable |
double | DoubleWritable |
String | Text |
map | MapWritable |
array | ArrayWritable |
用户编写的程序分成三个部分:Mapper、Reducer和Driver。
Mapper阶段
(1)用户自定义的Mapper要继承自己的父类Mapper
(2)Mapper的输入数据是KV对
的形式(KV的类型可自定义)
(3)Mapper中的业务逻辑写在map()方法
中
(4)Mapper的输出数据是KV对
的形式(KV的类型可自定义)
(5)map()方法(MapTak进程)
对每一个
调用一次
Reducer阶段
(1)用户自定义的Reducer要继承自己的父类Mapper
(2)Reducer的输入数据类型对应Mapper的输出数据类型,也是KV
(3)Reducer的业务逻辑写在reduce()
方法中
(4)ReduceTask进程对每一组相同
组调用一次reduce()方法
Driver阶段
相当于YARN集群的客户端,用于提交我们整个程序到YARN集群,提交的是封装了MapReduce程序相关运行参数的job对象
需求:
package com.atSchool.MapReduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/**
* Map 阶段
* Map的输入是在读一个文本,输入时的key是一个偏移量,value是文本的类型
* 输入的key是偏移量,value是读到的文本
* 输出的key是文本类型,value就是Hadoop中的整数类型
*/
public class MapDemo extends Mapper<LongWritable, Text, Text, IntWritable> {
/**
* 该类使用标准UTF-8编码存储文本。它提供了在字节级别序列化、反序列化和比较文本的方法。
* 长度类型为整数,并使用零压缩格式序列化.
* 此外,它还提供了字符串遍历的方法,而不将字节数组转换为字符串。
* 还包括用于序列化/反序列化字符串、编码/解码字符串、检查字节数组是否包含有效UTF-8代码、计算编码字符串长度的实用程序。
*/
private Text keyword = new Text();
private IntWritable keyvalue = new IntWritable(1);
/**
* 参数说明: context:上下文对象,贯穿整个任务的对象
*/
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// 1、获取一行
String line = value.toString();
// 2、截取
String[] split = line.split(",");
// 3、输出到上下文对象中
for (String string : split) {
keyword.set(string);
context.write(keyword, keyvalue);
}
}
}
package com.atSchool.MapReduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
/**
* Reduce 阶段
* Reduce的输入和Map的输出应该是一样的,输出的话保持不变就可以了
* 到了Reduce阶段,传过来的数据是 hadoop-{1,1,1} 的形式
*/
public class ReduceDemo extends Reducer<Text, IntWritable, Text, IntWritable> {
IntWritable val = new IntWritable();
/**
* 参数1Text key:Map传过来的key 参数2Iterable values(是一个迭代器)中存储的是{1,1,1}
*/
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
// 累加
int sum = 0;
for (IntWritable intWritable : values) {
sum += intWritable.get();
}
// 输出
val.set(sum);
context.write(key, val);
}
}
package com.atSchool.MapReduce;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class WordCountJob extends Configured implements Tool {
public static void main(String[] args) throws Exception {
new ToolRunner().run(new WordCountJob(), null);
}
@Override
public int run(String[] args) throws Exception {
// 获取Job
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
Job job = Job.getInstance(configuration);
// 设置需要运行的任务
job.setJarByClass(WordCountJob.class);
// 告诉job Map和Reduce在哪
job.setMapperClass(MapDemo.class);
job.setReducerClass(ReduceDemo.class);
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 告诉job Reduce输出的key和value的数据类型的是什么
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 告诉job输入和输出的路径
FileInputFormat.addInputPath(job, new Path("/world.txt"));
/**
* 因为输出的文件不允许存在,所以需要处理一下
*/
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/out");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
FileOutputFormat.setOutputPath(job, path);
// 提交任务
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
package com.atSchool.MapReduce;
import java.io.IOException;
import java.util.Arrays;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/**
* Map 阶段 Map的输入是在读一个文本,输入时的key是一个偏移量,value是文本的类型
* 输出的key是文本类型,value就是Hadoop中的整数类型
*/
public class MapDemo extends Mapper<LongWritable, Text, Text, Text> {
/**
* 该类使用标准UTF-8编码存储文本。它提供了在字节级别序列化、反序列化和比较文本的方法。 长度类型为整数,并使用零压缩格式序列化.
* 此外,它还提供了字符串遍历的方法,而不将字节数组转换为字符串。
* 还包括用于序列化/反序列化字符串、编码/解码字符串、检查字节数组是否包含有效UTF-8代码、计算编码字符串长度的实用程序。
*/
private Text outKey = new Text();
/**
* 参数说明: context:上下文对象,贯穿整个任务的对象
*/
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
// 1、获取一行
String line = value.toString().trim();
// 2、转成字符数组,并进行排序
char[] charArray = line.toCharArray();
Arrays.sort(charArray);
String string = new String(charArray);
// 3、输出到上下文对象中
outKey.set(string);
// 输出的key是排好序后的数据,value是传过来的原数据
context.write(outKey, value);
}
}
package com.atSchool.MapReduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
/**
* Reduce 阶段 Reduce的输入和Map的输出应该是一样的 输出的话保持不变就可以了
* 到了Reduce阶段,传过来的数据是hadoop{1,1,1}的形式
*/
public class ReduceDemo extends Reducer<Text, Text, Text, Text> {
Text val = new Text();
/**
* 参数1Text key:Map传过来的key 参数2Iterable values(是一个迭代器)中存储的是{1,1,1}
*/
@Override
protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
StringBuffer stringBuffer = new StringBuffer();
// 累加
for (Text text : values) {
stringBuffer.append("-" + text.toString());
}
// 输出
val.set(stringBuffer.toString());
context.write(key, val);
}
}
package com.atSchool.MapReduce;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class WordCountJob extends Configured implements Tool {
public static void main(String[] args) throws Exception {
new ToolRunner().run(new WordCountJob(), null);
}
@Override
public int run(String[] args) throws Exception {
// 获取Job
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
Job job = Job.getInstance(configuration);
// 设置需要运行的任务
job.setJarByClass(WordCountJob.class);
// 告诉job Map和Reduce在哪
job.setMapperClass(MapDemo.class);
job.setReducerClass(ReduceDemo.class);
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
// 告诉job Reduce输出的key和value的数据类型的是什么
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// 告诉job输入和输出的路径
FileInputFormat.addInputPath(job, new Path("/world.txt"));
/**
* 因为输出的文件不允许存在,所以需要处理一下
*/
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/MapReduceOut");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
FileOutputFormat.setOutputPath(job, path);
// 提交任务
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
分析资料:
package com.atSchool.MapReduce;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
/**
* Map 阶段 Map的输入是在读一个文本,输入时的key是一个偏移量,value是文本的类型
* 输出的key是文本类型,value就是Hadoop中的整数类型
*/
public class MapDemo extends Mapper<LongWritable, Text, Text, IntWritable> {
/**
* 该类使用标准UTF-8编码存储文本。它提供了在字节级别序列化、反序列化和比较文本的方法。 长度类型为整数,并使用零压缩格式序列化.
* 此外,它还提供了字符串遍历的方法,而不将字节数组转换为字符串。
* 还包括用于序列化/反序列化字符串、编码/解码字符串、检查字节数组是否包含有效UTF-8代码、计算编码字符串长度的实用程序。
*/
private Text outKey = new Text();
private IntWritable outValue = new IntWritable();
/**
* 参数说明: context:上下文对象,贯穿整个任务的对象
*/
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
/**
* 1、Hadoop将MapReduce的输入数据划分成等长的小数据块,称为输入分片(inputsplit)或简称为“分片”。
* 2、Hadoop为每个分片构建一个map任务,并由该任务来运行用户自定义的map函数从而处理分片中的每条数据。
* 3、getSplits()负责将文件切分成多个分片(InputSplit),但InputSplit并没有实际切分文件,而只是说明了如何切分数据,也就是说,InputSplit只是逻辑上的切分。
* 4、每个InputSplit对应一个map任务。作为map的输入,在逻辑上提供了这个map任务所要处理的key-value对。
*/
// 获取到文件名对应的InputSplit
InputSplit inputSplit = context.getInputSplit();
// 强转成子类类型FileSplit
FileSplit fSplit = (FileSplit) inputSplit;
// 获取到路径
Path path = fSplit.getPath();
// 获取到文件名
String name = path.getName();
// 1、获取一行
String line = value.toString();
// 2、截取
String[] split = line.split(" +");
// 3、输出到上下文对象中
outKey.set(name);
int valueOf = Integer.valueOf(split[4].trim());
// 由于数据文件中存在脏数据,所以要进行判断处理
if (valueOf != -9999) {
outValue.set(valueOf);
context.write(outKey, outValue);
}
}
}
package com.atSchool.MapReduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
/**
* Reduce 阶段 Reduce的输入和Map的输出应该是一样的 输出的话保持不变就可以了
* 到了Reduce阶段,传过来的数据是hadoop{1,1,1}的形式
*/
public class ReduceDemo extends Reducer<Text, IntWritable, Text, IntWritable> {
IntWritable val = new IntWritable();
/**
* 参数1Text key:Map传过来的key 参数2Iterable values(是一个迭代器)中存储的是{1,1,1}
*/
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
// 累加
int sum = 0; // 存储累加后的值
int count = 0; // 统计数据的个数
for (IntWritable intWritable : values) {
sum += intWritable.get();
count++;
}
sum /= count;
// 输出
val.set(sum);
context.write(key, val);
}
}
package com.atSchool.MapReduce;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class WordCountJob extends Configured implements Tool {
public static void main(String[] args) throws Exception {
new ToolRunner().run(new WordCountJob(), null);
}
@Override
public int run(String[] args) throws Exception {
// 获取Job
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
Job job = Job.getInstance(configuration);
// 设置需要运行的任务
job.setJarByClass(WordCountJob.class);
// 告诉job Map和Reduce在哪
job.setMapperClass(MapDemo.class);
job.setReducerClass(ReduceDemo.class);
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 告诉job Reduce输出的key和value的数据类型的是什么
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 告诉job输入和输出的路径
FileInputFormat.addInputPath(job, new Path("/weather_forecast_source"));
/**
* 因为输出的文件不允许存在,所以需要处理一下
*/
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/MapReduceOut");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
FileOutputFormat.setOutputPath(job, path);
// 提交任务
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
分析资料:
package com.atSchool.MapReduce;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
/**
* Map 阶段 Map的输入是在读一个文本,输入时的key是一个偏移量,value是文本的类型
* 输出的key是文本类型,value就是Hadoop中的整数类型
*/
public class MapDemo extends Mapper<LongWritable, Text, Text, IntWritable> {
/**
* 该类使用标准UTF-8编码存储文本。它提供了在字节级别序列化、反序列化和比较文本的方法。 长度类型为整数,并使用零压缩格式序列化.
* 此外,它还提供了字符串遍历的方法,而不将字节数组转换为字符串。
* 还包括用于序列化/反序列化字符串、编码/解码字符串、检查字节数组是否包含有效UTF-8代码、计算编码字符串长度的实用程序。
*/
private Text outKey = new Text();
private IntWritable outValue = new IntWritable();
/**
* 参数说明: context:上下文对象,贯穿整个任务的对象
*/
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// 1、获取一行
String line = value.toString();
// 2、截取
String[] split = line.split(" +");
// 3、输出到上下文对象中
// 由于数据文件中存在脏数据,所以要进行判断处理
if (!split[split.length - 1].trim().equals("-")) {
outKey.set(split[0].trim()); // name 为ip地址
outValue.set(Integer.valueOf(split[split.length - 1].trim()));
context.write(outKey, outValue);
}
}
}
package com.atSchool.MapReduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
/**
* Reduce 阶段 Reduce的输入和Map的输出应该是一样的 输出的话保持不变就可以了
* 到了Reduce阶段,传过来的数据是hadoop{1,1,1}的形式
*/
public class ReduceDemo extends Reducer<Text, IntWritable, Text, IntWritable> {
IntWritable val = new IntWritable();
/**
* 参数1Text key:Map传过来的key 参数2Iterable values(是一个迭代器)中存储的是{1,1,1}
*/
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
// 累加
int sum = 0; // 存储累加后的值
for (IntWritable intWritable : values) {
sum += intWritable.get();
}
// 输出
val.set(sum);
context.write(key, val);
}
}
package com.atSchool.MapReduce;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class WordCountJob extends Configured implements Tool {
public static void main(String[] args) throws Exception {
new ToolRunner().run(new WordCountJob(), null);
}
@Override
public int run(String[] args) throws Exception {
// 获取Job
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
Job job = Job.getInstance(configuration);
// 设置需要运行的任务
job.setJarByClass(WordCountJob.class);
// 告诉job Map和Reduce在哪
job.setMapperClass(MapDemo.class);
job.setReducerClass(ReduceDemo.class);
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 告诉job Reduce输出的key和value的数据类型的是什么
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 告诉job输入和输出的路径
FileInputFormat.addInputPath(job, new Path("/traffic_source"));
/**
* 因为输出的文件不允许存在,所以需要处理一下
*/
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/MapReduceOut");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
FileOutputFormat.setOutputPath(job, path);
// 提交任务
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
将Windows下安装的Hadoop下的hadoop-2.7.3\etc\hadoop\log4j.properties
文件复制到项目的src目录下即可。
_SUCESS
:仅仅用来说明执行成功了。
part-r-00000
:存储执行过后的结果。
自定义bean对象要想序列化传输,必须实现序列化接口,需要注意以下几项。
必须实现writable接口
反序列化时,需要反射调用空参构造函数,所以必须有空参构造
public FlowBean() {
super();
}
重写序列化方法
@override
public void write(DataOutput out)throws IOException {
out.writeLong(upFlow);
out.writeLong(downFlow);
out.writeLong(sumFlow);
}
重写反序列化方法
@override
public void readFields(DataInput in)throws IOException {
upFlow = in.readLong();
downFlow = in.readLong();
sumFlow = in.readLong();
}
注意反序列化的顺序和序列化的顺序完全一致
要想把结果显示在文件中,需要重写toString(),可用”\t"分开,方便后续用。
分析资料:
1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
1363157995052 13826544101 5C-0E-8B-C7-F1-E0:CMCC 120.197.40.4 4 0 264 0 200
1363157991076 13926435656 20-10-7A-28-CC-0A:CMCC 120.196.100.99 2 4 132 1512 200
1363154400022 13926251106 5C-0E-8B-8B-B1-50:CMCC 120.197.40.4 4 0 240 0 200
1363157993044 18211575961 94-71-AC-CD-E6-18:CMCC-EASY 120.196.100.99 iface.qiyi.com 视频网站 15 12 1527 2106 200
1363157995074 84138413 5C-0E-8B-8C-E8-20:7DaysInn 120.197.40.4 122.72.52.12 20 16 4116 1432 200
1363157993055 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 18 15 1116 954 200
1363157995033 15920133257 5C-0E-8B-C7-BA-20:CMCC 120.197.40.4 sug.so.360.cn 信息安全 20 20 3156 2936 200
1363157983019 13719199419 68-A1-B7-03-07-B1:CMCC-EASY 120.196.100.82 4 0 240 0 200
1363157984041 13660577991 5C-0E-8B-92-5C-20:CMCC-EASY 120.197.40.4 s19.cnzz.com 站点统计 24 9 6960 690 200
1363157973098 15013685858 5C-0E-8B-C7-F7-90:CMCC 120.197.40.4 rank.ie.sogou.com 搜索引擎 28 27 3659 3538 200
1363157986029 15989002119 E8-99-C4-4E-93-E0:CMCC-EASY 120.196.100.99 www.umeng.com 站点统计 3 3 1938 180 200
1363157992093 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 15 9 918 4938 200
1363157986041 13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4 3 3 180 180 200
1363157984040 13602846565 5C-0E-8B-8B-B6-00:CMCC 120.197.40.4 2052.flash2-http.qq.com 综合门户 15 12 1938 2910 200
1363157995093 13922314466 00-FD-07-A2-EC-BA:CMCC 120.196.100.82 img.qfc.cn 12 12 3008 3720 200
1363157982040 13502468823 5C-0A-5B-6A-0B-D4:CMCC-EASY 120.196.100.99 y0.ifengimg.com 综合门户 57 102 7335 110349 200
1363157986072 18320173382 84-25-DB-4F-10-1A:CMCC-EASY 120.196.100.99 input.shouji.sogou.com 搜索引擎 21 18 9531 2412 200
1363157990043 13925057413 00-1F-64-E1-E6-9A:CMCC 120.196.100.55 t3.baidu.com 搜索引擎 69 63 11058 48243 200
1363157988072 13760778710 00-FD-07-A4-7B-08:CMCC 120.196.100.82 2 2 120 120 200
1363157985066 13560436666 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
1363157993055 13560436666 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 18 15 1116 954 200
// 说明:
1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
记录报告时间戳 手机号码 AP mac AC mac 访问的网址 网址种类 上行数据包个数 下行数据包个数 上行流量 下行流量
需求说明:
package com.atSchool.Serialize;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;
/**
* 自定义序列化类
* 1、实现Writable接口
*/
public class FlowBean implements Writable {
// 上行流量(即上传时使用的流量)
private Long upFlow;
// 下行流量(即下载时使用的流量)
private Long downFlow;
// 总流量
private Long sumFlow;
// 2、反序列化时,需要反射调用空参构造函数,所以必须有空参构造
public FlowBean() {
super();
}
// 带参的构造器
public FlowBean(Long upFlow, Long downFlow) {
this.upFlow = upFlow;
this.downFlow = downFlow;
this.sumFlow = upFlow + downFlow;
}
// 3、重写序列化方法
@Override
public void write(DataOutput out) throws IOException {
out.writeLong(upFlow);
out.writeLong(downFlow);
out.writeLong(sumFlow);
}
// 4、重写反序列化方法
@Override
public void readFields(DataInput in) throws IOException {
upFlow = in.readLong();
downFlow = in.readLong();
sumFlow = in.readLong();
}
/**
* 重写toString方法
* 如果不重写toString方法,则不会输出FlowBean中的内容,回输出以下内容:
* 13480253104 com.atSchool.Serialize.FlowBean@77353c84
* 13502468823 com.atSchool.Serialize.FlowBean@31b3f152
* 13560436666 com.atSchool.Serialize.FlowBean@52a567cf
* 13560439658 com.atSchool.Serialize.FlowBean@3c17b48b
* 13602846565 com.atSchool.Serialize.FlowBean@62186e91
* 13660577991 com.atSchool.Serialize.FlowBean@4c0f3ae1
* 13719199419 com.atSchool.Serialize.FlowBean@356db7b0
*/
@Override
public String toString() {
return upFlow + "\t" + downFlow + "\t" + sumFlow;
}
public Long getUpFlow() {
return upFlow;
}
public void setUpFlow(Long upFlow) {
this.upFlow = upFlow;
}
public Long getDownFlow() {
return downFlow;
}
public void setDownFlow(Long downFlow) {
this.downFlow = downFlow;
}
public Long getSumFlow() {
return sumFlow;
}
public void setSumFlow(Long sumFlow) {
this.sumFlow = sumFlow;
}
}
package com.atSchool.Serialize;
import java.io.IOException;
import javax.xml.transform.OutputKeys;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MapSerialize extends Mapper<LongWritable, Text, Text, FlowBean> {
private FlowBean outValue;
private Text outKey = new Text();
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, FlowBean>.Context context)
throws IOException, InterruptedException {
// 读取一行
String line = value.toString();
// 分割
String[] split = line.split("\t+");
// 输出到上下文对象中
String tel = split[1].trim();
String upFlow = split[split.length - 3].trim();
String downFlow = split[split.length - 2].trim();
outKey.set(tel);
outValue = new FlowBean(Long.parseLong(upFlow), Long.parseLong(downFlow));
// 输出的key:电话号码;value:FlowBean对象
context.write(outKey, outValue);
}
}
package com.atSchool.Serialize;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class ReduceSerialize extends Reducer<Text, FlowBean, Text, FlowBean> {
@Override
protected void reduce(Text key, Iterable<FlowBean> value, Reducer<Text, FlowBean, Text, FlowBean>.Context context)
throws IOException, InterruptedException {
long upFlow = 0;
long downFlow = 0;
// 累加上行和下行流量数
for (FlowBean flowBean : value) {
upFlow += flowBean.getUpFlow();
downFlow += flowBean.getDownFlow();
}
// 输出的key:电话号码;value:FlowBean对象
context.write(key, new FlowBean(upFlow, downFlow));
}
}
package com.atSchool.Serialize;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class FlowJob extends Configured implements Tool {
public static void main(String[] args) throws Exception {
new ToolRunner().run(new FlowJob(), null);
}
@Override
public int run(String[] args) throws Exception {
// 获取Job
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
Job job = Job.getInstance(configuration);
// 设置需要运行的任务
job.setJarByClass(FlowJob.class);
// 告诉job Map和Reduce在哪
job.setMapperClass(MapSerialize.class);
job.setReducerClass(ReduceSerialize.class);
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FlowBean.class);
// 告诉job Reduce输出的key和value的数据类型的是什么
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
// 告诉job输入和输出的路径
FileInputFormat.addInputPath(job, new Path("/phone_traffic.txt"));
/**
* 因为输出的文件不允许存在,所以需要处理一下
*/
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/MapReduceOut");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
FileOutputFormat.setOutputPath(job, path);
// 提交任务
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
MR程序在处理数据的过程中会对数据排序(map输出的kv对
传输到reduce
之前,会排序),排序的依据是 map输出的key
所以,我们如果要实现自己需要的排序规则,则可以考虑将排序因素放到key
中,让key
实现接口:WritableComparable
,然后重写key
的compareTo
方法,最后在reduce
中把传入的maperr
处理好的数据的key
,value
进行调换,这样输出结果就是手机号在前,其他的在后了
package com.atSchool.Serialize;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;
/**
* 自定义序列化类
* 1、实现WritableComparable接口
* 说明:
* public interface WritableComparable extends Writable, Comparable {}
* 该接口继承了Writable, Comparable两个接口,所以直接实现该接口即可
*/
public class FlowBean implements WritableComparable<FlowBean> {
// 上行流量(即上传时使用的流量)
private Long upFlow;
// 下行流量(即下载时使用的流量)
private Long downFlow;
// 总流量
private Long sumFlow;
// 2、反序列化时,需要反射调用空参构造函数,所以必须有空参构造
public FlowBean() {
super();
}
// 带参的构造器
public FlowBean(Long upFlow, Long downFlow) {
this.upFlow = upFlow;
this.downFlow = downFlow;
this.sumFlow = upFlow + downFlow;
}
// 为了在排序的时候使用(排序时已经知道了sumFlow,再用上面的一个进行加减不太好)
public FlowBean(Long upFlow, Long downFlow, Long sumFlow) {
this.upFlow = upFlow;
this.downFlow = downFlow;
this.sumFlow = sumFlow;
}
// 3、重写序列化方法
@Override
public void write(DataOutput out) throws IOException {
out.writeLong(upFlow);
out.writeLong(downFlow);
out.writeLong(sumFlow);
}
// 4、重写反序列化方法
@Override
public void readFields(DataInput in) throws IOException {
upFlow = in.readLong();
downFlow = in.readLong();
sumFlow = in.readLong();
}
// 重写toString方法
@Override
public String toString() {
return upFlow + "\t" + downFlow + "\t" + sumFlow;
}
public Long getUpFlow() {
return upFlow;
}
public void setUpFlow(Long upFlow) {
this.upFlow = upFlow;
}
public Long getDownFlow() {
return downFlow;
}
public void setDownFlow(Long downFlow) {
this.downFlow = downFlow;
}
public Long getSumFlow() {
return sumFlow;
}
public void setSumFlow(Long sumFlow) {
this.sumFlow = sumFlow;
}
// 重写了compareTo方法,设置成按照sumFlow的倒序排序
@Override
public int compareTo(FlowBean flowBean) {
return sumFlow > flowBean.sumFlow ? -1 : 1;
}
}
这里先将 key
和 value
颠倒进行排序
package com.atSchool.Serialize;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/**
* 这里因为是要按sumFlow排序,所以将输出的key设置为FlowBean类型
*/
public class MapSort extends Mapper<LongWritable, Text, FlowBean, Text> {
private Text outValue = new Text();
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, FlowBean, Text>.Context context)
throws IOException, InterruptedException {
String line = value.toString();
// 分割
String[] split = line.split("\t+");
// 输出到上下文对象中
String tel = split[0].trim();
String upFlow = split[1].trim();
String downFlow = split[2].trim();
String sumFlow = split[3].trim();
FlowBean flowBean = new FlowBean(Long.parseLong(upFlow), Long.parseLong(downFlow), Long.parseLong(sumFlow));
outValue.set(tel);
context.write(flowBean, outValue);
}
}
这里再将 key
和 value
换回来。
package com.atSchool.Serialize;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class ReduceSort extends Reducer<FlowBean, Text, Text, FlowBean> {
@Override
protected void reduce(FlowBean key, Iterable<Text> value, Reducer<FlowBean, Text, Text, FlowBean>.Context context)
throws IOException, InterruptedException {
// 这里不需要遍历value,因为传过来的value就只是一个电话号码,直接拿出来就好了
Iterator<Text> iterator = value.iterator();
Text next = iterator.next();
context.write(next, key);
}
}
package com.atSchool.Serialize;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class SortJob extends Configured implements Tool {
public static void main(String[] args) throws Exception {
new ToolRunner().run(new SortJob(), null);
}
@Override
public int run(String[] args) throws Exception {
// 获取Job
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
Job job = Job.getInstance(configuration);
// 设置需要运行的任务
job.setJarByClass(FlowJob.class);
// 告诉job Map和Reduce在哪
job.setMapperClass(MapSort.class);
job.setReducerClass(ReduceSort.class);
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(FlowBean.class);
job.setMapOutputValueClass(Text.class);
// 告诉job Reduce输出的key和value的数据类型的是什么
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
// 告诉job输入和输出的路径
FileInputFormat.addInputPath(job, new Path("/MapReduceOut/part-r-00000"));
/**
* 因为输出的文件不允许存在,所以需要处理一下
*/
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/SortOut");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
FileOutputFormat.setOutputPath(job, path);
// 提交任务
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
计数器是用来记录job的执行进度和状态的。它的作用可以理解为日志。我们可以在程序的某个位置插入计数器,记录数据或者进度的变化情况。
MapReduce计数器(Counter)为我们提供一个窗口,用于观察MapReduce Job运行期
的各种细节数据。对MapReduce性能调优
很有帮助,MapReduce性能优化的评估
大部分都是基于这些Counter的数值
表现出来的。
MapReduce自带了许多默认Counter
,现在我们来分析这些默认Counter的含义,方便大家观察Job结果,如输入的字节数、输出的字节数、Map端输入/输出的字节数和条数、Reduce端的输入/输出的字节数和条数等。
...
...
21/04/02 11:06:49 INFO mapreduce.Job: Counters: 35 // Counters: 35 - 表示运行过程中使用到了35种计数器
File System Counters // 计数器
FILE: Number of bytes read=46873922
FILE: Number of bytes written=107355440
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=765843722
HDFS: Number of bytes written=624014
HDFS: Number of read operations=37
HDFS: Number of large read operations=0
HDFS: Number of write operations=10
Map-Reduce Framework // 计数器
Map input records=1948789
Map output records=1151021
Map output bytes=21132892
Map output materialized bytes=23434952
Input split bytes=387
Combine input records=0
Combine output records=0
Reduce input groups=31046
Reduce shuffle bytes=23434952
Reduce input records=1151021
Reduce output records=31046
Spilled Records=2302042
Shuffled Maps =3
Failed Shuffles=0
Merged Map outputs=3
GC time elapsed (ms)=1612
Total committed heap usage (bytes)=3603431424
Shuffle Errors // 计数器
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters // 计数器
Bytes Read=218157941
File Output Format Counters // 计数器
Bytes Written=624014
File System Counters:MR-Job执行依赖的数据来自不同的文件系统,这个group表示job与文件系统交互的读写统计
HDFS: Number of bytes read=765843722
**说明:**map从HDFS读取数据,包括源文件内容、split元数据。所以这个值比FileInputFormatCounters.BYTES_READ 要略大些。
FILE: Number of bytes written=107355440
**说明:**表示map task往本地磁盘中总共写了多少字节(其实,Reduce端的Merge也会写入本地File)
FILE: Number of bytes read=46873922
**说明:**reduce从本地文件系统读取数据(map结果保存在本地磁盘)
HDFS: Number of bytes written=624014
**说明:**最终结果写入HDFS
Job Counters(上面的例子种没有出现):MR子任务统计,即map tasks 和 reduce tasks
Launched map tasks=4
**说明:**启用map task的个数
Launched reduce tasks=5
**说明:**启用reduce task的个数
Map-Reduce Framework:MR框架计数器
Map input records=1948789
**说明:**map task从HDFS读取的文件总行数
Reduce input groups=31046
**说明:**Reduce输入的分组个数,如
Combine input records=0
**说明:**Combiner输入 = map输出
Spilled Records=2302042
**说明:**spill过程在map和reduce端都会发生,这里统计在总共从内存往磁盘中spill了多少条数据
Shuffle Errors:
File Input Format Counters:文件输入格式化计数器
Bytes Read=218157941
**说明:**map阶段,各个map task的map方法输入的所有value值字节数之和
File Output Format Counters:文件输出格式化计数器
Bytes Written=624014
**说明:**MR输出总的字节数,包括【单词】,【空格】,【单词个数】及每行的【换行符】
//自定义计数器的形式
Counter counter = context.getCounter("查找hello", "hello");
if (string.contains("hello")) {
counter.increment(1L); //出现一次+1
}
在map方法运行的前后各会有一个方法执行
所以写计数器的时候我们不需要reduce,直接重写cleanup进行输出即可。
package com.atSchool.Counter;
import java.io.IOException;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.counters.CounterGroupBase;
import org.apache.zookeeper.txn.Txn;
/**
* 计数总的行数和单词数量
*/
public class WordCounterMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
enum COUNTERS {
ALLLINES, ALLWORDS
}
Counter counter = null;
Counter counter2 = null;
/**
* 对输入拆分中的每个键/值对调用一次。大多数应用程序都应该覆盖这一点,但默认的是标识函数。
*/
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
/**
* 获取一个计数器
*
* 此处用了动态代理
*/
counter = context.getCounter(COUNTERS.ALLLINES);
counter2 = context.getCounter(COUNTERS.ALLWORDS);
/**
* void increment(long incr) 按给定值递增此计数器 参数: incr-增加该计数器的价值
*/
counter.increment(1L); // 出现一次 +1
String[] split = value.toString().split(" +");
for (String string : split) {
/**
* public static boolean isNotBlank(java.lang.String str)
* 检查字符串是否为空(“”),不为空,是否仅为空白。
* 示例:
* StringUtils.isNotBlank(null) = false
* StringUtils.isNotBlank("") = false
* StringUtils.isNotBlank(" ") = false
* StringUtils.isNotBlank("bob") = true
* StringUtils.isNotBlank(" bob ") = true
*/
if (StringUtils.isNotBlank(string)) {
counter2.increment(1L); // 出现一次 +1
}
}
}
/**
* 在map方法运行的前后各会有一个方法执行
* 前:setup(context);方法
* 后:cleanup(context);方法
*
* 所以这里我们不需要reduce,直接重写cleanup进行输出即可。
*/
@Override
protected void cleanup(Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// 去除计数器
String name = counter.getName();
long value = counter.getValue();
String name2 = counter2.getName();
long value2 = counter2.getValue();
context.write(new Text(name), new IntWritable((int) value));
context.write(new Text(name2), new IntWritable((int) value2));
}
}
package com.atSchool.Counter;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class WordCounterJob extends Configured implements Tool {
public static void main(String[] args) throws Exception {
new ToolRunner().run(new WordCounterJob(), null);
}
@Override
public int run(String[] args) throws Exception {
// 获取Job
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
Job job = Job.getInstance(configuration);
// 设置需要运行的任务
job.setJarByClass(WordCounterJob.class);
// 告诉job Map和Reduce在哪
job.setMapperClass(WordCounterMapper.class);
job.setNumReduceTasks(0); // !!!!!!这里用为没有用到reduce所以设置为0!!!!!!!!
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 告诉job输入和输出的路径
FileInputFormat.addInputPath(job, new Path("/a.txt"));
/**
* 因为输出的文件不允许存在,所以需要处理一下
*/
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/MapReduceOut");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
FileOutputFormat.setOutputPath(job, path);
// 提交任务
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
分析资料:
数据说明:
日期 摄像机ID 地址 车速 纬度 经度 位置
07/01/2014,CHI003,4124 W FOSTER AVE,123,41.9756053,-87.7316698,"(41.9756053, -87.7316698)"
07/01/2014,CHI004,5120 N PULASKI,68,41.9743327,-87.728347,"(41.9743327, -87.728347)"
07/01/2014,CHI005,2080 W PERSHING,68,41.8231888,-87.6773488,"(41.8231888, -87.6773488)"
07/01/2014,CHI007,3843 S WESTERN,75,41.823564,-87.6847211,"(41.823564, -87.6847211)"
07/01/2014,CHI008,3655 W JACKSON,15,41.8770708,-87.7181683,"(41.8770708, -87.7181683)"
07/01/2014,CHI009,3646 W MADISON,50,41.8809382,-87.7178984,"(41.8809382, -87.7178984)"
07/01/2014,CHI010,1111 N HUMBOLDT,77,,,
需求:统计车速超过120的车辆
package com.atSchool.Counter.overspeed;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.Mapper;
public class SpeedMaperr extends Mapper<LongWritable, Text, Text, LongWritable> {
enum speed {
OVERSPPD_CARS
}
Counter counter = null;
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context)
throws IOException, InterruptedException {
// 获取计数器
counter = context.getCounter(speed.OVERSPPD_CARS);
// 处理数据
String[] split = value.toString().split(",");
// 由于第一行是表头,所以要稍微的处理一下、
if (Character.isDigit(split[3].charAt(0))) {
// 统计
if (Integer.parseInt(split[3].trim()) >= 120) {
counter.increment(1L); // 只要车速超过120就算超速+1
}
}
}
@Override
protected void cleanup(Mapper<LongWritable, Text, Text, LongWritable>.Context context)
throws IOException, InterruptedException {
String name = counter.getName();
long value = counter.getValue();
context.write(new Text(name), new LongWritable(value));
}
}
package com.atSchool.Counter.overspeed;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class SpeedJob extends Configured implements Tool {
public static void main(String[] args) throws Exception {
new ToolRunner().run(new SpeedJob(), null);
}
@Override
public int run(String[] args) throws Exception {
// 获取Job
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
Job job = Job.getInstance(configuration);
// 设置需要运行的任务
job.setJarByClass(SpeedJob.class);
// 告诉job Map和Reduce在哪
job.setMapperClass(SpeedMaperr.class);
job.setNumReduceTasks(0); // 这里用为没有用到reduce所以设置为0
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
// 告诉job输入和输出的路径
FileInputFormat.addInputPath(job, new Path("/overspeed_cars.csv"));
/**
* 因为输出的文件不允许存在,所以需要处理一下
*/
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/MapReduceOut");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
FileOutputFormat.setOutputPath(job, path);
// 提交任务
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
我们知道reduce和map
都有一个局限性就是map是读一行执行一次
,reduce是每一组执行一次
但是当我们想全部得到数据之后,按照需求筛选然后再输出(例如,经典的topN问题)怎么办?这时候只使用map和reduce显然是达不到目的的?那该怎么呢?
这时候我们想到了setUp和cleanUp的特性,只执行一次。这样我们对于最终数据的过滤,然后输出要放在cleanUp中。这样就能实现对数据,不一组一组输出,而是全部拿到,最后过滤输出。
分析:
我们知道mapreduce有切分、聚合的功能,所以第一步就是:先在map种把每个单词读出来,然后在reduce中聚合,求出每个单词出现的次数
但是怎么控制只输出前三名呢?我们知道,map是读一行执行一次,reduce是每一组执行一次,所以只用map和reduce是无法控制输出的次数的,
但是我们又知道,无论map或者reduce都有setUp和cleanUp,而且这两个方法只执行一次
所以我们可以在reduce阶段把每一个单词当做key,单词出现的次数当做value,每一组存放到一个map集合
里面(此时只存,不写出)。在reduce的cleanUp阶段对map
进行排序,然后输出前三名
package com.atSchool.topN;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class maperrDemo extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text outKey = new Text();
private IntWritable outValue = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
String[] split = value.toString().split(" +");
for (String string : split) {
outKey.set(string);
context.write(outKey, outValue);
}
}
}
package com.atSchool.topN;
import java.io.IOException;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.Map;
import java.util.Map.Entry;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class reduceDemo extends Reducer<Text, IntWritable, Text, IntWritable> {
// HashMap的key不能重复
Map<String, Integer> hashMap = new HashMap<>();
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
int count = 0;
for (IntWritable intWritable : values) {
count += intWritable.get();
}
// 拿到每个单词和总的数量后
hashMap.put(key.toString(), count);
}
@Override
protected void cleanup(Reducer<Text, IntWritable, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// 进行排序后输出前三名
/**
* Set> entrySet():获取 键值对对象 的集合
* 说明: Map.Entry:描述的是键值对的对象实体
* 这个对象的方法有: a
* K getKey() 返回与此条目相对应的键。
* V getValue() 返回与此条目相对应的值。
*/
LinkedList<Entry<String, Integer>> linkedList = new LinkedList<>(hashMap.entrySet());
// 排序
Collections.sort(linkedList, new Comparator<Map.Entry<String, Integer>>() {
@Override
public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2) {
return o2.getValue() - o1.getValue();
}
});
for (int i = 0; i < linkedList.size(); i++) {
if (i <= 2) {
context.write(new Text(linkedList.get(i).getKey()), new IntWritable(linkedList.get(i).getValue()));
}
}
}
}
package com.atSchool.topN;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class jobDemo extends Configured implements Tool {
public static void main(String[] args) throws Exception {
new ToolRunner().run(new jobDemo(), null);
}
@Override
public int run(String[] args) throws Exception {
// 获取Job
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
Job job = Job.getInstance(configuration);
// 设置需要运行的任务
job.setJarByClass(jobDemo.class);
// 告诉job Map和Reduce在哪
job.setMapperClass(maperrDemo.class);
job.setReducerClass(reduceDemo.class);
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 告诉job Reduce输出的key和value的数据类型的是什么
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 告诉job输入和输出的路径
FileInputFormat.addInputPath(job, new Path("/a.txt"));
/**
* 因为输出的文件不允许存在,所以需要处理一下
*/
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/MapReduceOut");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
FileOutputFormat.setOutputPath(job, path);
// 提交任务
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
MapReduce中的Combiner是为了避免map任务和reduce任务之间的数据传输而设置的。Hadoop允许用户针对maptask的输出指定一个合并函数。即为了减少传输到Reduce中的数据量。它主要是为了削减Mapper的输出从而减少网络带宽和Reducer之上的负载。
21/04/04 14:14:27 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=5975774
FILE: Number of bytes written=9544692
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=3562268
HDFS: Number of bytes written=39
HDFS: Number of read operations=15
HDFS: Number of large read operations=0
HDFS: Number of write operations=6
Map-Reduce Framework
Map input records=172368
Map output records=229824
Map output bytes=2528064
Map output materialized bytes=2987718
Input split bytes=98
Combine input records=0 // 这里显示0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=2987718
Reduce input records=229824
Reduce output records=3
Spilled Records=459648
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=28
Total committed heap usage (bytes)=466616320
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1781134
File Output Format Counters
Bytes Written=39
执行成功
21/04/04 14:25:39 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=426
FILE: Number of bytes written=582500
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=3562268
HDFS: Number of bytes written=39
HDFS: Number of read operations=15
HDFS: Number of large read operations=0
HDFS: Number of write operations=6
Map-Reduce Framework
Map input records=172368
Map output records=229824
Map output bytes=2528064
Map output materialized bytes=44
Input split bytes=98
Combine input records=229824 // 这里说明使用了Combiner
Combine output records=3
Reduce input groups=3 // Reduce的工作量明显减少了
Reduce shuffle bytes=44
Reduce input records=3
Reduce output records=3
Spilled Records=6
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=25
Total committed heap usage (bytes)=464519168
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1781134
File Output Format Counters
Bytes Written=39
执行成功
package com.atSchool.MapReduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MapDemo extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text keyword = new Text();
private IntWritable keyvalue = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// 1、获取一行
String line = value.toString();
// 2、截取
String[] split = line.split(" +");
// 3、输出到上下文对象中
for (String string : split) {
keyword.set(string);
context.write(keyword, keyvalue);
}
}
}
package com.atSchool.MapReduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class ReduceDemo extends Reducer<Text, IntWritable, Text, IntWritable> {
IntWritable val = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
// 累加
int sum = 0;
for (IntWritable intWritable : values) {
sum += intWritable.get();
}
// 输出
val.set(sum);
context.write(key, val);
}
}
package com.atSchool.MapReduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {
IntWritable val = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
// 累加
int sum = 0;
for (IntWritable intWritable : values) {
sum += intWritable.get();
}
// 输出
val.set(sum);
context.write(key, val);
}
}
package com.atSchool.MapReduce;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class WordCountJob extends Configured implements Tool {
public static void main(String[] args) throws Exception {
new ToolRunner().run(new WordCountJob(), null);
}
@Override
public int run(String[] args) throws Exception {
// 获取Job
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
Job job = Job.getInstance(configuration);
// 设置需要运行的任务
job.setJarByClass(WordCountJob.class);
// 告诉job Map和Reduce在哪
job.setMapperClass(MapDemo.class);
job.setCombinerClass(WordCombiner.class); // 此处指定要使用的Combiner即可
job.setReducerClass(ReduceDemo.class);
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 告诉job Reduce输出的key和value的数据类型的是什么
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 告诉job输入和输出的路径
FileInputFormat.addInputPath(job, new Path("/a.txt"));
/**
* 因为输出的文件不允许存在,所以需要处理一下
*/
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/out");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
FileOutputFormat.setOutputPath(job, path);
// 提交任务
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
在关系数据库中,索引是一种单独的、物理的对数据库表中一列或多列的值进行排序的一种存储结构,它是某个表中一列或若干列值的集合 和 相应的指向表中物理标识这些值的数据页的逻辑指针清单。
索引的作用相当于图书的目录,可以根据目录中的页码快速找到所需的内容。索引提供指向存储在表的指定列中的数据值的指针,然后根据您指定的排序顺序对这些指针排序。数据库使用索引以找到特定值,然后顺指针找到包含该值的行。这样可以使对应于表的SQL语句执行得更快,可快速访问数据库表中的特定信息。
当表中有大量记录时,若要对表进行查询,第一种搜索信息方式是全表搜索,是将所有记录一一取出,和查询条件进行一一对比,然后返回满足条件的记录,这样做会消耗大量数据库系统时间,并造成大量磁盘I/0操作;
第二种就是在表中建立索引,然后在索引中找到符合查询条件的索引值,最后通过保存在索引中的ROWID(相当于页码)快速找到表中对应的记录.
“文档1”的ID > 单词1:出现次数,出现位置列表;单词2:出现次数,出现位置列表;…………。
“文档2”的ID > 此文档出现的关键词列表。
一般是通过key,去找value当用户在主页上搜索关键词“华为手机”时,假设只存在正向索引(forward index),那么就需要扫描索引库中的所有文档,找出所有包含关键词“华为手机”的文档,再根据打分模型进行打分,排出名次后呈现给用户。因为互联网上收录在搜索引擎中的文档的数目是个天文数字,这样的索引结构根本无法满足实时返回排名结果的要求
所以,搜索引擎会将正向索引重新构建为倒排索引,即把文件ID对应到关键词的映射转换为关键词到文件ID的映射,每个关键词都对应着一系列的文件,这些文件中都出现这个关键词.
“关键词1”:“文档1”的ID,“文档2”的ID.
“关键词2”:带有此关键词的文档ID列表.
从词的关键字,去找文档
得到day01~03文件中关键字的倒排索引
-------------第一步Mapper的输出结果格式如下:--------------------
context.wirte("love:day01.txt", "1")
context.wirte("beijing:day01.txt", "1")
context.wirte("love:day02.txt", "1")
context.wirte("beijing:day01.txt", "1")
-------------第二步Combiner的得到的输入数据格式如下:-------------
<"love:day01.txt", {"1"}>
<"beijing:day01.txt", {"1","1"}>
<"love:day02.txt", {"1"}>
-------------第二步Combiner的输出数据格式如下---------------------
context.write("love", "day01.txt:1")
context.write("beijing", "day01.txt:2")
context.write("love", "day02.txt:1")
-------------第三步Reducer得到的输入数据格式如下:-----------------
<"love", {"day01.txt:1", "day02.txt:1"}>
<"beijing", {"day01.txt:2"}>
-------------第三步Reducer输出的数据格式如下:-----------------
context.write("love", "day01.txt:1 day02.txt:2 ")
context.write("beijing", "day01.txt:2 ")
最终结果为:
love day01.txt:1 day02.txt:2
beijing day01.txt:2
package com.atSchool.index;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
/**
* 输出的形式:
* context.wirte("love:day01.txt", "1")
* context.wirte("beijing:day01.txt", "1")
* context.wirte("love:day02.txt", "1")
* context.wirte("beijing:day01.txt", "1")
*/
public class mapperDemo extends Mapper<LongWritable, Text, Text, Text> {
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
// 分割
String[] split = value.toString().split(" +");
// 获取文件名
// 1、获取切片
InputSplit inputSplit = context.getInputSplit();
// 2、强转
FileSplit fSplit = (FileSplit) inputSplit;
// 3、得到文件路径再获取文件名
String name = fSplit.getPath().getName();
for (String string : split) {
context.write(new Text(string + ":" + name), new Text("1"));
System.out.println("mapper:" + "<" + string + ":" + name + "," + "1" + ">");
}
}
}
package com.atSchool.index;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
/**
* 输入的形式:
* <"love:day01.txt", {"1"}>
* <"beijing:day01.txt", {"1","1"}>
* <"love:day02.txt", {"1"}>
*
* 输出的形式:
* context.write("love", "day01.txt:1")
* context.write("beijing", "day01.txt:2")
* context.write("love", "day02.txt:1")
*/
public class combinerDemo extends Reducer<Text, Text, Text, Text> {
private Text outKey = new Text();
private Text outValue = new Text();
@Override
protected void reduce(Text key, Iterable<Text> value, Reducer<Text, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
// 处理value
int sum = 0;
for (Text text : value) {
int parseInt = Integer.parseInt(text.toString().trim());
sum += parseInt;
}
// 处理key
String[] split = key.toString().split(":");
// 输出
outKey.set(split[0].trim());
outValue.set(split[1].trim() + ":" + String.valueOf(sum));
context.write(outKey, outValue);
System.out
.println("combiner:" + "<" + split[0].trim() + "," + split[1].trim() + ":" + String.valueOf(sum) + ">");
}
}
package com.atSchool.index;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
/**
* 输入格式:
* <"love", {"day01.txt:1", "day02.txt:1"}>
* <"beijing", {"day01.txt:2"}>
*
* 输出格式:
* context.write("love", "day01.txt:1 day02.txt:2 ")
* context.write("beijing", "day01.txt:2 ")
*/
public class reduceDemo extends Reducer<Text, Text, Text, Text> {
// StringBuilder stringBuilder = new StringBuilder();
private Text outValue = new Text();
@Override
protected void reduce(Text key, Iterable<Text> value, Reducer<Text, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
/**
* 这里为什么不能将stringBuilder定义在外面?
* 因为MP程序在运行的时候,只会走reduce方法,并不是将这整个类走一遍。
* 这里stringBuilder的append方法会一直将数据在末尾追加,不会覆盖之前的数据。
* 使用Test中的set方法时会将之前的数据覆盖。
*/
StringBuilder stringBuilder = new StringBuilder();
/**
* 迭代器为什么不能多次遍历?
* 当调用next()时,返回当前索引(cursor)指向的元素,然后当前索引值(cursor)会+1,
* 当所有元素遍历完,cursor == Collection.size(),
* 此时再使用while(Iterator.hasNext())做循环条件时,返回的是false,无法进行下次遍历,
* 如果需要多次使用Iterator进行遍历,当一次遍历完成,需要重新初始化Collection的iterator()。
*/
// for (Text text : value) {
// System.out.println("reduce输入的value:" + text.toString());
// }
for (Text text : value) {
System.out.println("reduce输入的value:" + text.toString());
stringBuilder.append(" " + text.toString());
}
// 输出
outValue.set(stringBuilder.toString());
context.write(key, outValue);
System.out.println("reduce:" + "<" + key.toString() + "," + stringBuilder.toString() + ">");
}
}
package com.atSchool.index;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class jobDemo extends Configured implements Tool {
public static void main(String[] args) throws Exception {
new ToolRunner().run(new jobDemo(), null);
}
@Override
public int run(String[] args) throws Exception {
// 获取Job
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
Job job = Job.getInstance(configuration);
// 设置需要运行的任务
job.setJarByClass(jobDemo.class);
// 告诉job Map和Reduce在哪
job.setMapperClass(mapperDemo.class);
job.setCombinerClass(combinerDemo.class);
job.setReducerClass(reduceDemo.class);
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
// 告诉job Reduce输出的key和value的数据类型的是什么
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// 告诉job输入和输出的路径
FileInputFormat.addInputPath(job, new Path("/index_source"));
/**
* 因为输出的文件不允许存在,所以需要处理一下
*/
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/MapReduceOut");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
FileOutputFormat.setOutputPath(job, path);
// 提交任务
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
and day01.txt:1
beijing day01.txt:1 day03.txt:1
capital day03.txt:1
china day03.txt:1 day02.txt:1
i day02.txt:1 day01.txt:2
is day03.txt:1
love day01.txt:2 day02.txt:1
of day03.txt:1
shanghai day01.txt:1
the day03.txt:1
分析资料:
需求:
单词 文件名:speaker:出次数
的形式输出。package com.atSchool.speak;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
/**
* 统计辩论赛的词频
*
* 文件内容
* "Line","Speaker","Text","Date"
* 1,"Holt","Good evening from Hofstra University in Hempstead, New York. I'm Lester Holt, anchor of ""NBC Nightly News."" I want to welcome you to the first presidential debate. The participants tonight are Donald Trump and Hillary Clinton. This debate is sponsored by the Commission on Presidential Debates, a nonpartisan, nonprofit organization. The commission drafted tonight's format, and the rules have been agreed to by the campaigns. The 90-minute debate is divided into six segments, each 15 minutes long. We'll explore three topic areas tonight: Achieving prosperity; America's direction; and securing America. At the start of each segment, I will ask the same lead-off question to both candidates, and they will each have up to two minutes to respond. From that point until the end of the segment, we'll have an open discussion. The questions are mine and have not been shared with the commission or the campaigns. The audience here in the room has agreed to remain silent so that we can focus on what the candidates are saying. I will invite you to applaud, however, at this moment, as we welcome the candidates: Democratic nominee for president of the United States, Hillary Clinton, and Republican nominee for president of the United States, Donald J. Trump.","9/26/16"
* 2,"Audience","(APPLAUSE)","9/26/16"
* 3,"Clinton","How are you, Donald?","9/26/16"
* 4,"Audience","(APPLAUSE)","9/26/16"
* 5,"Holt","Good luck to you.","9/26/16"
*/
public class mapperDemo extends Mapper<LongWritable, Text, Text, Text> {
private Text outKey = new Text();
private Text outValue = new Text();
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
// 获取按指定规则分割后的一行
String[] split = value.toString().split(",");
if (split.length == 4 && Character.isDigit(split[0].charAt(0))) {
String speaker = split[1]; // 辩论者
String text = split[2]; // 辩论内容
// 对辩论内容再进行分割
/**
* 字符串StringTokenizer类允许应用程序将字符串拆分成令牌。
* StringTokenizer方法不区分标识符,数字和引用的字符串,也不识别和跳过注释。
* 可以在创建时或每个令牌的基础上指定一组分隔符(分隔标记的字符)。
*/
StringTokenizer stringTokenizer = new StringTokenizer(text, " (),.?!--\"\"\n#");
// 获取文件名
String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
// 遍历每一个单词
while (stringTokenizer.hasMoreElements()) {
String nextToken = stringTokenizer.nextToken();
outKey.set(nextToken + ":" + fileName + ":" + speaker);
outValue.set("1");
context.write(outKey, outValue);
}
}
}
}
package com.atSchool.speak;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class combinerDemo extends Reducer<Text, Text, Text, Text> {
@Override
protected void reduce(Text key, Iterable<Text> value, Reducer<Text, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
// 合并
int sum = 0;
for (Text text : value) {
Integer valueOf = Integer.valueOf(text.toString());
sum += valueOf;
}
// 改变kv重新输出
String[] split = key.toString().split(":");
String outKey = split[0]; // 单词
String outValue = split[1] + ":" + split[2] + ":" + String.valueOf(sum); // 文件名:辩论者:出现次数
context.write(new Text(outKey), new Text(outValue));
}
}
package com.atSchool.speak;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class reduceDemo extends Reducer<Text, Text, Text, Text> {
@Override
protected void reduce(Text key, Iterable<Text> value, Reducer<Text, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
StringBuffer stringBuffer = new StringBuffer();
for (Text text : value) {
stringBuffer.append(text.toString() + "\t");
}
context.write(key, new Text(stringBuffer.toString()));
}
}
package com.atSchool.speak;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class jobDemo extends Configured implements Tool {
public static void main(String[] args) throws Exception {
new ToolRunner().run(new jobDemo(), null);
}
@Override
public int run(String[] args) throws Exception {
// 获取Job
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
Job job = Job.getInstance(configuration);
// 设置需要运行的任务
job.setJarByClass(jobDemo.class);
// 告诉job Map和Reduce在哪
job.setMapperClass(mapperDemo.class);
job.setCombinerClass(combinerDemo.class);
job.setReducerClass(reduceDemo.class);
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
// 告诉job Reduce输出的key和value的数据类型的是什么
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// 告诉job输入和输出的路径
FileInputFormat.addInputPath(job, new Path("/美国两党辩论关键词"));
/**
* 因为输出的文件不允许存在,所以需要处理一下
*/
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/MapReduceOut");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
FileOutputFormat.setOutputPath(job, path);
// 提交任务
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
分析资料:
A:B,C,D,F,E,O
B:A,C,E,K
C:F,A,D,I
D:A,E,F,L
E:B,C,D,M,L
F:A,B,C,D,E,O,M
G:A,C,D,E,F
H:A,C,D,E,O
I:A,O
J:B,O
K:A,C,D
L:D,E,F
M:E,F,G
O:A,H,I,J
package com.atSchool.friend;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/**
* 统计两个人的共同好友
*
* 原数据:
* A:B,C,D,F,E,O
* B:A,C,E,K
* C:F,A,D,I
* D:A,E,F,L
*
* 性质:好友双方应当都有对方名字,不会存在单一的情况
* 所以A中有B,即B中有A
*/
public class mapperDemo extends Mapper<LongWritable, Text, Text, Text> {
private Text outKey = new Text();
private Text outValue = new Text();
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
String[] split = value.toString().trim().split(":");
// 获取好友列表
String[] split2 = split[1].split(",");
outValue.set(split[0]); // 当前用户
// 遍历好友列表
for (String string : split2) {
/**
* 输出:
* B A
* C A
* D A
* F A
* E A
* O A
* A B
* C B
* E B
* K B
*/
outKey.set(string);
context.write(outKey, outValue);
System.out.println("mapper:" + string + "\t" + split[0]);
}
}
}
package com.atSchool.friend;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class reduceDemo extends Reducer<Text, Text, Text, Text> {
private Text outValue = new Text();
@Override
protected void reduce(Text key, Iterable<Text> value, Reducer<Text, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
StringBuilder stringBuilder = new StringBuilder();
for (Text text : value) {
stringBuilder.append(text.toString() + ",");
}
outValue.set(stringBuilder.toString());
context.write(key, outValue);
System.out.println("reduce-in-out:" + key.toString() + "\t" + stringBuilder.toString());
}
}
package com.atSchool.friend;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class jobDemo extends Configured implements Tool {
public static void main(String[] args) throws Exception {
new ToolRunner().run(new jobDemo(), null);
}
@Override
public int run(String[] args) throws Exception {
// 获取Job
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
Job job = Job.getInstance(configuration);
// 设置需要运行的任务
job.setJarByClass(jobDemo.class);
// 告诉job Map和Reduce在哪
job.setMapperClass(mapperDemo.class);
job.setReducerClass(reduceDemo.class);
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
// 告诉job Reduce输出的key和value的数据类型的是什么
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// 告诉job输入和输出的路径
FileInputFormat.addInputPath(job, new Path("/friend.txt"));
/**
* 因为输出的文件不允许存在,所以需要处理一下
*/
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/MapReduceOut");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
FileOutputFormat.setOutputPath(job, path);
// 提交任务
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
package com.atSchool.friend;
import java.io.IOException;
import java.util.Arrays;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class mapperDemo2 extends Mapper<LongWritable, Text, Text, Text> {
private Text outKey = new Text();
private Text outValue = new Text();
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
// 输出的value
String[] split = value.toString().split("\t");
outValue.set(split[0]);
// 输出的key
String[] split2 = split[1].split(",");
Arrays.sort(split2);
for (int i = 0; i < split2.length; i++) {
for (int j = i + 1; j < split2.length; j++) {
outKey.set(split2[i] + split2[j]);
/**
* 输出:
* BC A
* BD A
* BF A
* BG A
* BH A
* BI A
* BK A
* BO A
* CD A
* CF A
* CG A
* CH A
* CI A
* CK A
* CO A
*/
context.write(outKey, outValue);
System.out.println("mapper-out:" + outKey.toString() + "\t" + outValue.toString());
}
}
}
}
package com.atSchool.friend;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class jobDemo2 extends Configured implements Tool {
public static void main(String[] args) throws Exception {
new ToolRunner().run(new jobDemo2(), null);
}
@Override
public int run(String[] args) throws Exception {
// 获取Job
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
Job job = Job.getInstance(configuration);
// 设置需要运行的任务
job.setJarByClass(jobDemo2.class);
// 告诉job Map和Reduce在哪
job.setMapperClass(mapperDemo2.class);
job.setReducerClass(reduceDemo.class);
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
// 告诉job Reduce输出的key和value的数据类型的是什么
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// 告诉job输入和输出的路径
FileInputFormat.addInputPath(job, new Path("/MapReduceOut/part-r-00000"));
/**
* 因为输出的文件不允许存在,所以需要处理一下
*/
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/out");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
FileOutputFormat.setOutputPath(job, path);
// 提交任务
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
(1)自定义类继承Partitioner,重写getPartition()方法
(2)在job驱动中,设置自定义partitioner:
job.setPartitionerClass(CustomPartitioner.class);
(3)自定义partition后,要根据自定义Partitioner的逻辑设置相应数量的reducetask
job.setNumReduceTasks(5);
注意:
reduceTask的个数决定了有几个文件!!
如果reduceTask的数量 > getPartition的结果数
,则会多产生几个空的输出文件part-r-000xx;
如果1 < reduceTask的数量 < getPartition的结果数
,则有一部分 分区数据无处安放,会报Exception;
如果reduceTask的数量=1
,则不管mapTask端输出多少个分区文件,最终结果都交给这一个reduceTask,最终也就只会产生一个结果文件part–00000;
分析资料:
1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
1363157995052 13826544101 5C-0E-8B-C7-F1-E0:CMCC 120.197.40.4 4 0 264 0 200
1363157991076 13926435656 20-10-7A-28-CC-0A:CMCC 120.196.100.99 2 4 132 1512 200
1363154400022 13926251106 5C-0E-8B-8B-B1-50:CMCC 120.197.40.4 4 0 240 0 200
1363157993044 18211575961 94-71-AC-CD-E6-18:CMCC-EASY 120.196.100.99 iface.qiyi.com 视频网站 15 12 1527 2106 200
1363157995074 84138413 5C-0E-8B-8C-E8-20:7DaysInn 120.197.40.4 122.72.52.12 20 16 4116 1432 200
1363157993055 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 18 15 1116 954 200
1363157995033 15920133257 5C-0E-8B-C7-BA-20:CMCC 120.197.40.4 sug.so.360.cn 信息安全 20 20 3156 2936 200
1363157983019 13719199419 68-A1-B7-03-07-B1:CMCC-EASY 120.196.100.82 4 0 240 0 200
1363157984041 13660577991 5C-0E-8B-92-5C-20:CMCC-EASY 120.197.40.4 s19.cnzz.com 站点统计 24 9 6960 690 200
1363157973098 15013685858 5C-0E-8B-C7-F7-90:CMCC 120.197.40.4 rank.ie.sogou.com 搜索引擎 28 27 3659 3538 200
1363157986029 15989002119 E8-99-C4-4E-93-E0:CMCC-EASY 120.196.100.99 www.umeng.com 站点统计 3 3 1938 180 200
1363157992093 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 15 9 918 4938 200
1363157986041 13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4 3 3 180 180 200
1363157984040 13602846565 5C-0E-8B-8B-B6-00:CMCC 120.197.40.4 2052.flash2-http.qq.com 综合门户 15 12 1938 2910 200
1363157995093 13922314466 00-FD-07-A2-EC-BA:CMCC 120.196.100.82 img.qfc.cn 12 12 3008 3720 200
1363157982040 13502468823 5C-0A-5B-6A-0B-D4:CMCC-EASY 120.196.100.99 y0.ifengimg.com 综合门户 57 102 7335 110349 200
1363157986072 18320173382 84-25-DB-4F-10-1A:CMCC-EASY 120.196.100.99 input.shouji.sogou.com 搜索引擎 21 18 9531 2412 200
1363157990043 13925057413 00-1F-64-E1-E6-9A:CMCC 120.196.100.55 t3.baidu.com 搜索引擎 69 63 11058 48243 200
1363157988072 13760778710 00-FD-07-A4-7B-08:CMCC 120.196.100.82 2 2 120 120 200
1363157985066 13560436666 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
1363157993055 13560436666 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 18 15 1116 954 200
// 说明:(注意不一定每一条数据都有如下项,所以得注意)
1363157995033 15920133257 5C-0E-8B-C7-BA-20:CMCC 120.197.40.4 sug.so.360.cn 信息安全 20 20 3156 2936 200
记录报告时间戳 手机号码 AP mac AC mac 访问的网址 网址种类 上行数据包个数 下行数据包个数 上行流量 下行流量 HTTP Response的状态
package com.atSchool.partitioner;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class mapperDemo extends Mapper<LongWritable, Text, Text, LongWritable> {
private Text outKey = new Text();
private LongWritable outValue = new LongWritable();
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context)
throws IOException, InterruptedException {
// 读取一行
String line = value.toString();
// 分割
String[] split = line.split("\t+");
// 输出到上下文对象中
String tel = split[1].trim();
String upFlow = split[split.length - 3].trim();
String downFlow = split[split.length - 2].trim();
outKey.set(tel);
outValue.set(Long.parseLong(upFlow) + Long.parseLong(downFlow));
context.write(outKey, outValue);
}
}
package com.atSchool.partitioner;
import java.util.HashMap;
import java.util.Map;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class partitionerDemo extends Partitioner<Text, LongWritable> {
// 初始化分区
private static Map<String, Integer> maps = new HashMap<>();
static {
maps.put("135", 0);
maps.put("137", 1);
maps.put("138", 2);
maps.put("139", 3);
}
// 如果不是上面分区的任意一种,则放到第5个分区
@Override
public int getPartition(Text key, LongWritable value, int numPartitions) {
String substring = key.toString().substring(0, 3);
if (maps.get(substring) == null) {
return 4;
} else {
return maps.get(substring);
}
}
}
package com.atSchool.partitioner;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class reduceDemo extends Reducer<Text, LongWritable, Text, LongWritable> {
@Override
protected void reduce(Text key, Iterable<LongWritable> value,
Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
long sum = 0;
for (LongWritable longWritable : value) {
sum += longWritable.get();
}
context.write(key, new LongWritable(sum));
}
}
package com.atSchool.partitioner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class jobDemo extends Configured implements Tool {
public static void main(String[] args) throws Exception {
new ToolRunner().run(new jobDemo(), null);
}
@Override
public int run(String[] args) throws Exception {
// 获取Job
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
Job job = Job.getInstance(configuration);
// 设置需要运行的任务
job.setJarByClass(jobDemo.class);
// 告诉job Map和Reduce在哪
job.setMapperClass(mapperDemo.class);
job.setPartitionerClass(partitionerDemo.class); // 设置自定义partitioner
job.setReducerClass(reduceDemo.class);
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
// 告诉job Reduce输出的key和value的数据类型的是什么
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setNumReduceTasks(5); // 根据自定义Partitioner的逻辑设置相应数量的reducetask
// 告诉job输入和输出的路径
FileInputFormat.addInputPath(job, new Path("/phone_traffic.txt"));
/**
* 因为输出的文件不允许存在,所以需要处理一下
*/
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/MapReduceOut");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
FileOutputFormat.setOutputPath(job, path);
// 提交任务
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
统计分析学生课程与成绩案例
分析资料:
computer,huangxilaoming,78,90,80,76,87
computer,huangbo,65,60,0,75,77
computer,xuzheng,50,60,60,40,80,90,100
computer,wangbaoqiang,57,87,98,87,54,65,32,21
java,liuyifei,50,40,90,50,80,70,60,50,40
java,xuezhiqian,50,40,60,70,80,90,90,50
java,wanghan,88,99,88, 99,77,55
java,huangzitao,90,50,80,70,60,50,40
java,huangbo,70,80,90,90,50
java,xuzheng,99,88,99,77,55
scala,wangzulan,50,60,70,80,90,40,50,60
scala,dengchao,60,50,90,60,40,50,60,70,50
scala,zhouqi,60,50,40,90,80,70,80,90
scala,liuqian,80,90,40,50,60
scala,liutao,60,40,50,60,70,50
scala,zhourunfa,40,90,80,70,80,90
mysql,liutao,50,60,70,80,90,100,65,60
mysql,liuqian,50,60,50,70,80,90,50,60,40
mysql,zhourunfa,80,50,60,90,70,50,60,40,50
数据说明:
数据解释数据字段个数不固定:第一个是课程名称,总共四个课程,computer,math,english,algorithm,第二个是学生姓名,后面是每次考试的分数,但是每个学生在某门课程中的考试次数不固定
package com.atSchool.partitioner2;
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/**
* 统计分析学生课程与成绩案例
*/
public class mapperDemo extends Mapper<LongWritable, Text, Text, FloatWritable> {
private Text outKey = new Text();
private FloatWritable outValue = new FloatWritable();
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, FloatWritable>.Context context)
throws IOException, InterruptedException {
// 读取一行+分割
String[] split = value.toString().trim().split(",");
// 输出到上下文对象中
String courseName = split[0].trim(); // 课程名
float sum = 0;
float index = 0;
for (int i = 2; i < split.length; i++) {
sum += Float.parseFloat(split[i].trim());
index++;
}
float averageScore = sum / index; // 学生平均分
outKey.set(courseName);
outValue.set(averageScore);
context.write(outKey, outValue);
}
}
package com.atSchool.partitioner2;
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class reduceDemo extends Reducer<Text, FloatWritable, Text, Text> {
@Override
protected void reduce(Text key, Iterable<FloatWritable> value,
Reducer<Text, FloatWritable, Text, Text>.Context context) throws IOException, InterruptedException {
float sum = 0;
float index = 0;
for (FloatWritable floatWritable : value) {
sum += floatWritable.get();
index++;
}
float averageScore = sum / index;
context.write(key, new Text("考试人数:" + index + "\t" + "课程考试平均成绩:" + averageScore));
}
}
package com.atSchool.partitioner2;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class jobDemo extends Configured implements Tool {
public static void main(String[] args) throws Exception {
new ToolRunner().run(new jobDemo(), null);
}
@Override
public int run(String[] args) throws Exception {
// 获取Job
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
Job job = Job.getInstance(configuration);
// 设置需要运行的任务
job.setJarByClass(jobDemo.class);
// 告诉job Map和Reduce在哪
job.setMapperClass(mapperDemo.class);
job.setReducerClass(reduceDemo.class);
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FloatWritable.class);
// 告诉job Reduce输出的key和value的数据类型的是什么
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// 告诉job输入和输出的路径
FileInputFormat.addInputPath(job, new Path("/student.txt"));
/**
* 因为输出的文件不允许存在,所以需要处理一下
*/
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/MapReduceOut");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
FileOutputFormat.setOutputPath(job, path);
// 提交任务
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
输出结果:
computer 考试人数:4.0 课程考试平均成绩:67.19911
java 考试人数:6.0 课程考试平均成绩:71.98823
mysql 考试人数:3.0 课程考试平均成绩:64.69907
scala 考试人数:6.0 课程考试平均成绩:64.23148
package com.atSchool.studentDemo2;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;
public class beanDemo implements WritableComparable<beanDemo> {
private String courseName; // 课程名称
private String stuName; // 学生姓名
private float avgScore; // 平均成绩
// 构造方法
public beanDemo() {
super();
}
public beanDemo(String courseName, String stuName, float avgScore) {
super();
this.courseName = courseName;
this.stuName = stuName;
this.avgScore = avgScore;
}
// getter/setter方法
public String getCourseName() {
return courseName;
}
public void setCourseName(String courseName) {
this.courseName = courseName;
}
public String getStuName() {
return stuName;
}
public void setStuName(String stuName) {
this.stuName = stuName;
}
public float getAvgScore() {
return avgScore;
}
public void setAvgScore(float avgScore) {
this.avgScore = avgScore;
}
// 重写方法
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(courseName);
out.writeUTF(stuName);
out.writeFloat(avgScore);
}
@Override
public void readFields(DataInput in) throws IOException {
courseName = in.readUTF();
stuName = in.readUTF();
avgScore = in.readFloat();
}
/**
* 返回一个负整数、零或正整数,因为此对象小于、等于或大于指定对象。
*
* o1.compareTo(o2);
* 返回正数,比较对象(compareTo传参对象o2)放在 当前对象(调用compareTo方法的对象o1)的前面
* 返回负数,放在后面
*/
@Override
public int compareTo(beanDemo o) {
// 判断是不是同一个课程
int compareTo = o.courseName.compareTo(this.courseName);
// 如果是同一个课程
if (compareTo == 0) {
// 如果比较的对象比当前对象小,就返回正数,比较对象放在当前对象的后面
return avgScore > o.avgScore ? -1 : 1;
}
return compareTo > 0 ? 1 : -1;
}
@Override
public String toString() {
return "courseName=" + courseName + ", stuName=" + stuName + ", avgScore=" + avgScore;
}
}
package com.atSchool.studentDemo2;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Partitioner;
public class partitionerDemo extends Partitioner<beanDemo, NullWritable> {
@Override
public int getPartition(beanDemo key, NullWritable value, int numPartitions) {
String courseName = key.getCourseName();
if (courseName.equals("java")) {
return 0;
} else if (courseName.equals("computer")) {
return 1;
} else if (courseName.equals("scala")) {
return 2;
} else if (courseName.equals("mysql")) {
return 3;
} else {
return 4;
}
}
}
package com.atSchool.studentDemo2;
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/**
* 统计每门课程参考学生的平均分,并且按课程存入不同的结果文件,要求一门课程一个结果文件,
* 并且按平均分从高到低排序,分数保留一位小数
*/
public class mapperDemo extends Mapper<LongWritable, Text, beanDemo, NullWritable> {
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, beanDemo, NullWritable>.Context context)
throws IOException, InterruptedException {
// 读取一行+分割
String[] split = value.toString().trim().split(",");
// 输出到上下文对象中
String courseName = split[0].trim(); // 课程名
String stuName = split[1].trim();
float sum = 0;
float index = 0;
for (int i = 2; i < split.length; i++) {
sum += Float.parseFloat(split[i].trim());
index++;
}
String format = String.format("%.1f", sum / index); // 学生平均分,保留一位小数
beanDemo beanDemo = new beanDemo(courseName, stuName, Float.parseFloat(format););
context.write(beanDemo, NullWritable.get());
}
}
package com.atSchool.studentDemo2;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class jobDemo extends Configured implements Tool {
public static void main(String[] args) throws Exception {
new ToolRunner().run(new jobDemo(), null);
}
@Override
public int run(String[] args) throws Exception {
// 获取Job
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
Job job = Job.getInstance(configuration);
// 设置需要运行的任务
job.setJarByClass(jobDemo.class);
// 告诉job Map和Reduce在哪
job.setMapperClass(mapperDemo.class);
job.setPartitionerClass(partitionerDemo.class); // 设置自定义partitioner
job.setNumReduceTasks(5); // 根据自定义Partitioner的逻辑设置相应数量的reducetask
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(beanDemo.class);
job.setMapOutputValueClass(NullWritable.class);
// 告诉job Reduce输出的key和value的数据类型的是什么
job.setOutputKeyClass(beanDemo.class);
job.setOutputValueClass(NullWritable.class);
// 告诉job输入和输出的路径
FileInputFormat.addInputPath(job, new Path("/student.txt"));
/**
* 因为输出的文件不允许存在,所以需要处理一下
*/
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/MapReduceOut");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
FileOutputFormat.setOutputPath(job, path);
// 提交任务
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
package com.atSchool.studentDemo3;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
/**
* 分组
*/
public class groupDemo extends WritableComparator {
public groupDemo() {
super(beanDemo.class, true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
beanDemo s1 = (beanDemo) a;
beanDemo s2 = (beanDemo) b;
return s1.getCourseName().compareTo(s2.getCourseName());
}
}
package com.atSchool.studentDemo3;
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/**
* 求出每门课程参考学生成绩排名前3的学生的信息:课程,姓名和平均分
*/
public class mapperDemo extends Mapper<LongWritable, Text, beanDemo, NullWritable> {
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, beanDemo, NullWritable>.Context context)
throws IOException, InterruptedException {
// 读取一行+分割
String[] split = value.toString().trim().split(",");
// 输出到上下文对象中
String courseName = split[0].trim(); // 课程名
String stuName = split[1].trim(); // 学生姓名
float sum = 0;
float index = 0;
for (int i = 2; i < split.length; i++) {
sum += Float.parseFloat(split[i].trim());
index++;
}
float averageScore = sum / index; // 学生平均分
beanDemo beanDemo = new beanDemo(courseName, stuName, averageScore);
context.write(beanDemo, NullWritable.get());
}
}
package com.atSchool.studentDemo3;
import java.io.IOException;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;
public class reduceDemo extends Reducer<beanDemo, NullWritable, beanDemo, NullWritable> {
@Override
protected void reduce(beanDemo key, Iterable<NullWritable> value,
Reducer<beanDemo, NullWritable, beanDemo, NullWritable>.Context context)
throws IOException, InterruptedException {
int count = 0;
for (NullWritable nullWritable : value) {
if (count < 3) {
context.write(key, nullWritable);
count++;
} else {
break;
}
}
}
}
package com.atSchool.studentDemo3;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class jobDemo extends Configured implements Tool {
public static void main(String[] args) throws Exception {
new ToolRunner().run(new jobDemo(), null);
}
@Override
public int run(String[] args) throws Exception {
// 获取Job
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
Job job = Job.getInstance(configuration);
// 设置需要运行的任务
job.setJarByClass(jobDemo.class);
// 告诉job Map和Reduce在哪
job.setMapperClass(mapperDemo.class);
job.setReducerClass(reduceDemo.class);
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(beanDemo.class);
job.setMapOutputValueClass(NullWritable.class);
// 告诉job Reduce输出的key和value的数据类型的是什么
job.setOutputKeyClass(beanDemo.class);
job.setOutputValueClass(NullWritable.class);
job.setGroupingComparatorClass(groupDemo.class); // 指定分组的规则
// 告诉job输入和输出的路径
FileInputFormat.addInputPath(job, new Path("/student.txt"));
/**
* 因为输出的文件不允许存在,所以需要处理一下
*/
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/MapReduceOut");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
FileOutputFormat.setOutputPath(job, path);
// 提交任务
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
分析数据库:
package com.atSchool.db.MR;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import org.apache.hadoop.mapreduce.lib.db.DBWritable;
public class Shopping implements DBWritable {
private int id; // 商品id
private String name; // 商品名称
private String subtitle; // 商品副标题
private float price; // 商品价格
private int stock; // 库存数量
public Shopping() {}
public Shopping(int id, String name, String subtitle, float price, int stock) {
super();
this.id = id;
this.name = name;
this.subtitle = subtitle;
this.price = price;
this.stock = stock;
}
public int getId() {
return id;
}
public void setId(int id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getSubtitle() {
return subtitle;
}
public void setSubtitle(String subtitle) {
this.subtitle = subtitle;
}
public float getPrice() {
return price;
}
public void setPrice(float price) {
this.price = price;
}
public int getStock() {
return stock;
}
public void setStock(int stock) {
this.stock = stock;
}
// PreparedStatement在JDBC中用来存储已经预编译好了的sql语句
@Override
public void write(PreparedStatement statement) throws SQLException {
statement.setInt(1, id);
statement.setString(2, name);
statement.setString(3, subtitle);
statement.setFloat(4, price);
statement.setInt(5, stock);
}
// ResultSet在JDBC中是存储结果集的对象
@Override
public void readFields(ResultSet resultSet) throws SQLException {
this.id = resultSet.getInt(1);
this.name = resultSet.getString(2);
this.subtitle = resultSet.getString(3);
this.price = resultSet.getFloat(4);
this.stock = resultSet.getInt(5);
}
@Override
public String toString() {
return "id=" + id + ", name=" + name + ", subtitle=" + subtitle + ", price=" + price + ", stock=" + stock;
}
}
package com.atSchool.db.MR;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Mapper;
public class MapperDemo extends Mapper<LongWritable, Shopping, NullWritable, Shopping> {
@Override
protected void map(LongWritable key, Shopping value,
Mapper<LongWritable, Shopping, NullWritable, Shopping>.Context context)
throws IOException, InterruptedException {
context.write(NullWritable.get(), value);
}
}
package com.atSchool.db.MR;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.db.DBConfiguration;
import org.apache.hadoop.mapreduce.lib.db.DBInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.atSchool.utils.HDFSUtils;
public class JobDemo extends Configured implements Tool {
private String className = "com.mysql.cj.jdbc.Driver";
private String url = "jdbc:mysql:// 127.0.0.1:3306/shopping?&useSSL=false&serverTimezone=UTC&rewriteBatchedStatements=true";
private String user = "root";
private String password = "password";
public static void main(String[] args) throws Exception {
new ToolRunner().run(new JobDemo(), null);
}
@Override
public int run(String[] args) throws Exception {
/**
* 获取job:一个工作对象
*/
// 创建一个 配置 对象
Configuration configuration = new Configuration();
// 设置 name属性 的值。
// 如果名称已弃用或有一个弃用的名称与之关联,它会将值设置为两个名称。
// 名称将在配置前进行修剪。
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
// 在configuration中设置数据库访问相关字段。
DBConfiguration.configureDB(configuration, className, url, user, password);
// 根据配置文件创建一个job
Job job = Job.getInstance(configuration);
/**
* 设置job
*/
// sql语句
String sql_1 = "SELECT id,name,subtitle,price,stock FROM neuedu_product WHERE price>1999";
String sql_2 = "SELECT COUNT(id) FROM neuedu_product WHERE price>1999";
/**
* public static void setInput(Job job, Class extends DBWritable> inputClass, String inputQuery, String inputCountQuery)
* 使用适当的输入设置初始化作业的映射-部分。
* 参数:
* job-The map-reduce job
* inputClass-实现DBWritable的类对象,它是保存元组字段的Java对象。
* inputQuery-选择字段的输入查询。示例:"SELECT f1, f2, f3 FROM Mytable ORDER BY f1"
* inputCountQuery-返回表中记录数的输入查询。示例:"SELECT COUNT(f1) FROM Mytable"
*/
DBInputFormat.setInput(job, Shopping.class, sql_1, sql_2);
// 通过查找给定类的来源来设置Jar。
job.setJarByClass(JobDemo.class);
// 给 job 设置 Map和Reduce
job.setMapperClass(MapperDemo.class);
job.setNumReduceTasks(0); // 这里用为没有用到reduce所以设置为0
// 给 job 设置InputFormat
// InputFormat:描述 Map-Reduce job 的输入规范
// DBInputFormat:从一个SQL表中读取输入数据的输入格式。
job.setInputFormatClass(DBInputFormat.class);
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(Shopping.class);
/**
* 设置输出路径
*/
// 因为输出的文件不允许存在,所以需要处理一下
FileSystem fileSystem = HDFSUtils.getFileSystem();
Path path = new Path("/MapReduceOut");
if (fileSystem.exists(path)) {
fileSystem.delete(path, true);
System.out.println("删除成功");
}
// 设置map-reduce job的输出目录
FileOutputFormat.setOutputPath(job, path);
// 将job提交到集群并等待它完成。
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}
注意:写出的时候,写出的表在数据库中要事先建好
package com.atSchool.db.outmysql;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import org.apache.hadoop.mapreduce.lib.db.DBWritable;
public class User implements DBWritable {
private int id;
private String name;
private String password;
public User() {}
public User(int id, String name, String password) {
super();
this.id = id;
this.name = name;
this.password = password;
}
public int getId() {
return id;
}
public void setId(int id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getPassword() {
return password;
}
public void setPassword(String password) {
this.password = password;
}
@Override
public void write(PreparedStatement statement) throws SQLException {
statement.setInt(1, id);
statement.setString(2, name);
statement.setString(3, password);
}
@Override
public void readFields(ResultSet resultSet) throws SQLException {
this.id = resultSet.getInt(1);
this.name = resultSet.getString(2);
this.password = resultSet.getString(3);
}
@Override
public String toString() {
return "id=" + id + ", name=" + name + ", password=" + password;
}
}
package com.atSchool.db.outmysql;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MapperDemo extends Mapper<LongWritable, Text, User, NullWritable> {
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, User, NullWritable>.Context context)
throws IOException, InterruptedException {
String[] split = value.toString().split(",");
User user = new User(Integer.parseInt(split[0]), split[1], split[2]);
context.write(user, NullWritable.get());
System.out.println("mapper-out:" + user.toString());
}
}
package com.atSchool.db.outmysql;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.db.DBConfiguration;
import org.apache.hadoop.mapreduce.lib.db.DBOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class JobDemo extends Configured implements Tool {
private String className = "com.mysql.cj.jdbc.Driver";
/**
* rewriteBatchedStatements=true
* 说明:
* MySQL Jdbc驱动在默认情况下会无视executeBatch()语句,把我们期望批量执行的一组sql语句拆散,一条一条地发给MySQL数据库,直接造成较低的性能。
* 而把rewriteBatchedStatements参数置为true, 驱动就会批量执行SQL
* 注意:
* 在这里就不能将该属性设置成true了,这样会和mapreduce中形成冲突
*/
private String url = "jdbc:mysql://127.0.0.1:3306/shopping?&charactercEncoding=utf-8&useSSL=false&serverTimezone=UTC";
private String user = "root";
private String password = "password";
public static void main(String[] args) throws Exception {
new ToolRunner().run(new JobDemo(), null);
}
@Override
public int run(String[] args) throws Exception {
/**
* 获取job:一个工作对象
*/
// 创建一个 配置 对象
Configuration configuration = new Configuration();
// 设置 name属性 的值。
// 如果名称已弃用或有一个弃用的名称与之关联,它会将值设置为两个名称。
// 名称将在配置前进行修剪。
configuration.set("fs.defaultFS", "hdfs://192.168.232.129:9000");
// 在configuration中设置数据库访问相关字段。
DBConfiguration.configureDB(configuration, className, url, user, password);
// 根据配置文件创建一个job
Job job = Job.getInstance(configuration);
/**
* 设置job
*/
/**
* setOutput(Job job, String tableName, String... fieldNames) throws IOException
* 用适当的输出设置初始化作业的缩减部分
* 参数:
* job:The job
* tableName:要插入数据的表
* fieldNames:表中的字段名。
*/
DBOutputFormat.setOutput(job, "user", new String[] { "id", "name", "password" });
// 通过查找给定类的来源来设置Jar。
job.setJarByClass(JobDemo.class);
// 给 job 设置 Map和Reduce
job.setMapperClass(MapperDemo.class);
job.setNumReduceTasks(0); // 这里用为没有用到reduce所以设置为0
// 告诉job Map输出的key和value的数据类型的是什么
job.setMapOutputKeyClass(User.class);
job.setMapOutputValueClass(NullWritable.class);
// 给 job 设置InputFormat
// InputFormat:描述 Map-Reduce job 的输入规范
// DBInputFormat:从一个SQL表中读取输入数据的输入格式。
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(DBOutputFormat.class);
/**
* 设置输入路径
*/
FileInputFormat.setInputPaths(job, "/user_for_mysql.txt");
// 将job提交到集群并等待它完成。
boolean waitForCompletion = job.waitForCompletion(true);
System.out.println(waitForCompletion ? "执行成功" : "执行失败");
return 0;
}
}