单词统计被视为MapReduce的Hello Wold,下面来在看在Java接口中下如何实现
1.定义一个类继承于Mapper,然后重写它的map方法
import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; //LongWritable为hadoop自己提供高效序列化的Long类型,Text=String public class WCMapper extends Mapper<LongWritable,Text,Text,LongWritable> { /* * key:一般是输入长度的偏移量,默认是长整形 * value:输入文件的一行字符串 * context:上下文对象,利用把map后的输出 */ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line=value.toString(); String[] words=line.split(" "); //根据空格切分单词 for (String w:words){ context.write(new Text(w),new LongWritable(1)); //输出单词w出现的次数为1 } } }
public class WCReducer extends Reducer<Text,LongWritable,Text,LongWritable>{ //这里已经进行了一个合并操作,把相同的key都合并了 类似于HashMap<key,ArryList<long>> @Override protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { long counter=0; for(LongWritable v:values){ counter+=v.get(); } context.write(key,new LongWritable(counter)); //输出结构 } }
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class WordCountDemo { public static void main(String[] args) throws Exception { //1.构建Job对象 Job job = Job.getInstance(new Configuration()); //2.设置当前的jar包main函数所在类 job.setJarByClass(WordCountDemo.class); //3.设置Mapper的相关属性 job.setMapperClass(WCMapper.class); //设置实现map方法的类 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(LongWritable.class); FileInputFormat.setInputPaths(job, new Path(args[0])); //设置输入的文件路径 //4.设置Reduce的相关属性 job.setReducerClass(WCReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); FileOutputFormat.setOutputPath(job, new Path(args[1]));//设置输出的文件路径 //5.提交任务 job.waitForCompletion(true); //参数表示是打印mr的输出信息 } }
hadoop jar 包名 [main方法的所在类] [输入路径] [输出路径]
问题描述: 给出child-parent(孩子——父母)表,要求输出grandchild-grandparent(孙子——爷奶)表
+--------+--------+ | child | parent | +--------+--------+ | Tom | Lucy | | Tom | Jack | | Jone | Lucy | | Jone | Jack | | Lucy | Mary | | Lucy | Ben | | Jack | Alice | | Jack | Jesse | | Terry | Alice | | Terry | Jesse | | Philip | Terry | | Philip | Alma | | Mark | Terry | | Mark | Alma | +--------+--------+
结果表:
+------------+-------------+ | grandchild | grandparent | +------------+-------------+ | Tom | Mary | | Jone | Mary | | Tom | Ben | | Jone | Ben | | Tom | Alice | | Jone | Alice | | Tom | Jesse | | Jone | Jesse | | Philip | Alice | | Mark | Alice | | Philip | Jesse | | Mark | Jesse | +------------+-------------+
它的mysql语句是很容易得出的,一个右表的自连接:
select cp1.child,cp2.parent from child_parent as cp1 inner join child_parent as cp2 on cp1.parent=cp2.child;
以Lucy为列,Lucy在cp1中作为parent,然后去cp2中找Lucy作为child的,此时cp1的Lucy对于的child就是grandchild,而cp2中Lucy对应的parent即为grandparent.也就是说把Lucy作为中介,我们可以得到grandchild-grandparent的关系
| Tom | Lucy | Lucy | Mary | | Jone | Lucy | Lucy | Mary | | Tom | Lucy | Lucy | Ben | | Jone | Lucy | Lucy | Ben |
回想下shuffle的过程,它会把map的输出结果中相同的key合并,并把value加入到它对应的集合中。这个合并相同key的过程,我们可以认为就是等值连接的过程。比如map的输出类型为<child,parent>,在reduce端对于给定的child,我们很容易知道他所有的parent。但是此时 我们丢失了child的child的信息。因此在value中我们必须保存这个key的孩child和parent的消息,但是如何做?
answer:
对应一个相同的key,我们使用前缀‘l_’表示key的孩子,用‘r_’表示key的父母,比如map端的输出类型<Tom ,r_Lucy>,<Lucy,l_Tom><Tom ,r_Jack>,<Jack,l_Tom> ...
public static class KMapper extends Mapper<LongWritable, Text, Text, Text> { private Text txtKey = new Text(); private Text txtValue = new Text(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] values = value.toString().split("\t"); String chidName = values[0]; String parentName = values[1]; txtKey.set(chidName); txtValue.set("r_" + parentName);//the prefix 'r_' and 'l_' express the key's parent and child respectively context.write(txtKey, txtValue);//child is the key and the value reserve its parent txtKey.set(parentName); txtValue.set("l_" + chidName);//parent is the key and the value reserve its child context.write(txtKey, txtValue); } }
以Lucy为例在reduce端,我们可以得到如下键值对<Lucy,l_Tom,l_Jone,r_Mary,r_Ben>,然后输出该集合中child*parent的笛卡尔积,就是我们对应的结果
public static class KReducer extends Reducer<Text, Text, Text, Text> { private Text txtKey = new Text(); private Text txtValue = new Text(); @Override protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { Iterator<Text> iterator = values.iterator(); ArrayList<String> childList = new ArrayList<String>(); //the list of the key's child ArrayList<String> parentList = new ArrayList<String>();//the list of the key's parent while (iterator.hasNext()) { String v = iterator.next().toString(); if(v.startsWith("l")){ //get the child-list childList.add(v.substring(2)); }else if(v.startsWith("r")){ //get the parent-list parentList.add(v.substring(2)); } } for (String c : childList) { for (String p : parentList) { txtKey.set(c); txtValue.set(p); context.write(txtKey, txtValue); } } } }
如果你不想手动导入hadoop相关的包,你可以使用maven,配置如下的参数
<dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>2.7.0</version> </dependency> </dependencies>