Hadoop-简单的MapReduce

1.单词统计

单词统计被视为MapReduce的Hello Wold,下面来在看在Java接口中下如何实现

1.定义一个类继承于Mapper,然后重写它的map方法

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;


//LongWritable为hadoop自己提供高效序列化的Long类型,Text=String
public class WCMapper extends Mapper<LongWritable,Text,Text,LongWritable> {
    
    /*
     * key:一般是输入长度的偏移量,默认是长整形
     * value:输入文件的一行字符串
     * context:上下文对象,利用把map后的输出
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line=value.toString();
        String[] words=line.split(" "); //根据空格切分单词
        for (String w:words){
            context.write(new Text(w),new LongWritable(1)); //输出单词w出现的次数为1
        }
    }
}

2.定义一个类继承于Reducer

public class WCReducer extends Reducer<Text,LongWritable,Text,LongWritable>{
    
    //这里已经进行了一个合并操作,把相同的key都合并了 类似于HashMap<key,ArryList<long>>
    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
        long counter=0;
        for(LongWritable v:values){
            counter+=v.get(); 
        }
        context.write(key,new LongWritable(counter)); //输出结构
    }
}

3.在main函数中调用

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;

public class WordCountDemo {

    public static void main(String[] args) throws Exception {
        //1.构建Job对象
        Job job = Job.getInstance(new Configuration());
        //2.设置当前的jar包main函数所在类
        job.setJarByClass(WordCountDemo.class);
        //3.设置Mapper的相关属性
        job.setMapperClass(WCMapper.class); //设置实现map方法的类
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        FileInputFormat.setInputPaths(job, new Path(args[0])); //设置输入的文件路径
        //4.设置Reduce的相关属性
        job.setReducerClass(WCReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        FileOutputFormat.setOutputPath(job, new Path(args[1]));//设置输出的文件路径
        //5.提交任务
        job.waitForCompletion(true); //参数表示是打印mr的输出信息
    }
}

4.最后打包上传到Linux上使用hadoop命令执行,

hadoop jar 包名 [main方法的所在类] [输入路径] [输出路径]


2.单表关联

问题描述: 给出child-parent(孩子——父母)表,要求输出grandchild-grandparent(孙子——爷奶)表

+--------+--------+
| child  | parent |
+--------+--------+
| Tom    | Lucy   |
| Tom    | Jack   |
| Jone   | Lucy   |
| Jone   | Jack   |
| Lucy   | Mary   |
| Lucy   | Ben    |
| Jack   | Alice  |
| Jack   | Jesse  |
| Terry  | Alice  |
| Terry  | Jesse  |
| Philip | Terry  |
| Philip | Alma   |
| Mark   | Terry  |
| Mark   | Alma   |
+--------+--------+

结果表:

+------------+-------------+
| grandchild | grandparent |
+------------+-------------+
| Tom        | Mary        |
| Jone       | Mary        |
| Tom        | Ben         |
| Jone       | Ben         |
| Tom        | Alice       |
| Jone       | Alice       |
| Tom        | Jesse       |
| Jone       | Jesse       |
| Philip     | Alice       |
| Mark       | Alice       |
| Philip     | Jesse       |
| Mark       | Jesse       |
+------------+-------------+

它的mysql语句是很容易得出的,一个右表的自连接:

select cp1.child,cp2.parent from child_parent as cp1 inner join child_parent as cp2 on cp1.parent=cp2.child;

以Lucy为列,Lucy在cp1中作为parent,然后去cp2中找Lucy作为child的,此时cp1的Lucy对于的child就是grandchild,而cp2中Lucy对应的parent即为grandparent.也就是说把Lucy作为中介,我们可以得到grandchild-grandparent的关系

| Tom    | Lucy   | Lucy  | Mary   |
| Jone   | Lucy   | Lucy  | Mary   |
| Tom    | Lucy   | Lucy  | Ben    |
| Jone   | Lucy   | Lucy  | Ben    |

但是如何在mapreduce中实现呢?很庆幸,hadoop为我们提供了hive,只要在hive执行这条sql语句,它会自动生成相应的MapReduce(NB和EntityFramework有点像),你即可直接得到相应的结果。 但是我们现在研究的是如何自己去写这个MR.

回想下shuffle的过程,它会把map的输出结果中相同的key合并,并把value加入到它对应的集合中。这个合并相同key的过程,我们可以认为就是等值连接的过程。比如map的输出类型为<child,parent>,在reduce端对于给定的child,我们很容易知道他所有的parent。但是此时 我们丢失了child的child的信息。因此在value中我们必须保存这个key的孩child和parent的消息,但是如何做?

answer:

对应一个相同的key,我们使用前缀‘l_’表示key的孩子,用‘r_’表示key的父母,比如map端的输出类型<Tom ,r_Lucy>,<Lucy,l_Tom><Tom ,r_Jack>,<Jack,l_Tom> ...

public static class KMapper extends Mapper<LongWritable, Text, Text, Text> {
        private Text txtKey = new Text();
        private Text txtValue = new Text();

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] values = value.toString().split("\t");
            String chidName = values[0];
            String parentName = values[1];
            txtKey.set(chidName);
            txtValue.set("r_" + parentName);//the prefix 'r_' and 'l_' express the key's parent and child respectively
            context.write(txtKey, txtValue);//child is the key and the value reserve its parent
            txtKey.set(parentName);
            txtValue.set("l_" + chidName);//parent is the key and the value reserve its child
            context.write(txtKey, txtValue);
        }
    }


以Lucy为例在reduce端,我们可以得到如下键值对<Lucy,l_Tom,l_Jone,r_Mary,r_Ben>,然后输出该集合中child*parent的笛卡尔积,就是我们对应的结果

 public static class KReducer extends Reducer<Text, Text, Text, Text> {
        private Text txtKey = new Text();
        private Text txtValue = new Text();

        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            Iterator<Text> iterator = values.iterator();
            ArrayList<String> childList = new ArrayList<String>(); //the list of the key's child
            ArrayList<String> parentList = new ArrayList<String>();//the list of the key's parent
            while (iterator.hasNext()) {
                String v = iterator.next().toString();
                if(v.startsWith("l")){ //get the child-list
                    childList.add(v.substring(2));
                }else if(v.startsWith("r")){ //get the parent-list
                    parentList.add(v.substring(2));
                }
            }
            for (String c : childList) {
                for (String p : parentList) {
                    txtKey.set(c);
                    txtValue.set(p);
                    context.write(txtKey, txtValue);
                }
            }
        }
    }




















如果你不想手动导入hadoop相关的包,你可以使用maven,配置如下的参数

<dependencies>
	<dependency>
		<groupId>org.apache.hadoop</groupId>
		<artifactId>hadoop-common</artifactId>
		<version>2.7.0</version>
	</dependency>
	<dependency>
		<groupId>org.apache.hadoop</groupId>
		<artifactId>hadoop-hdfs</artifactId>
		<version>2.7.0</version>
	</dependency>

	<dependency>
		<groupId>org.apache.hadoop</groupId>
		<artifactId>hadoop-mapreduce-client-core</artifactId>
		<version>2.7.0</version>
	</dependency>
</dependencies>



你可能感兴趣的:(Hadoop-简单的MapReduce)