MapReduce实现Join操作

1、介绍

          在各种实际业务场景中,按照某个关键字对两份数据进行连接是非常常见的。如果两份数据都比较小,那么可以直接在内存中完成连接。如果是大数据量的呢?显然,在内存中进行连接会发生 OOM。MapReduce 可以用来解决大数据量的链接。

  MapReduce 的Join操作主要分两类:

                     MapJoin

                     ReduceJoin

ReduceJoin:
1、 map 阶段,两份数据 data1 和 data2 会被 map 分别读入,解析成以链接字段为 key 以查询字段为 value 的 key-value 对,并标明数据来源是 data1 还是 data2。
2、 reduce 阶段,reducetask 会接收来自 data1 和 data2 的相同 key 的数据,在 reduce 端进行乘积链接,最直接的影响是很消耗内存,导致 OOM

MapJoin: 
         MapJoin 适用于有一份数据较小的连接情况。做法是直接把该小份数据直接全部加载到内存当中,按链接关键字建立索引。然后大份数据就作为 MapTask 的输入,对 map()方法的每次输入都去内存当中直接去匹配连接。然后把连接结果按 key 输出,这种方法要使用 hadoop中的 DistributedCache 把小份数据分布到各个计算节点,每个 maptask 执行任务的节点都需要加载该数据到内存,并且按连接关键字建立索引。

2、实例

现有两份数据 movies.dat 和 ratings.dat 数据样式分别为: (https://pan.baidu.com/s/1vC-uq2sm0yFdqFZVOntlhA   提取码:dplw )

movies.dat

1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
9::Sudden Death (1995)::Action
10::GoldenEye (1995)::Action|Adventure|Thriller
11::American President, The (1995)::Comedy|Drama|Romance
12::Dracula: Dead and Loving It (1995)::Comedy|Horror
13::Balto (1995)::Animation|Children's
14::Nixon (1995)::Drama
15::Cutthroat Island (1995)::Action|Adventure|Romance
16::Casino (1995)::Drama|Thriller
17::Sense and Sensibility (1995)::Drama|Romance
18::Four Rooms (1995)::Thriller

  字段含义:movieid, moviename, movietype

ratings.dat

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::1197::3::978302268
1::1287::5::978302039
1::2804::5::978300719
1::594::4::978302268
1::919::4::978301368

字段含义:userid, movieid, rate, timestamp

 

1、需求:

 Select * from movie a join ratings b on a.movieid = b.movieid

 现要求对两表进行连接,要求输出最终的结果有以上六个字段: movieid, userid, rate, moviename, movietype, timestamp

 

2、实现代码

第一步:封装 MovieRate,方便数据的排序和序列化

package MapReduceJoin;

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class MovieRate implements WritableComparable {
    private String movieid;
    private String useid;
    private int rate;
    private String movieName;
    private String movieType;
    private long ts;

    public String getMovieid() {
        return movieid;
    }

    public String getUseid() {
        return useid;
    }

    public int getRate() {
        return rate;
    }

    public String getMovieName() {
        return movieName;
    }

    public String getMovieType() {
        return movieType;
    }



    public void setMovieid(String movieid) {
        this.movieid = movieid;
    }

    public void setUseid(String useid) {
        this.useid = useid;
    }

    public void setRate(int rate) {
        this.rate = rate;
    }

    public void setMovieName(String movieName) {
        this.movieName = movieName;
    }

    public void setMovieType(String movieType) {
        this.movieType = movieType;
    }

    public long getTs() {
        return ts;
    }

    public void setTs(long ts) {
        this.ts = ts;
    }

    public MovieRate(String movieid, String useid, int rate, String movieName, String movieType,long ts) {
        this.movieid = movieid;
        this.useid = useid;
        this.rate = rate;
        this.movieName = movieName;
        this.movieType = movieType;
        this.ts = ts;
    }

    @Override
    public String toString() {
        return   movieid + "\t" + useid + "\t" + rate + "\t" + movieName
                + "\t" + movieType + "\t" + ts;
    }

    public void write(DataOutput out) throws IOException {
        out.writeUTF(movieid);
        out.writeUTF(useid);
        out.writeInt(rate);
        out.writeUTF(movieName);
        out.writeUTF(movieType);
        out.writeLong(ts);
    }

    public void readFields(DataInput in) throws IOException {
        this.movieid = in.readUTF();
        this.useid = in.readUTF();
        this.rate = in.readInt();
        this.movieName = in.readUTF();
        this.movieType = in.readUTF();
        this.ts = in.readLong();
    }
    public int compareTo(MovieRate o) {
        int it = o.getMovieid().compareTo(this.movieid);
        if(it == 0){
            return o.getUseid().compareTo(this.useid) ;
        }else{
            return it;
        }
    }
}

第二步、定义Movie类

package MapReduceJoin;

public class Movie {
    private String movieid;
    private String movieName;
    private String moiveType;

    public String getMovieid() {
        return movieid;
    }

    public String getMovieName() {
        return movieName;
    }

    public String getMoiveType() {
        return moiveType;
    }

    public void setMovieid(String movieid) {
        this.movieid = movieid;
    }

    public void setMovieName(String movieName) {
        this.movieName = movieName;
    }

    public void setMoiveType(String moiveType) {
        this.moiveType = moiveType;
    }

    public Movie(String movieid, String movieName, String moiveType) {
        this.movieid = movieid;
        this.movieName = movieName;
        this.moiveType = moiveType;
    }
}

第三步、编写 MapReduce 程序

package MapReduceJoin;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.filecache.DistributedCache;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.HashMap;
import java.util.Map;

public class MovieRatingMapJoin {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        conf.set("fs.deafultFS","qyl01:9000");
        System.setProperty("HADOOP_USER_NAME","hadoop");
        Job job = Job.getInstance(conf);

        job.setJar("/home/qyl/mrmr.jar");

        job.setMapperClass(MovieRatingMapper.class);
        job.setMapOutputKeyClass(MovieRate.class);
        job.setMapOutputValueClass(NullWritable.class);

       job.setNumReduceTasks(0);  //因为不需要reduce来进行处理,所有设置为0

        String minInput = args[0];
        String maxInput = args[1];
        String output = args[2];

        FileInputFormat.setInputPaths(job, new Path(maxInput));
        Path outputPath = new Path(output);
        FileSystem fs = FileSystem.get(conf);
        if(fs.exists(outputPath)){
            fs.delete(outputPath, true);
        }
        FileOutputFormat.setOutputPath(job, outputPath);

        URI uri = new Path(minInput).toUri();
        job.addCacheFile(uri);
        boolean status = job.waitForCompletion(true);
        System.exit(status?0:1);

    }

    static class MovieRatingMapper extends Mapper{
        //用来存储小份数据的所有解析出来的key-value
        private static Map movieMap = new HashMap();

        @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            Path[] localCacheFilePaths = DistributedCache.getLocalCacheFiles(context.getConfiguration());  //获取文件地址
            String myfilePath = localCacheFilePaths[0].toString();
            System.out.println(myfilePath);
            URI[] cacheFiles =context.getCacheFiles();
            System.out.println(cacheFiles.toString());

            BufferedReader br = new BufferedReader(new FileReader(myfilePath.toString()));
            //此处的Line就是从文件当中逐行读取到的movie
            String line = "";
            while(null !=(line =br.readLine())){
                String[] split = line.split("::");   //切分一行数据
               movieMap.put(split[0],new Movie(split[0],split[1],split[2]));
            }
            IOUtils.closeStream(br);

        }

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] splits = value.toString().split("::");
            String userid = splits[0];
            String movieid = splits[1];
            int rate = Integer.parseInt(splits[2]);
            long ts = Long.parseLong(splits[3]);
            String movieName = movieMap.get(movieid).getMovieName();
            String movieType = movieMap.get(movieid).getMoiveType();
            MovieRate mr = new MovieRate(movieid,userid,rate,movieName,movieType,ts);
            context.write(mr,NullWritable.get());
        }


    }

}


 

你可能感兴趣的:(MapReduce)