MapReduce 之倒排索引

倒排索引 介绍:

即是 统计每篇文章 每个单词出现的次数,以此达到在搜索引擎中 搜索关键字,检索出出现关键字 最多的文章

需求:

统计每个单词 在 a.txt、b.txt 出现的次数

a.txt:

hello tom
hello jerry
hello kitty
jerry world

b.txt:

hello jerry
hello tom
jerry world

分析:

我们想达到这样的效果:

hello   "a.txt->3  b.txt->2"
jerry   "a.txt->2  b.txt->2"
...

首先需要知道 文件名,我们通过下面方法实现:

FileSplit inputSplit = (FileSplit) context.getInputSplit();
Path path = inputSplit.getPath();
String name = path.getName();
map 阶段:

读取每行内容,把 每个单词 + "->" + fileName 作为输入 K
把 1 作为 输出 value,形成如下的格式:

("hello->a.txt", 1)
("hello->a.txt", 1)
("hello->a.txt", 1)

("hello->b.txt", 1)
("hello->b.txt", 1)
combiner阶段:

到达 combiner 的数据,相同key 的 value 会聚合到一起,如下格式:

("hello->a.txt", {1,1,1})
("hello->b.txt", {1,1})

先遍历 value,计算 sum

("hello->a.txt", 3)
("hello->b.txt", 2)

然后用 split("->") 进行分割,取出每个单词,作为输出K
并把 fileName + "->" + sum 作为输出 value

("hello", "a.txt->3")
("hello", "b.txt->2")
reduce 阶段:

到 reduce 的数据,相同key的 value 聚合到一起:
("hello", {"a.txt->5", "b.txt->3"})

我们要遍历 values,组成成一个 String,以空格分开:
("hello","a.txt->5 b.txt->3")

实际代码

public class InverseIndex {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        
        Job job = Job.getInstance(conf);
        //设置jar
        job.setJarByClass(InverseIndex.class);
        
        //设置Mapper相关的属性
        job.setMapperClass(IndexMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        FileInputFormat.setInputPaths(job, new Path(args[0]));//words.txt
        
        //设置Reducer相关属性
        job.setReducerClass(IndexReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
        job.setCombinerClass(IndexCombiner.class);
                
        //提交任务
        job.waitForCompletion(true);
    }
    
    public static class IndexMapper extends Mapper{

        private Text k = new Text();
        private Text v = new Text();
        
        @Override
        protected void map(LongWritable key, Text value,
                Mapper.Context context)
                throws IOException, InterruptedException {
                
            String line = value.toString();
            String[] fields = line.split(" ");
            FileSplit inputSplit = (FileSplit) context.getInputSplit();
            Path path = inputSplit.getPath();
            String name = path.getName();

            for (String f : fields) {
                k.set(f + "->" + name);
                v.set("1");
                context.write(k, v);
            }
        }
    }
    
    public static class IndexCombiner extends Reducer{

        private Text k = new Text();
        private Text v = new Text();
        
        @Override
        protected void reduce(Text key, Iterable values,
                Reducer.Context context)
                throws IOException, InterruptedException {
                
            String[] fields = key.toString().split("->");
            long sum = 0;
            for (Text t : values) {
                sum += Long.parseLong(t.toString());
            }
            k.set(fields[0]);
            v.set(fields[1] + "->" + sum);
            context.write(k, v);
        }
    }
    
    public static class IndexReducer extends Reducer{

        private Text v = new Text();
        
        @Override
        protected void reduce(Text key, Iterable values,
                Reducer.Context context)
                throws IOException, InterruptedException {
                
            String value = "";
            for (Text t : values) {
                value += t.toString() + " ";
            }
            v.set(value);
            context.write(key, v);
        }
    }
}

你可能感兴趣的:(MapReduce 之倒排索引)