Breadth-first Graph Search in MapReduce

In chapter 5 of the book "Data-Intensive Text Processing with MapReduce", it introduced how to parallel breadth-first graph search with MapReduce. This parallel algorithm is a variant of Dijkstra's algorithm. I'm not going to talk about the sequential version of Dijkstra's algorithm, for detailed explaination, refer to wikipedia. Also I'm not going to talk about the Graph data structure, please refer to wikipedia too.

 

If you know Dijkstra's algorithn well, you will know that the key to to Dijkstra's algorithm is the priority queue that maintains a globally sorted list of nodes by current distance. This is not possible in MapReduce, as the programming model does not provide a mechanism for exchanging global data. Instead, we adopt a brute force approach known as parallel breadth-first search.

 

The algorithm works by mapping over all nodes and emitting a key-value pair for each neighbor on the node's adjacency list. The key contains the node id of the neighbor, and the value is the current distance to the node plus the distance between current node and the next node.  If we can reach node n with a distance d, then we must be able to reach all the nodes that we connected to n with distance d + dn. After shuffle and sort, reducers will receive keys corresponding to the destination node ids and distances corresponding to all paths leading to that node. The reducer will select the shortest of these distances and then update the distance in the node data structure.

 

It is apprant that parallel breadth-first search is an interative algorithm, where each iteration corresponds to a MapReduce Job. With each iteration, we will discover all nodes that are connected to current node. Also we need to pass along the graph structure from one iteration to the next. This is accomplished by emitting the node data structure itself in mapper. In the reducer, we must distinguish the node data structure from distance values and update the min distance in the node data structure before emitting it as the final value. Now it's ready to serve as input to the next iteration.

 

But the problem is, how may iterations are necessary to compute the shortest distance to all nodes? The answer is the dismeter of the graph, or the greatest distance between any pair of nodes. The algorithm can terminate when shortest distances at every node no longer change. We can use counters to keep track of such events. At the end of each MapReduce iteration, the driver program reads the counter value and determines if another is necessary.

package graph;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.HashMap;

public class ParallelDijkstra extends Configured implements Tool {

    public static String OUT = "output";
    public static String IN = "inputlarger";

    public static class DijkstraMapper extends Mapper<LongWritable, Text, LongWritable, Text> {

        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {

            //From slide 20 of Graph Algorithms with MapReduce (by Jimmy Lin, Univ @ Maryland)
            //Key is node n
            //Value is D, Points-To
            //For every point (or key), look at everything it points to.
            //Emit or write to the points to variable with the current distance + 1
            Text word = new Text();
            String line = value.toString();//looks like 1 0 2:3:
            String[] sp = line.split(" ");//splits on space
            int distanceAdded = Integer.parseInt(sp[1]) + 1;
            String[] pointsTo = sp[2].split(":");
            for (String distance : pointsTo) {
                word.set("VALUE " + distanceAdded);//tells me to look at distance value
                context.write(new LongWritable(Integer.parseInt(distance)), word);
                word.clear();
            }
            //pass in current node's distance (if it is the lowest distance)
            word.set("VALUE " + sp[1]);
            context.write(new LongWritable(Integer.parseInt(sp[0])), word);
            word.clear();

            word.set("NODES " + sp[2]);//tells me to append on the final tally
            context.write(new LongWritable(Integer.parseInt(sp[0])), word);
            word.clear();

        }
    }

    public static class DijkstraReducer extends Reducer<LongWritable, Text, LongWritable, Text> {
        public void reduce(LongWritable key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException {

            //From slide 20 of Graph Algorithms with MapReduce (by Jimmy Lin, Univ @ Maryland)
            //The key is the current point
            //The values are all the possible distances to this point
            //we simply emit the point and the minimum distance value

            String nodes = "UNMODED";
            Text word = new Text();
            int lowest = 10009;//start at infinity

            for (Text val : values) {//looks like NODES/VALUES 1 0 2:3:, we need to use the first as a key
                String[] sp = val.toString().split(" ");//splits on space
                //look at first value
                if (sp[0].equalsIgnoreCase("NODES")) {
                    nodes = null;
                    nodes = sp[1];
                } else if (sp[0].equalsIgnoreCase("VALUE")) {
                    int distance = Integer.parseInt(sp[1]);
                    lowest = Math.min(distance, lowest);
                }
            }
            word.set(lowest + " " + nodes);
            context.write(key, word);
            word.clear();
        }
    }

    //Almost exactly from http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html
    public int run(String[] args) throws Exception {
        //http://code.google.com/p/joycrawler/source/browse/NetflixChallenge/src/org/niubility/learning/knn/KNNDriver.java?r=242
        //make the key -> value space separated (for iterations)
        getConf().set("mapred.textoutputformat.separator", " ");

        //set in and out to args.
        IN = args[0];
        OUT = args[1];

        String infile = IN;
        String outputfile = OUT + System.nanoTime();

        boolean isdone = false;
        boolean success = false;

        HashMap<Integer, Integer> _map = new HashMap<Integer, Integer>();

        while (!isdone) {

            Job job = new Job(getConf(), "Dijkstra");
            job.setJarByClass(ParallelDijkstra.class);
            job.setOutputKeyClass(LongWritable.class);
            job.setOutputValueClass(Text.class);
            job.setMapperClass(DijkstraMapper.class);
            job.setReducerClass(DijkstraReducer.class);
            job.setInputFormatClass(TextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);

            FileInputFormat.addInputPath(job, new Path(infile));
            FileOutputFormat.setOutputPath(job, new Path(outputfile));

            success = job.waitForCompletion(true);

            //remove the input file
            //http://eclipse.sys-con.com/node/1287801/mobile
            if (!infile.equals(IN)) {
                String indir = infile.replace("part-r-00000", "");
                Path ddir = new Path(indir);
                FileSystem dfs = FileSystem.get(getConf());
                dfs.delete(ddir, true);
            }

            infile = outputfile + "/part-r-00000";
            outputfile = OUT + System.nanoTime();

            //do we need to re-run the job with the new input file??
            //http://www.hadoop-blog.com/2010/11/how-to-read-file-from-hdfs-in-hadoop.html
            isdone = true;//set the job to NOT run again!
            Path ofile = new Path(infile);
            FileSystem fs = FileSystem.get(new Configuration());
            BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(ofile)));

            HashMap<Integer, Integer> imap = new HashMap<Integer, Integer>();
            String line = br.readLine();
            while (line != null) {
                //each line looks like 0 1 2:3:
                //we need to verify node -> distance doesn't change
                String[] sp = line.split(" ");
                int node = Integer.parseInt(sp[0]);
                int distance = Integer.parseInt(sp[1]);
                imap.put(node, distance);
                line = br.readLine();
            }
            if (_map.isEmpty()) {
                //first iteration... must do a second iteration regardless!
                isdone = false;
            } else {
                //http://www.java-examples.com/iterate-through-values-java-hashmap-example
                //http://www.javabeat.net/articles/33-generics-in-java-50-1.html
                for (Integer key : imap.keySet()) {
                    int val = imap.get(key);
                    if (_map.get(key) != val) {
                        //values aren't the same... we aren't at convergence yet
                        isdone = false;
                    }
                }
            }
            if (!isdone) {
                _map.putAll(imap);//copy imap to _map for the next iteration (if required)
            }
        }

        return success ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        System.exit(ToolRunner.run(new ParallelDijkstra(), args));
    }
}

The code is quoted from this blog post, also you can read the blog post for the instruction of how to run the code in Eclipse IDE with Hadoop plugin. In my case, I will run it in a tiny Hadoop cluster.

 

The input data:

1 0 2:3:
2 10000 1:4:5:
3 10000 1:
4 10000 2:5:
5 10000 2:4:

 Output:

[root@n1 hadoop-examples]# hadoop jar hadoop-algorithms.jar graph.ParallelDijkstra inputgraph output
13/07/13 20:07:51 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/13 20:08:03 INFO input.FileInputFormat: Total input paths to process : 1
13/07/13 20:09:04 INFO mapred.JobClient: Running job: job_201307131656_0001
13/07/13 20:09:05 INFO mapred.JobClient:  map 0% reduce 0%
13/07/13 20:10:05 INFO mapred.JobClient:  map 100% reduce 0%
13/07/13 20:15:13 INFO mapred.JobClient:  map 100% reduce 100%
13/07/13 20:15:16 INFO mapred.JobClient: Job complete: job_201307131656_0001
13/07/13 20:15:16 INFO mapred.JobClient: Counters: 32
13/07/13 20:15:16 INFO mapred.JobClient:   File System Counters
13/07/13 20:15:16 INFO mapred.JobClient:     FILE: Number of bytes read=183
13/07/13 20:15:16 INFO mapred.JobClient:     FILE: Number of bytes written=313855
13/07/13 20:15:16 INFO mapred.JobClient:     FILE: Number of read operations=0
13/07/13 20:15:16 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/07/13 20:15:16 INFO mapred.JobClient:     FILE: Number of write operations=0
13/07/13 20:15:16 INFO mapred.JobClient:     HDFS: Number of bytes read=173
13/07/13 20:15:16 INFO mapred.JobClient:     HDFS: Number of bytes written=53
13/07/13 20:15:16 INFO mapred.JobClient:     HDFS: Number of read operations=2
13/07/13 20:15:16 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/07/13 20:15:16 INFO mapred.JobClient:     HDFS: Number of write operations=1
13/07/13 20:15:16 INFO mapred.JobClient:   Job Counters 
13/07/13 20:15:16 INFO mapred.JobClient:     Launched map tasks=1
13/07/13 20:15:16 INFO mapred.JobClient:     Launched reduce tasks=1
13/07/13 20:15:16 INFO mapred.JobClient:     Data-local map tasks=1
13/07/13 20:15:16 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=38337
13/07/13 20:15:16 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=159310
13/07/13 20:15:16 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/07/13 20:15:16 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/07/13 20:15:16 INFO mapred.JobClient:   Map-Reduce Framework
13/07/13 20:15:16 INFO mapred.JobClient:     Map input records=5
13/07/13 20:15:16 INFO mapred.JobClient:     Map output records=20
13/07/13 20:15:16 INFO mapred.JobClient:     Map output bytes=383
13/07/13 20:15:16 INFO mapred.JobClient:     Input split bytes=112
13/07/13 20:15:16 INFO mapred.JobClient:     Combine input records=0
13/07/13 20:15:16 INFO mapred.JobClient:     Combine output records=0
13/07/13 20:15:16 INFO mapred.JobClient:     Reduce input groups=5
13/07/13 20:15:16 INFO mapred.JobClient:     Reduce shuffle bytes=179
13/07/13 20:15:16 INFO mapred.JobClient:     Reduce input records=20
13/07/13 20:15:16 INFO mapred.JobClient:     Reduce output records=5
13/07/13 20:15:16 INFO mapred.JobClient:     Spilled Records=40
13/07/13 20:15:16 INFO mapred.JobClient:     CPU time spent (ms)=3970
13/07/13 20:15:16 INFO mapred.JobClient:     Physical memory (bytes) snapshot=240209920
13/07/13 20:15:16 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2276737024
13/07/13 20:15:16 INFO mapred.JobClient:     Total committed heap usage (bytes)=101519360
13/07/13 20:15:17 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/13 20:15:18 INFO input.FileInputFormat: Total input paths to process : 1
13/07/13 20:15:18 INFO mapred.JobClient: Running job: job_201307131656_0002
13/07/13 20:15:19 INFO mapred.JobClient:  map 0% reduce 0%
13/07/13 20:15:34 INFO mapred.JobClient:  map 100% reduce 0%
13/07/13 20:15:41 INFO mapred.JobClient:  map 100% reduce 100%
13/07/13 20:15:44 INFO mapred.JobClient: Job complete: job_201307131656_0002
13/07/13 20:15:44 INFO mapred.JobClient: Counters: 32
13/07/13 20:15:44 INFO mapred.JobClient:   File System Counters
13/07/13 20:15:44 INFO mapred.JobClient:     FILE: Number of bytes read=190
13/07/13 20:15:44 INFO mapred.JobClient:     FILE: Number of bytes written=312999
13/07/13 20:15:44 INFO mapred.JobClient:     FILE: Number of read operations=0
13/07/13 20:15:44 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/07/13 20:15:44 INFO mapred.JobClient:     FILE: Number of write operations=0
13/07/13 20:15:44 INFO mapred.JobClient:     HDFS: Number of bytes read=188
13/07/13 20:15:44 INFO mapred.JobClient:     HDFS: Number of bytes written=45
13/07/13 20:15:44 INFO mapred.JobClient:     HDFS: Number of read operations=2
13/07/13 20:15:44 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/07/13 20:15:44 INFO mapred.JobClient:     HDFS: Number of write operations=1
13/07/13 20:15:44 INFO mapred.JobClient:   Job Counters 
13/07/13 20:15:44 INFO mapred.JobClient:     Launched map tasks=1
13/07/13 20:15:44 INFO mapred.JobClient:     Launched reduce tasks=1
13/07/13 20:15:44 INFO mapred.JobClient:     Data-local map tasks=1
13/07/13 20:15:44 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=16471
13/07/13 20:15:44 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=6476
13/07/13 20:15:44 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/07/13 20:15:44 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/07/13 20:15:44 INFO mapred.JobClient:   Map-Reduce Framework
13/07/13 20:15:44 INFO mapred.JobClient:     Map input records=5
13/07/13 20:15:44 INFO mapred.JobClient:     Map output records=20
13/07/13 20:15:44 INFO mapred.JobClient:     Map output bytes=359
13/07/13 20:15:44 INFO mapred.JobClient:     Input split bytes=135
13/07/13 20:15:44 INFO mapred.JobClient:     Combine input records=0
13/07/13 20:15:44 INFO mapred.JobClient:     Combine output records=0
13/07/13 20:15:44 INFO mapred.JobClient:     Reduce input groups=5
13/07/13 20:15:44 INFO mapred.JobClient:     Reduce shuffle bytes=186
13/07/13 20:15:44 INFO mapred.JobClient:     Reduce input records=20
13/07/13 20:15:44 INFO mapred.JobClient:     Reduce output records=5
13/07/13 20:15:44 INFO mapred.JobClient:     Spilled Records=40
13/07/13 20:15:44 INFO mapred.JobClient:     CPU time spent (ms)=2290
13/07/13 20:15:44 INFO mapred.JobClient:     Physical memory (bytes) snapshot=250818560
13/07/13 20:15:44 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1309839360
13/07/13 20:15:44 INFO mapred.JobClient:     Total committed heap usage (bytes)=81399808
13/07/13 20:15:45 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/13 20:15:45 INFO input.FileInputFormat: Total input paths to process : 1
13/07/13 20:15:46 INFO mapred.JobClient: Running job: job_201307131656_0003
13/07/13 20:15:47 INFO mapred.JobClient:  map 0% reduce 0%
13/07/13 20:15:56 INFO mapred.JobClient:  map 100% reduce 0%
13/07/13 20:16:02 INFO mapred.JobClient:  map 100% reduce 100%
13/07/13 20:16:04 INFO mapred.JobClient: Job complete: job_201307131656_0003
13/07/13 20:16:04 INFO mapred.JobClient: Counters: 32
13/07/13 20:16:04 INFO mapred.JobClient:   File System Counters
13/07/13 20:16:04 INFO mapred.JobClient:     FILE: Number of bytes read=172
13/07/13 20:16:04 INFO mapred.JobClient:     FILE: Number of bytes written=312907
13/07/13 20:16:04 INFO mapred.JobClient:     FILE: Number of read operations=0
13/07/13 20:16:04 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/07/13 20:16:04 INFO mapred.JobClient:     FILE: Number of write operations=0
13/07/13 20:16:04 INFO mapred.JobClient:     HDFS: Number of bytes read=180
13/07/13 20:16:04 INFO mapred.JobClient:     HDFS: Number of bytes written=45
13/07/13 20:16:04 INFO mapred.JobClient:     HDFS: Number of read operations=2
13/07/13 20:16:04 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/07/13 20:16:04 INFO mapred.JobClient:     HDFS: Number of write operations=1
13/07/13 20:16:04 INFO mapred.JobClient:   Job Counters 
13/07/13 20:16:04 INFO mapred.JobClient:     Launched map tasks=1
13/07/13 20:16:04 INFO mapred.JobClient:     Launched reduce tasks=1
13/07/13 20:16:04 INFO mapred.JobClient:     Data-local map tasks=1
13/07/13 20:16:04 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=8584
13/07/13 20:16:04 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=4759
13/07/13 20:16:04 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/07/13 20:16:04 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/07/13 20:16:04 INFO mapred.JobClient:   Map-Reduce Framework
13/07/13 20:16:04 INFO mapred.JobClient:     Map input records=5
13/07/13 20:16:04 INFO mapred.JobClient:     Map output records=20
13/07/13 20:16:04 INFO mapred.JobClient:     Map output bytes=335
13/07/13 20:16:04 INFO mapred.JobClient:     Input split bytes=135
13/07/13 20:16:04 INFO mapred.JobClient:     Combine input records=0
13/07/13 20:16:04 INFO mapred.JobClient:     Combine output records=0
13/07/13 20:16:04 INFO mapred.JobClient:     Reduce input groups=5
13/07/13 20:16:04 INFO mapred.JobClient:     Reduce shuffle bytes=168
13/07/13 20:16:04 INFO mapred.JobClient:     Reduce input records=20
13/07/13 20:16:04 INFO mapred.JobClient:     Reduce output records=5
13/07/13 20:16:04 INFO mapred.JobClient:     Spilled Records=40
13/07/13 20:16:04 INFO mapred.JobClient:     CPU time spent (ms)=1880
13/07/13 20:16:04 INFO mapred.JobClient:     Physical memory (bytes) snapshot=286953472
13/07/13 20:16:04 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=3241136128
13/07/13 20:16:04 INFO mapred.JobClient:     Total committed heap usage (bytes)=139132928

 From the output we can see that the job iterates for three times with the given input graph data.

你可能感兴趣的:(mapreduce)