sunwinner

Breadth-first Graph Search in MapReduce

In chapter 5 of the book "Data-Intensive Text Processing with MapReduce", it introduced how to parallel breadth-first graph search with MapReduce. This parallel algorithm is a variant of Dijkstra's algorithm. I'm not going to talk about the sequential version of Dijkstra's algorithm, for detailed explaination, refer to wikipedia. Also I'm not going to talk about the Graph data structure, please refer to wikipedia too.

If you know Dijkstra's algorithn well, you will know that the key to to Dijkstra's algorithm is the priority queue that maintains a globally sorted list of nodes by current distance. This is not possible in MapReduce, as the programming model does not provide a mechanism for exchanging global data. Instead, we adopt a brute force approach known as parallel breadth-first search.

The algorithm works by mapping over all nodes and emitting a key-value pair for each neighbor on the node's adjacency list. The key contains the node id of the neighbor, and the value is the current distance to the node plus the distance between current node and the next node. If we can reach node n with a distance d, then we must be able to reach all the nodes that we connected to n with distance d + dn. After shuffle and sort, reducers will receive keys corresponding to the destination node ids and distances corresponding to all paths leading to that node. The reducer will select the shortest of these distances and then update the distance in the node data structure.

It is apprant that parallel breadth-first search is an interative algorithm, where each iteration corresponds to a MapReduce Job. With each iteration, we will discover all nodes that are connected to current node. Also we need to pass along the graph structure from one iteration to the next. This is accomplished by emitting the node data structure itself in mapper. In the reducer, we must distinguish the node data structure from distance values and update the min distance in the node data structure before emitting it as the final value. Now it's ready to serve as input to the next iteration.

But the problem is, how may iterations are necessary to compute the shortest distance to all nodes? The answer is the dismeter of the graph, or the greatest distance between any pair of nodes. The algorithm can terminate when shortest distances at every node no longer change. We can use counters to keep track of such events. At the end of each MapReduce iteration, the driver program reads the counter value and determines if another is necessary.

package graph;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.HashMap;

public class ParallelDijkstra extends Configured implements Tool {

    public static String OUT = "output";
    public static String IN = "inputlarger";

    public static class DijkstraMapper extends Mapper<LongWritable, Text, LongWritable, Text> {

        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {

            //From slide 20 of Graph Algorithms with MapReduce (by Jimmy Lin, Univ @ Maryland)
            //Key is node n
            //Value is D, Points-To
            //For every point (or key), look at everything it points to.
            //Emit or write to the points to variable with the current distance + 1
            Text word = new Text();
            String line = value.toString();//looks like 1 0 2:3:
            String[] sp = line.split(" ");//splits on space
            int distanceAdded = Integer.parseInt(sp[1]) + 1;
            String[] pointsTo = sp[2].split(":");
            for (String distance : pointsTo) {
                word.set("VALUE " + distanceAdded);//tells me to look at distance value
                context.write(new LongWritable(Integer.parseInt(distance)), word);
                word.clear();
            }
            //pass in current node's distance (if it is the lowest distance)
            word.set("VALUE " + sp[1]);
            context.write(new LongWritable(Integer.parseInt(sp[0])), word);
            word.clear();

            word.set("NODES " + sp[2]);//tells me to append on the final tally
            context.write(new LongWritable(Integer.parseInt(sp[0])), word);
            word.clear();

        }
    }

    public static class DijkstraReducer extends Reducer<LongWritable, Text, LongWritable, Text> {
        public void reduce(LongWritable key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException {

            //From slide 20 of Graph Algorithms with MapReduce (by Jimmy Lin, Univ @ Maryland)
            //The key is the current point
            //The values are all the possible distances to this point
            //we simply emit the point and the minimum distance value

            String nodes = "UNMODED";
            Text word = new Text();
            int lowest = 10009;//start at infinity

            for (Text val : values) {//looks like NODES/VALUES 1 0 2:3:, we need to use the first as a key
                String[] sp = val.toString().split(" ");//splits on space
                //look at first value
                if (sp[0].equalsIgnoreCase("NODES")) {
                    nodes = null;
                    nodes = sp[1];
                } else if (sp[0].equalsIgnoreCase("VALUE")) {
                    int distance = Integer.parseInt(sp[1]);
                    lowest = Math.min(distance, lowest);
                }
            }
            word.set(lowest + " " + nodes);
            context.write(key, word);
            word.clear();
        }
    }

    //Almost exactly from http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html
    public int run(String[] args) throws Exception {
        //http://code.google.com/p/joycrawler/source/browse/NetflixChallenge/src/org/niubility/learning/knn/KNNDriver.java?r=242
        //make the key -> value space separated (for iterations)
        getConf().set("mapred.textoutputformat.separator", " ");

        //set in and out to args.
        IN = args[0];
        OUT = args[1];

        String infile = IN;
        String outputfile = OUT + System.nanoTime();

        boolean isdone = false;
        boolean success = false;

        HashMap<Integer, Integer> _map = new HashMap<Integer, Integer>();

        while (!isdone) {

            Job job = new Job(getConf(), "Dijkstra");
            job.setJarByClass(ParallelDijkstra.class);
            job.setOutputKeyClass(LongWritable.class);
            job.setOutputValueClass(Text.class);
            job.setMapperClass(DijkstraMapper.class);
            job.setReducerClass(DijkstraReducer.class);
            job.setInputFormatClass(TextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);

            FileInputFormat.addInputPath(job, new Path(infile));
            FileOutputFormat.setOutputPath(job, new Path(outputfile));

            success = job.waitForCompletion(true);

            //remove the input file
            //http://eclipse.sys-con.com/node/1287801/mobile
            if (!infile.equals(IN)) {
                String indir = infile.replace("part-r-00000", "");
                Path ddir = new Path(indir);
                FileSystem dfs = FileSystem.get(getConf());
                dfs.delete(ddir, true);
            }

            infile = outputfile + "/part-r-00000";
            outputfile = OUT + System.nanoTime();

            //do we need to re-run the job with the new input file??
            //http://www.hadoop-blog.com/2010/11/how-to-read-file-from-hdfs-in-hadoop.html
            isdone = true;//set the job to NOT run again!
            Path ofile = new Path(infile);
            FileSystem fs = FileSystem.get(new Configuration());
            BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(ofile)));

            HashMap<Integer, Integer> imap = new HashMap<Integer, Integer>();
            String line = br.readLine();
            while (line != null) {
                //each line looks like 0 1 2:3:
                //we need to verify node -> distance doesn't change
                String[] sp = line.split(" ");
                int node = Integer.parseInt(sp[0]);
                int distance = Integer.parseInt(sp[1]);
                imap.put(node, distance);
                line = br.readLine();
            }
            if (_map.isEmpty()) {
                //first iteration... must do a second iteration regardless!
                isdone = false;
            } else {
                //http://www.java-examples.com/iterate-through-values-java-hashmap-example
                //http://www.javabeat.net/articles/33-generics-in-java-50-1.html
                for (Integer key : imap.keySet()) {
                    int val = imap.get(key);
                    if (_map.get(key) != val) {
                        //values aren't the same... we aren't at convergence yet
                        isdone = false;
                    }
                }
            }
            if (!isdone) {
                _map.putAll(imap);//copy imap to _map for the next iteration (if required)
            }
        }

        return success ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        System.exit(ToolRunner.run(new ParallelDijkstra(), args));
    }
}

The code is quoted from this blog post, also you can read the blog post for the instruction of how to run the code in Eclipse IDE with Hadoop plugin. In my case, I will run it in a tiny Hadoop cluster.

The input data:

1 0 2:3:
2 10000 1:4:5:
3 10000 1:
4 10000 2:5:
5 10000 2:4:

Output:

[root@n1 hadoop-examples]# hadoop jar hadoop-algorithms.jar graph.ParallelDijkstra inputgraph output
13/07/13 20:07:51 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/13 20:08:03 INFO input.FileInputFormat: Total input paths to process : 1
13/07/13 20:09:04 INFO mapred.JobClient: Running job: job_201307131656_0001
13/07/13 20:09:05 INFO mapred.JobClient:  map 0% reduce 0%
13/07/13 20:10:05 INFO mapred.JobClient:  map 100% reduce 0%
13/07/13 20:15:13 INFO mapred.JobClient:  map 100% reduce 100%
13/07/13 20:15:16 INFO mapred.JobClient: Job complete: job_201307131656_0001
13/07/13 20:15:16 INFO mapred.JobClient: Counters: 32
13/07/13 20:15:16 INFO mapred.JobClient:   File System Counters
13/07/13 20:15:16 INFO mapred.JobClient:     FILE: Number of bytes read=183
13/07/13 20:15:16 INFO mapred.JobClient:     FILE: Number of bytes written=313855
13/07/13 20:15:16 INFO mapred.JobClient:     FILE: Number of read operations=0
13/07/13 20:15:16 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/07/13 20:15:16 INFO mapred.JobClient:     FILE: Number of write operations=0
13/07/13 20:15:16 INFO mapred.JobClient:     HDFS: Number of bytes read=173
13/07/13 20:15:16 INFO mapred.JobClient:     HDFS: Number of bytes written=53
13/07/13 20:15:16 INFO mapred.JobClient:     HDFS: Number of read operations=2
13/07/13 20:15:16 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/07/13 20:15:16 INFO mapred.JobClient:     HDFS: Number of write operations=1
13/07/13 20:15:16 INFO mapred.JobClient:   Job Counters 
13/07/13 20:15:16 INFO mapred.JobClient:     Launched map tasks=1
13/07/13 20:15:16 INFO mapred.JobClient:     Launched reduce tasks=1
13/07/13 20:15:16 INFO mapred.JobClient:     Data-local map tasks=1
13/07/13 20:15:16 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=38337
13/07/13 20:15:16 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=159310
13/07/13 20:15:16 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/07/13 20:15:16 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/07/13 20:15:16 INFO mapred.JobClient:   Map-Reduce Framework
13/07/13 20:15:16 INFO mapred.JobClient:     Map input records=5
13/07/13 20:15:16 INFO mapred.JobClient:     Map output records=20
13/07/13 20:15:16 INFO mapred.JobClient:     Map output bytes=383
13/07/13 20:15:16 INFO mapred.JobClient:     Input split bytes=112
13/07/13 20:15:16 INFO mapred.JobClient:     Combine input records=0
13/07/13 20:15:16 INFO mapred.JobClient:     Combine output records=0
13/07/13 20:15:16 INFO mapred.JobClient:     Reduce input groups=5
13/07/13 20:15:16 INFO mapred.JobClient:     Reduce shuffle bytes=179
13/07/13 20:15:16 INFO mapred.JobClient:     Reduce input records=20
13/07/13 20:15:16 INFO mapred.JobClient:     Reduce output records=5
13/07/13 20:15:16 INFO mapred.JobClient:     Spilled Records=40
13/07/13 20:15:16 INFO mapred.JobClient:     CPU time spent (ms)=3970
13/07/13 20:15:16 INFO mapred.JobClient:     Physical memory (bytes) snapshot=240209920
13/07/13 20:15:16 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2276737024
13/07/13 20:15:16 INFO mapred.JobClient:     Total committed heap usage (bytes)=101519360
13/07/13 20:15:17 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/13 20:15:18 INFO input.FileInputFormat: Total input paths to process : 1
13/07/13 20:15:18 INFO mapred.JobClient: Running job: job_201307131656_0002
13/07/13 20:15:19 INFO mapred.JobClient:  map 0% reduce 0%
13/07/13 20:15:34 INFO mapred.JobClient:  map 100% reduce 0%
13/07/13 20:15:41 INFO mapred.JobClient:  map 100% reduce 100%
13/07/13 20:15:44 INFO mapred.JobClient: Job complete: job_201307131656_0002
13/07/13 20:15:44 INFO mapred.JobClient: Counters: 32
13/07/13 20:15:44 INFO mapred.JobClient:   File System Counters
13/07/13 20:15:44 INFO mapred.JobClient:     FILE: Number of bytes read=190
13/07/13 20:15:44 INFO mapred.JobClient:     FILE: Number of bytes written=312999
13/07/13 20:15:44 INFO mapred.JobClient:     FILE: Number of read operations=0
13/07/13 20:15:44 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/07/13 20:15:44 INFO mapred.JobClient:     FILE: Number of write operations=0
13/07/13 20:15:44 INFO mapred.JobClient:     HDFS: Number of bytes read=188
13/07/13 20:15:44 INFO mapred.JobClient:     HDFS: Number of bytes written=45
13/07/13 20:15:44 INFO mapred.JobClient:     HDFS: Number of read operations=2
13/07/13 20:15:44 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/07/13 20:15:44 INFO mapred.JobClient:     HDFS: Number of write operations=1
13/07/13 20:15:44 INFO mapred.JobClient:   Job Counters 
13/07/13 20:15:44 INFO mapred.JobClient:     Launched map tasks=1
13/07/13 20:15:44 INFO mapred.JobClient:     Launched reduce tasks=1
13/07/13 20:15:44 INFO mapred.JobClient:     Data-local map tasks=1
13/07/13 20:15:44 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=16471
13/07/13 20:15:44 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=6476
13/07/13 20:15:44 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/07/13 20:15:44 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/07/13 20:15:44 INFO mapred.JobClient:   Map-Reduce Framework
13/07/13 20:15:44 INFO mapred.JobClient:     Map input records=5
13/07/13 20:15:44 INFO mapred.JobClient:     Map output records=20
13/07/13 20:15:44 INFO mapred.JobClient:     Map output bytes=359
13/07/13 20:15:44 INFO mapred.JobClient:     Input split bytes=135
13/07/13 20:15:44 INFO mapred.JobClient:     Combine input records=0
13/07/13 20:15:44 INFO mapred.JobClient:     Combine output records=0
13/07/13 20:15:44 INFO mapred.JobClient:     Reduce input groups=5
13/07/13 20:15:44 INFO mapred.JobClient:     Reduce shuffle bytes=186
13/07/13 20:15:44 INFO mapred.JobClient:     Reduce input records=20
13/07/13 20:15:44 INFO mapred.JobClient:     Reduce output records=5
13/07/13 20:15:44 INFO mapred.JobClient:     Spilled Records=40
13/07/13 20:15:44 INFO mapred.JobClient:     CPU time spent (ms)=2290
13/07/13 20:15:44 INFO mapred.JobClient:     Physical memory (bytes) snapshot=250818560
13/07/13 20:15:44 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1309839360
13/07/13 20:15:44 INFO mapred.JobClient:     Total committed heap usage (bytes)=81399808
13/07/13 20:15:45 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/13 20:15:45 INFO input.FileInputFormat: Total input paths to process : 1
13/07/13 20:15:46 INFO mapred.JobClient: Running job: job_201307131656_0003
13/07/13 20:15:47 INFO mapred.JobClient:  map 0% reduce 0%
13/07/13 20:15:56 INFO mapred.JobClient:  map 100% reduce 0%
13/07/13 20:16:02 INFO mapred.JobClient:  map 100% reduce 100%
13/07/13 20:16:04 INFO mapred.JobClient: Job complete: job_201307131656_0003
13/07/13 20:16:04 INFO mapred.JobClient: Counters: 32
13/07/13 20:16:04 INFO mapred.JobClient:   File System Counters
13/07/13 20:16:04 INFO mapred.JobClient:     FILE: Number of bytes read=172
13/07/13 20:16:04 INFO mapred.JobClient:     FILE: Number of bytes written=312907
13/07/13 20:16:04 INFO mapred.JobClient:     FILE: Number of read operations=0
13/07/13 20:16:04 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/07/13 20:16:04 INFO mapred.JobClient:     FILE: Number of write operations=0
13/07/13 20:16:04 INFO mapred.JobClient:     HDFS: Number of bytes read=180
13/07/13 20:16:04 INFO mapred.JobClient:     HDFS: Number of bytes written=45
13/07/13 20:16:04 INFO mapred.JobClient:     HDFS: Number of read operations=2
13/07/13 20:16:04 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/07/13 20:16:04 INFO mapred.JobClient:     HDFS: Number of write operations=1
13/07/13 20:16:04 INFO mapred.JobClient:   Job Counters 
13/07/13 20:16:04 INFO mapred.JobClient:     Launched map tasks=1
13/07/13 20:16:04 INFO mapred.JobClient:     Launched reduce tasks=1
13/07/13 20:16:04 INFO mapred.JobClient:     Data-local map tasks=1
13/07/13 20:16:04 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=8584
13/07/13 20:16:04 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=4759
13/07/13 20:16:04 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/07/13 20:16:04 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/07/13 20:16:04 INFO mapred.JobClient:   Map-Reduce Framework
13/07/13 20:16:04 INFO mapred.JobClient:     Map input records=5
13/07/13 20:16:04 INFO mapred.JobClient:     Map output records=20
13/07/13 20:16:04 INFO mapred.JobClient:     Map output bytes=335
13/07/13 20:16:04 INFO mapred.JobClient:     Input split bytes=135
13/07/13 20:16:04 INFO mapred.JobClient:     Combine input records=0
13/07/13 20:16:04 INFO mapred.JobClient:     Combine output records=0
13/07/13 20:16:04 INFO mapred.JobClient:     Reduce input groups=5
13/07/13 20:16:04 INFO mapred.JobClient:     Reduce shuffle bytes=168
13/07/13 20:16:04 INFO mapred.JobClient:     Reduce input records=20
13/07/13 20:16:04 INFO mapred.JobClient:     Reduce output records=5
13/07/13 20:16:04 INFO mapred.JobClient:     Spilled Records=40
13/07/13 20:16:04 INFO mapred.JobClient:     CPU time spent (ms)=1880
13/07/13 20:16:04 INFO mapred.JobClient:     Physical memory (bytes) snapshot=286953472
13/07/13 20:16:04 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=3241136128
13/07/13 20:16:04 INFO mapred.JobClient:     Total committed heap usage (bytes)=139132928

From the output we can see that the job iterates for three times with the given input graph data.

浅谈MapReduce Android路上的人 Hadoop 分布式计算 mapreduce 分布式框架 hadoop
从今天开始，本人将会开始对另一项技术的学习，就是当下炙手可热的Hadoop分布式就算技术。目前国内外的诸多公司因为业务发展的需要，都纷纷用了此平台。国内的比如BAT啦，国外的在这方面走的更加的前面，就不一一列举了。但是Hadoop作为Apache的一个开源项目，在下面有非常多的子项目，比如HDFS，HBase,Hive，Pig,等等，要先彻底学习整个Hadoop，仅仅凭借一个的力量，是远远不够的。
Hadoop 傲雪凌霜，松柏长青后端大数据 hadoop 大数据分布式
ApacheHadoop是一个开源的分布式计算框架，主要用于处理海量数据集。它具有高度的可扩展性、容错性和高效的分布式存储与计算能力。Hadoop核心由四个主要模块组成，分别是HDFS（分布式文件系统）、MapReduce（分布式计算框架）、YARN（资源管理）和HadoopCommon（公共工具和库）。1.HDFS（HadoopDistributedFileSystem）HDFS是Hadoop生
hbase介绍 CrazyL- 云计算+大数据 hbase
hbase是一个分布式的、多版本的、面向列的开源数据库hbase利用hadoophdfs作为其文件存储系统，提供高可靠性、高性能、列存储、可伸缩、实时读写、适用于非结构化数据存储的数据库系统hbase利用hadoopmapreduce来处理hbase、中的海量数据hbase利用zookeeper作为分布式系统服务特点：数据量大：一个表可以有上亿行，上百万列（列多时，插入变慢）面向列：面向列（族）的
Spark集群的三种模式 MelodyYN #Spark spark hadoop big data
文章目录1、Spark的由来1.1Hadoop的发展1.2MapReduce与Spark对比2、Spark内置模块3、Spark运行模式3.1Standalone模式部署配置历史服务器配置高可用运行模式3.2Yarn模式安装部署配置历史服务器运行模式4、WordCount案例1、Spark的由来定义：Hadoop主要解决，海量数据的存储和海量数据的分析计算。Spark是一种基于内存的快速、通用、可
HBase介绍 mingyu1016 数据库
概述HBase是一个分布式的、面向列的开源数据库,源于google的一篇论文《bigtable：一个结构化数据的分布式存储系统》。HBase是GoogleBigtable的开源实现，它利用HadoopHDFS作为其文件存储系统，利用HadoopMapReduce来处理HBase中的海量数据，利用Zookeeper作为协同服务。HBase的表结构HBase以表的形式存储数据。表有行和列组成。列划分为
Hadoop windows intelij 跑 MR WordCount piziyang12138
一、软件环境我使用的软件版本如下:IntellijIdea2017.1Maven3.3.9Hadoop分布式环境二、创建maven工程打开Idea,file->new->Project,左侧面板选择maven工程。(如果只跑MapReduce创建java工程即可，不用勾选Creatfromarchetype，如果想创建web工程或者使用骨架可以勾选)image.png设置GroupId和Artif
ArcGIS地图切片原理与算法数智侠 GIS
ArcGIS地图切图系列之（一）切片原理解析点击打开链接ArcGIS地图切图系列之（二）JAVA实现点击打开链接ArcGIS地图切图系列之（三）MapReduce实现点击打开链接
数据中台建设方案-基于大数据平台(下) FRDATA1550333 大数据数据库架构数据库开发数据库
数据中台建设方案-基于大数据平台(下)1数据中台建设方案1.1总体建设方案1.2大数据集成平台1.3大数据计算平台1.3.1数据计算层建设计算层技术含量最高，最为活跃，发展也最为迅速。计算层主要实现各类数据的加工、处理和计算，为上层应用提供良好和充分的数据支持。大数据基础平台技术能力的高低，主要依赖于该层组件的发展。本建设方案满足甲方对于数据计算层建设的基本要求：利用了MapReduce、Spar
MIT6.824 课程-MapReduce 余为民同志 6.824 mapreduce 分布式 6.824
MapReduce：在大型集群上简化数据处理概要MapReduce是一种编程模型，它是一种用于处理和生成大型数据集的实现。用户通过指定一个用来处理键值对(Key/Value)的map函数来生成一个中间键值对集合。然后，再指定一个reduce函数，它用来合并所有的具有相同中间key的中间value。现实生活中有许多任务可以通过该模型进行表达，具体案例会在论文中展现出来。以这种函数式风格编写的程序能够
Hadoop之mapreduce -- WrodCount案例以及各种概念 lzhlizihang hadoop mapreduce 大数据
文章目录一、MapReduce的优缺点二、MapReduce案例--WordCount1、导包2、Mapper方法3、Partitioner方法（自定义分区器）4、reducer方法5、driver（main方法）6、Writable（手机流量统计案例的实体类）三、关于片和块1、什么是片，什么是块？2、mapreduce启动多少个MapTask任务？四、MapReduce的原理五、Shuffle过
Yarn介绍 - 大数据框架 why do not 大数据 hadoop
YARN的概述YARN是一个资源调度平台，负责为运算程序提供服务器运算资源，相当于一个分布式的操作系统平台，而MapReduce等运算程序则相当于运行于操作系统之上的应用程序YARN是Hadoop2.x版本中的一个新特性。它的出现其实是为了解决第一代MapReduce编程框架的不足，提高集群环境下的资源利用率，这些资源包括内存，磁盘，网络，IO等。Hadoop2.X版本中重新设计的这个YARN集群
浅析大数据Hadoop之YARN架构 haotian1685 python 数据清洗人工智能大数据大数据学习深度学习大数据大数据学习 YARN hadoop
1.YARN本质上是资源管理系统。YARN提供了资源管理和资源调度等机制1.1原HadoopMapReduce框架对于业界的大数据存储及分布式处理系统来说，Hadoop是耳熟能详的卓越开源分布式文件存储及处理框架，对于Hadoop框架的介绍在此不再累述，读者可参考Hadoop官方简介。使用和学习过老Hadoop框架（0.20.0及之前版本）的同仁应该很熟悉如下的原MapReduce框架图：1.2H
Hive的优势与使用场景傲雪凌霜，松柏长青后端大数据 hive hadoop 数据仓库
Hive的优势Hive作为一个构建在Hadoop上的数据仓库工具，具有许多优势，特别是在处理大规模数据分析任务时。以下是Hive的主要优势：1.与Hadoop生态系统的紧密集成Hive构建在Hadoop分布式文件系统(HDFS)之上，能够处理海量数据并进行分布式计算。它利用Hadoop的MapReduce或Spark来执行查询，具备高度扩展性，适合大数据处理。2.支持SQL-like查询语言(Hi
Spark概念知识笔记 kuntoria
最近总结了个人的各项能力，发现在大数据这方面几乎没有涉及，因此想补充这方面的知识，丰富自己的知识体系，大数据生态主要包含：Hadoop和Spark两个部分，Spark作用相当于MapReduceMapReduce和Spark对比如下磁盘由于其物理特性现在，速度提升非常困难，远远跟不上CPU和内存的发展速度。近几十年来，内存的发展一直遵循摩尔定律，价格在下降，内存在增加。现在主流的服务器，几百GB或
【Hadoop】- MapReduce & YARN 初体验[9] 星星法术嗲人 hadoop hadoop mapreduce
目录提交MapReduce程序至YARN运行1、提交wordcount示例程序1.1、先准备words.txt文件上传到hdfs，文件内容如下：1.2、在hdfs中创建两个文件夹，分别为/input、/output1.3、将创建好的words.txt文件上传到hdfs中/input1.4、提交MapReduce程序至YARN1.5、可通过node1:8088查看1.6、返回我们的服务器，检查输出文
DAG (directed acyclic graph) 作为大数据执行引擎的优点 joeywen 分布式计算 Storm Spark Storm 杂谈 Storm spark DAG
TL;DR-ConceptuallyDAGmodelisastrictgeneralizationofMapReducemodel.DAG-basedsystemslikeSparkandTezthatareawareofthewholeDAGofoperationscandobetterglobaloptimizationsthansystemslikeHadoopMapReducewhicha
Hadoop组件静听山水 Hadoop hadoop
这张图片展示了Hadoop生态系统的一些主要组件。Hadoop是一个开源的大数据处理框架，由Apache基金会维护。以下是每个组件的简短介绍：HBase：一个分布式、面向列的NoSQL数据库，基于GoogleBigTable的设计理念构建。HBase提供了实时读写访问大量结构化和半结构化数据的能力，非常适合大规模数据存储。Pig：一种高级数据流语言和执行引擎，用于编写MapReduce任务。Pig
Hadoop-MapReduce机制原理 H.S.T不想卷大数据 hadoop mapreduce 大数据
MapReduce机制原理1、MapReduce概述2、MapReduce特点3、MapReduce局限性4、MapTask5、Map阶段步骤：6、Reduce阶段步骤：7、MapReduce阶段图1、MapReduce概述 HadoopMapReduce是一个分布式计算框架，用于轻松编写分布式应用程序，这些应用程序以可靠，容错的方式并行处理大型硬件集群（数千个节点）上的大量数据（多TB数据集）
EMR组件部署指南 ivwdcwso 运维 EMR 大数据开源运维
EMR(ElasticMapReduce)是一个大数据处理和分析平台,包含了多个开源组件。本文将详细介绍如何部署EMR的主要组件,包括:JDK1.8ElasticsearchKafkaFlinkZookeeperHBaseHadoopPhoenixScalaSparkHive准备工作所有操作都在/data目录下进行。首先安装JDK1.8:yuminstalljava-1.8.0-openjdk部署
hive学习记录 2302_80695227 hive 学习 hadoop
一、Hive的基本概念定义：Hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张表，并提供类SQL查询功能。Hive将HQL（HiveQueryLanguage）转化成MapReduce程序或其他分布式计算引擎（如Tez、Spark）的任务进行计算。数据存储：Hive处理的数据存储在HDFS（HadoopDistributedFileSystem）上。执行引擎：Hive的
Mapreduce是什么 whisky丶
简单来说，MapReduce是一个编程模型，用以进行大数据量的计算。HadoopMapReduce是一个软件框架，基于该框架能够容易地编写应用程序，这些应用程序能够运行在由上千个商用机器组成的大集群上，并以一种可靠的，具有容错能力的方式并行地处理上TB级别的海量数据集。Mapreduce的特点：软件框架并行处理可靠且容错大规模集群海量数据集
Hadoop之MapReduce qq_43198449
1.MapReduce解决的问题1)数据问题：10G的TXT文件2)生活问题：统计分类上海市的图书馆的书2.MapReduce是什么MapReduce是一种分布式的离线计算框架，是一种编程模型，用于大规模数据集(大于1TB)的并行运算将自己的程序运行在分布式系统上。概念是：Map(映射)"和"Reduce(归约)指定一个Map(映射)函数，用来把一组键值对映射成一组新的键值对，指定并发的Reduc
生产环境中MapReduce的最佳实践大数据深度洞察 Hadoop mapreduce 大数据
目录MapReduce跑的慢的原因MapReduce常用调优参数1.MapTask相关参数2.ReduceTask相关参数3.总体调优参数4.其他重要参数调优策略MapReduce数据倾斜问题1.数据预处理2.自定义Partitioner3.调整Reduce任务数4.小文件问题处理5.二次排序6.使用桶表7.使用随机前缀8.参数调优实施步骤MapReduce跑的慢的原因MapReduce程序效率的
Hive 运行在 Tez 上爱吃酸梨大数据
Tez介绍Tez是一种基于内存的计算框架，速度比MapReduce要快解释：浅蓝色方块表示Map任务，绿色方块表示Reduce任务，蓝色边框的云朵表示中间结果落地磁盘。Tez下载Tez官网Tez在Hive上的运用前提要有Hadoop集群上传Tez压缩包到Hive节点上tar-zxvfapache-tez-0.9.1-bin.tar.gz-C/opt/module/tez-0.9.1修改$HIVE_
经验笔记：Hadoop 漆黑的莫莫随手笔记笔记 hadoop 大数据
Hadoop经验笔记一、Hadoop概述Hadoop是一个开源软件框架，用于分布式存储和处理大规模数据集。其设计目的是为了在商用硬件上运行，具备高容错性和可扩展性。Hadoop的核心是HadoopDistributedFileSystem(HDFS)和YARN(YetAnotherResourceNegotiator)，这两个组件加上MapReduce编程模型，构成了Hadoop的基本架构。二、H
大数据毕业设计hadoop+spark+hive微博舆情情感分析知识图谱微博推荐系统 qq_79856539 javaweb 大数据 hadoop 课程设计
（一）Selenium自动化Python爬虫工具采集新浪微博评论、热搜、文章等约10万条存入.csv文件作为数据集；（二）使用pandas+numpy或MapReduce对数据进行数据清洗，生成最终的.csv文件并上传到hdfs；（三）使用hive数仓技术建表建库，导入.csv数据集；（四）离线分析采用hive_sql完成，实时分析利用Spark之Scala完成;（五）统计指标使用sqoop导入m
Data-Intensive Text Processing with MapReduce 西二旗小码农自然语言处理（NLP）mapreduce processing 算法 integer hadoop pair
大量高效的MapReduce程序因为它简单的编写方法而产生：除了准备输入数据之外，程序员只需要实现mapper和ruducer接口，或加上合并器（combiner）和分配器（partitioner）。所有其他方面的执行都透明地控制在由一个节点到上千个节点组成的，数据级别达到GB到PB级别的集群的执行框架中。然而，这就意味着程序员想在上面实现的算法必须表现为一些严格定义的组件，必须用特殊的方法把它们
双十一云起实验室体验专场，七大场景，体验有礼阿里云天池体验场景活动云计算大数据容器云原生
云起实验室云起实验室是阿里云为开发者打造的一站式体验学习平台，在这里你可以了解并亲自动手体验各类云产品和云计算基础，无需关注资源开通和底层产品，无需任何费用。只要有一颗想要了解云、学习云、体验云的心，这里就是你的上云第一站。场景介绍此次体验《双十一云起实验室体验专场》，涉及七大技术场景实践体验，云上实践，云上成长。\大数据计算场景《基于EMR离线数据分析》E-MapReduce（简称“EMR”）是
小白学习大数据测试之hadoop hdfs和MapReduce小实战大数据学习02
转发是对小编的最大支持在湿货|大数据测试之hadoop单机环境搭建(超级详细版)这个基础上，我们来运行一个官网的MapReducedemo程序来看看效果和处理过程。大致步骤如下：新建一个文件test.txt，内容为HelloHadoopHelloxiaoqiangHellotestingbangHellohttp://xqtesting.sxl.cn将test.txt上传到hdfs的根目录/usr
虚拟机安装hadoop，hbase（单机伪集群模式）流~星~雨大数据相关 hadoop hbase 大数据
虚拟机安装Hadoop，Hbase工作中遇到了大数据方面的一些技术栈，没有退路可言，只能去学习掌握它，就像当初做爬虫一样（虽然很简单），在数据爆发的现在，传统的数据库mysql，oracle显然在处理大数据量级的数据时显得力不从心，所以有些特定的业务需要引进能够处理大数据量的数据库，hadoop提供了分布式文件系统（HDFS）来存储数据，又提供了分布式计算框架（mapreduce）来对这些数据进行
多线程编程之存钱与取钱周凡杨 java thread 多线程存钱取钱
生活费问题是这样的：学生每月都需要生活费，家长一次预存一段时间的生活费，家长和学生使用统一的一个帐号，在学生每次取帐号中一部分钱，直到帐号中没钱时通知家长存钱，而家长看到帐户还有钱则不存钱，直到帐户没钱时才存钱。问题分析：首先问题中有三个实体，学生、家长、银行账户，所以设计程序时就要设计三个类。其中银行账户只有一个，学生和家长操作的是同一个银行账户，学生的行为是
java中数组与List相互转换的方法征客丶 JavaScript java jsonp
1.List转换成为数组。（这里的List是实体是ArrayList) 　　调用ArrayList的toArray方法。　　toArray 　　public T[] toArray(T[] a)返回一个按照正确的顺序包含此列表中所有元素的数组；返回数组的运行时类型就是指定数组的运行时类型。如果列表能放入指定的数组，则返回放入此列表元素的数组。否则，将根据指定数组的运行时类型和此列表的大小分
Shell 流程控制 daizj 流程控制 if else while case shell
Shell 流程控制和Java、PHP等语言不一样，sh的流程控制不可为空，如(以下为PHP流程控制写法)： <?php if(isset($_GET["q"])){ search(q);}else{// 不做任何事情} 在sh/bash里可不能这么写，如果else分支没有语句执行，就不要写这个else，就像这样 if else if if 语句语
Linux服务器新手操作之二周凡杨 Linux 简单操作
1.利用关键字搜寻Man Pages man -k keyword 其中-k 是选项，keyword是要搜寻的关键字如果现在想使用whoami命令，但是只记住了前3个字符who，就可以使用 man -k who来搜寻关键字who的man命令 [haself@HA5-DZ26 ~]$ man -k
socket聊天室之服务器搭建朱辉辉33 socket
因为我们做的是聊天室，所以会有多个客户端，每个客户端我们用一个线程去实现，通过搭建一个服务器来实现从每个客户端来读取信息和发送信息。我们先写客户端的线程。 public class ChatSocket extends Thread{ Socket socket; public ChatSocket(Socket socket){ this.sock
利用finereport建设保险公司决策分析系统的思路和方法老A不折腾 finereport 金融保险分析系统报表系统项目开发
决策分析系统呈现的是数据页面，也就是俗称的报表，报表与报表间、数据与数据间都按照一定的逻辑设定，是业务人员查看、分析数据的平台，更是辅助领导们运营决策的平台。底层数据决定上层分析，所以建设决策分析系统一般包括数据层处理（数据仓库建设）。项目背景介绍通常，保险公司信息化程度很高，基本上都有业务处理系统（像集团业务处理系统、老业务处理系统、个人代理人系统等）、数据服务系统（通过
始终要页面在ifream的最顶层林鹤霄
index.jsp中有ifream，但是session消失后要让login.jsp始终显示到ifream的最顶层。。。始终没搞定，后来反复琢磨之后，得到了解决办法，在这儿给大家分享下。。 index.jsp--->主要是加了颜色的那一句 <html> <iframe name="top" ></iframe> <ifram
MySQL binlog恢复数据 aigo mysql
1，先确保my.ini已经配置了binlog： # binlog log_bin = D:/mysql-5.6.21-winx64/log/binlog/mysql-bin.log log_bin_index = D:/mysql-5.6.21-winx64/log/binlog/mysql-bin.index log_error = D:/mysql-5.6.21-win
OCX打成CBA包并实现自动安装与自动升级 alxw4616 ocx cab
近来手上有个项目,需要使用ocx控件 (ocx是什么? http://baike.baidu.com/view/393671.htm) 在生产过程中我遇到了如下问题. 1. 如何让 ocx 自动安装? a) 如何签名? b) 如何打包? c) 如何安装到指定目录? 2.
Hashmap队列和PriorityQueue队列的应用百合不是茶 Hashmap队列 PriorityQueue队列
HashMap队列已经是学过了的,但是最近在用的时候不是很熟悉,刚刚重新看以一次, HashMap是K,v键 ,值 put()添加元素 //下面试HashMap去掉重复的 package com.hashMapandPriorityQueue; import java.util.H
JDK1.5 returnvalue实例 bijian1013 java thread java多线程 returnvalue
Callable接口：返回结果并且可能抛出异常的任务。实现者定义了一个不带任何参数的叫做 call 的方法。 Callable 接口类似于 Runnable，两者都是为那些其实例可能被另一个线程执行的类设计的。但是 Runnable 不会返回结果，并且无法抛出经过检查的异常。 ExecutorService接口方
angularjs指令中动态编译的方法(适用于有异步请求的情况) 内嵌指令无效 bijian1013 JavaScript AngularJS
在directive的link中有一个$http请求，当请求完成后根据返回的值动态做element.append('......');这个操作，能显示没问题，可问题是我动态组的HTML里面有ng-click，发现显示出来的内容根本不执行ng-click绑定的方法！
【Java范型二】Java范型详解之extend限定范型参数的类型 bit1129 extend
在第一篇中，定义范型类时，使用如下的方式： public class Generics<M, S, N> { //M,S,N是范型参数 } 这种方式定义的范型类有两个基本的问题： 1. 范型参数定义的实例字段，如private M m = null;由于M的类型在运行时才能确定，那么我们在类的方法中，无法使用m，这跟定义pri
【HBase十三】HBase知识点总结 bit1129 hbase
1. 数据从MemStore flush到磁盘的触发条件有哪些？ a.显式调用flush，比如flush 'mytable' b.MemStore中的数据容量超过flush的指定容量，hbase.hregion.memstore.flush.size,默认值是64M 2. Region的构成是怎么样？ 1个Region由若干个Store组成
服务器被DDOS攻击防御的SHELL脚本 ronin47
mkdir /root/bin vi /root/bin/dropip.sh #!/bin/bash/bin/netstat -na|grep ESTABLISHED|awk ‘{print $5}’|awk -F:‘{print $1}’|sort|uniq -c|sort -rn|head -10|grep -v -E ’192.168|127.0′|awk ‘{if($2!=null&a
java程序员生存手册-craps 游戏-一个简单的游戏 bylijinnan java
import java.util.Random; public class CrapsGame { /** * *一个简单的赌*博游戏，游戏规则如下： *玩家掷两个骰子，点数为1到6，如果第一次点数和为7或11，则玩家胜， *如果点数和为2、3或12，则玩家输， *如果和为其它点数，则记录第一次的点数和，然后继续掷骰，直至点数和等于第一次掷出的点
TOMCAT启动提示NB: JAVA_HOME should point to a JDK not a JRE解决开窍的石头 JAVA_HOME
当tomcat是解压的时候，用eclipse启动正常，点击startup.bat的时候启动报错; 报错如下： The JAVA_HOME environment variable is not defined correctly This environment variable is needed to run this program NB: JAVA_HOME shou
[操作系统内核]操作系统与互联网 comsci 操作系统
我首先申明：我这里所说的问题并不是针对哪个厂商的，仅仅是描述我对操作系统技术的一些看法操作系统是一种与硬件层关系非常密切的系统软件，按理说，这种系统软件应该是由设计CPU和硬件板卡的厂商开发的，和软件公司没有直接的关系，也就是说，操作系统应该由做硬件的厂商来设计和开发
富文本框ckeditor_4.4.7 文本框的简单使用支持IE11 cuityang 富文本框
<html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>知识库内容编辑</tit
Property null not found darrenzhu datagrid Flex Advanced propery null
When you got error message like "Property null not found ***", try to fix it by the following way: 1)if you are using AdvancedDatagrid, make sure you only update the data in the data prov
MySQl数据库字符串替换函数使用 dcj3sjt126com mysql 函数替换
需求：需要将数据表中一个字段的值里面的所有的 . 替换成 _ 原来的数据是 site.title site.keywords .... 替换后要为 site_title site_keywords 使用的SQL语句如下： updat
mac上终端起动MySQL的方法 dcj3sjt126com mysql mac
首先去官网下载: http://www.mysql.com/downloads/ 我下载了5.6.11的dmg然后安装,安装完成之后..如果要用终端去玩SQL.那么一开始要输入很长的:/usr/local/mysql/bin/mysql 这不方便啊,好想像windows下的cmd里面一样输入mysql -uroot -p1这样...上网查了下..可以实现滴. 打开终端,输入: 1
Gson使用一（Gson） eksliang json gson
转载请出自出处：http://eksliang.iteye.com/blog/2175401 一.概述从结构上看Json，所有的数据（data）最终都可以分解成三种类型：第一种类型是标量（scalar），也就是一个单独的字符串（string）或数字（numbers），比如"ickes"这个字符串。第二种类型是序列（sequence），又叫做数组（array）
android点滴4 gundumw100 android
Android 47个小知识 http://www.open-open.com/lib/view/open1422676091314.html Android实用代码七段（一） http://www.cnblogs.com/over140/archive/2012/09/26/2611999.html http://www.cnblogs.com/over140/arch
JavaWeb之JSP基本语法 ihuning javaweb
目录 JSP模版元素 JSP表达式 JSP脚本片断 EL表达式 JSP注释特殊字符序列的转义处理如何查找JSP页面中的错误 JSP模版元素 JSP页面中的静态HTML内容称之为JSP模版元素，在静态的HTML内容之中可以嵌套JSP
App Extension编程指南（iOS8/OS X v10.10）中文版啸笑天 ext
当iOS 8.0和OS X v10.10发布后，一个全新的概念出现在我们眼前，那就是应用扩展。顾名思义，应用扩展允许开发者扩展应用的自定义功能和内容，能够让用户在使用其他app时使用该项功能。你可以开发一个应用扩展来执行某些特定的任务，用户使用该扩展后就可以在多个上下文环境中执行该任务。比如说，你提供了一个能让用户把内容分
SQLServer实现无限级树结构 macroli oracle sql SQL Server
表结构如下：数据库id path titlesort 排序 1 0 首页 0 2 0,1 新闻 1 3 0,2 JAVA 2 4 0,3 JSP 3 5 0,2,3 业界动态 2 6 0,2,3 国内新闻 1 创建一个存储过程来实现，如果要在页面上使用可以设置一个返回变量将至传过去 create procedure test as begin decla
Css居中div，Css居中img，Css居中文本，Css垂直居中div qiaolevip 众观千象学习永无止境每天进步一点点 css
/**********Css居中Div**********/ div.center { width: 100px; margin: 0 auto; } /**********Css居中img**********/ img.center { display: block; margin-left: auto; margin-right: auto; }
Oracle 常用操作(实用) 吃猫的鱼 oracle
SQL>select text from all_source where owner=user and name=upper('&plsql_name'); SQL>select * from user_ind_columns where index_name=upper('&index_name'); 将表记录恢复到指定时间段以前
iOS中使用RSA对数据进行加密解密 witcheryne ios rsa iPhone objective c
RSA算法是一种非对称加密算法,常被用于加密数据传输.如果配合上数字摘要算法, 也可以用于文件签名. 本文将讨论如何在iOS中使用RSA传输加密数据. 本文环境 mac os openssl-1.0.1j, openssl需要使用1.x版本, 推荐使用[homebrew](http://brew.sh/)安装. Java 8 RSA基本原理 RS

Breadth-first Graph Search in MapReduce

你可能感兴趣的:(mapreduce)