macyang

How-To: Run a MapReduce Job in CDH4 using Advanced Features

In my previous post, you learned how to write a basic MapReduce job and run it on Apache Hadoop. In this post, we’ll delve deeper into MapReduce programming and cover some of the framework’s more advanced features. In particular, we’ll explore:

Combiner functions, a feature that allows you to aggregate map outputs before they are passed to the reducer, possibly greatly reducing the amount of data written to disk and sent over the network for certain types of jobs
Counters, a way to track how often user-defined events occur across an entire job – for example, count the number of bad records your MapReduce job encounters in all your data and feed it back to you, without any complex instrumentation on your part
Custom Writables, go beyond the basic data types that Hadoop provides as keys and values for your mappers and reducers
MRUnit, a framework that facilitates unit testing of MapReduce programs

The full code and short instructions for how to compile and run it are available at https://github.com/sryza/traffic-reduce.

In addition, this time we’ll write our MapReduce program using the “new” MapReduce API, a cleaned-up take on what MapReduce programs should look like that was introduced in Hadoop 0.20. Note that the difference between the old and new MapReduce API is entirely separate from the difference between MR1 and MR2: The API changes affect developers writing MapReduce code, while MR2 is an architectural change that differs from MR1 by, under the hood, extracting out the scheduling and resource management aspects into YARN, which allows Hadoop to support other parallel execution frameworks and scale to larger clusters. Both MR1 and MR2 support the old and new MapReduce API.

The Use Case

It’s 11pm on a Thursday, and while Los Angeles is known for its atrocious traffic, you can usually count on being safe five from heavy traffic hours after rush hour. But when you merge onto the I-10 going west, it’s bumper to bumper for miles! What’s going on?

It has to be the Clippers game. With tens of thousands of cars leaving from the Staples Center after a home-team basketball game, of course it’s going to be bad. But what about for a Lakers game? How bad does it for those? And what about holidays and during political events? It would be great if you could enter a time and determine how far traffic deviated from average for every road in the city.

CalTrans’ Performance and Measurement System (PeMS) provides detailed traffic data from sensors placed on freeways across the state, with updates coming in every 30 seconds. The Los Angeles area alone contains over 4,000 sensor stations. While this is frankly a boatload of data, MapReduce allows you to leverage a cluster to process it in a reasonable amount of time.

In this post, we’ll write a MapReduce program that computes the averages, and next time, we’ll write a program that uses this information to build an index of this data, so that a program may query it easily to display data from the relevant time.

The TrafficInduce Program

For our first MapReduce job, we would like to find the average traffic for each sensor station at each time of the week. While the data is available every 30 seconds, we don’t need such fine granularity, so we will use the five-minute summaries that PeMS also publishes. Thus, with 4,370 stations, we will be calculating 4,370 * (60 / 5) * 24 * 7 = 8,809,920 averages.

Each of our input data files contains the measurements for all the stations over a month. Each line contains a station ID, a time, some information about the station, and the measurements taken from that station during that time interval.

Here are some example lines. The fields that are useful to us are the first, which tells the time; the second, which tells the station ID; and the 10th, which gives a normalized vehicle at that station at that time.

01/01/2012 00:00:00,312346,3,80,W,ML,.491,354,33,1128,.0209,71.5,0,0,0,0,0,0,260,.012,73.9,432,.0273,69.2,436,.0234,72.3,,,,,,,,,,,,,,,
01/01/2012 00:00:00,312347,3,80,W,OR,,236,50,91,,,,,,,,,91,,,,,,,,,,,,,,,,,,,,,,,
01/01/2012 00:00:00,312382,3,99,N,ML,.357,0,0,629,.0155,67,0,0,0,0,0,0,330,.0159,69.9,299,.015,63.9,,,,,,,,,,,,,,,,,,
01/01/2012 00:00:00,312383,3,99,N,OR,,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
01/01/2012 00:00:00,312386,3,99,N,ML,.42,0,0,1336,.0352,67.1,0,0,0,0,0,0,439,.0309,70.4,494,.039,67.4,403,.0357,63.4,,,,,,,,,,,,,,,

The mappers will parse the input lines and emit a key/value pair for each line, where the key is an ID that combines the station ID with the time of the week, and the value is the number of cars that passed over that sensor during that time. Each call to the reduce function receives a station/time of week and the vehicle count values over all the weeks, and computes their average.

Combiners

An interesting inefficiency to note is that if a single mapper processes measurements over multiple weeks, it will end up with multiple outputs going to the same reducer. As these outputs are going to be averaged by the reducer anyway, we would be able to save I/O by computing partial averages before we have the complete data. To do this, we would need to maintain a count of how many data points are in each partial average, so that we can weight our final average by that count. For example, we could collapse a set of map outputs like 5, 6, 9, 10 into (avg=7.5, count=4). As each map output is written to disk on the mapper, sent over the network, and then possibly written to disk on the reducer, reducing the number of outputs in this way can save a fair amount of I/O.

MapReduce provides us with a way to do exactly this in the form of combiner functions. The framework calls the combiner function in between the map and reduce phase, with the combiner’s outputs sent to the reducer instead of the map outputs that it’s called on. The framework may choose to call a combiner function zero or more times – generally it is called before map outputs are persisted to disk, both on the map and reduce side.

Thus, from a high level, our program looks like this:

map(line of input file) {
  parse line of input file, emit (station id, time of week) -> (vehicle count, 1)
}

combine((station id, time of week), list of corresponding (vehicle count, 1)s) {
  take average of the input values, emit (station id, time of week) -> (average vehicle count, size(list))
}

reduce((station id, time of week), list of corresponding (vehicle count, size)s) {
  take weighted average of the input values, emit (station id, time of week) -> average vehicle count
}

Custom Writables

MapReduce key and value classes implement Hadoop’s Writable interface so that they can be serialized to and from binary. While Hadoop provides a set of classes that implement Writable to serialize primitive types, the tuples we use in your pseudo-code don’t map efficiently onto any of them. For our keys, we can concatenate the station ID with the time of week to represent them as strings and use the Text type. However, as our value tuple is composed of primitive types, a float and an integer, it would be nice not to have to convert them to and from strings each time you want to use them. We can accomplish this by implementing a Writable for them.

public class AverageWritable implements Writable {

  private int numElements;
  private double average;
 
  public AverageWritable() {}
 
  public void set(int numElements, double average) {
    this.numElements = numElements;
    this.average = average;
  }
 
  public int getNumElements() {
    return numElements;
  }
 
  public double getAverage() {
    return average;
  }
 
  @Override
  public void readFields(DataInput input) throws IOException {
    numElements = input.readInt();
    average = input.readDouble();
  }

  @Override
  public void write(DataOutput output) throws IOException {
    output.writeInt(numElements);
    output.writeDouble(average);
  }

  [toString(), equals(), and hashCode() shown in the the github repo]
}

We deploy our Writable by including it in our job jar. To instantiate our Writable, the framework will call its no-argument constructor, and then fill it in by calling its readFields method. Note that if we wanted to use a custom class as a key, it would need to implement WritableComparable so that it would be able to be sorted.

At Last, the Program

With our custom data type in hand, we are at last ready to write our MapReduce program. Here is what our mapper looks like:

public class AveragerMapper extends Mapper<LongWritable, Text, Text, AverageWritable> {
 
  private AverageWritable outAverage = new AverageWritable();
  private Text id = new Text();
 
  @Override
  public void map(LongWritable key, Text line, Context context)
      throws InterruptedException, IOException {
    String[] tokens = line.toString().split(",");
    if (tokens.length < 10) {
      return;
    }
    String dateTime = tokens[0];
    String stationId = tokens[1];
    String trafficCount = tokens[9];

    if (trafficCount.length() > 0) {
      id.set(stationId + "_" + TimeUtil.toTimeOfWeek(dateTime));
      outAverage.set(1, Integer.parseInt(trafficCount));
     
      context.write(id, outAverage);
    }
  }
}

You may notice that this mapper looks a little bit different than the mapper used in the last post. This is because in this post we use the “new” MapReduce API, a rewrite of the MapReduce API that was introduced in Hadoop 0.20. The newer one is a little bit cleaner, but Hadoop will support both APIs far into the future.

An astute observer will notice that our combiner and reducer are doing exactly the same thing – i.e. outputting a weighted average of the inputs. Thus, we can write the following reducer function, and pass it as a combiner as well:

public class AveragerReducer extends Reducer<Text, AverageWritable, Text, AverageWritable> {
 
  private AverageWritable outAverage = new AverageWritable();
 
  @Override
  public void reduce(Text key, Iterable<AverageWritable> averages, Context context)
      throws InterruptedException, IOException {
    double sum = 0.0;
    int numElements = 0;
    for (AverageWritable partialAverage : averages) {
      // weight partial average by number of elements included in it
      sum += partialAverage.getAverage() * partialAverage.getNumElements();
      numElements += partialAverage.getNumElements();
    }
    double average = sum / numElements;
   
    outAverage.set(numElements, average);
    context.write(key, outAverage);
  }
}

Using the new API, our driver class looks like this:

public class AveragerRunner {
  public static void main(String[] args) throws IOException, ClassNotFoundException,
      InterruptedException {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    Job job = new Job(conf);
    job.setJarByClass(AveragerRunner.class);
    job.setMapperClass(AveragerMapper.class);
    job.setReducerClass(AveragerReducer.class);
    job.setCombinerClass(AveragerReducer.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(AverageWritable.class);
    job.setInputFormatClass(TextInputFormat.class);
   
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

    job.waitForCompletion(true);
  }
}

Note that unlike last time, when we used KeyValueTextInputFormat, we use TextInputFormat for our input data. While KeyValueTextInputFormat splits up the line into a key and a value, TextInputFormat passes the entire line as the value, and uses its position in the file (as an offset from the first byte) as the key. The position is not used, which is fairly typical when using TextInputFormat.

Counters

In the real world, data is messy. Traffic sensor data, for example, contains records with missing fields all the time, as sensors in the wild are bound to malfunction at times. Running our MapReduce job, it is often useful to count up and collect metrics on the side about what our job is doing. For a program on a single computer, we might just do this by adding in a count variable, incrementing it whenever our event of interest occurs, and printing it out at the end, but when our code is running in a distributed fashion, aggregating these counts gets hairy very quickly.

Luckily, Hadoop provides a mechanism to handle this for us, using Counters. MapReduce contains a number of built-in counters that you have probably seen in the output on completion of a MapReduce job.

Map-Reduce Framework
       Map input records=10
       Map output records=7
       Map output bytes=175
       [and many more]

This information is also available in the web UI, both per-job and per-task. To use our own counter, we can simply add a line like

context.getCounter("Averager Counters", "Missing vehicle flows").increment(1);

to the point in the code where the mapper comes across a record with a missing count. Then, when our job completes, we will see our count along with the built-in counters:

Averager Counters Missing vehicle flows=2329

It’s often convenient to wrap your entire map or reduce function in a try/catch, and increment a counter in the catch block, using the exception class’s name as the counter’s name for a profile of what kind of errors come up.

Testing

Running a MapReduce program on a cluster, if we even have access to one, can take a while. However, if we want to make sure that our basic logic works, we have no need for all the machinery. Enter Apache MRUnit, an Apache project that makes writing JUnit tests for MapReduce programs probably as easy as it could possibly be. Through MRUnit, we can test our mappers and reducers both separately and as a full flow.

To include it in our project, we add the following to the dependencies section Maven’s pom.xml:

<dependency>
   <groupId>org.apache.mrunit</groupId>
   <artifactId>mrunit</artifactId>
   <version>0.9.0-incubating</version>
   <classifier>hadoop2</classifier>
</dependency>

The following contains a test for both the mapper and reducer, verifying that with sample inputs, they produce the expected outputs:

public class TestTrafficAverager {
  private MapDriver<LongWritable, Text, Text, AverageWritable> mapDriver;
  private ReduceDriver<Text, AverageWritable, Text, AverageWritable> reduceDriver;
 
  @Before
  public void setup() {
    AveragerMapper mapper = new AveragerMapper();
    AveragerReducer reducer = new AveragerReducer();
    mapDriver = MapDriver.newMapDriver(mapper);
    reduceDriver = ReduceDriver.newReduceDriver(reducer);
  }
 
  @Test
  public void testMapper() throws IOException {
    String line = "01/01/2012 00:00:00,311831,3,5,S,OR,,118,0,200,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,";
    mapDriver.withInput(new LongWritable(0), new Text(line));
    Text outKey = new Text("311831_" + TimeUtil.toTimeOfWeek("01/01/2012 00:00:00"));
    AverageWritable outVal = new AverageWritable();
    outVal.set(1, 200.0);
    mapDriver.withOutput(outKey, outVal);
    mapDriver.runTest();
  }
 
  @Test
  public void testReducer() {
    AverageWritable avg1 = new AverageWritable();
    avg1.set(1, 2.0);
    AverageWritable avg2 = new AverageWritable();
    avg2.set(3, 1.0);
    AverageWritable outAvg = new AverageWritable();
    outAvg.set(4, 1.25);
    Text key = new Text("331831_86400");
   
    reduceDriver.withInput(key, Arrays.asList(avg1, avg2));
    reduceDriver.withOutput(key, outAvg);
    reduceDriver.runTest();
  }

We can run our tests with “mvn test” in the project directory. If there are failures, information on why is available in the project directory in target/surefire-reports.

A more in depth MRUnit tutorial is available here:https://cwiki.apache.org/confluence/display/MRUNIT/MRUnit+Tutorial.

Running Our Program on Hadoop

Like last time, we can build the jar with

mvn install

The full data is available at http://pems.dot.ca.gov/?dnode=Clearinghouse, but like last time, the github repo contains some sample data to run our program on. To place it on the cluster, we can run:

hadoop fs -mkdir trafficcounts
hadoop fs -put samples/input.txt trafficcounts

To run our program, we can use

hadoop jar target/trafficinduce-1.0-SNAPSHOT.jar AveragerRunner trafficcounts/input.txt trafficcounts/output

We can inspect the output with:

hadoop fs -cat /trafficcounts/output/part-00000

Thanks for reading! Next time, we’ll delve into some more advanced MapReduce features, like the distributed cache, custom partitioners, and custom input and output formats.

Sandy Ryza is a Software Engineer on the Platform team.

Ref: http://blog.cloudera.com/blog/2013/02/how-to-run-a-mapreduce-job-in-cdh4-using-advanced-features/

docker run后台启动命令_Docker命令详解之run 刘秩 docker run后台启动命令
RUN命令RUN指令是用来执行命令行命令的，是最常用的指令之一。命令格式：dockerrun[OPTIONS]IMAGE[COMMAND][ARG...]意思为：通过run命令创建一个新的容器常用选项说明：-d,--detach=false，指定容器运行于前台还是后台，默认为false-i,--interactive=false，打开STDIN，用于控制台交互-t,--tty=false，分配tt
AF3 ProteinDataset类的初始化方法解读 qq_27390023 python 人工智能机器学习
AlphaFold3protein_dataset模块ProteinDataset类主要负责从结构化的蛋白质数据中构建一个可供模型训练/推理使用的数据集，ProteinDataset类的__init__方法用于初始化一个蛋白质数据集对象。源代码：def__init__(self,dataset_folder,features_folder="./data/tmp/",clustering_dict
蓝桥杯真题——日期问题（结构体排序/sort排序+输入输出格式）、四平方和（枚举+平方数）随便昵称蓝桥杯职场和发展
目录P8651[蓝桥杯2017省B]日期问题思路：代码：P8635[蓝桥杯2016省AB]四平方和编辑编辑思路：P8651[蓝桥杯2017省B]日期问题思路：使用scanf读入，枚举从1960到2059，若符合题目形式，加入答案，从小到大输出：存入结构体，通过自定义排序后输出知识点：结构体排序+sort()代码：#include#include#includeusingnamespacestd;s
【干货】被1000+人私藏的找工作新技巧是乐谷工作面试简历程序人生 python 算法求职招聘 java 安全
本干货文章由“boli-你的职业规划助手‘赞助，网站地址：https://www.jobleap.cn/以下是正文：“APP都刷烂了，还是没几个岗位！”“每日推荐，都是那几个老熟人。”“只能搜知道的岗位名称，找来找去还是老本行。”说好的“金三银四”，但越刷越失望。其实不好找工作，只是因为不知道这些关键词！如果你是运营，那么就不能只搜“运营”试试搜索这些关键词，一定会发现更多公司和职位机会：新媒体运
动态规划 (Dynamic Programming) nuo534202 学习笔记动态规划算法 c++
文章目录背包DP01背包完全背包多重背包混合背包背包DP01背包1.洛谷P2871[USACO07DEC]CharmBraceletS题目链接：洛谷P287101背包模板题，不过多解释。#includeusingnamespacestd;constexprintN=3500,M=13000;intn,m,w[N],d[N],dp[M];intmain(){ios::sync_with_stdio(
【并发编程】多线程安全问题，如何避免死锁阿Q说代码并发专栏 java 并发编程线程安全多线程死锁
文章目录概念进程线程对比代码使用进程线程线程创建方式线程的生命周期和状态停止线程方法介绍sleep()/wait()为什么wait()不被定义在Thread中？sleep()定义在Thread中？run()/start()为什么使用多线程？线程安全问题线程死锁如何避免死锁？总结从今天开始阿Q将陆续更新java并发编程专栏，期待您的订阅。在系统学习线程之前，我们先来了解一下它的概念，与经常提到的进程
定制一款国密浏览器(2)：修改包名云水木石信创系统软件开发实战国密算法浏览器信创 linux
在上一章中，介绍了chromium源码的获取和构建deb包，这一章将修改包名。我给定制浏览器取名MojoBrowser，Mojo这个词来自Chromium代码中的Mojo跨进程框架，此外Mojo隐含有突破困境的内在动力的意思。所以接下修改包名为org.mojo.browser，第二就是修改程序的安装位置。debian的构建比较特殊，带版本号的包名是根据changelog来创建的，所以需要修改该文件
【MCAL】AUTOSAR架构下基于SPI通信的驱动模块详解-以TJA1145为例汽车电子嵌入式 AUTOSAR精进之路 AUTOSAR SPI TJA1145 Job Channel Sequence
目录前言正文1.TJA1145驱动代码中的SPI协议设计1.1对SPIDriver的依赖1.2对SPI配置的依赖1.2.1SpiExternalDevice1.2.2Channel_x1.2.3Job_x1.2.4SequenceN1.2.5SequenceM1.2.6SequenceL1.2.7小结2.基于Vector驱动代码的SPI配置2.1SPI引脚配置2.2SPIGeneral2.3Spi
【数据挖掘】岭回归（Ridge Regression）和线性回归（Linear Regression）对比实验 dundunmm 数据挖掘数据挖掘回归线性回归岭回归
这是一个非常实用的岭回归（RidgeRegression）和线性回归（LinearRegression）对比实验，使用了scikit-learn中的CaliforniaHousing数据集来预测房价。第一步：导入必要的库importnumpyasnpimportpandasaspdimportmatplotlib.pyplotaspltfromsklearn.linear_modelimportR
开源中文的繁简体转换 opencc4j-05-日文转换支持后端java
Opencc4jOpencc4j支持中文繁简体转换，考虑到词组级别。开源中文的繁简体转换opencc4j-01-使用入门概览开源中文的繁简体转换opencc4j-02-一个汉字竟然对应两个char？开源中文的繁简体转换opencc4j-03-简体还是繁体，你说了算!开源中文的繁简体转换opencc4j-04-香港地区转换支持开源中文的繁简体转换opencc4j-05-日文转换支持Features特
html5 drawwindow,GitHub - krisrak/html5-canvas-drawing-app: Sketchpad app using html5 canvas to draw... 子皮论 html5 drawwindow
html5-canvas-drawing-appSketchpadappusinghtml5canvastodrawusingtouchormouse,worksoniOS,Android,WindowPhoneandbrowser.Appusestouchevents,MSPointereventsandmouseeventstosupportiOS,Android,WindowPhoneand
Java之多线程(6个demo) nuist__NJUPT JavaSE java jvm 开发语言多线程线程同步
本文主要介绍Java中多线程，在Java中启动多线程的方法包括：继承Thread类或者实现Runnable接口，介绍了设置线程名称，设置线程优先级，设置守护线程等，介绍了线程同步，使用synchronized关键字和lock锁分别实现线程同步，解决线程安全问题，介绍了生产者和消费者模式，涉及线程延迟，等待，唤醒的方法。目录1-继承Thread类并重写run()方法实现多线程2-线程控制(设置主线程
C++Primer11.3.6节练习小白学C++. C++基础 c++开发语言算法
练习11.33：#includeusingnamespacestd;#include#include#include#include#include#include#includeconststring&transform(conststring&s,map&mp_trans){autoit=mp_trans.find(s);if(it!=mp_trans.end()){//在单词转换规则中，转换
Embedding Lua as Dynamic Configuration in C/C++ 东北豆子哥 C++lua c语言
EmbeddingLuaasDynamicConfigurationinC/C++UsingLuaasadynamicconfigurationsysteminC/C++programsisapowerfulapproachthatallowsyoutomodifyprogrambehaviorwithoutrecompiling.Here’showtoimplementit:Step1:SetU
MapReduce1中资源预先划分为固定数量的map slot和reduce slot，具体是怎么划分的？ BenBen尔 java 数据库大数据 hadoop
MapReduce1（MRv1）中mapslot与reduceslot的固定划分机制在HadoopMapReduce1（MRv1）中，资源管理采用静态分配的方式，mapslot和reduceslot的数量在集群启动时预先配置，且无法动态调整。以下是具体划分方式及其背后的设计逻辑：一、核心架构与角色MRv1的资源管理由两个核心组件实现：JobTracker负责作业调度（将任务分配给TaskTrack
maven编译jar踩坑[sqlite.db] benyuanone 数据库 maven jar
背景：最近在项目中搞多数据源切换的job,在src/resource下有初始化的sqlite默认文件供后续拷贝使用，在测试阶段没有什么问题，但是一部署到服务器上运行就有问题。报错现象：找不到这个sqlite.db文件或者文件格式有问题，通过查看服务器文件，发现.db文件是存在的，那就纳闷了，难道是切换数据源出问题了，但是本地没问题。后续排查：从服务器上下载的sqlite.db文件比本地文件会大一点
ABC401D题代码 ytz0208 算法数据结构
#includeusingnamespacestd;constintN=2e5+5;intn,k,t=0;chars[N];intmain(){scanf("%d%d%s",&n,&k,s+1/*我自己改的*/);for(inti=1;i<=n;++i)if(s[i-1]=='o'||s[i+1]=='o')s[i]='.';elseif(s[i]=='o')t++;if(t==k){for(in
【非C盘创建虚拟环境、创建jupyter kernel、pycharm运行jupyter文件、pycharm运行jupyter文件Run Error问题】 Mr_Zhangxiaobai python 开发语言
一、非C盘创建、管理虚拟环境1.1创建、删除指定路径创建：condacreate--prefix路径python=3.x指定路径删除：condaremove-p路径--allp.s.默认路径创建：condacreate-nnamepython=3.x(python版本自己指定)默认路径删除：condaremove-nenv_name--all1.2虚拟环境切换、退出显示所有环境：condaenvl
C#MQTT协议服务器与客户端通讯实现（客户端包含断开重连模块）风，停下网络协议 c#
C#MQTT协议服务器与客户端通讯实现1DLL版本2服务器3客户端1DLL版本MQTTnet.DLL版本-2.7.5.0基于比较老的项目中应用的DLL，其他更高版本变化可能较大，谨慎参考。2服务器开启服务器关闭服务器绑定事件【客户端连接服务器事件】绑定事件【客户端断开(服务器)连接事件】绑定事件【客户端订阅主题事件】绑定事件【客户端退订主题事件】绑定事件【接收客户端(发送)消息事件】usingSy
深入详解 C# Task.Run异步任务猿享天开开发语言 c#task
目录Task.RunTask.Run的底层原理默认并发数量控制并发使用SemaphoreSlim代码解析使用Parallel.ForEach代码解析注意事项自定义任务调度器代码解析使用自定义任务调度器：总结Task.RunTask.Run是.NET中创建和启动异步任务的一种便捷方法。它通过将一个委托排队到.NET线程池来创建并运行任务。理解Task.Run的底层原理、默认并发数量以及并发控制方法对
Flink作业提交流程欢乐海豚 Flink flink 大数据 java
一角色1作业管理器（JobManager） JobManager是一个Flink集群中任务管理和调度的核心，是控制应用执行的主进程。也就是说，每个应用都应该被唯一的JobManager所控制执行。 JobManger又包含3个不同的组件。1）JobMaster： JobMaster是JobManager中最核心的组件，负责处理单独的作业（Job）。所以JobMaster和具体的Job是一一对
【Flink运行时架构】作业提交流程 Data跳动 flink 大数据
本文介绍在单作业模式下Flink提交作业的具体流程，如下图所示。客户端将作业提交给YARN的RM；YARN的RM启动FlinkJobManager，并将作业提交给JobMaster；JobMaster向Flink内置的RM请求slots；Flink内置的RM向YARNRM请求容器；YARN启动带有TaskManager的容器；TaskManager启动之后，向Flink的RM注册自己的可用slot
二分查找+3个高精度一个人在码代码的章鱼算法学习刷题 r语言算法 c++
二分查找：模版值域在1-nintfind(intx){intl=0,r=n+1;intres=0;while(lusingnamespacestd;constintN=100010;inta[N];intn,m;intfind1(intx){intl=0,r=n+1;intans=0;while(l>1;if(a[mid]>x){r=mid-1;}elseif(a[mid]==x){ans=mid
【详解】使用原生Python编写HadoopMapReduce程序牛肉胡辣汤 c#开发语言
目录使用原生Python编写HadoopMapReduce程序HadoopStreaming简介Python环境准备示例：单词计数1.Mapper脚本2.Reducer脚本3.运行MapReduce作业1.环境准备2.编写Mapper脚本3.编写Reducer脚本4.准备输入数据5.运行MapReduce作业6.查看结果HadoopStreaming原理Python编写的MapReduce示例1.
数据结构与算法-数学-基础数学算法（筛质数，最大公约数，最小公倍数，质因数算法，快速幂，乘法逆元，欧拉函数）一个人在码代码的章鱼 #数学算法学习算法 c++数据结构
一：筛质数：1-埃氏筛法该算法核心是从2开始，把每个质数的倍数标记为合数，时间复杂度约为O(nloglogn)。#include#includeusingnamespacestd;constintN=1000010;boolst[N];//标记数组，true表示是合数，false表示是质数voidget_primes(intn){for(inti=2;i>n;get_primes(n);for(i
Windows安装sentencepiece报错： python setup.py egg_info did not run successfully 代码手艺人老羊 python 开发语言
在pipinstallsentencepiece报错：pythonsetup.pyegg_infodidnotrunsuccessfully解决办法：setuptools更新库pipinstall--upgradesetuptools若还报错：UpdatetheVERSIONargumentvalue.Or,usethe...syntaxtotellCMakethattheprojectrequi
seafile 同步linux,seafile: Seafile 是一款安全、高性能的开源网盘（云存储）软件。Seafile 提供了主流网盘（云盘）产品所具有的功能，包括文件同步、文件共享等。在此基础... 卷福酱 seafile 同步linux
IntroductionSeafileisanopensourcecloudstoragesystemwithprivacyprotectionandteamworkfeatures.Collectionsoffilesarecalledlibraries.Eachlibrarycanbesyncedseparately.Alibrarycanalsobeencryptedwithausercho
C++中数学相关知识与例题（初级）兮兮能吃能睡 c++算法开发语言
目录一、大小二、最小公倍数&最大公约数加减乘除高精加高精减高精乘高精减进制中位数质数开根数列前缀和最值（最大值）斐波那契数列偏序集概念+Dilworth定理一、大小关于几个数的大小，我们怎么比较它们的大小呢？inta,b;cin>>a>>b;if(a>=b)coutusingnamespacestd;intmain(){inta,b;cin>>a>>b;intz=a>=b?a:b;cout?:很容
C++基础6 模板努力成为程序源 C++基础 c++开发语言
函数模板模板的意义：对类型也可以进行参数化#includeusingnamespacestd;templateboolcompare(Ta,Tb){returna>b;}intmain(){compare(10,20);return0;}在函数调用点，编译器用用户指定的类型，从原模版实例化一份函数代码出来。当用户没有指定类型时，模板会进行实参推演，可以根据用户传入的实参的类型，来推导出模板类型参数
C++基础7 运算符重载努力成为程序源 C++基础 c++开发语言
运算符重载使对象的运算表现变得和编译器内置类型一样a+b=>a.operator+(b)下面实现一下复数类的运算符重载#includeusingnamespacestd;classCComplex{public:CComplex(intr=0,inti=0):mreal(r),mimage(i){}CComplexoperator+(constCComplex&com){/*CComplexcom
分享100个最新免费的高匿HTTP代理IP mcj8089 代理IP 代理服务器匿名代理免费代理IP 最新代理IP
推荐两个代理IP网站： 1. 全网代理IP：http://proxy.goubanjia.com/ 2. 敲代码免费IP：http://ip.qiaodm.com/ 120.198.243.130:80,中国/广东省 58.251.78.71:8088,中国/广东省 183.207.228.22:83,中国/
mysql高级特性之数据分区 annan211 java 数据结构 mongodb 分区 mysql
mysql高级特性 1 以存储引擎的角度分析，分区表和物理表没有区别。是按照一定的规则将数据分别存储的逻辑设计。器底层是由多个物理字表组成。 2 分区的原理分区表由多个相关的底层表实现，这些底层表也是由句柄对象表示，所以我们可以直接访问各个分区。存储引擎管理分区的各个底层表和管理普通表一样(所有底层表都必须使用相同的存储引擎)，分区表的索引只是
JS采用正则表达式简单获取URL地址栏参数 chiangfai js 地址栏参数获取
GetUrlParam:function GetUrlParam(param){ var reg = new RegExp("(^|&)"+ param +"=([^&]*)(&|$)"); var r = window.location.search.substr(1).match(reg); if(r!=null
怎样将数据表拷贝到powerdesigner (本地数据库表) Array_06 powerDesigner
================================================== 1、打开PowerDesigner12，在菜单中按照如下方式进行操作 file->Reverse Engineer->DataBase 点击后，弹出 New Physical Data Model 的对话框 2、在General选项卡中 Model name:模板名字，自
logbackのhelloworld 飞翔的马甲日志 logback
一、概述 1.日志是啥？当我是个逗比的时候我是这么理解的：log.debug()代替了system.out.print(); 当我项目工作时，以为是一堆得.log文件。这两天项目发布新版本，比较轻松，决定好好地研究下日志以及logback。传送门1：日志的作用与方法： http://www.infoq.com/cn/articles/why-and-how-log 上面的作
新浪微博爬虫模拟登陆随意而生新浪微博
转载自：http://hi.baidu.com/erliang20088/item/251db4b040b8ce58ba0e1235 近来由于毕设需要，重新修改了新浪微博爬虫废了不少劲，希望下边的总结能够帮助后来的同学们。现行版的模拟登陆与以前相比，最大的改动在于cookie获取时候的模拟url的请求
synchronized 香水浓 java thread
Java语言的关键字，可用来给对象和方法或者代码块加锁，当它锁定一个方法或者一个代码块的时候，同一时刻最多只有一个线程执行这段代码。当两个并发线程访问同一个对象object中的这个加锁同步代码块时，一个时间内只能有一个线程得到执行。另一个线程必须等待当前线程执行完这个代码块以后才能执行该代码块。然而，当一个线程访问object的一个加锁代码块时，另一个线程仍然
maven 简单实用教程 AdyZhang maven
1. Maven介绍 1.1. 简介 java编写的用于构建系统的自动化工具。目前版本是2.0.9，注意maven2和maven1有很大区别，阅读第三方文档时需要区分版本。 1.2. Maven资源见官方网站；The 5 minute test，官方简易入门文档；Getting Started Tutorial，官方入门文档；Build Coo
Android 通过 intent传值获得null aijuans android
我在通过intent 获得传递兑现过的时候报错，空指针,我是getMap方法进行传值，代码如下 1 2 3 4 5 6 7 8 9 public void getMap(View view){ Intent i =
apache 做代理报如下错误：The proxy server received an invalid response from an upstream baalwolf response
网站配置是apache＋tomcat,tomcat没有报错，apache报错是： The proxy server received an invalid response from an upstream server. The proxy server could not handle the request GET /. Reason: Error reading fr
Tomcat6 内存和线程配置 BigBird2012 tomcat6
1、修改启动时内存参数、并指定JVM时区（在windows server 2008 下时间少了8个小时）在Tomcat上运行j2ee项目代码时，经常会出现内存溢出的情况，解决办法是在系统参数中增加系统参数： window下，在catalina.bat最前面 set JAVA_OPTS=-XX:PermSize=64M -XX:MaxPermSize=128m -Xms5
Karam与TDD bijian1013 Karam TDD
一.TDD 测试驱动开发（Test-Driven Development,TDD）是一种敏捷（AGILE）开发方法论，它把开发流程倒转了过来，在进行代码实现之前，首先保证编写测试用例，从而用测试来驱动开发（而不是把测试作为一项验证工具来使用）。 TDD的原则很简单： a.只有当某个
[Zookeeper学习笔记之七]Zookeeper源代码分析之Zookeeper.States bit1129 zookeeper
public enum States { CONNECTING, //Zookeeper服务器不可用，客户端处于尝试链接状态 ASSOCIATING, //？？？ CONNECTED, //链接建立，可以与Zookeeper服务器正常通信 CONNECTEDREADONLY, //处于只读状态的链接状态，只读模式可以在
【Scala十四】Scala核心八：闭包 bit1129 scala
Free variable A free variable of an expression is a variable that’s used inside the expression but not defined inside the expression. For instance, in the function literal expression (x: Int) => (x
android发送json并解析返回json ronin47 android
package com.http.test; import org.apache.http.HttpResponse; import org.apache.http.HttpStatus; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import
一份IT实习生的总结 brotherlamp PHP php资料 php教程 php培训 php视频
今天突然发现在不知不觉中自己已经实习了 3 个月了，现在可能不算是真正意义上的实习吧，因为现在自己才大三，在这边撸代码的同时还要考虑到学校的功课跟期末考试。让我震惊的是，我完全想不到在这 3 个月里我到底学到了什么，这是一件多么悲催的事情啊。同时我对我应该 get 到什么新技能也很迷茫。所以今晚还是总结下把，让自己在接下来的实习生活有更加明确的方向。最后感谢工作室给我们几个人这个机会让我们提前出来
据说是2012年10月人人网校招的一道笔试题-给出一个重物重量为X,另外提供的小砝码重量分别为1，3，9。。。3^N。将重物放到天平左侧，问在两边如何添加砝码 bylijinnan java
public class ScalesBalance { /** * 题目： * 给出一个重物重量为X,另外提供的小砝码重量分别为1，3，9。。。3^N。（假设N无限大，但一种重量的砝码只有一个） * 将重物放到天平左侧，问在两边如何添加砝码使两边平衡 * * 分析： * 三进制 * 我们约定括号表示里面的数是三进制，例如 47=(1202
dom4j最常用最简单的方法 chiangfai dom4j
要使用dom4j读写XML文档,需要先下载dom4j包,dom4j官方网站在 http://www.dom4j.org/目前最新dom4j包下载地址:http://nchc.dl.sourceforge.net/sourceforge/dom4j/dom4j-1.6.1.zip 解开后有两个包,仅操作XML文档的话把dom4j-1.6.1.jar加入工程就可以了,如果需要使用XPath的话还需要
简单HBase笔记 chenchao051 hbase
一、Client-side write buffer 客户端缓存请求描述：可以缓存客户端的请求，以此来减少RPC的次数，但是缓存只是被存在一个ArrayList中，所以多线程访问时不安全的。可以使用getWriteBuffer()方法来取得客户端缓存中的数据。默认关闭。二、Scan的Caching 描述： next( )方法请求一行就要使用一次RPC,即使
mysqldump导出时出现when doing LOCK TABLES daizj mysql mysqdump 导数据
　　执行　mysqldump -uxxx -pxxx -hxxx -Pxxxx database tablename > tablename.sql　导出表时，会报 mysqldump: Got error: 1044: Access denied for user 'xxx'@'xxx' to database 'xxx' when doing LOCK TABLES 解决
CSS渲染原理 dcj3sjt126com Web
从事Web前端开发的人都与CSS打交道很多，有的人也许不知道css是怎么去工作的，写出来的css浏览器是怎么样去解析的呢？当这个成为我们提高css水平的一个瓶颈时，是否应该多了解一下呢？一、浏览器的发展与CSS
《阿甘正传》台词 dcj3sjt126com
Part Ⅰ: 《阿甘正传》Forrest Gump经典中英文对白 Forrest: Hello! My names Forrest. Forrest Gump. You wanna Chocolate? I could eat about a million and a half othese. My momma always said life was like a box ochocol
Java处理JSON dyy_gusi json
Json在数据传输中很好用，原因是JSON 比 XML 更小、更快，更易解析。在Java程序中，如何使用处理JSON，现在有很多工具可以处理，比较流行常用的是google的gson和alibaba的fastjson，具体使用如下： 1、读取json然后处理 class ReadJSON { public static void main(String[] args)
win7下nginx和php的配置 geeksun nginx
1. 安装包准备 nginx : 从nginx.org下载nginx-1.8.0.zip php：从php.net下载php-5.6.10-Win32-VC11-x64.zip， php是免安装文件。 RunHiddenConsole: 用于隐藏命令行窗口 2. 配置 # java用8080端口做应用服务器，nginx反向代理到这个端口即可 p
基于2.8版本redis配置文件中文解释 hongtoushizi redis
转载自： http://wangwei007.blog.51cto.com/68019/1548167 在Redis中直接启动redis-server服务时, 采用的是默认的配置文件。采用redis-server xxx.conf 这样的方式可以按照指定的配置文件来运行Redis服务。下面是Redis2.8.9的配置文
第五章常用Lua开发库3-模板渲染 jinnianshilongnian nginx lua
动态web网页开发是Web开发中一个常见的场景，比如像京东商品详情页，其页面逻辑是非常复杂的，需要使用模板技术来实现。而Lua中也有许多模板引擎，如目前我在使用的lua-resty-template，可以渲染很复杂的页面，借助LuaJIT其性能也是可以接受的。如果学习过JavaEE中的servlet和JSP的话，应该知道JSP模板最终会被翻译成Servlet来执行；而lua-r
JZSearch大数据搜索引擎颠覆者 JavaScript
系统简介：大数据的特点有四个层面：第一，数据体量巨大。从TB级别，跃升到PB级别；第二，数据类型繁多。网络日志、视频、图片、地理位置信息等等。第三，价值密度低。以视频为例，连续不间断监控过程中，可能有用的数据仅仅有一两秒。第四，处理速度快。最后这一点也是和传统的数据挖掘技术有着本质的不同。业界将其归纳为4个“V”——Volume，Variety，Value，Velocity。大数据搜索引
10招让你成为杰出的Java程序员 pda158 java 编程框架
如果你是一个热衷于技术的 Java 程序员，那么下面的 10 个要点可以让你在众多 Java 开发人员中脱颖而出。　　 1. 拥有扎实的基础和深刻理解 OO 原则　　对于 Java 程序员，深刻理解 Object Oriented Programming（面向对象编程）这一概念是必须的。没有 OOPS 的坚实基础，就领会不了像 Java 这些面向对象编程语言
tomcat之oracle连接池配置小网客 oracle
tomcat版本7.0 配置oracle连接池方式：修改tomcat的server.xml配置文件： <GlobalNamingResources> <Resource name="utermdatasource" auth="Container" type="javax.sql.DataSou
Oracle 分页算法汇总 vipbooks oracle sql 算法 .net
这是我找到的一些关于Oracle分页的算法，大家那里还有没有其他好的算法没？我们大家一起分享一下！ -- Oracle 分页算法一 select * from ( select page.*,rownum rn from (select * from help) page -- 20 = (currentPag