impove hadoop mapreduce performance

http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
http://hadoop.group.iteye.com/group/topic/18294

1.set combiner:
Users can optionally specify a combiner, via  JobConf.setCombinerClass(Class), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer. 设置Combiner Class就是为了在把数据由mapper传给reducer前先把local machine的数据处理过,这样就避免数据的大规模迁移(先处理local data,再传给reducer)

2.how many maps:
理想状态下 = sizeOf(inputData)/blockSize(试??,是理想状态下还是最高数目)
Task setup takes awhile, so it is best if the maps take at least a minute to execute(最好map的执行时间至少1分钟). The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks(cpu light的task可以设置得更高点).

Thus, if you expect 10TB of input data and have a blocksize of 128MB, you'll end up with 82,000 maps, unless  setNumMapTasks(int) (which only provides a hint to the framework) is used to set it even higher. setNumMapTask()只是给mapreduce framework一个hint,而并非执行时真的就是这么个map task 数

3.How Many Reduces:

The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum).

With 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.

Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.

4.Reducer NONE:

It is legal to set the number of reduce-tasks to zero if no reduction is desired.不用reduce阶段

5.mapred.tasktracker.map.tasks.maximum:(default value = 2)
The maximum number of map tasks that will be run
  simultaneously by a task tracker.一台机器上(tasktracker)同时运行的map task的个数。一个map task是有input data split 后的一份执行map函数?
《pro hadoop》书79页说这个值最好设置为the effective number of CPUs on the node.(意思是=cpu的数目?双核的设为2??)

6.mapred.map.tasks 这个job的所有tasks的个数(default value=2)如果要设定的话,根据上面第二点,=machinenum*(10-100)试??

7. dfs.block.size 待

你可能感兴趣的:(apache,mapreduce,hadoop,UP,performance)