7 Tips forImproving MapReduce Performance

7 Tips forImproving MapReduce Performance

One servicethat Cloudera provides for our customers is help with tuning and optimizingMapReduce jobs. Since MapReduce and HDFS are complex distributed systems thatrun arbitrary user code, there’s no hard and fast set of rules to achieveoptimal performance; instead, I tend to think of tuning a cluster or job muchlike a doctor would treat a sick human being. There are a number of keysymptoms to look for, and each set of symptoms leads to a different diagnosisand course of treatment.

In medicine,there’s no automatic process that can replace the experience of a well seasoneddoctor. The same is true with complex distributed systems — experienced usersand operators often develop a “sixth sense” for common issues. Having workedwith Cloudera customers in a number of different industries, each with adifferent workload, dataset, and cluster hardware, I’ve accumulated a bit ofthis experience, and would like to share some with you today.

In this blogpost, I’ll highlight a few tips for improving MapReduce performance. The firstfew tips are cluster-wide, and will be useful for operators and developersalike. The latter tips are for developers writing custom MapReduce jobs inJava. For each tip, I’ll also note a few of the “symptoms” or “diagnostictests” that indicate a particular remedy might bring you some goodimprovements.

Please note,also, that these tips contain lots of rules of thumb based on my experienceacross a variety of situations. They may not apply to your particular workload,dataset, or cluster, and you should always benchmark your jobs before and afterany changes. For these tips, I’ll show some comparative numbers for a 40GBwordcount job on a small 4-node cluster. Tuned optimally, each of the map tasksin this job runs in about 33 seconds, and the total job runtime is about 8m30s.

Tip 1) Configure your cluster correctly

Diagnostics/symptoms:

·       top shows slave nodes fairly idleeven when all map and reduce task slots are filled up running jobs.

·       top shows kernel processes likeRAID (mdX_raid*) or pdflush taking most of the CPU time.

·       Linux load averages are often seenmore than twice the number of CPUs on the system.

·       Linux load averages stay less thanhalf the number of CPUs on the system, even when running jobs.

·       Any swap usage on nodes beyond a fewMB.

The firststep to optimizing your MapReduce performance is to make sure your clusterconfiguration has been tuned. For starters, check out our earlier blog post on configuration parameters. In addition to thoseknobs in the Hadoop configuration, here are a few more checklist items youshould go through before beginning to tune the performance of an individualjob:

·       Make sure the mounts you’re using forDFS and MapReduce storage have been mounted with thenoatime option.This disables access time tracking and can improve IO performance.

·       Avoid RAID and LVM on TaskTracker andDataNode machines – it generally reduces performance.

·       Make sure you’ve configured mapred.local.dir and dfs.data.dir to point to one directory oneach of your disks to ensure that all of your IO capacity is used. Run iostat -dx 5 from the sysstatpackage while the cluster is loadedto make sure each disk shows utilization.

·       Ensure that you have SMART monitoringfor the health status of your disk drives. MapReduce jobs are fault tolerant,but dying disks can cause performance to degrade as tasks must be re-executed.If you find that a particular TaskTracker becomes blacklisted on many jobinvocations, it may have a failing drive.

·       Monitor and graph swap usage andnetwork usage with software like Ganglia. Monitoring Hadoop metrics in Ganglia is also a good idea.If you see swap being used, reduce the amount of RAM allocated to each taskin mapred.child.java.opts.

Benchmarks:

UnfortunatelyI was not able to perform benchmarks for this tip, as it would involvere-imaging the cluster. If you have had relevant experience, feel free to leavea note in the Comments section below.

Tip 2) Use LZO Compression

Diagnostics/symptoms:

·       This is almost always a good idea forintermediate data! In the doctor analogy, consider LZO compression yourvitamins.

·       Output data size of MapReduce job isnontrivial.

·       Slave nodes show high iowait utilization in top and iostat when jobs are running.

Almost everyHadoop job that generates an non-negligible amount of map output will benefitfrom intermediate data compression with LZO. Although LZO adds a little bit ofCPU overhead, the reduced amount of disk IO during the shuffle will usuallysave time overall.

Whenever ajob needs to output a significant amount of data, LZO compression can alsoincrease performance on the output side. Since writes are replicated 3x bydefault, each GB of output data you save will save 3GB of disk writes.
In order to enable LZO compression, check out our recent guest blog from Twitter. Be sure to setmapred.compress.map.output to true.

Benchmarks:

Disabling LZOcompression on the wordcount example increased the job runtime only slightly onour cluster. The FILE_BYTES_WRITTEN counterincreased from 3.5GB to 9.2GB, showing that the compression yielded a 62%decrease in disk IO. Since this job was not sharing the cluster, and each nodehas a high ratio of number of disks to number of tasks, IO is not thebottleneck here, and thus the improvement was not substantial. On clusterswhere disks are pegged due to a lot of concurrent activity, a 60% reduction inIO can yield a substantial improvement in job completion speed.

Tip 3) Tune the number of map and reduce tasks appropriately

Diagnostics/symptoms:

·       Each map or reduce task finishes inless than 30-40 seconds.

·       A large job does not utilize allavailable slots in the cluster.

·       After most mappers or reducers arescheduled, one or two remains pending and then runs all alone.

Tuning thenumber of map and reduce tasks for a job is important and easy to overlook.Here are some rules of thumb I use to set these parameters:

·       If each task takes less than 30-40seconds, reduce the number of tasks. The task setup and scheduling overhead isa few seconds, so if tasks finish very quickly, you’re wasting time while notdoing work. JVM reuse can also be enabled to solve this problem.

·       If a job has more than 1TB of input,consider increasing the block size of the input dataset to 256M or even 512M sothat the number of tasks will be smaller. You can change the block size ofexisting files with a command like hadoop distcp -Ddfs.block.size=$[256*1024*1024]/path/to/inputdata /path/to/inputdata-with-largeblocks. After thiscommand completes, you can remove the original data.

·       So long as each task runs for atleast 30-40 seconds, increase the number of mapper tasks to some multiple ofthe number of mapper slots in the cluster. If you have 100 map slots in yourcluster, try to avoid having a job with 101 mappers – the first 100 will finishat the same time, and then the 101st will have to run alone before the reducerscan run. This is more important on small clusters and small jobs.

·       Don’t schedule too many reduce tasks– for most jobs, we recommend a number of reduce tasks equal to or a bit lessthan the number of reduce slots in the cluster.

Benchmarks:

To make thewordcount job run with too many tasks, I ran it with the argument -Dmapred.max.split.size=$[16*1024*1024]. This yielded 2640 tasks instead of the 360 that the framework chose bydefault. When running with this setting, each task took about 9 seconds, andwatching the Cluster Summary view on the JobTracker showed the number ofrunning maps fluctuating between 0 and 24 continuously throughout the job. Theentire job finished in 17m52s, more than twice as slow as the original job.

Tip 4) Write a Combiner

Diagnostics/symptoms:

·       A job performs aggregation of somesort, and the Reduce input groups counter is significantlysmaller than the Reduce input records counter.

·       The job performs a large shuffle(e.g. map output bytes is multiple GB per node)

·       The number of spilled records is many times larger than the number of map outputrecords as seen in the Job counters.

If youralgorithm involves computing aggregates of any sort, chances are you can use aCombiner in order to perform some kind of initial aggregation before the datahits the reducer. The MapReduce framework runs combiners intelligently in orderto reduce the amount of data that has to be written to disk and transfered overthe network in between the Map and Reduce stages of computation.

Benchmarks:

I modifiedthe word count example to remove the call to setCombinerClass, andotherwise left it the same. This changed the average map task run time from 33sto 48s, and increased the amount of shuffled data from 1GB to 1.4GB. The totaljob runtime increased from 8m30s to 15m42s, nearly a factor of two. Note thatthis benchmark was run with map output compression enabled – without map outputcompression, the effect of the combiner would have been even more important.

Tip 5) Use the most appropriate and compact Writable type for your data

Symptoms/diagnostics:

·       Text objects are used for workingwith non-textual or complex data

·       IntWritable or LongWritable objects are used when most output values tend to be significantlysmaller than the maximum value.

When usersare new to programming in MapReduce, or are switching from Hadoop Streaming toJava MapReduce, they often use the Text writable type unnecessarily. Although Text can be convenient, converting numeric data to and from UTF8 stringsis inefficient and can actually make up a significant portion of CPU time.Whenever dealing with non-textual data, consider using the binary WritableslikeIntWritable, FloatWritable, etc.

In additionto avoiding the text parsing overhead, the binary Writable types will take upless space as intermediate data. Since disk IO and network transfer will becomea bottleneck in large jobs, reducing the sheer number of bytes taken up by theintermediate data can provide a substantial performance gain. When dealing withintegers, it can also sometimes be faster to use VIntWritable or VLongWritable— these implement variable-lengthinteger encoding which saves space when serializing small integers. Forexample, the value 4 will be serialized in a single byte, whereas the value10000 will be serialized in two. These variable length numbers can be veryeffective for data like counts, where you expect that the majority of recordswill have a small number that fits in one or two bytes.

If theWritable types that ship with Hadoop don’t fit the bill, consider writing yourown. It’s pretty simple, and will be significantly faster than parsing text. Ifyou do so, make sure to provide aRawComparator — see the source code for thebuilt in Writables for an example.

Along thesame vein, if your MapReduce job is part of a multistage workflow, use a binaryformat likeSequenceFile for the intermediate steps,even if the last stage needs to output text. This will reduce the amount ofdata that needs to be materialized along the way.

Benchmarks:

For theexample word count job, I modified the intermediate count values to be Text type rather thanIntWritable. In the reducer, I used Integer.parseString(value.toString()) when accumulating the sum. The performance of the suboptimal versionof the WordCount was about 10% slower than the original. The full job ran in abit over 9 minutes, and each map task took 36 seconds instead of the original33. Since integer parsing is itself rather fast, this did not represent a largeimprovement; in the general case, I have seen using more efficient Writables tomake as much as a 2-3x difference in performance.

Tip 6) Reuse Writables

Symptoms/diagnostics:

·       Add -verbose:gc -XX:+PrintGCDetails to mapred.child.java.opts. Then inspect the logs for sometasks. If garbage collection is frequent and represents a lot of time, you maybe allocating unnecessary objects.

·       grep for “new Text” or “new IntWritable” in your code base. If you find this in an inner loop, or insidethe map or reduce functionsthis tip may help.

·       This tip is especially helpful whenyour tasks are constrained in RAM.

One of thefirst mistakes that many MapReduce users make is to allocate a new Writable object for every output from a mapper or reducer. For example, onemight implement a word-count mapper like this:

public voidmap(...) {

  ...

  for (String word : words) {

    output.collect(new Text(word), newIntWritable(1));

  }

}

Thisimplementation causes thousands of very short-lived objects to be allocated.While the Java garbage collector does a reasonable job at dealing with this, itis more efficient to write:

class MyMapper ...{

  Text wordText = new Text();

  IntWritable one = new IntWritable(1);

  public void map(...) {

    ...

    for (String word : words) {

      wordText.set(word);

      output.collect(word, one);

    }

  }

}

Benchmarks:

When Imodified the word count example as described above, I initially found it madeno difference in the run time of the job. This is because this cluster’sdefault settings include a 1GB heap size for each task, so garbage collectionnever ran. However, running it with each task allocated only 200mb of heap sizeshowed a drastic slowdown in the version that did not reuse Writables — thetotal job runtime increased from around 8m30s to over 17 minutes. The originalversion, which does reuse Writables, stayed the same speed even with thesmaller heap. Since reusing Writables is an easy fix, I recommend always doingso – it may not bring you a gain for every job, but if you’re low on memory itcan make a huge difference.

Tip 7) Use “Poor Man’s Profiling” to see what your tasks are doing

This is atrick I almost always use when first looking at the performance of a MapReducejob. Profiling purists will disagree and say that this won’t work, but youcan’t argue with results!

In order todo what I call “poor man’s profiling”, ssh intoone of your slave nodes while some tasks from a slow job are running. Thensimply run sudo killall -QUIT java 5-10 times in a row, each a fewseconds apart. Don’t worry — this doesn’t cause anything to quit, despite thename. Then, use the JobTracker interface to navigate to the stdout logs for one of the tasks that’s running on this node, or lookin /var/log/hadoop/userlogs/ for a stdout file of a task that is currently running. You’ll see stack traceoutput from each time you sent the SIGQUIT signalto the JVM.

It takes abit of experience to parse this output, but here’s the method I usually use:

1.For each thread in the trace, quicklyscan for the name of your Java package (e.g. com.mycompany.mrjobs). If youdon’t see any lines in the trace that are part of your code, skip over thisthread.

2.When you find a stack trace that hassome of your code in it, make a quick mental note what it’s doing. For example,“something NumberFormat-related” is all you need at this point. Don’t worryabout specific line numbers yet.

3.Go down to the next dump you took afew seconds later in the logs. Perform the same process here and make a note.

4.After you’ve gone through 4-5 of thetraces, you might notice that the same vague thing shows up in every one ofthem. If that thing is something that you expect to be fast, you probably foundyour culprit. If you take 10 traces, and 5 of them show NumberFormat in the dump, it means that you’re spending somewhere around 50% ofyour CPU time formatting numbers, and you might consider doing somethingdifferently.

Sure, thismethod isn’t as scientific as using a real profiler on your tasks, but I’vefound that it’s a surefire way to notice any glaring CPU bottlenecks veryquickly and with no setup involved. It’s also a technique that you’ll get betterat with practice as you learn what a normal dump looks like and when somethingjumps out as odd.

Here are afew performance mistakes I often find through this technique:

·       NumberFormat is slow – avoid it wherepossible.

·       String.split, as well as encoding or decodingUTF8 are slower than you think – see above tips about using the appropriateWritables

·       Concatenating Strings rather than using StringBuffer.append


These arejust a few tips for improving MapReduce performance. If you have your own tipsand tricks for profiling and optimizing MapReduce jobs, please leave a commentbelow! If you’d like to look at the code I used for running the benchmarks,I’ve put it online athttp://github.com/toddlipcon/performance-blog-code/


Appendix: Benchmark Cluster Setup
Each node in the cluster is a dualquad-core Nehalem box with hyperthreading enabled, 24G of RAM and 12×1TB disks.The TaskTrackers are configured with 6 map and 6 reduce slots, slightly lowerthan we normally recommend since we sometimes run multiple clusters at once onthese boxes for testing.

 

你可能感兴趣的:(benchmark,mapreduce,hadoop,performance)