Optimizing Joins running on HDInsight Hive on Azure at GFS

Introduction

To analyze hardware utilization within their data centers, Microsoft’s Online Services Division – Global Foundation Services (GFS) is working with Hadoop / Hive via HDInsight on Azure.  A common scenarios is to perform joins between the various tables of data.  This quick blog post provides a little context on how we managed take a query from >2h to <10min and the thinking behind it.

Background

The join is a three-column join between a large fact table (~1.2B rows/day) and a smaller dimension table (~300K rows).  The size of a single day of compressed source files is ~4.2GB; decompressed is ~120GB.  When performing a regular join (in Hive parlance “common join”), the join managed to create ~230GB of intermediary files.  On a 4-node HDInsight on Azure cluster, taking a 1/6th sample of the large table for a single day of data, the query took 2h 24min.

SELECT
colA, colB, … , colN
FROM FactTable f
LEFT OUTER JOIN DimensionTable d
ON d.colC = f.colC
AND d.colD = f.colD
AND d.colE = f.colE

Join Categories

Our options to improve join performance were noted in Join Strategies in Hive:

Category Description Query Notes
Common Join Standard Hive Join 2h 24min on 1/6 of the full dataset
Map Join Designed for joins when joining between large table and one small table.  The small table can be propped into memory. Map joins should work perfectly for this scenario
Bucket Map Join Great for joining large tables together where you create buckets for the tables so the joins occur between buckets Not optimal for this situation since we had created external hive tables against the data (we had wanted to avoid the additional step / processing time needed to create bucketed tables)
Skewed Joins Hint to tell Hive that the data is skewed and optimize the query accordingly Reviewing the join columns groupings (e.g. colC, colD, colE in the above query), the data was evenly distributed across 38 buckets – so not skewed at all.

Query Path

Below was the thought process performed to get the max query performance.

Test Run Duration Mappers Reducers
Base Query* 2:23:59 23 1
Compression* 1:24:38 23 1
Configure Reducer Task Size* 0:21:39 23 30
Full Dataset 2:01:56 134 182
Increase Nodes (4 to 10) 1:10:57 134 182
Map Joins 0:09:58 132 0

* sample data size (1/6 of the full daily dataset)

  FILES BYTES READ FILES BYTES WRITTEN
Test Run map reduce map reduce
Base Query* 43,370,646,355 78,930,287,557 67,577,746,322 59,748,935,558
Compression* 1,727,983,197 39,441,385,351 2,695,972,976 20,259,915,184
Configure Reducer Task Size* 3,285,339,403 38,775,855,507 2,677,260,304 19,595,626,728
Full Dataset 106,420,783,433 255,327,019,090 17,460,501,681 128,929,981,208
Increase Nodes (4 to 10) 106,420,795,137 255,327,093,479 17,460,513,463 128,930,072,938
Map Joins 540,664 0 7,212,269 0

Base Query

As noted above, on just 1/6 of the data, the regular join above took 2h 24min.

Compressing the Intermediate Files and Output

As noted earlier, upon analysis it was determined that there were 230GB of intermediary files generated.  By compressing the intermediate files (using the set commands below), it improved the query performance (down to 1:24:38) and reduced the size of the files bytes read and files bytes written.

set mapred.compress.map.output=true;
set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set hive.exec.compress.intermediate=true;

Note, currently HDInsight supports Gzip and BZ2 codecs – we chose the Gzip codec to match the gzip compressed source.

Configure Reducer Task Size

In the previous two queries, it apparent that there was only one reducer in operation and increasing the number of reducers (up to a point) should improve query performance as well.  To improve the query to 0:21:39, the configuration of the number of reducers was added.

set hive.exec.reducers.bytes.per.reducer=25000000;

Full Dataset

While this improved performance, once we switched back to the full dataset, using the above configuration, it took 134 mappers and 182 reducers to complete the job in 2:01:56.   By increasing the number of nodes from four to ten, the query duration dropped down to 1:10:57.

Map Joins

The great thing about map joins is that it was designed for this type of situation – large tables joined to a small table.  The small table can be placed into memory / distributed cache.  By using the configuration below, we managed to take a query that took 1:10:57 down to 00:09:58.  Note that with map joins, there are no reducers because the join can be completed during the map phase with a lot less data movement.

set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=50000000;

An important note is to not forget the hive.mapjoin.smalltable.filesize setting.  By default it is 25MB and in this case, the smaller table was 43MB.  Because I had forgotten to set it to 50MB, all of my original map join tests had reverted back to common joins.

Verifying Map Joins are Happening

The ways to verify that the map joins are happening (vs. common joins):

1. With a map join, there are no reducers because the join does at the map level
2. From the command line, it’ll report that a map join is being done because it is pushing a smaller table up to memory (as noted in the dump the hash table)
3. And right at the end, there is a call out that it’s converting the join into MapJoin

Below is the command line output of a map join:

2013-04-26 10:52:41 Starting to launch local task to process map join;
maximum memory = 932118528
2013-04-26 10:52:45 Processing rows: 200000 Hashtable size: 199999
Memory usage: 145227488 rate: 0.156
2013-04-26 10:52:47 Processing rows: 300000 Hashtable size: 299999
Memory usage: 183032536 rate: 0.196
2013-04-26 10:52:49 Processing rows: 330936 Hashtable size: 330936
Memory usage: 149795152 rate: 0.161
2013-04-26 10:52:49 Dump the hashtable into file: file:/tmp/msgbigdata/hive_
2013-04-26_22-52-34_959_3143934780177488621/-local-10002/HashTable-Stage-4/MapJo
in-mapfile01–.hashtable
2013-04-26 10:52:56 Upload 1 File to: file:/tmp/msgbigdata/hive_2013-04-26_2
2-52-34_959_3143934780177488621/-local-10002/HashTable-Stage-4/MapJoin-mapfile01
–.hashtable File size: 39687547
2013-04-26 10:52:56 End of local task; Time Taken: 14.203 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 2 out of 2

Discussion

By compressing intermediary / map output files and configuring the map join correctly (and adding some extra cores), we were able to take a join query that originally >2h to complete and get it under 10min.  For this particular situation, map joins were perfect but it will be important for you to analyze your data first to see if you have any skews, can fit the smaller table in memory, etc.

set mapred.compress.map.output=true;
set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set hive.exec.compress.intermediate=true;
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=50000000;

References

Other great references on Hive Map Joins include:

- Join Strategies in Hive:https://cwiki.apache.org/Hive/presentations.data/Hive%20Summit%202011-join.pdf

- Join Optimization in Hive: http://www.slideshare.net/aiolos127/join-optimization-in-hive

- Hadoop’s Map Side Join Implements Hash Join:http://stackoverflow.com/questions/2823303/hadoops-map-side-join-implements-hash-join

- Apache Hive Language Manual > Joins: https://cwiki.apache.org/Hive/languagemanual-joins.html

Ref:  http://dennyglee.com/2013/04/26/optimizing-joins-running-on-hdinsight-hive-on-azure-at-gfs/

你可能感兴趣的:(Optimizing Joins running on HDInsight Hive on Azure at GFS)