Parallel Execution

I’m aware that in the previous article in this series I said I’d continue “in a few days” and it has now been more like 11 weeks – but finally I’ve got the time. In this article I’m going to talk primarily about Bloom filters and their impact on performance, but I’ll need to say something about the “virtual tables” and “parallel execution message size” before I begin. Take a look at this fragment of a parallel execution plan:

-----------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost  |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------

|   7 |        PX SEND BROADCAST    | :TQ10000 |     3 |    18 |     2 |  Q1,00 | P->P | BROADCAST  |
|   8 |         PX BLOCK ITERATOR   |          |     3 |    18 |     2 |  Q1,00 | PCWC |            |
|*  9 |          TABLE ACCESS FULL  | T3       |     3 |    18 |     2 |  Q1,00 | PCWP |            |

-----------------------------------------------------------------------------------------------------

Examining the Name and TQ columns, we can say (from the TQ column) that one set of parallel execution slaves has operated steps 7, 8, and 9 to populate a “virtual table” called :TQ10000 which is broadcast (every slave receives every row) to another set of parallel execution slaves. Each slave in the other slave set will read from that virtual table (at some point) and do something with the data.

I am fairly certain (and it is an important point in understanding the mechanics and rationale of parallel execution plans) that a parallel slave (set) cannot write one virtual table while reading from another virtual table; it has to read the whole of the “incoming” virtual table before it can start writing to an “outgoing” virtual table. This restriction probably keeps the code as simple as possible, partly because it limits the number of parallel execution slave sets needed for a data flow operation to a maximum of two. Bear in mind that there is also (probably) a “side-channel” that the query co-ordinator uses to pass instructions to the parallel query slaves, and this side-channel probably falls outside any “can’t read and write simultaneously” restriction.

So how does Oracle implement virtual tables ? A virtual table is simply a small number of “pages” of memory, whose size is dictated by the parameter _parallel_execution_message_size. The default value for this parameter is either 2,148 bytes 4,096 bytes, or 16,384 bytes, depending the Oracle version and the setting for various other parameters. When I say “small number”, though, this is per communication channel and every slave in one slave set has to be able to pass data to every slave in the other slave set, so if you’re running a query with degree N you need to allocate a total of N-squared times that small number of pages. (In fact, since both sets of slaves may also have to send data to the query co-ordinator at some stage (e.g. while “ranging”) you may need a further 2N sets of pages – i.e. N for each slave set – for those channels).

I recall reading a document many years ago (in the era of 7.2 or 7.3 – when we has OCIS rather than Metalink or MoS) which said the number of pages required was 3, though I’ve also seen a blog note recently that says it’s 4 in RAC systems. So, assume you have a query that is running parallel 5 in single instance, and your _parallel_execution_message_size is 16,384 bytes, then the amount of memory you need for virtual tables for that query is something like: 5 * 5 * 3 * 16KB = 1,200KB (or maybe 5 * 7 * 3 * 16KB = 1,680KB if my comment about the query co-ordinator is correct).

Again depending on version and parameter settings, this memory will be located either in the shared pool or large pool with the name “PX msg pool”; if you run a series of tests that continuously increases the degree of parallelism on a fixed query you’ll see this pool size growing at roughly the right rate to match the formulae above (although the approximation was somewhat obscured when I checked as Oracle was allocating the pool in chunks of 120KB + 24 bytes).

Given the small number of pages available on any channel for reading and writing, you can imagine that a writer (producer) will sometimes find that all the available pages for a given channel are full and it has to wait for the reader (consumer) to empty them before it can write the next set of data. This leads to the two waits: “PX Deq Credit: send blkd” and “PX Deq Credit: need buffer”. I’m not sure exactly why there are two different waits, but it’s possible that one is an explicit block (e.g.: “send blkd” = the consumer is now reading the buffers and has made them unavailable, and “need buffer” = all the buffers are full but the consumer is currently doing something else and hasn’t noticed).

There is an important performance consideration that goes along with this blocking: if a consumer starts spilling its data to disc that’s a (relatively) slow operation, which means that all the producers trying to send data to that consumer are likely to get blocked waiting each time that it writes a few more pages to disc, so one (unlucky) consumer could produce far more delay to the query than just the time it spends writing (if I spill my pages to disc none of the slaves in the other slave set can send data to me until I’ve finished, which means they don’t send data to anyone else either). Moreover if a single producer tends to send most of its data to a single consumer then that producer will effectively be causing a block on all the other producers because they may have to wait for that one consumer to spill the data from that one producer every time they try to send data to that consumer.

Clearly, spilling to disc is a bad thing – it’s going to slow the query down whatever the circumstances, but an imbalance in the sharing of data between consumers and producers could have an impact far greater than we might at first expect. Any strategy that minimises the risk of spilling to disk is likely to be a good idea – and that’s where Bloom filters come into play; if we can find a cheap way of telling one set of slaves about the data that the other set doesn’t need then we can reduce traffic through the virtual tables and reduce the time lost through “need buffer” and “send blkd” waits.

The time when it’s really important to find a way to eliminate data before it goes through a table queue is when we do a hash join. Remember a hash join takes place in two pieces – we extract data from one table and build an (in-memory) hash table, then we extract data from the other table and probe the hash table for each row extracted; but when we are actually extracting the data we don’t consider the join condition. When a hash join goes parallel we may be in a position where one slave set holds the in-memory hash table and the other slave set does the work of scanning the second table – in this case we would like the second slave set to discard as much data as possible before sending it through the table queue … but it doesn’t have any information about the join condition.

A Bloom filter is simply a small dataset created by the first slave set that is a “summary” of the values that are likely to be useful in the join. As it builds the in-memory hash table, the first slave set can construct a Bloom filter for the join values; when the hash table is complete the first slave set can pass the Bloom filter to the second slave set (possibly by way of the query co-ordinator, I haven’t checked the mechanism yet) allowing the second slave set to discard the data that definitely won’t be needed and passing only data that will probably be needed through the table queue. (Note: the Bloom filter may not be a perfect filter, it may allow some data that isn’t needed. For more detailed information about Bloom filters, see this article by Christian Antognini.)

Let’s take a look at a simple example to demonstrate the principle – I’ve created tables t1 and t2 as “select * from all_objects” (some 70,000 rows in 11.2.0.4) and gathered stats with no histograms. Now I’m going to run three queries, first an unfiltered join between the two tables, then a filtered join showing a Bloom filter magically appearing, then I’ll repeat the filtered query after disabling Bloom filtering so that we can see the benefits of the filter. In all three cases I’ll enable rowsource execution statistics so that you can see the volume of data that went through the table queues. Here’s the first, unfiltered, query:

select
    /*+
        parallel(t1 2)
        parallel(t2 2)
        leading(t1 t2)
        pq_distribute(t2 hash hash)
    */
    t1.object_name, t2.object_name, rpad('x',100,'x') padding
from    t1, t2
where
    t2.object_id = t1.object_id
;

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Starts | E-Time   |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem |  O/1/M   | Max-Tmp |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |      1 |          |        |      |            |  70101 |00:00:00.17 |      12 |      0 |      0 |       |       |          |         |
|   1 |  PX COORDINATOR         |          |      1 |          |        |      |            |  70101 |00:00:00.17 |      12 |      0 |      0 |       |       |          |         |
|   2 |   PX SEND QC (RANDOM)   | :TQ10002 |      0 | 00:00:01 |  Q1,02 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|*  3 |    HASH JOIN BUFFERED   |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  70101 |00:00:00.28 |       0 |    310 |    310 |  5952K|  1953K|     2/0/0|    2048 |
|   4 |     PX RECEIVE          |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  70101 |00:00:00.08 |       0 |      0 |      0 |       |       |          |         |
|   5 |      PX SEND HASH       | :TQ10000 |      0 | 00:00:01 |  Q1,00 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|   6 |       PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,00 | PCWC |            |  70101 |00:00:00.05 |    1031 |   1005 |      0 |       |       |          |         |
|*  7 |        TABLE ACCESS FULL| T1       |     26 | 00:00:01 |  Q1,00 | PCWP |            |  70101 |00:00:00.02 |    1031 |   1005 |      0 |       |       |          |         |
|   8 |     PX RECEIVE          |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  70102 |00:00:00.06 |       0 |      0 |      0 |       |       |          |         |
|   9 |      PX SEND HASH       | :TQ10001 |      0 | 00:00:01 |  Q1,01 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|  10 |       PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,01 | PCWC |            |  70102 |00:00:00.04 |    1031 |   1005 |      0 |       |       |          |         |
|* 11 |        TABLE ACCESS FULL| T2       |     26 | 00:00:01 |  Q1,01 | PCWP |            |  70102 |00:00:00.01 |    1031 |   1005 |      0 |       |       |          |         |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

So, following the table queues:

slave set 2 scans t1 and distributes it to slave set 1 through virtual table 0, and slave set 1 builds an in-memory hash table: (implied by lines 3 – 7).
slave set 2 scans t2 and distributes it to slave set 1 through virtual table 1, and slave set 1 buffers it until the scan has completed and all relevant data has arrived: (implied by lines 3, 8 – 11)
slave set 1 does the hash join (line 3) reading back the second data set from its buffer (which may have spilled to disk) and passes the results on to the query coordinator through virtual table 2.

Now lets restrict the data we select from table t1.

select
    /*+
        parallel(t1 2)
        parallel(t2 2)
        leading(t1 t2)
        pq_distribute(t2 hash hash)
     */
    t1.object_name, t2.object_name, rpad('x',100,'x') padding
from    t1, t2
where
    t2.object_id = t1.object_id
and t1.owner = 'SYS'
;

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Starts | E-Time   |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem |  O/1/M   | Max-Tmp |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |      1 |          |        |      |            |  34463 |00:00:00.10 |      12 |      0 |      0 |       |       |          |         |
|   1 |  PX COORDINATOR          |          |      1 |          |        |      |            |  34463 |00:00:00.10 |      12 |      0 |      0 |       |       |          |         |
|   2 |   PX SEND QC (RANDOM)    | :TQ10002 |      0 | 00:00:01 |  Q1,02 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|*  3 |    HASH JOIN BUFFERED    |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  34463 |00:00:00.16 |       0 |    124 |    124 |  3611K|  1953K|     2/0/0|    1024 |
|   4 |     JOIN FILTER CREATE   | :BF0000  |      2 | 00:00:01 |  Q1,02 | PCWP |            |  34463 |00:00:00.05 |       0 |      0 |      0 |       |       |          |         |
|   5 |      PX RECEIVE          |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  34463 |00:00:00.04 |       0 |      0 |      0 |       |       |          |         |
|   6 |       PX SEND HASH       | :TQ10000 |      0 | 00:00:01 |  Q1,00 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|   7 |        PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,00 | PCWC |            |  34463 |00:00:00.03 |    1031 |   1005 |      0 |       |       |          |         |
|*  8 |         TABLE ACCESS FULL| T1       |     26 | 00:00:01 |  Q1,00 | PCWP |            |  34463 |00:00:00.01 |    1031 |   1005 |      0 |       |       |          |         |
|   9 |     PX RECEIVE           |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  38848 |00:00:00.05 |       0 |      0 |      0 |       |       |          |         |
|  10 |      PX SEND HASH        | :TQ10001 |      0 | 00:00:01 |  Q1,01 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|  11 |       JOIN FILTER USE    | :BF0000  |      2 | 00:00:01 |  Q1,01 | PCWP |            |  38848 |00:00:00.04 |    1031 |   1005 |      0 |       |       |          |         |
|  12 |        PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,01 | PCWC |            |  38848 |00:00:00.03 |    1031 |   1005 |      0 |       |       |          |         |
|* 13 |         TABLE ACCESS FULL| T2       |     26 | 00:00:01 |  Q1,01 | PCWP |            |  38848 |00:00:00.01 |    1031 |   1005 |      0 |       |       |          |         |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T2"."OBJECT_ID"="T1"."OBJECT_ID")
   8 - access(:Z>=:Z AND :Z<=:Z)
        filter("T1"."OWNER"='SYS')
   13 - access(:Z>=:Z AND :Z<=:Z)
        filter(SYS_OP_BLOOM_FILTER(:BF0000,"T2"."OBJECT_ID"))

In this case the basic plan is the same although we’ve introduced two extra lines, a “JOIN FILTER CREATE” at line 4, this is slave set 1 creating a Bloom filter as it constructs the in-memory hash table, and a “JOIN FILTER USE” at line 11, which is slave set 2 discarding data from the tablescan of t2 before sending data through virtual table 1.

Looking at predicate information for line 13 we can see the filter predicate using a function called sys_op_bloom_filter() to eliminate data. Oracle has decided that the number of distinct values of object_id returned by the predicate OWNER = ‘SYS’ is sufficiently small that it’s worth building a Bloom filter describing it and passing it to the other slave set. (In fact the optimizer has been a little over-optimistic in this case because I didn’t gather any histograms on the table – there’s a lot more SYS-owned data than the optimizer is expecting).

We can see that the Bloom filter has been pretty effective. Since the two tables have the same data (plus or minus one row) the volume that would be returned by a perfect filter at line 13 ought to match the volume produced from the first table at line 8 – in fact producing about 12.5% more than the ideal, but that’s still a big improvement over sending the whole table through the virtual table.

Just to finish off, here are the results when I disable Bloom filtering using the /*+ no_px_join_filter(t2) / hint. (This hint appeared, along with the opposite hint /+ px_join_filter(alias) */ and Bloom filters, in 10gR2 but didn’t get into outlines and SQL Plan baselines until 11.1.0.6

Plan hash value: 3683239666

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Starts | E-Time   |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem |  O/1/M   | Max-Tmp |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |      1 |          |        |      |            |  34463 |00:00:00.22 |      10 |      0 |      0 |       |       |          |         |
|   1 |  PX COORDINATOR         |          |      1 |          |        |      |            |  34463 |00:00:00.22 |      10 |      0 |      0 |       |       |          |         |
|   2 |   PX SEND QC (RANDOM)   | :TQ10002 |      0 | 00:00:01 |  Q1,02 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|*  3 |    HASH JOIN BUFFERED   |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  34463 |00:00:00.21 |       0 |     93 |     93 |  3611K|  1953K|     2/0/0|    1024 |
|   4 |     PX RECEIVE          |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  34463 |00:00:00.04 |       0 |      0 |      0 |       |       |          |         |
|   5 |      PX SEND HASH       | :TQ10000 |      0 | 00:00:01 |  Q1,00 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|   6 |       PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,00 | PCWC |            |  34463 |00:00:00.04 |    1031 |   1005 |      0 |       |       |          |         |
|*  7 |        TABLE ACCESS FULL| T1       |     26 | 00:00:01 |  Q1,00 | PCWP |            |  34463 |00:00:00.02 |    1031 |   1005 |      0 |       |       |          |         |
|   8 |     PX RECEIVE          |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  70102 |00:00:00.07 |       0 |      0 |      0 |       |       |          |         |
|   9 |      PX SEND HASH       | :TQ10001 |      0 | 00:00:01 |  Q1,01 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|  10 |       PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,01 | PCWC |            |  70102 |00:00:00.06 |    1031 |   1005 |      0 |       |       |          |         |
|* 11 |        TABLE ACCESS FULL| T2       |     26 | 00:00:01 |  Q1,01 | PCWP |            |  70102 |00:00:00.02 |    1031 |   1005 |      0 |       |       |          |         |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T2"."OBJECT_ID"="T1"."OBJECT_ID")
   7 - access(:Z>=:Z AND :Z<=:Z)
        filter("T1"."OWNER"='SYS')
   11 - access(:Z>=:Z AND :Z<=:Z)

You’ll notice we’re back to 11 lines, there’s no reference to JOIN FILTER and the predicate section no longer has a call to sys_op_bloom_filter().
More significantly the actual rows reported in line 11 is the full 70,000 for the whole table – the entire data set will be passing through virtual table 1, increasing the CPU load, the probability of “send blkd” and “need buffer”, and the changes to the BUFFERED data spilling to disc.

That’s nearly half a day gone writing up the preamble to how to interpret the execution plan for my original query after I hinted it into a hash/hash distribution – but it’s a lovely day outside so I’m heading out to the garden with my pruning secateurs and it’s going to be a few more days (I hope) before the last part of the series comes out.

Parallel Execution – 4

你可能感兴趣的:(Parallel Execution – 4)