partition outer join in oracle 10g
Partition outer join is a new mechanism in 10g to "invent" data to fill the gaps in non-contiguous results. In 10g there are many methods to deal with such a problem (including the awe-inspiring, but equally terrifying, MODEL clause). In older versions of Oracle, "data-densification" was not as simple and certainly less efficient than it has now become.
the problem
This article has been motivated by a response I gave to a problem raised on an Oracle developer forum. Our requirement is to produce a report that details customer spending for each month of the year. Our database only records actual spend, so for any given month, data for dormant or idle customers will have to be generated.
setup
First, we'll create a mock CUSTOMER_ORDERS table with sparse data to represent customer spending. To keep the example simple, we'll denormalise the customer name onto the orders table.
SQL> CREATE TABLE customer_orders (name, dt, amt)
2 AS
3 SELECT *
4 FROM (
5 SELECT owner
6 , TRUNC(created) + MOD(ROWNUM,6)
7 , TRUNC(object_id/ROWNUM)
8 FROM all_objects
9 WHERE created > TRUNC(SYSDATE,'YEAR')
10 AND owner IN ('ORDSYS','WKSYS')
11 ORDER BY
12 DBMS_RANDOM.RANDOM
13 )
14 WHERE ROWNUM <= 1000;
Table created.
a sparse report
With our customer orders data as sparse as it is, a monthly report for purchases by customer would look as follows.
SQL> SELECT name
2 , TO_CHAR(dt,'YYYYMM') AS year_month
3 , SUM(amt) AS total_amt
4 FROM customer_orders
5 GROUP BY
6 name
7 , TO_CHAR(dt,'YYYYMM')
8 ORDER BY 2, 1;
NAME YEAR_MONTH TOTAL_AMT
--------------- ------------ ----------
ORDSYS 200405 3668
WKSYS 200405 93
ORDSYS 200406 21427
WKSYS 200406 371
ORDSYS 200407 156864
WKSYS 200407 2163
6 rows selected.
adding the missing months
We can see from the data that we are missing most months of the year for our two customers. Remember that our requirement is to show a report for every month in 2004 for every customer. First we will build a "time dimension" set (using subquery factoring) and outer join it to our orders table.
SQL> WITH year_months AS (
2 SELECT TO_CHAR(
3 ADD_MONTHS(
4 DATE '2004-01-01', ROWNUM-1 ), 'YYYYMM' ) AS year_month
5 FROM dual
6 CONNECT BY ROWNUM < 12
7 )
8 SELECT co.name
9 , ym.year_month
10 , NVL(SUM(co.amt),0) AS total_amount
11 FROM year_months ym
12 LEFT OUTER JOIN
13 customer_orders co
14 ON (TO_CHAR(co.dt,'YYYYMM') = ym.year_month)
15 GROUP BY
16 co.name
17 , ym.year_month
18 ORDER BY
19 co.name
20 , ym.year_month;
NAME YEAR_MONTH TOTAL_AMOUNT
--------------- ------------ ------------
ORDSYS 200405 3668
ORDSYS 200406 21427
ORDSYS 200407 156864
WKSYS 200405 93
WKSYS 200406 371
WKSYS 200407 2163
200401 0
200402 0
200403 0
200404 0
200408 0
200409 0
200410 0
200411 0
200412 0
15 rows selected.
We can see that this hasn't quite worked. We have the zero sums and the year-months, but we are missing customer names. This is because we outer joined to CUSTOMER_ORDERS on the year-months, so any customer columns would show as NULL for deficient rows. Until PARTITION OUTER JOIN appeared in Oracle 10g, we couldn't "invent" data easily, though the next section shows that it is possible in prior versions.
data-densification without partition outer join
We can now take the previous example further and generate the missing customer names for our report. We need to create a set of distinct customer names in addition to the year-months that we generated by subquery. We can combine the two generated "dimensions" (name and year-month) to drive the report as follows.
SQL> WITH year_months AS (
2 SELECT TO_CHAR(
3 ADD_MONTHS(
4 DATE '2004-01-01', ROWNUM-1 ), 'YYYYMM' ) AS year_month
5 FROM dual
6 CONNECT BY ROWNUM < 12
7 )
8 , customer_names AS (
9 SELECT DISTINCT name
10 FROM customer_orders
11 )
12 SELECT sq.name
13 , sq.year_month
14 , NVL(co.total_amount,0) AS total_amount
15 FROM (
16 SELECT name
17 , TO_CHAR(dt,'YYYYMM') AS year_month
18 , SUM(amt) AS total_amount
19 FROM customer_orders
20 GROUP BY
21 name
22 , TO_CHAR(dt,'YYYYMM')
23 ) co
24 , (
25 SELECT *
26 FROM customer_names
27 , year_months
28 ) sq
29 WHERE sq.year_month = co.year_month (+)
30 AND sq.name = co.name (+)
31 ORDER BY
32 sq.name
33 , sq.year_month;
NAME YEAR_MONTH TOTAL_AMOUNT
--------------- ------------ ------------
ORDSYS 200401 0
ORDSYS 200402 0
ORDSYS 200403 0
ORDSYS 200404 0
ORDSYS 200405 3668
ORDSYS 200406 21427
ORDSYS 200407 156864
ORDSYS 200408 0
ORDSYS 200409 0
ORDSYS 200410 0
ORDSYS 200411 0
ORDSYS 200412 0
WKSYS 200401 0
WKSYS 200402 0
WKSYS 200403 0
WKSYS 200404 0
WKSYS 200405 93
WKSYS 200406 371
WKSYS 200407 2163
WKSYS 200408 0
WKSYS 200409 0
WKSYS 200410 0
WKSYS 200411 0
WKSYS 200412 0
24 rows selected.
We can see that this has now become quite complicated and unintuitive. Using the generated dimensions, we built a cartesian product of our range of months and customer names to derive the full range of dates per customer. We then outer joined this cartesian product to CUSTOMER_ORDERS to fill in the gaps for our report. We can imagine the potential cost of such a method if our CUSTOMER_ORDERS table had millions of rows and we had to generate a distinct set of names and then join this back to the table.
In the following section, we'll see just how much simpler the PARTITION OUTER JOIN solution is.
partition outer join
In 10g, the PARTITION OUTER JOIN enables us to avoid the second access of CUSTOMER_ORDERS, and makes the report much easier to build. We can see this with a re-write of the previous example.
SQL> WITH year_months AS (
2 SELECT TO_CHAR(
3 ADD_MONTHS(
4 DATE '2004-01-01', ROWNUM-1), 'YYYYMM') AS year_month
5 FROM dual
6 CONNECT BY ROWNUM < 12
7 )
8 SELECT co.name
9 , ym.year_month
10 , NVL(SUM(co.amt),0) AS total_amount
11 FROM year_months ym
12 LEFT OUTER JOIN
13 customer_orders co PARTITION BY (co.name)
14 ON (TO_CHAR(co.dt,'YYYYMM') = ym.year_month)
15 GROUP BY
16 co.name
17 , ym.year_month
18 ORDER BY
19 co.name
20 , ym.year_month;
NAME YEAR_MONTH TOTAL_AMOUNT
--------------- ------------ ------------
ORDSYS 200401 0
ORDSYS 200402 0
ORDSYS 200403 0
ORDSYS 200404 0
ORDSYS 200405 3668
ORDSYS 200406 21427
ORDSYS 200407 156864
ORDSYS 200408 0
ORDSYS 200409 0
ORDSYS 200410 0
ORDSYS 200411 0
ORDSYS 200412 0
WKSYS 200401 0
WKSYS 200402 0
WKSYS 200403 0
WKSYS 200404 0
WKSYS 200405 93
WKSYS 200406 371
WKSYS 200407 2163
WKSYS 200408 0
WKSYS 200409 0
WKSYS 200410 0
WKSYS 200411 0
WKSYS 200412 0
24 rows selected.
This is so much easier! The PARTITION BY clause on the outer join has enabled us to split our deficient data into groups and outer join to each group. In the context of our example, this means that Oracle knows the customer name for each group that we are outer joining to. Hence we did not need to generate a separate set of names and this in turn keeps our resource usage to a minimum.
If we fetch the execution plan for the previous query using DBMS_XPLAN.DISPLAY_CURSOR, we can see the presence of the PARTITION OUTER JOIN as follows.
SQL> SELECT plan_table_output
2 FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR);
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------
SQL_ID 2w0usftp0k4un, child number 0
-------------------------------------
WITH year_months AS ( SELECT TO_CHAR( ADD_MONTHS(
DATE '2004-01-01', ROWNUM-1), 'YYYYMM') AS year_month FROM dual CONNECT BY
ROWNUM < 12 ) SELECT co.name , ym.year_month , NVL(SUM(co.amt),0) AS
total_amount FROM year_months ym LEFT OUTER JOIN customer_orders co PARTITION
BY (co.name) ON (TO_CHAR(co.dt,'YYYYMM') = ym.year_month) GROUP BY co.name ,
ym.year_month ORDER BY co.name , ym.year_month
Plan hash value: 2660043991
------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 7 (100)| |
| 1 | SORT GROUP BY | | 1 | 79 | 6 (34)| 00:00:01 |
| 2 | VIEW | | 1 | 79 | 5 (20)| 00:00:01 |
| 3 | NESTED LOOPS PARTITION OUTER | | 1 | 44 | 5 (20)| 00:00:01 |
| 4 | BUFFER SORT | | | | | |
| 5 | VIEW | | 1 | 5 | 2 (0)| 00:00:01 |
| 6 | COUNT | | | | | |
| 7 | CONNECT BY WITHOUT FILTERING| | | | | |
| 8 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
|* 9 | FILTER | | | | | |
| 10 | SORT PARTITION JOIN | | 10 | 390 | 3 (34)| 00:00:01 |
| 11 | TABLE ACCESS FULL | CUSTOMER_ORDERS | 10 | 390 | 2 (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
9 - filter("YM"."YEAR_MONTH"=TO_CHAR("CO"."DT",'YYYYMM'))
Note
-----
- dynamic sampling used for this statement
37 rows selected.
The effects of the PARTITION OUTER JOIN can be seen in steps 3 and 10. Step 10 is where Oracle sorts the customer data to partition by name and step 3 is where the partitions are outer joined to the year-month data. Note that the CONNECT BY steps are a result of the method we chose to generate the time dimension data.
further reading
For more information on PARTITION OUTER JOIN and how it can be applied, see the online documentation. Jonathan Gennick has written about this feature in a wider article on the MODEL clause in Oracle Magazine and this thread on Ask Tom shows the original forum problem.
source code
The source code for the examples in this article can be downloaded from here.
总结:
1. 注意PARTITION OUTER JOIN 的用法:
( xx: table 1 ) LEFT OUTER JOIN ( xxx: table 2 ) PARTITION BY ( xxx: column name ) ON ( xxx: join condition )
2. left outer join partition 主要解决left outer join 中存在的空值问题。
当两个用来做Join的表中的数据不“相当”的时候,如果我们希望在最终结果中其中一张表的所有records都不落下,一般会用到outer join。另外一个表中“缺失”的那部分数据自然都是以NULL来补足。但是如果我们希望缺失的那部分数据(NULL)能正常"补足'’ 出来,这个时候用Partiton Outer Join就会很方便,而且性能很好。
this article from http://www.oracle-developer.net/display.php?id=312