【semi join】IN/EXISTS子查询的区别及其优化方式

【semi join】IN/EXISTS子查询的区别及其优化方式 2013-04-27 12:45:18

分类: Oracle

    在ORACLE 11G大行其道的今天,还有很多人受早期版本的影响,记住一些既定的规则,
   1.子查询结果集小,用IN
   2.外表小,子查询表大,用EXISTS


     这是完全错误的观点。在8i时代,这经常是正确的,但是现在已经11G了,马上12C就要面世了。其实在ORACLE 9i CBO就已经优化了IN,EXISTS的区别,ORACLE优化器有个查询转换器,很多SQL虽然写法不同,但是ORACLE优化器会根据既定规则进行查询重写,重写为优化器觉得效率最高的SQL,所以可能SQL写法不同,但是执行计划却是完全一样的。

      IN与EXISTS有一点要记住:IN一般是用于非相关子查询,而EXISTS一般用于相关子查询。当然IN也可以用于相关子查询,EXISTS也可以用于非相关子查询。但是这区别很重要,虽然优化器很强大, 但是查询转换是有一定的限制的,在EXISTS性能低下,无法进行相关查询转换,比如不能UNNEST SUBQUERY,那么可能我们需要改写SQL,通常可以用IN/JOIN等改写相关子查询,或对于含有OR的SEMI JOIN为UNION ALL/UNION的形式。
    

    下面就用例子说明一下:
DROP TABLE a;
DROP TABLE b;
CREATE TABLE a AS SELECT * FROM hr.employees;
CREATE TABLE b AS SELECT * FROM  hr.employees;
--反复插入,构造20万行+
INSERT INTO a SELECT * FROM a;
INSERT INTO b SELECT * FROM a;
COMMIT;
BEGIN
 dbms_stats.gather_table_stats(ownname => USER,tabname => 'a',estimate_percent => 100,cascade => TRUE);
  dbms_stats.gather_table_stats(ownname => USER,tabname => 'b',estimate_percent => 100,cascade => TRUE);
END;
/

1.测试IN,EXISTS在简单查询中,是等价的

SQL> set autotrace traceonly exp
SQL> SELECT * FROM a
  2  WHERE EXISTS(
  3  SELECT 1 FROM b WHERE a.employee_id=b.employee_id);


执行计划
----------------------------------------------------------
Plan hash value: 2317816356


---------------------------------------------------------------------------------
| Id  | Operation              | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |        |   217K|    15M|  1375   (2)| 00:00:17 |
|*  1 |  HASH JOIN             |        |   217K|    15M|  1375   (2)| 00:00:17 |
|   2 |   SORT UNIQUE          |        |   217K|   848K|   126   (1)| 00:00:02 |
|   3 |    INDEX FAST FULL SCAN| IDX2_B |   217K|   848K|   126   (1)| 00:00:02 |
|   4 |   TABLE ACCESS FULL    | A      |   217K|    14M|   620   (1)| 00:00:08 |
---------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------


   1 - access("A"."EMPLOYEE_ID"="B"."EMPLOYEE_ID")


SQL> SELECT * FROM a
  2  WHERE a.employee_id IN (SELECT b.employee_id FROM b);


执行计划
----------------------------------------------------------
Plan hash value: 2317816356


---------------------------------------------------------------------------------
| Id  | Operation              | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |        |   217K|    15M|  1375   (2)| 00:00:17 |
|*  1 |  HASH JOIN             |        |   217K|    15M|  1375   (2)| 00:00:17 |
|   2 |   SORT UNIQUE          |        |   217K|   848K|   126   (1)| 00:00:02 |
|   3 |    INDEX FAST FULL SCAN| IDX2_B |   217K|   848K|   126   (1)| 00:00:02 |
|   4 |   TABLE ACCESS FULL    | A      |   217K|    14M|   620   (1)| 00:00:08 |
---------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------


   1 - access("A"."EMPLOYEE_ID"="B"."EMPLOYEE_ID")

  可以看出,两个计划完全没有区别。类似于ORACLE查询重写为:
 SELECT a.* FROM a,(SELECT DISTINCT b.employee_id FROM b) b1 WHERE a.employee_id=b1.employee_id;

看看8i时代的区别:
SQL> SELECT/*+optimizer_features_enable('8.1.7')*/ * FROM a
  2  WHERE EXISTS(
  3  SELECT 1 FROM b WHERE a.employee_id=b.employee_id);
已用时间:  00: 00: 00.00


执行计划
----------------------------------------------------------
Plan hash value: 3422092984


-------------------------------------------------------------
| Id  | Operation          | Name   | Rows  | Bytes | Cost  |
-------------------------------------------------------------
|   0 | SELECT STATEMENT   |        | 10854 |   731K|   344 |
|*  1 |  FILTER            |        |       |       |       |
|   2 |   TABLE ACCESS FULL| A      | 10854 |   731K|   344 |
|*  3 |   INDEX RANGE SCAN | IDX2_B |  2049 |  8196 |     5 |
-------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------


   1 - filter( EXISTS (SELECT 0 FROM "B" "B" WHERE
              "B"."EMPLOYEE_ID"=:B1))
   3 - access("B"."EMPLOYEE_ID"=:B1)


Note
-----
   - cpu costing is off (consider enabling it)


SQL> SELECT/*+optimizer_features_enable('8.1.7')*/ * FROM a
  2  WHERE a.employee_id IN (SELECT b.employee_id FROM b);
已用时间:  00: 00: 00.00


执行计划
----------------------------------------------------------
Plan hash value: 1679318093


-------------------------------------------------------------------------
| Id  | Operation            | Name     | Rows  | Bytes |TempSpc| Cost  |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |          |   217K|    16M|       |  1126 |
|*  1 |  HASH JOIN           |          |   217K|    16M|       |  1126 |
|   2 |   VIEW               | VW_NSO_1 |   106 |  1378 |       |   779 |
|   3 |    SORT UNIQUE       |          |   106 |   424 |  2576K|   779 |
|   4 |     TABLE ACCESS FULL| B        |   217K|   848K|       |   344 |
|   5 |   TABLE ACCESS FULL  | A        |   217K|    14M|       |   344 |
-------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------


   1 - access("A"."EMPLOYEE_ID"="EMPLOYEE_ID")


Note
-----
   - cpu costing is off (consider enabling it)

    显然,在8i时代,还是有明显的区别,EXISTS是主表驱动,走的FILTER,如果主表返回行很多,那么必然效率会低,如果主表小,内表,内表能走索引,是很好的,IN走HASH JOIN,受内表
驱动,内表返回行少,效率高。当然具体情况肯定会很复杂,这里不做研究了,因为8i已经是过去式了。

2.遇到优化器限制的做法:改写SQL
       
ORACLE优化器虽然已经很强大,但是还有很多限制,比如无法UNNEST的限制如子查询有CONNECT BY,SET操作,ROWNUM,关联子查询内部包含分组函数等。还比如SEMI JOIN.ANTI JOIN条件带OR的形式等。这时候,我们常用的优化方式就是SQL的等价改写,这要根据具体的业务和数据特点,来重写等价的SQL,千万别改写成结果不等价,那就糟糕了。

     比如这个SQL:
SELECT * FROM a
WHERE EXISTS(
SELECT 1 FROM b WHERE a.department_id=b.department_id  GROUP BY b.department_id HAVING a.salary>=MAX(b.salary));

SQL> @display_cursor
SQL_ID  dgc8b80sxwct2, child number 1
-------------------------------------
SELECT * FROM a WHERE EXISTS( SELECT 1 FROM b WHERE
a.department_id=b.department_id  GROUP BY b.department_id HAVING
a.salary>=MAX(b.salary))


Plan hash value: 705769378


-----------------------------------------------------------------------------------------
| Id  | Operation              | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      |      1 |        |  22528 |00:09:13.61 |      32M|
|*  1 |  FILTER                |      |      1 |        |  22528 |00:09:13.61 |      32M|
|   2 |   TABLE ACCESS FULL    | A    |      1 |    217K|    217K|00:00:00.20 |    3733 |
|*  3 |   FILTER               |      |  14403 |        |   2058 |00:09:13.20 |      32M|
|   4 |    SORT GROUP BY NOSORT|      |  14403 |      1 |  14403 |00:09:13.17 |      32M|
|*  5 |     TABLE ACCESS FULL  | B    |  14403 |  19745 |    973M|00:06:17.21 |      32M|
-----------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------


   1 - filter( IS NOT NULL)
   3 - filter(MAX("B"."SALARY")<=:B1)
   5 - filter("B"."DEPARTMENT_ID"=:B1)

   这是个很简单的SQL,但是因为使用了EXISTS关联子查询,并且内部有分组操作,无法进行有效的查询转换,走了FILTER操作,FILTER操作类似于NESTED LOOPS,但是不同于NESTED LOOPS的是,他还可以通过条件判断,是否走子步骤。 这里全表扫描B 14403次(这是无法忍受的,特别遇到大表,甭想跑出来了)

  那么如何优化这种SQL呢,要改写,改写为JOIN形式:
 SQL> SELECT * FROM a,(SELECT department_id,MAX(b.salary) max_salary FROM b GROUP BY b.department_id) b1
  2  WHERE a.department_id=b1.department_id AND a.salary>=b1.max_salary;


已选择22528行。


已用时间:  00: 00: 00.64


执行计划
----------------------------------------------------------
Plan hash value: 774961296


-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      | 10854 |  1006K|  1250   (2)| 00:00:16 |
|*  1 |  HASH JOIN           |      | 10854 |  1006K|  1250   (2)| 00:00:16 |
|   2 |   VIEW               |      |    11 |   286 |   629   (3)| 00:00:08 |
|   3 |    HASH GROUP BY     |      |    11 |    88 |   629   (3)| 00:00:08 |
|   4 |     TABLE ACCESS FULL| B    |   217K|  1696K|   620   (1)| 00:00:08 |
|   5 |   TABLE ACCESS FULL  | A    |   217K|    14M|   620   (1)| 00:00:08 |
-----------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------


   1 - access("A"."DEPARTMENT_ID"="B1"."DEPARTMENT_ID")
       filter("A"."SALARY">="B1"."MAX_SALARY")




统计信息
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       5986  consistent gets
          0  physical reads
          0  redo size
    1851483  bytes sent via SQL*Net to client
      16926  bytes received via SQL*Net from client
       1503  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
      22528  rows processed

    从上面看出,逻辑读相对于没有改写之前的32M,时间9分钟,效率现在大幅度提升,时间变为 00: 00: 00.64,l逻辑读变为5986次,因为走了HASH JOIN,相当于两表各扫描1次。

  上面是通过将子查询改为JOIN的形式来优化,当然改写方式多样,下面用一个EXISTS改写为IN的方式来提高效率:

 
 SELECT * FROM a WHERE EXISTS(SELECT 1 FROM b WHERE  a.department_id=b.department_id AND a.employee_id=b.employee_id
  START WITH b.employee_id =202 CONNECT BY PRIOR b.employee_id=b.manager_id);

SQL> set autotrace traceonly exp
SQL>  SELECT * FROM a WHERE EXISTS(SELECT 1 FROM b WHERE  a.department_id=b.department_id AND a.employee_id=b.employee_id
  2    START WITH b.employee_id =202 CONNECT BY PRIOR b.employee_id=b.manager_id);
已用时间:  00: 00: 00.00


执行计划
----------------------------------------------------------
Plan hash value: 985844456


---------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |      |   186 | 12834 |   723K  (1)| 02:24:37 |
|*  1 |  FILTER                                    |      |       |       |            |          |
|   2 |   TABLE ACCESS FULL                        | A    |   217K|    14M|   620   (1)| 00:00:08 |
|*  3 |   FILTER                                   |      |       |       |            |          |
|*  4 |    CONNECT BY NO FILTERING WITH SW (UNIQUE)|      |       |       |            |          |
|   5 |     TABLE ACCESS FULL                      | B    |   217K|  2545K|   620   (1)| 00:00:08 |
---------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------


   1 - filter( EXISTS (SELECT 0 FROM "B" "B" WHERE "B"."DEPARTMENT_ID"=:B1 AND
              "B"."EMPLOYEE_ID"=:B2 START WITH "B"."EMPLOYEE_ID"=202 CONNECT BY "B"."MANAGER_ID"=PRIOR
              "B"."EMPLOYEE_ID"))
   3 - filter("B"."DEPARTMENT_ID"=:B1 AND "B"."EMPLOYEE_ID"=:B2)
   4 - access("B"."MANAGER_ID"=PRIOR "B"."EMPLOYEE_ID")
       filter("B"."EMPLOYEE_ID"=202)


   又是和前面类似的FILTER操作,这条SQL也要运行很长时间,FILTER操作不是不好,就像NESTED LOOPS一样,也有高效的时候,如果FILTER操作做的次数不多,而且分支操作可以高效实用索引,那么也是高效的,这得注意。

  改写,以上查询可以很容易改写为IN的形式:
 SQL> SELECT * FROM a WHERE (a.department_id,a.employee_id) IN
  2   (SELECT b.department_id,b.employee_id FROM b  START WITH b.employee_id =202 CONNECT BY PRIOR b.employee_id=b.manager_id );
已用时间:  00: 00: 00.13


执行计划
----------------------------------------------------------
Plan hash value: 1584203533


--------------------------------------------------------------------------------------------------------------
| Id  | Operation                                 | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                          |          |  2048 |   190K|       |  2466   (1)| 00:00:30 |
|*  1 |  HASH JOIN RIGHT SEMI                     |          |  2048 |   190K|  8064K|  2466   (1)| 00:00:30 |
|   2 |   VIEW                                    | VW_NSO_1 |   217K|  5514K|       |  1617  (62)| 00:00:20 |
|*  3 |    CONNECT BY NO FILTERING WITH START-WITH|          |       |       |       |            |       |
|   4 |     TABLE ACCESS FULL                     | B        |   217K|  2545K|       |   620   (1)| 00:00:08 |
|   5 |   TABLE ACCESS FULL                       | A        |   217K|    14M|       |   620   (1)| 00:00:08 |
--------------------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------


   1 - access("A"."DEPARTMENT_ID"="DEPARTMENT_ID" AND "A"."EMPLOYEE_ID"="EMPLOYEE_ID")
   3 - access("B"."MANAGER_ID"=PRIOR "B"."EMPLOYEE_ID")
       filter("B"."EMPLOYEE_ID"=202)


    走了HASH JOIN,这条SQL效率大增,基本在<1s内会返回结果。

  在学习过程中,一定要亲自实践,不能遵循于从某个地方看到的规则,特别是N年前的规则,规则是有用的,但是规则也会不断滴更新的,如果规则发生了更新,但是在你的脑子里没有更新,你却不经过实践,就永远遵循这规则,那你对这方面的知识永远知之甚少或知而不全。

你可能感兴趣的:(【semi join】IN/EXISTS子查询的区别及其优化方式)