ORACLE分析函数学习之Windowing Aggregate Functions

ORACLE在数据仓库的应用上,有很多针对分析数据和报表查询的函数。分类如下:

 

 

Type Used For

Ranking

Calculating ranks, percentiles, and n-tiles of the values in a result set.

Windowing

Calculating cumulative and moving aggregates.

Works with these functions: SUM, AVG, MIN, MAX, COUNT, VARIANCE, STDDEV, FIRST_VALUE, LAST_VALUE,

and new statistical functions. Note that the DISTINCT keyword is

not supported in windowing functions except for MAX and MIN.

Reporting

Calculating shares, for example, market share.

Works with these functions: SUM, AVG, MIN, MAX, COUNT (with/without DISTINCT),

VARIANCE, STDDEV, RATIO_TO_REPORT, and new statistical functions.

Note that the DISTINCT keyword may be used in those reporting

functions that support DISTINCT in aggregate mode.

LAG/LEAD

Finding a value in a row a specified number of rows from a current row.

FIRST/LAST

First or last value in an ordered group.

Linear Regression

Calculating linear regression and other statistics (slope, intercept, and so on).

Inverse Percentile

The value in a data set that corresponds to a specified percentile.

Hypothetical Rank and Distribution

The rank or percentile that a row would have if inserted into a specified data set.

  

    

其中 Windowing analytic function 的主要功能是 Calculating cumulative and moving aggregates,即累计计算求和。

Reporting 只是省略了WIDNOW ,即对分组后的全部数据求和,没有累加效果,很适合对某个地区销售百分额的查询。

 

分析函数的语法格式如下:

 

analytic_function::=

 

 

analytic_clause::=

 

 

 

 

Processing order:

 

 

 

 

首先,SQL语句中JOINS,WHERE,GROUP BY HAVING Clause先执行,然后为分析函数生成一个结果集,分析函数在此基础上进行计算,

如果有最后ORDER BY Clause,再进行排序(分析函数内部如有ORDER BY子句是在计算之前执行的)。

 

Result set partitions:

 

对JOINS,WHERE,GROUP BY HAVING 字句后生成的结果集分组,如省略次子句,默认对partition by 所有列

 

Window(重点):

 partition by后的每一行,都拥有一个视窗。可以把每一行想象成一个天真活泼的小男孩,视窗就是他们手中拿着的可以自动伸长缩短的魔法棍。

一个结果集就如同所有的小男孩排成一个长队。然后每个小男孩拿着魔法棍,往前或者往后或者前后同时,想打谁就打谁,只要输入口令。

UNBOUNDED PRECEDING,就是从最前面到自己,UNBOUNDED FOLLOWING,就是从自己到最后面。

BETWEEN '起点' and '终点',就是以自己为中心从起点到终点的范围。

但是魔法棒有个缺点,只能范围攻击,不能单独击中某一个自己相邻的除外。

打比方打到这里,其实魔法棒不是最佳比喻,还是WINDOW视窗比较合适,正好是范围的滑动。

 

Current row:

就是当前行啦,它作为视窗的参考点,并包括在整个视窗的计算范围内。

 

详细语法如下:

analytic_function([ arguments ])
   OVER (analytic_clause)

where analytic_clause =
     [ query_partition_clause ]
     [ order_by_clause [ windowing_clause ] ]

and query_partition_clause =
    PARTITION BY
      { value_expr[, value_expr ]...
      | ( value_expr[, value_expr ]... )
      }
and windowing_clause =
     { ROWS | RANGE }
     { BETWEEN
       { UNBOUNDED PRECEDING
     | CURRENT ROW
     | value_expr { PRECEDING | FOLLOWING }
     }
     AND
     { UNBOUNDED FOLLOWING
     | CURRENT ROW
     | value_expr { PRECEDING | FOLLOWING }
     }
   | { UNBOUNDED PRECEDING
     | CURRENT ROW
     | value_expr PRECEDING
     }
   }

RANGE:

定义为Logical Offset,即前后相隔可以通过具体的逻辑表达式表示,

如 range INTERVAL '1' DAY AND INTERVAL '2' DAY 就是当前行的ORDER BY 字段的时间前一天和后两天的范围。

用RANG的时候,RANG 后面可是NUMERIC INTERVAL和表达式 ORDER BY字段的类型必须和RANG后的类型一样。并且用RANG的时候,只能ORDER BY一个字段。

ROWS:

定义为Physical Offsets,即前后相隔只能是具体的行,如rows between 2 and 2就是当前行的前两行和后两行。可以ORDER BY多个字段。

 

   WINDOWS 子句只能在ORDER BY子句后出现,如果不写则默认为 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

即WINDOWS都是基于分析子句中的OREDR BY,无ORDER BY的情况下,功能相当于REPORTING FUNCTION。

分析函数只能出现在SELECT子句和ORDER BY子句中。

(这里插一句,跟本主题不相关,只是突然想把在想的写下来,不管对不对。ORACLE在SQL执行计划中SELECT都放在最后,

在解析语句的时候,就解析到了分析函数,当执行GROUP BY等后,就会在RESULT SETS中根据解析到的分析函数语法,

对排序后的每一行数据进行计算,最后在FETCH所有的行显示出来。GROUP BY 后的统计函数也是,如解析出SUM后,

对分组后的结果集中每个组中的每一行数据依次FENTCH相加,如果在SUM中写CASE WHEN就能根据条件选择每一行的加减乘除。

但GROUP 带不出所有的行,而分析函数是基于每一行的。

哎,有些只能意会不能言传,主要还是没有实际的搞懂,要是能看到源代码就好了:)

 

下面举具体的例子说明。

Windowing Functions with Logical Offset:

1.

 

SELECT t.time_id, TO_CHAR (SUM(amount_sold), '9,999,999,999')
AS SALES, TO_CHAR(AVG(SUM(amount_sold)) OVER
  (ORDER BY t.time_id
   RANGE BETWEEN INTERVAL '1' DAY PRECEDING AND
   INTERVAL '1' DAY FOLLOWING), '9,999,999,999') AS CENTERED_3_DAY_AVG
   FROM sales s, times t
   WHERE s.time_id=t.time_id AND t.calendar_week_number IN (51)
     AND calendar_year=1999
   GROUP BY t.time_id
   ORDER BY t.time_id;

 

TIME_ID          SALES          CENTERED_3_DAY
---------           -------------         --------------
20-DEC-99        134,337        106,676
21-DEC-99         79,015        102,539
22-DEC-99         94,264         85,342
23-DEC-99         82,746         93,322
24-DEC-99        102,957         82,937
25-DEC-99         63,107         87,062
26-DEC-99         95,123         79,115

每一行结果中带出 当前行的TIME_ID前后相隔一天的的SALES的值的平均值。第一行前面没有行,平均值为后一行加当前行除以2.当找不到行的时候不会把那一行算入到计算内。RANG后为INTERVAL,GROUP BY的字段必须为DATE类型。

2.

SELECT t_timekey, AVG(stock_price)
        OVER (ORDER BY t_timekey RANGE fn(t_timekey) PRECEDING) av_price
FROM stock, time WHERE st_timekey = t_timekey
ORDER BY t_timekey;

    RANG后可跟表达式,表达式可以为函数。

3.

SELECT time_id, daily_sum, SUM(daily_sum) OVER (ORDER BY time_id
RANGE BETWEEN INTERVAL '10' DAY PRECEDING AND CURRENT ROW)
AS current_group_sum
FROM (SELECT time_id, channel_id, SUM(s.quantity_sold)
AS daily_sum
FROM customers c, sales s, countries
WHERE c.cust_id=s.cust_id
  AND c.country_id = countries.country_id
  AND s.cust_id IN (638, 634, 753, 440 ) AND s.time_id BETWEEN '01-MAY-00'
  AND '13-MAY-00' GROUP BY time_id, channel_id);

 

TIME_ID    DAILY_SUM CURRENT_GROUP_SUM
---------          ----------           -----------------
06-MAY-00          7                 7                             /* 7 */
10-MAY-00          1                 9                              /* 7 + (1+1) */
10-MAY-00          1                 9                               * 7 + (1+1) */
11-MAY-00          2                15                             /* 7 + (1+1) + (2+4) */
11-MAY-00          4                15                             /* 7 + (1+1) + (2+4) */
12-MAY-00          1                16                             /* 7 + (1+1) + (2+4) + 1 */
13-MAY-00          2                23                             /* 7 + (1+1) + (2+4) + 1 + (5+2) */
13-MAY-00          5                23                             /* 7 + (1+1) + (2+4) + 1 + (5+2) */

 

 

因为是LOGIC OFFCET,当ORDER BY 字段中有相同值时,会把相同字段的分析函数中要分析的值相加后在根据WINDOW值累计。

注意这个特性。要想不要这种效果,也就是无论ORDER BY字段是否有相同的值,都按照WINDOW值相加就要用ROWS了。

 

Windowing Aggregate Functions With Physical Offsets

1.

SELECT t.time_id, TO_CHAR(amount_sold, '9,999,999,999') AS INDIV_SALE,
       TO_CHAR(SUM(amount_sold) OVER (PARTITION BY t.time_id ORDER BY t.time_id
ROWS UNBOUNDED PRECEDING), '9,999,999,999') AS CUM_SALES
FROM sales s, times t, customers c
WHERE s.time_id=t.time_id AND s.cust_id=c.cust_id
  AND t.time_id IN 
   (TO_DATE('11-DEC-1999'), TO_DATE('12-DEC-1999')) 
  AND c.cust_id 
BETWEEN 6500 AND 6600
ORDER BY t.time_id;

 

TIME_ID        INDIV_SALE    CUM_SALES
---------             ----------    ---------
12-DEC-99         23           23
12-DEC-99          9           32
12-DEC-99         14           46
12-DEC-99         24           70
12-DEC-99         19           89

 

ROWS UNBOUNDED PRECEDING 表示从第一行到当前行的视窗范围,即使ORDER BY字段的值一样,也实现了第一行到当前行的求和。

 

2.

SELECT c.cust_id, t.calendar_month_desc, TO_CHAR (SUM(amount_sold), 
      '9,999,999,999') AS SALES, TO_CHAR(AVG(SUM(amount_sold))
OVER (ORDER BY c.cust_id, t.calendar_month_desc ROWS 2 PRECEDING),
      '9,999,999,999') AS MOVING_3_MONTH_AVG
FROM sales s, times t, customers c
WHERE s.time_id=t.time_id AND s.cust_id=c.cust_id
   AND t.calendar_year=1999 AND c.cust_id IN (6510)
GROUP BY c.cust_id, t.calendar_month_desc
ORDER BY c.cust_id, t.calendar_month_desc; 

 

CUST_ID CALENDAR       SALES          MOVING_3_MONTH
----------    --------            -------------- --------------
      6510 1999-04             125            125
      6510 1999-05           3,395          1,760
      6510 1999-06           4,080          2,533
      6510 1999-07           6,435          4,637
      6510 1999-08           5,105          5,207
      6510 1999-09           4,676          5,405
      6510 1999-10           5,109          4,963
      6510 1999-11             802          3,529

前两行到当前行求AVG,注意,如果行没有找到,将不计在内。如第二行的AVG为(第一行+第二行的SALES值)/2,

第三行则为(第一行+第二行+第三行的SALES值)/3

 

如果不写ORDER BY子句,便没有视窗概念,功能为REPORTING FUNCTION

 

SELECT prod_category, country_region, sales
FROM (SELECT SUBSTR(p.prod_category,1,8) AS prod_category, co.country_region,
 SUM(amount_sold) AS sales,
MAX(SUM(amount_sold)) OVER (PARTITION BY prod_category) AS MAX_REG_SALES
FROM sales s, customers c, countries co, products p
WHERE s.cust_id=c.cust_id AND c.country_id=co.country_id
   AND s.prod_id =p.prod_id AND s.time_id = TO_DATE('11-OCT-2001')
GROUP BY prod_category, country_region)
WHERE sales = MAX_REG_SALES;

 

   查询结果

PROD_CAT COUNTRY_REGION            SALES MAX_REG_SALES
--------       -------------------- ---------- -------------
Electron    Americas                 581.92        581.92
Hardware Americas                 925.93        925.93
Peripher Americas                3084.48       4290.38
Peripher Asia                      2616.51         4290.38
Peripher Europe                  4290.38       4290.38
Peripher Oceania                  940.43       4290.38
Software Americas                 4445.7        4445.7
Software Asia                    1408.19          4445.7
Software Europe                  3288.83        4445.7
Software Oceania                  890.25        4445.7

  

 

参考资料ORACLE document《Data Warehousing Guide 》,《SQL Language Reference 》

你可能感兴趣的:(oracle,sql,C++,c,C#)