由于数据仓库中某表数据达到137,669,168条记录,但是我们需要对该表中的数据做某些特殊的业务处理。因此,我们决定按天来划分出一些统计表。
我首先写出的SQL语句是:
create table views_ads_date
as select a.user_id,count(b.id) as Impressions,b.crdate as DateRange
from ads a join views b on a.id = b.ad_id
group by a.user_id,b.crdate;
Marcus指出这样存在问题:
No, this will not group per day. This will group per millisecond.
select a.user_id,count(b.id) as Impressions,
to_char(b.crdate, 'YYYY-MM-DD') as DateRange
from ads a join views b on a.id = b.ad_id
group by a.user_id,DateRange;
Marcus再次指出这样在性能上会有问题:
No. You'll convert the date to a string which is slow to search on.
select a.user_id,count(b.id) as Impressions,b.crdate::date as DateRange
from ads a join views b on a.id = b.ad_id
group by a.user_id,DateRange;
Marcus反问为什么不直接使用date函数。
因为我在测试的时候,发现用date_extract返回的格式是:0000-00-00 00:00:00,而我想要YYYY-MM-DD的格式,所以就那样写了。
结果,Marcus告知我:No, the format isn't that. It is PRINTED as that. The format is a binary number of the amount of microseconds since 1976-01-01 I think it is.
Big difference. You want this type because it is fast to search at.
最终:
create table views_ads_date as select a.user_id,count(b.id) as Impressions,date_trunc('day',b.crdate) as DateRange
from ads a join views b on a.id = b.ad_id
group by a.user_id,DateRange;