group by前后 数据不一致

相同的筛选条件,有了group by后数值不一样,如

1、原始sql

SELECT

source,

--platform,

max(source_name) as source_name,

uniqExactMerge(device_id_state) as uv,

toStartOfMonth(toDate(date_str)) as date_str

FROM

mdata_flows_user_realtime_day_all table_alias

WHERE

bd_id != '6'

AND shop_type = '1'

AND type = 'prod'

AND source IN ('1')

AND bd_id IN ('5',

'12')

AND platform IN ('12',

'20')

AND (table_alias.date_str between '2021-05-01' and '2021-05-25'

or table_alias.date_str between '2020-05-01' and '2020-05-25'

or table_alias.date_str between '2021-04-01' and '2021-04-25')

GROUP BY

date_str,

source

--platform

结果:

source_name|source|uv |date_str |

-----------|------|-------|----------|

主站        |1    |1135552|2021-05-01

uv=1135552

2、再加一个group by字段

SELECT

new_id,

--platform,

uniqExactMerge(device_id_state) as uv,

toStartOfMonth(toDate(date_str)) as date_str

FROM

mdata_flows_user_realtime_day_all table_alias

WHERE

bd_id != '6'

AND source = '1'

AND shop_type = '1'

AND type = 'prod'

AND bd_id IN ('5',

'12')

AND platform IN ('12',

'20')

AND (table_alias.date_str between '2021-05-01' and '2021-05-25'

or table_alias.date_str between '2020-05-01' and '2020-05-25'

or table_alias.date_str between '2021-04-01' and '2021-04-25')

GROUP BY

date_str,

new_id

--platform

结果:

new_id|uv |date_str |

------|-------|----------|

    0| 207473|2021-05-01|

    1| 961056|2021-05-01

uv=207473+961056=1168529

3、继续添加一个字段platform

SELECT

new_id,

platform,

uniqExactMerge(device_id_state) as uv,

toStartOfMonth(toDate(date_str)) as date_str

FROM

mdata_flows_user_realtime_day_all table_alias

WHERE

bd_id != '6'

AND source = '1'

AND shop_type = '1'

AND type = 'prod'

AND bd_id IN ('5',

'12')

AND platform IN ('12',

'20')

AND (table_alias.date_str between '2021-05-01' and '2021-05-25'

or table_alias.date_str between '2020-05-01' and '2020-05-25'

or table_alias.date_str between '2021-04-01' and '2021-04-25')

GROUP BY

date_str,

new_id,

platform

结果:

new_id|platform|uv |date_str |

------|--------|------|----------|

    1|20      |193091|2021-05-01|

    1|12      |767966|2021-05-01|

    0|20      | 36139|2021-05-01|

    0|12      |171334|2021-05-01|

uv=767966+171334+36139+193091=1168530

可见,三个完全相同筛选条件的sql,只因为group by字段的不同数据有差异。

具体原因就是去重的顺序影响的,也就是 先汇总再去重 和先group by 再去重的区别。

后者往往会大于等于前者,不同group by分组中可能存在相同的去重字段。

你可能感兴趣的:(group by前后 数据不一致)