INFOBRIGHT 引擎下的DISTINCT的内存爆:


mysql> explain SELECT count(DISTINCT wuuid) as vnum  FROM visit_info WHERE (1=1) and usertype in (-1) and (begintime >= 1380556800 and begintime <= 1387209600);
+----+-------------+------------+------+---------------+------+---------+------+----------+-----------------------------------+
| id | select_type | table      | type | possible_keys | key  | key_len | ref  | rows     | Extra                             |
+----+-------------+------------+------+---------------+------+---------+------+----------+-----------------------------------+
|  1 | SIMPLE      | visit_info | ALL  | NULL          | NULL | NULL    | NULL | 37778749 | Using where with pushed condition |
+----+-------------+------------+------+---------------+------+---------+------+----------+-----------------------------------+
1 row in set (0.65 sec)

mysql> explain  SELECT count(DISTINCT uuid) as vnum  FROM visit_info WHERE (1=1) and usertype in (-1) and (begintime >= 1380556800 and begintime <= 1387209600);
+----+-------------+------------+------+---------------+------+---------+------+----------+-----------------------------------+
| id | select_type | table      | type | possible_keys | key  | key_len | ref  | rows     | Extra                             |
+----+-------------+------------+------+---------------+------+---------+------+----------+-----------------------------------+
|  1 | SIMPLE      | visit_info | ALL  | NULL          | NULL | NULL    | NULL | 37778749 | Using where with pushed condition |
+----+-------------+------------+------+---------------+------+---------+------+----------+-----------------------------------+
1 row in set (0.00 sec)


mysql> SELECT count(DISTINCT uuid) as vnum  FROM visit_info WHERE (1=1) and usertype in (-1) and (begintime >= 1380556800 and begintime <= 1387209600);


ERROR 9 (HY000): Brighthouse out of resources error: Insufficient memory/disk space

mysql> SELECT count( uuid) as vnum  FROM visit_info WHERE (1=1) and usertype in (-1) and (begintime >= 1380556800 and begintime <= 1387209600);
+----------+
| vnum     |
+----------+
| 37196097 |
+----------+
1 row in set (2.82 sec)


Distinct实现原理:

在数据库的设计中,如何实现Distinct操作呢?一般有两种基本思路:排序(Sort)法,哈希(Hash)法。

排序法将表格中的数据全部按照distinct指定的列为key进行排序,然后逐行迭代,
每迭代出一行数据都与上一行数据根据key作对比,如果相同,则丢弃当前行继续迭代下一行,
如果不同则输出。排序法带来的一个副作用就是数据输出按照key有序。


哈希法将表格中的数据全部按照distinct指定的列值为key作为hash key进行分桶,key相同的行自然就被区分出来了。

排序法在具体实现中会遇到这么一些问题:

1. 数据集超出了内存限制,如何排序?

2. 如何实现可以尽可能减少数据拷贝?

3. 如果已经有了Sort运算符,如何实现代码重用。

问题:上面两种方法在内存占用上那个更省?


最后说个题外话,distinct跟groupby蛮像的,那么他们的区别又在哪里呢? 简单地说,distinct是一种很弱的groupby。详细见网上转载的一篇文章:


参考博客:

http://blog.csdn.net/maray/article/details/7634543