filtered主要的功能是说明,引擎层扫描数据后根据额外的where条件过滤后剩下值的一个百分比,因为遇到一个问题和其相关,不由得想看看其估算过程,主要考虑如下
- 主要代码5.7.22
- 主要考虑等值过滤条件
- 8.0.28带直方图测试
一、计算方式
经过debug发现其实filter的计算还是比较简单的,其主要函数为calculate_condition_filter
其中定义了几个宏
/// Filtering effect for equalities: col1 = col2
#define COND_FILTER_EQUALITY 0.1f
/// Filtering effect for inequalities: col1 > col2
#define COND_FILTER_INEQUALITY 0.3333f
/// Filtering effect for between: col1 BETWEEN a AND b
#define COND_FILTER_BETWEEN 0.1111f
对于等值来讲我们用到的就是COND_FILTER_EQUALITY宏,其固定为10%,下列的估算方式这个宏会参与计算。主要的计算流程为:
1. 每个不能使用索引的where条件计算为 max(1/n_rows,0.1)
这个部分来自Item_func_eq::get_filtering_effect,这里的Item_func_eq就是每个where等值条件,可以清晰看到
return fld->get_cond_filter_default_probability(rows_in_table,
COND_FILTER_EQUALITY);//0.1f
Item_field::get_cond_filter_default_probability
return std::max(static_cast(1/max_distinct_values), default_filter);
2. 关于n_rows到底是什么?
那么这里有一个主要的问题,就是这个n_rows 行数到底代表的是什么,首先在函数calculate_condition_filter中可以看到,如下代码
filtered*=
tab->join()->where_cond->get_filtering_effect(tab->table_ref->map(),
used_tables,
&table->tmp_set,
static_cast(tab->records()));
而这里的where_cond就是我们where条件后面用不到索引的那些条件,其中行数比较关键为tab->records(),如果简单看一下实际上这个值为
QEP_shared中的(执行计划相关)
/**
Either #rows in the table or 1 for const table.
Used in optimization, and also in execution for FOUND_ROWS().
*/
ha_rows m_records;
其设置为 QEP_shared::set_records,
最终可以看到这个值实际上来自
tab->set_records(tab->found_records= tab->table()->file->stats.records);
其中tab->table()返回对象为TABLE结构体,也就是我们常说的table_cache
它就是innodb底层的统计数据,代表是表中预估的行数,可在函数ha_innobase::info_low中找到如下:
n_rows = ib_table->stat_n_rows;
....
stats.records = (ha_rows) n_rows;
简单说就是mysql.innodb_table_stats/information_schema.tables下那个表数据的行数。
3. 如果有多个条件则通过各自filter相乘的方式进行计算,得到最终值
这部分来自Item_cond_and::get_filtering_effect,and 也是一个item,其下面包含2个time,代码和注释说明了一切
/*
Calculated as "Conjunction of independent events":
P(A and B ...) = P(A) * P(B) * ...
*/
while ((item= it++))
filtered*= item->get_filtering_effect(filter_for_table,
read_tables,
fields_to_ignore,
rows_in_table);
return filtered;
这个就不需要过多解释了吧。
4. 如果最终计算的filter和1/num_rows之间判断,取max(1/n_rows,0.1)
注意这里和前面不同,第1点是每个where条件的filter不同,这里是最终计算的filter
filtered 在 calculate_condition_filter 中进行计算,如下
filtered= max(filtered, 1.0f / tab->records());
5. 如果预估扫描的行数*计算的最终filter 小于 0.05则取0.05
注意这里和前面不同,第1点是每个where条件的filter不同,这里是最终计算的filtered 在 calculate_condition_filter 中进行计算,如下
if ((filtered * fanout) < 0.05f) //扇出
filtered= 0.05f/static_cast(fanout);
总的来说,filtered几乎是一个预估的计算,可以作为参考但是一般不会太准,其参与计算的元素主要包含:
- 0.1 固定写死值
- 0.05 固定写死的值
- 1/num_rows 也没有过多的考虑偏移量的问题,其中num_rows 为表的总行数
- 预估引擎层扫描的行数
这就可能导致执行计划中join 之类的选择错驱动表,因为这种算法的过滤性完全是估算的,没有基数等统计数据作为标准(或者索引下探动态统计dive ),当然到了8.0可以考虑直方图,最后我们来进行简单测试。下面来测试一下,看看是不是这样计算的。
二、测试
先建立2个表如下:
mysql> create table testfil(a int,b int,c int);
Query OK, 0 rows affected (0.01 sec)
mysql> insert into testfil values(1,1,1);
Query OK, 1 row affected (0.00 sec)
mysql> insert into testfil values(2,2,2);
Query OK, 1 row affected (0.01 sec)
mysql> insert into testfil values(3,3,3);
Query OK, 1 row affected (0.01 sec)
mysql> insert into testfil values(4,4,4);
Query OK, 1 row affected (0.00 sec)
mysql> insert into testfil values(4,4,4);
Query OK, 1 row affected (0.00 sec)
mysql> insert into testfil values(4,4,4);
Query OK, 1 row affected (0.00 sec)
mysql> insert into testfil values(4,4,4);
Query OK, 1 row affected (0.01 sec)
mysql> insert into testfil values(3,3,3);
Query OK, 1 row affected (0.00 sec)
mysql> select * from testfil;
+------+------+------+
| a | b | c |
+------+------+------+
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 4 | 4 | 4 |
| 4 | 4 | 4 |
| 4 | 4 | 4 |
| 4 | 4 | 4 |
| 3 | 3 | 3 |
+------+------+------+
7 rows in set (0.00 sec)
mysql> alter table testfil add key(a);
Query OK, 0 rows affected (0.02 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> create table testfil2 like testfil;
Query OK, 0 rows affected (0.01 sec)
mysql> insert into testfil2 select * from testfil;
Query OK, 7 rows affected (0.01 sec)
Records: 7 Duplicates: 0 Warnings: 0
...
mysql> insert into testfil2 select * from testfil2;
Query OK, 14336 rows affected (4.91 sec)
Records: 14336 Duplicates: 0 Warnings: 0
mysql> select table_name,n_rows from mysql.innodb_table_stats where table_name in ('testfil','testfil2');
+------------+--------+
| table_name | n_rows |
+------------+--------+
| testfil | 7 |
| testfil2 | 28755 |
+------------+--------+
2 rows in set (0.09 sec)
mysql> select table_name,TABLE_ROWS from information_schema.tables where table_name in ('testfil','testfil2');
+------------+------------+
| table_name | TABLE_ROWS |
+------------+------------+
| testfil | 7 |
| testfil2 | 28755 |
+------------+------------+
2 rows in set (0.13 sec)
这里主要是建立2个表1个小表1个大表,我们分别来看。
- 测试单个where条件,小表
mysql> desc select * from testfil where b=1 ;
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | testfil | NULL | ALL | NULL | NULL | NULL | NULL | 7 | 14.29 | Using where |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.01 sec)
mysql> desc select * from testfil where b=4 ;
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | testfil | NULL | ALL | NULL | NULL | NULL | NULL | 7 | 14.29 | Using where |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
这里我们明显的看到,虽然等于4的行是比较多的,等于1的行只有1行,但是他们计算的filtered都是1/7左右,又因为1/7 > 0.1 因此显示为14.29%,很明显没有考虑数据的倾斜度
- 测试单个where条件,大表
mysql> desc select * from testfil2 where b=1 ;
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| 1 | SIMPLE | testfil2 | NULL | ALL | NULL | NULL | NULL | NULL | 28755 | 10.00 | Using where |
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
mysql> desc select * from testfil2 where b=4 ;
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| 1 | SIMPLE | testfil2 | NULL | ALL | NULL | NULL | NULL | NULL | 28755 | 10.00 | Using where |
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
这里也显然,1/28755< 0.1,因此都显示10%,不管数据偏移量。
- 测试两个个where条件,小表
mysql> desc select * from testfil where b=1 and c=1;
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | testfil | NULL | ALL | NULL | NULL | NULL | NULL | 7 | 14.29 | Using where |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.01 sec)
mysql> desc select * from testfil where b=4 and c=4;
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | testfil | NULL | ALL | NULL | NULL | NULL | NULL | 7 | 14.29 | Using where |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
这里依旧显示14.29%,因为filter初始化计算出来为(1/7)*(1/7),根据第4点小于1/7,依旧算1/7,也就是14.29%
DEBUG如下
1416 filtered= max(filtered, 1.0f / tab->records());
(gdb) p filtered
$1 = 0.0204081647 ((1/7)*(1/7))
(gdb) n
1428 if ((filtered * fanout) < 0.05f)
(gdb) p filtered
$2 = 0.142857149
依旧不考虑数据切斜度。
- 测试两个where条件,大表
mysql> desc select * from testfil2 where b=1 and c=1;
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| 1 | SIMPLE | testfil2 | NULL | ALL | NULL | NULL | NULL | NULL | 28755 | 1.00 | Using where |
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.01 sec)
mysql> desc select * from testfil2 where b=4 and c=4;
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| 1 | SIMPLE | testfil2 | NULL | ALL | NULL | NULL | NULL | NULL | 28755 | 1.00 | Using where |
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
这里很显然就是两个0.1相乘,0.1*0.1=0.01 也就是1%,而0.01 > 1/28755 ,因此直接显示
- 测试带索引定位数据扫描和两个where条件
这种情况主要是要测试最后一点就是第5点的0.05测试出来,建立的表如下:
mysql> show create table testauto \G
*************************** 1. row ***************************
Table: testauto
Create Table: CREATE TABLE `testauto` (
`a` int(11) NOT NULL,
`b` int(11) DEFAULT NULL,
`c` int(11) DEFAULT NULL,
KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
1 row in set (0.00 sec)
mysql> select count(*) from testauto;
+----------+
| count(*) |
+----------+
| 16384 |
+----------+
1 row in set (0.77 sec)
mysql> select table_name,n_rows from mysql.innodb_table_stats where table_name in ('testauto');
+------------+--------+
| table_name | n_rows |
+------------+--------+
| testauto | 16188 |
+------------+--------+
1 row in set (0.10 sec)
这里字段a有索引,且里面的数字是唯一值,那么如下语句
mysql> desc select * from testauto where a=100 and b=1 and c=1;
+----+-------------+----------+------------+------+---------------+------+---------+-------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------+---------+-------+------+----------+-------------+
| 1 | SIMPLE | testauto | NULL | ref | a | a | 4 | const | 1 | 5.00 | Using where |
+----+-------------+----------+------------+------+---------------+------+---------+-------+------+----------+-------------+
1 row in set, 1 warning (0.01 sec)
我们看到filtered为5%,也就是第5点生效,因为 rows=1,计算出来的filter应该为0.01,而1*0.01< 0.05 所以直接置为0.05,也就是5%。看下DEBUG
1416 filtered= max(filtered, 1.0f / tab->records());
(gdb)
1428 if ((filtered * fanout) < 0.05f)
(gdb) p filtered
$3 = 0.0100000007
(gdb) p fanout
$4 = 1
(gdb) n
1429 filtered= 0.05f/static_cast(fanout);
(gdb) n
1433 bitmap_clear_all(&table->tmp_set);
(gdb) p filtered
$5 = 0.0500000007
- join测试
mysql> desc select * from testauto a,testfil2 b where a.a=b.a and a.b=1 and a.c=1;
+----+-------------+-------+------------+------+---------------+------+---------+--------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+--------+-------+----------+-------------+
| 1 | SIMPLE | a | NULL | ALL | a | NULL | NULL | NULL | 16188 | 1.00 | Using where |
| 1 | SIMPLE | b | NULL | ref | a | a | 5 | tp.a.a | 1 | 100.00 | NULL |
+----+-------------+-------+------------+------+---------------+------+---------+--------+-------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)
mysql> desc select * from testauto a,testfil2 b where a.a=b.a and a.b=1 and a.c=1 and a.a=1;
+----+-------------+-------+------------+------+---------------+------+---------+-------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+-------+------+----------+-------------+
| 1 | SIMPLE | a | NULL | ref | a | a | 4 | const | 1 | 5.00 | Using where |
| 1 | SIMPLE | b | NULL | ref | a | a | 5 | const | 4096 | 100.00 | NULL |
+----+-------------+-------+------------+------+---------------+------+---------+-------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)
mysql> desc select * from testauto a,testfil2 b where a.a=b.a and a.b=1 and a.c=1 and a.a=1 and b.c=1;
+----+-------------+-------+------------+------+---------------+------+---------+-------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+-------+------+----------+-------------+
| 1 | SIMPLE | a | NULL | ref | a | a | 4 | const | 1 | 5.00 | Using where |
| 1 | SIMPLE | b | NULL | ref | a | a | 5 | const | 4096 | 10.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+-------+------+----------+-------------+
2 rows in set, 1 warning (0.01 sec)
mysql> desc select * from testauto a,testfil2 b where a.a=b.a and a.b=1 and a.c=1 and a.a=1 and b.c=1 and b.b=1;
+----+-------------+-------+------------+------+---------------+------+---------+-------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+-------+------+----------+-------------+
| 1 | SIMPLE | a | NULL | ref | a | a | 4 | const | 1 | 5.00 | Using where |
| 1 | SIMPLE | b | NULL | ref | a | a | 5 | const | 4096 | 1.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+-------+------+----------+-------------+
这里过多分析了,实际上join中依旧生效,只是a.a=b.a关联条件要作为被驱动表的一个where条件来考虑。
三、8.0 简单测试(8.0.28)
依旧使用上面的表,进行测试
- 不带直方图
mysql> desc select * from testfil where b=4 and c=4;
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | testfil | NULL | ALL | NULL | NULL | NULL | NULL | 7 | 14.29 | Using where |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
mysql> desc select * from testfil2 where b=4 and c=4;
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| 1 | SIMPLE | testfil2 | NULL | ALL | NULL | NULL | NULL | NULL | 86265 | 1.00 | Using where |
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
依旧不考虑数据的倾斜度。
- 收集直方图后执行
我这里不同的数据很少,使用等宽直方图就可以了,如下:
mysql> ANALYZE TABLE testfil UPDATE HISTOGRAM ON b,c WITH 10 BUCKETS;
+----------------+-----------+----------+----------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+----------------+-----------+----------+----------------------------------------------+
| mytest.testfil | histogram | status | Histogram statistics created for column 'b'. |
| mytest.testfil | histogram | status | Histogram statistics created for column 'c'. |
+----------------+-----------+----------+----------------------------------------------+
2 rows in set (0.04 sec)
mysql> ANALYZE TABLE testfil2 UPDATE HISTOGRAM ON b,c WITH 10 BUCKETS;
+-----------------+-----------+----------+----------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+-----------------+-----------+----------+----------------------------------------------+
| mytest.testfil2 | histogram | status | Histogram statistics created for column 'b'. |
| mytest.testfil2 | histogram | status | Histogram statistics created for column 'c'. |
+-----------------+-----------+----------+----------------------------------------------+
2 rows in set (5.21 sec)
mysql> desc select * from testfil where b=4 and c=4;
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | testfil | NULL | ALL | NULL | NULL | NULL | NULL | 7 | 32.65 | Using where |
+----+-------------+---------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
mysql> desc select * from testfil2 where b=4 and c=4;
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| 1 | SIMPLE | testfil2 | NULL | ALL | NULL | NULL | NULL | NULL | 86265 | 32.65 | Using where |
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
mysql>
我们可以看到直方图确实影响了filtered的计算,也更加准确了,但是8.0的直方图默认没有收集,因此使用不多,我的了解也比较有限。如果不使用直方图filter还是5.7的计算方式。
附录:使用的BUG点
1 breakpoint keep y 0x0000000000eb13ac in main(int, char**) at /opt/percona-server-locks-detail-5.7.22/sql/main.cc:25
breakpoint already hit 1 time
2 breakpoint keep y 0x0000000001727699 in Explain_join::explain_rows_and_filtered() at /opt/percona-server-locks-detail-5.7.22/sql/opt_explain.cc:1480
breakpoint already hit 14 times
3 breakpoint keep y 0x0000000001589411 in calculate_condition_filter(JOIN_TAB const*, Key_use const*, unsigned long long, double, bool)
at /opt/percona-server-locks-detail-5.7.22/sql/sql_planner.cc:1219
breakpoint already hit 26 times
5 breakpoint keep y 0x0000000000fadb41 in Item_equal::get_filtering_effect(unsigned long long, unsigned long long, st_bitmap const*, double)
at /opt/percona-server-locks-detail-5.7.22/sql/item_cmpfunc.cc:7478
breakpoint already hit 4 times
6 breakpoint keep y 0x0000000000f83dc2 in Item_field::get_cond_filter_default_probability(double, float) const at /opt/percona-server-locks-detail-5.7.22/sql/item.cc:7905
breakpoint already hit 3 times