MySQL Indexing Best Practices: Webinar Questions Followup
Peter Zaitsev | August 16, 2012 | Posted In: Insight for DBAs, MySQL
I had a lot of questions on my MySQL Indexing: Best Practices Webinar (both recording and slides are available now) We had lots of questions. I did not have time to answer some and others are better answered in writing anyway.
Q1:
One developer on our team wants to replace longish (25-30) indexed varchars with an additional bigint column containing the crc64, and change the indexing to be on that column. That would clearly save indexing space. Is this a reasonable performance optimization. (Keep in mind that the prefix adaptive hashing would fail here, because the first 10 or so characters usually are the same). Of course UNIQUE index optimizations can no longer be applied either.
A1:
This is good optimization in many cases. When you apply it though remember hash can have collisions so you will need to have your queries do something like SELECT * FROM TBL WHERE hash=crc32('string') AND string='string'
The other thing you need to consider is string comparison in MySQL is case insensitive by default while hash comparison will be done case sensitive unless you lowercase string before hashing. I also would note 25-30 bytes length is rather short for such hack as BIGINT itself is 8 bytes and the difference in the index length with all overhead is not going to be huge. I think this technique is best when you’re working with 100 bytes+ strings. Is say bytes as it is string length at which it makes sense is collation specific.
在很多时候,这是一个很好的优化方法。考虑到哈希碰撞的问题,你可以使用形如 SELECT * FROM TBL WHERE hash=crc32(‘string’) AND string=’string’这样的查询。另外还需要注意的是 MySQL 里的字符串比较默认是不区分大小写的,对于哈希字符串也是如此。另外与 25~30 字节长度的字符串来说,使用 8 位的 bigint 在索引存储上的差异并不大。非常适合超过 100 字节的字符串
Q2:
ORDER By optimization issues: select * from table where A=xxx and B between 100 and 200 order by B
Very common for a date range to also need to be ordered. The question is how can one have optimized indexes and sorting in such a scenario, since inequality ends index usage.
A2:
Actually in this case index on (A,B) would work well. If however you would need to sort by some 3rd column, say C index (A,B,C) would not work as range will prevent sorting from using the index. In this case you can use trick mentioned in the presentation to convert sort to the union for small ranges.
在 (A,B) 建索引即可。但如果你需要使用第三个列(C)作排序,那么复合索引 (A,B,C) 就会导致排序无法使用索引,因此可以将排序转成小范围数据的联合来处理。
Q3:
In the case of a junction table, would indexing on (foreignkey1, foreignkey2) AND on (foreignkey2, foreignkey1) be a good idea?
A3:
Yes. This is a good practice. Normally I’d do something like
CREATE TABLE LINK (
id1 int unsigned not null,
id2 int unsigned not null,
PRIMARY KEY(id1, id2),
KEY K(id2)
)
engine=INNODB;
when table has to be traversed in both directions for different queries. This will use fast primary key for some queries and use key K as covering index for lookup in other direction.
For Innodb Table id2 is not needed pas part of second key as PRIMARY key is appended to it internally anyway. For MyISAM table you should use K(id2,id1) in the same case. Some people would prefer to define second key as UNIQUE this has benefits and drawbacks. Benefit being you can get extra optimizations by optimizer knowing index is UNIQUE the drawback is insert buffer will not be able to be used, which can be important for large, heavily written tables.
可以的。正常情况下如果我需要使用不同查询做双向遍历时,我会这样创建表:CREATE TABLE LINK (id1 int unsigned not null ,id2 int unsigned not null, PRIMARY KEY(id1,id2), KEY K(id2,id1)) engine=INNODB; 这个将使用更快的主键来做一些查询,然后使用索引 K 来做其他的查询。
Q4:
in trick #1 will WHERE a IN (2-4)
be worse then “WHERE a IN (2,3,4)”? Another word is range for IN clause better than BETWEEN?
A4:
IN(2-4) will not do what you’re implying here. 2-4 will be evaluated as math expression and the result will be IN(-2) which is not what you’re looking for.
Q5:
I have a primary index on an int (ID) and other indexes on columns idx1(X,A,B,C) idx2(Y,A,B,C) etc (there are 5) would I be better off making the primary A,B,C,ID and Having other indexes on one column, idx1(X) idx2(Y) etc?
A5:
I would wonder whenever it is best setup for 5 indexes to differ only by first column. Regarding changing primary key to include such column prefix it depends on what you’re looking for a lot. This will cause data clustering done by these columns which can be helpful if you’re doing a lot of range scans on what would be primary key but it also can slow down your inserts and make primary key significantly fragmented. I also would note there are some MySQL optimizer restrictions in how well it can deal with primary key appended to the index column, especially in such case as you’re suggesting. In the end I would seek for a lot of performance gains before I move to such unusual setup.
Q6:
Table1 has a primary key. Table2 joins to table1 using Table1’s primary key. Should table2 have an index on the field that is used to join the two tables?
A6:
The question in this case is how MySQL will execute the join. If it will fist lookup Table2 using some other index and when go to Table1 to lookup row by primary key when you do not need an index on a field which is used to join tables in table2
Q7:
In regards to extending an index being better than adding a new one: Let’s say I have a table named PO that has a primary key of PO # and 2 additional fields for vendor_id and order_id. If I have an index on vendor_id, order_id but my query is only selecting on vendor, will the index have any impact on the speed of the query?
A7:
If you extend the index from (vendor_id) to (vendor_id,order_id) you will make it 4 bytes longer (assuming order_id is int) which will impact your queries which only use vendor_id but unlikely significantly. It is likely to be a lot less expensive than having another index on (vendor_id,order_id) in addition to index on (vendor_id) alone. The cases when you really should worry about performance impact of extending index is when you increase its length dramatically, for example adding long varchar column. In such cases it indeed might be better to add another index.
Q8:
We have a database that has about 400GB of indexes. The indexes can’t fit in memory anymore. How does this affect performance?
A8:
Typically you do not need all your indexes to be in memory only those portions of them which are accessed frequently. The size of this “working set” can greatly depend on application and can range from 5% of total size or less to almost 100%. When your go from working set what fits in memory to the one which does not any more performance can degrade 10 or more times.
Q9:
In which cases should auto-increment be used as primary key?
A9:
Auto-increment is a good default primary key. You should pick something else if you have a good reason to do it – if you would benefit from data clustering in the different way or if you have some other natural candidate for primary key which gets a lot of lookups is frequent reason to use something else as primary key.
Q10:
How many indexes is too many?
A10:
There is a hard limit on amount of indexes you can have, which is 64 per table in recent MySQL versions. However it is often too many. Instead of thinking about hard limit of indexes I prefer to add indexes only in cases which provides positive impact to performance. At some point the gains from indexes you add will be less than performance loss because of having too many indexes.
Q11:
Is there a difference between `id` = 5 and `id` IN (5) regarding indexes and performance?
A11:
Recent MySQL versions are smart enough to convert id IN (5) to ID=5 (for single item in-lists). There were the times when it would make a difference though.
Q12:
Would you recommend creating an index in every table you create? Example:
CREATE TABLE user_competition_entry
user_id (INT),
competition_id(INT);
The table is only used to record a user_id and competition_id, nothing more. Would doing a SELECT competition_id, COUNT(user_id) AS user_count FROM user_competition_entry GROUP BY competition_id; be slower without an index?
A12:
I would define (competition_id,user_id) as a PRIMARY KEY for such table. It also will help the query you’re mentioning allowing group by to be performed without temporary table or external sort.
Q13:
How can we manage indexes on servers from DBA point of view ? Is there any management required or server does everything itself. Especially when using a CMS where DB structure is predefined.
A13:
MySQL Server will not automatically define any indexes for you. Hopefully your CMS already comes with reasonable set of indexes, if not you will need to add indexes manually.
Q14:
What are some methods to overcome vastly differing cardinality on a primary key. After running an analyze on a table with 11M rows I’ve seen cardinality range from 19 to over 19,000?
A14:
Cardinality is property of data so you usually would deal with it not overcome it. The best thing to start with is looking at the queries around “outliers” – the keys which have a lot of values to see if you can make them work well. You might need to redesign schema and queries to make it work well.
就是说,对于一个大的query来说,
Q15:
how does a index be used having an index on one column and using order by on another column. do i need to add the index on column using order by clause.
A15:
If index is used for ORDER BY the same index must be used for selection for the same table, not other index, also you only can have equality comparison as a search condition WHERE A=5 ORDER BY B
will use index (A,B) for sorting optimization, for more complicated conditions you will need to use something like Trick Unionizing Order by
described in presentation.
就是说,如果所需要的 order by 不能通过使用 index 来进行优化,那么可以考虑用 union 的方式,把一个语句拆成多个,各自order by,再union。
要注意的是,直接union的话只能出现一个 order by,要想各自order by再union,需要各自加上括号(光这样order by不会生效),同时外面再包一层 select (也就是伪装成两个select直接union)。如下:
SELECT * FROM
(SELECT * FROM t1 WHERE username LIKE 'l%' ORDER BY score ASC) t3
UNION
SELECT * FROM
(SELECT * FROM t1 WHERE username LIKE '%m%' ORDER BY score ASC) t4
Q16:
what is the impact on indexing to use wider UUID such as VARCHAR(36) instead of auto-increment?
A16:
If you’re using UUID it is at least good to convert it in binary form and store as VARBINARY(16) for performance reasons. In any case you would likely to get table which is larger than if you would use auto increment. Having said that there are many people using UUID rather successful in applications which do not need to be optimized for peak performance or the cases when this does not become the bottleneck. Also check out my old article on the topic which goes into a lot more details.
Q17:
how mysql use index for group by?
A18:
If you have Index on the column MySQL can avoid temporary table or file sort for group by by this column. This works because by scanning data in index order MySQL gets data in already sorted order and looks at “one group at the time”, computing aggregate functions as needed.
如果需要group by的字段上有索引,那么MySQL就会避免建一个临时表,或者避免进行 file sort。这是因为有了索引,MySQL扫描的是几乎是排好序的数据,那么不论是怎样聚合,count、sum、average等等,都会很方便。
Q19:
Is there any special concerns or tricks for selecting using some date ranges? or between dates? or after a date?
A19:
Date comparisons work very similar to other comparison and same tricks may apply, for example you may benefit to convert BETWEEN into IN-ranges in some cases for better index usage.
对于时间数据的过滤比较,也是类似其他类型的过滤比较。比如,可以用
in
来替换between
。
Q20:
Is the b+ tree innodb index a single or double linked list at the leaf nodes? your slide showed single but the fact that you can use and index for “order by desc” indicates a double linked list.
A20:
Innodb has double linked list – each leaf page contains pointers to both previous and next pages in index order. Note however it is not really requirement for ORDER BY DESC optimization – you can still traverse BTREE in any direction, even if there are no page leaf pointers, it is just what it becomes relatively more expensive.
Innodb的索引使用的是双向链表,每个叶子page都包含指向前一个和后一个的指针。但是为了优化倒序排列(order by xx desc),并不是必须要用到双向链表。遍历一个btree,甚至可以没有leaf pointers。只是这样做(没有双向链表的指针),会相对代价大一点。
Everyone, Thank you for attending and your questions!
Check out more MySQL Webinars from Percona!
翻译(仅供参考):
https://www.oschina.net/translate/mysql-indexing-best-practices-webinar-questions-followup?cmp&p=1#