innodb 36个问题

Heikki Tuuri Innodb answers – Part I

Its almost a month since I promised Heikki Tuuri to answer Innodb Questions. Heikki is a busy man so I got answers to only some of the questions but as people still poking me about this I decided to publish the answers I have so far. Plus we may get some interesting follow up questions out of this.

I had added my comments to some of the questions. HT will stand for Heikki Tuuri in the answers and PZ for myself.

Q1: Why did you decide to go with gzip compression instead of implementing “classical” approach of index prefix compression ?

HT: The reason is that index prefix compression prevents one from using a binary search on a B-tree node. We could not compare a search key K to an index record R if we only knew the suffix of R. There are, of course, hybrid approaches where some prefix information is stored, and a search almost as efficient as a binary search can be performed.

But InnoDB’s adaptive hash indexes require that the prefix is stored in every record R. An adaptive hash index tries to guess where our search would end in the B-tree. To verify that we arrived at the right position, we need to know the whole record R.

Also, a gzip compression potentially compresses the database more than just prefix compression. The downside in gzip is increased usage of CPU time.

PZ: We will see how it plays when we can see more benchmarks. The big problems I see so far with this feature are ease of use and requirement to “guess” how your data will compress for the whole table. Among other things compression for data pages and uncompressed index pages can be significantly different.

Q2: Does Innodb has pointers to the next/previous page in index tree leaf pages ?

HT: Yes

PZ: It is great as it can help to make full table scans and index scans more efficient. I can’t wait for ability to create physically sorted indexes with Innodb (with index built by sort) this is when this should be helpful the most.

Q3: Does Innodb secondary Indexes trees are allocated in two extents same as primary key tree or stored in the same extent ?

HT: An InnoDB extent is a contiguous block of 64 pages, 16 kB each. Each B-tree index in InnoDB occupies two SEGMENTs. One segment is for the non-leaf nodes, and the other for leaf nodes. The first 32 pages of a segment are allocated from ‘fragment extents’ where any individual page can be allocated to whatever use in whatever segment. After that, a segment always reserves whole 64 page extents.

Thus, the answer to the question is: for small tables, everything can reside within a single extent. For big tables, a secondary index reserves different extents from the clustered index.

Q4: Does Innodb ever merges sequential pages together if they become almost empty

HT: Yes. If a page becomes less than half full, a merge is attempted.

btr0cur.h:

/* In the pessimistic delete, if the page data size drops below this
limit, merging it to a neighbor is tried */

#define BTR_CUR_PAGE_COMPRESS_LIMIT     (UNIV_PAGE_SIZE / 2)

Q5: When Innodb free space becomes available for use within same object? When does Innodb make it available for other objects.

HT: Free space can always be recycled within a segment. A segment can release free space to other segments, if a 64 page extent becomes totally empty, and also it can release individual pages from its 32 ‘fragment extent’ pages.

PZ: This is the reason why when you delete a lot of scattered rows you may not see a “free space” in Innodb tablespace to grow significantly.

Q6: Does Innodb policy replacement algorithm takes into account page position in btree index ?

HT: No. It is a pure LRU algorithm.

PZ: This is one of the things which we would like to play with – I would expect serious improvements are possible especially on high IO pressure workloads.

Q7: Does Innodb has any protection from pages being overwritten in buffer pool by large full table scan

HT: No

PZ: Another possible area of optimization. I frequently see batch jobs killing server performance overtaking buffer pool. Though full table scan is only one of replacement policy optimizations possible.

Q8: How do you assess current state of Innodb scalability with multiple threads and multiple concurrent transactions ?

HT: The scalability in my opinion is ‘mediocre’, and we are working on improving it.

PZ: Thank you for honest acknowledgment and I can’t wait to get hold of further improved versions.

Q9: Do you favour database “self tuning” or prefer to expose tuning options to be available to the user.

HT: I favor self tuning.

PZ: I also like self tuning in theory. But I really like to have tools to steer the database behavior. in the cases when self tuning gets it wrong.

Q10: When Innodb decides to schedule sequential read-ahead, random read ahead ?

HT: InnoDB schedules a sequential read-ahead when it notices an ascending access pattern that passes an extent boundary. Roughly, if most of the pages in the extent have been accessed, and in an ascending order of the
file address, then InnoDB schedules the read of all the 64 pages of the extent where the next index page in the ascending order is. InnoDB has a similar heuristic for descending index scans.

The InnoDB sequential read-ahead is not optimal at all. It should schedule the read of the next extent when we are in the middle of the processing of the previous extent.

Random read-ahead means that when InnoDB notices that if at least 13 pages in an extent have been accessed very recently, then InnoDB reads in all the rest of the pages in the extent.

PZ: Interesting enough, there was some work done by DIKU students improving algorithms used for read ahead and results were significant. They however were not merged in Innodb mainline yet.

Q11: If read-ahead is happening and Innodb needs to access one of the pages being fetched does it has to wait for whole read ahead request to complete

HT: Unfortunately, it often has to wait. A read-ahead of a 64 page extent is
often done as a single 1 MB file read.

PZ: “often” a bit stikes me here. I thought it would be always single read, typically 1MB in size. I should also mention a post on this topic once.

Q12: If read-ahead request was placed to the queue but did not start executing yet will innodb has to wait for it to complete or can just go ahead and read page it needs

HT: InnoDB has to wait.

PZ: This one is a bummer. I thought Innodb will make sure to schedule synchronous wait in front if read-ahead IO has not been started already. The interesting question it brings – how many read-ahead requests can be queued at once.

Q13: Is there any IO concurrency control – how many random and sequential read aheads can be happening at the same time

HT: All read-aheads are performed in the Unix version with a single thread. Only one read-ahead can be happening on Unix at a time. On Windows, InnoDB uses native async I/O, and can perform. many read-aheads at the same time.

PZ: Another thing I’d like to experiment with. I would expect you can gain significantly with multiple IO threads on systems with large amount of hard drives. Google has Patches which let you to try it out.

Q14: Are there any plans to fix Innodb so it unlocks (or never locks) the rows if they were not matched by query where clause when it was not resolved by index

HT: That is already implemented in 5.1: if you set the transaction isolation level READ COMMITTED, then InnoDB normally does not lock the ‘gaps’, and it does not lock rows whose latest committed version does not match the
WHERE condition in a search. Please use this with care: you MUST use row-based binlogging and replication, and remember that the execution is not serializable.

PZ: Cool. This is one of little known new features in MySQL 5.1

Q15: How frequently does Innodb fuzzy checkpointing is activated

HT: InnoDB flushes about 128 dirty pages per flush. That means that under a heavy write load, a new flush and a checkpoint happens more than once per second.

PZ: Not what this answers question exactly. So is it activated as soon as 128 pages must be flushed ? If so how this is discovered when. I think fuzzy checkpointing one of little known aspects of Innodb operations while I see it causing problems every so often.

Q16: How Innodb decided how many pages to flush at each checkpoint interval

HT: Usually it is 128, or less.

PZ: Looks like another magic number to experiment with.

Q17: How InnoBD handles blobs/text fields (needs more info)
According to the documentation, InnoDB put first 768 chars of each text/blob in the page and will allocate some space outside of the page.

However:
1. if the total size of the row is less than 8100 bytes InnoDB will not allocate additional space for blobs, even if each blob is larger than 768.
2. InnoDB will allocate additional space outside of the page for _each_ blob (if we have 8 blobs 8xN bytes will be allocated)

Question:
How much space InnoDB allocates for each blob outside of the page?

HT: For each column that InnoDB needs to store ‘externally’, it allocates
whole 16 kB pages. That will cause a lot of waste of space if the fields
are less that 16 kB.

The ‘zip’ source code tree by Marko has removed most of the 768 byte
local storage in the record. In that source code tree, InnoDB only needs
to store locally a prefix of an indexed column.

PZ: I think it is also very interesting question what happens for blobs larger than 16K – is exact size allocated or also segment based allocation is used.

Q18: Innodb Q: Is Group commit still busted in 5.0/5.1?

HT: Yes. However, work has been done on it by Inaam.

Q19: INNODB uses itâ

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/133735/viewspace-753912/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/133735/viewspace-753912/

你可能感兴趣的:(innodb 36个问题)