关于varchar(255)的一些资料整理

我们在MySQL + InnoDB + UTF8建表时,不管是按自己的经验,还是遵循DBA的经验,一般都会默认不超过varchar(255)。
如果继续往下细究,为什么不建议超过255,去网上搜一搜,基本上都说:在超过768字节后,会变得跟Text一样,查询效率差。
在仔细阅读了官方文档后,发现没那么简单。
首先我们找到Row Format相关的介绍,重点关注下compact和dynamic这两种,因为这两种使我们目前使用最多的row format。
COMPACT Row Format
Tables that use the COMPACT row format store the first 768 bytes of variable-length column values (VARCHAR, VARBINARY, and BLOB and TEXT types) in the index record within the B-tree node, with the remainder stored on overflow pages. Fixed-length columns greater than or equal to 768 bytes are encoded as variable-length columns, which can be stored off-page. For example, a CHAR(255) column can exceed 768 bytes if the maximum byte length of the character set is greater than 3, as it is with utf8mb4.
如果使用compact row format,当变长字段超过768字节后,在索引中仅存储前768字节,其余部分会溢出存储(单独页)。当定长字段超过768字节后,也会变成变长字段,如char(255)采用utf8mb4编码时最大会超过768字节;

If the value of a column is 768 bytes or less, an overflow page is not used, and some savings in I/O may result, since the value is stored entirely in the B-tree node. This works well for relatively short BLOB column values, but may cause B-tree nodes to fill with data rather than key values, reducing their efficiency. Tables with many BLOB columns could cause B-tree nodes to become too full, and contain too few rows, making the entire index less efficient than if rows were shorter or column values were stored off-page.
当一列的大小不超过768字节时,不会使用溢出页,这会节省一些IO,因为值全部存储在b-tree节点,但这会导致b-tree节点因为存储太多数据而不是k-v,使索引效率变差。表包含太多BLOB列会使b-tree节点变得很丰满,只能存储很少的行,使得整个索引低效,相对来说,短行或者off-page列索引效率会高很多;

这么看,平时建表时,如果采用compact row format,varchar不超过255是有一定道理的(UTF8),如果这一列经常查询,将会去off-page查询,很低效。而且varchar(255)大字段也不能太多,太多了之后会导致整个索引的效率变差,因为b-tree节点存了太多的数据而不是k-v。

再看一下另一个row format
DYNAMIC Row Format
When a table is created with ROW_FORMAT=DYNAMIC, InnoDB can store long variable-length column values (for VARCHAR, VARBINARY, and BLOB and TEXT types) fully off-page, with the clustered index record containing only a 20-byte pointer to the overflow page. Fixed-length fields greater than or equal to 768 bytes are encoded as variable-length fields. For example, a CHAR(255) column can exceed 768 bytes if the maximum byte length of the character set is greater than 3, as it is with utf8mb4.
当使用dynamic row format时,InnoDB能将过长的变长字段完全存入off-page,聚簇索引记录里面仅包含20个字节指向off-page。定长字段超过768字节会变成变长字段。

Whether columns are stored off-page depends on the page size and the total size of the row. When a row is too long, the longest columns are chosen for off-page storage until the clustered index record fits on the B-tree page. TEXT and BLOB columns that are less than or equal to 40 bytes are stored in line.
至于何时发生off-page存储,依赖于page size以及行大小,当一行太大,最长的列(并不一定是变长字段)将被off-page存储,直到聚集索引上的记录符合page要求;TEXT和BLOB字段在行中存储不会超过40字节

The DYNAMIC row format maintains the efficiency of storing the entire row in the index node if it fits (as do the COMPACT and REDUNDANT formats), but the DYNAMIC row format avoids the problem of filling B-tree nodes with a large number of data bytes of long columns. The DYNAMIC row format is based on the idea that if a portion of a long data value is stored off-page, it is usually most efficient to store the entire value off-page. With DYNAMIC format, shorter columns are likely to remain in the B-tree node, minimizing the number of overflow pages required for a given row.
dynamic row format通过存储一行的全部内容来保证效率(像compact一样),但是dynamic避免b-tree节点存储太多的长字段类型,dynamic基于这一思想:如果一部分数据off-page存储,它通常最有效的方式是全部数据都off-page存储,dynamic尽量使短字段保留在b-tree节点、减少溢出页的数量。

从官方文档可以看出,compact和dynamic有明显的区别,一个是发生off-page存储时,b-tree节点上存储的数据大小,前者是768字节,后者是20字节;另一个区别是何时发生off-page存储,compact说的很清楚,超过768就会发生,而dynamic则只说依赖于page size及行大小。

那我们再找找其它的相关知识:
Row Size Limits
The internal representation of a MySQL table has a maximum row size limit of 65,535 bytes, even if the storage engine is capable of supporting larger rows. BLOB and TEXT columns only contribute 9 to 12 bytes toward the row size limit because their contents are stored separately from the rest of the row.
MySQL内部约束一行最大不超过65535字节,即使存储引擎支持更大的行。BLOB和TEXT列在一行中仅占用9-12字节,因为他们的内容单独存储。

The maximum row size for an InnoDB table, which applies to data stored locally within a database page, is slightly less than half a page for 4KB, 8KB, 16KB, and 32KB innodb_page_size settings. For example, the maximum row size is slightly less than 8KB for the default 16KB InnoDB page size. For 64KB pages, the maximum row size is slightly less than 16KB.
最大的行大小对于InnoDB来说,依赖于数据存储到的数据页,会稍微比4k、8k、16k、32k的半页要少一点,如InnoDB默认一页为16k,那最大行不超过8k。对于64k的页来说,最大行不超过16k。

If a row containing variable-length columns exceeds the InnoDB maximum row size, InnoDB selects variable-length columns for external off-page storage until the row fits within the InnoDB row size limit. The amount of data stored locally for variable-length columns that are stored off-page differs by row format.
如果一行包含了变长字段的总大小超过了InnoDB最大行限制,InnoDB选择将变长字段存储到off-page,直到符合InnoDB行大小限制,本地存储的数据量不同的row format会有所不同。

从这些可以看出,InnoDB会限制一行大小不能超过半页(为了提高索引的效率,每页至少两行记录),而默认innodb_page_size默认16k,所以一行不能超过8k。再回过头来看dynamic何时会发生off-page就明白了。

如使用compact + latin1建表,包含33个char(255),是不允许的,因为char(255)在latin1下一定会占用255字节,33 * 255=8415,超过了8126(略小于半页8192),所以创建失败。
如果想要建表成功,可以采取以下方式:
1、compact + utf8mb4,建表时按照最大长度算,char(255)最大超过了768字节,从而使其成为变长字段,变长字段可以off-page存储,所以能建表成功。这里还需注意数据插入情况:
utf8mb4前提下,char(255)所占字节数并不是简单的4 * 255(英文1字节,中文3字节,表情4字节),compact/dynamic会对变长字符集(utf8、utf8mb4)做优化,尽量使char(255)所占的字节数为255,所以如果插入全英文,在每列不超过255个字符的前提下,占用字节数为255,总共能插入31列,剩余的列不能插入,否则超过行限制。为什么是31列?因为32 * 255 = 8160,超过了8126。
如果1-30列存入英文,而第31列存入中文,那就要根据中文所占字节数来算了,实测第31列大概能存128个中文。
前30列英文总共所占字节数:30 * 255 = 7650,第31列中文128 * 3 = 384,总计8034。

2、dynamic + utfmb4,char(255)最大超过了768字节,成为变长字段,而且在dynamic下,对于off-page存储,b-tree节点仅仅存储20个字节;
这种情况下建的表,33个char字段都能顺利插入(不超过255个字符),虽然总字节数不小于33 * 255(英文255,中文最大超过768),远远超过了8126,但是dynamic会将最长的列off-page存储,直到符合要求。

上面提到的compact/dynamic对变长字符集下char(N)做优化的相关信息如下:
COMPACT Row Format Storage Characteristics
Internally, for variable-length character sets such as utf8mb3 and utf8mb4, InnoDB attempts to store CHAR(N) in N bytes by trimming trailing spaces. If the byte length of a CHAR(N) column value exceeds N bytes, trailing spaces are trimmed to a minimum of the column value byte length. The maximum length of a CHAR(N) column is the maximum character byte length × N.
对于变长字符集如utf8和utf8mb3,InnoDB尝试通过裁剪尾部空格将char(N)所占字节数控制在N字节。

你可能感兴趣的:(关于varchar(255)的一些资料整理)