MySQL utf8mb4排序规则

文章直通车:
  • utf8mb4 和 utf8
  • utf8mb4排序规则

一、先了解下 utf8mb4 和 utf8

参考MySQL文档:

  • utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character.
  • utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character.
  • utf8: An alias for utf8mb3.

Note

The utf8mb3 character set is deprecated and will be removed in a future MySQL release. Please use utf8mb4 instead. Although utf8 is currently an alias for utf8mb3, at some point utf8 will become a reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4 explicitly for character set references instead of utf8.

UTF-8是使用1~4个字节,一种变长的编码格式。(字符编码 )

mb4即 most bytes 4,使用4个字节来表示完整的UTF-8。而MySQL中的utf8是utfmb3,只有三个字节,节省空间但不能表达全部的UTF-8(比如emoji表情),只能支持“基本多文种平面”(Basic Multilingual Plane,BMP)。

所以推荐使用utf8mb4。

二、utf8mb4排序规则:utf8mb4_unicode_ci、utf8mb4_general_ci、utf8mb4_bin

utf8mb4_unicode_ci 和 utf8mb4_general_ci 的对比:

To further illustrate, the following equalities hold in both utf8_general_ci and utf8_unicode_ci (for the effect of this in comparisons or searches, see Section 10.8.6, “Examples of the Effect of Collation”):

Ä = A

Ö = O

Ü = U

A difference between the collations is that this is true for utf8_general_ci:

ß = s

Whereas this is true for utf8_unicode_ci, which supports the German DIN-1 ordering (also known as dictionary order):

ß = ss

MySQL implements utf8 language-specific collations if the ordering with utf8_unicode_ci does not work well for a language. For example, utf8_unicode_ci works fine for German dictionary order and French, so there is no need to create special utf8 collations.

utf8_general_ci also is satisfactory for both German and French, except that ß is equal to s, and not to ss. If this is acceptable for your application, you should use utf8_general_ci because it is faster. If this is not acceptable (for example, if you require German dictionary order), use utf8_unicode_ci because it is more accurate.

utf8mb4_general_ci, utf8mb4_unicode_ci:ci即case insensitive,不区分大小写。

准确性:

  • utf8mb4_unicode_ci 是基于标准的Unicode来排序和比较,能够在各种语言之间精确排序
  • utf8mb4_general_ci 没有实现Unicode排序规则,在遇到某些特殊语言或者字符集,排序结果可能不一致。但是,在绝大多数情况下,这些特殊字符的顺序并不需要那么精确

性能:

  • utf8mb4_general_ci 在比较和排序的时候更快
  • utf8mb4_unicode_ci 在特殊情况下,Unicode排序规则为了能够处理特殊字符的情况,实现了略微复杂的排序算法。但是在绝大多数情况下发,不会发生此类复杂比较。相比选择哪一种collation,使用者更应该关心字符集与排序规则在db里需要统一。

utf8mb4_bin:将字符串每个字符用二进制数据编译存储,区分大小写,而且可以存二进制的内容。


参考:

  1. https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html
  2. https://dev.mysql.com/doc/refman/8.0/en/charset-collation-effect.html

你可能感兴趣的:(技术交流,MYSQL,utf8mb4,utf8mb4)