MySQL -- utf8 - utf8mb4

通过一个emoji表情引发的报错来认识下MySQL中独一无二的utf8mb4

1.发现问题

java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\xAB' for column 'name' at row 1
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:964)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3973)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3909)
    at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2527)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2680)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2487)
    at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
    at com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1197)
    at com.example.demo.tst.Three.main(Three.java:51)

碰到这种四个字节(\xF0\x9F\x98\xAB)的报错,说明字符集有问题。

2. 分析问题

第一篇文章中,我们就说过mysql中的字符集转换是怎么玩的,那就一个一个来看下哪里的问题。

表结构:

mysql> show create table account;
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table   | Create Table
                                                                                                                                  |
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| account | CREATE TABLE `account` (
  `id` int(11) NOT NULL AUTO_INCREMENT COMMENT '自增id',
  `name` varchar(100) CHARACTER SET utf8mb4 DEFAULT NULL,
  `balance` int(11) DEFAULT NULL COMMENT '余额',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8     |
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)
  • 先看库中字段的字符集
mysql> show full columns from account;
+---------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+----------+
| Field   | Type         | Collation       | Null | Key | Default | Extra          | Privileges                      | Comment  |
+---------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+----------+
| id      | int(11)      | NULL            | NO   | PRI | NULL    | auto_increment | select,insert,update,references | 自增id   |
| name    | varchar(100) | utf8_general_ci | YES  |     | NULL    |                | select,insert,update,references |          |
| balance | int(11)      | NULL            | YES  |     | NULL    |                | select,insert,update,references | 余额     |
+---------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+----------+
3 rows in set (0.01 sec)

上面报错的字段是name,这里其比较规则为utf8_general_ci,我们知道这是utf8 的比较规则,MySQL中的utf8字符集实际是utf8mb3,最大只能放三个字节。
那就改呗,四个字节的utf8 -- utf8mb4

MySQL版本5.1.13以后开始支持utf8mb4

3.解决问题

  • 检查列字符集:
mysql> alter table account modify name varchar(100) character set utf8mb4 collate utf8mb4_general_ci;
Query OK, 1 row affected (0.76 sec)
Records: 1  Duplicates: 0  Warnings: 0
  • 检查MySQL服务端那三个字符集
mysql> show variables like 'character_set_%';
+--------------------------+---------------------------------------+
| Variable_name            | Value                                 |
+--------------------------+---------------------------------------+
| character_set_client     | utf8                               |
| character_set_connection | utf8                               |
| character_set_database   | utf8                                  |
| character_set_filesystem | binary                                |
| character_set_results    | utf8                               |
| character_set_server     | utf8                                  |
| character_set_system     | utf8                                  |
| character_sets_dir       | E:\MySQL\mysql-5.7.26\share\charsets\ |
+--------------------------+---------------------------------------+
8 rows in set, 1 warning (0.00 sec)

啥?居然都是utf8,没的说,改:

mysql> set names utf8mb4;
Query OK, 0 rows affected (0.00 sec)

再查看下:

mysql> show variables like 'character_set_%';
+--------------------------+---------------------------------------+
| Variable_name            | Value                                 |
+--------------------------+---------------------------------------+
| character_set_client     | utf8mb4                               |
| character_set_connection | utf8mb4                               |
| character_set_database   | utf8                                  |
| character_set_filesystem | binary                                |
| character_set_results    | utf8mb4                               |
| character_set_server     | utf8                                  |
| character_set_system     | utf8                                  |
| character_sets_dir       | E:\MySQL\mysql-5.7.26\share\charsets\ |
+--------------------------+---------------------------------------+
8 rows in set, 1 warning (0.00 sec)

嗯,改回来了~

  • 检查客户端发送字符集(这步有点坑~)

上面两步都能理解,这一步可能有点不明白为啥还要再来一个字符集?
其实,一个字符从出生(发送)到安全落库(落盘),经历的字符集可能有点多,但其第一个字符集就是这个所谓的客户端发送字符集。

虽然要经过这么多的字符集转换,但并不会有那么多的转换,因为实际情况下会将其中几个字符集都设为一致~

我们是通过Java代码连接MySQL,这里可以在Connector/J中指定(省略其他配置):

 jdbc:mysql://localhost:3306/asia?characterEncoding=utf8mb4&useUnicode=true

不过,你会发现这样启动会报错告诉你不支持这么写。。
这里先贴下官方说明:

Setting the Character Encoding

The character encoding between client and server is automatically detected upon connection (provided that the Connector/J connection properties characterEncoding and connectionCollation are not set). You specify the encoding on the server using the system variable character_set_server (for more information, see Server Character Set and Collation). The driver automatically uses the encoding specified by the server. For example, to use the 4-byte UTF-8 character set with Connector/J, configure the MySQL server with character_set_server=utf8mb4, and leave characterEncoding and connectionCollation out of the Connector/J connection string. Connector/J will then autodetect the UTF-8 setting.

MySQL告诉我们如若要使用4个字节的utf8,那你需要做两件事:

  • MySQL服务端配置character_set_server=utf8mb4
  • 不要设置characterEncoding and connectionCollation属性

网上还有一种说法:

Connector/J did not support utf8mb4 for servers 5.5.2 and newer.

Connector/J now auto-detects servers configured with character_set_server=utf8mb4 or treats the Java encoding utf-8 passed usingcharacterEncoding=... as utf8mb4 in the SET NAMES= calls it makes when establishing the connection. (Bug #54175)

这个意思是Connector/J并不支持utf8mb4,不过有两种方式等价:

  • MySQL服务器中配置加上character_set_server=utf8mb4
  • Connector/J中加上characterEncoding=utf8

本文选择相信官方文档,那个Bug #54175贴出来也是求同存异, 也可能是我理解错了[狗头]


我这里就直接改库中字符集了:

mysql> set character_set_server=utf8mb4;
Query OK, 0 rows affected (0.00 sec)

嗯,上面吹了这么多牛逼,结果发现还是不行:

java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\xAB' for column 'name' at row 1
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:964)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3973)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3909)
    at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2527)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2680)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2487)
    at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
    at com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1197)
    at com.example.demo.tst.Three.main(Three.java:51)

原因很简单,需要重启,不过重启读的是配置文件中的配置,本次会话手动设置的字符集将全部失效。
所以建议写在MySQL配置文件中 my.iniwindows系统)
综上的配置如下(省略其他配置):

[client]
# 服务端的三个字符集,效果等同 set NAMES utf8mb4
default-character-set=utf8mb4

[mysqld]
# 指定服务器系统字符集
character-set-server=utf8mb4
collation-server=utf8mb4_general_ci

这里的比较规则是 ci(默认是这种比较规则),是不区分大小写的,如果需要可以改成utf8mb4_bin

再把服务重启一哈:

C:\Windows\system32>net stop mysql
MySQL 服务正在停止..
MySQL 服务已成功停止。

C:\Windows\system32>
C:\Windows\system32>net start mysql
MySQL 服务正在启动 ..
MySQL 服务已经启动成功。

代码再执行就执行成功了,查询看看:

mysql> select * from account;
+----+------+---------+
| id | name | balance |
+----+------+---------+
|  1 |      |      10 |
+----+------+---------+
1 row in set (0.00 sec)

这个表情真丑。。

这里还可以玩一个东西,我们把服务端的三个字符集改成utf8,再查询看看:

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql>
mysql> select * from account;
+----+------+---------+
| id | name | balance |
+----+------+---------+
|  1 | ?    |      10 |
+----+------+---------+
1 row in set (0.00 sec)

还可以在这个时候来把列name的字符集改回utf8试试:

mysql> alter table account modify name varchar(100) character set utf8 collate utf8_general_ci;
ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x98\xAB' for column 'name' at row 2

挺好玩吧~~~


番外

  • 本地开发环境测试的时候发现,当配置文件中加了这个配置character-set-server=utf8mb4,那么无论其他的字符集你怎么玩,连接怎么设置,你的emoji表情都能正常入库,而生产环境不行。

推论: character-set-server这个字符集的作用可能不止于指定发送字符集为utf8mb4, 甚至猜测不能指定发送的字符集

如果有大佬知道的,还望留言告知,万谢!

  • 鉴于以上情况,测试环境characterEncoding=utf8 此配置基本无用.
  • 对于生产环境不太好修改库设置的,可以在数据源中增加 initSql,以druid配置为例:
dataSource = new DruidDataSource();
// 省略其他配置
List sqlList = new ArrayList<>();
sqlList.add("SET NAMES utf8mb4");
dataSource.setConnectionInitSqls(sqlList);

每次连接都会去设置字符集为uft8mb4,生产环境可行。

  • 最后捋一捋思路
    • 库中字段字符集为utf8mb4
    • 连接的字符集要为utf8mb4, 就是连接过程中的那三个, 通过SET NAMES UTF8MB4指定,这一步可以走库配置也可以走数据源设置,具体见上文
    • 发送字符集要为utf8mb4,也就是这个配置character-set-server=utf8mb4

按照这个思路,BMP中的字符保管你全部都能存。

BMP是啥? 我下篇文章就写写Unicode吧,等我~

你可能感兴趣的:(MySQL -- utf8 - utf8mb4)