当爬虫遇到emoji表情

场景:爬虫爬取豆瓣电影影评时遇到一些用户的用户名和影评中有emoji表情符号,存储到数据库的时候出现报错:

Incorrect string value: '\xF0\x9F\x98...' for column 'name'

Mysql此时的字符集是utf-8

那么问题来了,utf-8的一个字符最大3个字节,而emoji却是4个字节,这就导致存入失败

可以通过以下命令查看

SHOW VARIABLES LIKE 'char%';

解决方式:使用utf-8mb4,它是utf-8的超集,一个字符最大四个字节。

在Linux修改Mysql字符集方式:

在/etc/my.cnf文件下(我是使用yum安装Mysql,也在该目录下)

当爬虫遇到emoji表情_第1张图片

在[mysqld]下设置

collation_server = utf8mb4_general_ci
character-set-server = utf8mb4

在[client]下设置

default-character-set = utf8mb4

重启一下

service mysqld restart

另外需要去mysql可视化客户端看下,例如SQLyog:

需要将表内列修改一下。完成

另外我的my.cnf文件

# For advice on how to change settings please see
# http://dev.mysql.com/doc/refman/5.7/en/server-configuration-defaults.html

[mysqld]
#
# Remove leading # and set to the amount of RAM for the most important data
# cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%.
# innodb_buffer_pool_size = 128M
#
# Remove leading # to turn on a very important data integrity option: logging
# changes to the binary log between backups.
# log_bin
#
# Remove leading # to set options mainly useful for reporting servers.
# The server defaults are faster for transactions and fast SELECTs.
# Adjust sizes as needed, experiment to find the optimal values.
# join_buffer_size = 128M
# sort_buffer_size = 2M
# read_rnd_buffer_size = 2M
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock

# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
collation_server = utf8mb4_general_ci
character-set-server = utf8mb4
[client]
default-character-set = utf8mb4
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

 

你可能感兴趣的:(MySQL)