异常:
BatchUpdate error! sql:INSERT INTO t_industry_summary_error_info (id, create_time, update_time, rowkey, content, summary) VALUES (?, ?, ?, ?, ?, ?) ;
Caused by: java.sql.SQLException: Incorrect string value: '\xF0\xAB\x93\xB9\xE5\xB0...' for column 'content' at row 1 Query: INSERT INTO t_industry_summary_error_info (id, create_time, update_time, rowkey, content, summary) VALUES (?, ?, ?, ?, ?, ?) Parameters: [[7cce59098a2c458d8542b0b5d9f12dd5, 2018-02-02 16:07:00.065, 2018-02-02 16:07:00.065, 535cf21378c044f77b0e6489e44090cf, 格隆汇22日讯,凯基投顾分析师郭明????将iphonex生命周期的出货量预估大幅下调23%,原因包括中国市场需求疲弱且高售价压抑了换机需求。郭明????之前预计其生命周期出货量在8,000万部,现估6,200万部;下修iphonex第一季及第二季出货量分别至1,800万部及1,300万部。郭明????是12月以来一连串下调iphonex出货量的最新一位,他还预计iphonex的生产将在今年中结束。苹果股价上周五一度下跌1%以上,最终收跌0.45%。a股相关概念股:德赛电池跌0.66%、歌尔股份跌0.12%、吉比特、东山精密跌2.88%、蓝思科技跌1.99%等。港股手机产业链股:信利国际跌超4%,瑞声科技跌超2%,通达集团、丘钛科技、高伟电子和舜宇光学目前均初处于下跌中。, 郭明????之前预计其生命周期出货量在8,000万部,现估6,200万部。郭明????是12月以来一连串下调iphonex出货量的最新一位,他还预计iphonex的生产将在今年中结束。]]
at org.apache.commons.dbutils.AbstractQueryRunner.rethrow(AbstractQueryRunner.java:392)
at org.apache.commons.dbutils.QueryRunner.batch(QueryRunner.java:155)
at org.apache.commons.dbutils.QueryRunner.batch(QueryRunner.java:92)
at cn.org.zeronote.orm.dao.DefaultCommonDao.batchUpdate(DefaultCommonDao.java:674)
... 23 common frames omitted
原因分析:
经过查看Hbase中查到“????”原来在Hbase中以及元数据文件中都存的是“?“字,通过查看编码级发现其unicode编码为:
这是字符集不支持的异常。 新旧数据库都使用的是utf8编码,utf8最大的一个特点,就是它是一种变长的编码方式,它可以使用1~4个字节表示一个符号,根据不同的符号而变化字节长度。其中Emoji表情和一些生僻字(比如“?“)是4个字节,而MySql的utf8编码最多3个字节,所以导致了数据插不进去报错。
具体解决方案:
1、修改Mysql的编码级--使用utf8mb4的mysql编码来容纳这些字符
特别注意:mysql支持的最低版本为5.5.3
可以保证文本的一致性。需要修改mysql的配置信息,具体操作如下:
1).建表的时候添加如下限制:ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
2).在my.cnf上修改如下:
------------------my.cnf------------------------------------------------------
# For advice on how to change settings please see
# http://dev.mysql.com/doc/refman/5.6/en/server-configuration-defaults.html
[client]
default-character-set=utf8mb4
[mysql]
default-character-set = utf8mb4
# Remove leading # and set to the amount of RAM for the most important data
# cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%.
# innodb_buffer_pool_size = 128M
# Remove leading # to turn on a very important data integrity option: logging
# changes to the binary log between backups.
# log_bin
# These are commonly set, remove the # and set as required.
# basedir = .....
# datadir = .....
# port = .....
# server_id = .....
# socket = .....
# Remove leading # to set options mainly useful for reporting servers.
# The server defaults are faster for transactions and fast SELECTs.
# Adjust sizes as needed, experiment to find the optimal values.
# join_buffer_size = 128M
# sort_buffer_size = 2M
# read_rnd_buffer_size = 2M
sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES
log-error=/var/log/mysqld.log
long_query_time=3
通过命令查看编码级:
SHOW VARIABLES WHERE Variable_name LIKE 'character_set_%' OR Variable_name LIKE 'collation%';
当然上面比较麻烦,也可以修改一列的配置信息,命令如下:
varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci
2、将特殊字符替换--特殊的表情字符
这种方式明显会对数据有所失真,如果对数据准确性要求比较高的话慎用。
summary.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "*");
特别感谢:
本文参考以下文章:
https://www.cnblogs.com/nick-huang/p/5507262.html
http://blog.csdn.net/ahuyangdong/article/details/50739549
https://www.cnblogs.com/junrong624/p/6073700.html