CDH 之 Hive 中文乱码平定通用法则

一、乱象

1.1 中文注释乱码

CDH 之 Hive 中文乱码平定通用法则_第1张图片

hive> DESCRIBE test;
OK
# col_name              data_type               comment
id                      string                  ??ID ??             
pcs                     string                  ?????               
mzmc                    string                  ????                
gzdb_addtime            string                  ???????             
swdd                    string                  ????                
swyydm                  string                  ??????              
dz                      string                  ????                           
xm                      string                  ??                  
gjmc                    string                  ????                
zt                      string                  ?? 

二、平乱

2.1 建库指定 utf8 编码

CREATE DATABASE my_database
COMMENT 'My database'
LOCATION 'hdfs://localhost:9000/user/hive/warehouse/my_database.db'
WITH DBPROPERTIES ('charset'='utf8');

2.2 建表指定 utf8 编码

CREATE TABLE `test_table`(
  `xm` string COMMENT '姓名', 
  `xb` string COMMENT '性别', 
  `nl` string COMMENT '年龄', 
  `czdz` string COMMENT '常住地址', 
  `hjdz` string COMMENT '户籍地址', 
  `csrq` string COMMENT '出生日期', 
COMMENT '测试'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES ('charset'='utf8');

        实际上以上方式都不能解决编码问题,根本原因并不出在 hive 上,而是存储 hive 元数据的 mysql 数据库上面


2.3 mysql 编码查看与修改

2.3.1 修改hive元数据库编码

(1)查看hive元数据库编码(显示:utf8mb3)

mysql> show create database hive;
+----------+-----------------------------------------------------------------------------------------------------+
| Database | Create Database                                                                                     |
+----------+-----------------------------------------------------------------------------------------------------+
| hive     | CREATE DATABASE `hive` /*!40100 DEFAULT CHARACTER SET utf8mb3 */ /*!80016 DEFAULT ENCRYPTION='N' */ |
+----------+-----------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

(2) 修改编码为 latin1

mysql> alter database hive character set latin1;
Query OK, 1 row affected (0.01 sec)

2.3.2 修改表编码

(1)查看hive库中有哪些表

mysql> use hive;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+-------------------------------+
| Tables_in_hive                |
+-------------------------------+
| AUX_TABLE                     |
| BUCKETING_COLS                |
| CDH_VERSION                   |
| CDS                           |
| COLUMNS_V2                    |
| COMPACTION_QUEUE              |
| COMPLETED_COMPACTIONS         |
| COMPLETED_TXN_COMPONENTS      |
| CTLGS                         |
| CTLGS_bak20230606             |
| DATABASE_PARAMS               |
| DBS                           |
| DB_PRIVS                      |
| DELEGATION_TOKENS             |
| FUNCS                         |
| FUNC_RU                       |
| GLOBAL_PRIVS                  |
| HIVE_LOCKS                    |
| IDXS                          |
| INDEX_PARAMS                  |
| I_SCHEMA                      |
| KEY_CONSTRAINTS               |
| MASTER_KEYS                   |
| MATERIALIZATION_REBUILD_LOCKS |
| METASTORE_DB_PROPERTIES       |
| MIN_HISTORY_LEVEL             |
| MV_CREATION_METADATA          |
| MV_TABLES_USED                |
| NEXT_COMPACTION_QUEUE_ID      |
| NEXT_LOCK_ID                  |
| NEXT_TXN_ID                   |
| NEXT_WRITE_ID                 |
| NOTIFICATION_LOG              |
| NOTIFICATION_SEQUENCE         |
| NUCLEUS_TABLES                |
| PARTITIONS                    |
| PARTITION_EVENTS              |
| PARTITION_KEYS                |
| PARTITION_KEY_VALS            |
| PARTITION_PARAMS              |
| PART_COL_PRIVS                |
| PART_COL_STATS                |
| PART_PRIVS                    |
| REPL_TXN_MAP                  |
| ROLES                         |
| ROLE_MAP                      |
| RUNTIME_STATS                 |
| SCHEMA_VERSION                |
| SDS                           |
| SD_PARAMS                     |
| SEQUENCE_TABLE                |
| SERDES                        |
| SERDE_PARAMS                  |
| SKEWED_COL_NAMES              |
| SKEWED_COL_VALUE_LOC_MAP      |
| SKEWED_STRING_LIST            |
| SKEWED_STRING_LIST_VALUES     |
| SKEWED_VALUES                 |
| SORT_COLS                     |
| TABLE_PARAMS                  |
| TAB_COL_STATS                 |
| TBLS                          |
| TBL_COL_PRIVS                 |
| TBL_PRIVS                     |
| TXNS                          |
| TXN_COMPONENTS                |
| TXN_TO_WRITE_ID               |
| TYPES                         |
| TYPE_FIELDS                   |
| VERSION                       |
| WM_MAPPING                    |
| WM_POOL                       |
| WM_POOL_TO_TRIGGER            |
| WM_RESOURCEPLAN               |
| WM_TRIGGER                    |
| WRITE_SET                     |
+-------------------------------+
76 rows in set (0.00 sec)

(2)需要修改如下表编码

# COLUMNS_V2
# TABLE_PARAMS
# PARTITION_PARAMS
# PARTITION_KEYS
# INDEX_PARAMS

(3)查看目前表编码

#查看表 COLUMNS_V2 (其他略)
mysql> show create table COLUMNS_V2;
+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table      | Create Table                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| COLUMNS_V2 | CREATE TABLE `COLUMNS_V2` (
  `CD_ID` bigint NOT NULL,
  `COMMENT` varchar(256) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  `COLUMN_NAME` varchar(767) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `TYPE_NAME` mediumtext,
  `INTEGER_IDX` int NOT NULL,
  PRIMARY KEY (`CD_ID`,`COLUMN_NAME`),
  KEY `COLUMNS_V2_N49` (`CD_ID`),
  CONSTRAINT `COLUMNS_V2_FK1` FOREIGN KEY (`CD_ID`) REFERENCES `CDS` (`CD_ID`) ON DELETE RESTRICT ON UPDATE RESTRICT
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

(4)修改以上表编码

mysql> alter table COLUMNS_V2 modify column COMMENT varchar(256) character set utf8;

mysql> alter table TABLE_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8;

mysql> alter table PARTITION_PARAMS  modify column PARAM_VALUE varchar(4000) character set utf8;

mysql> alter table PARTITION_KEYS  modify column PKEY_COMMENT varchar(4000) character set utf8;

mysql> alter table  INDEX_PARAMS  modify column PARAM_VALUE  varchar(4000) character set utf8;

(5)查看表编码变更信息

#可以看到编码已经被设置为 utf8mb3
mysql> show create table COLUMNS_V2;
+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table      | Create Table                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| COLUMNS_V2 | CREATE TABLE `COLUMNS_V2` (
  `CD_ID` bigint NOT NULL,
  `COMMENT` varchar(256) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci DEFAULT NULL,
  `COLUMN_NAME` varchar(767) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `TYPE_NAME` mediumtext,
  `INTEGER_IDX` int NOT NULL,
  PRIMARY KEY (`CD_ID`,`COLUMN_NAME`),
  KEY `COLUMNS_V2_N49` (`CD_ID`),
  CONSTRAINT `COLUMNS_V2_FK1` FOREIGN KEY (`CD_ID`) REFERENCES `CDS` (`CD_ID`) ON DELETE RESTRICT ON UPDATE RESTRICT
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

(5)重新创建 hive 库和表

#这里建库和建表就没有像之前那样指定 utf-8 编码去建了
#可以看到已经没有中文乱码了
hive> desc test;
OK                                                  
xm                      string                  姓名                  
sfzhm                   string                  身份证号码               
xbdm                    string                  性别代码                
xb                      string                  性别                  
csrq                    string                  出生日期                
hjdz                    string                  户籍地址                            
lxdh                    string                  联系电话      

三、应变之策

        如果以上并没有解决你的乱码问题,可以尝试如下方法:

修改 hive-site.xml 配置,如下是我的默认配置


  javax.jdo.option.ConnectionURL
  jdbc:mysql://hadoop105:3306/hive?createDatabaseIfNotExist=true
  JDBC connect string for a JDBC metastore

加入:useUnicode=true&characterEncoding=UTF-8 修改为:


  javax.jdo.option.ConnectionURL
  jdbc:mysql://hadoop105:3306/hive?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=UTF-8
  JDBC connect string for a JDBC metastore

重启后再去建库建表

你可能感兴趣的:(CDH,hive,hadoop,数据仓库)