OS:CentOS 7
Hive:2.3.0
Hadoop:2.7.7
MySQL Server:5.7.10
Hive官方手册:LanguageManual LZO
在配置Hive使用lzo压缩功能之前,需要保证Hadoop集群中lzo依赖库的正确安装,以及hadoop-lzo依赖的正确配置,可以参考:Hadoop配置lzo压缩
温馨提示:Hive自定义组件打包时,不要同时打包依赖,避免各种版本冲突,只将额外的依赖添加到classpath中即可
core-site.xml
文件的io.compression.codecs
参数中添加lzo、lzop压缩对应的编解码器类,并配置io.compression.codec.lzo.class
参数,具体如下所示:
<property>
<name>io.compression.codecsname>
<value>
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.DeflateCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
org.apache.hadoop.io.compress.Lz4Codec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
value>
<description>
A comma-separated list of the compression codec classes that can
be used for compression/decompression. In addition to any classes specified
with this property (which take precedence), codec classes on the classpath
are discovered using a Java ServiceLoader.
description>
property>
<property>
<name>io.compression.codec.lzo.classname>
<value>com.hadoop.compression.lzo.LzoCodecvalue>
property>
mapred-site.xml
文件中配置如下参数,设置MR Job执行时使用的压缩方式:
<property>
<name>mapreduce.map.output.compressname>
<value>truevalue>
<description>
Should the outputs of the maps be compressed before being
sent across the network. Uses SequenceFile compression.
description>
property>
<property>
<name>mapreduce.map.output.compress.codecname>
<value>com.hadoop.compression.lzo.LzoCodecvalue>
<description>
If the map outputs are compressed, how should they be compressed?
description>
property>
<property>
<name>mapreduce.output.fileoutputformat.compressname>
<value>truevalue>
<description>Should the job outputs be compressed?
description>
property>
<property>
<name>mapreduce.output.fileoutputformat.compress.codecname>
<value>com.hadoop.compression.lzo.LzoCodecvalue>
<description>If the job outputs are compressed, how should they be compressed?
description>
property>
<property>
<name>mapreduce.output.fileoutputformat.compress.typename>
<value>BLOCKvalue>
<description>If the job outputs are to compressed as SequenceFiles, how should
they be compressed? Should be one of NONE, RECORD or BLOCK.
description>
property>
在$HIVE_HOME/conf/hive-site.xml
文件中设置如下参数,使得Hive进行查询时使用压缩功能,具体使用的压缩算法默认与Hadoop中的配置相同,当然也有相应的参数可以进行覆盖:
<property>
<name>hive.exec.compress.outputname>
<value>truevalue>
<description>
This controls whether the final outputs of a query (to a local/HDFS file or a Hive table)
is compressed.
The compression codec and other options are determined from Hadoop config variables
mapred.output.compress*
description>
property>
<property>
<name>hive.exec.compress.intermediatename>
<value>truevalue>
<description>
This controls whether intermediate files produced by Hive between multiple map-reduce jobs are compressed.
The compression codec and other options are determined from Hadoop config variables mapred.output.compress*
description>
property>
CREATE TABLE tmp like emp
STORED AS
INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
org.apache.hadoop.mapred.InputFormat
接口com.hadoop.mapred.DeprecatedLzoTextInputFormat
,此工具类在hadoop-lzo的jar包中(如:hadoop-lzo-0.4.20.jar)。只有使用LzoTextInputFormat
才能避免将lzo索引文件识别成数据文件,又因为Hive支持支旧版本API,因此必须使用示例中的DeprecatedLzoTextInputFormat
才能使用lzo的分片功能。DeprecatedLzoTextInputFormat
只能识别后缀为.lzo
的lzo压缩文件,无法识别后缀为.lzo_deflate
的lzo压缩文件。前者是使用编解码器LzopCodec
生成的,后者是使用LzoCodec
生成的,.lzo
压缩文件能够创建索引,而.lzo_deflate
压缩文件无法创建索引,只有建立了lzo索引才能使用lzo分片功能。PS:可以通过以下命令,来修改表的InputFormat/Outputformat/SerDe
-- 可以通过以下命令,修改表数据的读取/写入/序列化和反序列化方式
ALTER TABLE tmp
SET FILEFORMAT
INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" -- Hive默认Outputformat
SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'; -- Hive默认SERDE
LOAD DATA INPATH '/tmp/data/output/emp/000000_0.lzo' OVERWRITE INTO TABLE tmp;
.lzo.index
hadoop jar \
/opt/module/hadoop-2.7.7/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar \
com.hadoop.compression.lzo.DistributedLzoIndexer \
/user/hive/warehouse/test.db/tmp/000000_0.lzo
[tomandersen@hadoop101 libs]$ hadoop fs -ls /user/hive/warehouse/test.db/tmp/;
Found 2 items
-rwxr-xr-x 1 tomandersen supergroup 515 2020-06-19 17:43 /user/hive/warehouse/test.db/tmp/000000_0.lzo
-rw-r--r-- 1 tomandersen supergroup 8 2020-06-21 21:53 /user/hive/warehouse/test.db/tmp/000000_0.lzo.index
InputFormat
为DeprecatedLzoTextInputFormat
,不再将lzo索引文件视为数据文件,读取结果正常0: jdbc:hive2://hadoop101:10000/default (test)> select * from test.tmp;
+------------+------------+----------+------------+----------+---------------+----------+-----------+-------------+
| tmp.empno | tmp.ename | tmp.sex | tmp.job | tmp.mgr | tmp.hiredate | tmp.sal | tmp.comm | tmp.deptno |
+------------+------------+----------+------------+----------+---------------+----------+-----------+-------------+
| 7369 | SMITH | male | CLERK | 7902 | 1980-12-17 | 800.0 | NULL | 20 |
| 7499 | ALLEN | male | SALESMAN | 7698 | 1981-2-20 | 1600.0 | 300.0 | 30 |
| 7521 | WARD | female | SALESMAN | 7698 | 1981-2-22 | 1250.0 | 500.0 | 30 |
| 7566 | JONES | male | MANAGER | 7839 | 1981-4-2 | 2975.0 | NULL | 20 |
| 7654 | MARTIN | female | SALESMAN | 7698 | 1981-9-28 | 1250.0 | 1400.0 | 30 |
| 7698 | BLAKE | male | MANAGER | 7839 | 1981-5-1 | 2850.0 | NULL | 30 |
| 7782 | CLARK | male | MANAGER | 7839 | 1981-6-9 | 2450.0 | NULL | 10 |
| 7788 | SCOTT | male | ANALYST | 7566 | 1987-4-19 | 3000.0 | NULL | 20 |
| 7839 | KING | female | PRESIDENT | NULL | 1981-11-17 | 5000.0 | NULL | 10 |
| 7844 | TURNER | female | SALESMAN | 7698 | 1981-9-8 | 1500.0 | 0.0 | 30 |
| 7876 | ADAMS | male | CLERK | 7788 | 1987-5-23 | 1100.0 | NULL | 20 |
| 7900 | JAMES | male | CLERK | 7698 | 1981-12-3 | 950.0 | NULL | 30 |
| 7902 | FORD | male | ANALYST | 7566 | 1981-12-3 | 3000.0 | NULL | 20 |
| 7934 | MILLER | female | CLERK | 7782 | 1982-1-23 | 1300.0 | NULL | 10 |
+------------+------------+----------+------------+----------+---------------+----------+-----------+-------------+
14 rows selected (0.244 seconds)
InputFormat
为DeprecatedLzoTextInputFormat
,会将lzo索引文件视为数据文件进行读取,查询结果会多出一行NULL值0: jdbc:hive2://hadoop101:10000/default (test)> select * from test.tmp;
+------------+------------+----------+------------+----------+---------------+----------+-----------+-------------+
| tmp.empno | tmp.ename | tmp.sex | tmp.job | tmp.mgr | tmp.hiredate | tmp.sal | tmp.comm | tmp.deptno |
+------------+------------+----------+------------+----------+---------------+----------+-----------+-------------+
| 7369 | SMITH | male | CLERK | 7902 | 1980-12-17 | 800.0 | NULL | 20 |
| 7499 | ALLEN | male | SALESMAN | 7698 | 1981-2-20 | 1600.0 | 300.0 | 30 |
| 7521 | WARD | female | SALESMAN | 7698 | 1981-2-22 | 1250.0 | 500.0 | 30 |
| 7566 | JONES | male | MANAGER | 7839 | 1981-4-2 | 2975.0 | NULL | 20 |
| 7654 | MARTIN | female | SALESMAN | 7698 | 1981-9-28 | 1250.0 | 1400.0 | 30 |
| 7698 | BLAKE | male | MANAGER | 7839 | 1981-5-1 | 2850.0 | NULL | 30 |
| 7782 | CLARK | male | MANAGER | 7839 | 1981-6-9 | 2450.0 | NULL | 10 |
| 7788 | SCOTT | male | ANALYST | 7566 | 1987-4-19 | 3000.0 | NULL | 20 |
| 7839 | KING | female | PRESIDENT | NULL | 1981-11-17 | 5000.0 | NULL | 10 |
| 7844 | TURNER | female | SALESMAN | 7698 | 1981-9-8 | 1500.0 | 0.0 | 30 |
| 7876 | ADAMS | male | CLERK | 7788 | 1987-5-23 | 1100.0 | NULL | 20 |
| 7900 | JAMES | male | CLERK | 7698 | 1981-12-3 | 950.0 | NULL | 30 |
| 7902 | FORD | male | ANALYST | 7566 | 1981-12-3 | 3000.0 | NULL | 20 |
| 7934 | MILLER | female | CLERK | 7782 | 1982-1-23 | 1300.0 | NULL | 10 |
| NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
+------------+------------+----------+------------+----------+---------------+----------+-----------+-------------+
15 rows selected (0.229 seconds)
解析:很显然这是一个差找不到指定类的Bug,而LzoCodec是hadoop-lzo依赖中的工具类,因此将对应的jar包添加到classpath中即可,解决方案多种多样
解决方案示例:在hive-env.sh配置文件中设置HIVE_AUX_JARS_PATH
环境变量,将hadoop-lzo依赖jar包放入此变量所指路径(或者将此变量设置成hadoop-lzo.jar所在路径)。
解析:通过查找网上的资料发现,导致这样的报错可能有多种原因,最常见的就是由于在hadoop中lzo的相关配置,与Hive中的表相关设置相冲突(具体原因未知)
解决方案示例:将mapred-site.xml
文件中的mapreduce.output.fileoutputformat.compress.codec
参数设置成除了com.hadoop.compression.lzo.LzopCodec
意以外的其他值,如com.hadoop.compression.lzo.LzoCodec
,即更换最终MR输出文件所使用的压缩算法编解码器为非LzopCodec
的其他值