[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FfIuDsO2-1611677436463)(/images/74/1.png)]
Hadoop编码/解码器方式,如下表所示
压缩格式 | 对应的编码/解码 |
---|---|
DEFAULT | org.apache.hadoop.io.compress.DefaultCodec |
Gzip | org.apache.hadoop.io.compress.GzipCodec |
Bzip | org.apache.hadoop.io.compress.BzipCodec |
Snappy | org.apache.hadoop.io.compress.SnappyCodec |
Lzo | org.apache.hadoop.io.compress.LzopCodec |
HiveQL语句最终都将转换成为hadoop中的MapReduce job,而MapReduce job可以有对处理的数据进行压缩。
Hive中间数据压缩
hive.exec.compress.intermediate:默认为false,设置true为激活中间数据压缩功能,就是MapReduce的shuffle阶段对mapper产生中间压缩,在这个阶段,优先选择一个低CPU开销:
set hive.exec.compress.intermediate=true
set mapred.map.output.compression.codec= org.apache.hadoop.io.compress.SnappyCodec
set mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec
最终输出结果压缩
hive.exec.compress.output:用户可以对最终生成的Hive表的数据通常也需要压缩。该参数控制这一功能的激活与禁用,设置为true来声明将结果文件进行压缩。
mapred.output.compression.codec:将hive.exec.compress.output参数设置成true后,然后选择一个合适的编解码器,如选择SnappyCodec。设置如下(两种压缩的编写方式是一样的):
set hive.exec.compress.output=true
set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
或者
或者
set mapred.output.compress=true
set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzopCodec
1. TEXTFILE
create table if not exists textfile_table(
site string,
url string,
pv bigint,
label string)
row format delimited
fields terminated by '\t'
stored as textfile;
插入数据操作:
set hive.exec.compress.output=true; //输出结果压缩开启
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; //压缩和解压缩编码类列表,用逗号分隔,将所用到解压和压缩码设置其中
insert overwrite table textfile_table select * from testfile_table;
2. SEQUENCEFILE
create table if not exists seqfile_table(
site string,
url string,
pv bigint,
label string)
row format delimited
fields terminated by '\t'
stored as sequencefile;
插入数据操作:
set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
SET mapred.output.compression.type=BLOCK;
insert overwrite table seqfile_table select * from testfile_table;
3. RCFILE
create table if not exists rcfile_table(
site string,
url string,
pv bigint,
label string)
row format delimited
fields terminated by '\t'
stored as rcfile;
插入数据操作:
set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
insert overwrite table rcfile_table select * from testfile_table;
4. ORCFILE
create table if not exists orcfile_table(
site string,
url string,
pv bigint,
label string)
row format delimited
fields terminated by '\t'
stored as orc;
插入数据操作:
set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
insert overwrite table orcfile_table select * from testfile_table;
总结: