Snappy,Lzo,bzip2,gzip,deflate 都是hive常用的文件压缩格式,各有所长,这里咱们只关注具体文件的解压
package compress;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionInputStream;
public class Decompress {
public static final Log LOG = LogFactory.getLog(Decompress.class.getName());
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String name = "io.compression.codecs";
String value = "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec";
conf.set(name, value);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
for (int i = 0; i < args.length; ++i) {
CompressionCodec codec = factory.getCodec(new Path(args[i]));
if (codec == null) {
System.out.println("Codec for " + args[i] + " not found.");
} else {
CompressionInputStream in = null;
try {
in = codec.createInputStream(new java.io.FileInputStream(
args[i]));
byte[] buffer = new byte[100];
int len = in.read(buffer);
while (len > 0) {
System.out.write(buffer, 0, len);
len = in.read(buffer);
}
} finally {
if (in != null) {
in.close();
}
}
}
}
}
}
简要说明一下,这几种压缩文件相关的核心类为:
org.apache.hadoop.io.compress.SnappyCodec
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.DeflateCodec,
org.apache.hadoop.io.compress.DefaultCodec,
首先我们需要这些依赖,我把解压需要的依赖都放在了 /home/apache/test/lib/ 目录下
此外还需要文件压缩需要的本地库文件,找到一台装有hadoop的环境,将 $HADOOP_HOME/lib/native 目录复制过来,我放到了 /tmp/decompress 目录下
因为我没安装Snappy库,所以就用hive来创建snappy压缩文件:
这只需要两个参数:
hive.exec.compress.output 设置为 true 来声明将结果文件进行压缩
mapred.output.compression.codec 用来设置具体的结果文件压缩格式
在 hive shell 中检查这两个参数,设置为我们需要的 Snappy 格式后,随便运行一个SQL将结果写到本地文件
> set hive.exec.compress.output;
hive.exec.compress.output=true
hive> set mapred.output.compression.codec;mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/snappy' select * from info900m limit 20;
至此,我们获得了结果文件 /tmp/snappy/000000_0.snappy
同上,我们指定压缩格式为 lzo
> set hive.exec.compress.output;
hive.exec.compress.output=true
hive> set mapred.output.compression.codec;
mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
hive> set mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/lzo' select * from info900m limit 20;
获得了结果文件 /tmp/lzo/000000_0.lzo
创建bz2文件
[apache@indigo bz2]$ cp /etc/resolv.conf .
[apache@indigo bz2]$ cat resolv.conf
# Generated by NetworkManager
domain dhcp
search dhcp server
nameserver 192.168.0.1
创建 gz 文件
[apache@indigo bz2]$ tar zcf resolv.conf.gz resolv.conf
> set mapred.output.compression.codec;
mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec
hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec;
hive>
> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/deflate' select * from info900m limit 20;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_1385947742139_0006, Tracking URL = http://indigo:8088/proxy/application_1385947742139_0006/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1385947742139_0006
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
2013-12-02 13:30:48,522 Stage-1 map = 0%, reduce = 0%
2013-12-02 13:30:56,271 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 1.2 sec
2013-12-02 13:30:57,330 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.85 sec
......
2013-12-02 13:31:15,508 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.85 sec
2013-12-02 13:31:16,552 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.85 sec
MapReduce Total cumulative CPU time: 4 seconds 850 msec
Ended Job = job_1385947742139_0006 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1385947742139_0006_m_000003 (and more) from job job_1385947742139_0006
Task with the most failures(4):
-----
Task ID:
task_1385947742139_0006_r_000000
URL:
http://indigo:8088/taskdetails.jsp?jobid=job_1385947742139_0006&tipid=task_1385947742139_0006_r_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"20130526","_col1":"20130526","_col2":"SXY","_col3":"4577020","_col4":"20","_col5":"P029124","_col6":"1","_col7":"612423196707110625","_col8":"","_col9":"Y1","_col10":"20130526"},"alias":0}
at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:270)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:460)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:407)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"20130526","_col1":"20130526","_col2":"SXY","_col3":"4577020","_col4":"20","_col5":"P029124","_col6":"1","_col7":"612423196707110625","_col8":"","_col9":"Y1","_col10":"20130526"},"alias":0}
at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:258)
... 7 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.DefaultCode was not found.
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:479)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:543)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
at org.apache.hadoop.hive.ql.exec.LimitOperator.processOp(LimitOperator.java:51)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:249)
... 7 more
Caused by: java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.DefaultCode was not found.
at org.apache.hadoop.mapred.FileOutputFormat.getOutputCompressorClass(FileOutputFormat.java:94)
at org.apache.hadoop.hive.ql.exec.Utilities.getFileExtension(Utilities.java:910)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:469)
... 16 more
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.io.compress.DefaultCode not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
at org.apache.hadoop.mapred.FileOutputFormat.getOutputCompressorClass(FileOutputFormat.java:91)
... 18 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 4 Reduce: 1 Cumulative CPU: 4.85 sec HDFS Read: 460084 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 4 seconds 850 msec
这里 hive 居然没读 hadoop 的 classpath ,所以只好将依赖放到 hive classpath下,重启hive ,重新查询即可
cp /usr/lib/hadoop/hadoop-common-2.0.0-cdh4.3.0.jar /usr/lib/hive/lib
全部就绪了,咱编译好上面的类后就开始吧
需要注意一下,参数为要解压的文件名,创建对应的 Decompression 的依据是压缩文件扩展名,所以这里扩展名不能随便改,下面解压刚才获取的snappy文件
[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/test/lib/*" compress.Decompress /tmp/snappy/000000_0.snappy
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
..................文件内容省略................................
因为我安装了lzo库,所以可以直接解压
[apache@indigo lzo]$ lzop -d 000000_0.lzo
[apache@indigo lzo]$ ll
total 8
-rw-r--r--. 1 apache apache 1650 Dec 2 13:12 000000_0
-rwxr-xr-x. 1 apache apache 848 Dec 2 13:12 000000_0.lzo
用compress.Decompress 指定lzo压缩文件名即可:
[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/test/lib/*" compress.Decompress /tmp/lzo/000000_0.lzo
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/diary/1202/lib/*" compress.Decompress /tmp/bz2/resolv.conf.bz2
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
resolv.conf0000644000175000017500000000012412247014427012463 0ustar apacheapache# Generated by NetworkManager
domain dhcp
search dhcp server
nameserver 192.168.0.1
[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/diary/1202/lib/*" compress.Decompress /tmp/bz2/resolv.conf.gz
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
resolv.conf0000644000175000017500000000012412247014427012463 0ustar apacheapache# Generated by NetworkManager
domain dhcp
search dhcp server
nameserver 192.168.0.1
[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/diary/1202/lib/*" compress.Decompress /tmp/deflate/000000_0.deflate
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
..................文件内容省略................................