Snappy,Lzo,bzip2,gzip,deflate文件解压


Snappy,Lzo,bzip2,gzip,deflate 都是hive常用的文件压缩格式,各有所长,这里咱们只关注具体文件的解压

一、先贴代码:

package compress;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionInputStream;

public class Decompress {

	public static final Log LOG = LogFactory.getLog(Decompress.class.getName());

	public static void main(String[] args) throws Exception {

		Configuration conf = new Configuration();
		String name = "io.compression.codecs";
		String value = "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec";
		conf.set(name, value);
		CompressionCodecFactory factory = new CompressionCodecFactory(conf);
		for (int i = 0; i < args.length; ++i) {
			CompressionCodec codec = factory.getCodec(new Path(args[i]));
			if (codec == null) {
				System.out.println("Codec for " + args[i] + " not found.");
			} else {
				CompressionInputStream in = null;
				try {
					in = codec.createInputStream(new java.io.FileInputStream(
							args[i]));
					byte[] buffer = new byte[100];
					int len = in.read(buffer);
					while (len > 0) {
						System.out.write(buffer, 0, len);
						len = in.read(buffer);
					}
				} finally {
					if (in != null) {
						in.close();
					}
				}
			}
		}
	}
}

二、准备工作

1、准备依赖

简要说明一下,这几种压缩文件相关的核心类为:

org.apache.hadoop.io.compress.SnappyCodec
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.DeflateCodec,
org.apache.hadoop.io.compress.DefaultCodec,

首先我们需要这些依赖,我把解压需要的依赖都放在了 /home/apache/test/lib/ 目录下

此外还需要文件压缩需要的本地库文件,找到一台装有hadoop的环境,将 $HADOOP_HOME/lib/native  目录复制过来,我放到了 /tmp/decompress 目录下

2、准备压缩文件

2.1、Snappy 文件

因为我没安装Snappy库,所以就用hive来创建snappy压缩文件:

这只需要两个参数:

hive.exec.compress.output 设置为 true 来声明将结果文件进行压缩

mapred.output.compression.codec 用来设置具体的结果文件压缩格式

在 hive shell 中检查这两个参数,设置为我们需要的 Snappy 格式后,随便运行一个SQL将结果写到本地文件

    > set hive.exec.compress.output;
hive.exec.compress.output=true
hive> set mapred.output.compression.codec;mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/snappy' select * from info900m limit 20;
至此,我们获得了结果文件 /tmp/snappy/000000_0.snappy

2.2、Lzo文件

同上,我们指定压缩格式为 lzo

    > set hive.exec.compress.output;                                                 
hive.exec.compress.output=true
hive> set mapred.output.compression.codec;                                           
mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
hive> set mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/lzo' select * from info900m limit 20;
获得了结果文件 /tmp/lzo/000000_0.lzo 

2.3、创建 bz2 文件和 gz 文件

创建bz2文件
[apache@indigo bz2]$ cp /etc/resolv.conf .
[apache@indigo bz2]$ cat resolv.conf
# Generated by NetworkManager
domain dhcp
search dhcp server
nameserver 192.168.0.1

创建 gz 文件
[apache@indigo bz2]$ tar zcf resolv.conf.gz resolv.conf

2.4、创建 deflate 文件

    > set mapred.output.compression.codec;                                        
mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec
hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec; 
hive> 
    > INSERT OVERWRITE LOCAL DIRECTORY '/tmp/deflate' select * from info900m limit 20;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_1385947742139_0006, Tracking URL = http://indigo:8088/proxy/application_1385947742139_0006/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1385947742139_0006
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
2013-12-02 13:30:48,522 Stage-1 map = 0%,  reduce = 0%
2013-12-02 13:30:56,271 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 1.2 sec
2013-12-02 13:30:57,330 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.85 sec
......
2013-12-02 13:31:15,508 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.85 sec
2013-12-02 13:31:16,552 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.85 sec
MapReduce Total cumulative CPU time: 4 seconds 850 msec
Ended Job = job_1385947742139_0006 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1385947742139_0006_m_000003 (and more) from job job_1385947742139_0006

Task with the most failures(4): 
-----
Task ID:
  task_1385947742139_0006_r_000000

URL:
  http://indigo:8088/taskdetails.jsp?jobid=job_1385947742139_0006&tipid=task_1385947742139_0006_r_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"20130526","_col1":"20130526","_col2":"SXY","_col3":"4577020","_col4":"20","_col5":"P029124","_col6":"1","_col7":"612423196707110625","_col8":"","_col9":"Y1","_col10":"20130526"},"alias":0}
	at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:270)
	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:460)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:407)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"20130526","_col1":"20130526","_col2":"SXY","_col3":"4577020","_col4":"20","_col5":"P029124","_col6":"1","_col7":"612423196707110625","_col8":"","_col9":"Y1","_col10":"20130526"},"alias":0}
	at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:258)
	... 7 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.DefaultCode was not found.
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:479)
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:543)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
	at org.apache.hadoop.hive.ql.exec.LimitOperator.processOp(LimitOperator.java:51)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
	at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
	at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:249)
	... 7 more
Caused by: java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.DefaultCode was not found.
	at org.apache.hadoop.mapred.FileOutputFormat.getOutputCompressorClass(FileOutputFormat.java:94)
	at org.apache.hadoop.hive.ql.exec.Utilities.getFileExtension(Utilities.java:910)
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:469)
	... 16 more
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.io.compress.DefaultCode not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
	at org.apache.hadoop.mapred.FileOutputFormat.getOutputCompressorClass(FileOutputFormat.java:91)
	... 18 more


FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched: 
Job 0: Map: 4  Reduce: 1   Cumulative CPU: 4.85 sec   HDFS Read: 460084 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 4 seconds 850 msec
这里 hive 居然没读 hadoop 的 classpath ,所以只好将依赖放到 hive classpath下,重启hive ,重新查询即可

cp /usr/lib/hadoop/hadoop-common-2.0.0-cdh4.3.0.jar /usr/lib/hive/lib

全部就绪了,咱编译好上面的类后就开始吧

三、解压

1、snappy文件

需要注意一下,参数为要解压的文件名,创建对应的 Decompression 的依据是压缩文件扩展名,所以这里扩展名不能随便改,下面解压刚才获取的snappy文件

[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/test/lib/*" compress.Decompress /tmp/snappy/000000_0.snappy
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
..................文件内容省略................................

2、lzo文件 

因为我安装了lzo库,所以可以直接解压 

[apache@indigo lzo]$ lzop -d 000000_0.lzo 
[apache@indigo lzo]$ ll
total 8
-rw-r--r--. 1 apache apache 1650 Dec  2 13:12 000000_0
-rwxr-xr-x. 1 apache apache  848 Dec  2 13:12 000000_0.lzo
用compress.Decompress 指定lzo压缩文件名即可:
[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/test/lib/*" compress.Decompress /tmp/lzo/000000_0.lzo
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.


3、bzip2 文件

[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/diary/1202/lib/*" compress.Decompress /tmp/bz2/resolv.conf.bz2
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
resolv.conf0000644000175000017500000000012412247014427012463 0ustar  apacheapache# Generated by NetworkManager
domain dhcp
search dhcp server
nameserver 192.168.0.1

4、gzip 文件

[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/diary/1202/lib/*" compress.Decompress /tmp/bz2/resolv.conf.gz
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
resolv.conf0000644000175000017500000000012412247014427012463 0ustar  apacheapache# Generated by NetworkManager
domain dhcp
search dhcp server
nameserver 192.168.0.1

5、deflate文件

[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/diary/1202/lib/*" compress.Decompress /tmp/deflate/000000_0.deflate
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
..................文件内容省略................................



你可能感兴趣的:(hive)