Hadoop-2.6.0-cdh5.7.0本身是没有进行压缩支持的,但是我们在生产环境进行操作的时候,必须要进行压缩处理.
好处
坏处
压缩格式 | 工具 | 算法 | 扩展名 | 是否支持分割 |
---|---|---|---|---|
DEFLATE | N/A | DEFLATE | .deflate | No |
gzip | gzip | DEFLATE | .gz | No |
bzip2 | bzip2 | bzip2 | .bz2 | Yes |
LZO | Lzop | LZO | .lzo | Yes |
LZ4 | N/A | LZ4 | .lz4 | No |
Snappy | N/A | Snappy | .snappy | No |
可以看出,压缩比越高,压缩时间越长,压缩比:Snappy
a. gzip
**优点:**压缩比在四种压缩方式中较高;hadoop本身支持,在应用中处理gzip格式的文件就和直接处理文本一样;有hadoop native库;大部分linux系统都自带gzip命令,使用方便
**缺点:**不支持split
b. lzo
优点: 压缩/解压速度也比较快,合理的压缩率;支持split,是hadoop中最流行的压缩格式;支持hadoop native库;需要在linux系统下自行安装lzop命令,使用方便
缺点: 压缩率比gzip要低;hadoop本身不支持,需要安装;lzo虽然支持split,但需要对lzo文件建索引,否则hadoop也是会把lzo文件看成一个普通文件(为了支持split需要建索引,需要指定inputformat为lzo格式)
c. snappy
**优点:**压缩速度快;支持hadoop native库
**缺点:**不支持split;压缩比低;hadoop本身不支持,需要安装;linux系统下没有对应的命令
d. bzip2
**优点:**支持split;具有很高的压缩率,比gzip压缩率都高;hadoop本身支持,但不支持native;在linux系统下自带bzip2命令,使用方便
**缺点:**压缩/解压速度慢;不支持native
不同的场景选择不同的压缩方式,肯定没有一个一劳永逸的方法,如果选择高压缩比,那么对于cpu的性能要求要高,同时压缩、解压时间耗费也多;选择压缩比低的,对于磁盘io、网络io的时间要多,空间占据要多;对于支持分割的,可以实现并行处理。
Tips:一般我们要尽量控制我们的输出文件的大小不大于一个block块的大小,例如block块大小是128M,我们可以将输出的文件大小控制在126M左右,不能超过128M.
#下载maven,在清华镜像站可以找到下载地址(清华就是流批)
[hadoop@hadoop001 scripts]$ wget https://mirrors.tuna.tsinghua.edu.cn/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
[root@hadoop001 scripts] tar -xzvf apache-maven-3.3.9-bin.tar.gz -C /home/hadoop/app/
[root@hadoop001 app] mv apache-maven-3.3.9/ maven
[root@hadoop001 app] chown hadoop:hadoop maven
#配置环境变量
[hadoop@hadoop001 ~]$ vim ~/.bash_profile
export MAVEN_HOME=/home/hadoop/app/maven
export PATH=$MAVEN_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin:$ZOOKEEPER_HOME/bin:$PATH
[hadoop@hadoop001 ~]$ source ~/.bash_profile
#配置maven本地仓库目录
[hadoop@hadoop001 ~]$ cd app/maven/conf/
[hadoop@hadoop001 conf]$ vim settings.xml
#配置maven的本地仓库位置
<localRepository>/home/hadoop/maven_repo/repo</localRepository>
#添加阿里云中央仓库地址,注意一定要写在 之间,这个特别重要
<mirror>
<id>nexus-aliyun</id>
<mirrorOf>central</mirrorOf>
<name>Nexus aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public</url>
</mirror>
# 解压findbugs-3.0.1.tar.gz以及protobuf-2.5.0.tar.gz,同时注意一下文件以及文件夹的用户组
[root@hadoop001 software] tar -xzvf findbugs-3.0.1.tar.gz -C /home/hadoop/app/
[root@hadoop001 software] tar -xzvf protobuf-2.5.0.tar.gz -C /usr/local
# 查看环境变量的时候可以看一下Java的环境变量,Hadoop-2.6.0-cdh5.7.0应该使用jdk1.7编译,1.8会编译失败
# 解压Hadoop-2.6.0-cdh5.7.0的源码包
[hadoop@hadoop001 ~]$ tar -xzvf hadoop-2.6.0-cdh5.7.0-src.tar.gz -C /app/source/
# cd到源码目录,查看环境要求
[hadoop@hadoop001 hadoop-2.6.0-cdh5.7.0]$ cat BUILDING.txt
----------------------------------------------------------
## Build instructions for Hadoop
Requirements:
- Unix System
- JDK 1.7+
- Maven 3.0 or later
- Findbugs 1.3.9 (if running findbugs)
- ProtocolBuffer 2.5.0
- CMake 2.6 or newer (if compiling native code), must be 3.0 or newer on Mac
- Zlib devel (if compiling native code)
- openssl devel ( if compiling native hadoop-pipes )
- Internet connection for first build (to fetch all Maven and Hadoop dependencies)
----------------------------------------------------------
[root@hadoop001 conf] yum install -y gcc gcc-c++ make cmake
#cd到protobuf的目录下,编译软件
[root@hadoop001 app] cd /usr/local/protobuf-2.5.0/
[root@hadoop001 protobuf-2.5.0] ./configure --prefix=/usr/local/protobuf
[root@hadoop001 protobuf-2.5.0] make && make install
#确认所有环境变量
[hadoop@hadoop001 ~]$ vim ~/.bash_profile
export JAVA_HOME=/usr/java/jdk1.7.0_60
export ZOOKEEPER_HOME=/home/hadoop/app/zookeeper
export HADOOP_HOME=/home/hadoop/app/hadoop
export MAVEN_HOME=/home/hadoop/app/maven
export FINDBUGS_HOME=/home/hadoop/app/findbugs-3.0.1
export PROTOC_HOME=/usr/local/protobuf
export PATH=$FINDBUGS_HOME/bin:$PROTOC_HOME/bin:$MAVEN_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin:$ZOOKEEPER_HOME/bin:$PATH
[hadoop@hadoop001 ~]$ source ~/.bash_profile
[hadoop@hadoop001 ~]$ which java
/usr/java/jdk1.7.0_60/bin/java
[hadoop@hadoop001 ~]$ which mvn
~/app/maven/bin/mvn
[hadoop@hadoop001 ~]$ which findbugs
~/app/findbugs-3.0.1/bin/findbugs
[hadoop@hadoop001 ~]$ which protoc
/usr/local/protobuf/bin/protoc
#查看所有软件版本
[hadoop@hadoop001 ~]$ java -version
java version "1.7.0_60"
Java(TM) SE Runtime Environment (build 1.7.0_60-b19)
Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)
[hadoop@hadoop001 ~]$ mvn -version
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-11T00:41:47+08:00)
Maven home: /home/hadoop/app/maven
Java version: 1.7.0_60, vendor: Oracle Corporation
Java home: /usr/java/jdk1.7.0_60/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-696.16.1.el6.x86_64", arch: "amd64", family: "unix"
[hadoop@hadoop001 ~]$ findbugs -version
3.0.1
[hadoop@hadoop001 ~]$ protoc --version
libprotoc 2.5.0
#主要是几种压缩方式的安装
[root@hadoop001 ~]$ yum install -y openssl openssl-devel svn ncurses-devel zlib-devel libtool
[root@hadoop001 protobuf-2.5.0]$ yum install -y snappy snappy-devel bzip2 bzip2-devel lzo lzo-devel lzop autoconf automake
[hadoop@hadoop001 hadoop-2.6.0-cdh5.7.0]$ pwd
/home/hadoop/app/source/hadoop-2.6.0-cdh5.7.0
[hadoop@hadoop001 hadoop-2.6.0-cdh5.7.0]$ mvn clean package -Pdist,native -DskipTests -Dtar
#出现下面的结果即为编译成功
[INFO] hadoop-mapreduce ................................... SUCCESS [ 4.853 s]
[INFO] Apache Hadoop MapReduce Streaming .................. SUCCESS [ 3.133 s]
[INFO] Apache Hadoop Distributed Copy ..................... SUCCESS [ 6.775 s]
[INFO] Apache Hadoop Archives ............................. SUCCESS [ 1.603 s]
[INFO] Apache Hadoop Archive Logs ......................... SUCCESS [ 1.474 s]
[INFO] Apache Hadoop Rumen ................................ SUCCESS [ 4.117 s]
[INFO] Apache Hadoop Gridmix .............................. SUCCESS [ 3.182 s]
[INFO] Apache Hadoop Data Join ............................ SUCCESS [ 2.064 s]
[INFO] Apache Hadoop Ant Tasks ............................ SUCCESS [ 1.562 s]
[INFO] Apache Hadoop Extras ............................... SUCCESS [ 2.098 s]
[INFO] Apache Hadoop Pipes ................................ SUCCESS [ 6.435 s]
[INFO] Apache Hadoop OpenStack support .................... SUCCESS [ 3.757 s]
[INFO] Apache Hadoop Amazon Web Services support .......... SUCCESS [ 5.689 s]
[INFO] Apache Hadoop Azure support ........................ SUCCESS [ 3.103 s]
[INFO] Apache Hadoop Client ............................... SUCCESS [ 4.345 s]
[INFO] Apache Hadoop Mini-Cluster ......................... SUCCESS [ 1.204 s]
[INFO] Apache Hadoop Scheduler Load Simulator ............. SUCCESS [ 3.345 s]
[INFO] Apache Hadoop Tools Dist ........................... SUCCESS [ 7.353 s]
[INFO] Apache Hadoop Tools ................................ SUCCESS [ 0.047 s]
[INFO] Apache Hadoop Distribution ......................... SUCCESS [ 34.657 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 30:43 min
[INFO] Finished at: 2019-04-08T10:51:51+08:00
[INFO] Final Memory: 212M/970M
[INFO] ------------------------------------------------------------------------
#查看编译之后的压缩方式支持
[hadoop@hadoop001 hadoop-2.6.0-cdh5.7.0]$ hadoop checknative -a
19/04/08 11:28:50 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
19/04/08 11:28:50 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/lib/native/libhadoop.so.1.0.0
zlib: true /lib64/libz.so.1
snappy: true /usr/lib64/libsnappy.so.1
lz4: true revision:99
bzip2: true /lib64/libbz2.so.1
openssl: true /usr/lib64/libcrypto.so
如果报错 Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
[hadoop@hadoop001 hadoop-2.6.0-cdh5.7.0]$ hadoop checknative -a
19/04/08 11:20:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Native library checking:
hadoop: false
zlib: false
snappy: false
lz4: false
bzip2: false
openssl: false
19/04/08 11:20:21 INFO util.ExitUtil: Exiting with status 1
解决方法:
到网站 http://dl.bintray.com/sequenceiq/sequenceiq-bin/ 下载对应的编译版本。然后执行以下命令:
[hadoop@hadoop001 software]$ tar -xvf hadoop-native-64-2.6.0.tar -C $HADOOP_HOME/lib/
[hadoop@hadoop001 software]$ tar -xvf hadoop-native-64-2.6.0.tar -C $HADOOP_HOME/lib/native
# 配置环境变量
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_HOME/lib/native"
#然后复制文件
[hadoop@hadoop001 software]$ cp ~/app/source/hadoop-2.6.0-cdh5.7.0/hadoop-dist/target/hadoop-2.6.0-cdh5.7.0/lib/native/* /home/hadoop/app/hadoop/lib/native/
[hadoop@hadoop001 software]$ cp ~/app/source/hadoop-2.6.0-cdh5.7.0/hadoop-dist/target/hadoop-2.6.0-cdh5.7.0/lib/native/* /home/hadoop/app/hadoop/lib