hive (default)> create table product_info_snappy as select *from product_info where 1=2; (在hive中创建一张表,结构与 product_info相同 。这张表在MySQL的ruozedata5数据库下面。)
[hadoop@hadoop001 ~]$ sqoop import --connect jdbc:mysql://localhost:3306/ruozedata5 --username root --password 123456 --delete-target-dir --table product_info --fields-terminated-by '\t' --hive-import --hive-overwrite --hive-table product_info_snappy --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec -m 1 --hive-overwrite
[hadoop@hadoop001 ~]$ hdfs dfs -ls /user/hive/warehouse/product_info_snappy
-rwxr-xr-x 1 hadoop supergroup 990 2018-12-05 20:05 /user/hive/warehouse/product_info_snappy/part-m-00000.snappy(可以看到导进去到hive中的表格已经是压缩后的格式)
注:所有的操作均在另一台机器完成,然后scp到当前机器,编译需要maven,我已经在之前介绍过了,不再赘述。
[root@hadoop002 opt]# yum -y install lzo-devel zlib-devel gcc autoconf automake libtool
[root@hadoop002 opt]# wget www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
[root@hadoop002 opt]# tar -zxf lzo-2.06.tar.gz
[root@hadoop002 opt]# cd lzo-2.06
[root@hadoop002 lzo-2.06]# export CFLAGS=-m64
[root@hadoop002 lzo-2.06]# ./configure -enable-shared -prefix=/usr/local/lzo-2.06
[root@hadoop002 lzo-2.06]# ./configure -enable-shared -prefix=/usr/local/lzo-2.06
[root@hadoop002 opt]# wget https://github.com/twitter/hadoop-lzo/archive/master.zip
[root@hadoop002 opt]# unzip master
[root@hadoop002 opt]# vi hadoop-lzo-master/pom.xml
UTF-8
2.7.4 #这里修改成对应的hadoop版本号
1.0.4
[root@hadoop002 opt]# cd hadoop-lzo-master/
[root@hadoop002 hadoop-lzo-master]# export CFLAGS=-m64
[root@hadoop002 hadoop-lzo-master]# export C_INCLUDE_PATH=/usr/local/lzo-2.06/include
[root@hadoop002 hadoop-lzo-master]# export LIBRARY_PATH=/usr/local/lzo-2.06/lib
#mvn clean package -Dmaven.test.skip=true
[root@hadoop002 hadoop-lzo-master]# pwd
/opt/hadoop-lzo-master
[root@hadoop002 hadoop-lzo-master]# cd target/native/Linux-amd64-64
[root@hadoop002 Linux-amd64-64]# tar -cBf - -C lib . | tar -xBvf - -C ~
[root@hadoop002 Linux-amd64-64]# cp ~/libgplcompression* /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/lib/native/
[root@hadoop002 target]# pwd
/opt/hadoop-lzo-master/target
[root@hadoop002 target]# cp target/hadoop-lzo-0.4.21-SNAPSHOT.jar /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/
[root@hadoop002 target]# cd /usr/local
[root@hadoop002 local]# scp -r lzo-2.06 [email protected]:/opt/
[root@hadoop002 local]# scp /opt/hadoop-lzo-master/target/hadoop-lzo-0.4.21-SNAPSHOT.jar [email protected]:/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/
然后把native下的几个文件一一发到192.168.2.65这台机器的hadoop软件的native下面
传完之后hadoop下面的文件要修改用户用户组为hadoop
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export LD_LIBRARY_PATH=/usr/local/lzo-2.06/lib
vi $HADOOP_HOME/etc/hadoop/core-site.xml
io.compression.codecs
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
io.compression.codec.lzo.class
com.hadoop.compression.lzo.LzoCodec
# vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
mapreduce.output.fileoutputformat.compress
true
mapreduce.output.fileoutputformat.compress.codec
com.hadoop.compression.lzo.LzopCodec
mapreduce.map.output.compress
true
mapreduce.map.output.compress.codec
com.hadoop.compression.lzo.LzoCodec
注:# lzop -V 用于测试是否安装成功
#lzop -h #可以看到帮助说明
#lzop access.log #对日志进行压缩
hive (default)> desc test_lzo;
id varchar(10)
first_name varchar(10)
last_name varchar(10)
sex varchar(5)
score varchar(10)
copy_id varchar(10)
hive (default)> select * from test_lzo;
Time taken: 0.441 seconds (空的,只有表结构)
[hadoop@hadoop001 ~]$ sqoop import --connect jdbc:mysql://localhost:3306/mysql --username root --password 123456 --delete-target-dir --table test --fields-terminated-by '\t' --hive-import --hive-overwrite --hive-table test_lzo -m 1
[hadoop@hadoop001 hadoop-2.6.0-cdh5.7.0]$ hdfs dfs -ls /user/hive/warehouse/test_lzo
-rwxr-xr-x 1 hadoop supergroup 66407404 2018-12-07 17:03 /user/hive/warehouse/test_lzo/part-m-00000.lzo
[hadoop@hadoop001 hadoop-2.6.0-cdh5.7.0]$ hadoop jar /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer /user/hive/warehouse/test_lzo/part-m-00000.lzo (生成索引文件)
[hadoop@hadoop001 hadoop-2.6.0-cdh5.7.0]$ hdfs dfs -ls /user/hive/warehouse/test_lzo/
-rwxr-xr-x 1 hadoop supergroup 66407404 2018-12-07 17:03 /user/hive/warehouse/test_lzo/part-m-00000.lzo
-rw-r--r-- 1 hadoop supergroup 4120 2018-12-07 17:45 /user/hive/warehouse/test_lzo/part-m-00000.lzo.index
[hadoop@hadoop001 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /user/hive/warehouse/test_lzo /data/test1 (18/12/07 18:51:00 INFO mapreduce.JobSubmitter: number of splits:2)
[hadoop@hadoop001 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /user/hive/warehouse/test_lzo/part-m-00000.lzo /data/test2(18/12/07 21:39:52 INFO mapreduce.JobSubmitter: number of splits:1)
上述可以看出对/test_lzo/part-m-00000.lzo 文件做wc 的mr 时候分片数量为1.对/test_lzo 文件夹(此文件夹下面为.lzo 和.lzo.index)做wc 的mr 时候分片数量为2.
注:因为带index的.lzo文件分片总是不论大小同意分成两片,我又参照别人博客,在mapred-site.xml中增加了一个参数的配置,如下:
mapred.child.env
LD_LIBRARY_PATH=/usr/local/lzo-2.06/lib
修改完之后,重启hadoop,发现还没解决index 分片数量为2 的问题。
支队.lzo文件做索引的话,mr 的wc 运行的时候会把index文件误认为也是一个输入文件,所以上面操作的结果一直测出来的分片数目为2,应该在运行的时候指定一个参数,如下:
[hadoop@hadoop001 data1]$ hdfs dfs -ls /data/ice
-rw-r--r-- 1 hadoop supergroup 733104209 2018-12-07 22:24 /data/ice/ice.txt.lzo
-rw-r--r-- 1 hadoop supergroup 46464 2018-12-08 08:50 /data/ice/ice.txt.lzo.index
[hadoop@hadoop001 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat /data/ice/ice.txt.lzo /data/ice_out
[hadoop@hadoop001 data1]$ hdfs dfs -du -h /data/ice_out
0 0 /data/ice_out/_SUCCESS
689.9 M 689.9 M /data/ice_out/part-r-00000.lzo(到此为止,终于完成了lzo 分片功能)
主要参考
官网
lzop lzo
map输出的中间数据使用LzoCodec,reduce输出使用 LzopCodec
解释了为什么我的.lzo文件不足块的大小128M,依然被分成了两片
关于创造数据的方法