版本:
cdh5.0.0+hadoop2.3.0+hbase0.96.1.1+Spoon5.0.1
一、HBase数据导入
HBase数据导入使用org.apache.hadoop.hbase.mapreduce.ImportTsv 的两种方式,一种是直接导入,一种是转换为HFile,然后再次导入。
1. HDFS数据为(部分):
[root@node33 data]# hadoop fs -ls /input Found 1 items -rwxrwxrwx 1 hdfs supergroup 13245467 2014-05-01 17:09 /input/hbase-data.csv [root@node33 data]# hadoop fs -cat /input/* | head -n 3 1,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0,0,1 2,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0,0,1 3,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0,0,1
2. 使用直接导入的方式
a. 建立hbase-employees-1表,使用hbase shell,进入shell模式,使用命令:create 'hbase-employees-1','col' ,建立表;
b. 进入hbase安装目录,如果使用cdh默认安装,一般在目录/usr/lib/hbase/bin中,运行:
./hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.columns=HBASE_ROW_KEY,col:x1,col:x2,col:x3,col:x4,col:x5,col:x6,col:x7,col:x8,col:x9,col:y hbase-employees-1 hdfs://node33:8020/input/hbase-data.csv
日志如下:
2014-05-02 13:15:07,716 INFO [main] mapreduce.JobSubmitter: Submitting tokens for job: job_1398958404577_0018 2014-05-02 13:15:08,674 INFO [main] impl.YarnClientImpl: Submitted application application_1398958404577_0018 2014-05-02 13:15:09,101 INFO [main] mapreduce.Job: The url to track the job: http://node33:8088/proxy/application_1398958404577_0018/ 2014-05-02 13:15:09,103 INFO [main] mapreduce.Job: Running job: job_1398958404577_0018 2014-05-02 13:15:34,169 INFO [main] mapreduce.Job: Job job_1398958404577_0018 running in uber mode : false 2014-05-02 13:15:34,207 INFO [main] mapreduce.Job: map 0% reduce 0% 2014-05-02 13:16:32,789 INFO [main] mapreduce.Job: map 1% reduce 0% 2014-05-02 13:16:53,477 INFO [main] mapreduce.Job: map 5% reduce 0% 2014-05-02 13:16:56,701 INFO [main] mapreduce.Job: map 9% reduce 0% 2014-05-02 13:16:59,928 INFO [main] mapreduce.Job: map 13% reduce 0% 2014-05-02 13:17:02,970 INFO [main] mapreduce.Job: map 16% reduce 0% 2014-05-02 13:17:07,260 INFO [main] mapreduce.Job: map 22% reduce 0% 2014-05-02 13:17:10,472 INFO [main] mapreduce.Job: map 29% reduce 0% 2014-05-02 13:17:12,879 INFO [main] mapreduce.Job: map 36% reduce 0% 2014-05-02 13:17:16,555 INFO [main] mapreduce.Job: map 45% reduce 0% 2014-05-02 13:17:43,452 INFO [main] mapreduce.Job: map 48% reduce 0% 2014-05-02 13:17:45,629 INFO [main] mapreduce.Job: map 63% reduce 0% 2014-05-02 13:17:52,845 INFO [main] mapreduce.Job: map 79% reduce 0% 2014-05-02 13:17:55,862 INFO [main] mapreduce.Job: map 91% reduce 0% 2014-05-02 13:18:01,000 INFO [main] mapreduce.Job: map 100% reduce 0% 2014-05-02 13:18:14,644 INFO [main] mapreduce.Job: Job job_1398958404577_0018 completed successfully 2014-05-02 13:18:17,123 INFO [main] mapreduce.Job: Counters: 31 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=115531 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=13245571 HDFS: Number of bytes written=0 HDFS: Number of read operations=2 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 Job Counters Launched map tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=146367 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=146367 Total vcore-seconds taken by all map tasks=146367 Total megabyte-seconds taken by all map tasks=149879808 Map-Reduce Framework Map input records=203500 Map output records=203500 Input split bytes=104 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=8595 CPU time spent (ms)=11290 Physical memory (bytes) snapshot=99639296 Virtual memory (bytes) snapshot=647782400 Total committed heap usage (bytes)=26517504 ImportTsv Bad Lines=0 File Input Format Counters Bytes Read=13245467 File Output Format Counters Bytes Written=0
查看hbase表数据大小:
bash-4.1$ hadoop fs -count /hbase/data/default/hbase-employees-1 6 4 60100796 /hbase/data/default/hbase-employees-1
大约57.3M数据,原始数据是12.3M的。耗时方面:从任务提交到任务结束耗时3分6秒。
3. 使用HFile方式:
a. 转换为HFile并新建hbase-employee表:
./hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.bulk.output=hdfs://node33:8020/output/hbase_bulk -Dimporttsv.columns=HBASE_ROW_KEY,col:x1,col:x2,col:x3,col:x4,col:x5,col:x6,col:x7,col:x8,col:x9,col:y hbase-employees hdfs://node33:8020/input/hbase-data.csv
这样运行后,可以查看HFile的大小:
hadoop fs -count /output/hbase_bulk
同时数据库中已经新建了hbase-employees表,但是表中没有数据。
b. 插入数据到表中
这里也有两种方式,但是这两种方式操作前都需要把相应目录的权限放开。简单来说可以这样:
bash-4.1$ hadoop fs -chmod -R 777 /output bash-4.1$ hadoop fs -chmod -R 777 /hbase
(1)使用hbase的命令:
/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles hdfs://node33:8020/output/hbase_bulk hbase-employees
ps:这个我没有试,官网说可以这样。
(2)使用hadoop jar方式:
hadoop jar /usr/lib/hbase/lib/hbase-server-0.96.1.1-cdh5.0.0.jar completebulkload hdfs://node33:8020/output/hbase_bulk hbase-employees
不过做这个前需要把hbase的相关包拷贝到hadoop的lib目录下,可以拷贝到/usr/lib/hadoop-mapreduce目录下,jar包如下:
-rw-r--r--. 1 root root 818519 May 2 10:35 hbase-client.jar -rw-r--r--. 1 root root 360422 May 2 10:36 hbase-common.jar -rw-r--r--. 1 root root 66898 May 2 10:36 hbase-hadoop2-compat.jar -rw-r--r--. 1 root root 4071 May 2 10:36 hbase-it.jar -rw-r--r--. 1 root root 87533 May 2 10:36 hbase-prefix-tree.jar -rw-r--r--. 1 root root 3142247 May 2 10:36 hbase-protocol.jar -rw-r--r--. 1 root root 3110607 May 2 10:36 hbase-server.jar -rw-r--r--. 1 root root 4499 May 2 10:37 hbase-shell-0.96.1.1-cdh5.0.0.jar -rw-r--r--. 1 root root 2270053 May 2 10:37 hbase-thrift-0.96.1.1-cdh5.0.0.jar
这种方式的log,如下:
2014-05-02 13:39:27,916 INFO [main] mapreduce.JobSubmitter: Submitting tokens for job: job_1398958404577_0020 2014-05-02 13:39:28,408 INFO [main] impl.YarnClientImpl: Submitted application application_1398958404577_0020 2014-05-02 13:39:28,552 INFO [main] mapreduce.Job: The url to track the job: http://node33:8088/proxy/application_1398958404577_0020/ 2014-05-02 13:39:28,553 INFO [main] mapreduce.Job: Running job: job_1398958404577_0020 2014-05-02 13:39:44,376 INFO [main] mapreduce.Job: Job job_1398958404577_0020 running in uber mode : false 2014-05-02 13:39:44,402 INFO [main] mapreduce.Job: map 0% reduce 0% 2014-05-02 13:40:24,580 INFO [main] mapreduce.Job: map 67% reduce 0% 2014-05-02 13:40:38,125 INFO [main] mapreduce.Job: map 100% reduce 0% 2014-05-02 13:41:09,295 INFO [main] mapreduce.Job: map 100% reduce 67% 2014-05-02 13:41:17,905 INFO [main] mapreduce.Job: map 100% reduce 77% 2014-05-02 13:41:20,928 INFO [main] mapreduce.Job: map 100% reduce 90% 2014-05-02 13:41:23,948 INFO [main] mapreduce.Job: map 100% reduce 100% 2014-05-02 13:41:29,514 INFO [main] mapreduce.Job: Job job_1398958404577_0020 completed successfully 2014-05-02 13:41:31,530 INFO [main] mapreduce.Job: Counters: 50 File System Counters FILE: Number of bytes read=48950370 FILE: Number of bytes written=98519119 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=13245571 HDFS: Number of bytes written=73861410 HDFS: Number of read operations=8 HDFS: Number of large read operations=0 HDFS: Number of write operations=3 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=51857 Total time spent by all reduces in occupied slots (ms)=42035 Total time spent by all map tasks (ms)=51857 Total time spent by all reduce tasks (ms)=42035 Total vcore-seconds taken by all map tasks=51857 Total vcore-seconds taken by all reduce tasks=42035 Total megabyte-seconds taken by all map tasks=53101568 Total megabyte-seconds taken by all reduce tasks=43043840 Map-Reduce Framework Map input records=203500 Map output records=203500 Map output bytes=48339864 Map output materialized bytes=48950370 Input split bytes=104 Combine input records=203500 Combine output records=203500 Reduce input groups=203500 Reduce shuffle bytes=48950370 Reduce input records=203500 Reduce output records=2035000 Spilled Records=407000 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=1573 CPU time spent (ms)=26250 Physical memory (bytes) snapshot=341180416 Virtual memory (bytes) snapshot=1296404480 Total committed heap usage (bytes)=152965120 ImportTsv Bad Lines=0 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=13245467 File Output Format Counters Bytes Written=73861410
这个是转为HFile的job,插入数据到HBase表中2秒不到就完成了。
查看HBase中数据大小:
bash-4.1$ hadoop fs -count /hbase/data/default/hbase-employees 5 3 73861752 /hbase/data/default/hbase-employees
一共大概70.4M数据,原始数据是12.3M,耗时为:2分2秒。
总结:HBase的数据导入都会把数据扩大,一般大概为5倍左右,同时使用HFile的方式,在时间上会有优势,但是数据会比较大。ps:这个仅仅是一次测试总结而已,实际情况可能不同。(为什么HBase的数据被扩大了?参考前篇,Hive的数据如果是直接导入,一般是不会扩大数据占用空间的,而且使用orc方式的时候还可以达到很好的压缩比,但是HBase却是扩大的,这是为什么?)
二、Kettle操作Hbase0.96
1. HBase Output组件,可以参考 http://wiki.pentaho.com/display/BAD/Loading+Data+into+HBase 。
2. HBase Input组件,可以参考 http://wiki.pentaho.com/display/BAD/Extracting+Data+from+HBase+to+Load+an+RDBMS。
参考这两个操作步骤,都是可以得到正确结果的。这里只是说说遇到的问题。
(1)zookeeper和hbase-site.xml文件两者选择一个配置即可,配置两个可能有问题。
(2)mapping配置的是HBase表中Family的名称。
(3)如果遇到下面的问题:
java.lang.IllegalArgumentException: offset (0) + length (8) exceed the capacity of the array: 1 at org.apache.hadoop.hbase.util.Bytes.explainWrongLengthOrOffset(Bytes.java:602) at org.apache.hadoop.hbase.util.Bytes.toLong(Bytes.java:580) at org.apache.hadoop.hbase.util.Bytes.toLong(Bytes.java:553) at org.pentaho.hbase.shim.common.CommonHBaseBytesUtil.toLong(CommonHBaseBytesUtil.java:87) at org.pentaho.hbase.shim.api.HBaseValueMeta.decodeKeyValue(HBaseValueMeta.java:567) at org.pentaho.di.trans.steps.hbaseinput.HBaseInputData.getOutputRow(HBaseInputData.java:464) at org.pentaho.di.trans.steps.hbaseinput.HBaseInput.processRow(HBaseInput.java:281) at org.pentaho.di.trans.step.RunThread.run(RunThread.java:60) at java.lang.Thread.run(Thread.java:724)
或者
Unknown type for column at org.pentaho.hbase.shim.api.HBaseValueMeta.decodeColumnValue(HBaseValueMeta.java:769) at org.pentaho.di.trans.steps.hbaseinput.HBaseInputData.getOutputRow(HBaseInputData.java:495) at org.pentaho.di.trans.steps.hbaseinput.HBaseInput.processRow(HBaseInput.java:281) at org.pentaho.di.trans.step.RunThread.run(RunThread.java:60) at java.lang.Thread.run(Thread.java:724)
或者
org.pentaho.di.core.exception.KettleException: Length of integer column value is not equal to the defined length of a short, int or long at org.pentaho.hbase.shim.api.HBaseValueMeta.decodeColumnValue(HBaseValueMeta.java:711) at org.pentaho.di.trans.steps.hbaseinput.HBaseInputData.getOutputRow(HBaseInputData.java:495) at org.pentaho.di.trans.steps.hbaseinput.HBaseInput.processRow(HBaseInput.java:281) at org.pentaho.di.trans.step.RunThread.run(RunThread.java:60) at java.lang.Thread.run(Thread.java:724)
则把Mapping的时候的数据格式全部设置为String就可以了,这个是因为,如果HBase表中的数据是以其他方式导入的话,那么其存储是以String格式存储的(但是如果使用kettle的HBase output组件导入数据,那么数据格式可以调整),所以使用其他的数据格式就会报上面的错误。
分享,成长,快乐
转载请注明blog地址: http://blog.csdn.net/fansy1990