批量导入数据到HBase

hbase一般用于大数据的批量分析,所以在很多情况下需要将大量数据从外部导入到hbase中,
hbase提供了一种导入数据的方式,
主要用于批量导入大量数据,即importtsv工具,用法如下:
 
Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>



Imports the given input directory of TSV data into the specified table.



The column names of the TSV data must be specified using the -Dimporttsv.columns

option. This option takes the form of comma-separated column names, where each

column name is either a simple column family, or a columnfamily:qualifier. The special

column name HBASE_ROW_KEY is used to designate that this column should be used as the row key for each imported record. You must specify exactly one column to be the row key, and you must specify a column name for every column that exists in the input data. Another special column HBASE_TS_KEY designates that this column should be

used as timestamp for each record. Unlike HBASE_ROW_KEY, HBASE_TS_KEY is optional.

You must specify atmost one column as timestamp key for each imported record.

Record with invalid timestamps (blank, non-numeric) will be treated as bad record.

Note: if you use this option, then 'importtsv.timestamp' option will be ignored.



By default importtsv will load data directly into HBase. To instead generate HFiles of data to prepare for a bulk data load, pass the option: -Dimporttsv.bulk.output=/path/for/output

  Note: if you do not use this option, then the target table must already exist in HBase



Other options that may be specified with -D include:

  -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line

  '-Dimporttsv.separator=|' - eg separate on pipes instead of tabs

  -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import

  -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper

For performance consider the following options:

  -Dmapred.map.tasks.speculative.execution=false

  -Dmapred.reduce.tasks.speculative.execution=false

hbase提供importtsv工具支持从tsv文件中将数据导入hbase。使用该工具将文本数据加载至hbase十分高效,因为它是通过mapreduce
job来实施导入的。哪怕是要从现有的关系型数据库中加载数据,也可以先将数据导入文本文件中,然后使用importtsv 工具导入hbase。
在导入海量数据时,这个方式运行的很好,因为导出数据比在关系型数据库中执行sql快很多。 importtsv工具不仅支持将数据直接加载进hbase的表中,还支持直接生成hbase自有格式文件(hfile),所以你可以用hbase的bulk load工具将生成好的文件直接加载进运行中的hbase集群。这样就减少了在数据迁移过程中,数据传输与hbase加载时产生的网络流量。下文描述了 importtsv 和bulk load工具的使用场景。我们首先展示使用importtsv工具从tsv文件中将数据加载至hbase表中。
当然也会包含如何直接生成hbase自有格式文件,以及如何直接将已经生成好的文件加载入hbase

bulk-load的作用是用mapreduce的方式将hdfs上的文件装载到hbase中,对于海量数据装载入hbase非常有用.

测试如下:

landen@Master:~/UntarFile/hadoop-1.0.4$ bin/hadoop jar $HADOOP_HOME/lib/hbase-0.94.12.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,IPAddress:countrycode,IPAddress:countryname,IPAddress:region,IPAddress:regionname,IPAddress:city,IPAddress:latitude,IPAddress:longitude,IPAddress:timezone -Dimporttsv.bulk.output=/output HiddenIPInfo /input
Warning: $HADOOP_HOME is deprecated.

13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:host.name=Master
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:java.version=1.7.0_17
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:java.home=/home/landen/UntarFile/jdk1.7.0_17/jre
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:java.class.path=/home/landen/UntarFile/hadoop-1.0.4/conf:/home/landen/UntarFile/jdk1.7.0_17/lib/tools.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/..:/home/landen/UntarFile/hadoop-1.0.4/libexec/../hadoop-core-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/asm-3.2.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/aspectjrt-1.6.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/aspectjtools-1.6.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/chukwa-0.5.0-client.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/chukwa-0.5.0.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-beanutils-1.7.0.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-beanutils-core-1.8.0.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-cli-1.2.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-codec-1.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-collections-3.2.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-configuration-1.6.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-daemon-1.0.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-digester-1.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-el-1.0.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-httpclient-3.0.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-io-2.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-lang-2.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-logging-1.1.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-logging-api-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-math-2.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-net-1.4.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/core-3.1.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/guava-11.0.2.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hadoop-capacity-scheduler-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hadoop-fairscheduler-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hadoop-thriftfs-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hbase-0.94.12.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hsqldb-1.8.0.10.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jackson-core-asl-1.8.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jasper-compiler-5.5.12.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jasper-runtime-5.5.12.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jdeb-0.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jersey-core-1.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jersey-json-1.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jersey-server-1.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jets3t-0.6.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jetty-6.1.26.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jetty-util-6.1.26.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jsch-0.1.42.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/json-simple-1.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/junit-4.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/kfs-0.2.2.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/LoadJsonData.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/log4j-1.2.15.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/mockito-all-1.8.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/oro-2.0.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/protobuf-java-2.4.0a.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/servlet-api-2.5-20081211.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/slf4j-api-1.4.3.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/slf4j-log4j12-1.4.3.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/xmlenc-0.52.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/zookeeper-3.4.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jsp-2.1/jsp-2.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jsp-2.1/jsp-api-2.1.jar
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/native/Linux-i386-32
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:os.arch=i386
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:os.version=3.2.0-24-generic-pae
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:user.name=landen
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:user.home=/home/landen
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:user.dir=/home/landen/UntarFile/hadoop-1.0.4
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=Slave1:2222,Master:2222,Slave2:2222 sessionTimeout=180000 watcher=hconnection
13/12/09 21:52:28 INFO zookeeper.ClientCnxn: Opening socket connection to server Slave1/10.21.244.124:2222. Will not attempt to authenticate using SASL (unknown error)
13/12/09 21:52:28 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 6809@Master
13/12/09 21:52:28 INFO zookeeper.ClientCnxn: Socket connection established to Slave1/10.21.244.124:2222, initiating session
13/12/09 21:52:28 INFO zookeeper.ClientCnxn: Session establishment complete on server Slave1/10.21.244.124:2222, sessionid = 0x142cbdf535f0010, negotiated timeout = 180000
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=Slave1:2222,Master:2222,Slave2:2222 sessionTimeout=180000 watcher=catalogtracker-on-org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@821075
13/12/09 21:52:28 INFO zookeeper.ClientCnxn: Opening socket connection to server Slave2/10.21.244.110:2222. Will not attempt to authenticate using SASL (unknown error)
13/12/09 21:52:28 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 6809@Master
13/12/09 21:52:28 INFO zookeeper.ClientCnxn: Socket connection established to Slave2/10.21.244.110:2222, initiating session
13/12/09 21:52:28 INFO zookeeper.ClientCnxn: Session establishment complete on server Slave2/10.21.244.110:2222, sessionid = 0x242d5abedac0016, negotiated timeout = 180000
13/12/09 21:52:28 INFO zookeeper.ClientCnxn: EventThread shut down
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Session: 0x242d5abedac0016 closed
13/12/09 21:52:28 INFO mapreduce.HFileOutputFormat: Looking up current regions for table org.apache.hadoop.hbase.client.HTable@1ae6df8
13/12/09 21:52:28 INFO mapreduce.HFileOutputFormat: Configuring 1 reduce partitions to match current region count
13/12/09 21:52:28 INFO mapreduce.HFileOutputFormat: Writing partition information to hdfs://Master:9000/user/landen/partitions_b0c3723c-85ea-4828-8521-52de201023f0
13/12/09 21:52:28 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/12/09 21:52:28 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/12/09 21:52:28 INFO compress.CodecPool: Got brand-new compressor
13/12/09 21:52:29 INFO mapreduce.HFileOutputFormat: Incremental table output configured.
13/12/09 21:52:34 INFO input.FileInputFormat: Total input paths to process : 1
13/12/09 21:52:34 WARN snappy.LoadSnappy: Snappy native library not loaded
13/12/09 21:52:35 INFO mapred.JobClient: Running job: job_201312042044_0027
13/12/09 21:52:36 INFO mapred.JobClient:  map 0% reduce 0%
13/12/09 21:53:41 INFO mapred.JobClient:  map 100% reduce 0%
13/12/09 21:53:56 INFO mapred.JobClient:  map 100% reduce 100%
13/12/09 21:54:01 INFO mapred.JobClient: Job complete: job_201312042044_0027
13/12/09 21:54:01 INFO mapred.JobClient: Counters: 30
13/12/09 21:54:01 INFO mapred.JobClient:   Job Counters
13/12/09 21:54:01 INFO mapred.JobClient:     Launched reduce tasks=1
13/12/09 21:54:01 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=42735
13/12/09 21:54:01 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/12/09 21:54:01 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/12/09 21:54:01 INFO mapred.JobClient:     Launched map tasks=1
13/12/09 21:54:01 INFO mapred.JobClient:     Data-local map tasks=1
13/12/09 21:54:01 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=13878
13/12/09 21:54:01 INFO mapred.JobClient:   ImportTsv
13/12/09 21:54:01 INFO mapred.JobClient:     Bad Lines=0
13/12/09 21:54:01 INFO mapred.JobClient:   File Output Format Counters
13/12/09 21:54:01 INFO mapred.JobClient:     Bytes Written=2194
13/12/09 21:54:01 INFO mapred.JobClient:   FileSystemCounters
13/12/09 21:54:01 INFO mapred.JobClient:     FILE_BYTES_READ=1895
13/12/09 21:54:01 INFO mapred.JobClient:     HDFS_BYTES_READ=333
13/12/09 21:54:01 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=77323
13/12/09 21:54:01 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2194
13/12/09 21:54:01 INFO mapred.JobClient:   File Input Format Counters
13/12/09 21:54:01 INFO mapred.JobClient:     Bytes Read=233
13/12/09 21:54:01 INFO mapred.JobClient:   Map-Reduce Framework
13/12/09 21:54:01 INFO mapred.JobClient:     Map output materialized bytes=1742
13/12/09 21:54:01 INFO mapred.JobClient:     Map input records=3
13/12/09 21:54:01 INFO mapred.JobClient:     Reduce shuffle bytes=1742
13/12/09 21:54:01 INFO mapred.JobClient:     Spilled Records=6
13/12/09 21:54:01 INFO mapred.JobClient:     Map output bytes=1724
13/12/09 21:54:01 INFO mapred.JobClient:     Total committed heap usage (bytes)=131731456
13/12/09 21:54:01 INFO mapred.JobClient:     CPU time spent (ms)=14590
13/12/09 21:54:01 INFO mapred.JobClient:     Combine input records=0
13/12/09 21:54:01 INFO mapred.JobClient:     SPLIT_RAW_BYTES=100
13/12/09 21:54:01 INFO mapred.JobClient:     Reduce input records=3
13/12/09 21:54:01 INFO mapred.JobClient:     Reduce input groups=3
13/12/09 21:54:01 INFO mapred.JobClient:     Combine output records=0
13/12/09 21:54:01 INFO mapred.JobClient:     Physical memory (bytes) snapshot=184393728
13/12/09 21:54:01 INFO mapred.JobClient:     Reduce output records=24
13/12/09 21:54:01 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=698474496
13/12/09 21:54:01 INFO mapred.JobClient:     Map output records=3
landen@Master:~/UntarFile/hadoop-1.0.4$ bin/hadoop fs -ls /output
Warning: $HADOOP_HOME is deprecated.

Found 3 items
drwxr-xr-x   - landen supergroup          0 2013-12-09 21:53 /output/IPAddress
-rw-r--r--   1 landen supergroup          0 2013-12-09 21:53 /output/_SUCCESS
drwxr-xr-x   - landen supergroup          0 2013-12-09 21:52 /output/_logs

completebulkload 工具读取生成的文件,判断它们归属的族群,然后访问适当的族群服务器。族群服务器会将hfile文件转移进自身存储目录中,并且为客户端建立在线数据.

landen@Master:~/UntarFile/hadoop-1.0.4$ bin/hadoop jar $HADOOP_HOME/lib/hbase-0.94.12.jar completebulkload /output HiddenIPInfo(HBase对应表名)
Warning: $HADOOP_HOME is deprecated.

13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:host.name=Master
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:java.version=1.7.0_17
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:java.home=/home/landen/UntarFile/jdk1.7.0_17/jre
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:java.class.path=/home/landen/UntarFile/hadoop-1.0.4/conf:/home/landen/UntarFile/jdk1.7.0_17/lib/tools.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/..:/home/landen/UntarFile/hadoop-1.0.4/libexec/../hadoop-core-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/asm-3.2.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/aspectjrt-1.6.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/aspectjtools-1.6.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/chukwa-0.5.0-client.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/chukwa-0.5.0.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-beanutils-1.7.0.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-beanutils-core-1.8.0.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-cli-1.2.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-codec-1.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-collections-3.2.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-configuration-1.6.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-daemon-1.0.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-digester-1.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-el-1.0.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-httpclient-3.0.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-io-2.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-lang-2.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-logging-1.1.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-logging-api-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-math-2.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-net-1.4.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/core-3.1.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/guava-11.0.2.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hadoop-capacity-scheduler-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hadoop-fairscheduler-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hadoop-thriftfs-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hbase-0.94.12.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hsqldb-1.8.0.10.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jackson-core-asl-1.8.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jasper-compiler-5.5.12.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jasper-runtime-5.5.12.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jdeb-0.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jersey-core-1.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jersey-json-1.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jersey-server-1.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jets3t-0.6.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jetty-6.1.26.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jetty-util-6.1.26.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jsch-0.1.42.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/json-simple-1.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/junit-4.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/kfs-0.2.2.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/LoadJsonData.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/log4j-1.2.15.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/mockito-all-1.8.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/oro-2.0.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/protobuf-java-2.4.0a.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/servlet-api-2.5-20081211.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/slf4j-api-1.4.3.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/slf4j-log4j12-1.4.3.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/xmlenc-0.52.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/zookeeper-3.4.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jsp-2.1/jsp-2.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jsp-2.1/jsp-api-2.1.jar
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/native/Linux-i386-32
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:os.arch=i386
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:os.version=3.2.0-24-generic-pae
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:user.name=landen
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:user.home=/home/landen
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:user.dir=/home/landen/UntarFile/hadoop-1.0.4
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=Slave1:2222,Master:2222,Slave2:2222 sessionTimeout=180000 watcher=hconnection
13/12/09 22:00:00 INFO zookeeper.ClientCnxn: Opening socket connection to server Slave1/10.21.244.124:2222. Will not attempt to authenticate using SASL (unknown error)
13/12/09 22:00:00 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 7168@Master
13/12/09 22:00:00 INFO zookeeper.ClientCnxn: Socket connection established to Slave1/10.21.244.124:2222, initiating session
13/12/09 22:00:00 INFO zookeeper.ClientCnxn: Session establishment complete on server Slave1/10.21.244.124:2222, sessionid = 0x142cbdf535f0011, negotiated timeout = 180000
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=Slave1:2222,Master:2222,Slave2:2222 sessionTimeout=180000 watcher=catalogtracker-on-org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@a13b90
13/12/09 22:00:00 INFO zookeeper.ClientCnxn: Opening socket connection to server Slave1/10.21.244.124:2222. Will not attempt to authenticate using SASL (unknown error)
13/12/09 22:00:00 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 7168@Master
13/12/09 22:00:00 INFO zookeeper.ClientCnxn: Socket connection established to Slave1/10.21.244.124:2222, initiating session
13/12/09 22:00:00 INFO zookeeper.ClientCnxn: Session establishment complete on server Slave1/10.21.244.124:2222, sessionid = 0x142cbdf535f0012, negotiated timeout = 180000
13/12/09 22:00:01 INFO zookeeper.ZooKeeper: Session: 0x142cbdf535f0012 closed
13/12/09 22:00:01 INFO zookeeper.ClientCnxn: EventThread shut down
13/12/09 22:00:01 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://Master:9000/output/_SUCCESS
13/12/09 22:00:01 INFO hfile.CacheConfig: Allocating LruBlockCache with maximum size 222.2m
13/12/09 22:00:01 INFO util.ChecksumType: Checksum can use java.util.zip.CRC32
13/12/09 22:00:01 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://Master:9000/output/IPAddress/b29b74ad57ff4be1a62968229b7e23d4 first=125.111.251.118 last=60.180.248.201
landen@Master:~/UntarFile/hadoop-1.0.4$

在HBase shell中查询批量导入到HBase表HiddenIPInfo的数据:

hbase(main):045:0> scan 'HiddenIPInfo'
ROW                            COLUMN+CELL                                                                             
 125.111.251.118               column=IPAddress:city, timestamp=1386597147615, value=Ningbo                            
 125.111.251.118               column=IPAddress:countrycode, timestamp=1386597147615, value=CN                         
 125.111.251.118               column=IPAddress:countryname, timestamp=1386597147615, value=China                      
 125.111.251.118               column=IPAddress:latitude, timestamp=1386597147615, value=29.878204                     
 125.111.251.118               column=IPAddress:longitude, timestamp=1386597147615, value=121.5495                     
 125.111.251.118               column=IPAddress:region, timestamp=1386597147615, value=02                              
 125.111.251.118               column=IPAddress:regionname, timestamp=1386597147615, value=Zhejiang                    
 125.111.251.118               column=IPAddress:timezone, timestamp=1386597147615, value=Asia/Shanghai                 
 221.12.10.218                 column=IPAddress:city, timestamp=1386597147615, value=Hangzhou                          
 221.12.10.218                 column=IPAddress:countrycode, timestamp=1386597147615, value=CN                         
 221.12.10.218                 column=IPAddress:countryname, timestamp=1386597147615, value=China                      
 221.12.10.218                 column=IPAddress:latitude, timestamp=1386597147615, value=30.293594                     
 221.12.10.218                 column=IPAddress:longitude, timestamp=1386597147615, value=120.16141                    
 221.12.10.218                 column=IPAddress:region, timestamp=1386597147615, value=02                              
 221.12.10.218                 column=IPAddress:regionname, timestamp=1386597147615, value=Zhejiang                    
 221.12.10.218                 column=IPAddress:timezone, timestamp=1386597147615, value=Asia/Shanghai                 
 60.180.248.201                column=IPAddress:city, timestamp=1386597147615, value=Wenzhou                           
 60.180.248.201                column=IPAddress:countrycode, timestamp=1386597147615, value=CN                         
 60.180.248.201                column=IPAddress:countryname, timestamp=1386597147615, value=China                      
 60.180.248.201                column=IPAddress:latitude, timestamp=1386597147615, value=27.999405                     
 60.180.248.201                column=IPAddress:longitude, timestamp=1386597147615, value=120.66681                    
 60.180.248.201                column=IPAddress:region, timestamp=1386597147615, value=02                              
 60.180.248.201                column=IPAddress:regionname, timestamp=1386597147615, value=Zhejiang                    
 60.180.248.201                column=IPAddress:timezone, timestamp=1386597147615, value=Asia/Shanghai                 
3 row(s) in 0.2640 seconds

Note:

1> HBASE_ROW_KEY可以不在第一列,如果在第二列,则第二列作为row key;

2>  tsv文件的字段索引与hbase表中列的对应信息是对 -dimporttsv.columns参数进行设置;

3> 如果设置了输出目录-Dimporttsv.bulk.output, HiddenIPInfo表还暂时不会生成,只是将hfile输出到output文件夹下(当进行completebulkload导入操作后HiddenIPInfo才会生成); 然后执行bin/hadoop jar hbase-VERSION.jar completebulkload /output(HFile文件存放目录)  HiddenIPInfo(对应的HBase表名)操作将这个输出目录中的hfile文件转移到对应的region中,这一步因为只是mv,所以相当快;

4> 如果数据特别大,而表中原来就有region,那么会执行切分工作,查找数据对应的region并装载;

5> bin/hadoop jar $HADOOP_HOME/lib/hbase-0.94.12.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,IPAddress:countrycode,IPAddress:countryname,IPAddress:region,IPAddress:regionname,IPAddress:city,IPAddress:latitude,IPAddress:longitude,IPAddress:timezone (-Dimporttsv.bulk.output=/output) HiddenIPInfo /input当未指定-Dimporttsv.bulk.output参数时,则:

1. 执行命令前,表需已创建完成;

2. 此方式采用Put方法向hbase写入数据,性能较低,在map阶段使用的是tableoutputformat. 通过指定-Dimporttsv.bulk.output参数,importtsv工具可以直接生成StorageFile,使用hfileoutputformat来代替在hdfs中生成hbase的自有格式文件(hfile),然后配合CompleteBulkLoad工具来加载生成的文件到一个运行的集群中并导入hbase,性能更好, 如果表不存在,CompleteBulkLoad工具会自动创建;

6> importtsv工具只从hdfs中读取数据,所以一开始我们需要将tsv文件从本地文件系统拷贝到hdfs中。importtsv工具要求源文件满足TSV格式,关于TSV文件格式,可参考:http://en.wikipedia.org/wiki/Tab-separated_values,获取源文件后,先将源文件导入到hdfs中,hadoop dfs -copyFromLocal file:///path/to/source-file hdfs:///path/to/source-file,即源文件默认以"\t"为分割符,如果需要换成其它分割符,在执行时加上-Dimporttsv.separator=",",则变成了以","分割.

你可能感兴趣的:(hbase)