Import data to HBase

There have 3 common usages to import data to HBase:

1>Use ImportTsv

ImportTsv is a utility that will load data in TSV format into HBase.

There have two usages about it:

a.Loading data from TSV format in HDFS into HBase via Puts.

This kind of load is non-bulk loading

Eg:

Bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv –Dimporttsv.columns=HBASE_ROW_KEY,information:c1 table_name /user/hadoop/data

b.Another method is to use the bulk loading

It divided into 2 steps:

1>To generate StoreFiles for bulk-loading

 HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.94.2.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,information:c1 -Dimporttsv.bulk.output=/user/hadoop/rdc_search_hfile -Dimporttsv.separator=, rdc_search_information /user/hadoop/sourcedata

2>Move the generated StoreFiles into an HBase table using completebulkload utility

bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /user/hadoop/rdc_search_hfile rdc_search_information

in above demo, the table name is rdc_search_information and it has one column family information.

Running ImportTsv with no arguments prints brief usage information:

Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>

 

Imports the given input directory of TSV data into the specified table.

 

The column names of the TSV data must be specified using the -Dimporttsv.columns

option. This option takes the form of comma-separated column names, where each

column name is either a simple column family, or a columnfamily:qualifier. The special

column name HBASE_ROW_KEY is used to designate that this column should be used

as the row key for each imported record. You must specify exactly one column

to be the row key, and you must specify a column name for every column that exists in the

input data.

 

By default importtsv will load data directly into HBase. To instead generate

HFiles of data to prepare for a bulk data load, pass the option:

  -Dimporttsv.bulk.output=/path/for/output

  Note: the target table will be created with default column family descriptors if it does not already exist.

 

Other options that may be specified with -D include:

  -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line

  '-Dimporttsv.separator=|' - eg separate on pipes instead of tabs

  -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import

  -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper

 

2> We can write a java project and read the file in our OS and use the Put class to put the data into HBase.

3>Use the Pig to read the data from HDFS and write to HBase.

 

Ref: http://hbase.apache.org/book/ops_mgt.html#importtsv

 

你可能感兴趣的:(hadoop,hbase)