There have 3 common usages to import data to HBase:
1>Use ImportTsv
ImportTsv is a utility that will load data in TSV format into HBase.
There have two usages about it:
a.Loading data from TSV format in HDFS into HBase via Puts.
This kind of load is non-bulk loading
Eg:
Bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv –Dimporttsv.columns=HBASE_ROW_KEY,information:c1 table_name /user/hadoop/data
b.Another method is to use the bulk loading
It divided into 2 steps:
1>To generate StoreFiles for bulk-loading
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.94.2.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,information:c1 -Dimporttsv.bulk.output=/user/hadoop/rdc_search_hfile -Dimporttsv.separator=, rdc_search_information /user/hadoop/sourcedata
2>Move the generated StoreFiles into an HBase table using completebulkload utility
bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /user/hadoop/rdc_search_hfile rdc_search_information
in above demo, the table name is rdc_search_information and it has one column family information.
Running ImportTsv with no arguments prints brief usage information:
Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
Imports the given input directory of TSV data into the specified table.
The column names of the TSV data must be specified using the -Dimporttsv.columns
option. This option takes the form of comma-separated column names, where each
column name is either a simple column family, or a columnfamily:qualifier. The special
column name HBASE_ROW_KEY is used to designate that this column should be used
as the row key for each imported record. You must specify exactly one column
to be the row key, and you must specify a column name for every column that exists in the
input data.
By default importtsv will load data directly into HBase. To instead generate
HFiles of data to prepare for a bulk data load, pass the option:
-Dimporttsv.bulk.output=/path/for/output
Note: the target table will be created with default column family descriptors if it does not already exist.
Other options that may be specified with -D include:
-Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
'-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
-Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
-Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
2> We can write a java project and read the file in our OS and use the Put class to put the data into HBase.
3>Use the Pig to read the data from HDFS and write to HBase.
Ref: http://hbase.apache.org/book/ops_mgt.html#importtsv