This tutorial with quickly teach you how to use HBase, a column-oriented tool that sits on top of Hadoop, it works best when you have large tables and are accessing your Big Data randomly and in real-time. Though it does not support SQL, HBase can easily be connected to Hive, providing you with the read/write speed of HBase, the ease of Hive, and the parallel processing of MapReduce.
The BigSQL bundle automatically starts up a pseudo-distributed model of HBase in which a master and region server are both running on your local computer.
The tutorial will use the data file previously used in the Hadoop Hive Tutorial (See this tutorial for all prerequisites). If you have not grabbed the file already it is located in the zipfile here. Place the file ex1data.csv into the
~/Downloads/Sample_files/ex1data.csv directory.
Note: if you are using the BigSQL distribution (highly recommended) make sure you are using at least version beta 2.28!
The first step is to upload the csv file into HDFS. Use the hadoop fs command to make the directory and copy the ex1data.csv from your Downloads folder.
$ hadoop fs -mkdir /user/data/salesdata $ hadoop fs -copyFromLocal ~/Downloads/Sample_files/ex1data.csv /user/data/salesdata/ex1data.csv
Next, start the hbase shell and create the table “sales_data” with the column families location, units, size, age and pricing.
$ hbase shell hbase > create 'sales_data', 'location', 'units', 'size', 'age', 'pricing' hbase > quit
Use the ImportTsv tool to import the csv file into the HBase table. The column that will be the row’s primary key does not need to be listed by name. In this example, we list HBASE_ROW_KEY instead of explicitly saying s_num.
$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv '-Dimporttsv.separator=,' -Dimporttsv.columns=HBASE_ROW_KEY,location:s_borough,location:s_neighbor, location:s_b_class,location:s_c_p,location:s_block,location:s_lot,location:s_easement, location:w_c_p_2,location:s_address,location:s_app_num,location:s_zip,units:s_res_units, units:s_com_units,units:s_tot_units,size:s_sq_ft,size:s_g_sq_ft,age:s_yr_built, pricing:s_tax_c,pricing:s_b_class2,pricing:s_price,pricing:s_sales_dt sales_data /user/data/salesdata/ex1data.csv
Since this file was separated by comas and not tabs, you need to specify ‘-Dimporttsv.separator=,’.
HBase is also very good with bulk uploads. In order to do this, use the ‘importtsv.bulk.output’ tool to generate compatible files, then use the ‘completebulkloads’ utility to load those into the HBase tables.
To ensure that the table has been created and loaded into hive, you can use the list command to show all HBase tables.
$ hbase shell hbase > list TABLE sales_data
To check the data within the table, you can use the scan command. This will list every cell in the table as one row.
hbase > scan 'sales_data'
To add the table to hive, create an external table in hive stored by org.apache.hadoop.hive.hbase.HBaseStorageHandler. You must list the hbase.columns.mapping as shown below. Note that the even though s_num is listed in the definition of the table, it is not listed under the serde properties.
$ hive hive > CREATE EXTERNAL TABLE IF NOT EXISTS sales_data ( s_num FLOAT, s_borough INT, s_neighbor STRING, s_b_class STRING, s_c_p STRING, s_block STRING, s_lot STRING, s_easement STRING, w_c_p_2 STRING, s_address STRING, s_app_num STRING, s_zip STRING, s_res_units STRING, s_com_units STRING, s_tot_units INT, s_sq_ft FLOAT, s_g_sq_ft FLOAT, s_yr_built INT, s_tax_c INT, s_b_class2 STRING, s_price FLOAT, s_sales_dt STRING ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = "location:s_borough, location:s_neighbor,location:s_b_class,location:s_c_p,location:s_block, location:s_lot,location:s_easement,location:w_c_p_2,location:s_address, location:s_app_num,location:s_zip,units:s_res_units,units:s_com_units, units:s_tot_units,size:s_sq_ft,size:s_g_sq_ft,age:s_yr_built,pricing:s_tax_c, pricing:s_b_class2,pricing:s_price,pricing:s_sales_dt"); hive> DESCRIBE sales_data; OK col_name data_type comment s_num float from deserializer s_borough int from deserializer s_neighbor string from deserializer s_b_class string from deserializer s_c_p string from deserializer s_block string from deserializer s_lot string from deserializer s_easement string from deserializer w_c_p_2 string from deserializer s_address string from deserializer s_app_num string from deserializer s_zip string from deserializer s_res_units string from deserializer s_com_units string from deserializer s_tot_units int from deserializer s_sq_ft float from deserializer s_g_sq_ft float from deserializer s_yr_built int from deserializer s_tax_c int from deserializer s_b_class2 string from deserializer s_price float from deserializer s_sales_dt string from deserializer Time taken: 0.27 seconds, Fetched: 22 row(s) You can also use the HBase Console (localhost:60010/master-status) to check the user tables created and their attributes and other metrics! For more information on BigSQL visit BigSQL.org