Hadoop Hbase Tutorial

This tutorial with quickly teach you how to use HBase, a column-oriented tool that sits on top of Hadoop, it works best when you have large tables and are accessing your Big Data randomly and in real-time. Though it does not support SQL, HBase can easily be connected to Hive, providing you with the read/write speed of HBase, the ease of Hive, and the parallel processing of MapReduce.

The BigSQL bundle automatically starts up a pseudo-distributed model of HBase in which a master and region server are both running on your local computer.

The tutorial will use the data file previously used in the Hadoop Hive Tutorial (See this tutorial for all prerequisites).  If you have not grabbed the file already it is located in the zipfile here. Place the file ex1data.csv into the

~/Downloads/Sample_files/ex1data.csv directory.

Note: if you are using the BigSQL distribution (highly recommended) make sure you are using at least version beta 2.28!

The first step is to upload the csv file into HDFS. Use the hadoop fs command to make the directory and copy the ex1data.csv from your Downloads folder.

	$ hadoop fs -mkdir /user/data/salesdata
 	$ hadoop fs -copyFromLocal ~/Downloads/Sample_files/ex1data.csv /user/data/salesdata/ex1data.csv

Next, start the hbase shell and create the table “sales_data” with the column families location, units, size, age and pricing.

	$ hbase shell
     	hbase > create 'sales_data', 'location', 'units', 'size', 'age', 'pricing'
     	hbase > quit

Use the ImportTsv tool to import the csv file into the HBase table. The column that will be the row’s primary key does not need to be listed by name. In this example, we list HBASE_ROW_KEY instead of explicitly saying s_num.

     	$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv '-Dimporttsv.separator=,'
            -Dimporttsv.columns=HBASE_ROW_KEY,location:s_borough,location:s_neighbor,
            location:s_b_class,location:s_c_p,location:s_block,location:s_lot,location:s_easement,
            location:w_c_p_2,location:s_address,location:s_app_num,location:s_zip,units:s_res_units,
            units:s_com_units,units:s_tot_units,size:s_sq_ft,size:s_g_sq_ft,age:s_yr_built,
            pricing:s_tax_c,pricing:s_b_class2,pricing:s_price,pricing:s_sales_dt
            sales_data /user/data/salesdata/ex1data.csv

Since this file was separated by comas and not tabs, you need to specify ‘-Dimporttsv.separator=,’.

HBase is also very good with bulk uploads. In order to do this, use the ‘importtsv.bulk.output’ tool to generate compatible files, then use the ‘completebulkloads’ utility to load those into the HBase tables.

To ensure that the table has been created and loaded into hive, you can use the list command to show all HBase tables.

     	$ hbase shell
     	hbase > list
        	TABLE
        	sales_data

To check the data within the table, you can use the scan command. This will list every cell in the table as one row.

     	hbase > scan 'sales_data'

To add the table to hive, create an external table in hive stored by org.apache.hadoop.hive.hbase.HBaseStorageHandler. You must list the hbase.columns.mapping as shown below. Note that the even though s_num is listed in the definition of the table, it is not listed under the serde properties.

     	$ hive
     	hive > CREATE EXTERNAL TABLE IF NOT EXISTS sales_data ( s_num FLOAT, s_borough INT, s_neighbor STRING, s_b_class STRING, s_c_p STRING, s_block STRING, s_lot STRING, s_easement STRING, w_c_p_2 STRING, s_address STRING, s_app_num STRING, s_zip STRING, s_res_units STRING, s_com_units STRING, s_tot_units INT, s_sq_ft FLOAT, s_g_sq_ft FLOAT, s_yr_built INT, s_tax_c INT, s_b_class2 STRING, s_price FLOAT, s_sales_dt STRING ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = "location:s_borough, location:s_neighbor,location:s_b_class,location:s_c_p,location:s_block, location:s_lot,location:s_easement,location:w_c_p_2,location:s_address, location:s_app_num,location:s_zip,units:s_res_units,units:s_com_units, units:s_tot_units,size:s_sq_ft,size:s_g_sq_ft,age:s_yr_built,pricing:s_tax_c, pricing:s_b_class2,pricing:s_price,pricing:s_sales_dt");

	hive> DESCRIBE sales_data;                                                                                  
		OK
		col_name		data_type		comment
		s_num               	float               	from deserializer   
		s_borough           	int                 	from deserializer   
		s_neighbor          	string              	from deserializer   
		s_b_class           	string              	from deserializer   
		s_c_p               	string              	from deserializer   
		s_block             	string              	from deserializer   
		s_lot               	string              	from deserializer   
		s_easement          	string              	from deserializer   
		w_c_p_2             	string              	from deserializer   	
		s_address           	string              	from deserializer   
		s_app_num           	string              	from deserializer   
		s_zip               	string              	from deserializer   
		s_res_units         	string              	from deserializer   
		s_com_units         	string              	from deserializer   
		s_tot_units         	int                 	from deserializer   
		s_sq_ft             	float               	from deserializer   
		s_g_sq_ft           	float               	from deserializer   
		s_yr_built          	int                 	from deserializer   
		s_tax_c             	int                 	from deserializer   
		s_b_class2          	string              	from deserializer   
		s_price             	float               	from deserializer   
		s_sales_dt          	string              	from deserializer   
		Time taken: 0.27 seconds, Fetched: 22 row(s)

You can also use the HBase Console (localhost:60010/master-status) to check the user tables created and their attributes and other metrics!
For more information on BigSQL visit BigSQL.org

你可能感兴趣的:(hadoop,hbase)