Alex 的 Hadoop 菜鸟教程: 第13课 Sqoop1 导入 Hbase 以及 Hive

原帖地址: http://blog.csdn.net/nsrainbow/article/details/41649671

承接上节课,继续讲如何使用sqoop将mysql的数据导入到 Hbase 或者 Hive 里面

数据准备

mysql

在mysql 里面建立表 employee 并插入数据
CREATE TABLE `employee` (    
  `id` int(11) NOT NULL,    
  `name` varchar(20) NOT NULL,    
  PRIMARY KEY (`id`)    
) ENGINE=MyISAM  DEFAULT CHARSET=utf8;  

insert into employee (id,name) values (1,'michael');  
insert into employee (id,name) values (2,'ted'); 
insert into employee (id,name) values (3,'jack'); 

Hbase

hbase(main):006:0> create 'employee','info'
0 row(s) in 0.4440 seconds

=> Hbase::Table - employee

Hive

不需要数据准备,等等用--create-hive-table会自动建表


从mysql导入到Hbase

# sqoop import --connect jdbc:mysql://host1:3306/sqoop_test --username root --password root --table employee --hbase-table employee --column-family info --hbase-row-key id -m 1
  • --table 数据来源表
  • --hbase-table 要导入的hbase表
  • --column-family 要导入的列簇
  • --hbase-row-key 要指定哪个字段作为row-key的值
  • 更详细的参数见 http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html#_importing_data_into_hbase
执行效果
…………
); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
15/01/23 07:56:49 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
15/01/23 07:57:16 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 1,315.279 seconds (0 bytes/sec)
15/01/23 07:57:16 INFO mapreduce.ImportJobBase: Retrieved 3 records.

最后打印出成功导入3条数据

去检查下hbase

hbase(main):001:0> scan 'employee'
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
ROW                                      COLUMN+CELL                                                                                                           
 1                                       column=info:name, timestamp=1417426628685, value=michael                                                              
 2                                       column=info:name, timestamp=1417426628685, value=ted                                                                  
 3                                       column=info:name, timestamp=1417426628685, value=jack                                                                 
3 row(s) in 0.1630 seconds

成功插入3条数据

从mysql导入hive

# sqoop import --connect jdbc:mysql://host1:3306/sqoop_test --username root --password root --table employee --hive-import --hive-table hive_employee --create-hive-table

  • --hive-import 代表是hive的导入
  • --hive-table 要导入的hive表
  • --create-hive-table 是否同时自动创建该表
  • 更详细的参数见 http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html#_importing_data_into_hive


再次说下mysql的jdbc链接不要用localhost,因为这个任务会被分布式的发送不同的hadoop机子上,要那些机子真的可以通过jdbc连到mysql上才行,否则会丢数据

检查下hive
hive> select * from hive_employee;
OK
1	michael
2	ted
3	jack
Time taken: 0.179 seconds, Fetched: 3 row(s)

还有一点要声明下: 目前sqoop无法导入数据到基于hbase建立的hive表

从mysql导入hive分区表

分区表的导入会比较复杂一点

Step1

建立people表,并插入数据
CREATE TABLE `people` ( 
  `id` int(11) NOT NULL,
  `name` varchar(20) NOT NULL, 
  `year` varchar(10),
  PRIMARY KEY (`id`)    
) ENGINE=MyISAM  DEFAULT CHARSET=utf8;

insert into people values (1,'jack','2015','01','02');
insert into people values (2,'ted','2015','01','02');
insert into people values (3,'billy','2015','01','02');
insert into people values (4,'sara','2015','01','03');

Step2

sqoop import --connect jdbc:mysql://xmseapp03:3306/sqoop_test --username sqoop --password sqoop --query 'select id,name from people where year="2015"  AND $CONDITIONS'  --direct -m 2 --split-by id --hive-import --create-hive-table --hive-table hive_people --target-dir /user/hive_people --hive-partition-key year --hive-partition-value '2015'
  • --query 写你要查询的sql,AND $CONDITIONS 这句话不能省,是给sqoop用的
  • --split-by 写主键
  • --hive-partition-key 定义分区表的键 , --hive-partion-value定义分区表的值
导入成功后查看数据 
hive> select * from hive_people;
OK
1	jack	2015
2	ted	2015
3	billy	2015
4	sara	2015
Time taken: 1.041 seconds, Fetched: 4 row(s)

查看下hdfs里面怎么存的
[root@host1 ~]$ hdfs dfs -cat /user/hive/warehouse/hive_people/year=2015/part-m-00000
1jack
2ted
[root@host1 ~]$ hdfs dfs -cat /user/hive/warehouse/hive_people/year=2015/part-m-00001
3billy
4sara





你可能感兴趣的:(hadoop,hive,分布式,hbase)