有时候需要将mysql的全量数据导入到hive或者hbase中,使用sqoop是一个比较好用的工具,速度相对来说比较快。mysql的增量数据在用其他方法实时同步。
导入命令:
sqoop import --connect jdbc:mysql://xxx.xxx.xxx.xxx:3306/database --table tablename --hbase-table hbasetablename --column-family family --hbase-row-key ID --hbase-create-table --username 'root' -P
参数说明:
--connect:数据库连接串
--username:用户名
--P:交互式输入密码
--table:表名
-m:并行执行sqoop导入程序的map task的数量,在不指定的情况下默认启动4个map
--split-by:并行导入过程中,各个map task根据哪个字段来划分数据段,该参数最好指定一个能相对均匀划分数据的字段,比如创建时间、递增的ID
--hbase-table:hbase中接收数据的表名
--hbase-create-table:如果指定的接收数据表在hbase中不存在,则新建表
--column-family:列族名称,所有源表的字段都进入该列族
--hbase-row-key:如果不指定则采用源表的key作为hbase的row key。可以指定一个字段作为row key,或者指定组合行键,当指定组合行键时,用双引号包含多个字段,各字段用逗号分隔
执行命令部分日志:
[hdfs@slave1 ~]$ sqoop import --connect jdbc:mysql://xxx.xxx.xxx.xxx:3306/database --table tablename --hbase-table hbasetablename --column-family family --hbase-row-key ID --hbase-create-table --username 'root' -P
Warning: /soft/bigdata/clouderamanager/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
17/04/28 15:54:37 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.10.0
Enter password:
17/04/28 15:54:44 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
17/04/28 15:54:44 INFO tool.CodeGenTool: Beginning code generation
Fri Apr 28 15:54:44 CST 2017 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
17/04/28 15:54:45 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `COMP_DICT` AS t LIMIT 1
17/04/28 15:54:45 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `COMP_DICT` AS t LIMIT 1
17/04/28 15:54:45 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /soft/bigdata/clouderamanager/cloudera/parcels/CDH/lib/hadoop-mapreduce
Note: /tmp/sqoop-hdfs/compile/f5c3b693ffb26b66c554308ad32b2880/COMP_DICT.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
17/04/28 15:54:47 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/f5c3b693ffb26b66c554308ad32b2880/COMP_DICT.jar
……
7/04/28 15:54:53 INFO mapreduce.Job: The url to track the job: http://master2:8088/proxy/application_1491881598805_0027/
17/04/28 15:54:53 INFO mapreduce.Job: Running job: job_1491881598805_0027
17/04/28 15:54:59 INFO mapreduce.Job: Job job_1491881598805_0027 running in uber mode : false
17/04/28 15:54:59 INFO mapreduce.Job: map 0% reduce 0%
17/04/28 15:55:05 INFO mapreduce.Job: map 20% reduce 0%
17/04/28 15:55:06 INFO mapreduce.Job: map 60% reduce 0%
17/04/28 15:55:09 INFO mapreduce.Job: map 100% reduce 0%
17/04/28 15:55:10 INFO mapreduce.Job: Job job_1491881598805_0027 completed successfully
17/04/28 15:55:10 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=925010
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=665
HDFS: Number of bytes written=0
HDFS: Number of read operations=5
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=5
Other local map tasks=5
Total time spent by all maps in occupied slots (ms)=25663
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=25663
Total vcore-seconds taken by all map tasks=25663
Total megabyte-seconds taken by all map tasks=26278912
Map-Reduce Framework
Map input records=10353
Map output records=10353
Input split bytes=665
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=586
CPU time spent (ms)=17940
Physical memory (bytes) snapshot=1619959808
Virtual memory (bytes) snapshot=14046998528
Total committed heap usage (bytes)=1686634496
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
17/04/28 15:55:10 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 20.0424 seconds (0 bytes/sec)
17/04/28 15:55:10 INFO mapreduce.ImportJobBase: Retrieved 10353 records.
PS:如果想把表中的主键也增加到hbase的cell中也就是列族中的列的话,需要设置如下参数:
sqoop.hbase.add.row.key=true
即
sqoop import -D sqoop.hbase.add.row.key=true --connect jdbc:mysql://xxx.xxx.xxx.xxx:3306/database --table tablename --hbase-table hbasetablename --column-family family --hbase-row-key "etl_date,APPLY_ID" --hbase-create-table --username 'root' -P
1、创建与mysql相同结构的表:
sqoop create-hive-table --connect jdbc:mysql://xxx.xxx.xxx.xxx:3306/shiro --table UserInfo --hive-database shiro --hive-table userinfo --username root --password xxxxxx --fields-terminated-by "\0001" --lines-terminated-by "\n";
参数说明:
--fields-terminated-by "\0001" 是设置每列之间的分隔符,"\0001"是ASCII码中的1,它也是hive的默认行内分隔符, 而sqoop的默认行内分隔符为","
--lines-terminated-by "\n" 设置的是每行之间的分隔符,此处为换行符,也是默认的分隔符;
sqoop import --connect jdbc:mysql://xxx.xxx.xxx.xxx:3306/testSqoop --table dydata --hive-database testsqoop --hive-import --hive-table dydata --username root --password xxxxxx --fields-terminated-by "\0001";
参数说明:
-m 2 表示由两个map作业执行
--fields-terminated-by "\0001" 需同创建hive表时保持一致;
--hive-import一定要加此参数,否则无法成功导入hive中