Sqoop

Sqoop

  • 安装

    • 下载,解压,配置环境变量
    • conf里的配置不需要动,如果没有安装ZooKeeper和Hbase,就把configure-sqoop里有关zk和hbase的脚本全部注释掉;如果安装了zk和hbase,就不需要改。
  • 导入,一个mysql的坑

我们导入hive表的DBS表

  sqoop git:(master)  sqoop import --connect jdbc:mysql://localhost:3306/hive --table DBS --username root -password root

java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@3901d134 is still active.

Warning: /Users/chenxiaokang/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /Users/chenxiaokang/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
18/08/07 10:52:24 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
18/08/07 10:52:24 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
18/08/07 10:52:24 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
18/08/07 10:52:24 INFO tool.CodeGenTool: Beginning code generation
18/08/07 10:52:25 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `DBS` AS t LIMIT 1
18/08/07 10:52:25 ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@3901d134 is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.
java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@3901d134 is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.

这是MySQL的一个bug,把(lib目录下)mysql的连接jar包mysql-connector-java-5.1.13-bin.jar换成mysql-connector-java-5.1.32.jar就好了。

  lib git:(master)  sqoop import --connect jdbc:mysql://localhost:3306/hive --table DBS --username root -password root 
Warning: /Users/chenxiaokang/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /Users/chenxiaokang/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
18/08/07 11:01:47 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
18/08/07 11:01:47 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
18/08/07 11:01:47 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
18/08/07 11:01:47 INFO tool.CodeGenTool: Beginning code generation
18/08/07 11:01:48 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `DBS` AS t LIMIT 1
18/08/07 11:01:48 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `DBS` AS t LIMIT 1
18/08/07 11:01:48 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /Users/chenxiaokang/hadoop-2.7.6
Note: /tmp/sqoop-chenxiaokang/compile/3ecfbbea71dfb1dd1314eba358b9a7d7/DBS.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
18/08/07 11:01:52 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-chenxiaokang/compile/3ecfbbea71dfb1dd1314eba358b9a7d7/DBS.jar
18/08/07 11:01:52 WARN manager.MySQLManager: It looks like you are importing from mysql.
18/08/07 11:01:52 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
18/08/07 11:01:52 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
18/08/07 11:01:52 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
18/08/07 11:01:52 INFO mapreduce.ImportJobBase: Beginning import of DBS
18/08/07 11:01:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/07 11:01:53 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
18/08/07 11:01:56 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
18/08/07 11:01:56 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
18/08/07 11:02:02 INFO db.DBInputFormat: Using read commited transaction isolation
18/08/07 11:02:02 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`DB_ID`), MAX(`DB_ID`) FROM `DBS`
18/08/07 11:02:02 INFO mapreduce.JobSubmitter: number of splits:4
18/08/07 11:02:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1533537460397_0001
18/08/07 11:02:04 INFO impl.YarnClientImpl: Submitted application application_1533537460397_0001
18/08/07 11:02:05 INFO mapreduce.Job: The url to track the job: http://172.20.10.3:8088/proxy/application_1533537460397_0001/
18/08/07 11:02:05 INFO mapreduce.Job: Running job: job_1533537460397_0001
18/08/07 11:02:23 INFO mapreduce.Job: Job job_1533537460397_0001 running in uber mode : false
18/08/07 11:02:23 INFO mapreduce.Job:  map 0% reduce 0%
18/08/07 11:02:38 INFO mapreduce.Job:  map 50% reduce 0%
18/08/07 11:02:39 INFO mapreduce.Job:  map 100% reduce 0%
18/08/07 11:02:40 INFO mapreduce.Job: Job job_1533537460397_0001 completed successfully
18/08/07 11:02:40 INFO mapreduce.Job: Counters: 31
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=565736
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=417
        HDFS: Number of bytes written=158
        HDFS: Number of read operations=16
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=8
    Job Counters 
        Killed map tasks=1
        Launched map tasks=4
        Other local map tasks=4
        Total time spent by all maps in occupied slots (ms)=50716
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=50716
        Total vcore-milliseconds taken by all map tasks=50716
        Total megabyte-milliseconds taken by all map tasks=51933184
    Map-Reduce Framework
        Map input records=2
        Map output records=2
        Input split bytes=417
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=454
        CPU time spent (ms)=0
        Physical memory (bytes) snapshot=0
        Virtual memory (bytes) snapshot=0
        Total committed heap usage (bytes)=440926208
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=158
18/08/07 11:02:40 INFO mapreduce.ImportJobBase: Transferred 158 bytes in 44.4693 seconds (3.553 bytes/sec)
18/08/07 11:02:40 INFO mapreduce.ImportJobBase: Retrieved 2 records.

可以看到hdfs中就有了我们导入的mysql的数据

0: jdbc:hive2://localhost:10000> dfs -ls /user/hdfs;
+----------------------------------------------------------------------------+--+
|                                 DFS Output                                 |
+----------------------------------------------------------------------------+--+
| Found 1 items                                                              |
| drwxr-xr-x   - hdfs supergroup          0 2018-08-07 11:02 /user/hdfs/DBS  |
+----------------------------------------------------------------------------+--+
2 rows selected (0.01 seconds)
0: jdbc:hive2://localhost:10000> dfs -cat /user/hdfs/DBS/part-m-00000;
+----------------------------------------------------------------------------------------+--+
|                                       DFS Output                                       |
+----------------------------------------------------------------------------------------+--+
| 1,Default Hive database,hdfs://localhost:9000/user/hive/warehouse,default,public,ROLE  |
+----------------------------------------------------------------------------------------+--+
1 row selected (0.024 seconds)
  • 导入过程

    • Sqoop也是通过MapReduce作业进行导入工作,在作业中,会从表中读取一行行记录,然后将其写入HDFS中:
      • 开始导入之前,Sqoop会通过JDBC来获得所需要的数据库元数据,例如导入表的列名、数据类型等;
      • 接着数据库的数据类型(varchar、number等)会被映射成Java的基本数据类型(String、int等),根据这些信息,Sqoop会生成一个与表名同名的类用来完成反序列化工作,保存表中每一行的记录;
      • Sqoop启动MapReduce作业;
      • 启动的作业在input的过程中,会通过JDBC读取数据库表中的内容,这时使用Sqoop生成的类进行反序列化;
      • 最后再将这些记录写到HDFS中,在写入HDFS的过程中,同时会使用Sqoop生成的类进行序列化。
    • Sqoop的导入作业通常不只是由一个Map任务完成,也就是每个任务会获取表的一部分数据。如果只由一个Map任务完成导入的话,那么在第四步会执行“`SELECT col1,col2,… FROM table;
    • 如果多个Map任务,就必须对表进行水平切分,水平切分的依据通常是表的主键。Sqoop在启动MapReduce作业时,会首先通过JDBC查询切分列的最大值和最小值,再根据启动的任务数(-m指定)划分出每个任务所负责的数据:SELECT col1,col2,... FROM table WHERE id >= 0 AND id < 50000;,SELECT col1,col2,... FROM table WHERE id >= 50000 AND id < 100000;
    • 并行导入切分列的数据分布会很大地影响性能,如果均匀分布,性能最好。数据严重倾斜,性能很差。所以在导入之前,有必要对切分列的数据进行抽样检测,了解数据的分布。
    • Sqoop可以对导入过程进行精细地控制,不用每次都导入一张表的所有字段。Sqoop允许我们指定表的列,在查询中加入WHERE子句,也可以自定义查询SQL语句,在SQL中可以使用目标数据库所支持的函数。
    • 我们在导入到HDFS的时候可以在Hive创建好该表:sqoop create-hive-table --connect jdbc://master:3306/hive --table DBS --fields-terminated-by ',' --username [username] --password [password],然后LOAD数据即可。
    • Sqoop默认导出格式为逗号分隔,所以在Sqoop建表命令中,我们用--fields-terminated-by ','指明Hive中的DBS表的列分隔符。
    • 也可将导入HDFS、创建表、加载数据合并为一个步骤:sqoop import --connect jdbc:mysql://master:3306/hive --table DBS --username [usernmae] --password [password] -m [num] --hive-import
  • 导出过程

    • 在将Hive中的表导出到数据库时,必须在数据库中新建一张用来接收数据的表。
    • 同样的,Sqoop根据目标表的结构会生成一个Java类(第一二步),作用是序列化和反序列化。接着启动一个MapReduce作业(第三步),在作业中会用生成的Java类从HDFS中读取数据(第四步),并生成一批INSERT语句,每条语句都会向MySQL的目标表中插入多条记录(第五步),这样读写都是并行,写入性能受限于目标数据库的写入性能。

例子:导出sc表到MySQL

0: jdbc:hive2://localhost:10000> DESC sc;
+-----------+------------+----------+--+
| col_name  | data_type  | comment  |
+-----------+------------+----------+--+
| id        | bigint     |          |
| courseid  | bigint     |          |
| account   | string     |          |
+-----------+------------+----------+--+
3 rows selected (0.185 seconds)
0: jdbc:hive2://localhost:10000> 
mysql> use test;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> create table sc(id bigint, courseid bigint, account varchar(32));
Query OK, 0 rows affected (0.03 sec)
➜  bin git:(master) ✗ sqoop export --connect jdbc:mysql://localhost:3306/test --table sc --export-dir /user/hive/warehouse/sc --username root --password root -m 1 --fields-terminated-by ',';
Warning: /Users/chenxiaokang/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /Users/chenxiaokang/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
18/08/07 12:23:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
18/08/07 12:23:32 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
18/08/07 12:23:32 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
18/08/07 12:23:32 INFO tool.CodeGenTool: Beginning code generation
18/08/07 12:23:33 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `sc` AS t LIMIT 1
18/08/07 12:23:33 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `sc` AS t LIMIT 1
18/08/07 12:23:33 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /Users/chenxiaokang/hadoop-2.7.6
Note: /tmp/sqoop-chenxiaokang/compile/fb350fa941a369d077323a7b646b5380/sc.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
18/08/07 12:23:35 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-chenxiaokang/compile/fb350fa941a369d077323a7b646b5380/sc.jar
18/08/07 12:23:35 INFO mapreduce.ExportJobBase: Beginning export of sc
18/08/07 12:23:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/07 12:23:35 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
18/08/07 12:23:37 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
18/08/07 12:23:37 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
18/08/07 12:23:37 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
18/08/07 12:23:37 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
18/08/07 12:23:42 INFO input.FileInputFormat: Total input paths to process : 1
18/08/07 12:23:42 INFO input.FileInputFormat: Total input paths to process : 1
18/08/07 12:23:43 INFO mapreduce.JobSubmitter: number of splits:1
18/08/07 12:23:43 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
18/08/07 12:23:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1533537460397_0002
18/08/07 12:23:44 INFO impl.YarnClientImpl: Submitted application application_1533537460397_0002
18/08/07 12:23:44 INFO mapreduce.Job: The url to track the job: http://172.20.10.3:8088/proxy/application_1533537460397_0002/
18/08/07 12:23:44 INFO mapreduce.Job: Running job: job_1533537460397_0002
18/08/07 12:23:57 INFO mapreduce.Job: Job job_1533537460397_0002 running in uber mode : false
18/08/07 12:23:57 INFO mapreduce.Job:  map 0% reduce 0%
18/08/07 12:24:04 INFO mapreduce.Job:  map 100% reduce 0%
18/08/07 12:24:05 INFO mapreduce.Job: Job job_1533537460397_0002 completed successfully
18/08/07 12:24:05 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=141082
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=1044
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=4
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters 
        Launched map tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=4797
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=4797
        Total vcore-milliseconds taken by all map tasks=4797
        Total megabyte-milliseconds taken by all map tasks=4912128
    Map-Reduce Framework
        Map input records=82
        Map output records=82
        Input split bytes=132
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=74
        CPU time spent (ms)=0
        Physical memory (bytes) snapshot=0
        Virtual memory (bytes) snapshot=0
        Total committed heap usage (bytes)=121110528
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=0
18/08/07 12:24:05 INFO mapreduce.ExportJobBase: Transferred 1.0195 KB in 28.176 seconds (37.0528 bytes/sec)
18/08/07 12:24:05 INFO mapreduce.ExportJobBase: Exported 82 records.

你可能感兴趣的:(Hadoop)