A小菠菜罐头

【Hive】用Sqoop实现数据HDFS到mysql到Hive

大数据协作框架

“大数据协作框架”其实是一个统称，主要是以下四个框架

数据转换工具 Sqoop
文件收集库框架 Flume
任务调度框架 Oozie
大数据WEB工具Hue

Sqoop作用

将关系数据库中的某张表数据抽取到Hadoop的HDFS文件系统中，底层运行的还是MapReduce。

利用MapReduce加快数据传输速度。

批处理方式进行数据传输。

也可以将HDFS上的文件数据或者Hive表中的数据导出到关系型数据库当中的某张表中。

HDFS →RDBMS

sqoop export \
--connect jdbc:mysql://xxx:3306/xxx \
--username xxx \
--password xxx \
--table xxx \
--export-dir xxx

RDBMS→Hive

sqoop import \
--connect jdbc:mysql://xxx:3306/xxx \
--username xxx \
--password xxx \
--fields-terminated-by "\t" \
--table xxx \
--hive-import \
--hive-table xxx

Hive→RDBMS

sqoop export \
--connect jdbc:mysql://xxx:3306/xxx \
--username xxx \
--password xxx \
--table xxx \
--export-dir xxx \
--input-fields-terminated-by '\t'

RDBMS→HDFS

sqoop import \
--connect jdbc:mysql://xxx:3306/xxx \
--username xxx \
--password xxx \
--table xxx \
--target-dir xxx

规律：

从RDBMS导入到HDFS或者Hive中的都使用import；从HDFS或者Hive导出到RDBMS中的都使用export；以HDFS和Hive为参考，根据数据流向选择关键字。
connect、username、password、table四个参数为每一种传输都必须的；其中connect参数格式均为--connect jdbc:mysql://主机名:3306/数据库名(使用mysql数据库)；table是指明mysql中的表名。
export-dir参数只有在导出数据到RDBMS中时才会用到，含义为表在hdfs中存放的路径。

区别：

HDFS →RDBMS：指明table：mysql中的表，需要自行先创建；指明export-dir：HDFS中数据的存储路径
RDBMS→Hive：指明fields-terminated-by：指定分隔符，分隔符指的是存放在Hive中的数据的分隔符，如果目标将存储在Hive中可理解为编码格式，若目标将存储在RDBMS上，则可理解为解码格式；指明table:mysql中的表名；指明hive-import：导入到hive操作;指明hive-table:hive中的表名。注意：table参数不可以与用户家目录下已存在的目录重名，因为sqoop导数据到hive会先将数据导入到HDFS上，然后再将数据load到hive中，最后把这个目录再删除掉。
Hive→RDBMS：指明table:mysql中的表名；指明export-dir:hive在hdfs中存储的路径;指明hive-table:hive中的表名。
RDBMS→HDFS：指明table:mysql里的表名;指明target-dir:hdfs存储数据的目录;

Sqoop安装

配置Sqoop1.x

conf目录【sqoop-env-template.sh】

export HADOOP_COMMON_HOME=Hadoop目录
export HADOOP_MAPRED_HOME=Hadoop目录
export HIVE_HOME=Hive目录
export ZOOCFGDIR=Zookeeper目录

将mysqlJDBC驱动包拷到sqoop的lib目录下

测试sqoop

bin/sqoop list-databases \
--connect jdbc:mysql://主机名：3306 \
--username root \
--password 123456 \

查看本地mysql


mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| metastore          |
| mysql              |
| test               |
+--------------------+
4 rows in set (0.00 sec)

mysql> use test;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+----------------+
| Tables_in_test |
+----------------+
| my_user        |
+----------------+
1 row in set (0.00 sec)

mysql> desc my_user;
+---------+--------------+------+-----+---------+----------------+
| Field   | Type         | Null | Key | Default | Extra          |
+---------+--------------+------+-----+---------+----------------+
| id      | tinyint(4)   | NO   | PRI | NULL    | auto_increment |
| account | varchar(255) | YES  |     | NULL    |                |
| passwd  | varchar(255) | YES  |     | NULL    |                |
+---------+--------------+------+-----+---------+----------------+
3 rows in set (0.00 sec)


mysql> select * from my_user;
+----+----------+----------+
| id | account  | passwd   |
+----+----------+----------+
|  1 | admin    | admin    |
|  2 | johnny   | 123456   |
|  3 | zhangsan | zhangsan |
|  4 | lisi     | lisi     |
|  5 | test     | test     |
|  6 | qiqi     | qiqi     |
|  7 | hangzhou | hangzhou |
+----+----------+----------+
7 rows in set (0.00 sec)

hive创建相同结构的空表

hive (test)> create table h_user(                            
           > id int,                                         
           > account string,                                 
           > passwd string                                   
           > )row format delimited fields terminated by '\t';
OK
Time taken: 0.113 seconds

hive (test)> desc h_user;
OK
col_name        data_type       comment
id                      int                                         
account                 string                                      
passwd                  string                                      
Time taken: 0.228 seconds, Fetched: 3 row(s)

从本地mysql导出数据到Hive里

bin/sqoop import \
--connect jdbc:mysql://cdaisuke:3306/test \
--username root \
--password 123456 \
--table my_user \
--num-mappers 1 \
--delete-target-dir \
--fields-terminated-by "\t" \
--hive-database test \
--hive-import \
--hive-table h_user

hive (test)> select * from h_user;                      
OK
h_user.id       h_user.account  h_user.passwd
1       admin   admin
2       johnny  123456
3       zhangsan        zhangsan
4       lisi    lisi
5       test    test
6       qiqi    qiqi
7       hangzhou        hangzhou
Time taken: 0.061 seconds, Fetched: 7 row(s)

从mysql导入到HDFS里

bin/sqoop import \
--connect jdbc:mysql://cdaisuke:3306/test \
--username root \
--password 123456 \
--table my_user \
--num-mappers 3 \
--target-dir /user/hadoop/ \
--delete-target-dir \
--fields-terminated-by "\t"

------------------------------------------------------------

[hadoop@cdaisuke sqoop-1.4.5-cdh5.3.6]$ bin/sqoop import \
> --connect jdbc:mysql://cdaisuke:3306/test \
> --username root \
> --password 123456 \
> --table my_user \
> --num-mappers 3 \
> --target-dir /user/hadoop/ \
> --delete-target-dir \
> --fields-terminated-by "\t"
18/08/14 00:02:11 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.6
18/08/14 00:02:11 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
18/08/14 00:02:12 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
18/08/14 00:02:12 INFO tool.CodeGenTool: Beginning code generation
18/08/14 00:02:13 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `my_user` AS t LIMIT 1
18/08/14 00:02:13 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `my_user` AS t LIMIT 1
18/08/14 00:02:13 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/modules/hadoop-2.5.0-cdh5.3.6_Hive
Note: /tmp/sqoop-hadoop/compile/7c8bdb7cd3df7b2f4b48700704f46f65/my_user.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
18/08/14 00:02:18 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/7c8bdb7cd3df7b2f4b48700704f46f65/my_user.jar
18/08/14 00:02:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/14 00:02:22 INFO tool.ImportTool: Destination directory /user/hadoop is not present, hence not deleting.
18/08/14 00:02:22 WARN manager.MySQLManager: It looks like you are importing from mysql.
18/08/14 00:02:22 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
18/08/14 00:02:22 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
18/08/14 00:02:22 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
18/08/14 00:02:22 INFO mapreduce.ImportJobBase: Beginning import of my_user
18/08/14 00:02:22 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
18/08/14 00:02:22 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
18/08/14 00:02:23 INFO client.RMProxy: Connecting to ResourceManager at slave01/192.168.79.140:8032
18/08/14 00:02:28 INFO db.DBInputFormat: Using read commited transaction isolation
18/08/14 00:02:28 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `my_user`
18/08/14 00:02:28 INFO mapreduce.JobSubmitter: number of splits:3
18/08/14 00:02:28 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1533652222364_0078
18/08/14 00:02:29 INFO impl.YarnClientImpl: Submitted application application_1533652222364_0078
18/08/14 00:02:29 INFO mapreduce.Job: The url to track the job: http://slave01:8088/proxy/application_1533652222364_0078/
18/08/14 00:02:29 INFO mapreduce.Job: Running job: job_1533652222364_0078
18/08/14 00:02:50 INFO mapreduce.Job: Job job_1533652222364_0078 running in uber mode : false
18/08/14 00:02:50 INFO mapreduce.Job:  map 0% reduce 0%
18/08/14 00:03:00 INFO mapreduce.Job:  map 33% reduce 0%
18/08/14 00:03:01 INFO mapreduce.Job:  map 67% reduce 0%
18/08/14 00:03:02 INFO mapreduce.Job:  map 100% reduce 0%
18/08/14 00:03:02 INFO mapreduce.Job: Job job_1533652222364_0078 completed successfully
18/08/14 00:03:02 INFO mapreduce.Job: Counters: 30
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=394707
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=295
                HDFS: Number of bytes written=106
                HDFS: Number of read operations=12
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=6
        Job Counters 
                Launched map tasks=3
                Other local map tasks=3
                Total time spent by all maps in occupied slots (ms)=25213
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=25213
                Total vcore-seconds taken by all map tasks=25213
                Total megabyte-seconds taken by all map tasks=25818112
        Map-Reduce Framework
                Map input records=7
                Map output records=7
                Input split bytes=295
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=352
                CPU time spent (ms)=3600
                Physical memory (bytes) snapshot=316162048
                Virtual memory (bytes) snapshot=2523156480
                Total committed heap usage (bytes)=77766656
        File Input Format Counters 
                Bytes Read=0
        File Output Format Counters 
                Bytes Written=106
18/08/14 00:03:02 INFO mapreduce.ImportJobBase: Transferred 106 bytes in 40.004 seconds (2.6497 bytes/sec)
18/08/14 00:03:02 INFO mapreduce.ImportJobBase: Retrieved 7 records.

设置3个map任务

--num-mappers 3 \

设置HDFS目标存储目录
--target-dir /user/hadoop/ \

如果设置目录存在则删除此目录
--delete-target-dir \

从Hive导出到mysql

在mysql创建新表

create table user_export(
id tinyint(4) not null auto_increment,
account varchar(255) default null,
passwd varchar(255) default null,
primary key(id)
);

用sqoop导出数据

bin/sqoop export \
--connect jdbc:mysql://cdaisuke:3306/test \
--username root \
--password 123456 \
--table user_export \
--num-mappers 1 \
--fields-terminated-by "\t" \
--export-dir /user/hive/warehouse/test.db/h_user

----------------------------------------------------

[hadoop@cdaisuke sqoop-1.4.5-cdh5.3.6]$ bin/sqoop export \
> --connect jdbc:mysql://cdaisuke:3306/test \
> --username root \
> --password 123456 \
> --table user_export \
> --num-mappers 1 \
> --fields-terminated-by "\t" \
> --export-dir /user/hive/warehouse/test.db/h_user
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
18/08/14 00:16:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.6
18/08/14 00:16:32 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
18/08/14 00:16:33 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
18/08/14 00:16:33 INFO tool.CodeGenTool: Beginning code generation
18/08/14 00:16:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `user_export` AS t LIMIT 1
18/08/14 00:16:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `user_export` AS t LIMIT 1
18/08/14 00:16:34 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/modules/hadoop-2.5.0-cdh5.3.6_Hive
Note: /tmp/sqoop-hadoop/compile/6823ffae505b34f7ae8b9881bae4b898/user_export.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
18/08/14 00:16:39 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/6823ffae505b34f7ae8b9881bae4b898/user_export.jar
18/08/14 00:16:39 INFO mapreduce.ExportJobBase: Beginning export of user_export
18/08/14 00:16:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/14 00:16:40 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
18/08/14 00:16:43 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
18/08/14 00:16:43 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
18/08/14 00:16:43 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
18/08/14 00:16:43 INFO client.RMProxy: Connecting to ResourceManager at slave01/192.168.79.140:8032
18/08/14 00:16:48 INFO input.FileInputFormat: Total input paths to process : 1
18/08/14 00:16:48 INFO input.FileInputFormat: Total input paths to process : 1
18/08/14 00:16:48 INFO mapreduce.JobSubmitter: number of splits:1
18/08/14 00:16:48 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
18/08/14 00:16:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1533652222364_0079
18/08/14 00:16:50 INFO impl.YarnClientImpl: Submitted application application_1533652222364_0079
18/08/14 00:16:50 INFO mapreduce.Job: The url to track the job: http://slave01:8088/proxy/application_1533652222364_0079/
18/08/14 00:16:50 INFO mapreduce.Job: Running job: job_1533652222364_0079
18/08/14 00:17:11 INFO mapreduce.Job: Job job_1533652222364_0079 running in uber mode : false
18/08/14 00:17:11 INFO mapreduce.Job:  map 0% reduce 0%
18/08/14 00:17:27 INFO mapreduce.Job:  map 100% reduce 0%
18/08/14 00:17:27 INFO mapreduce.Job: Job job_1533652222364_0079 completed successfully
18/08/14 00:17:27 INFO mapreduce.Job: Counters: 30
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=131287
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=258
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=4
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
        Job Counters 
                Launched map tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=13426
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=13426
                Total vcore-seconds taken by all map tasks=13426
                Total megabyte-seconds taken by all map tasks=13748224
        Map-Reduce Framework
                Map input records=7
                Map output records=7
                Input split bytes=149
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=73
                CPU time spent (ms)=1230
                Physical memory (bytes) snapshot=113061888
                Virtual memory (bytes) snapshot=838946816
                Total committed heap usage (bytes)=45613056
        File Input Format Counters 
                Bytes Read=0
        File Output Format Counters 
                Bytes Written=0
18/08/14 00:17:27 INFO mapreduce.ExportJobBase: Transferred 258 bytes in 44.2695 seconds (5.8279 bytes/sec)
18/08/14 00:17:27 INFO mapreduce.ExportJobBase: Exported 7 records.

-----------------------------------------------------------------

mysql> select * from user_export;
+----+----------+----------+
| id | account  | passwd   |
+----+----------+----------+
|  1 | admin    | admin    |
|  2 | johnny   | 123456   |
|  3 | zhangsan | zhangsan |
|  4 | lisi     | lisi     |
|  5 | test     | test     |
|  6 | qiqi     | qiqi     |
|  7 | hangzhou | hangzhou |
+----+----------+----------+
7 rows in set (0.00 sec)

从HDFS导出到mysql