lcz-2000

大数据开发之Sqoop详细介绍

测试环境 CDH 6.3.1 Sqoop 1.4.7

一.Sqoop概述

Apache Sqoop（SQL-to-Hadoop）项目旨在协助RDBMS与Hadoop之间进行高效的大数据交流。用户可以在 Sqoop 的帮助下，轻松地把关系型数据库的数据导入到 Hadoop 与其相关的系统 (如HBase和Hive)中；同时也可以把数据从 Hadoop 系统里抽取并导出到关系型数据库里。

Sqoop是一个在结构化数据和Hadoop之间进行批量数据迁移的工具，结构化数据可以是MySQL、Oracle等RDBMS。Sqoop底层用MapReduce程序实现抽取、转换、加载，MapReduce天生的特性保证了并行化和高容错率，而且相比Kettle等传统ETL工具，任务跑在Hadoop集群上，减少了ETL服务器资源的使用情况。在特定场景下，抽取过程会有很大的性能提升。

如果要用Sqoop，必须正确安装并配置Hadoop，因依赖于本地的Hadoop环境启动MR程序；MySQL、Oracle等数据库的JDBC驱动也要放到Sqoop的lib目录下。

二.Sqoop 工具概述

通过Sqoop的help命令可以看到sqoop有哪些工具

[root@zhou ~]# sqoop help
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/11/20 17:00:56 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  import-mainframe   Import datasets from a mainframe server to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information

See 'sqoop help COMMAND' for information on a specific command.

例如我想看 sqoop的import工具有哪些参数:

[root@zhou ~]# sqoop help import
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/11/20 17:11:43 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:
   --connect <jdbc-uri>                                       Specify JDBC
                                                              connect
                                                              string
   --connection-manager <class-name>                          Specify
                                                              connection
                                                              manager
                                                              class name
   --connection-param-file <properties-file>                  Specify
                                                              connection
                                                              parameters
                                                              file
   --driver <class-name>                                      Manually
                                                              specify JDBC
                                                              driver class
                                                              to use

**限于篇幅，中间省略N多行**


At minimum, you must specify --connect and --table
Arguments to mysqldump and other subprograms may be supplied
after a '--' on the command line.

三.Sqoon工具详解

3.1 codegen

codegen工具生成封装和解释导入记录的Java类。记录的Java定义作为导入过程的一部分实例化，但是也可以单独执行。例如，如果Java源代码丢失了，可以重新创建它。可以创建在字段之间使用不同分隔符的类的新版本，等等。

由于我是从传统数据仓库转的大数据，目前对java不熟悉，此处就不展开了。

3.2 create-hive-table

create-hive-table 工具用来将表同步到hive。

3.2.1 create-hive-table工具命令介绍

Hive命令 | 参数 | 描述 | |–|–| | --hive-home

3.2.2 create-hive-table 测试案例

需求，将mysql test库下的emp表的表结构同步到hive的test库下ods_emp表

sqoop create-hive-table \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--hive-database test \
--table emp --hive-table ods_emp

测试记录:

[root@zhou ~]# 
[root@zhou ~]# sqoop create-hive-table \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> --hive-database test \
> --table emp --hive-table ods_emp
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/11/23 18:12:28 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
23/11/23 18:12:28 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
23/11/23 18:12:28 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
23/11/23 18:12:28 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
23/11/23 18:12:28 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
23/11/23 18:12:29 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `emp` AS t LIMIT 1
23/11/23 18:12:29 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `emp` AS t LIMIT 1
23/11/23 18:12:29 WARN hive.TableDefWriter: Column hiredate had to be cast to a less precise type in Hive
23/11/23 18:12:29 WARN hive.TableDefWriter: Column sal had to be cast to a less precise type in Hive
23/11/23 18:12:29 WARN hive.TableDefWriter: Column comm had to be cast to a less precise type in Hive
23/11/23 18:12:30 INFO hive.HiveImport: Loading uploaded data into Hive
23/11/23 18:12:30 INFO conf.HiveConf: Found configuration file file:/etc/hive/conf.cloudera.hive/hive-site.xml
**********中间省略N多输出*******
23/11/23 18:12:33 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=zhou:2181,hp3:2181,zhou:2181 sessionTimeout=1200000 watcher=org.apache.curator.ConnectionState@176555c
23/11/23 18:12:33 INFO zookeeper.ClientCnxn: Opening socket connection to server zhou/10.31.1.123:2181. Will not attempt to authenticate using SASL (unknown error)
23/11/23 18:12:33 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.31.1.123:54526, server: zhou/10.31.1.123:2181
23/11/23 18:12:33 INFO zookeeper.ClientCnxn: Session establishment complete on server zhou/10.31.1.123:2181, sessionid = 0x175e00052b51d5e, negotiated timeout = 40000
23/11/23 18:12:33 INFO state.ConnectionStateManager: State change: CONNECTED
23/11/23 18:12:33 INFO ql.Driver: Executing command(queryId=root_20231123181231_97629f38-ba50-4864-9d21-6aa2239f11a9): CREATE TABLE IF NOT EXISTS `test`.`ods_emp` ( `empno` INT, `ename` STRING, `job` STRING, `mgr` INT, `hiredate` STRING, `sal` DOUBLE, `comm` DOUBLE, `deptno` INT) COMMENT 'Imported by sqoop on 2023/11/23 18:12:29' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\012' STORED AS TEXTFILE
23/11/23 18:12:33 INFO ql.Driver: Starting task [Stage-0:DDL] in serial mode
23/11/23 18:12:33 INFO hive.metastore: Mestastore configuration hive.metastore.filter.hook changed from org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook to org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl
23/11/23 18:12:33 INFO hive.metastore: Closed a connection to metastore, current connections: 0
23/11/23 18:12:33 INFO exec.DDLTask: creating table test.ods_emp on null
23/11/23 18:12:34 INFO hive.metastore: HMS client filtering is enabled.
23/11/23 18:12:34 INFO hive.metastore: Trying to connect to metastore with URI thrift://zhou:9083
23/11/23 18:12:34 INFO hive.metastore: Opened a connection to metastore, current connections: 1
23/11/23 18:12:34 INFO hive.metastore: Connected to metastore.
23/11/23 18:12:34 INFO ql.Driver: Completed executing command(queryId=root_20231123181231_97629f38-ba50-4864-9d21-6aa2239f11a9); Time taken: 0.515 seconds
OK
23/11/23 18:12:34 INFO ql.Driver: OK
Time taken: 2.633 seconds
23/11/23 18:12:34 INFO CliDriver: Time taken: 2.633 seconds
23/11/23 18:12:34 INFO conf.HiveConf: Using the default value passed in for log id: 2fec9e5a-4d5e-4786-9e5b-5e5303b8cca1
23/11/23 18:12:34 INFO session.SessionState: Resetting thread name to  main
23/11/23 18:12:34 INFO conf.HiveConf: Using the default value passed in for log id: 2fec9e5a-4d5e-4786-9e5b-5e5303b8cca1
23/11/23 18:12:34 INFO session.SessionState: Deleted directory: /tmp/hive/root/2fec9e5a-4d5e-4786-9e5b-5e5303b8cca1 on fs with scheme hdfs
23/11/23 18:12:34 INFO session.SessionState: Deleted directory: /tmp/root/2fec9e5a-4d5e-4786-9e5b-5e5303b8cca1 on fs with scheme file
23/11/23 18:12:34 INFO hive.metastore: Closed a connection to metastore, current connections: 0
23/11/23 18:12:34 INFO hive.HiveImport: Hive import complete.
23/11/23 18:12:34 INFO imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting
23/11/23 18:12:34 INFO zookeeper.ZooKeeper: Session: 0x175e00052b51d5e closed
23/11/23 18:12:34 INFO CuratorFrameworkSingleton: Closing ZooKeeper client.
23/11/23 18:12:34 INFO zookeeper.ClientCnxn: EventThread shut down

查看hive这边的表可以看到整数转换为int，小数的自动转为double，varchar转为string没什么问题，可是date类型的也变为string了，不知道为什么会这个样子。另外，主外键直接忽略了。

hive> 
    > 
    > desc ods_emp;
OK
empno                   int                                         
ename                   string                                      
job                     string                                      
mgr                     int                                         
hiredate                string                                      
sal                     double                                      
comm                    double                                      
deptno                  int                                         
Time taken: 0.078 seconds, Fetched: 8 row(s)
    > show create table ods_emp;
OK
CREATE TABLE `ods_emp`(
  `empno` int, 
  `ename` string, 
  `job` string, 
  `mgr` int, 
  `hiredate` string, 
  `sal` double, 
  `comm` double, 
  `deptno` int)
COMMENT 'Imported by sqoop on 2023/11/23 18:12:29'
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES ( 
  'field.delim'='', 
  'line.delim'='\n', 
  'serialization.format'='') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://nameservice1/user/hive/warehouse/test.db/ods_emp'
TBLPROPERTIES (
  'transient_lastDdlTime'='1606126354')
Time taken: 0.107 seconds, Fetched: 24 row(s)
hive>

3.3 eval

执行前，先评估命令的准确性

3.3.1 eval工具命令介绍

SQL 评估命令 | 参数 | 描述 | |–|–| | -e,–query

3.3.2 eval命令测试

测试 --query

sqoop eval \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--query "select * from emp"

测试记录:

[root@zhou ~]# sqoop eval \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> --query "select * from emp"
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/11/23 18:28:05 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
23/11/23 18:28:05 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
23/11/23 18:28:05 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
----------------------------------------------------------------------------------
| empno | ename      | job       | mgr  | hiredate   | sal       | comm      | deptno | 
----------------------------------------------------------------------------------
| 7369 | SMITH      | CLERK     | 7902 | 1980-12-17 | 800.00    | (null)    | 20 | 
| 7499 | ALLEN      | SALESMAN  | 7698 | 1981-02-20 | 1600.00   | 300.00    | 30 | 
| 7521 | WARD       | SALESMAN  | 7698 | 1981-02-22 | 1250.00   | 500.00    | 30 | 
| 7566 | JONES      | MANAGER   | 7839 | 1981-04-02 | 2975.00   | (null)    | 20 | 
| 7654 | MARTIN     | SALESMAN  | 7698 | 1981-09-28 | 1250.00   | 1400.00   | 30 | 
| 7698 | BLAKE      | MANAGER   | 7839 | 1981-05-01 | 2850.00   | (null)    | 30 | 
| 7782 | CLARK      | MANAGER   | 7839 | 1981-06-09 | 2450.00   | (null)    | 10 | 
| 7788 | SCOTT      | ANALYST   | 7566 | 1987-06-13 | 3000.00   | (null)    | 20 | 
| 7839 | KING       | PRESIDENT | (null) | 1981-11-17 | 5000.00   | (null)    | 10 | 
| 7844 | TURNER     | SALESMAN  | 7698 | 1981-09-08 | 1500.00   | 0.00      | 30 | 
| 7876 | ADAMS      | CLERK     | 7788 | 1987-06-13 | 1100.00   | (null)    | 20 | 
| 7900 | JAMES      | CLERK     | 7698 | 1981-12-03 | 950.00    | (null)    | 30 | 
| 7902 | FORD       | ANALYST   | 7566 | 1981-12-03 | 3000.00   | (null)    | 20 | 
| 7934 | MILLER     | CLERK     | 7782 | 1982-01-23 | 1300.00   | (null)    | 10 | 
----------------------------------------------------------------------------------
[root@zhou ~]#

测试 -e

sqoop eval \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
-e "insert into emp(empno) values (1)"

测试记录:

[root@zhou ~]# sqoop eval \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> -e "insert into emp(empno) values (1)"
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/11/23 18:32:26 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
23/11/23 18:32:26 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
23/11/23 18:32:26 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
23/11/23 18:32:26 INFO tool.EvalSqlTool: 1 row(s) updated.
[root@zhou ~]#

3.4 export

将hdfs里的数据同步到关系型数据库,目标表必须在数据库中存在，将根据用户指定的分隔符将输入文件读取并解析为一组记录。

3.4.1 export命令概述

代码生成的参数 | 参数 | 描述 | |–|–| | --bindir

3.4.2 export命令测试案例

sqoop的export命令支持 insert、update到关系型数据库，但是不支持merge，亲自测试过，–merge-key id 这样会报错。但是sqoop的export命令支持调用存储过程，这样同步了数据之后，可以通过mysql的存储过程进行数据处理。

3.4.2.1 hive表导入mysql数据库insert案例

数据准备:

-- hive表数据
hive> 
    > 
    > select * from t1;
OK
1       abc
2       def

-- MySQL 数据
mysql> select * from t1; 
Empty set (0.00 sec)

mysql>

代码: --fields-terminated-by ‘\0001’ 必须是这个格式，其它格式试了报错，待研究

sqoop export \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t1 \
--export-dir /user/hive/warehouse/test.db/t1 \
--num-mappers 1 \
--fields-terminated-by '\0001'

测试记录:

[root@zhou ~]# sqoop export \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> --table t1 \
> --export-dir /user/hive/warehouse/test.db/t1 \
> --num-mappers 1 \
> --fields-terminated-by '\0001'
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/11/23 19:22:04 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
23/11/23 19:22:04 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
23/11/23 19:22:04 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
23/11/23 19:22:04 INFO tool.CodeGenTool: Beginning code generation
23/11/23 19:22:05 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t1` AS t LIMIT 1
23/11/23 19:22:05 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t1` AS t LIMIT 1
23/11/23 19:22:05 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
23/11/23 19:22:06 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-root/compile/fab279df6cf018b9c6c4a9b4d9508e39/t1.java to /root/./t1.java. Error: Destination '/root/./t1.java' already exists
23/11/23 19:22:06 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/fab279df6cf018b9c6c4a9b4d9508e39/t1.jar
23/11/23 19:22:06 INFO mapreduce.ExportJobBase: Beginning export of t1
23/11/23 19:22:06 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
23/11/23 19:22:07 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
23/11/23 19:22:07 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
23/11/23 19:22:07 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
23/11/23 19:22:07 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
23/11/23 19:22:07 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/root/.staging/job_1605780958086_0024
23/11/23 19:22:09 INFO input.FileInputFormat: Total input files to process : 2
23/11/23 19:22:09 INFO input.FileInputFormat: Total input files to process : 2
23/11/23 19:22:09 INFO mapreduce.JobSubmitter: number of splits:1
23/11/23 19:22:09 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address
23/11/23 19:22:09 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
23/11/23 19:22:09 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
23/11/23 19:22:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1605780958086_0024
23/11/23 19:22:09 INFO mapreduce.JobSubmitter: Executing with tokens: []
23/11/23 19:22:10 INFO conf.Configuration: resource-types.xml not found
23/11/23 19:22:10 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
23/11/23 19:22:10 INFO impl.YarnClientImpl: Submitted application application_1605780958086_0024
23/11/23 19:22:10 INFO mapreduce.Job: The url to track the job: http://hp3:8088/proxy/application_1605780958086_0024/
23/11/23 19:22:10 INFO mapreduce.Job: Running job: job_1605780958086_0024
23/11/23 19:22:16 INFO mapreduce.Job: Job job_1605780958086_0024 running in uber mode : false
23/11/23 19:22:16 INFO mapreduce.Job:  map 0% reduce 0%
23/11/23 19:22:21 INFO mapreduce.Job:  map 100% reduce 0%
23/11/23 19:22:21 INFO mapreduce.Job: Job job_1605780958086_0024 completed successfully
23/11/23 19:22:21 INFO mapreduce.Job: Counters: 33
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=247166
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=241
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=7
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
                HDFS: Number of bytes read erasure-coded=0
        Job Counters 
                Launched map tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=2732
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=2732
                Total vcore-milliseconds taken by all map tasks=2732
                Total megabyte-milliseconds taken by all map tasks=2797568
        Map-Reduce Framework
                Map input records=2
                Map output records=2
                Input split bytes=223
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=65
                CPU time spent (ms)=930
                Physical memory (bytes) snapshot=235925504
                Virtual memory (bytes) snapshot=2592751616
                Total committed heap usage (bytes)=193462272
                Peak Map Physical memory (bytes)=235925504
                Peak Map Virtual memory (bytes)=2592751616
        File Input Format Counters 
                Bytes Read=0
        File Output Format Counters 
                Bytes Written=0
23/11/23 19:22:21 INFO mapreduce.ExportJobBase: Transferred 241 bytes in 13.9456 seconds (17.2815 bytes/sec)
23/11/23 19:22:21 INFO mapreduce.ExportJobBase: Exported 2 records.
[root@zhou ~]#

数据验证可以看到数据导入成功

mysql> select * from t1;
Empty set (0.00 sec)

mysql> select * from t1;
+------+------+
| id   | name |
+------+------+
|    1 | abc  |
|    2 | def  |
+------+------+
2 rows in set (0.00 sec)

3.4.2.2 hive表导入mysql数据库update案例

数据准备:

-- hive 数据
hive> 
    > 
    > select * from t1;
OK
1       abc
2       def
-- Mysql 数据
mysql> select * from t1;
+------+------+
| id   | name |
+------+------+
|    1 | a    |
|    3 | b    |
+------+------+
2 rows in set (0.00 sec)

代码:

sqoop export \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t1 \
--export-dir /user/hive/warehouse/test.db/t1 \
--update-key id \
--num-mappers 1 \
--fields-terminated-by '\0001'

测试记录:

[root@zhou ~]# sqoop export \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> --table t1 \
> --export-dir /user/hive/warehouse/test.db/t1 \
> --update-key id \
> --num-mappers 1 \
> --fields-terminated-by '\0001'
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/11/24 15:00:54 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
23/11/24 15:00:54 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
23/11/24 15:00:54 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
23/11/24 15:00:54 INFO tool.CodeGenTool: Beginning code generation
23/11/24 15:00:54 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t1` AS t LIMIT 1
23/11/24 15:00:55 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t1` AS t LIMIT 1
23/11/24 15:00:55 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
23/11/24 15:00:56 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-root/compile/a3298f025b919a783d8aeba0fcfdd07c/t1.java to /root/./t1.java. Error: Destination '/root/./t1.java' already exists
23/11/24 15:00:56 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/a3298f025b919a783d8aeba0fcfdd07c/t1.jar
23/11/24 15:00:56 INFO mapreduce.ExportJobBase: Beginning export of t1
23/11/24 15:00:56 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
23/11/24 15:00:57 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
23/11/24 15:00:57 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
23/11/24 15:00:57 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
23/11/24 15:00:57 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
23/11/24 15:00:58 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/root/.staging/job_1605780958086_0028
23/11/24 15:00:59 INFO input.FileInputFormat: Total input files to process : 2
23/11/24 15:00:59 INFO input.FileInputFormat: Total input files to process : 2
23/11/24 15:00:59 INFO mapreduce.JobSubmitter: number of splits:1
23/11/24 15:00:59 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address
23/11/24 15:00:59 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
23/11/24 15:00:59 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
23/11/24 15:00:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1605780958086_0028
23/11/24 15:00:59 INFO mapreduce.JobSubmitter: Executing with tokens: []
23/11/24 15:00:59 INFO conf.Configuration: resource-types.xml not found
23/11/24 15:00:59 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
23/11/24 15:01:00 INFO impl.YarnClientImpl: Submitted application application_1605780958086_0028
23/11/24 15:01:00 INFO mapreduce.Job: The url to track the job: http://hp3:8088/proxy/application_1605780958086_0028/
23/11/24 15:01:00 INFO mapreduce.Job: Running job: job_1605780958086_0028
23/11/24 15:01:06 INFO mapreduce.Job: Job job_1605780958086_0028 running in uber mode : false
23/11/24 15:01:06 INFO mapreduce.Job:  map 0% reduce 0%
23/11/24 15:01:11 INFO mapreduce.Job:  map 100% reduce 0%
23/11/24 15:01:11 INFO mapreduce.Job: Job job_1605780958086_0028 completed successfully
23/11/24 15:01:11 INFO mapreduce.Job: Counters: 33
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=247485
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=241
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=7
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
                HDFS: Number of bytes read erasure-coded=0
        Job Counters 
                Launched map tasks=1
                Rack-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=3094
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=3094
                Total vcore-milliseconds taken by all map tasks=3094
                Total megabyte-milliseconds taken by all map tasks=3168256
        Map-Reduce Framework
                Map input records=2
                Map output records=2
                Input split bytes=223
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=60
                CPU time spent (ms)=1010
                Physical memory (bytes) snapshot=243896320
                Virtual memory (bytes) snapshot=2593943552
                Total committed heap usage (bytes)=193462272
                Peak Map Physical memory (bytes)=243896320
                Peak Map Virtual memory (bytes)=2593943552
        File Input Format Counters 
                Bytes Read=0
        File Output Format Counters 
                Bytes Written=0
23/11/24 15:01:11 INFO mapreduce.ExportJobBase: Transferred 241 bytes in 13.9585 seconds (17.2655 bytes/sec)
23/11/24 15:01:11 INFO mapreduce.ExportJobBase: Exported 2 records.
[root@zhou ~]#

验证数据 发现update成功

mysql> select * from t1;
+------+------+
| id   | name |
+------+------+
|    1 | abc  |
|    3 | b    |
+------+------+
2 rows in set (0.00 sec)

3.4 help工具

help工具查看命令帮助例如，我不知道sqoop import的语法想查看:

[root@zhou ~]# sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:
   --connect <jdbc-uri>     Specify JDBC connect string
   --connection-manager <class-name>     Specify connection manager class to use
   --driver <class-name>    Manually specify JDBC driver class to use
   --hadoop-mapred-home <dir>            Override $HADOOP_MAPRED_HOME
   --help                   Print usage instructions
   --password-file          Set path for file containing authentication password
   -P                       Read password from console
   --password <password>    Set authentication password
   --username <username>    Set authentication username
   --verbose                Print more information while working
   --hadoop-home <dir>      Deprecated. Override $HADOOP_HOME

Import control arguments:
   --as-avrodatafile             Imports data to Avro Data Files
   --as-sequencefile             Imports data to SequenceFiles
   --as-textfile                 Imports data as plain text (default)
   --as-parquetfile              Imports data to Parquet Data Files
...

3.5 import工具

sqoop import工具是使用的最频繁的工具，该工具将关系型数据库的数据同步到hdfs上。关系型数据库表中的每一行都会被当做一个独立的记录存储为 text files、Avro 或者 SequenceFiles

3.5.1 sqoop import 工具命令介绍

转移字符–fields-terminated-by \t 支持的转义字符: 1.\b (backspace) 2.\n (newline) 3.\r (carriage return) 4.\t (tab) 5." (double-quote) 6.’ (single-quote) 7.\ (backslash) 8.\0 (NUL) – 这将在字段或行之间插入NUL字符，或者如果用于–enclosed-by、–optional -enclosed-by或–escaping -by参数之一，将禁用封闭/转义

UTF-8字符代码点的八进制表示,–fields-terminated-by \001 would yield the ^A character

Hive命令 | 参数 | 描述 | |–|–| | --hive-home

3.5.2 hive import测试案例
数据准备:

-- MySQL
mysql> select count(*) from fact_sale;
+-----------+
| count(*)  |
+-----------+
| 767830000 |
+-----------+
1 row in set (2 min 32.75 sec)

mysql> desc fact_sale;
+-----------+-------------+------+-----+-------------------+-----------------------------+
| Field     | Type        | Null | Key | Default           | Extra                       |
+-----------+-------------+------+-----+-------------------+-----------------------------+
| id        | bigint(8)   | NO   | PRI | NULL              | auto_increment              |
| sale_date | timestamp   | NO   |     | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| prod_name | varchar(50) | YES  |     | NULL              |                             |
| sale_nums | int(11)     | YES  |     | NULL              |                             |
+-----------+-------------+------+-----+-------------------+-----------------------------+
4 rows in set (0.01 sec)

-- Hive这边无需提前创建表

代码:

sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table fact_sale \
--fields-terminated-by '\0001' \
--delete-target-dir \
--num-mappers 4 \
--hive-import \
--hive-database test \
--hive-table ods_fact_sale \
--hive-overwrite

测试记录:

Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/11/25 09:49:54 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
23/11/25 09:49:54 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
23/11/25 09:49:54 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
23/11/25 09:49:54 INFO tool.CodeGenTool: Beginning code generation
23/11/25 09:49:54 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `fact_sale` AS t LIMIT 1
23/11/25 09:49:54 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `fact_sale` AS t LIMIT 1
23/11/25 09:49:54 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
23/11/25 09:49:56 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-root/compile/ac811b1a59e599927528fdb832a4a853/fact_sale.java to /root/./fact_sale.java. Error: Destination '/root/./fact_sale.ja
va' already exists
23/11/25 09:49:56 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/ac811b1a59e599927528fdb832a4a853/fact_sale.jar
23/11/25 09:49:56 INFO tool.ImportTool: Destination directory fact_sale is not present, hence not deleting.
23/11/25 09:49:56 WARN manager.MySQLManager: It looks like you are importing from mysql.
23/11/25 09:49:56 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
23/11/25 09:49:56 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
23/11/25 09:49:56 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
23/11/25 09:49:56 INFO mapreduce.ImportJobBase: Beginning import of fact_sale
23/11/25 09:49:56 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
23/11/25 09:49:56 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
23/11/25 09:49:57 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
23/11/25 09:49:57 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/root/.staging/job_1605780958086_0032
23/11/25 09:49:59 INFO db.DBInputFormat: Using read commited transaction isolation
23/11/25 09:49:59 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `fact_sale`
23/11/25 09:49:59 INFO db.IntegerSplitter: Split size: 196905399; Num splits: 4 from: 1 to: 787621597
23/11/25 09:49:59 INFO mapreduce.JobSubmitter: number of splits:4
23/11/25 09:49:59 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address
23/11/25 09:49:59 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
23/11/25 09:49:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1605780958086_0032
23/11/25 09:49:59 INFO mapreduce.JobSubmitter: Executing with tokens: []
23/11/25 09:49:59 INFO conf.Configuration: resource-types.xml not found
23/11/25 09:49:59 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
23/11/25 09:49:59 INFO impl.YarnClientImpl: Submitted application application_1605780958086_0032
23/11/25 09:49:59 INFO mapreduce.Job: The url to track the job: http://hp3:8088/proxy/application_1605780958086_0032/
23/11/25 09:49:59 INFO mapreduce.Job: Running job: job_1605780958086_0032
23/11/25 09:50:06 INFO mapreduce.Job: Job job_1605780958086_0032 running in uber mode : false
23/11/25 09:50:06 INFO mapreduce.Job:  map 0% reduce 0%
23/11/25 09:56:28 INFO mapreduce.Job:  map 25% reduce 0%
23/11/25 09:57:08 INFO mapreduce.Job:  map 50% reduce 0%
23/11/25 10:03:40 INFO mapreduce.Job:  map 75% reduce 0%
23/11/25 10:04:07 INFO mapreduce.Job:  map 100% reduce 0%
23/11/25 10:04:07 INFO mapreduce.Job: Job job_1605780958086_0032 completed successfully
23/11/25 10:04:07 INFO mapreduce.Job: Counters: 33
**中间省略部分输出**
0/11/25 10:04:11 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=zhou:2181,hp3:2181,zhou:2181 sessionTimeout=1200000 watcher=org.apache.curator.ConnectionState@615bad16
23/11/25 10:04:11 INFO zookeeper.ClientCnxn: Opening socket connection to server hp3/10.31.1.125:2181. Will not attempt to authenticate using SASL (unknown error)
23/11/25 10:04:11 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.31.1.123:41634, server: hp3/10.31.1.125:2181
23/11/25 10:04:11 INFO zookeeper.ClientCnxn: Session establishment complete on server hp3/10.31.1.125:2181, sessionid = 0x275e0004e5929e2, negotiated timeout = 40000
23/11/25 10:04:11 INFO state.ConnectionStateManager: State change: CONNECTED
23/11/25 10:04:11 INFO ql.Driver: Executing command(queryId=root_20231125100408_f35214cb-e680-4e38-9833-7cb096592131): CREATE TABLE IF NOT EXISTS `test`.`ods_fact_sale` ( `id` BIGINT, `sale_date` STRING, 
`prod_name` STRING, `sale_nums` INT) COMMENT 'Imported by sqoop on 2023/11/25 10:04:07' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\012' STORED AS TEXTFILE
23/11/25 10:04:11 INFO ql.Driver: Completed executing command(queryId=root_20231125100408_f35214cb-e680-4e38-9833-7cb096592131); Time taken: 0.006 seconds
OK
23/11/25 10:04:11 INFO ql.Driver: OK
Time taken: 2.319 seconds
23/11/25 10:04:11 INFO CliDriver: Time taken: 2.319 seconds
23/11/25 10:04:11 INFO conf.HiveConf: Using the default value passed in for log id: c9680e77-741b-4fe8-a882-55ac6ae6ee79
23/11/25 10:04:11 INFO session.SessionState: Resetting thread name to  main
23/11/25 10:04:11 INFO conf.HiveConf: Using the default value passed in for log id: c9680e77-741b-4fe8-a882-55ac6ae6ee79
23/11/25 10:04:11 INFO session.SessionState: Updating thread name to c9680e77-741b-4fe8-a882-55ac6ae6ee79 main
23/11/25 10:04:11 INFO ql.Driver: Compiling command(queryId=root_20231125100411_b2ff0e5c-3ccc-4f7e-b202-53855934f5f0): 
LOAD DATA INPATH 'hdfs://nameservice1/user/root/fact_sale' OVERWRITE INTO TABLE `test`.`ods_fact_sale`
23/11/25 10:04:11 INFO ql.Driver: Semantic Analysis Completed
23/11/25 10:04:11 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
23/11/25 10:04:11 INFO ql.Driver: Completed compiling command(queryId=root_20231125100411_b2ff0e5c-3ccc-4f7e-b202-53855934f5f0); Time taken: 0.232 seconds
23/11/25 10:04:11 INFO ql.Driver: Executing command(queryId=root_20231125100411_b2ff0e5c-3ccc-4f7e-b202-53855934f5f0): 
LOAD DATA INPATH 'hdfs://nameservice1/user/root/fact_sale' OVERWRITE INTO TABLE `test`.`ods_fact_sale`
23/11/25 10:04:11 INFO ql.Driver: Starting task [Stage-0:MOVE] in serial mode
23/11/25 10:04:11 INFO hive.metastore: Closed a connection to metastore, current connections: 0
Loading data to table test.ods_fact_sale
23/11/25 10:04:11 INFO exec.Task: Loading data to table test.ods_fact_sale from hdfs://nameservice1/user/root/fact_sale
23/11/25 10:04:11 INFO hive.metastore: HMS client filtering is enabled.
23/11/25 10:04:11 INFO hive.metastore: Trying to connect to metastore with URI thrift://zhou:9083
23/11/25 10:04:11 INFO hive.metastore: Opened a connection to metastore, current connections: 1
23/11/25 10:04:11 INFO hive.metastore: Connected to metastore.
23/11/25 10:04:11 INFO fs.TrashPolicyDefault: Moved: 'hdfs://nameservice1/user/hive/warehouse/test.db/ods_fact_sale/part-m-00000' to trash at: hdfs://nameservice1/user/root/.Trash/Current/user/hive/wareho
use/test.db/ods_fact_sale/part-m-00000
23/11/25 10:04:11 INFO common.FileUtils: Creating directory if it doesn't exist: hdfs://nameservice1/user/hive/warehouse/test.db/ods_fact_sale
23/11/25 10:04:11 INFO ql.Driver: Starting task [Stage-1:STATS] in serial mode
23/11/25 10:04:11 INFO exec.StatsTask: Executing stats task
23/11/25 10:04:11 INFO hive.metastore: Closed a connection to metastore, current connections: 0
23/11/25 10:04:11 INFO hive.metastore: HMS client filtering is enabled.
23/11/25 10:04:11 INFO hive.metastore: Trying to connect to metastore with URI thrift://zhou:9083
23/11/25 10:04:11 INFO hive.metastore: Opened a connection to metastore, current connections: 1
23/11/25 10:04:11 INFO hive.metastore: Connected to metastore.
23/11/25 10:04:12 INFO hive.metastore: Closed a connection to metastore, current connections: 0
23/11/25 10:04:12 INFO hive.metastore: HMS client filtering is enabled.
23/11/25 10:04:12 INFO hive.metastore: Trying to connect to metastore with URI thrift://zhou:9083
23/11/25 10:04:12 INFO hive.metastore: Opened a connection to metastore, current connections: 1
23/11/25 10:04:12 INFO hive.metastore: Connected to metastore.
23/11/25 10:04:12 INFO exec.StatsTask: Table test.ods_fact_sale stats: [numFiles=4, numRows=0, totalSize=31421093662, rawDataSize=0, numFilesErasureCoded=0]
23/11/25 10:04:12 INFO ql.Driver: Completed executing command(queryId=root_20231125100411_b2ff0e5c-3ccc-4f7e-b202-53855934f5f0); Time taken: 0.897 seconds
OK
23/11/25 10:04:12 INFO ql.Driver: OK
Time taken: 1.154 seconds
23/11/25 10:04:12 INFO CliDriver: Time taken: 1.154 seconds
23/11/25 10:04:12 INFO conf.HiveConf: Using the default value passed in for log id: c9680e77-741b-4fe8-a882-55ac6ae6ee79
23/11/25 10:04:12 INFO session.SessionState: Resetting thread name to  main
23/11/25 10:04:12 INFO conf.HiveConf: Using the default value passed in for log id: c9680e77-741b-4fe8-a882-55ac6ae6ee79
23/11/25 10:04:12 INFO session.SessionState: Deleted directory: /tmp/hive/root/c9680e77-741b-4fe8-a882-55ac6ae6ee79 on fs with scheme hdfs
23/11/25 10:04:12 INFO session.SessionState: Deleted directory: /tmp/root/c9680e77-741b-4fe8-a882-55ac6ae6ee79 on fs with scheme file
23/11/25 10:04:12 INFO hive.metastore: Closed a connection to metastore, current connections: 0
23/11/25 10:04:12 INFO hive.HiveImport: Hive import complete.
23/11/25 10:04:12 INFO imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting
23/11/25 10:04:12 INFO zookeeper.ZooKeeper: Session: 0x275e0004e5929e2 closed
23/11/25 10:04:12 INFO zookeeper.ClientCnxn: EventThread shut down
23/11/25 10:04:12 INFO CuratorFrameworkSingleton: Closing ZooKeeper client.

3.6 import-all-tables工具

sqoop import-all-tables 将所有表从数据库导入到HDFS，这样可以不用一个一个的进行导入了

3.6.1 import-all-tables参数

UTF-8字符代码点的八进制表示,–fields-terminated-by \001 would yield the ^A character

Hive命令 | 参数 | 描述 | |–|–| | --hive-home

3.6.2 import-all-tables 测试案例

将test库下的剔除 fact_sale、t1外的其它表同步到hive的test库

数据准备:

mysql> show tables;
+----------------+
| Tables_in_test |
+----------------+
| bonus          |
| dept           |
| emp            |
| fact_sale      |
| salgrade       |
| t1             |
+----------------+
6 rows in set (0.00 sec)

代码:

sqoop import-all-tables \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--exclude-tables fact_sale,t1 \
--autoreset-to-one-mapper \
--fields-terminated-by '\0001' \
--num-mappers 4 \
--hive-import \
--hive-database test \
--hive-overwrite

测试记录: 因为有表没有主键，不能使用 --num-mappers 4，报错加入 --autoreset-to-one-mapper 当存在没有主键的表的时候， --num-mappers 为1

[root@zhou ~]# sqoop import-all-tables \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> --exclude-tables fact_sale,t1 \
> --fields-terminated-by '\0001' \
> --num-mappers 4 \
> --hive-import \
> --hive-database test \
> --hive-overwrite
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/11/25 10:40:36 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
23/11/25 10:40:36 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
23/11/25 10:40:36 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
23/11/25 10:40:37 INFO tool.CodeGenTool: Beginning code generation
23/11/25 10:40:37 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `bonus` AS t LIMIT 1
23/11/25 10:40:37 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `bonus` AS t LIMIT 1
23/11/25 10:40:37 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
23/11/25 10:40:38 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/9ef9bc1769dcf7d49b3848d6a55144fa/bonus.jar
23/11/25 10:40:38 WARN manager.MySQLManager: It looks like you are importing from mysql.
23/11/25 10:40:38 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
23/11/25 10:40:38 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
23/11/25 10:40:38 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
23/11/25 10:40:38 ERROR tool.ImportAllTablesTool: Error during import: No primary key could be found for table bonus. Please specify one with --split-by or perform a sequential import with '-m 1'.
[root@zhou ~]#

调整后的测试记录:

Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/11/25 10:47:05 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
23/11/25 10:47:05 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
23/11/25 10:47:05 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
23/11/25 10:47:06 INFO tool.CodeGenTool: Beginning code generation
23/11/25 10:47:06 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `bonus` AS t LIMIT 1
23/11/25 10:47:06 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `bonus` AS t LIMIT 1
23/11/25 10:47:06 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
23/11/25 10:47:07 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-root/compile/7e08dfd7df6bcae92b216d3c68d5a4f3/bonus.java to /root/./bonus.java. Error: Destination '/root/./bonus.java' already 
exists
23/11/25 10:47:07 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/7e08dfd7df6bcae92b216d3c68d5a4f3/bonus.jar
23/11/25 10:47:07 WARN manager.MySQLManager: It looks like you are importing from mysql.
23/11/25 10:47:07 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
23/11/25 10:47:07 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
23/11/25 10:47:07 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
23/11/25 10:47:07 WARN manager.SqlManager: Split by column not provided or can't be inferred.  Resetting to one mapper
23/11/25 10:47:07 INFO mapreduce.ImportJobBase: Beginning import of bonus
23/11/25 10:47:07 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
23/11/25 10:47:08 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
23/11/25 10:47:08 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
23/11/25 10:47:08 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/root/.staging/job_1605780958086_0037
23/11/25 10:47:10 INFO db.DBInputFormat: Using read commited transaction isolation
23/11/25 10:47:10 INFO mapreduce.JobSubmitter: number of splits:1
23/11/25 10:47:10 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address
23/11/25 10:47:10 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
23/11/25 10:47:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1605780958086_0037
23/11/25 10:47:10 INFO mapreduce.JobSubmitter: Executing with tokens: []
23/11/25 10:47:10 INFO conf.Configuration: resource-types.xml not found
23/11/25 10:47:10 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
23/11/25 10:47:10 INFO impl.YarnClientImpl: Submitted application application_1605780958086_0037
23/11/25 10:47:10 INFO mapreduce.Job: The url to track the job: http://hp3:8088/proxy/application_1605780958086_0037/
23/11/25 10:47:10 INFO mapreduce.Job: Running job: job_1605780958086_0037
23/11/25 10:47:16 INFO mapreduce.Job: Job job_1605780958086_0037 running in uber mode : false
23/11/25 10:47:16 INFO mapreduce.Job:  map 0% reduce 0%
23/11/25 10:47:22 INFO mapreduce.Job:  map 100% reduce 0%
23/11/25 10:47:22 INFO mapreduce.Job: Job job_1605780958086_0037 completed successfully
**中间省略部分输出**
23/11/25 10:48:21 INFO common.FileUtils: Creating directory if it doesn't exist: hdfs://nameservice1/user/hive/warehouse/test.db/salgrade
23/11/25 10:48:21 INFO ql.Driver: Starting task [Stage-1:STATS] in serial mode
23/11/25 10:48:21 INFO exec.StatsTask: Executing stats task
23/11/25 10:48:21 INFO hive.metastore: Closed a connection to metastore, current connections: 0
23/11/25 10:48:21 INFO hive.metastore: HMS client filtering is enabled.
23/11/25 10:48:21 INFO hive.metastore: Trying to connect to metastore with URI thrift://zhou:9083
23/11/25 10:48:21 INFO hive.metastore: Opened a connection to metastore, current connections: 1
23/11/25 10:48:21 INFO hive.metastore: Connected to metastore.
23/11/25 10:48:21 INFO hive.metastore: Closed a connection to metastore, current connections: 0
23/11/25 10:48:21 INFO hive.metastore: HMS client filtering is enabled.
23/11/25 10:48:21 INFO hive.metastore: Trying to connect to metastore with URI thrift://zhou:9083
23/11/25 10:48:21 INFO hive.metastore: Opened a connection to metastore, current connections: 1
23/11/25 10:48:21 INFO hive.metastore: Connected to metastore.
23/11/25 10:48:21 INFO exec.StatsTask: Table test.salgrade stats: [numFiles=1, numRows=0, totalSize=59, rawDataSize=0, numFilesErasureCoded=0]
23/11/25 10:48:21 INFO ql.Driver: Completed executing command(queryId=root_20231125104820_ad1fb49d-6166-41cb-9113-91074fb2ea16); Time taken: 0.549 seconds
OK
23/11/25 10:48:21 INFO ql.Driver: OK
Time taken: 0.588 seconds
23/11/25 10:48:21 INFO CliDriver: Time taken: 0.588 seconds
23/11/25 10:48:21 INFO conf.HiveConf: Using the default value passed in for log id: 1e735270-9c28-4658-aad6-fc816a42e377
23/11/25 10:48:21 INFO session.SessionState: Resetting thread name to  main
23/11/25 10:48:21 INFO conf.HiveConf: Using the default value passed in for log id: 1e735270-9c28-4658-aad6-fc816a42e377
23/11/25 10:48:21 INFO session.SessionState: Deleted directory: /tmp/hive/root/1e735270-9c28-4658-aad6-fc816a42e377 on fs with scheme hdfs
23/11/25 10:48:21 INFO session.SessionState: Deleted directory: /tmp/root/1e735270-9c28-4658-aad6-fc816a42e377 on fs with scheme file
23/11/25 10:48:21 INFO hive.metastore: Closed a connection to metastore, current connections: 0
23/11/25 10:48:21 INFO hive.HiveImport: Hive import complete.
Skipping table: t1
23/11/25 10:48:21 INFO imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting
23/11/25 10:48:21 INFO zookeeper.ZooKeeper: Session: 0x275e0004e592a1e closed
23/11/25 10:48:21 INFO zookeeper.ClientCnxn: EventThread shut down
23/11/25 10:48:21 INFO CuratorFrameworkSingleton: Closing ZooKeeper client.

3.7 import-mainframe 工具

sqoop import-mainframe 用于从大型机,此处没有测试环境，略过。

3.8 list-databases工具

sqoop list-databases 列出服务器上可用数据库

3.8.1 list-databases工具参数

3.8.2 list-databases 测试记录

代码:

sqoop list-databases \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123

测试记录:

[root@zhou ~]# sqoop list-databases \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/11/25 10:58:57 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
23/11/25 10:58:57 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
23/11/25 10:58:57 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
mysql
performance_schema
sys
test

3.9 list-tables 工具

sqoop list-tables 命令用于列出数据库下的表名

3.9.1 list-tables 工具参数

3.9.2 list-tables 测试案例

代码:

sqoop list-tables \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123

测试记录:

[root@zhou ~]# sqoop list-tables \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/11/25 11:07:37 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
23/11/25 11:07:37 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
23/11/25 11:07:37 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
bonus
dept
emp
fact_sale
salgrade
t1

3.10 version 工具

sqoop version 工具用于查看版本信息

测试记录:

[root@zhou ~]# sqoop version
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/11/25 11:09:36 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
Sqoop 1.4.7-cdh6.3.1
git commit id 
Compiled by jenkins on Thu Sep 26 02:57:53 PDT 2019
[root@zhou ~]#

四.sqoop将关系型数据库表同步到hdfs

实际生产环境中，一般将关系型数据库先同步到hdfs中，然后再将hdfs中的数据同步到关系型表中。测试的性能与直接同步到hive中的性能一致。

-- 目标端创建表结构
-- 速度很快
sqoop create-hive-table \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--hive-database default \
--table fact_sale --hive-table ods_fact_sale

-- 将mysql表数据同步到hdfs
sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table fact_sale \
--delete-target-dir \
--fields-terminated-by '\0001' \
--num-mappers 4 \
--target-dir /tmp/fact_sale 

-- 将hdfs数据同步到表
-- 速度很快
LOAD DATA INPATH '/tmp/fact_sale' into table ods_fact_sale;

测试记录:

[root@zhou tmp]# sqoop import \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> --table fact_sale \
> --delete-target-dir \
> --fields-terminated-by '\0001' \
> --num-mappers 4 \
> --target-dir /tmp/fact_sale
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/12/07 15:13:42 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
23/12/07 15:13:42 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
23/12/07 15:13:42 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
23/12/07 15:13:42 INFO tool.CodeGenTool: Beginning code generation
23/12/07 15:13:42 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `fact_sale` AS t LIMIT 1
23/12/07 15:13:42 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `fact_sale` AS t LIMIT 1
23/12/07 15:13:42 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
23/12/07 15:13:44 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-root/compile/76fc9602ec3165207e0dd51e2115cc14/fact_sale.java to /tmp/./fact_sale.java. Error: Destination '/tmp/./fact_sale.java' already exists
23/12/07 15:13:44 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/76fc9602ec3165207e0dd51e2115cc14/fact_sale.jar
23/12/07 15:13:44 INFO tool.ImportTool: Destination directory /tmp/fact_sale deleted.
23/12/07 15:13:44 WARN manager.MySQLManager: It looks like you are importing from mysql.
23/12/07 15:13:44 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
23/12/07 15:13:44 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
23/12/07 15:13:44 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
23/12/07 15:13:45 INFO mapreduce.ImportJobBase: Beginning import of fact_sale
23/12/07 15:13:45 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
23/12/07 15:13:45 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
23/12/07 15:13:45 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/root/.staging/job_1606698967173_0119
23/12/07 15:13:47 INFO db.DBInputFormat: Using read commited transaction isolation
23/12/07 15:13:47 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `fact_sale`
23/12/07 15:13:47 INFO db.IntegerSplitter: Split size: 196905399; Num splits: 4 from: 1 to: 787621597
23/12/07 15:13:47 INFO mapreduce.JobSubmitter: number of splits:4
23/12/07 15:13:47 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address
23/12/07 15:13:47 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
23/12/07 15:13:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1606698967173_0119
23/12/07 15:13:48 INFO mapreduce.JobSubmitter: Executing with tokens: []
23/12/07 15:13:48 INFO conf.Configuration: resource-types.xml not found
23/12/07 15:13:48 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
23/12/07 15:13:48 INFO impl.YarnClientImpl: Submitted application application_1606698967173_0119
23/12/07 15:13:48 INFO mapreduce.Job: The url to track the job: http://zhou:8088/proxy/application_1606698967173_0119/
23/12/07 15:13:48 INFO mapreduce.Job: Running job: job_1606698967173_0119
23/12/07 15:13:55 INFO mapreduce.Job: Job job_1606698967173_0119 running in uber mode : false
23/12/07 15:13:55 INFO mapreduce.Job:  map 0% reduce 0%
23/12/07 15:20:19 INFO mapreduce.Job:  map 25% reduce 0%
23/12/07 15:20:52 INFO mapreduce.Job:  map 50% reduce 0%
23/12/07 15:27:27 INFO mapreduce.Job:  map 75% reduce 0%
23/12/07 15:27:56 INFO mapreduce.Job:  map 100% reduce 0%
23/12/07 15:27:56 INFO mapreduce.Job: Job job_1606698967173_0119 completed successfully
23/12/07 15:27:56 INFO mapreduce.Job: Counters: 33
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=990012
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=441
                HDFS: Number of bytes written=31421093662
                HDFS: Number of read operations=24
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=8
                HDFS: Number of bytes read erasure-coded=0
        Job Counters 
                Launched map tasks=4
                Other local map tasks=4
                Total time spent by all maps in occupied slots (ms)=1647540
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=1647540
                Total vcore-milliseconds taken by all map tasks=1647540
                Total megabyte-milliseconds taken by all map tasks=1687080960
        Map-Reduce Framework
                Map input records=767830000
                Map output records=767830000
                Input split bytes=441
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=15214
                CPU time spent (ms)=1769770
                Physical memory (bytes) snapshot=1753706496
                Virtual memory (bytes) snapshot=10405761024
                Total committed heap usage (bytes)=1332215808
                Peak Map Physical memory (bytes)=464699392
                Peak Map Virtual memory (bytes)=2621968384
        File Input Format Counters 
                Bytes Read=0
        File Output Format Counters 
                Bytes Written=31421093662
23/12/07 15:27:56 INFO mapreduce.ImportJobBase: Transferred 29.2632 GB in 851.8971 seconds (35.175 MB/sec)
23/12/07 15:27:56 INFO mapreduce.ImportJobBase: Retrieved 767830000 records.
[root@zhou tmp]# 


hive> 
    > 
    > LOAD DATA INPATH '/tmp/fact_sale' into table ods_fact_sale;
Loading data to table default.ods_fact_sale
OK
Time taken: 0.947 seconds
hive>

五.sqoop增量同步

实际生产环境中，考虑到全量刷新数据，占用了大量的存储空间，且存储了过多的冗余数据。于是引入了增量更新，sqoop也支持增量更新。

5.1 导入方式:

append 2) lastmodified方式，必须要加–append（追加）或者–merge-key（合并，一般填主键）

5.2 测试案例

测试数据准备:

-- MySQL 
create table t_test(id int,name varchar(200),last_update_datetime timestamp,primary key(id) );
insert into t_test values (1,'abc','2023-12-01 00:09:00');
insert into t_test values (2,'def','2023-12-02 00:10:00');

同步数据到hive:

sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t_test \
--fields-terminated-by '\0001' \
--delete-target-dir \
--num-mappers 1 \
--hive-import \
--hive-database test \
--hive-table t_test

检查hive表数据:

hive> 
    > select * from t_test;
OK
1       abc     2023-12-01 00:09:00.0
2       def     2023-12-02 00:10:00.0
Time taken: 0.452 seconds, Fetched: 2 row(s)

创建hive备份表，方便回退测试

create table t_test_bak as select * from t_test;
hive> 
    > select * from t_test_bak;
OK
1       abc     2023-12-01 00:09:00.0
2       def     2023-12-02 00:10:00.0
Time taken: 0.059 seconds, Fetched: 2 row(s)
hive>

– 如需回退，先清空再insert

truncate table t_test;
insert into t_test select * from t_test_bak;

5.2.1 append方式导入

在导入一个表时，您应该指定追加模式(append mode)，其中新行将随着行id值的增加而不断添加。用–check-column指定包含行id的列。Sqoop导入检查列的值大于用–last-value指定的值的行。

这种默认适全部都是insert的，例如日志类的数据的同步，对于update类的无效。

修改MySQL源表数据新增两条数据

mysql> insert into t_test values (3,'ccc','2023-12-07 00:09:00');
Query OK, 1 row affected (0.01 sec)

mysql> insert into t_test values (4,'ddd','2023-12-07 00:10:00'); 
Query OK, 1 row affected (0.00 sec)

mysql> update t_test set name='bbb',last_update_datetime = '2023-12-07 08:00:00' where id = 2;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select * from t_test;
+----+------+----------------------+
| id | name | last_update_datetime |
+----+------+----------------------+
|  1 | abc  | 2023-12-01 00:09:00  |
|  2 | bbb  | 2023-12-07 08:00:00  |
|  3 | ccc  | 2023-12-07 00:09:00  |
|  4 | ddd  | 2023-12-07 00:10:00  |
+----+------+----------------------+
4 rows in set (0.00 sec)

sqoop增量同步 --target-dir 参数表示表所在的路径 --last-value 是上次同步的id的最大值

sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t_test \
--fields-terminated-by '\0001' \
--target-dir '/user/hive/warehouse/test.db/t_test' \
--incremental append \
--check-column id \
--last-value 2 \
-m 1

数据验证 可以看到2个新增的是同步了，但是对id为2的update的记录却没有同步

hive> 
    > 
    > select * from t_test;
OK
1       abc     2023-12-01 00:09:00.0
2       def     2023-12-02 00:10:00.0
3       ccc     2023-12-07 00:09:00.0
4       ddd     2023-12-07 00:10:00.0
Time taken: 0.1 seconds, Fetched: 4 row(s)

还原hive表数据 重新通过last_update_datetime时间戳来同步

sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t_test \
--fields-terminated-by '\0001' \
--target-dir '/user/hive/warehouse/test.db/t_test' \
--incremental append \
--check-column last_update_datetime \
--last-value '2023-12-02 00:10:00' \
-m 1

验证数据: 可以看到数据重复了

hive> 
    > 
    > select * from t_test;
OK
1       abc     2023-12-01 00:09:00.0
2       def     2023-12-02 00:10:00.0
2       bbb     2023-12-07 08:00:00.0
3       ccc     2023-12-07 00:09:00.0
4       ddd     2023-12-07 00:10:00.0
Time taken: 0.067 seconds, Fetched: 5 row(s)

5.2.2 lastmodified方式导入

Sqoop支持的另一种表更新策略称为lastmodified模式。当可能更新源表的行，并且每次这样的更新都会将last-modified列的值设置为当前时间戳时，应该使用此方法。将导入check列中保存的时间戳比用–last-value指定的时间戳更近的行。

在增量导入结束时，应该为后续导入指定为–last-value的值将打印到屏幕上。在运行后续导入时，应该以这种方式指定–last-value，以确保只导入新的或更新的数据。这可以通过创建增量导入作为已保存的作业来自动处理，这是执行循环增量导入的首选机制。

官方文档原文

An alternate table update strategy supported by Sqoop is called lastmodified mode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported.

At the end of an incremental import, the value which should be specified as --last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify --last-value in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.

5.2.2.1 --incremental lastmodified --append 方式

这种也是增量的进行insert的方式

sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t_test \
--fields-terminated-by '\0001' \
--target-dir '/user/hive/warehouse/test.db/t_test' \
--check-column last_update_datetime \
--incremental lastmodified \
--last-value '2023-12-02 00:10:00' \
--m 1 \
--append

测试数据 可以看到超过最后更新时间的字段都insert了

hive> 
    > 
    > select * from t_test;
OK
1       abc     2023-12-01 00:09:00.0
2       def     2023-12-02 00:10:00.0
2       bbb     2023-12-07 08:00:00.0
3       ccc     2023-12-07 00:09:00.0
4       ddd     2023-12-07 00:10:00.0
Time taken: 0.07 seconds, Fetched: 5 row(s)

5.2.2.2 --incremental lastmodified --merge-key 方式

merge-key方式类似Oracle数据库的merge语句，存在就update，不存在的insert。

sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t_test \
--fields-terminated-by '\0001' \
--target-dir '/user/hive/warehouse/test.db/t_test' \
--check-column last_update_datetime \
--incremental lastmodified \
--last-value '2023-12-02 00:10:00' \
--m 1 \
--merge-key id

测试数据 可以看到 id为2的数据被更新了，3、4成功插入对于一些有最后更新时间戳的业务数据，此种方法可以保证数据仓库的数据与业务数据保持一致。

> select * from t_test;
OK
1       abc     2023-12-01 00:09:00.0
2       bbb     2023-12-07 08:00:00.0
3       ccc     2023-12-07 00:09:00.0
4       ddd     2023-12-07 00:10:00.0
Time taken: 0.064 seconds, Fetched: 4 row(s)

你可能感兴趣的:(大数据,sqoop,hadoop)

Java 大视界 -- Java 大数据机器学习模型在金融市场情绪分析与投资策略制定中的应用青云交大数据新视界 Java 大视界 java 大数据机器学习情绪分析智能投资多源数据
Java大视界--Java大数据机器学习模型在金融市场情绪分析与投资策略制定中的应用）引言：正文：一、金融情绪数据的立体化采集与治理1.1多模态数据采集架构1.2数据治理与特征工程二、Java机器学习模型的工程化实践2.1情感分析模型的深度优化2.2强化学习驱动的动态投资策略三、顶级机构实战：Java系统的金融炼金术四、技术前沿：Java与金融科技的未来融合4.1量子机器学习集成4.2联邦学习在合
Java 大视界 -- Java 大数据在影视内容推荐与用户兴趣挖掘中的深度实践（183）青云交大数据新视界 Java 大视界 Java+Python 双剑合璧：AI 大数据实战通关秘籍大数据影视内容推荐用户兴趣挖掘协同过滤基于内容推荐数据可视化个性化推荐系统
亲爱的朋友们，热烈欢迎来到青云交的博客！能与诸位在此相逢，我倍感荣幸。在这飞速更迭的时代，我们都渴望一方心灵净土，而我的博客正是这样温暖的所在。这里为你呈上趣味与实用兼具的知识，也期待你毫无保留地分享独特见解，愿我们于此携手成长，共赴新程！全网（微信公众号/CSDN/抖音/华为/支付宝/微博）：青云交一、欢迎加入【福利社群】点击快速加入1：青云交技术圈福利社群（NEW)点击快速加入2：2025CS
Java 大视界 -- 基于 Java 的大数据分布式文件系统在科研数据存储与共享中的应用优化（187）青云交大数据新视界 Java 大视界 Java+Python 双剑合璧：AI 大数据实战通关秘籍大数据大数据分布式文件系统科研数据存储科研数据共享应用优化 HDFS 数据分区
亲爱的朋友们，热烈欢迎来到青云交的博客！能与诸位在此相逢，我倍感荣幸。在这飞速更迭的时代，我们都渴望一方心灵净土，而我的博客正是这样温暖的所在。这里为你呈上趣味与实用兼具的知识，也期待你毫无保留地分享独特见解，愿我们于此携手成长，共赴新程！全网（微信公众号/CSDN/抖音/华为/支付宝/微博）：青云交一、欢迎加入【福利社群】点击快速加入1：青云交技术圈福利社群（NEW)点击快速加入2：CSDN博客
Python医疗大数据实战：基于Scrapy-Redis的医院评价数据分布式爬虫设计与实现 Python爬虫项目 python 开发语言爬虫 selenium scrapy
摘要本文将详细介绍如何使用Python构建一个高效的医院评价数据爬虫系统。我们将从爬虫基础讲起，逐步深入到分布式爬虫架构设计，使用Scrapy框架结合Redis实现分布式爬取，并采用最新的反反爬技术确保数据采集的稳定性。文章包含完整的代码实现、性能优化方案以及数据处理方法，帮助读者掌握医疗大数据采集的核心技术。关键词：Python爬虫、Scrapy-Redis、分布式爬虫、医疗大数据、反反爬技术1
flink-sql读写hive-1.13 第一片心意 flink flink sql hive
1.版本说明本文档内容基于flink-1.13.x，其他版本的整理，请查看本人博客的flink专栏其他文章。1.1.概述ApacheHive已经成为了数据仓库生态系统中的核心。它不仅仅是一个用于大数据分析和ETL场景的SQL引擎，同样也是一个数据管理平台，可用于发现，定义，和演化数据。Flink与Hive的集成包含两个层面。一是利用了Hive的MetaStore作为持久化的Catalog，用户可通
觉察与正念佳佳的宝瓶子
今天因为交电费的事与妈妈沟通。在沟通的过程中，年届八十的母亲一直给我强调着过去怎么怎么。父母家的电费一直是银行代扣的，这样的模式自从可以通过银行代扣便开始了。可见那时候的父母还是蛮新潮的，能接受新事物的。至从有了智能手机，人类便进入了大数据时代。通过微信或支付宝来交电费方便得多。可惜父亲不在了，老母亲是连手机都坚决不用的人。（因为想要掩饰自己的不能、不敢，所以干脆拒绝！不愿意做任何的改变）。今年，
Java大视界：Java大数据在智能医疗电子健康档案数据挖掘与健康服务创新＞ Loving_enjoy 计算机学科论文创新点人工智能深度学习迁移学习经验分享
>本文通过完整代码示例，揭秘如何用Java大数据技术挖掘电子健康档案价值，实现疾病预测、个性化健康管理等创新服务。###一、智能医疗时代的数据金矿电子健康档案（EHR）作为医疗数字化的核心载体，包含海量患者全生命周期健康数据。据统计，全球医疗数据量正以每年**48%的速度增长**，单个三甲医院年数据量可达**PB级**。这些数据蕴藏着疾病规律、治疗效能的宝贵知识，但传统技术难以有效挖掘。**Jav
无人值守人工智能智慧系统数据分析：深度洞察与未来展望呆码科技人工智能数据分析数据挖掘
无人值守人工智能智慧系统数据分析：深度洞察与未来展望随着科技的飞速发展，人工智能（AI）技术已逐渐渗透到社会经济的各个领域，其中无人值守人工智能智慧系统作为AI技术应用的前沿阵地，正引领着一场深刻的行业变革。这类系统通过集成高级算法、大数据分析、物联网（IoT）及云计算等先进技术，实现了对复杂环境的自主监控、智能决策与高效管理，极大地提升了运营效率，降低了人力成本，并开启了数据驱动决策的新纪元。本
浮漂式水质监测设备：智能守护水环境的未来之眼柏峰电子人工智能
浮漂式水质监测设备：智能守护水环境的未来之眼柏峰【BF-FBSZ】随着全球水资源短缺和水污染问题日益严峻，水质监测技术正迎来前所未有的发展机遇。作为这一领域的创新突破，浮漂式水质监测设备凭借其实时性、智能化和网络化优势，正在重塑水资源管理的新格局。本文将深入探讨这一技术的原理、特点、应用场景及未来发展趋势。一、技术原理与系统架构浮漂式水质监测设备是一种集成了现代传感器技术、物联网和大数据分析的智能
基于蜣螂算法优化多头注意力机制的卷积神经网络结合双向长短记忆神经网络实现温度预测DBO-CNN-biLSTM-Multihead-Attention附matlab代码 matlab科研助手神经网络算法 cnn
✅作者简介：热爱科研的Matlab仿真开发者，修心和技术同步精进，代码获取、论文复现及科研仿真合作可私信。个人主页：Matlab科研工作室个人信条：格物致知。更多Matlab完整代码及仿真定制内容点击智能优化算法神经网络预测雷达通信无线传感器电力系统信号处理图像处理路径规划元胞自动机无人机物理应用机器学习内容介绍温度预测在气象学、农业、能源等领域具有重要的应用价值。随着大数据和人工智能技术的快速发
基于Socket来构建无界数据流并通过Flink框架进行处理每天五分钟玩转人工智能 Flink技术实战 flink 大数据 Flink 分布式无界数据
本文重点随着大数据技术的不断发展，实时数据流处理已成为企业应对海量数据、实现快速决策的关键技术。ApacheFlink是一个开源的流处理框架，它能够对无界数据流进行高效的、精确的处理。本文将介绍如何通过Socket构建无界数据流，并利用Flink框架进行无界流处理。基于Socket构建无界数据无界数据指的是源源不断产生的数据，这些数据通常来自各种实时数据源，如用户行为日志、传感器数据等。Socke
sgg大数据全套技术链接[plus] 原来是大华啊~ 资源大数据
写在开头：感谢尚硅谷，尚硅谷万岁，我爱尚硅谷111个技术栈+43个项目，兄弟们，冲！最近小米又又又火了一把，致敬所有造福人民的企业和伟大的企业家，致敬雷军，小米，致敬马云，致敬尚硅谷，致敬所有为人民谋福的英雄人物和企业，再次献上我诚挚的敬意，致敬！尚硅谷大数据全套111个技术1.Java从入门到精通JDK版链接：https://pan.baidu.com/s/1GAc610SYSMmZBuOX4D
疫情下，我的健康码首次变成了黄码唯我一心
3月中旬，老公在广州白云区接了一单生意，要很久才回来，就在那里租了一间房，带我和孩子一起住。房子在七楼，步梯，因孩子小，自己就很少下楼，都是他买菜回来，4月8号，订单完成，返程回了佛山。过了两天突然接到短信通知，白云区要大规模核酸筛查，又过一天收到短信:通过大数据分析，您近期行程涉及疫情防控重点区域，您的健康码将被赋予2次黄码并需开展2次核酸检测，请注意健康码状态，尽快凭码到附近黄码核酸检测点进行
到底应该怎么抓语文成绩山东董纯
上学期期末考试，全区统一采用网上阅卷的形式。在这个大数据时代，在这个极为透明的数据时代，一旦采用这样网络统一阅卷的形式。那丑媳妇就要真的见公婆了。再这样一个要生源没生源。要学习积极性没有学习积极性的氛围里。想取得好的成绩是真的难上加难。尽管已经预料到跟其他兄弟学校有一定的差距。但是没有想到差距如此之大。领导们坐不住了，反复约谈备课组长。理由是其他科目差距不大，甚至有优势。为什么语文学科会有如此大的
告别内存焦虑！用Dask打开Python大数据并行计算的“任意门“ 小张在编程 python 大数据开发语言
引言当你在Jupyter里用Pandas读取20GB的CSV文件，看到内存占用率从10%飙升到90%，最后弹出"MemoryError"时；当你想对亿级数据做分组聚合，却发现单线程计算要等上半小时——这些场景是不是像极了用小推车搬运万吨货物？Python生态中，Dask库就像一台"并行计算推土机"，能把大数据拆分成小块并行处理，让你的普通电脑也能拥有分布式计算的能力。本文将从原理到实战，带你掌握这
python大数据论文_大数据环境下基于python的网络爬虫技术 weixin_39775976 python大数据论文
软件开发大数据环境下基于python的网络爬虫技术作者/谢克武，重庆工商大学派斯学院软件工程学院摘要：随着互联网的发展壮大，网络数据呈爆炸式增长，传统捜索引擎已经不能满足人们对所需求数据的获取的需求，作为搜索引擎的抓取数据的重要组成部分，网络爬虫的作用十分重要，本文首先介绍了在大数据环境下网络爬虫的重要性，接着介绍了网络爬虫的概念，工作原理，工作流程，网页爬行策略，python在编写爬虫领域的优势
Redis性能测试：工具、参数与实战示例 Seal^_^ 数据库专栏 #数据库--Redis redis 数据库 Redis性能测试
Redis性能测试：工具、参数与实战示例1.Redis性能测试概述2.redis-benchmark基础使用2.1基本语法2.2简单示例3.性能测试参数详解4.实战测试示例4.1基础测试4.2指定命令测试4.3带随机key的测试4.4大数据测试4.5管道测试5.性能测试流程图6.测试结果分析与优化建议6.1结果解读6.2优化建议7.高级测试场景7.1持久化影响测试7.2集群测试7.3长时间稳定性测
2025年各细分产业链企业数据(汽车、数字经济、食品、制造业) 经管数据库汽车智能手机数据分析
本数据包含2025年及之前的所有上中下游企业信息，67个细分产业。汽车专区、数字经济专区、数字创意专区、未来产业专区、高端装备专区、新能源专区、食品农业专区、传统制造业专区等71个文件。汽车专区：充电桩制造动力电池汽车材料制造汽车制造汽车制造设备汽车座椅制造驱动电机制造燃料电池汽车制造燃料电池系统制造新能源汽车制造智能驾驶智能视觉数字经济专区：5g边缘计算大数据类服务器光通信集成电路区块链人工智能
C#语法基础总结（超级全面）（二） inwith C#语法基础 c#开发语言
文章目录c#语法基本元素关键字操作符（operator）类型转换标识符（Identifier）语句try语句迭代语句（循环语句）索引器文本（字面值）五大数据类型引用类型：值类型：变量、对象与内存装箱和拆箱类类的实例化类的三大成员（属性、方法、事件）属性（property）方法（函数）方法参数值参数引用参数输出参数数组参数具名参数可选参数扩展方法（this参数）方法的重载构造器（constructo
SQL 常用版本语法概览：标准演进与关键语法分析
一、引言SQL（StructuredQueryLanguage，结构化查询语言）是关系型数据库系统的核心语言，自1986年成为ANSI和ISO标准以来，经历了多次版本演进，不断增强语义表达能力以适应复杂的企业数据需求。随着数据库技术的不断发展，各大数据库厂商（如Oracle、SQLServer、PostgreSQL、MySQL等）在实现标准的基础上扩展了大量方言语法，使得掌握SQL的标准语法版本成
主流数据库语言语法对比两圆相切数据库
以下是五大数据库（MySQL、PostgreSQL、Oracle、SQLServer、SQLite）核心语法对比，涵盖DDL、DML、查询、函数、事务等全场景，包含底层原理差异和实用示例。##一、数据一、类型深度对比分类MySQLPostgreSQLOracleSQLServerSQLite整数TINYINT,INT,BIGINTSMALLINT,INT,BIGINTNUMBER(10)TIN
Hadoop与云原生集成：弹性扩缩容与OSS存储分离架构深度解析
Hadoop与云原生集成的必要性Hadoop在大数据领域的基石地位作为大数据处理领域的奠基性技术，Hadoop自2006年诞生以来已形成包含HDFS、YARN、MapReduce三大核心组件的完整生态体系。根据CSDN技术社区的分析报告，全球超过75%的《财富》500强企业仍在使用Hadoop处理EB级数据，其分布式文件系统HDFS通过数据分片（默认128MB块大小）和三副本存储机制，成功解决了P
深入TA-Lib：量化技术指标详解
深入TA-Lib：量化技术指标详解本文系统讲解TA-Lib技术指标分析，涵盖基础、数据处理、趋势与动量指标、均量线、布林线等，并结合Python代码与大数据、机器学习实战案例，助力读者掌握量化交易实战技巧。本文系统梳理了TA-Lib技术指标分析的核心内容，包括TA-Lib基础、数据处理、趋势与动量指标、均量线、布林线等关键技术指标分析方法，并结合Python代码示例与大数据、机器学习的融合实战案例
大数据时代下的时序数据库选型指南：基于工业场景的IoTDB技术优势与适用性研究 Loving_enjoy 计算机学科论文创新点机器学习 facebook 经验分享课程设计
>在宝钢集团的智能工厂里，5万多个传感器每秒产生150万+数据点，传统数据库系统每天积压3TB未处理数据——这揭示了工业4.0时代的核心矛盾：**海量时序数据处理能力已成为智能制造的关键瓶颈**。###工业时序数据的四大特殊性工业场景下的时序数据与传统互联网数据存在本质差异：1.**高精度时间要求**-数控机床振动监测需微秒级时间戳-电网故障定位要求时间同步精度≤1μs2.**多源异构性**```
斗鱼大数据面试题及参考答案大模型大数据攻城狮大数据大数据面试 hadoop面试 spark面试 flink面试手撕SQL 手撕代码
GC（垃圾回收）相关知识一、常见的GC收集器SerialGCSerialGC是最基本的垃圾收集器，它是单线程的。在进行垃圾收集时，会暂停所有的用户线程，直到垃圾收集完成。它的工作过程比较简单，首先标记出所有的垃圾对象，然后将它们清除。例如，在一个小型的、对响应时间要求不高的Java应用程序中，如简单的命令行工具，SerialGC可以满足垃圾收集的需求。因为这种应用程序通常没有很高的并发要求，暂停用
Java 大视界 -- Java 大数据机器学习模型在金融市场情绪指数构建与投资决策支持中的应用（339）青云交大数据新视界 Java 大视界 java 大数据机器学习金融情绪指数投资决策量化策略情绪分析
Java大视界--Java大数据机器学习模型在金融市场情绪指数构建与投资决策支持中的应用（339）引言：正文：一、Java构建的金融市场情绪数据采集与预处理体系1.1多源异构数据接入引擎1.2数据采集延迟测试报告1.3情绪数据预处理管道二、Java驱动的金融市场情绪指数构建模型2.1多维度情绪指数计算框架2.2情绪指数与投资决策的映射模型三、Java在金融投资决策支持中的实战应用3.1量化私募情绪
数字孪生技术为UI前端注入新活力：实现产品设计的沉浸式体验 ui设计前端开发老司机 ui
hello宝子们...我们是艾斯视觉擅长ui设计、前端开发、数字孪生、大数据、三维建模、三维动画10年+经验!希望我的分享能帮助到您!如需帮助可以评论关注私信我们一起探讨!致敬感谢感恩!一、引言：从“平面交互”到“沉浸体验”的UI革命当用户在电商APP中翻看3D家具模型却无法感知其与自家客厅的匹配度，当设计师在2D屏幕上绘制汽车内饰却难以预判实际乘坐体验——传统UI设计的“平面化、静态化、割裂感”
提升企业级数据处理效率！TDengine 四个集群优化点详解 TDengine （老段） TDengine 运维大数据数据库物联网时序数据库服务器运维 tdengine
为了帮助企业更好地进行大数据处理，我们在此前TDengine3.x系列版本中进行了几项与集群相关的优化和新功能开发，以提升集群的稳定性和在异常情况下的恢复能力。这些优化包括clusterID隔离、leaderrebalance、raftlearner和restorednode。本文将对这几项重要优化进行详细阐述，以解答企业在此领域的疑问，并帮助大家更好地应对相关挑战。clusterID隔离问题fi
中国银联豪掷1亿采购海光C86架构服务器信创新态势海光芯片 C86 国产芯片海光信息
近日，中国银联国产服务器采购大单正式敲定，基于海光C86架构的服务器产品中标，项目金额超过1亿元。接下来，C86服务器将用于支撑中国银联的虚拟化、大数据、人工智能、研发测试等技术场景，进一步提升其业务处理能力、用户服务效率和信息安全水平。作为我国重要的银行卡组织和金融基础设施，中国银联在全球183个国家和地区设有银联受理网络，境内外成员机构超过2600家，是世界三大银行卡品牌之一。此次中国银联发力
全面探索Kafka：架构、应用与流处理
Kafka：企业级消息系统与流处理平台的深度解析ApacheKafka作为分布式流处理平台，广泛应用于大数据处理和实时分析领域。本文将基于其官方文档，详细探讨Kafka的核心功能、应用场景以及如何进行有效管理。背景简介Kafka作为高吞吐量的消息系统，支持企业级的发布-订阅模式。它能够处理大量实时数据，并支持高并发读写操作。本文将依据Kafka官方文档的内容，逐层深入，从入门到高级应用，帮助读者全
矩阵求逆（JAVA）利用伴随矩阵 qiuwanchi 利用伴随矩阵求逆矩阵
package gaodai.matrix; import gaodai.determinant.DeterminantCalculation; import java.util.ArrayList; import java.util.List; import java.util.Scanner; /** * 矩阵求逆(利用伴随矩阵) * @author 邱万迟
单例（Singleton）模式 aoyouzi 单例模式 Singleton
3.1 概述如果要保证系统里一个类最多只能存在一个实例时，我们就需要单例模式。这种情况在我们应用中经常碰到，例如缓存池，数据库连接池，线程池，一些应用服务实例等。在多线程环境中，为了保证实例的唯一性其实并不简单，这章将和读者一起探讨如何实现单例模式。 3.2
[开源与自主研发]就算可以轻易获得外部技术支持,自己也必须研发 comsci 开源
现在国内有大量的信息技术产品，都是通过盗版，免费下载，开源，附送等方式从国外的开发者那里获得的。。。。。。虽然这种情况带来了国内信息产业的短暂繁荣，也促进了电子商务和互联网产业的快速发展，但是实际上，我们应该清醒的看到，这些产业的核心力量是被国外的
页面有两个frame,怎样点击一个的链接改变另一个的内容 Array_06 UI XHTML
<a src="地址" targets="这里写你要操作的Frame的名字" />搜索然后你点击连接以后你的新页面就会显示在你设置的Frame名字的框那里 targerts="",就是你要填写目标的显示页面位置 ===================== 例如： <frame src=&
Struts2实现单个/多个文件上传和下载 oloz 文件上传 struts
struts2单文件上传：步骤01:jsp页面  　　<form action="fileUplo
推荐10个在线logo设计网站 362217990 logo
在线设计Logo网站。 1、http://flickr.nosv.org（这个太简单） 2、http://www.logomaker.com/?source=1.5770.1 3、http://www.simwebsol.com/ImageTool 4、http://www.logogenerator.com/logo.php?nal=1&tpl_catlist[]=2 5、ht
jsp上传文件香水浓 jsp fileupload
1. jsp上传 Notice： 1. form表单 method 属性必须设置为 POST 方法，不能使用 GET 方法 2. form表单 enctype 属性需要设置为 multipart/form-data 3. form表单 action 属性需要设置为提交到后台处理文件上传的jsp文件地址或者servlet地址。例如 uploadFile.jsp 程序文件用来处理上传的文
我的架构经验系列文章 - 前端架构 agevs JavaScript Web 框架 UI jQuer
框架层面：近几年前端发展很快，前端之所以叫前端因为前端是已经可以独立成为一种职业了，js也不再是十年前的玩具了，以前富客户端RIA的应用可能会用flash/flex或是silverlight，现在可以使用js来完成大部分的功能，因此js作为一门前端的支撑语言也不仅仅是进行的简单的编码，越来越多框架性的东西出现了。越来越多的开发模式转变为后端只是吐json的数据源，而前端做所有UI的事情。MVCMV
android ksoap2 中把XML(DataSet) 当做参数传递 aijuans android
我的android app中需要发送webservice ，于是我使用了 ksop2 进行发送，在测试过程中不是很顺利,不能正常工作.我的web service 请求格式如下 [html] view plain copy <Envelope xmlns="http://schemas.
使用Spring进行统一日志管理 + 统一异常管理 baalwolf spring
统一日志和异常管理配置好后，SSH项目中，代码以往散落的log.info() 和 try..catch..finally 再也不见踪影！统一日志异常实现类： [java] view plain copy package com.pilelot.web.util; impor
Android SDK 国内镜像 BigBird2012 android sdk
一、镜像地址： 1、东软信息学院的 Android SDK 镜像，比配置代理下载快多了。配置地址， http://mirrors.neusoft.edu.cn/configurations.we#android 2、北京化工大学的： IPV4:ubuntu.buct.edu.cn IPV4:ubuntu.buct.cn IPV6:ubuntu.buct6.edu.cn
HTML无害化和Sanitize模块 bijian1013 JavaScript AngularJS Linky Sanitize
一.ng-bind-html、ng-bind-html-unsafe AngularJS非常注重安全方面的问题，它会尽一切可能把大多数攻击手段最小化。其中一个攻击手段是向你的web页面里注入不安全的HTML，然后利用它触发跨站攻击或者注入攻击。考虑这样一个例子，假设我们有一个变量存
[Maven学习笔记二]Maven命令 bit1129 maven
mvn compile compile编译命令将src/main/java和src/main/resources中的代码和配置文件编译到target/classes中，不会对src/test/java中的测试类进行编译 MVN编译使用 maven-resources-plugin:2.6:resources maven-compiler-plugin:2.5.1:compile &nbs
【Java命令二】jhat bit1129 Java命令
jhat用于分析使用jmap dump的文件，，可以将堆中的对象以html的形式显示出来，包括对象的数量，大小等等，并支持对象查询语言。 jhat默认开启监听端口7000的HTTP服务，jhat是Java Heap Analysis Tool的缩写 1. 用法： [hadoop@hadoop bin]$ jhat -help Usage: jhat [-stack <bool&g
JBoss 5.1.0 GA:Error installing to Instantiated: name=AttachmentStore state=Desc ronin47
进到类似目录 server/default/conf/bootstrap，打开文件 profile.xml找到： Xml代码<bean name="AttachmentStore" class="org.jboss.system.server.profileservice.repository.AbstractAtta
写给初学者的6条网页设计安全配色指南 brotherlamp UI ui自学 ui视频 ui教程 ui资料
网页设计中最基本的原则之一是，不管你花多长时间创造一个华丽的设计，其最终的角色都是这场秀中真正的明星——内容的衬托我仍然清楚地记得我最早的一次美术课，那时我还是一个小小的、对凡事都充满渴望的孩子，我摆放出一大堆漂亮的彩色颜料。我仍然记得当我第一次看到原色与另一种颜色混合变成第二种颜色时的那种兴奋，并且我想，既然两种颜色能创造出一种全新的美丽色彩，那所有颜色
有一个数组，每次从中间随机取一个，然后放回去，当所有的元素都被取过，返回总共的取的次数。写一个函数实现。复杂度是什么。 bylijinnan java 算法面试
import java.util.Random; import java.util.Set; import java.util.TreeSet; /** * http://weibo.com/1915548291/z7HtOF4sx * #面试题#有一个数组，每次从中间随机取一个，然后放回去，当所有的元素都被取过，返回总共的取的次数。 * 写一个函数实现。复杂度是什么
struts2获得request、session、application方式 chiangfai application
1、与Servlet API解耦的访问方式。 a.Struts2对HttpServletRequest、HttpSession、ServletContext进行了封装，构造了三个Map对象来替代这三种对象要获取这三个Map对象，使用ActionContext类。 -----> package pro.action; import java.util.Map; imp
改变python的默认语言设置 chenchao051 python
import sys sys.getdefaultencoding() 可以测试出默认语言，要改变的话，需要在python lib的site-packages文件夹下新建： sitecustomize.py，这个文件比较特殊，会在python启动时来加载，所以就可以在里面写上： import sys sys.setdefaultencoding('utf-8') &n
mysql导入数据load data infile用法 daizj mysql 导入数据
我们常常导入数据！mysql有一个高效导入方法，那就是load data infile 下面来看案例说明基本语法： load data [low_priority] [local] infile 'file_name txt' [replace | ignore] into table tbl_name [fields [terminated by't'] [OPTI
phpexcel导入excel表到数据库简单入门示例 dcj3sjt126com PHP Excel
跟导出相对应的，同一个数据表，也是将phpexcel类放在class目录下，将Excel表格中的内容读取出来放到数据库中 <?php error_reporting(E_ALL); set_time_limit(0); ?> <html> <head> <meta http-equiv="Content-Type"
22岁到72岁的男人对女人的要求 dcj3sjt126com
22岁男人对女人的要求是：一，美丽，二，性感，三，有份具品味的职业，四，极有耐性，善解人意，五，该聪明的时候聪明，六，作小鸟依人状时尽量自然，七，怎样穿都好看，八，懂得适当地撒娇，九，虽作惊喜反应，但看起来自然，十，上了床就是个无条件荡妇。 32岁的男人对女人的要求，略作修定，是：一，入得厨房，进得睡房，二，不必服侍皇太后，三，不介意浪漫蜡烛配盒饭，四，听多过说，五，不再傻笑，六，懂得独
Spring和HIbernate对DDM设计的支持 e200702084 DAO 设计模式 spring Hibernate 领域模型
A：数据访问对象 DAO和资源库在领域驱动设计中都很重要。DAO是关系型数据库和应用之间的契约。它封装了Web应用中的数据库CRUD操作细节。另一方面，资源库是一个独立的抽象，它与DAO进行交互，并提供到领域模型的“业务接口”。资源库使用领域的通用语言，处理所有必要的DAO，并使用领域理解的语言提供对领域模型的数据访问服务。
NoSql 数据库的特性比较 geeksun NoSQL
Redis 是一个开源的使用ANSI C语言编写、支持网络、可基于内存亦可持久化的日志型、Key-Value数据库，并提供多种语言的API。目前由VMware主持开发工作。 1. 数据模型作为Key-value型数据库，Redis也提供了键（Key）和值（Value）的映射关系。除了常规的数值或字符串，Redis的键值还可以是以下形式之一： Lists （列表） Sets
使用 Nginx Upload Module 实现上传文件功能 hongtoushizi nginx
转载自： http://www.tuicool.com/wx/aUrAzm 普通网站在实现文件上传功能的时候，一般是使用Python，Java等后端程序实现，比较麻烦。Nginx有一个Upload模块，可以非常简单的实现文件上传功能。此模块的原理是先把用户上传的文件保存到临时文件，然后在交由后台页面处理，并且把文件的原名，上传后的名称，文件类型，文件大小set到页面。下
spring-boot-web-ui及thymeleaf基本使用 jishiweili spring thymeleaf
视图控制层代码demo如下： @Controller @RequestMapping("/") public class MessageController { private final MessageRepository messageRepository; @Autowired public MessageController(Mes
数据源架构模式之活动记录 home198979 PHP 架构活动记录数据映射
hello!架构一、概念活动记录（Active Record）：一个对象，它包装数据库表或视图中某一行，封装数据库访问，并在这些数据上增加了领域逻辑。对象既有数据又有行为。活动记录使用直截了当的方法，把数据访问逻辑置于领域对象中。二、实现简单活动记录活动记录在php许多框架中都有应用，如cakephp。 <?php /** * 行数据入口类 *
Linux Shell脚本之自动修改IP pda158 linux centos Debian 脚本
作为一名 Linux SA，日常运维中很多地方都会用到脚本，而服务器的ip一般采用静态ip或者MAC绑定，当然后者比较操作起来相对繁琐，而前者我们可以设置主机名、ip信息、网关等配置。修改成特定的主机名在维护和管理方面也比较方便。如下脚本用途为：修改ip和主机名等相关信息，可以根据实际需求修改，举一反三！ #!/bin/sh #auto Change ip netmask ga
开发环境搭建独浮云 eclipse jdk tomcat
最近在开发过程中，经常出现MyEclipse内存溢出等错误，需要重启的情况，好麻烦。对于一般的JAVA+TOMCAT项目开发，其实没有必要使用重量级的MyEclipse，使用eclipse就足够了。尤其是开发机器硬件配置一般的人。 &n

大数据开发之Sqoop详细介绍

一.Sqoop概述

二.Sqoop 工具概述

三.Sqoon工具详解

3.1 codegen

3.2 create-hive-table

3.2.1 create-hive-table工具命令介绍

3.2.2 create-hive-table 测试案例

3.3 eval

3.3.1 eval工具命令介绍

3.3.2 eval命令测试

3.4 export

3.4.1 export命令概述

3.4.2 export命令测试案例

3.4.2.1 hive表导入mysql数据库insert案例

3.4.2.2 hive表导入mysql数据库update案例

3.4 help工具

3.5 import工具

3.5.1 sqoop import 工具命令介绍

3.6 import-all-tables工具

3.6.1 import-all-tables参数

3.6.2 import-all-tables 测试案例

3.7 import-mainframe 工具

3.8 list-databases工具

3.8.1 list-databases工具参数

3.8.2 list-databases 测试记录

3.9 list-tables 工具

3.9.1 list-tables 工具 参数

3.9.2 list-tables 测试案例

3.10 version 工具

四.sqoop将关系型数据库表同步到hdfs

五.sqoop增量同步

5.1 导入方式:

5.2 测试案例

5.2.1 append方式导入

5.2.2 lastmodified方式导入

5.2.2.1 --incremental lastmodified --append 方式

5.2.2.2 --incremental lastmodified --merge-key 方式

你可能感兴趣的:(大数据,sqoop,hadoop)

3.9.1 list-tables 工具参数