该项目已退役。详情请参阅其 阁楼页面。
文档主页
Sqoop 用户指南 (v1.4.7)
Sqoop 用户指南 (v1.4.7)
目录
一、简介
2. 支持的版本
3. Sqoop 发布
4. 先决条件
5. 基本用法
6. Sqoop 工具
6.1. 使用命令别名
6.2. 控制 Hadoop 安装
6.3. 使用通用参数和特定参数
6.4. 使用选项文件传递参数
6.5. 使用工具
7. sqoop-import
7.1. 目的
7.2. 句法
7.2.1. 连接到数据库服务器
7.2.2. 选择要导入的数据
7.2.3. 自由格式查询导入
7.2.4. 控制并行度
7.2.5. 控制分布式缓存
7.2.6. 控制进口过程
7.2.7. 控制事务隔离
7.2.8. 控制类型映射
7.2.9. 模式名称处理
7.2.10. 增量导入
7.2.11. 文件格式
7.2.12. 大型物体
7.2.13. 将数据导入 Hive
7.2.14. 将数据导入 HBase
7.2.15. 将数据导入 Accumulo
7.2.16. 其他导入配置属性
7.3. 示例调用
8. sqoop-import-all-tables
8.1. 目的
8.2. 句法
8.3. 示例调用
9. sqoop-import-mainframe
9.1. 目的
9.2. 句法
9.2.1. 连接到大型机
9.2.2. 选择要导入的文件
9.2.3. 控制并行度
9.2.4. 控制分布式缓存
9.2.5. 控制进口过程
9.2.6. 文件格式
9.2.7. 将数据导入 Hive
9.2.8. 将数据导入 HBase
9.2.9. 将数据导入 Accumulo
9.2.10. 其他导入配置属性
9.3. 示例调用
10. sqoop-export
10.1. 目的
10.2. 句法
10.3. 插入与更新
10.4. 出口和交易
10.5. 导出失败
10.6. 示例调用
11. validation
11.1. 目的
11.2. 介绍
11.3. 句法
11.4. 配置
11.5. 限制
11.6. 示例调用
12. 保存的工作
13. sqoop-job
13.1. 目的
13.2. 句法
13.3. 保存的作业和密码
13.4. 保存的工作和增量导入
14. sqoop-metastore
14.1. 目的
14.2. 句法
15. sqoop-merge
15.1. 目的
15.2. 句法
16. sqoop-codegen
16.1. 目的
16.2. 句法
16.3. 示例调用
17. sqoop-create-hive-table
17.1. 目的
17.2. 句法
17.3. 示例调用
18. sqoop-eval
18.1. 目的
18.2. 句法
18.3. 示例调用
19. sqoop-list-databases
19.1. 目的
19.2. 句法
19.3. 示例调用
20. sqoop-list-tables
20.1. 目的
20.2. 句法
20.3. 示例调用
21. sqoop-help
21.1. 目的
21.2. 句法
21.3. 示例调用
22. sqoop-version
22.1. 目的
22.2. 句法
22.3. 示例调用
23. Sqoop-HCatalog 集成
23.1. HCatalog 背景
23.2. 将 HCatalog 表暴露给 Sqoop
23.2.1. 新的命令行选项
23.2.2. 支持的 Sqoop Hive 选项
23.2.3. 直接模式支持
23.2.4. 不支持的 Sqoop 选项
23.2.4.1. 不受支持的 Sqoop Hive 导入选项
23.2.4.2. 不支持的 Sqoop 导出和导入选项
23.2.5. 忽略的 Sqoop 选项
23.3. 自动表创建
23.4. 分隔文本格式以及字段和行分隔符
23.5. HCatalog 表要求
23.6. 支持分区
23.7. 模式映射
23.8. 支持 HCatalog 数据类型
23.9. 为 Sqoop 作业提供 Hive 和 HCatalog 库
23.10. 例子
23.11. 进口
23.12. 出口
24. 兼容性说明
24.1. 支持的数据库
24.2. MySQL
24.2.1. 零日期时间行为
24.2.2. UNSIGNED列
24.2.3. BLOB和CLOB列
24.2.4. 在直接模式下导入视图
24.3. PostgreSQL
24.3.1. 在直接模式下导入视图
24.4. 甲骨文
24.4.1. 日期和时间
24.5. Hive 中的模式定义
24.6. Hive 中的 Parquet 支持
24.7. CUBRID
25. 特定连接器的注意事项
25.1. MySQL JDBC 连接器
25.1.1. 插入功能
25.2. MySQL 直接连接器
25.2.1. 要求
25.2.2. 限制
25.2.3. 直接模式事务
25.3. Microsoft SQL 连接器
25.3.1. 额外的参数
25.3.2. 允许身份插入
25.3.3. 非弹性操作
25.3.4. 模式支持
25.3.5. 表格提示
25.4. PostgreSQL 连接器
25.4.1. 额外的参数
25.4.2. 模式支持
25.5. PostgreSQL 直接连接器
25.5.1. 要求
25.5.2. 限制
25.6. pg_bulkload 连接器
25.6.1. 目的
25.6.2. 要求
25.6.3. 句法
25.6.4. 数据暂存
25.7. Netezza 连接器
25.7.1. 额外的参数
25.7.2. 直接模式
25.7.3. 空字符串处理
25.8. 适用于 Oracle 和 Hadoop 的数据连接器
25.8.1. 关于
25.8.1.1. 工作
25.8.1.2. 标准 Oracle 管理器如何用于导入
25.8.1.3. 适用于 Oracle 和 Hadoop 的数据连接器如何用于导入
25.8.1.4. 用于 Oracle 和 Hadoop 导出的数据连接器
25.8.2. 要求
25.8.2.1. 确保正确设置 Oracle 数据库 JDBC 驱动程序
25.8.2.2. Oracle 角色和特权
25.8.2.3. 导出所需的其他 Oracle 角色和特权
25.8.2.4. 支持的数据类型
25.8.3. 使用 Data Connector for Oracle 和 Hadoop 执行 Sqoop
25.8.3.1. 连接到 Oracle / Oracle RAC
25.8.3.2. 连接到 Oracle 数据库实例
25.8.3.3. 连接到 Oracle RAC
25.8.3.4. 登录 Oracle 实例
25.8.3.5. 杀死 Oracle 和 Hadoop 作业的数据连接器
25.8.4. 从 Oracle 导入数据
25.8.4.1. 将 Hadoop 文件与 Oracle 表分区匹配
25.8.4.2. 指定要导入的分区
25.8.4.3. 一致读取:所有映射器从同一时间点读取
25.8.5. 将数据导出到 Oracle
25.8.5.1。插入导出
25.8.5.2. 更新导出
25.8.5.3. 合并导出
25.8.5.4. 创建 Oracle 表
25.8.5.5. 不记录
25.8.5.6. 分区
25.8.5.7。通过多列匹配行
25.8.5.8. 存储条款
25.8.6. 管理日期和时间戳数据类型
25.8.6.1。从 Oracle 导入日期和时间戳数据类型
25.8.6.2. 适用于 Oracle 和 Hadoop 的数据连接器不对 DATE / TIMESTAMP 数据类型应用时区
25.8.6.3. 适用于 Oracle 和 Hadoop 的数据连接器以 TIMEZONE 数据类型保留时区信息
25.8.6.4. 适用于 Oracle 和 Hadoop 的数据连接器明确说明 LOCAL TIMEZONE 数据类型的时区
25.8.6.5. java.sql.时间戳
25.8.6.6. 将日期和时间戳数据类型导出到 Oracle
25.8.7. 为 Oracle 和 Hadoop 配置数据连接器
25.8.7.1。oraoop-site-template.xml
25.8.7.2. oraoop.oracle.session.initialization.statements
25.8.7.3. oraoop.table.import.where.clause.location
25.8.7.4. oracle.row.fetch.size
25.8.7.5。oraoop.import.hint
25.8.7.6. oraoop.oracle.append.values.hint.usage
25.8.7.7。mapred.map.tasks.speculative.execution
25.8.7.8。oraoop.block.allocation
25.8.7.9。oraoop.import.omit.lobs.and.long
25.8.7.10。oraoop.locations
25.8.7.11。sqoop.connection.factories
25.8.7.12。oraoop-site.xml 中的表达式
25.8.8. 对 Oracle 和 Hadoop 的数据连接器进行故障排除
25.8.8.1。引用 Oracle 所有者和表
25.8.8.2. 引用 Oracle 列
25.8.8.3. 确认 Oracle 和 Hadoop 的数据连接器可以初始化 Oracle 会话
25.8.8.4. 检查 Sqoop 调试日志中的错误消息
25.8.8.5。导出:检查表是否兼容
25.8.8.6。导出:并行化
25.8.8.7。导出:检查oraoop.oracle.append.values.hint.usage
25.8.8.8。打开详细
26. 获得支持
27. 故障排除
27.1. 一般故障排除过程
27.2. 具体故障排除提示
27.2.1. Oracle:连接重置错误
27.2.2. Oracle:区分大小写的目录查询错误
27.2.3. MySQL:连接失败
27.2.4. Oracle:ORA-00933 错误(SQL 命令未正确结束)
27.2.5. MySQL:从 MySQL 导入 TINYINT(1) 的行为很奇怪
授权给 Apache 软件基金会 (ASF) 下一个
或更多贡献者许可协议。查看通知文件
与这项工作一起分发以获取更多信息
关于版权归属。ASF 许可此文件
根据 Apache 许可,版本 2.0(
“执照”); 除非合规,否则您不得使用此文件
与许可证。您可以在以下网址获取许可证副本
http://www.apache.org/licenses/LICENSE-2.0
除非适用法律要求或书面同意,否则软件
根据许可证分发的分发是按“原样”的基础分发的,
不作任何形式的明示或暗示的保证或条件。
请参阅许可证以了解管理权限的特定语言和
许可证下的限制。
一、简介
Sqoop 是一种旨在在 Hadoop 和关系数据库或大型机之间传输数据的工具。您可以使用 Sqoop 将数据从 MySQL 或 Oracle 等关系数据库管理系统 (RDBMS) 或大型机导入 Hadoop 分布式文件系统 (HDFS),在 Hadoop MapReduce 中转换数据,然后将数据导出回 RDBMS .
Sqoop 自动化了这个过程的大部分,依靠数据库来描述要导入的数据的模式。Sqoop 使用 MapReduce 来导入和导出数据,它提供并行操作和容错。
本文档介绍了如何开始使用 Sqoop 在数据库和 Hadoop 或大型机之间将数据移动到 Hadoop,并为 Sqoop 命令行工具套件的操作提供参考信息。本文档适用于:
系统和应用程序员
系统管理员
数据库管理员
数据分析师
数据工程师
2. 支持的版本
本文档适用于 Sqoop v1.4.7。
Sqoop 的软件开发发生在http://sqoop.apache.org 在该站点您可以获得:
Sqoop 的新版本及其最新的源代码
问题跟踪器
包含 Sqoop 文档的 wiki
4. 先决条件
本产品需要以下必备知识:
基本计算机技术和术语
熟悉命令行界面,例如 bash
关系数据库管理系统
基本熟悉Hadoop的目的和操作
在您可以使用 Sqoop 之前,必须安装和配置一个 Hadoop 版本。Sqoop 目前支持 4 个主要的 Hadoop 版本 - 0.20、0.23、1.0 和 2.0。
本文档假设您使用的是 Linux 或类似 Linux 的环境。如果您使用的是 Windows,则可以使用 cygwin 来完成以下大部分任务。如果您使用的是 Mac OS X,您应该会看到很少(如果有)兼容性错误。Sqoop 主要在 Linux 上运行和测试。
导入过程的一个副产品是一个生成的 Java 类,它可以封装导入表的一行。此类由 Sqoop 本身在导入过程中使用。还为您提供了此类的 Java 源代码,用于后续的 MapReduce 数据处理。此类可以将数据与 SequenceFile 格式进行序列化和反序列化。它还可以解析记录的分隔文本形式。这些功能使您能够快速开发在处理管道中使用 HDFS 存储记录的 MapReduce 应用程序。您还可以使用您喜欢的任何其他工具自行解析分隔记录数据。
在处理导入的记录(例如,使用 MapReduce 或 Hive)后,您可能拥有一个结果数据集,然后您可以将其导出 回关系数据库。Sqoop 的导出过程会从 HDFS 并行读取一组带分隔符的文本文件,将它们解析为记录,并将它们作为新行插入目标数据库表中,供外部应用程序或用户使用。
Sqoop 包括一些其他命令,允许您检查您正在使用的数据库。例如,您可以列出可用的数据库模式(使用sqoop-list-databases工具)和模式中的表(使用sqoop-list-tables工具)。Sqoop 还包括一个原始的 SQL 执行 shell(sqoop-eval工具)。
导入、代码生成和导出过程的大多数方面都可以自定义。对于数据库,您可以控制导入的特定行范围或列。您可以为基于文件的数据表示以及使用的文件格式指定特定的分隔符和转义字符。您还可以控制生成的代码中使用的类或包名称。本文档的后续部分解释了如何为 Sqoop 指定这些和其他参数。
如果 Sqoop 是从自己的源代码编译的,则无需正式安装过程即可通过运行bin/sqoop程序来运行 Sqoop 。Sqoop 打包部署(例如 Apache Bigtop 附带的 RPM)的用户将看到此程序安装为/usr/bin/sqoop. 本文档的其余部分将此程序称为sqoop. 例如:
$ sqoop 工具名称 [工具参数]
[笔记] 笔记
以下以 字 符 开 头 的 示 例 表 示 必 须 在 终 端 提 示 符 下 输 入 命 令 ( 例 如 b a s h ) 。 的 字符开头的示例表示必须在终端提示符下输入命令(例如 bash)。的 字符开头的示例表示必须在终端提示符下输入命令(例如bash)。的字符表示提示本身; 您不应通过键入 . 您 还 可 以 在 段 落 文 本 中 内 联 输 入 命 令 ; 例 如 , s q o o p h e l p 。 这 些 示 例 不 显 示 . 您还可以在段落文本中内联输入命令;例如,sqoop help。这些示例不显示 .您还可以在段落文本中内联输入命令;例如,sqoophelp。这些示例不显示前缀,但您应该以相同的方式输入它们。不要将 示 例 中 的 s h e l l 提 示 与 示例中的shell 提示与 示例中的shell提示与 环境变量名称前面的混淆。例如,字符串文字$HADOOP_HOME包括“ $”。
Sqoop 附带了一个帮助工具。要显示所有可用工具的列表,请键入以下命令:
$ sqoop 帮助
用法:sqoop 命令 [ARGS]
可用命令:
codegen 生成与数据库记录交互的代码
create-hive-table 将表定义导入 Hive
eval 评估一条 SQL 语句并显示结果
export 将 HDFS 目录导出到数据库表
help 列出可用命令
import 将表从数据库导入到 HDFS
import-all-tables 从数据库导入表到 HDFS
import-mainframe 将大型机数据集导入 HDFS
list-databases 列出服务器上可用的数据库
list-tables 列出数据库中的可用表
version 显示版本信息
有关特定命令的信息,请参阅“sqoop help COMMAND”。
您可以通过输入来显示特定工具的帮助:sqoop help (tool-name); 例如,sqoop help import。
您还可以将–help参数添加到任何命令:sqoop import --help.
6.1. 使用命令别名
除了键入sqoop (toolname)语法之外,您还可以使用指定sqoop-(toolname)语法的别名脚本。例如,脚本sqoop-import、sqoop-export等各选择一个特定的工具。
6.2. 控制 Hadoop 安装
您可以通过 Hadoop 提供的程序启动功能调用 Sqoop。在sqoop命令行程序是运行在一个包装 bin/hadoop附带的Hadoop脚本。如果您的机器上安装了多个 Hadoop,您可以通过设置$HADOOP_COMMON_HOME和 $HADOOP_MAPRED_HOME环境变量来选择 Hadoop 安装。
例如:
$ HADOOP_COMMON_HOME=/path/to/some/hadoop
HADOOP_MAPRED_HOME=/path/to/some/hadoop-mapreduce
sqoop导入–参数…
要么:
$ export HADOOP_COMMON_HOME=/some/path/to/hadoop
$ export HADOOP_MAPRED_HOME=/some/path/to/hadoop-mapreduce
$ sqoop import --arguments…
如果这些变量中的任何一个未设置,Sqoop 将回退到 $HADOOP_HOME. 如果也未设置,Sqoop 将分别使用 Apache Bigtop/usr/lib/hadoop和 的默认安装位置/usr/lib/hadoop-mapreduce。
活动 Hadoop 配置从 加载 H A D O O P H O M E / c o n f / , 除 非 HADOOP_HOME/conf/,除非 HADOOPHOME/conf/,除非HADOOP_CONF_DIR设置了环境变量。
6.3. 使用通用参数和特定参数
要控制每个 Sqoop 工具的操作,您可以使用通用参数和特定参数。
例如:
$ sqoop 帮助导入
用法:sqoop 导入 [GENERIC-ARGS] [TOOL-ARGS]
常见论点:
–connect 指定 JDBC 连接字符串
–connect-manager 指定要使用的连接管理器类
–driver 手动指定要使用的 JDBC 驱动程序类
–hadoop-mapred-home <目录> 覆盖 $HADOOP_MAPRED_HOME
–help 打印使用说明
–password-file 为包含认证密码的文件设置路径
-P 从控制台读取密码
–password <密码> 设置认证密码
–username 设置认证用户名
–verbose 工作时打印更多信息
–hadoop-home
[…]
通用 Hadoop 命令行参数:
(必须在任何特定于工具的参数之前)
支持的通用选项是
-conf <配置文件> 指定应用程序配置文件
-D
-fs
-jt
-files <逗号分隔的文件列表> 指定要复制到map reduce集群的逗号分隔文件
-libjars <逗号分隔的 jar 列表> 指定要包含在类路径中的逗号分隔的 jar 文件。
-archives <逗号分隔的存档列表> 指定要在计算机上取消存档的逗号分隔存档。
一般的命令行语法是
bin/hadoop 命令 [genericOptions] [commandOptions]
您必须在工具名称之后但在任何特定于工具的参数(例如)之前提供通用参数-conf、等 。请注意,通用 Hadoop 参数以单个短划线字符 ( ) 开头,而特定于工具的参数以两个短划线 ( )开头,除非它们是单个字符参数,例如.-D–connect----P
的-conf,-D,-fs和-jt参数控制配置和Hadoop服务器设置。例如,-D mapred.job.name=
的-files,-libjars和-archives参数通常不使用Sqoop,但它们包含作为Hadoop的内部争论,分析系统的一部分。
6.4. 使用选项文件传递参数
使用 Sqoop 时,为方便起见,可以将在调用之间不发生变化的命令行选项放在一个选项文件中。选项文件是一个文本文件,其中每一行按照它在命令行中出现的顺序标识一个选项。选项文件允许通过在中间行末尾使用反斜杠字符在多行上指定单个选项。还支持选项文件中以散列字符开头的注释。注释必须在新行中指定,并且不能与选项文本混合。展开选项文件时,所有注释和空行都将被忽略。除非选项显示为带引号的字符串,否则任何前导或尾随空格都将被忽略。带引号的字符串(如果使用)不得超出指定的行。
选项文件可以在命令行中的任何位置指定,只要其中的选项遵循其他规定的选项排序规则。例如,无论选项是从哪里加载的,它们都必须遵循这样的顺序,即通用选项首先出现,然后是工具特定选项,最后是旨在传递给子程序的选项。
要指定选项文件,只需在方便的位置创建一个选项文件,然后通过–options-file参数将其传递到命令行 。
每当指定选项文件时,它都会在调用工具之前在命令行上展开。如果需要,您可以在同一个调用中指定多个选项文件。
例如,可以替代地指定以下用于导入的 Sqoop 调用,如下所示:
$ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST
$ sqoop --options-file /users/homer/work/import.txt --table TEST
其中选项文件/users/homer/work/import.txt包含以下内容:
进口
进口
6.5. 使用工具
以下部分将描述每个工具的操作。这些工具按照您会发现它们最有用的顺序列出。
7.2. 句法
7.2.1. 连接到数据库服务器
7.2.2. 选择要导入的数据
7.2.3. 自由格式查询导入
7.2.4. 控制并行度
7.2.5. 控制分布式缓存
7.2.6. 控制进口过程
7.2.7. 控制事务隔离
7.2.8. 控制类型映射
7.2.9. 模式名称处理
7.2.10. 增量导入
7.2.11. 文件格式
7.2.12. 大型物体
7.2.13. 将数据导入 Hive
7.2.14. 将数据导入 HBase
7.2.15. 将数据导入 Accumulo
7.2.16. 其他导入配置属性
$ sqoop导入(通用参数)(导入参数)
$ sqoop-import (generic-args) (import-args)
虽然 Hadoop 通用参数必须在任何导入参数之前,但您可以按任意顺序输入导入参数。
[笔记] 笔记
在本文档中,参数被分组到按函数组织的集合中。一些集合存在于多个工具中(例如,“common”参数)。仅在本文档的第一个演示文稿中对其功能进行了扩展说明。
表 1. 常见参数
争论 描述
–connect 指定 JDBC 连接字符串
–connection-manager 指定要使用的连接管理器类
–driver 手动指定要使用的 JDBC 驱动程序类
–hadoop-mapred-home
7.2.1. 连接到数据库服务器
Sqoop 旨在将表从数据库导入 HDFS。为此,您必须指定一个描述如何连接到数据库的连接字符串。该连接字符串是类似于URL,并传达给Sqoop与–connect争论。这描述了要连接的服务器和数据库;它还可以指定端口。例如:
$ sqoop import --connect jdbc:mysql://database.example.com/employees
此字符串将连接到employees主机上命名的 MySQL 数据库database.example.com。如果您打算将 Sqoop 与分布式 Hadoop 集群一起使用,请不要使用 URL ,这一点很重要localhost。您提供的连接字符串将用于整个 MapReduce 集群的 TaskTracker 节点;如果您指定文字 name localhost,每个节点将连接到不同的数据库(或者更有可能根本没有数据库)。相反,您应该使用所有远程节点都可以看到的数据库主机的完整主机名或 IP 地址。
您可能需要先对数据库进行身份验证,然后才能访问它。您可以使用–username向数据库提供用户名。Sqoop 提供了几种不同的方式来向数据库提供密码,安全和非安全,下面详述。
向数据库提供密码的安全方式。 您应该将密码保存在用户主目录中具有 400 权限的文件中–password-file ,并使用参数指定该文件的路径,这是输入凭据的首选方法。然后 Sqoop 将从文件中读取密码,并使用安全方式将其传递给 MapReduce 集群,而不会在作业配置中暴露密码。包含密码的文件可以位于本地 FS 或 HDFS 上。例如:
$ sqoop import --connect jdbc:mysql://database.example.com/employees
–username venkatesh --password-file ${user.home}/.password
[警告] 警告
Sqoop 将读取密码文件的全部内容并将其用作密码。这将包括任何尾随空白字符,例如大多数文本编辑器默认添加的换行符。您需要确保您的密码文件只包含属于您的密码的字符。在命令行上,您可以使用echo带有 switch 的命令 -n来存储密码,而无需任何尾随空格字符。例如,要存储密码,secret您将调用 echo -n “secret” > password.file.
另一种提供密码的方法是使用-P从控制台提示中读取密码的参数。
保护密码免受猎物的侵害。 Hadoop 2.6.0 提供了一个 API 来将密码存储与应用程序分开。此 API 称为凭据提供 API,并且有一个新的 credential 命令行工具来管理密码及其别名。密码与其别名一起存储在受密码保护的密钥库中。密钥库密码可以提供给命令行上的密码提示,通过环境变量或默认为软件定义的常量。请查看 Hadoop 文档以了解该工具的使用情况。
一旦使用 Credential Provider 工具存储密码并且适当更新 Hadoop 配置,所有应用程序都可以选择使用别名代替实际密码,并在运行时解析要使用的密码的别名。
由于用于存储凭证提供者的密钥库或类似技术是跨组件共享的,因此可以安全地存储各种应用程序、各种数据库和其他密码的密码,并且只需要在配置文件中公开别名,从而保护密码不被泄露可见的。
Sqoop 已得到增强,如果在所使用的底层 Hadoop 版本中可用,则允许使用此功能。引入了一个新选项来在命令行上提供别名而不是实际密码 (–password-alias)。此选项的参数值是与实际密码关联的存储上的别名。示例用法如下:
$ sqoop import --connect jdbc:mysql://database.example.com/employees
–username dbuser --password-alias mydb.password.alias
同样,如果不首选命令行选项,则可以将别名保存在 --password-file 选项提供的文件中。与此同时,Sqoop 配置参数 org.apache.sqoop.credentials.loader.class 应设置为提供别名解析的类名: org.apache.sqoop.util.password.CredentialProviderPasswordLoader
示例用法如下(假设 .password.alias 具有真实密码的别名):
$ sqoop import --connect jdbc:mysql://database.example.com/employees
–username dbuser --password-file ${user.home}/.password-alias
[警告] 警告
该–password参数是不安全的,因为其他用户可能能够通过诸如ps. 与使用-P参数相比,参数是首选方法–password。凭证仍然可以使用不安全的方式在 MapReduce 集群的节点之间传输。例如:
$ sqoop import --connect jdbc:mysql://database.example.com/employees
–username aaron --password 12345
Sqoop 自动支持多种数据库,包括 MySQL。以 开头的连接字符串jdbc:mysql://在 Sqoop 中自动处理。(“支持的数据库”部分提供了具有内置支持的完整数据库列表。对于某些数据库,您可能需要自己安装 JDBC 驱动程序。)
您可以将 Sqoop 与任何其他符合 JDBC 的数据库一起使用。首先,为要导入的数据库类型下载相应的 JDBC 驱动程序,并将 .jar 文件安装在$SQOOP_HOME/lib客户端计算机的目录中。(/usr/lib/sqoop/lib如果您是从 RPM 或 Debian 软件包安装的,则会出现这种情况。)每个驱动程序.jar文件还有一个特定的驱动程序类,用于定义驱动程序的入口点。例如,MySQL 的 Connector/J 库的驱动程序类为com.mysql.jdbc.Driver. 请参阅特定于数据库供应商的文档以确定主要驱动程序类。此类必须作为 Sqoop 的参数提供–driver。
例如,要连接到 SQLServer 数据库,首先从 microsoft.com 下载驱动程序并将其安装在您的 Sqoop 库路径中。
然后运行 Sqoop。例如:
$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver
–connect <连接字符串> …
使用 JDBC 连接到数据库时,您可以选择使用选项通过属性文件指定额外的 JDBC 参数 --connection-param-file。该文件的内容被解析为标准 Java 属性,并在创建连接时传递给驱动程序。
[笔记] 笔记
通过可选属性文件指定的参数仅适用于 JDBC 连接。任何使用 JDBC 以外的连接的快速路径连接器都将忽略这些参数。
表 2. 验证参数更多细节
争论 描述
–validate 启用数据复制验证,仅支持单表复制。
–validator 指定要使用的验证器类。
–validation-threshold 指定要使用的验证阈值类。
–validation-failurehandler 指定要使用的验证失败处理程序类。
表 3. 导入控制参数:
争论 描述
–append 将数据附加到 HDFS 中的现有数据集
–as-avrodatafile 将数据导入 Avro 数据文件
–as-sequencefile 将数据导入到 SequenceFiles
–as-textfile 以纯文本形式导入数据(默认)
–as-parquetfile 将数据导入 Parquet 文件
–boundary-query 用于创建拆分的边界查询
–columns
在–null-string和–null-non-string参数都是可选的。\如果未指定,那么字符串“null”将被使用。
7.2.2. 选择要导入的数据
Sqoop 通常以表为中心的方式导入数据。使用 --table参数选择要导入的表。例如,–table employees。此参数还可以标识数据库中的一个VIEW或其他类似表的实体。
默认情况下,选择表中的所有列进行导入。导入的数据按其“自然顺序”写入 HDFS;也就是说,包含 A、B 和 C 列的表会导致导入数据,例如:
A1,B1,C1
A2,B2,C2
…
您可以选择列的子集并使用–columns参数控制它们的顺序。这应该包括要导入的列的逗号分隔列表。例如:–columns “name,employee_id,jobtitle”。
您可以通过WHERE在导入语句中添加 SQL子句来控制导入哪些行。默认情况下,Sqoop 生成形式为 的语句SELECT FROM
。您可以WHERE在此–where参数中附加一个 子句。例如:–where “id > 400”。只有列id值大于 400 的行才会被导入。默认情况下,sqoop 将使用查询select min(), max() from
来找出创建拆分的边界。在某些情况下,此查询不是最佳查询,因此您可以使用–boundary-query参数指定任何返回两个数字列的任意查询。7.2.3. 自由格式查询导入
Sqoop 还可以导入任意 SQL 查询的结果集。您可以使用参数指定 SQL 语句,而不是使用–table,–columns和参数。–where–query
导入自由格式查询时,必须使用–target-dir.
如果要并行导入查询的结果,那么每个 map 任务都需要执行查询的副本,结果由 Sqoop 推断的边界条件进行分区。您的查询必须包含$CONDITIONS 每个 Sqoop 进程将替换为唯一条件表达式的标记。您还必须选择带有 的拆分列–split-by。
例如:
$ sqoop导入
–query ‘SELECT a., b. FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS’
–split-by a.id --target-dir /user/foo/joinresults
或者,通过使用以下命令指定单个 map 任务,查询可以执行一次并连续导入-m 1:
$ sqoop导入
–query ‘SELECT a., b. FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS’
-m 1 --target-dir /user/foo/joinresults
[笔记] 笔记
If you are issuing the query wrapped with double quotes ("), you will have to use $CONDITIONS instead of just $CONDITIONS to disallow your shell from treating it as a shell variable. For example, a double quoted query may look like: “SELECT * FROM x WHERE a=‘foo’ AND $CONDITIONS”
[笔记] Note
The facility of using free-form query in the current version of Sqoop is limited to simple queries where there are no ambiguous projections and no OR conditions in the WHERE clause. Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results.
7.2.4. Controlling Parallelism
Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16. Do not increase the degree of parallelism greater than that available within your MapReduce cluster; tasks will run serially and will likely increase the amount of time required to perform the import. Likewise, do not increase the degree of parallism higher than that which your database can reasonably support. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result.
When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.
If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by argument. For example, --split-by employee_id. Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.
User can override the --num-mapers by using --split-limit option. Using the --split-limit parameter places a limit on the size of the split section created. If the size of the split created is larger than the size specified in this parameter, then the splits would be resized to fit within this limit, and the number of splits will change according to that.This affects actual number of mappers. If size of a split calculated based on provided --num-mappers parameter exceeds --split-limit parameter then actual number of mappers will be increased.If the value specified in --split-limit parameter is 0 or negative, the parameter will be ignored altogether and the split size will be calculated according to the number of mappers.
If a table does not have a primary key defined and the --split-by is not provided, then import will fail unless the number of mappers is explicitly set to one with the --num-mappers 1 option or the --autoreset-to-one-mapper option is used. The option --autoreset-to-one-mapper is typically used with the import-all-tables tool to automatically handle tables without a primary key in a schema.
7.2.5. Controlling Distributed Cache
Sqoop will copy the jars in $SQOOP_HOME/lib folder to job cache every time when start a Sqoop job. When launched by Oozie this is unnecessary since Oozie use its own Sqoop share lib which keeps Sqoop dependencies in the distributed cache. Oozie will do the localization on each worker node for the Sqoop dependencies only once during the first Sqoop job and reuse the jars on worker node for subsquencial jobs. Using option --skip-dist-cache in Sqoop command when launched by Oozie will skip the step which Sqoop copies its dependencies to job cache and save massive I/O.
7.2.6. Controlling the Import Process
By default, the import process will use JDBC which provides a reasonable cross-vendor import channel. Some databases can perform imports in a more high-performance fashion by using database-specific data movement tools. For example, MySQL provides the mysqldump tool which can export data from MySQL to other systems very quickly. By supplying the --direct argument, you are specifying that Sqoop should attempt the direct import channel. This channel may be higher performance than using JDBC.
Details about use of direct mode with each specific RDBMS, installation requirements, available options and limitations can be found in Section 25, “Notes for specific connectors”.
By default, Sqoop will import a table named foo to a directory named foo inside your home directory in HDFS. For example, if your username is someuser, then the import tool will write to /user/someuser/foo/(files). You can adjust the parent directory of the import with the --warehouse-dir argument. For example:
$ sqoop import --connnect --table foo --warehouse-dir /shared
…
This command would write to a set of files in the /shared/foo/ directory.
You can also explicitly choose the target directory, like so:
$ sqoop import --connnect --table foo --target-dir /dest
…
This will import the files into the /dest directory. --target-dir is incompatible with --warehouse-dir.
When using direct mode, you can specify additional arguments which should be passed to the underlying tool. If the argument – is given on the command-line, then subsequent arguments are sent directly to the underlying tool. For example, the following adjusts the character set used by mysqldump:
$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar
–direct – --default-character-set=latin1
By default, imports go to a new target location. If the destination directory already exists in HDFS, Sqoop will refuse to import and overwrite that directory’s contents. If you use the --append argument, Sqoop will import data to a temporary directory and then rename the files into the normal target directory in a manner that does not conflict with existing filenames in that directory.
7.2.7. Controlling transaction isolation
By default, Sqoop uses the read committed transaction isolation in the mappers to import data. This may not be the ideal in all ETL workflows and it may desired to reduce the isolation guarantees. The --relaxed-isolation option can be used to instruct Sqoop to use read uncommitted isolation level.
The read-uncommitted isolation level is not supported on all databases (for example, Oracle), so specifying the option --relaxed-isolation may not be supported on all databases.
7.2.8. Controlling type mapping
Sqoop is preconfigured to map most SQL types to appropriate Java or Hive representatives. However the default mapping might not be suitable for everyone and might be overridden by --map-column-java (for changing mapping to Java) or --map-column-hive (for changing Hive mapping).
Table 4. Parameters for overriding mapping
Argument Description
–map-column-java Override mapping from SQL to Java type for configured columns.
–map-column-hive Override mapping from SQL to Hive type for configured columns.
Sqoop is expecting comma separated list of mapping in form =. For example:
$ sqoop import … --map-column-java id=String,value=Integer
Notice that specifying commas in --map-column-hive option, you should use URL encoded keys and values, for example, use DECIMAL(1%2C%201) instead of DECIMAL(1, 1).
Sqoop will rise exception in case that some configured mapping will not be used.
7.2.9. Schema name handling
When sqoop imports data from an enterprise store, table and column names may have characters that are not valid Java identifier characters or Avro/Parquet identifiers. To address this, sqoop translates these characters to _ as part of the schema creation. Any column name starting with an _ (underscore) character will be translated to have two underscore characters. For example _AVRO will be converted to __AVRO.
In the case of HCatalog imports, column names are converted to lower case when mapped to HCatalog columns. This may change in future.
7.2.10. Incremental Imports
Sqoop provides an incremental import mode which can be used to retrieve only rows newer than some previously-imported set of rows.
The following arguments control incremental imports:
Table 5. Incremental import arguments:
Argument Description
–check-column (col) Specifies the column to be examined when determining which rows to import. (the column should not be of type CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR)
–incremental (mode) Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
–last-value (value) Specifies the maximum value of the check column from the previous import.
Sqoop supports two types of incremental imports: append and lastmodified. You can use the --incremental argument to specify the type of incremental import to perform.
You should specify append mode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row’s id with --check-column. Sqoop imports rows where the check column has a value greater than the one specified with --last-value.
An alternate table update strategy supported by Sqoop is called lastmodified mode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported.
At the end of an incremental import, the value which should be specified as --last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify --last-value in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.
7.2.11. File Formats
You can import data in one of two file formats: delimited text or SequenceFiles.
Delimited text is the default import format. You can also specify it explicitly by using the --as-textfile argument. This argument will write string-based representations of each record to the output files, with delimiter characters between individual columns and rows. These delimiters may be commas, tabs, or other characters. (The delimiters can be selected; see “Output line formatting arguments.”) The following is the results of an example text-based import:
1,here is a message,2010-05-01
2,happy new year!,2010-01-01
3,another message,2009-11-12
Delimited text is appropriate for most non-binary data types. It also readily supports further manipulation by other tools, such as Hive.
SequenceFiles are a binary format that store individual records in custom record-specific data types. These data types are manifested as Java classes. Sqoop will automatically generate these data types for you. This format supports exact storage of all data in binary representations, and is appropriate for storing binary data (for example, VARBINARY columns), or data that will be principly manipulated by custom MapReduce programs (reading from SequenceFiles is higher-performance than reading from text files, as records do not need to be parsed).
Avro data files are a compact, efficient binary format that provides interoperability with applications written in other programming languages. Avro also supports versioning, so that when, e.g., columns are added or removed from a table, previously imported data files can be processed along with new ones.
By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the --compression-codec argument. This applies to SequenceFile, text, and Avro files.
7.2.12. Large Objects
Sqoop handles large objects (BLOB and CLOB columns) in particular ways. If this data is truly large, then these columns should not be fully materialized in memory for manipulation, as most columns are. Instead, their data is handled in a streaming fashion. Large objects can be stored inline with the rest of the data, in which case they are fully materialized in memory on every access, or they can be stored in a secondary storage file linked to the primary data storage. By default, large objects less than 16 MB in size are stored inline with the rest of the data. At a larger size, they are stored in files in the _lobs subdirectory of the import target directory. These files are stored in a separate format optimized for large record storage, which can accomodate records of up to 2^63 bytes each. The size at which lobs spill into separate files is controlled by the --inline-lob-limit argument, which takes a parameter specifying the largest lob size to keep inline, in bytes. If you set the inline LOB limit to 0, all large objects will be placed in external storage.
Table 6. Output line formatting arguments:
Argument Description
–enclosed-by Sets a required field enclosing character
–escaped-by Sets the escape character
–fields-terminated-by Sets the field separator character
–lines-terminated-by Sets the end-of-line character
–mysql-delimiters Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: ’
–optionally-enclosed-by Sets a field enclosing character
When importing to delimited files, the choice of delimiter is important. Delimiters which appear inside string-based fields may cause ambiguous parsing of the imported data by subsequent analysis passes. For example, the string “Hello, pleased to meet you” should not be imported with the end-of-field delimiter set to a comma.
Delimiters may be specified as:
a character (–fields-terminated-by X)
an escape character (–fields-terminated-by \t). Supported escape characters are:
\b (backspace)
\n (newline)
\r (carriage return)
\t (tab)
" (double-quote)
\’ (single-quote)
\ (backslash)
\0 (NUL) - This will insert NUL characters between fields or lines, or will disable enclosing/escaping if used for one of the --enclosed-by, --optionally-enclosed-by, or --escaped-by arguments.
The octal representation of a UTF-8 character’s code point. This should be of the form \0ooo, where ooo is the octal value. For example, --fields-terminated-by \001 would yield the ^A character.
The hexadecimal representation of a UTF-8 character’s code point. This should be of the form \0xhhh, where hhh is the hex value. For example, --fields-terminated-by \0x10 would yield the carriage return character.
The default delimiters are a comma (,) for fields, a newline (\n) for records, no quote character, and no escape character. Note that this can lead to ambiguous/unparsible records if you import database records containing commas or newlines in the field data. For unambiguous parsing, both must be enabled. For example, via --mysql-delimiters.
If unambiguous delimiters cannot be presented, then use enclosing and escaping characters. The combination of (optional) enclosing and escaping characters will allow unambiguous parsing of lines. For example, suppose one column of a dataset contained the following values:
Some string, with a comma.
Another “string with quotes”
The following arguments would provide delimiters which can be unambiguously parsed:
$ sqoop import --fields-terminated-by , --escaped-by \ --enclosed-by ‘"’ …
(Note that to prevent the shell from mangling the enclosing character, we have enclosed that argument itself in single-quotes.)
The result of the above arguments applied to the above dataset would be:
“Some string, with a comma.”,“1”,“2”,“3”…
“Another “string with quotes””,“4”,“5”,“6”…
Here the imported strings are shown in the context of additional columns (“1”,“2”,“3”, etc.) to demonstrate the full effect of enclosing and escaping. The enclosing character is only strictly necessary when delimiter characters appear in the imported text. The enclosing character can therefore be specified as optional:
$ sqoop import --optionally-enclosed-by ‘"’ (the rest as above)…
Which would result in the following import:
“Some string, with a comma.”,1,2,3…
“Another “string with quotes””,4,5,6…
[笔记] Note
Even though Hive supports escaping characters, it does not handle escaping of new-line character. Also, it does not support the notion of enclosing characters that may include field delimiters in the enclosed string. It is therefore recommended that you choose unambiguous field and record-terminating delimiters without the help of escaping and enclosing characters when working with Hive; this is due to limitations of Hive’s input parsing abilities.
The --mysql-delimiters argument is a shorthand argument which uses the default delimiters for the mysqldump program. If you use the mysqldump delimiters in conjunction with a direct-mode import (with --direct), very fast imports can be achieved.
While the choice of delimiters is most important for a text-mode import, it is still relevant if you import to SequenceFiles with --as-sequencefile. The generated class’ toString() method will use the delimiters you specify, so subsequent formatting of the output data will rely on the delimiters you choose.
Table 7. Input parsing arguments:
Argument Description
–input-enclosed-by Sets a required field encloser
–input-escaped-by Sets the input escape character
–input-fields-terminated-by Sets the input field separator
–input-lines-terminated-by Sets the input end-of-line character
–input-optionally-enclosed-by Sets a field enclosing character
When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates when doing a delimited-format import. The delimiters are chosen with arguments such as --fields-terminated-by; this controls both how the data is written to disk, and how the generated parse() method reinterprets this data. The delimiters used by the parse() method can be chosen independently of the output arguments, by using --input-fields-terminated-by, and so on. This is useful, for example, to generate classes which can parse records created with one set of delimiters, and emit the records to a different set of files using a separate set of delimiters.
Table 8. Hive arguments:
Argument Description
–hive-home
7.2.13. Importing Data Into Hive
Sqoop’s import tool’s main function is to upload your data into files in HDFS. If you have a Hive metastore associated with your HDFS cluster, Sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data’s layout in Hive. Importing data into Hive is as simple as adding the --hive-import option to your Sqoop command line.
If the Hive table already exists, you can specify the --hive-overwrite option to indicate that existing table in hive must be replaced. After your data is imported into HDFS or this step is omitted, Sqoop will generate a Hive script containing a CREATE TABLE operation defining your columns using Hive’s types, and a LOAD DATA INPATH statement to move the data files into Hive’s warehouse directory.
The script will be executed by calling the installed copy of hive on the machine where Sqoop is run. If you have multiple Hive installations, or hive is not in your $PATH, use the --hive-home option to identify the Hive installation directory. Sqoop will use $HIVE_HOME/bin/hive from here.
[笔记] Note
This function is incompatible with --as-avrodatafile and --as-sequencefile.
Even though Hive supports escaping characters, it does not handle escaping of new-line character. Also, it does not support the notion of enclosing characters that may include field delimiters in the enclosed string. It is therefore recommended that you choose unambiguous field and record-terminating delimiters without the help of escaping and enclosing characters when working with Hive; this is due to limitations of Hive’s input parsing abilities. If you do use --escaped-by, --enclosed-by, or --optionally-enclosed-by when importing data into Hive, Sqoop will print a warning message.
Hive will have problems using Sqoop-imported data if your database’s rows contain string fields that have Hive’s default row delimiters (\n and \r characters) or column delimiters (\01 characters) present in them. You can use the --hive-drop-import-delims option to drop those characters on import to give Hive-compatible text data. Alternatively, you can use the --hive-delims-replacement option to replace those characters with a user-defined string on import to give Hive-compatible text data. These options should only be used if you use Hive’s default delimiters and should not be used if different delimiters are specified.
Sqoop will pass the field and record delimiters through to Hive. If you do not set any delimiters and do use --hive-import, the field delimiter will be set to ^A and the record delimiter will be set to \n to be consistent with Hive’s defaults.
Sqoop will by default import NULL values as string null. Hive is however using string \N to denote NULL values and therefore predicates dealing with NULL (like IS NULL) will not work correctly. You should append parameters --null-string and --null-non-string in case of import job or --input-null-string and --input-null-non-string in case of an export job if you wish to properly preserve NULL values. Because sqoop is using those parameters in generated code, you need to properly escape value \N to \N:
$ sqoop import … --null-string ‘\N’ --null-non-string ‘\N’
The table name used in Hive is, by default, the same as that of the source table. You can control the output table name with the --hive-table option.
Hive can put data into partitions for more efficient query performance. You can tell a Sqoop job to import data for Hive into a particular partition by specifying the --hive-partition-key and --hive-partition-value arguments. The partition value must be a string. Please see the Hive documentation for more details on partitioning.
You can import compressed tables into Hive using the --compress and --compression-codec options. One downside to compressing tables imported into Hive is that many codecs cannot be split for processing by parallel map tasks. The lzop codec, however, does support splitting. When importing tables with this codec, Sqoop will automatically index the files for splitting and configuring a new Hive table with the correct InputFormat. This feature currently requires that all partitions of a table be compressed with the lzop codec.
Table 9. HBase arguments:
Argument Description
–column-family Sets the target column family for the import
–hbase-create-table If specified, create missing HBase tables
–hbase-row-key Specifies which input column to use as the row key
In case, if input table contains composite
key, then must be in the form of a
comma-separated list of composite key
attributes
–hbase-table Specifies an HBase table to use as the target instead of HDFS
–hbase-bulkload Enables bulk loading
7.2.14. Importing Data Into HBase
Sqoop supports additional import targets beyond HDFS and Hive. Sqoop can also import records into a table in HBase.
By specifying --hbase-table, you instruct Sqoop to import to a table in HBase rather than a directory in HDFS. Sqoop will import data to the table specified as the argument to --hbase-table. Each row of the input table will be transformed into an HBase Put operation to a row of the output table. The key for each row is taken from a column of the input. By default Sqoop will use the split-by column as the row key column. If that is not specified, it will try to identify the primary key column, if any, of the source table. You can manually specify the row key column with --hbase-row-key. Each output column will be placed in the same column family, which must be specified with --column-family.
[笔记] Note
This function is incompatible with direct import (parameter --direct).
If the input table has composite key, the --hbase-row-key must be in the form of a comma-separated list of composite key attributes. In this case, the row key for HBase row will be generated by combining values of composite key attributes using underscore as a separator. NOTE: Sqoop import for a table with composite key will work only if parameter --hbase-row-key has been specified.
If the target table and column family do not exist, the Sqoop job will exit with an error. You should create the target table and column family before running an import. If you specify --hbase-create-table, Sqoop will create the target table and column family if they do not exist, using the default parameters from your HBase configuration.
Sqoop currently serializes all values to HBase by converting each field to its string representation (as if you were importing to HDFS in text mode), and then inserts the UTF-8 bytes of this string in the target cell. Sqoop will skip all rows containing null values in all columns except the row key column.
To decrease the load on hbase, Sqoop can do bulk loading as opposed to direct writes. To use bulk loading, enable it using --hbase-bulkload.
Table 10. Accumulo arguments:
Argument Description
–accumulo-table Specifies an Accumulo table to use as the target instead of HDFS
–accumulo-column-family Sets the target column family for the import
–accumulo-create-table If specified, create missing Accumulo tables
–accumulo-row-key Specifies which input column to use as the row key
–accumulo-visibility (Optional) Specifies a visibility token to apply to all rows inserted into Accumulo. Default is the empty string.
–accumulo-batch-size (Optional) Sets the size in bytes of Accumulo’s write buffer. Default is 4MB.
–accumulo-max-latency (Optional) Sets the max latency in milliseconds for the Accumulo batch writer. Default is 0.
–accumulo-zookeepers host:port Comma-separated list of Zookeeper servers used by the Accumulo instance
–accumulo-instance Name of the target Accumulo instance
–accumulo-user Name of the Accumulo user to import as
–accumulo-password Password for the Accumulo user
7.2.15. Importing Data Into Accumulo
Sqoop supports importing records into a table in Accumulo
By specifying --accumulo-table, you instruct Sqoop to import to a table in Accumulo rather than a directory in HDFS. Sqoop will import data to the table specified as the argument to --accumulo-table. Each row of the input table will be transformed into an Accumulo Mutation operation to a row of the output table. The key for each row is taken from a column of the input. By default Sqoop will use the split-by column as the row key column. If that is not specified, it will try to identify the primary key column, if any, of the source table. You can manually specify the row key column with --accumulo-row-key. Each output column will be placed in the same column family, which must be specified with --accumulo-column-family.
[笔记] Note
This function is incompatible with direct import (parameter --direct), and cannot be used in the same operation as an HBase import.
If the target table does not exist, the Sqoop job will exit with an error, unless the --accumulo-create-table parameter is specified. Otherwise, you should create the target table before running an import.
Sqoop currently serializes all values to Accumulo by converting each field to its string representation (as if you were importing to HDFS in text mode), and then inserts the UTF-8 bytes of this string in the target cell.
By default, no visibility is applied to the resulting cells in Accumulo, so the data will be visible to any Accumulo user. Use the --accumulo-visibility parameter to specify a visibility token to apply to all rows in the import job.
For performance tuning, use the optional --accumulo-buffer-size\ and --accumulo-max-latency parameters. See Accumulo’s documentation for an explanation of the effects of these parameters.
In order to connect to an Accumulo instance, you must specify the location of a Zookeeper ensemble using the --accumulo-zookeepers parameter, the name of the Accumulo instance (–accumulo-instance), and the username and password to connect with (–accumulo-user and --accumulo-password respectively).
Table 11. Code generation arguments:
Argument Description
–bindir
As mentioned earlier, a byproduct of importing a table to HDFS is a class which can manipulate the imported data. If the data is stored in SequenceFiles, this class will be used for the data’s serialization container. Therefore, you should use this class in your subsequent MapReduce processing of the data.
The class is typically named after the table; a table named foo will generate a class named foo. You may want to override this class name. For example, if your table is named EMPLOYEES, you may want to specify --class-name Employee instead. Similarly, you can specify just the package name with --package-name. The following import generates a class named com.foocorp.SomeTable:
$ sqoop import --connect --table SomeTable --package-name com.foocorp
The .java source file for your class will be written to the current working directory when you run sqoop. You can control the output directory with --outdir. For example, --outdir src/generated/.
The import process compiles the source into .class and .jar files; these are ordinarily stored under /tmp. You can select an alternate target directory with --bindir. For example, --bindir /scratch.
If you already have a compiled class that can be used to perform the import and want to suppress the code-generation aspect of the import process, you can use an existing jar and class by providing the --jar-file and --class-name options. For example:
$ sqoop import --table SomeTable --jar-file mydatatypes.jar
–class-name SomeTableType
This command will load the SomeTableType class out of mydatatypes.jar.
7.2.16. Additional Import Configuration Properties
There are some additional properties which can be configured by modifying conf/sqoop-site.xml. Properties can be specified the same as in Hadoop configuration files, for example:
sqoop import -D property.name=property.value …
Table 12. Additional import configuration properties:
Argument Description
sqoop.bigdecimal.format.string Controls how BigDecimal columns will formatted when stored as a String. A value of true (default) will use toPlainString to store them without an exponent component (0.0000001); while a value of false will use toString which may include an exponent (1E-7)
sqoop.hbase.add.row.key When set to false (default), Sqoop will not add the column used as a row key into the row data in HBase. When set to true, the column used as a row key will be added to the row data in HBase.
7.3. Example Invocations
The following examples illustrate how to use the import tool in a variety of situations.
A basic import of a table named EMPLOYEES in the corp database:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
A basic import requiring a login:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
–username SomeUser -P
Enter password: (hidden)
Selecting specific columns from the EMPLOYEES table:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
–columns “employee_id,first_name,last_name,job_title”
Controlling the import parallelism (using 8 parallel tasks):
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
-m 8
Storing data in SequenceFiles, and setting the generated class name to com.foocorp.Employee:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
–class-name com.foocorp.Employee --as-sequencefile
Specifying the delimiters to use in a text-mode import:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
–fields-terminated-by ‘\t’ --lines-terminated-by ‘\n’
–optionally-enclosed-by ‘"’
Importing the data to Hive:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
–hive-import
Importing only new employees:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
–where “start_date > ‘2010-01-01’”
Changing the splitting column from the default:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
–split-by dept_id
Verifying that an import was successful:
$ hadoop fs -ls EMPLOYEES
Found 5 items
drwxr-xr-x - someuser somegrp 0 2010-04-27 16:40 /user/someuser/EMPLOYEES/_logs
-rw-r–r-- 1 someuser somegrp 2913511 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00000
-rw-r–r-- 1 someuser somegrp 1683938 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00001
-rw-r–r-- 1 someuser somegrp 7245839 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00002
-rw-r–r-- 1 someuser somegrp 7842523 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00003
$ hadoop fs -cat EMPLOYEES/part-m-00000 | head -n 10
0,joe,smith,engineering
1,jane,doe,marketing
…
Performing an incremental import of new data, after having already imported the first 100,000 rows of a table:
$ sqoop import --connect jdbc:mysql://db.foo.com/somedb --table sometable
–where “id > 100000” --target-dir /incremental_dataset --append
An import of a table named EMPLOYEES in the corp database that uses validation to validate the import using the table row count and number of rows copied into HDFS: More Details
$ sqoop import --connect jdbc:mysql://db.foo.com/corp
–table EMPLOYEES --validate
8. sqoop-import-all-tables
8.1. Purpose
8.2. Syntax
8.3. Example Invocations
8.1. Purpose
The import-all-tables tool imports a set of tables from an RDBMS to HDFS. Data from each table is stored in a separate directory in HDFS.
For the import-all-tables tool to be useful, the following conditions must be met:
Each table must have a single-column primary key or --autoreset-to-one-mapper option must be used.
You must intend to import all columns of each table.
You must not intend to use non-default splitting column, nor impose any conditions via a WHERE clause.
8.2. Syntax
$ sqoop import-all-tables (generic-args) (import-args)
$ sqoop-import-all-tables (generic-args) (import-args)
Although the Hadoop generic arguments must preceed any import arguments, the import arguments can be entered in any order with respect to one another.
Table 13. Common arguments
Argument Description
–connect Specify JDBC connect string
–connection-manager Specify connection manager class to use
–driver Manually specify JDBC driver class to use
–hadoop-mapred-home
Table 14. Import control arguments:
Argument Description
–as-avrodatafile Imports data to Avro Data Files
–as-sequencefile Imports data to SequenceFiles
–as-textfile Imports data as plain text (default)
–as-parquetfile Imports data to Parquet Files
–direct Use direct import fast path
–inline-lob-limit Set the maximum size for an inline LOB
-m,–num-mappers Use n map tasks to import in parallel
–warehouse-dir
These arguments behave in the same manner as they do when used for the sqoop-import tool, but the --table, --split-by, --columns, and --where arguments are invalid for sqoop-import-all-tables. The --exclude-tables argument is for +sqoop-import-all-tables only.
Table 15. Output line formatting arguments:
Argument Description
–enclosed-by Sets a required field enclosing character
–escaped-by Sets the escape character
–fields-terminated-by Sets the field separator character
–lines-terminated-by Sets the end-of-line character
–mysql-delimiters Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: ’
–optionally-enclosed-by Sets a field enclosing character
Table 16. Input parsing arguments:
Argument Description
–input-enclosed-by Sets a required field encloser
–input-escaped-by Sets the input escape character
–input-fields-terminated-by Sets the input field separator
–input-lines-terminated-by Sets the input end-of-line character
–input-optionally-enclosed-by Sets a field enclosing character
Table 17. Hive arguments:
Argument Description
–hive-home
Table 18. Code generation arguments:
Argument Description
–bindir
The import-all-tables tool does not support the --class-name argument. You may, however, specify a package with --package-name in which all generated classes will be placed.
8.3. Example Invocations
Import all tables from the corp database:
$ sqoop import-all-tables --connect jdbc:mysql://db.foo.com/corp
Verifying that it worked:
$ hadoop fs -ls
Found 4 items
drwxr-xr-x - someuser somegrp 0 2010-04-27 17:15 /user/someuser/EMPLOYEES
drwxr-xr-x - someuser somegrp 0 2010-04-27 17:15 /user/someuser/PAYCHECKS
drwxr-xr-x - someuser somegrp 0 2010-04-27 17:15 /user/someuser/DEPARTMENTS
drwxr-xr-x - someuser somegrp 0 2010-04-27 17:15 /user/someuser/OFFICE_SUPPLIES
9. sqoop-import-mainframe
9.1. Purpose
9.2. Syntax
9.2.1. Connecting to a Mainframe
9.2.2. Selecting the Files to Import
9.2.3. Controlling Parallelism
9.2.4. Controlling Distributed Cache
9.2.5. Controlling the Import Process
9.2.6. File Formats
9.2.7. Importing Data Into Hive
9.2.8. Importing Data Into HBase
9.2.9. Importing Data Into Accumulo
9.2.10. Additional Import Configuration Properties
9.3. Example Invocations
9.1. Purpose
The import-mainframe tool imports all sequential datasets in a partitioned dataset(PDS) on a mainframe to HDFS. A PDS is akin to a directory on the open systems. The records in a dataset can contain only character data. Records will be stored with the entire record as a single text field.
9.2. Syntax
9.2.1. Connecting to a Mainframe
9.2.2. Selecting the Files to Import
9.2.3. Controlling Parallelism
9.2.4. Controlling Distributed Cache
9.2.5. Controlling the Import Process
9.2.6. File Formats
9.2.7. Importing Data Into Hive
9.2.8. Importing Data Into HBase
9.2.9. Importing Data Into Accumulo
9.2.10. Additional Import Configuration Properties
$ sqoop import-mainframe (generic-args) (import-args)
$ sqoop-import-mainframe (generic-args) (import-args)
While the Hadoop generic arguments must precede any import arguments, you can type the import arguments in any order with respect to one another.
Table 19. Common arguments
Argument Description
–connect Specify mainframe host to connect
–connection-manager Specify connection manager class to use
–hadoop-mapred-home
9.2.1. Connecting to a Mainframe
Sqoop is designed to import mainframe datasets into HDFS. To do so, you must specify a mainframe host name in the Sqoop --connect argument.
$ sqoop import-mainframe --connect z390
This will connect to the mainframe host z390 via ftp.
You might need to authenticate against the mainframe host to access it. You can use the --username to supply a username to the mainframe. Sqoop provides couple of different ways to supply a password, secure and non-secure, to the mainframe which is detailed below.
Secure way of supplying password to the mainframe. You should save the password in a file on the users home directory with 400 permissions and specify the path to that file using the --password-file argument, and is the preferred method of entering credentials. Sqoop will then read the password from the file and pass it to the MapReduce cluster using secure means with out exposing the password in the job configuration. The file containing the password can either be on the Local FS or HDFS.
Example:
$ sqoop import-mainframe --connect z390
–username david --password-file ${user.home}/.password
Another way of supplying passwords is using the -P argument which will read a password from a console prompt.
[警告] Warning
The --password parameter is insecure, as other users may be able to read your password from the command-line arguments via the output of programs such as ps. The -P argument is the preferred method over using the --password argument. Credentials may still be transferred between nodes of the MapReduce cluster using insecure means.
Example:
$ sqoop import-mainframe --connect z390 --username david --password 12345
Table 20. Import control arguments:
Argument Description
–as-avrodatafile Imports data to Avro Data Files
–as-sequencefile Imports data to SequenceFiles
–as-textfile Imports data as plain text (default)
–as-parquetfile Imports data to Parquet Files
–delete-target-dir Delete the import target directory if it exists
-m,–num-mappers Use n map tasks to import in parallel
–target-dir
9.2.2. Selecting the Files to Import
You can use the --dataset argument to specify a partitioned dataset name. All sequential datasets in the partitioned dataset will be imported.
9.2.3. Controlling Parallelism
Sqoop imports data in parallel by making multiple ftp connections to the mainframe to transfer multiple files simultaneously. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ. By default, four tasks are used. You can adjust this value to maximize the data transfer rate from the mainframe.
9.2.4. Controlling Distributed Cache
Sqoop will copy the jars in $SQOOP_HOME/lib folder to job cache every time when start a Sqoop job. When launched by Oozie this is unnecessary since Oozie use its own Sqoop share lib which keeps Sqoop dependencies in the distributed cache. Oozie will do the localization on each worker node for the Sqoop dependencies only once during the first Sqoop job and reuse the jars on worker node for subsquencial jobs. Using option --skip-dist-cache in Sqoop command when launched by Oozie will skip the step which Sqoop copies its dependencies to job cache and save massive I/O.
9.2.5. Controlling the Import Process
By default, Sqoop will import all sequential files in a partitioned dataset pds to a directory named pds inside your home directory in HDFS. For example, if your username is someuser, then the import tool will write to /user/someuser/pds/(files). You can adjust the parent directory of the import with the --warehouse-dir argument. For example:
$ sqoop import-mainframe --connnect --dataset foo --warehouse-dir /shared
…
This command would write to a set of files in the /shared/pds/ directory.
You can also explicitly choose the target directory, like so:
$ sqoop import-mainframe --connnect --dataset foo --target-dir /dest
…
This will import the files into the /dest directory. --target-dir is incompatible with --warehouse-dir.
By default, imports go to a new target location. If the destination directory already exists in HDFS, Sqoop will refuse to import and overwrite that directory’s contents.
9.2.6. File Formats
By default, each record in a dataset is stored as a text record with a newline at the end. Each record is assumed to contain a single text field with the name DEFAULT_COLUMN. When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates.
You can also import mainframe records to Sequence, Avro, or Parquet files.
By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the --compression-codec argument.
Table 21. Output line formatting arguments:
Argument Description
–enclosed-by Sets a required field enclosing character
–escaped-by Sets the escape character
–fields-terminated-by Sets the field separator character
–lines-terminated-by Sets the end-of-line character
–mysql-delimiters Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: ’
–optionally-enclosed-by Sets a field enclosing character
Since mainframe record contains only one field, importing to delimited files will not contain any field delimiter. However, the field may be enclosed with enclosing character or escaped by an escaping character.
Table 22. Input parsing arguments:
Argument Description
–input-enclosed-by Sets a required field encloser
–input-escaped-by Sets the input escape character
–input-fields-terminated-by Sets the input field separator
–input-lines-terminated-by Sets the input end-of-line character
–input-optionally-enclosed-by Sets a field enclosing character
When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates when doing a delimited-format import. The delimiters are chosen with arguments such as --fields-terminated-by; this controls both how the data is written to disk, and how the generated parse() method reinterprets this data. The delimiters used by the parse() method can be chosen independently of the output arguments, by using --input-fields-terminated-by, and so on. This is useful, for example, to generate classes which can parse records created with one set of delimiters, and emit the records to a different set of files using a separate set of delimiters.
Table 23. Hive arguments:
Argument Description
–hive-home
9.2.7. Importing Data Into Hive
Sqoop’s import tool’s main function is to upload your data into files in HDFS. If you have a Hive metastore associated with your HDFS cluster, Sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data’s layout in Hive. Importing data into Hive is as simple as adding the --hive-import option to your Sqoop command line.
If the Hive table already exists, you can specify the --hive-overwrite option to indicate that existing table in hive must be replaced. After your data is imported into HDFS or this step is omitted, Sqoop will generate a Hive script containing a CREATE TABLE operation defining your columns using Hive’s types, and a LOAD DATA INPATH statement to move the data files into Hive’s warehouse directory.
The script will be executed by calling the installed copy of hive on the machine where Sqoop is run. If you have multiple Hive installations, or hive is not in your $PATH, use the --hive-home option to identify the Hive installation directory. Sqoop will use $HIVE_HOME/bin/hive from here.
[笔记] Note
This function is incompatible with --as-avrodatafile and --as-sequencefile.
Even though Hive supports escaping characters, it does not handle escaping of new-line character. Also, it does not support the notion of enclosing characters that may include field delimiters in the enclosed string. It is therefore recommended that you choose unambiguous field and record-terminating delimiters without the help of escaping and enclosing characters when working with Hive; this is due to limitations of Hive’s input parsing abilities. If you do use --escaped-by, --enclosed-by, or --optionally-enclosed-by when importing data into Hive, Sqoop will print a warning message.
Hive will have problems using Sqoop-imported data if your database’s rows contain string fields that have Hive’s default row delimiters (\n and \r characters) or column delimiters (\01 characters) present in them. You can use the --hive-drop-import-delims option to drop those characters on import to give Hive-compatible text data. Alternatively, you can use the --hive-delims-replacement option to replace those characters with a user-defined string on import to give Hive-compatible text data. These options should only be used if you use Hive’s default delimiters and should not be used if different delimiters are specified.
Sqoop will pass the field and record delimiters through to Hive. If you do not set any delimiters and do use --hive-import, the field delimiter will be set to ^A and the record delimiter will be set to \n to be consistent with Hive’s defaults.
Sqoop will by default import NULL values as string null. Hive is however using string \N to denote NULL values and therefore predicates dealing with NULL (like IS NULL) will not work correctly. You should append parameters --null-string and --null-non-string in case of import job or --input-null-string and --input-null-non-string in case of an export job if you wish to properly preserve NULL values. Because sqoop is using those parameters in generated code, you need to properly escape value \N to \N:
$ sqoop import … --null-string ‘\N’ --null-non-string ‘\N’
The table name used in Hive is, by default, the same as that of the source table. You can control the output table name with the --hive-table option.
Hive can put data into partitions for more efficient query performance. You can tell a Sqoop job to import data for Hive into a particular partition by specifying the --hive-partition-key and --hive-partition-value arguments. The partition value must be a string. Please see the Hive documentation for more details on partitioning.
You can import compressed tables into Hive using the --compress and --compression-codec options. One downside to compressing tables imported into Hive is that many codecs cannot be split for processing by parallel map tasks. The lzop codec, however, does support splitting. When importing tables with this codec, Sqoop will automatically index the files for splitting and configuring a new Hive table with the correct InputFormat. This feature currently requires that all partitions of a table be compressed with the lzop codec.
Table 24. HBase arguments:
Argument Description
–column-family Sets the target column family for the import
–hbase-create-table If specified, create missing HBase tables
–hbase-row-key Specifies which input column to use as the row key
In case, if input table contains composite
key, then must be in the form of a
comma-separated list of composite key
attributes
–hbase-table Specifies an HBase table to use as the target instead of HDFS
–hbase-bulkload Enables bulk loading
9.2.8. Importing Data Into HBase
Sqoop supports additional import targets beyond HDFS and Hive. Sqoop can also import records into a table in HBase.
By specifying --hbase-table, you instruct Sqoop to import to a table in HBase rather than a directory in HDFS. Sqoop will import data to the table specified as the argument to --hbase-table. Each row of the input table will be transformed into an HBase Put operation to a row of the output table. The key for each row is taken from a column of the input. By default Sqoop will use the split-by column as the row key column. If that is not specified, it will try to identify the primary key column, if any, of the source table. You can manually specify the row key column with --hbase-row-key. Each output column will be placed in the same column family, which must be specified with --column-family.
[笔记] Note
This function is incompatible with direct import (parameter --direct).
If the input table has composite key, the --hbase-row-key must be in the form of a comma-separated list of composite key attributes. In this case, the row key for HBase row will be generated by combining values of composite key attributes using underscore as a separator. NOTE: Sqoop import for a table with composite key will work only if parameter --hbase-row-key has been specified.
If the target table and column family do not exist, the Sqoop job will exit with an error. You should create the target table and column family before running an import. If you specify --hbase-create-table, Sqoop will create the target table and column family if they do not exist, using the default parameters from your HBase configuration.
Sqoop currently serializes all values to HBase by converting each field to its string representation (as if you were importing to HDFS in text mode), and then inserts the UTF-8 bytes of this string in the target cell. Sqoop will skip all rows containing null values in all columns except the row key column.
To decrease the load on hbase, Sqoop can do bulk loading as opposed to direct writes. To use bulk loading, enable it using --hbase-bulkload.
Table 25. Accumulo arguments:
Argument Description
–accumulo-table Specifies an Accumulo table to use as the target instead of HDFS
–accumulo-column-family Sets the target column family for the import
–accumulo-create-table If specified, create missing Accumulo tables
–accumulo-row-key Specifies which input column to use as the row key
–accumulo-visibility (Optional) Specifies a visibility token to apply to all rows inserted into Accumulo. Default is the empty string.
–accumulo-batch-size (Optional) Sets the size in bytes of Accumulo’s write buffer. Default is 4MB.
–accumulo-max-latency (Optional) Sets the max latency in milliseconds for the Accumulo batch writer. Default is 0.
–accumulo-zookeepers host:port Comma-separated list of Zookeeper servers used by the Accumulo instance
–accumulo-instance Name of the target Accumulo instance
–accumulo-user Name of the Accumulo user to import as
–accumulo-password Password for the Accumulo user
9.2.9. Importing Data Into Accumulo
Sqoop supports importing records into a table in Accumulo
By specifying --accumulo-table, you instruct Sqoop to import to a table in Accumulo rather than a directory in HDFS. Sqoop will import data to the table specified as the argument to --accumulo-table. Each row of the input table will be transformed into an Accumulo Mutation operation to a row of the output table. The key for each row is taken from a column of the input. By default Sqoop will use the split-by column as the row key column. If that is not specified, it will try to identify the primary key column, if any, of the source table. You can manually specify the row key column with --accumulo-row-key. Each output column will be placed in the same column family, which must be specified with --accumulo-column-family.
[笔记] Note
This function is incompatible with direct import (parameter --direct), and cannot be used in the same operation as an HBase import.
If the target table does not exist, the Sqoop job will exit with an error, unless the --accumulo-create-table parameter is specified. Otherwise, you should create the target table before running an import.
Sqoop currently serializes all values to Accumulo by converting each field to its string representation (as if you were importing to HDFS in text mode), and then inserts the UTF-8 bytes of this string in the target cell.
By default, no visibility is applied to the resulting cells in Accumulo, so the data will be visible to any Accumulo user. Use the --accumulo-visibility parameter to specify a visibility token to apply to all rows in the import job.
For performance tuning, use the optional --accumulo-buffer-size\ and --accumulo-max-latency parameters. See Accumulo’s documentation for an explanation of the effects of these parameters.
In order to connect to an Accumulo instance, you must specify the location of a Zookeeper ensemble using the --accumulo-zookeepers parameter, the name of the Accumulo instance (–accumulo-instance), and the username and password to connect with (–accumulo-user and --accumulo-password respectively).
Table 26. Code generation arguments:
Argument Description
–bindir
As mentioned earlier, a byproduct of importing a table to HDFS is a class which can manipulate the imported data. You should use this class in your subsequent MapReduce processing of the data.
The class is typically named after the partitioned dataset name; a partitioned dataset named foo will generate a class named foo. You may want to override this class name. For example, if your partitioned dataset is named EMPLOYEES, you may want to specify --class-name Employee instead. Similarly, you can specify just the package name with --package-name. The following import generates a class named com.foocorp.SomePDS:
$ sqoop import-mainframe --connect --dataset SomePDS --package-name com.foocorp
The .java source file for your class will be written to the current working directory when you run sqoop. You can control the output directory with --outdir. For example, --outdir src/generated/.
The import process compiles the source into .class and .jar files; these are ordinarily stored under /tmp. You can select an alternate target directory with --bindir. For example, --bindir /scratch.
If you already have a compiled class that can be used to perform the import and want to suppress the code-generation aspect of the import process, you can use an existing jar and class by providing the --jar-file and --class-name options. For example:
$ sqoop import-mainframe --dataset SomePDS --jar-file mydatatypes.jar
–class-name SomePDSType
This command will load the SomePDSType class out of mydatatypes.jar.
9.2.10. Additional Import Configuration Properties
There are some additional properties which can be configured by modifying conf/sqoop-site.xml. Properties can be specified the same as in Hadoop configuration files, for example:
sqoop import -D property.name=property.value …
9.3. Example Invocations
The following examples illustrate how to use the import tool in a variety of situations.
A basic import of all sequential files in a partitioned dataset named EMPLOYEES in the mainframe host z390:
$ sqoop import-mainframe --connect z390 --dataset EMPLOYEES
–username SomeUser -P
Enter password: (hidden)
Controlling the import parallelism (using 8 parallel tasks):
$ sqoop import-mainframe --connect z390 --dataset EMPLOYEES
–username SomeUser --password-file mypassword -m 8
Importing the data to Hive:
$ sqoop import-mainframe --connect z390 --dataset EMPLOYEES
–hive-import
10. sqoop-export
10.1. Purpose
10.2. Syntax
10.3. Inserts vs. Updates
10.4. Exports and Transactions
10.5. Failed Exports
10.6. Example Invocations
10.1. Purpose
The export tool exports a set of files from HDFS back to an RDBMS. The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specified delimiters.
The default operation is to transform these into a set of INSERT statements that inject the records into the database. In “update mode,” Sqoop will generate UPDATE statements that replace existing records in the database, and in “call mode” Sqoop will make a stored procedure call for each record.
10.2. Syntax
$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)
Although the Hadoop generic arguments must preceed any export arguments, the export arguments can be entered in any order with respect to one another.
Table 27. Common arguments
Argument Description
–connect Specify JDBC connect string
–connection-manager Specify connection manager class to use
–driver Manually specify JDBC driver class to use
–hadoop-mapred-home
Table 28. Validation arguments More Details
Argument Description
–validate Enable validation of data copied, supports single table copy only.
–validator Specify validator class to use.
–validation-threshold Specify validation threshold class to use.
–validation-failurehandler Specify validation failure handler class to use.
Table 29. Export control arguments:
Argument Description
–columns
The --export-dir argument and one of --table or --call are required. These specify the table to populate in the database (or the stored procedure to call), and the directory in HDFS that contains the source data.
By default, all columns within a table are selected for export. You can select a subset of columns and control their ordering by using the --columns argument. This should include a comma-delimited list of columns to export. For example: --columns “col1,col2,col3”. Note that columns that are not included in the --columns parameter need to have either defined default value or allow NULL values. Otherwise your database will reject the imported data which in turn will make Sqoop job fail.
You can control the number of mappers independently from the number of files present in the directory. Export performance depends on the degree of parallelism. By default, Sqoop will use four tasks in parallel for the export process. This may not be optimal; you will need to experiment with your own particular setup. Additional tasks may offer better concurrency, but if the database is already bottlenecked on updating indices, invoking triggers, and so on, then additional load may decrease performance. The --num-mappers or -m arguments control the number of map tasks, which is the degree of parallelism used.
Some databases provides a direct mode for exports as well. Use the --direct argument to specify this codepath. This may be higher-performance than the standard JDBC codepath. Details about use of direct mode with each specific RDBMS, installation requirements, available options and limitations can be found in Section 25, “Notes for specific connectors”.
The --input-null-string and --input-null-non-string arguments are optional. If --input-null-string is not specified, then the string “null” will be interpreted as null for string-type columns. If --input-null-non-string is not specified, then both the string “null” and the empty string will be interpreted as null for non-string columns. Note that, the empty string will be always interpreted as null for non-string columns, in addition to other string if specified by --input-null-non-string.
Since Sqoop breaks down export process into multiple transactions, it is possible that a failed export job may result in partial data being committed to the database. This can further lead to subsequent jobs failing due to insert collisions in some cases, or lead to duplicated data in others. You can overcome this problem by specifying a staging table via the --staging-table option which acts as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction.
In order to use the staging facility, you must create the staging table prior to running the export job. This table must be structurally identical to the target table. This table should either be empty before the export job runs, or the --clear-staging-table option must be specified. If the staging table contains data and the --clear-staging-table option is specified, Sqoop will delete all of the data before starting the export job.
[笔记] Note
Support for staging data prior to pushing it into the destination table is not always available for --direct exports. It is also not available when export is invoked using the --update-key option for updating existing data, and when stored procedures are used to insert the data. It is best to check the Section 25, “Notes for specific connectors” section to validate.
10.3. Inserts vs. Updates
By default, sqoop-export appends new rows to a table; each input record is transformed into an INSERT statement that adds a row to the target database table. If your table has constraints (e.g., a primary key column whose values must be unique) and already contains data, you must take care to avoid inserting records that violate these constraints. The export process will fail if an INSERT statement fails. This mode is primarily intended for exporting records to a new, empty table intended to receive these results.
If you specify the --update-key argument, Sqoop will instead modify an existing dataset in the database. Each input record is treated as an UPDATE statement that modifies an existing row. The row a statement modifies is determined by the column name(s) specified with --update-key. For example, consider the following table definition:
CREATE TABLE foo(
id INT NOT NULL PRIMARY KEY,
msg VARCHAR(32),
bar INT);
Consider also a dataset in HDFS containing records like these:
0,this is a test,42
1,some more data,100
…
Running sqoop-export --table foo --update-key id --export-dir /path/to/data --connect … will run an export job that executes SQL statements based on the data like so:
UPDATE foo SET msg=‘this is a test’, bar=42 WHERE id=0;
UPDATE foo SET msg=‘some more data’, bar=100 WHERE id=1;
…
If an UPDATE statement modifies no rows, this is not considered an error; the export will silently continue. (In effect, this means that an update-based export will not insert new rows into the database.) Likewise, if the column specified with --update-key does not uniquely identify rows and multiple rows are updated by a single statement, this condition is also undetected.
The argument --update-key can also be given a comma separated list of column names. In which case, Sqoop will match all keys from this list before updating any existing record.
Depending on the target database, you may also specify the --update-mode argument with allowinsert mode if you want to update rows if they exist in the database already or insert rows if they do not exist yet.
Table 30. Input parsing arguments:
Argument Description
–input-enclosed-by Sets a required field encloser
–input-escaped-by Sets the input escape character
–input-fields-terminated-by Sets the input field separator
–input-lines-terminated-by Sets the input end-of-line character
–input-optionally-enclosed-by Sets a field enclosing character
Table 31. Output line formatting arguments:
Argument Description
–enclosed-by Sets a required field enclosing character
–escaped-by Sets the escape character
–fields-terminated-by Sets the field separator character
–lines-terminated-by Sets the end-of-line character
–mysql-delimiters Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: ’
–optionally-enclosed-by Sets a field enclosing character
Sqoop automatically generates code to parse and interpret records of the files containing the data to be exported back to the database. If these files were created with non-default delimiters (comma-separated fields with newline-separated records), you should specify the same delimiters again so that Sqoop can parse your files.
If you specify incorrect delimiters, Sqoop will fail to find enough columns per line. This will cause export map tasks to fail by throwing ParseExceptions.
Table 32. Code generation arguments:
Argument Description
–bindir
If the records to be exported were generated as the result of a previous import, then the original generated class can be used to read the data back. Specifying --jar-file and --class-name obviate the need to specify delimiters in this case.
The use of existing generated code is incompatible with --update-key; an update-mode export requires new code generation to perform the update. You cannot use --jar-file, and must fully specify any non-default delimiters.
10.4. Exports and Transactions
Exports are performed by multiple writers in parallel. Each writer uses a separate connection to the database; these have separate transactions from one another. Sqoop uses the multi-row INSERT syntax to insert up to 100 records per statement. Every 100 statements, the current transaction within a writer task is committed, causing a commit every 10,000 rows. This ensures that transaction buffers do not grow without bound, and cause out-of-memory conditions. Therefore, an export is not an atomic process. Partial results from the export will become visible before the export is complete.
10.5. Failed Exports
Exports may fail for a number of reasons:
Loss of connectivity from the Hadoop cluster to the database (either due to hardware fault, or server software crashes)
Attempting to INSERT a row which violates a consistency constraint (for example, inserting a duplicate primary key value)
Attempting to parse an incomplete or malformed record from the HDFS source data
Attempting to parse records using incorrect delimiters
Capacity issues (such as insufficient RAM or disk space)
If an export map task fails due to these or other reasons, it will cause the export job to fail. The results of a failed export are undefined. Each export map task operates in a separate transaction. Furthermore, individual map tasks commit their current transaction periodically. If a task fails, the current transaction will be rolled back. Any previously-committed transactions will remain durable in the database, leading to a partially-complete export.
10.6. Example Invocations
A basic export to populate a table named bar:
$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar
–export-dir /results/bar_data
This example takes the files in /results/bar_data and injects their contents in to the bar table in the foo database on db.example.com. The target table must already exist in the database. Sqoop performs a set of INSERT INTO operations, without regard for existing content. If Sqoop attempts to insert rows which violate constraints in the database (for example, a particular primary key value already exists), then the export fails.
Alternatively, you can specify the columns to be exported by providing --columns “col1,col2,col3”. Please note that columns that are not included in the --columns parameter need to have either defined default value or allow NULL values. Otherwise your database will reject the imported data which in turn will make Sqoop job fail.
Another basic export to populate a table named bar with validation enabled: More Details
$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar
–export-dir /results/bar_data --validate
An export that calls a stored procedure named barproc for every record in /results/bar_data would look like:
$ sqoop export --connect jdbc:mysql://db.example.com/foo --call barproc
–export-dir /results/bar_data
11. validation
11.1. Purpose
11.2. Introduction
11.3. Syntax
11.4. Configuration
11.5. Limitations
11.6. Example Invocations
11.1. Purpose
Validate the data copied, either import or export by comparing the row counts from the source and the target post copy.
11.2. Introduction
There are 3 basic interfaces: ValidationThreshold - Determines if the error margin between the source and target are acceptable: Absolute, Percentage Tolerant, etc. Default implementation is AbsoluteValidationThreshold which ensures the row counts from source and targets are the same.
ValidationFailureHandler - Responsible for handling failures: log an error/warning, abort, etc. Default implementation is LogOnFailureHandler that logs a warning message to the configured logger.
Validator - Drives the validation logic by delegating the decision to ValidationThreshold and delegating failure handling to ValidationFailureHandler. The default implementation is RowCountValidator which validates the row counts from source and the target.
11.3. Syntax
$ sqoop import (generic-args) (import-args)
$ sqoop export (generic-args) (export-args)
Validation arguments are part of import and export arguments.
11.4. Configuration
The validation framework is extensible and pluggable. It comes with default implementations but the interfaces can be extended to allow custom implementations by passing them as part of the command line arguments as described below.
Validator.
Property: validator
Description: Driver for validation,
must implement org.apache.sqoop.validation.Validator
Supported values: The value has to be a fully qualified class name.
Default value: org.apache.sqoop.validation.RowCountValidator
Validation Threshold.
Property: validation-threshold
Description: Drives the decision based on the validation meeting the
threshold or not. Must implement
org.apache.sqoop.validation.ValidationThreshold
Supported values: The value has to be a fully qualified class name.
Default value: org.apache.sqoop.validation.AbsoluteValidationThreshold
Validation Failure Handler.
Property: validation-failurehandler
Description: Responsible for handling failures, must implement
org.apache.sqoop.validation.ValidationFailureHandler
Supported values: The value has to be a fully qualified class name.
Default value: org.apache.sqoop.validation.AbortOnFailureHandler
11.5. Limitations
Validation currently only validates data copied from a single table into HDFS. The following are the limitations in the current implementation:
all-tables option
free-form query option
Data imported into Hive, HBase or Accumulo
table import with --where argument
incremental imports
11.6. Example Invocations
A basic import of a table named EMPLOYEES in the corp database that uses validation to validate the row counts:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp
–table EMPLOYEES --validate
A basic export to populate a table named bar with validation enabled:
$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar
–export-dir /results/bar_data --validate
Another example that overrides the validation args:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
–validate --validator org.apache.sqoop.validation.RowCountValidator
–validation-threshold
org.apache.sqoop.validation.AbsoluteValidationThreshold
–validation-failurehandler
org.apache.sqoop.validation.AbortOnFailureHandler
12. Saved Jobs
Imports and exports can be repeatedly performed by issuing the same command multiple times. Especially when using the incremental import capability, this is an expected scenario.
Sqoop allows you to define saved jobs which make this process easier. A saved job records the configuration information required to execute a Sqoop command at a later time. The section on the sqoop-job tool describes how to create and work with saved jobs.
By default, job descriptions are saved to a private repository stored in $HOME/.sqoop/. You can configure Sqoop to instead use a shared metastore, which makes saved jobs available to multiple users across a shared cluster. Starting the metastore is covered by the section on the sqoop-metastore tool.
If a saved job is configured to perform an incremental import, state regarding the most recently imported rows is updated in the saved job to allow the job to continually import only the newest rows.
13.2. Syntax
$ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
$ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
Although the Hadoop generic arguments must preceed any job arguments, the job arguments can be entered in any order with respect to one another.
Table 33. Job management options:
Argument Description
–create Define a new saved job with the specified job-id (name). A second Sqoop command-line, separated by a – should be specified; this defines the saved job.
–delete Delete a saved job.
–exec Given a job defined with --create, run the saved job.
–show Show the parameters for a saved job.
–list List all saved jobs
Creating saved jobs is done with the --create action. This operation requires a – followed by a tool name and its arguments. The tool and its arguments will form the basis of the saved job. Consider:
$ sqoop job --create myjob – import --connect jdbc:mysql://example.com/db
–table mytable
This creates a job named myjob which can be executed later. The job is not run. This job is now available in the list of saved jobs:
$ sqoop job --list
Available jobs:
myjob
We can inspect the configuration of a job with the show action:
direct.import = false
codegen.input.delimiters.record = 0
hdfs.append.dir = false
db.table = mytable
…
And if we are satisfied with it, we can run the job with exec:
$ sqoop job --exec myjob
10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation
…
The exec action allows you to override arguments of the saved job by supplying them after a --. For example, if the database were changed to require a username, we could specify the username and password with:
$ sqoop job --exec myjob – --username someuser -P
Enter password:
…
Table 34. Metastore connection options:
Argument Description
–meta-connect Specifies the JDBC connect string used to connect to the metastore
By default, a private metastore is instantiated in $HOME/.sqoop. If you have configured a hosted metastore with the sqoop-metastore tool, you can connect to it by specifying the --meta-connect argument. This is a JDBC connect string just like the ones used to connect to databases for import.
In conf/sqoop-site.xml, you can configure sqoop.metastore.client.autoconnect.url with this address, so you do not have to supply --meta-connect to use a remote metastore. This parameter can also be modified to move the private metastore to a location on your filesystem other than your home directory.
If you configure sqoop.metastore.client.enable.autoconnect with the value false, then you must explicitly supply --meta-connect.
Table 35. Common options:
Argument Description
–help Print usage instructions
–verbose Print more information while working
13.3. Saved jobs and passwords
The Sqoop metastore is not a secure resource. Multiple users can access its contents. For this reason, Sqoop does not store passwords in the metastore. If you create a job that requires a password, you will be prompted for that password each time you execute the job.
You can enable passwords in the metastore by setting sqoop.metastore.client.record.password to true in the configuration.
Note that you have to set sqoop.metastore.client.record.password to true if you are executing saved jobs via Oozie because Sqoop cannot prompt the user to enter passwords while being executed as Oozie tasks.
13.4. Saved jobs and incremental imports
Incremental imports are performed by comparing the values in a check column against a reference value for the most recent import. For example, if the --incremental append argument was specified, along with --check-column id and --last-value 100, all rows with id > 100 will be imported. If an incremental import is run from the command line, the value which should be specified as --last-value in a subsequent incremental import will be printed to the screen for your reference. If an incremental import is run from a saved job, this value will be retained in the saved job. Subsequent runs of sqoop job --exec someIncrementalJob will continue to import only newer rows than those previously imported.
Clients must be configured to connect to the metastore in sqoop-site.xml or with the --meta-connect argument.
14.2. Syntax
$ sqoop metastore (generic-args) (metastore-args)
$ sqoop-metastore (generic-args) (metastore-args)
Although the Hadoop generic arguments must preceed any metastore arguments, the metastore arguments can be entered in any order with respect to one another.
Table 36. Metastore management options:
Argument Description
–shutdown Shuts down a running metastore instance on the same machine.
Running sqoop-metastore launches a shared HSQLDB database instance on the current machine. Clients can connect to this metastore and create jobs which can be shared between users for execution.
The location of the metastore’s files on disk is controlled by the sqoop.metastore.server.location property in conf/sqoop-site.xml. This should point to a directory on the local filesystem.
The metastore is available over TCP/IP. The port is controlled by the sqoop.metastore.server.port configuration parameter, and defaults to 16000.
Clients should connect to the metastore by specifying sqoop.metastore.client.autoconnect.url or --meta-connect with the value jdbc:hsqldb:hsql://:/sqoop. For example, jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop.
This metastore may be hosted on a machine within the Hadoop cluster, or elsewhere on the network.
15.2. Syntax
$ sqoop merge (generic-args) (merge-args)
$ sqoop-merge (generic-args) (merge-args)
Although the Hadoop generic arguments must preceed any merge arguments, the job arguments can be entered in any order with respect to one another.
Table 37. Merge options:
Argument Description
–class-name Specify the name of the record-specific class to use during the merge job.
–jar-file Specify the name of the jar to load the record class from.
–merge-key Specify the name of a column to use as the merge key.
–new-data Specify the path of the newer dataset.
–onto Specify the path of the older dataset.
–target-dir Specify the target path for the output of the merge job.
The merge tool runs a MapReduce job that takes two directories as input: a newer dataset, and an older one. These are specified with --new-data and --onto respectively. The output of the MapReduce job will be placed in the directory in HDFS specified by --target-dir.
When merging the datasets, it is assumed that there is a unique primary key value in each record. The column for the primary key is specified with --merge-key. Multiple rows in the same dataset should not have the same primary key, or else data loss may occur.
To parse the dataset and extract the key column, the auto-generated class from a previous import must be used. You should specify the class name and jar file with --class-name and --jar-file. If this is not availab,e you can recreate the class using the codegen tool.
The merge tool is typically run after an incremental import with the date-last-modified mode (sqoop import --incremental lastmodified …).
Supposing two incremental imports were performed, where some older data is in an HDFS directory named older and newer data is in an HDFS directory named newer, these could be merged like so:
$ sqoop merge --new-data newer --onto older --target-dir merged
–jar-file datatypes.jar --class-name Foo --merge-key id
This would run a MapReduce job where the value in the id column of each row is used to join rows; rows in the newer dataset will be used in preference to rows in the older dataset.
This can be used with both SequenceFile-, Avro- and text-based incremental imports. The file types of the newer and older datasets must be the same.
16.2. Syntax
$ sqoop codegen (generic-args) (codegen-args)
$ sqoop-codegen (generic-args) (codegen-args)
Although the Hadoop generic arguments must preceed any codegen arguments, the codegen arguments can be entered in any order with respect to one another.
Table 38. Common arguments
Argument Description
–connect Specify JDBC connect string
–connection-manager Specify connection manager class to use
–driver Manually specify JDBC driver class to use
–hadoop-mapred-home
Table 39. Code generation arguments:
Argument Description
–bindir
Table 40. Output line formatting arguments:
Argument Description
–enclosed-by Sets a required field enclosing character
–escaped-by Sets the escape character
–fields-terminated-by Sets the field separator character
–lines-terminated-by Sets the end-of-line character
–mysql-delimiters Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: ’
–optionally-enclosed-by Sets a field enclosing character
Table 41. Input parsing arguments:
Argument Description
–input-enclosed-by Sets a required field encloser
–input-escaped-by Sets the input escape character
–input-fields-terminated-by Sets the input field separator
–input-lines-terminated-by Sets the input end-of-line character
–input-optionally-enclosed-by Sets a field enclosing character
Table 42. Hive arguments:
Argument Description
–hive-home
If Hive arguments are provided to the code generation tool, Sqoop generates a file containing the HQL statements to create a table and load data.
16.3. Example Invocations
Recreate the record interpretation code for the employees table of a corporate database:
$ sqoop codegen --connect jdbc:mysql://db.example.com/corp
–table employees
17. sqoop-create-hive-table
17.1. Purpose
17.2. Syntax
17.3. Example Invocations
17.1. Purpose
The create-hive-table tool populates a Hive metastore with a definition for a table based on a database table previously imported to HDFS, or one planned to be imported. This effectively performs the “–hive-import” step of sqoop-import without running the preceeding import.
If data was already loaded to HDFS, you can use this tool to finish the pipeline of importing the data to Hive. You can also create Hive tables with this tool; data then can be imported and populated into the target after a preprocessing step run by the user.
17.2. Syntax
$ sqoop create-hive-table (generic-args) (create-hive-table-args)
$ sqoop-create-hive-table (generic-args) (create-hive-table-args)
Although the Hadoop generic arguments must preceed any create-hive-table arguments, the create-hive-table arguments can be entered in any order with respect to one another.
Table 43. Common arguments
Argument Description
–connect Specify JDBC connect string
–connection-manager Specify connection manager class to use
–driver Manually specify JDBC driver class to use
–hadoop-mapred-home
Table 44. Hive arguments:
Argument Description
–hive-home
Table 45. Output line formatting arguments:
Argument Description
–enclosed-by Sets a required field enclosing character
–escaped-by Sets the escape character
–fields-terminated-by Sets the field separator character
–lines-terminated-by Sets the end-of-line character
–mysql-delimiters Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: ’
–optionally-enclosed-by Sets a field enclosing character
Do not use enclosed-by or escaped-by delimiters with output formatting arguments used to import to Hive. Hive cannot currently parse them.
17.3. Example Invocations
Define in Hive a table named emps with a definition based on a database table named employees:
$ sqoop create-hive-table --connect jdbc:mysql://db.example.com/corp
–table employees --hive-table emps
18. sqoop-eval
18.1. Purpose
18.2. Syntax
18.3. Example Invocations
18.1. Purpose
The eval tool allows users to quickly run simple SQL queries against a database; results are printed to the console. This allows users to preview their import queries to ensure they import the data they expect.
[警告] Warning
The eval tool is provided for evaluation purpose only. You can use it to verify database connection from within the Sqoop or to test simple queries. It’s not suppose to be used in production workflows.
18.2. Syntax
$ sqoop eval (generic-args) (eval-args)
$ sqoop-eval (generic-args) (eval-args)
Although the Hadoop generic arguments must preceed any eval arguments, the eval arguments can be entered in any order with respect to one another.
Table 46. Common arguments
Argument Description
–connect Specify JDBC connect string
–connection-manager Specify connection manager class to use
–driver Manually specify JDBC driver class to use
–hadoop-mapred-home
Table 47. SQL evaluation arguments:
Argument Description
-e,–query Execute statement in SQL.
18.3. Example Invocations
Select ten records from the employees table:
$ sqoop eval --connect jdbc:mysql://db.example.com/corp
–query “SELECT * FROM employees LIMIT 10”
Insert a row into the foo table:
$ sqoop eval --connect jdbc:mysql://db.example.com/corp
-e “INSERT INTO foo VALUES(42, ‘bar’)”
19. sqoop-list-databases
19.1. Purpose
19.2. Syntax
19.3. Example Invocations
19.1. Purpose
List database schemas present on a server.
19.2. Syntax
$ sqoop list-databases (generic-args) (list-databases-args)
$ sqoop-list-databases (generic-args) (list-databases-args)
Although the Hadoop generic arguments must preceed any list-databases arguments, the list-databases arguments can be entered in any order with respect to one another.
Table 48. Common arguments
Argument Description
–connect Specify JDBC connect string
–connection-manager Specify connection manager class to use
–driver Manually specify JDBC driver class to use
–hadoop-mapred-home
19.3. Example Invocations
List database schemas available on a MySQL server:
$ sqoop list-databases --connect jdbc:mysql://database.example.com/
information_schema
employees
[笔记] Note
This only works with HSQLDB, MySQL and Oracle. When using with Oracle, it is necessary that the user connecting to the database has DBA privileges.
20.2. Syntax
$ sqoop list-tables (generic-args) (list-tables-args)
$ sqoop-list-tables (generic-args) (list-tables-args)
Although the Hadoop generic arguments must preceed any list-tables arguments, the list-tables arguments can be entered in any order with respect to one another.
Table 49. Common arguments
Argument Description
–connect Specify JDBC connect string
–connection-manager Specify connection manager class to use
–driver Manually specify JDBC driver class to use
–hadoop-mapred-home
20.3. Example Invocations
List tables available in the “corp” database:
$ sqoop list-tables --connect jdbc:mysql://database.example.com/corp
employees
payroll_checks
job_descriptions
office_supplies
In case of postgresql, list tables command with common arguments fetches only “public” schema. For custom schema, use --schema argument to list tables of particular schema Example
$ sqoop list-tables --connect jdbc:postgresql://localhost/corp --username name -P – --schema payrolldept
employees
expenses
21. sqoop-help
21.1. Purpose
21.2. Syntax
21.3. Example Invocations
21.1. Purpose
List tools available in Sqoop and explain their usage.
21.2. Syntax
$ sqoop help [tool-name]
$ sqoop-help [tool-name]
If no tool name is provided (for example, the user runs sqoop help), then the available tools are listed. With a tool name, the usage instructions for that specific tool are presented on the console.
21.3. Example Invocations
List available tools:
$ sqoop help
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
…
See ‘sqoop help COMMAND’ for information on a specific command.
Display usage instructions for the import tool:
$ bin/sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]
Common arguments:
–connect Specify JDBC connect string
–connection-manager Specify connection manager class to use
–driver Manually specify JDBC driver class to use
–hadoop-mapred-home
Import control arguments:
–as-avrodatafile Imports data to Avro Data Files
–as-sequencefile Imports data to SequenceFiles
–as-textfile Imports data as plain text (default)
–as-parquetfile Imports data to Parquet Data Files
…
22. sqoop-version
22.1. Purpose
22.2. Syntax
22.3. Example Invocations
22.1. Purpose
Display version information for Sqoop.
22.2. Syntax
$ sqoop version
$ sqoop-version
22.3. Example Invocations
Display the version:
$ sqoop version
Sqoop {revnumber}
git commit id 46b3e06b79a8411320d77c984c3030db47dd1c22
Compiled by aaron@jargon on Mon May 17 13:43:22 PDT 2010
23. Sqoop-HCatalog Integration
23.1. HCatalog Background
23.2. Exposing HCatalog Tables to Sqoop
23.2.1. New Command Line Options
23.2.2. Supported Sqoop Hive Options
23.2.3. Direct Mode support
23.2.4. Unsupported Sqoop Options
23.2.4.1. Unsupported Sqoop Hive Import Options
23.2.4.2. Unsupported Sqoop Export and Import Options
23.2.5. Ignored Sqoop Options
23.3. Automatic Table Creation
23.4. Delimited Text Formats and Field and Line Delimiter Characters
23.5. HCatalog Table Requirements
23.6. Support for Partitioning
23.7. Schema Mapping
23.8. Support for HCatalog Data Types
23.9. Providing Hive and HCatalog Libraries for the Sqoop Job
23.10. Examples
23.11. Import
23.12. Export
23.1. HCatalog Background
HCatalog is a table and storage management service for Hadoop that enables users with different data processing tools Pig, MapReduce, and Hive to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored: RCFile format, text files, or SequenceFiles.
HCatalog supports reading and writing files in any format for which a Hive SerDe (serializer-deserializer) has been written. By default, HCatalog supports RCFile, CSV, JSON, and SequenceFile formats. To use a custom format, you must provide the InputFormat and OutputFormat as well as the SerDe.
The ability of HCatalog to abstract various storage formats is used in providing the RCFile (and future file types) support to Sqoop.
23.2. Exposing HCatalog Tables to Sqoop
23.2.1. New Command Line Options
23.2.2. Supported Sqoop Hive Options
23.2.3. Direct Mode support
23.2.4. Unsupported Sqoop Options
23.2.4.1. Unsupported Sqoop Hive Import Options
23.2.4.2. Unsupported Sqoop Export and Import Options
23.2.5. Ignored Sqoop Options
HCatalog integration with Sqoop is patterned on an existing feature set that supports Avro and Hive tables. Seven new command line options are introduced, and some command line options defined for Hive have been reused.
23.2.1. New Command Line Options
–hcatalog-database
Specifies the database name for the HCatalog table. If not specified, the default database name default is used. Providing the --hcatalog-database option without --hcatalog-table is an error. This is not a required option.
–hcatalog-table
The argument value for this option is the HCatalog tablename. The presence of the --hcatalog-table option signifies that the import or export job is done using HCatalog tables, and it is a required option for HCatalog jobs.
–hcatalog-home
The home directory for the HCatalog installation. The directory is expected to have a lib subdirectory and a share/hcatalog subdirectory with necessary HCatalog libraries. If not specified, the system property hcatalog.home will be checked and failing that, a system environment variable HCAT_HOME will be checked. If none of these are set, the default value will be used and currently the default is set to /usr/lib/hcatalog. This is not a required option.
–create-hcatalog-table
This option specifies whether an HCatalog table should be created automatically when importing data. By default, HCatalog tables are assumed to exist. The table name will be the same as the database table name translated to lower case. Further described in Automatic Table Creation below.
–drop-and-create-hcatalog-table
Same as --create-hcatalog-table, but does a drop if exists before creating the table.
–hcatalog-storage-stanza
This option specifies the storage stanza to be appended to the table. Further described in Automatic Table Creation below.
–hcatalog-partition-keys and --hcatalog-partition-values
These two options are used to specify multiple static partition key/value pairs. In the prior releases, --hive-partition-key and --hive-partition-value options were used to specify the static partition key/value pair, but only one level of static partition keys could be provided. The options --hcatalog-partition-keys and --hcatalog-partition-values allow multiple keys and values to be provided as static partitioning keys. Multiple option values are to be separated by , (comma).
For example, if the hive partition keys for the table to export/import from are defined with partition key names year, month and date and a specific partition with year=1999, month=12, day=31 is the desired partition, then the values for the two options will be as follows:
–hcatalog-partition-keys year,month,day
–hcatalog-partition-values 1999,12,31
To provide backward compatibility, if --hcatalog-partition-keys or --hcatalog-partition-values options are not provided, then --hive-partitition-key and --hive-partition-value will be used if provided.
It is an error to specify only one of --hcatalog-partition-keys or --hcatalog-partition-values options. Either both of the options should be provided or neither of the options should be provided.
23.2.2. Supported Sqoop Hive Options
The following Sqoop options are also used along with the --hcatalog-table option to provide additional input to the HCatalog jobs. Some of the existing Hive import job options are reused with HCatalog jobs instead of creating HCatalog-specific options for the same purpose.
–map-column-hive
This option maps a database column to HCatalog with a specific HCatalog type.
–hive-home
The Hive home location.
–hive-partition-key
Used for static partitioning filter. The partitioning key should be of type STRING. There can be only one static partitioning key. Please see the discussion about --hcatalog-partition-keys and --hcatalog-partition-values options.
–hive-partition-value
The value associated with the partition. Please see the discussion about --hcatalog-partition-keys and --hcatalog-partition-values options.
23.2.3. Direct Mode support
HCatalog integration in Sqoop has been enhanced to support direct mode connectors (which are high performance connectors specific to a database). Netezza direct mode connector has been enhanced to take advatange of this feature.
[重要的] Important
Only Netezza direct mode connector is currently enabled to work with HCatalog.
23.2.4. Unsupported Sqoop Options
23.2.4.1. Unsupported Sqoop Hive Import Options
23.2.4.2. Unsupported Sqoop Export and Import Options
23.2.4.1. Unsupported Sqoop Hive Import Options
The following Sqoop Hive import options are not supported with HCatalog jobs.
–hive-import
–hive-overwrite
23.2.4.2. Unsupported Sqoop Export and Import Options
The following Sqoop export and import options are not supported with HCatalog jobs.
–export-dir
–target-dir
–warehouse-dir
–append
–as-sequencefile
–as-avrodatafile
–as-parquetfile
23.2.5. Ignored Sqoop Options
The following options are ignored with HCatalog jobs.
All input delimiter options are ignored.
Output delimiters are generally ignored unless either --hive-drop-import-delims or --hive-delims-replacement is used. When the --hive-drop-import-delims or --hive-delims-replacement option is specified, all CHAR type database table columns will be post-processed to either remove or replace the delimiters, respectively. See Delimited Text Formats and Field and Line Delimiter Characters below. This is only needed if the HCatalog table uses text formats.
23.3. Automatic Table Creation
One of the key features of Sqoop is to manage and create the table metadata when importing into Hadoop. HCatalog import jobs also provide for this feature with the option --create-hcatalog-table. Furthermore, one of the important benefits of the HCatalog integration is to provide storage agnosticism to Sqoop data movement jobs. To provide for that feature, HCatalog import jobs provide an option that lets a user specifiy the storage format for the created table.
The option --create-hcatalog-table is used as an indicator that a table has to be created as part of the HCatalog import job. If the option --create-hcatalog-table is specified and the table exists, then the table creation will fail and the job will be aborted.
The option --hcatalog-storage-stanza can be used to specify the storage format of the newly created table. The default value for this option is stored as rcfile. The value specified for this option is assumed to be a valid Hive storage format expression. It will be appended to the create table command generated by the HCatalog import job as part of automatic table creation. Any error in the storage stanza will cause the table creation to fail and the import job will be aborted.
Any additional resources needed to support the storage format referenced in the option --hcatalog-storage-stanza should be provided to the job either by placing them in $HIVE_HOME/lib or by providing them in HADOOP_CLASSPATH and LIBJAR files.
If the option --hive-partition-key is specified, then the value of this option is used as the partitioning key for the newly created table. Only one partitioning key can be specified with this option.
Object names are mapped to the lowercase equivalents as specified below when mapped to an HCatalog table. This includes the table name (which is the same as the external store table name converted to lower case) and field names.
23.4. Delimited Text Formats and Field and Line Delimiter Characters
HCatalog supports delimited text format as one of the table storage formats. But when delimited text is used and the imported data has fields that contain those delimiters, then the data may be parsed into a different number of fields and records by Hive, thereby losing data fidelity.
For this case, one of these existing Sqoop import options can be used:
–hive-delims-replacement
–hive-drop-import-delims
If either of these options is provided for import, then any column of type STRING will be formatted with the Hive delimiter processing and then written to the HCatalog table.
23.5. HCatalog Table Requirements
The HCatalog table should be created before using it as part of a Sqoop job if the default table creation options (with optional storage stanza) are not sufficient. All storage formats supported by HCatalog can be used with the creation of the HCatalog tables. This makes this feature readily adopt new storage formats that come into the Hive project, such as ORC files.
23.6. Support for Partitioning
The Sqoop HCatalog feature supports the following table types:
Unpartitioned tables
Partitioned tables with a static partitioning key specified
Partitioned tables with dynamic partition keys from the database result set
Partitioned tables with a combination of a static key and additional dynamic partitioning keys
23.7. Schema Mapping
Sqoop currently does not support column name mapping. However, the user is allowed to override the type mapping. Type mapping loosely follows the Hive type mapping already present in Sqoop except that SQL types FLOAT and REAL are mapped to HCatalog type float. In the Sqoop type mapping for Hive, these two are mapped to double. Type mapping is primarily used for checking the column definition correctness only and can be overridden with the --map-column-hive option.
All types except binary are assignable to a String type.
Any field of number type (int, shortint, tinyint, bigint and bigdecimal, float and double) is assignable to another field of any number type during exports and imports. Depending on the precision and scale of the target type of assignment, truncations can occur.
Furthermore, date/time/timestamps are mapped to date/timestamp hive types. (the full date/time/timestamp representation). Date/time/timstamp columns can also be mapped to bigint Hive type in which case the value will be the number of milliseconds since epoch.
BLOBs and CLOBs are only supported for imports. The BLOB/CLOB objects when imported are stored in a Sqoop-specific format and knowledge of this format is needed for processing these objects in a Pig/Hive job or another Map Reduce job.
Database column names are mapped to their lowercase equivalents when mapped to the HCatalog fields. Currently, case-sensitive database object names are not supported.
Projection of a set of columns from a table to an HCatalog table or loading to a column projection is allowed, subject to table constraints. The dynamic partitioning columns, if any, must be part of the projection when importing data into HCatalog tables.
Dynamic partitioning fields should be mapped to database columns that are defined with the NOT NULL attribute (although this is not enforced during schema mapping). A null value during import for a dynamic partitioning column will abort the Sqoop job.
23.8. Support for HCatalog Data Types
All the primitive Hive types that are part of Hive 0.13 version are supported. Currently all the complex HCatalog types are not supported.
BLOB/CLOB database types are only supported for imports.
23.9. Providing Hive and HCatalog Libraries for the Sqoop Job
With the support for HCatalog added to Sqoop, any HCatalog job depends on a set of jar files being available both on the Sqoop client host and where the Map/Reduce tasks run. To run HCatalog jobs, the environment variable HADOOP_CLASSPATH must be set up as shown below before launching the Sqoop HCatalog jobs.
HADOOP_CLASSPATH=$(hcat -classpath) export HADOOP_CLASSPATH
The necessary HCatalog dependencies will be copied to the distributed cache automatically by the Sqoop job.
23.10. Examples
Create an HCatalog table, such as:
hcat -e “create table txn(txn_date string, cust_id string, amount float, store_id int) partitioned by (cust_id string) stored as rcfile;”
Then Sqoop import and export of the “txn” HCatalog table can be invoked as follows:
23.11. Import
$SQOOP_HOME/bin/sqoop import --connect -table --hcatalog-table txn
23.12. Export
$SQOOP_HOME/bin/sqoop export --connect -table --hcatalog-table txn
Nonetheless, several database-specific decisions were made in the implementation of Sqoop, and some databases offer additional settings which are extensions to the standard.
This section describes the databases tested with Sqoop, any exceptions in Sqoop’s handling of each database relative to the norm, and any database-specific settings available in Sqoop.
24.1. Supported Databases
While JDBC is a compatibility layer that allows a program to access many different databases through a common API, slight differences in the SQL language spoken by each database may mean that Sqoop can’t use every database out of the box, or that some databases may be used in an inefficient manner.
When you provide a connect string to Sqoop, it inspects the protocol scheme to determine appropriate vendor-specific logic to use. If Sqoop knows about a given database, it will work automatically. If not, you may need to specify the driver class to load via --driver. This will use a generic code path which will use standard SQL to access the database. Sqoop provides some databases with faster, non-JDBC-based access mechanisms. These can be enabled by specfying the --direct parameter.
Sqoop includes vendor-specific support for the following databases:
Database version --direct support? connect string matches
HSQLDB 1.8.0+ No jdbc:hsqldb://
MySQL 5.0+ Yes jdbc:mysql://
Oracle 10.2.0+ Yes jdbc:oracle://
PostgreSQL 8.3+ Yes (import only) jdbc:postgresql://
CUBRID 9.2+ NO jdbc:cubrid:*
Sqoop may work with older versions of the databases listed, but we have only tested it with the versions specified above.
Even if Sqoop supports a database internally, you may still need to install the database vendor’s JDBC driver in your $SQOOP_HOME/lib path on your client. Sqoop can load classes from any jars in $SQOOP_HOME/lib on the client and will use them as part of any MapReduce jobs it runs; unlike older versions, you no longer need to install JDBC jars in the Hadoop library path on your servers.
24.2. MySQL
24.2.1. zeroDateTimeBehavior
24.2.2. UNSIGNED columns
24.2.3. BLOB and CLOB columns
24.2.4. Importing views in direct mode
JDBC Driver: MySQL Connector/J
MySQL v5.0 and above offers very thorough coverage by Sqoop. Sqoop has been tested with mysql-connector-java-5.1.13-bin.jar.
24.2.1. zeroDateTimeBehavior
MySQL allows values of ‘0000-00-00’ for DATE columns, which is a non-standard extension to SQL. When communicated via JDBC, these values are handled in one of three different ways:
Convert to NULL.
Throw an exception in the client.
Round to the nearest legal date (‘0001-01-01’).
You specify the behavior by using the zeroDateTimeBehavior property of the connect string. If a zeroDateTimeBehavior property is not specified, Sqoop uses the convertToNull behavior.
You can override this behavior. For example:
$ sqoop import --table foo
–connect jdbc:mysql://db.example.com/someDb?zeroDateTimeBehavior=round
24.2.2. UNSIGNED columns
Columns with type UNSIGNED in MySQL can hold values between 0 and 2^32 (4294967295), but the database will report the data type to Sqoop as INTEGER, which will can hold values between -2147483648 and +2147483647. Sqoop cannot currently import UNSIGNED values above 2147483647.
24.2.3. BLOB and CLOB columns
Sqoop’s direct mode does not support imports of BLOB, CLOB, or LONGVARBINARY columns. Use JDBC-based imports for these columns; do not supply the --direct argument to the import tool.
24.2.4. Importing views in direct mode
Sqoop is currently not supporting import from view in direct mode. Use JDBC based (non direct) mode in case that you need to import view (simply omit --direct parameter).
24.3. PostgreSQL
24.3.1. Importing views in direct mode
Sqoop supports JDBC-based connector for PostgreSQL: http://jdbc.postgresql.org/
The connector has been tested using JDBC driver version “9.1-903 JDBC 4” with PostgreSQL server 9.1.
24.3.1. Importing views in direct mode
Sqoop is currently not supporting import from view in direct mode. Use JDBC based (non direct) mode in case that you need to import view (simply omit --direct parameter).
24.4. Oracle
24.4.1. Dates and Times
JDBC Driver: Oracle JDBC Thin Driver - Sqoop is compatible with ojdbc6.jar.
Sqoop has been tested with Oracle 10.2.0 Express Edition. Oracle is notable in its different approach to SQL from the ANSI standard, and its non-standard JDBC driver. Therefore, several features work differently.
24.4.1. Dates and Times
Oracle JDBC represents DATE and TIME SQL types as TIMESTAMP values. Any DATE columns in an Oracle database will be imported as a TIMESTAMP in Sqoop, and Sqoop-generated code will store these values in java.sql.Timestamp fields.
When exporting data back to a database, Sqoop parses text fields as TIMESTAMP types (with the form yyyy-mm-dd HH:MM:SS.ffffffff) even if you expect these fields to be formatted with the JDBC date escape format of yyyy-mm-dd. Dates exported to Oracle should be formatted as full timestamps.
Oracle also includes the additional date/time types TIMESTAMP WITH TIMEZONE and TIMESTAMP WITH LOCAL TIMEZONE. To support these types, the user’s session timezone must be specified. By default, Sqoop will specify the timezone “GMT” to Oracle. You can override this setting by specifying a Hadoop property oracle.sessionTimeZone on the command-line when running a Sqoop job. For example:
$ sqoop import -D oracle.sessionTimeZone=America/Los_Angeles
–connect jdbc:oracle:thin:@//db.example.com/foo --table bar
Note that Hadoop parameters (-D …) are generic arguments and must appear before the tool-specific arguments (–connect, --table, and so on).
Legal values for the session timezone string are enumerated at http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14225/applocaledata.htm#i637736.
24.5. Schema Definition in Hive
Hive users will note that there is not a one-to-one mapping between SQL types and Hive types. In general, SQL types that do not have a direct mapping (for example, DATE, TIME, and TIMESTAMP) will be coerced to STRING in Hive. The NUMERIC and DECIMAL SQL types will be coerced to DOUBLE. In these cases, Sqoop will emit a warning in its log messages informing you of the loss of precision.
24.6. Parquet Support in Hive
In order to contact the Hive MetaStore from a MapReduce job, a delegation token will be fetched and passed. HIVE_CONF_DIR and HIVE_HOME must be set appropriately to add Hive to the runtime classpath. Otherwise, importing/exporting into Hive in Parquet format may not work.
24.7. CUBRID
Sqoop supports JDBC-based connector for Cubrid: http://www.cubrid.org/?mid=downloads&item=jdbc_driver
The connector has been tested using JDBC driver version “JDBC-9.2.0.0155-cubrid.jar” with Cubrid 9.2.
25.1.1. Upsert functionality
MySQL JDBC Connector is supporting upsert functionality using argument --update-mode allowinsert. To achieve that Sqoop is using MySQL clause INSERT INTO … ON DUPLICATE KEY UPDATE. This clause do not allow user to specify which columns should be used to distinct whether we should update existing row or add new row. Instead this clause relies on table’s unique keys (primary key belongs to this set). MySQL will try to insert new row and if the insertion fails with duplicate unique key error it will update appropriate row instead. As a result, Sqoop is ignoring values specified in parameter --update-key, however user needs to specify at least one valid column to turn on update mode itself.
25.2. MySQL Direct Connector
25.2.1. Requirements
25.2.2. Limitations
25.2.3. Direct-mode Transactions
MySQL Direct Connector allows faster import and export to/from MySQL using mysqldump and mysqlimport tools functionality instead of SQL selects and inserts.
To use the MySQL Direct Connector, specify the --direct argument for your import or export job.
Example:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
–direct
Passing additional parameters to mysqldump:
$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar
–direct – --default-character-set=latin1
25.2.1. Requirements
Utilities mysqldump and mysqlimport should be present in the shell path of the user running the Sqoop command on all nodes. To validate SSH as this user to all nodes and execute these commands. If you get an error, so will Sqoop.
25.2.2. Limitations
Currently the direct connector does not support import of large object columns (BLOB and CLOB).
Importing to HBase and Accumulo is not supported
Use of a staging table when exporting data is not supported
Import of views is not supported
25.2.3. Direct-mode Transactions
For performance, each writer will commit the current transaction approximately every 32 MB of exported data. You can control this by specifying the following argument before any tool-specific arguments: -D sqoop.mysql.export.checkpoint.bytes=size, where size is a value in bytes. Set size to 0 to disable intermediate checkpoints, but individual files being exported will continue to be committed independently of one another.
Sometimes you need to export large data with Sqoop to a live MySQL cluster that is under a high load serving random queries from the users of your application. While data consistency issues during the export can be easily solved with a staging table, there is still a problem with the performance impact caused by the heavy export.
First off, the resources of MySQL dedicated to the import process can affect the performance of the live product, both on the master and on the slaves. Second, even if the servers can handle the import with no significant performance impact (mysqlimport should be relatively “cheap”), importing big tables can cause serious replication lag in the cluster risking data inconsistency.
With -D sqoop.mysql.export.sleep.ms=time, where time is a value in milliseconds, you can let the server relax between checkpoints and the replicas catch up by pausing the export process after transferring the number of bytes specified in sqoop.mysql.export.checkpoint.bytes. Experiment with different settings of these two parameters to archieve an export pace that doesn’t endanger the stability of your MySQL cluster.
[重要的] Important
Note that any arguments to Sqoop that are of the form -D parameter=value are Hadoop generic arguments and must appear before any tool-specific arguments (for example, --connect, --table, etc). Don’t forget that these parameters are only supported with the --direct flag set.
25.3. Microsoft SQL Connector
25.3.1. Extra arguments
25.3.2. Allow identity inserts
25.3.3. Non-resilient operations
25.3.4. Schema support
25.3.5. Table hints
25.3.1. Extra arguments
List of all extra arguments supported by Microsoft SQL Connector is shown below:
Table 50. Supported Microsoft SQL Connector extra arguments:
Argument Description
±-identity-insert Set IDENTITY_INSERT to ON before export insert.
–non-resilient Don’t attempt to recover failed export operations.
–schema Scheme name that sqoop should use. Default is “dbo”.
–table-hints Table hints that Sqoop should use for data movement.
25.3.2. Allow identity inserts
You can allow inserts on columns that have identity. For example:
$ sqoop export … --export-dir custom_dir --table custom_table – --identity-insert
25.3.3. Non-resilient operations
You can override the default and not use resilient operations during export. This will avoid retrying failed operations. For example:
$ sqoop export … --export-dir custom_dir --table custom_table – --non-resilient
25.3.4. Schema support
If you need to work with tables that are located in non-default schemas, you can specify schema names via the --schema argument. Custom schemas are supported for both import and export jobs. For example:
$ sqoop import … --table custom_table – --schema custom_schema
25.3.5. Table hints
Sqoop supports table hints in both import and export jobs. Table hints are used only for queries that move data from/to Microsoft SQL Server, but they cannot be used for meta data queries. You can specify a comma-separated list of table hints in the --table-hints argument. For example:
$ sqoop import … --table custom_table – --table-hints NOLOCK
25.4. PostgreSQL Connector
25.4.1. Extra arguments
25.4.2. Schema support
25.4.1. Extra arguments
List of all extra arguments supported by PostgreSQL Connector is shown below:
Table 51. Supported PostgreSQL extra arguments:
Argument Description
–schema Scheme name that sqoop should use. Default is “public”.
25.4.2. Schema support
If you need to work with table that is located in schema other than default one, you need to specify extra argument --schema. Custom schemas are supported for both import and export job (optional staging table however must be present in the same schema as target table). Example invocation:
$ sqoop import … --table custom_table – --schema custom_schema
25.5. PostgreSQL Direct Connector
25.5.1. Requirements
25.5.2. Limitations
PostgreSQL Direct Connector allows faster import and export to/from PostgresSQL “COPY” command.
To use the PostgreSQL Direct Connector, specify the --direct argument for your import or export job.
When importing from PostgreSQL in conjunction with direct mode, you can split the import into separate files after individual files reach a certain size. This size limit is controlled with the --direct-split-size argument.
The direct connector offers also additional extra arguments:
Table 52. Additional supported PostgreSQL extra arguments in direct mode:
Argument Description
–boolean-true-string String that will be used to encode true value of boolean columns.
Default is “TRUE”.
–boolean-false-string String that will be used to encode false value of boolean columns.
Default is “FALSE”.
25.5.1. Requirements
Utility psql should be present in the shell path of the user running the Sqoop command on all nodes. To validate SSH as this user to all nodes and execute these commands. If you get an error, so will Sqoop.
25.5.2. Limitations
Currently the direct connector does not support import of large object columns (BLOB and CLOB).
Importing to HBase and Accumulo is not supported
Import of views is not supported
25.6. pg_bulkload connector
25.6.1. Purpose
25.6.2. Requirements
25.6.3. Syntax
25.6.4. Data Staging
25.6.1. Purpose
pg_bulkload connector is a direct connector for exporting data into PostgreSQL. This connector uses pg_bulkload. Users benefit from functionality of pg_bulkload such as fast exports bypassing shared bufferes and WAL, flexible error records handling, and ETL feature with filter functions.
25.6.2. Requirements
pg_bulkload connector requires following conditions for export job execution:
The pg_bulkload must be installed on DB server and all slave nodes. RPM for RedHat or CentOS is available in then download page.
The PostgreSQL JDBC is required on client node.
Superuser role of PostgreSQL database is required for execution of pg_bulkload.
25.6.3. Syntax
Use --connection-manager option to specify connection manager classname.
$ sqoop export (generic-args) --connection-manager org.apache.sqoop.manager.PGBulkloadManager (export-args)
$ sqoop-export (generic-args) --connection-manager org.apache.sqoop.manager.PGBulkloadManager (export-args)
This connector supports export arguments shown below.
Table 53. Supported export control arguments:
Argument Description
–export-dir
There are additional configuration for pg_bulkload execution specified via Hadoop Configuration properties which can be given with -D
Table 54. Supported export control properties:
Property Description
mapred.reduce.tasks Number of reduce tasks for staging. The defalt value is 1. Each tasks do staging in a single transaction.
pgbulkload.bin Path of the pg_bulkoad binary installed on each slave nodes.
pgbulkload.check.constraints Specify whether CHECK constraints are checked during the loading. The default value is YES.
pgbulkload.parse.errors The maximum mumber of ingored records that cause errors during parsing, encoding, filtering, constraints checking, and data type conversion. Error records are recorded in the PARSE BADFILE. The default value is INFINITE.
pgbulkload.duplicate.errors Number of ingored records that violate unique constraints. Duplicated records are recorded in the DUPLICATE BADFILE on DB server. The default value is INFINITE.
pgbulkload.filter Specify the filter function to convert each row in the input file. See the pg_bulkload documentation to know how to write FILTER functions.
pgbulkload.clear.staging.table Indicates that any data present in the staging table can be dropped.
Here is a example of complete command line.
$ sqoop export
-Dmapred.reduce.tasks=2
-Dpgbulkload.bin="/usr/local/bin/pg_bulkload"
-Dpgbulkload.input.field.delim=$’\t’
-Dpgbulkload.check.constraints=“YES”
-Dpgbulkload.parse.errors=“INFINITE”
-Dpgbulkload.duplicate.errors=“INFINITE”
–connect jdbc:postgresql://pgsql.example.net:5432/sqooptest
–connection-manager org.apache.sqoop.manager.PGBulkloadManager
–table test --username sqooptest --export-dir=/test -m 2
25.6.4. Data Staging
Each map tasks of pg_bulkload connector’s export job create their own staging table on the fly. The Name of staging tables is decided based on the destination table and the task attempt ids. For example, the name of staging table for the “test” table is like test_attempt_1345021837431_0001_m_000000_0 .
Staging tables are automatically dropped if tasks successfully complete or map tasks fail. When reduce task fails, staging table for the task are left for manual retry and users must take care of it.
25.7. Netezza Connector
25.7.1. Extra arguments
25.7.2. Direct Mode
25.7.3. Null string handling
25.7.1. Extra arguments
List of all extra arguments supported by Netezza Connector is shown below:
Table 55. Supported Netezza extra arguments:
Argument Description
–partitioned-access Whether each mapper acts on a subset of data slices of a table or all Default is “false” for standard mode and “true” for direct mode.
–max-errors Applicable only for direct mode export. This option specifies the error threshold per mapper while transferring data. If the number of errors encountered exceed this threshold then the job will fail.
Default value is 1.
–log-dir Applicable only for direct mode export. Specifies the directory where Netezza external table operation logs are stored on the hadoop filesystem. Logs are stored under this directory with one directory for the job and sub-directories for each task number and attempt. Default value is the user home directory. The nzlog and nzbad files will be under
(logdir)/job-id/job-attempt-id.
–trunc-string Applicable only for direct mode export. Specifies whether the system truncates strings to the declared storage and loads the data. By default truncation of strings is reported as an error.
–ctrl-chars Applicable only for direct mode export. Specifies whether control characters (ASCII chars 1 - 31) can be allowed to be part of char/nchar/varchar/nvarchar columns. Default is false.
–crin-string Applicable only for direct mode export. Specifies whether carriage return (ASCII char 13) can be allowed to be part of char/nchar/varchar/nvarchar columns. Note that CR can no longer be a record delimiter with this option. Default is false.
–ignore-zero Applicable only for direct mode export. Specifies whether NUL character (ASCII char 0) should be scanned and ignored as part of the data loaded into char/nchar/varchar/nvarchar columns. Default is false.
25.7.2. Direct Mode
Netezza connector supports an optimized data transfer facility using the Netezza external tables feature. Each map tasks of Netezza connector’s import job will work on a subset of the Netezza partitions and transparently create and use an external table to transport data. Similarly, export jobs will use the external table to push data fast onto the NZ system. Direct mode does not support staging tables, upsert options etc.
Here is an example of complete command line for import using the Netezza external table feature.
$ sqoop import
–direct
–connect jdbc:netezza://nzhost:5480/sqoop
–table nztable
–username nzuser
–password nzpass
–target-dir hdfsdir
Here is an example of complete command line for export with tab as the field terminator character.
$ sqoop export
–direct
–connect jdbc:netezza://nzhost:5480/sqoop
–table nztable
–username nzuser
–password nzpass
–export-dir hdfsdir
–input-fields-terminated-by “\t”
25.7.3. Null string handling
Netezza direct connector supports the null-string features of Sqoop. The null string values are converted to appropriate external table options during export and import operations.
Table 56. Supported export control arguments:
Argument Description
–input-null-string The string to be interpreted as null for string columns.
–input-null-non-string The string to be interpreted as null for non string columns.
In the case of Netezza direct mode connector, both the arguments must be left to the default values or explicitly set to the same value. Furthermore the null string value is restricted to 0-4 utf8 characters.
On export, for non-string columns, if the chosen null value is a valid representation in the column domain, then the column might not be loaded as null. For example, if the null string value is specified as “1”, then on export, any occurrence of “1” in the input file will be loaded as value 1 instead of NULL for int columns.
It is suggested that the null value be specified as empty string for performance and consistency.
Table 57. Supported import control arguments:
Argument Description
–null-string The string to be interpreted as null for string columns.
–null-non-string The string to be interpreted as null for non string columns.
In the case of Netezza direct mode connector, both the arguments must be left to the default values or explicitly set to the same value. Furthermore the null string value is restricted to 0-4 utf8 characters.
On import, for non-string columns, the chosen null value in current implementations the null value representation is ignored for non character columns. For example, if the null string value is specified as “\N”, then on import, any occurrence of NULL for non-char columns in the table will be imported as an empty string instead of \N, the chosen null string representation.
It is suggested that the null value be specified as empty string for performance and consistency.
25.8. Data Connector for Oracle and Hadoop
25.8.1. About
25.8.1.1. Jobs
25.8.1.2. How The Standard Oracle Manager Works for Imports
25.8.1.3. How The Data Connector for Oracle and Hadoop Works for Imports
25.8.1.4. Data Connector for Oracle and Hadoop Exports
25.8.2. Requirements
25.8.2.1. Ensure The Oracle Database JDBC Driver Is Setup Correctly
25.8.2.2. Oracle Roles and Privileges
25.8.2.3. Additional Oracle Roles And Privileges Required for Export
25.8.2.4. Supported Data Types
25.8.3. Execute Sqoop With Data Connector for Oracle and Hadoop
25.8.3.1. Connect to Oracle / Oracle RAC
25.8.3.2. Connect to An Oracle Database Instance
25.8.3.3. Connect to An Oracle RAC
25.8.3.4. Login to The Oracle Instance
25.8.3.5. Kill Data Connector for Oracle and Hadoop Jobs
25.8.4. Import Data from Oracle
25.8.4.1. Match Hadoop Files to Oracle Table Partitions
25.8.4.2. Specify The Partitions To Import
25.8.4.3. Consistent Read: All Mappers Read From The Same Point In Time
25.8.5. Export Data into Oracle
25.8.5.1. Insert-Export
25.8.5.2. Update-Export
25.8.5.3. Merge-Export
25.8.5.4. Create Oracle Tables
25.8.5.5. NOLOGGING
25.8.5.6. Partitioning
25.8.5.7. Match Rows Via Multiple Columns
25.8.5.8. Storage Clauses
25.8.6. Manage Date And Timestamp Data Types
25.8.6.1. Import Date And Timestamp Data Types from Oracle
25.8.6.2. The Data Connector for Oracle and Hadoop Does Not Apply A Time Zone to DATE / TIMESTAMP Data Types
25.8.6.3. The Data Connector for Oracle and Hadoop Retains Time Zone Information in TIMEZONE Data Types
25.8.6.4. Data Connector for Oracle and Hadoop Explicitly States Time Zone for LOCAL TIMEZONE Data Types
25.8.6.5. java.sql.Timestamp
25.8.6.6. Export Date And Timestamp Data Types into Oracle
25.8.7. Configure The Data Connector for Oracle and Hadoop
25.8.7.1. oraoop-site-template.xml
25.8.7.2. oraoop.oracle.session.initialization.statements
25.8.7.3. oraoop.table.import.where.clause.location
25.8.7.4. oracle.row.fetch.size
25.8.7.5. oraoop.import.hint
25.8.7.6. oraoop.oracle.append.values.hint.usage
25.8.7.7. mapred.map.tasks.speculative.execution
25.8.7.8. oraoop.block.allocation
25.8.7.9. oraoop.import.omit.lobs.and.long
25.8.7.10. oraoop.locations
25.8.7.11. sqoop.connection.factories
25.8.7.12. Expressions in oraoop-site.xml
25.8.8. Troubleshooting The Data Connector for Oracle and Hadoop
25.8.8.1. Quote Oracle Owners And Tables
25.8.8.2. Quote Oracle Columns
25.8.8.3. Confirm The Data Connector for Oracle and Hadoop Can Initialize The Oracle Session
25.8.8.4. Check The Sqoop Debug Logs for Error Messages
25.8.8.5. Export: Check Tables Are Compatible
25.8.8.6. Export: Parallelization
25.8.8.7. Export: Check oraoop.oracle.append.values.hint.usage
25.8.8.8. Turn On Verbose
25.8.1. About
25.8.1.1. Jobs
25.8.1.2. How The Standard Oracle Manager Works for Imports
25.8.1.3. How The Data Connector for Oracle and Hadoop Works for Imports
25.8.1.4. Data Connector for Oracle and Hadoop Exports
The Data Connector for Oracle and Hadoop is now included in Sqoop.
It can be enabled by specifying the --direct argument for your import or export job.
25.8.1.1. Jobs
The Data Connector for Oracle and Hadoop inspects each Sqoop job and assumes responsibility for the ones it can perform better than the Oracle manager built into Sqoop.
Data Connector for Oracle and Hadoop accepts responsibility for the following Sqoop Job types:
Import jobs that are Non-Incremental.
Export jobs
Data Connector for Oracle and Hadoop does not accept responsibility for other Sqoop job types. For example Data Connector for Oracle and Hadoop does not accept eval jobs etc.
Data Connector for Oracle and Hadoop accepts responsibility for those Sqoop Jobs with the following attributes:
Oracle-related
Table-Based - Jobs where the table argument is used and the specified object is a table.
[笔记] Note
Data Connector for Oracle and Hadoop does not process index-organized tables unless the table is partitioned and oraoop.chunk.method is set to PARTITION
There are at least 2 mappers — Jobs where the Sqoop command-line does not include: --num-mappers 1
25.8.1.2. How The Standard Oracle Manager Works for Imports
The Oracle manager built into Sqoop uses a range-based query for each mapper. Each mapper executes a query of the form:
SELECT * FROM sometable WHERE id >= lo AND id < hi
The lo and hi values are based on the number of mappers and the minimum and maximum values of the data in the column the table is being split by.
If no suitable index exists on the table then these queries result in full table-scans within Oracle. Even with a suitable index, multiple mappers may fetch data stored within the same Oracle blocks, resulting in redundant IO calls.
25.8.1.3. How The Data Connector for Oracle and Hadoop Works for Imports
The Data Connector for Oracle and Hadoop generates queries for the mappers of the form:
SELECT *
FROM sometable
WHERE rowid >= dbms_rowid.rowid_create(1, 893, 1, 279, 0) AND
rowid <= dbms_rowid.rowid_create(1, 893, 1, 286, 32767)
The Data Connector for Oracle and Hadoop queries ensure that:
No two mappers read data from the same Oracle block. This minimizes redundant IO.
The table does not require indexes.
The Sqoop command line does not need to specify a --split-by column.
25.8.1.4. Data Connector for Oracle and Hadoop Exports
Benefits of the Data Connector for Oracle and Hadoop:
Merge-Export facility - Update Oracle tables by modifying changed rows AND inserting rows from the HDFS file that did not previously exist in the Oracle table. The Connector for Oracle and Hadoop’s Merge-Export is unique - there is no Sqoop equivalent.
Lower impact on the Oracle database - Update the rows in the Oracle table that have changed, not all rows in the Oracle table. This has performance benefits and reduces the impact of the query on Oracle (for example, the Oracle redo logs).
Improved performance - With partitioned tables, mappers utilize temporary Oracle tables which allow parallel inserts and direct path writes.
25.8.2. Requirements
25.8.2.1. Ensure The Oracle Database JDBC Driver Is Setup Correctly
25.8.2.2. Oracle Roles and Privileges
25.8.2.3. Additional Oracle Roles And Privileges Required for Export
25.8.2.4. Supported Data Types
25.8.2.1. Ensure The Oracle Database JDBC Driver Is Setup Correctly
You may want to ensure the Oracle Database 11g Release 2 JDBC driver is setup correctly on your system. This driver is required for Sqoop to work with Oracle.
The Oracle Database 11g Release 2 JDBC driver file is ojdbc6.jar (3.2Mb).
If this file is not on your system then download it from: http://www.oracle.com/technetwork/database/features/jdbc/index-091264.html
This file should be put into the $SQOOP_HOME/lib directory.
25.8.2.2. Oracle Roles and Privileges
The Oracle user for The Data Connector for Oracle and Hadoop requires the following roles and privileges:
create session
In addition, the user must have the select any dictionary privilege or select_catalog_role role or all of the following object privileges:
select on v_KaTeX parse error: Expected group after '_' at position 229: …nly select on v_̲database — Required for Sqoop imports only
select on v_$parameter — Required for Sqoop imports only
[笔记] Note
The user also requires the alter session privilege to make use of session tracing functionality. See “oraoop.oracle.session.initialization.statements” for more information.
25.8.2.3. Additional Oracle Roles And Privileges Required for Export
The Oracle user for Data Connector for Oracle and Hadoop requires:
Quota on the tablespace in which the Oracle export tables are located.
An example Oracle command to achieve this is
alter user username quota unlimited on tablespace
The following privileges:
Type of Export Privileges Required
All Export create table
select on dba_tab_partitions
select on dba_tab_subpartitions
select on dba_indexes
select on dba_ind_columns
Insert-Export with a template table into another schema select any table
create any table
insert any table
alter any table (partitioning)
Insert-Export without a template table into another schema select,insert on table (no partitioning)
select,alter on table (partitioning)
Update-Export into another schema select,update on table (no partitioning)
select,delete,alter,insert on table (partitioning)
Merge-Export into another schema select,insert,update on table (no partitioning)
select,insert,delete,alter on table (partitioning)
25.8.2.4. Supported Data Types
The following Oracle data types are supported by the Data Connector for Oracle and Hadoop:
BINARY_DOUBLE NCLOB
BINARY_FLOAT NUMBER
BLOB NVARCHAR2
CHAR RAW
CLOB ROWID
DATE TIMESTAMP
FLOAT TIMESTAMP WITH TIME ZONE
INTERVAL DAY TO SECOND TIMESTAMP WITH LOCAL TIME ZONE
INTERVAL YEAR TO MONTH URITYPE
LONG VARCHAR2
NCHAR
All other Oracle column types are NOT supported. Example Oracle column types NOT supported by Data Connector for Oracle and Hadoop include:
All of the ANY types BFILE
All of the MEDIA types LONG RAW
All of the SPATIAL types MLSLABEL
Any type referred to as UNDEFINED UROWID
All custom (user-defined) URI types XMLTYPE
[笔记] Note
Data types RAW, LONG and LOB (BLOB, CLOB and NCLOB) are supported for Data Connector for Oracle and Hadoop imports. They are not supported for Data Connector for Oracle and Hadoop exports.
25.8.3. Execute Sqoop With Data Connector for Oracle and Hadoop
25.8.3.1. Connect to Oracle / Oracle RAC
25.8.3.2. Connect to An Oracle Database Instance
25.8.3.3. Connect to An Oracle RAC
25.8.3.4. Login to The Oracle Instance
25.8.3.5. Kill Data Connector for Oracle and Hadoop Jobs
25.8.3.1. Connect to Oracle / Oracle RAC
The Sqoop --connect parameter defines the Oracle instance or Oracle RAC to connect to. It is required with all Sqoop import and export commands.
Data Connector for Oracle and Hadoop expects the associated connection string to be of a specific format dependent on whether the Oracle SID, Service or TNS name is defined. The TNS name based URL scheme can be used to enable authentication using Oracle wallets.
–connect jdbc:oracle:thin:@OracleServer:OraclePort:OracleSID
–connect jdbc:oracle:thin:@//OracleServer:OraclePort/OracleService
–connect jdbc:oracle:thin:@TNSName
25.8.3.2. Connect to An Oracle Database Instance
Parameter / Component Description
jdbc:oracle:thin The Data Connector for Oracle and Hadoop requires the connection string starts with jdbc:oracle.
The Data Connector for Oracle and Hadoop has been tested with the thin driver however it should work equally well with other drivers such as OCI.
OracleServer The host name of the Oracle server.
OraclePort The port to connect to the Oracle server.
OracleSID The Oracle instance.
OracleService The Oracle Service.
TNSName The TNS name for the entry describing the connection to the Oracle server.
[笔记] Note
The Hadoop mappers connect to the Oracle database using a dynamically generated JDBC URL. This is designed to improve performance however it can be disabled by specifying:
-D oraoop.jdbc.url.verbatim=true
25.8.3.3. Connect to An Oracle RAC
Use the --connect parameter as above. The connection string should point to one instance of the Oracle RAC. The listener of the host of this Oracle instance will locate the other instances of the Oracle RAC.
[笔记] Note
To improve performance, The Data Connector for Oracle and Hadoop identifies the active instances of the Oracle RAC and connects each Hadoop mapper to them in a roundrobin manner.
If services are defined for this Oracle RAC then use the following parameter to specify the service name:
-D oraoop.oracle.rac.service.name=ServiceName
Parameter / Component Description
OracleServer:OraclePort:OracleInstance Name one instance of the Oracle RAC. The Data Connector for Oracle and Hadoop assumes the same port number for all instances of the Oracle RAC.
The listener of the host of this Oracle instance is used to locate other instances of the Oracle RAC. For more information enter this command on the host command line:
lsnrctl status
-D oraoop.oracle.rac.service.name=ServiceName The service to connect to in the Oracle RAC.
A connection is made to all instances of the Oracle RAC associated with the service given by ServiceName.
If omitted, a connection is made to all instances of the Oracle RAC.
The listener of the host of this Oracle instance needs to know the ServiceName and all instances of the Oracle RAC. For more information enter this command on the host command line:
lsnrctl status
25.8.3.4. Login to The Oracle Instance
Login to the Oracle instance on the Sqoop command line:
–connect jdbc:oracle:thin:@OracleServer:OraclePort:OracleInstance --username UserName -P
Parameter / Component Description
–username UserName The username to login to the Oracle instance (SID).
-P You will be prompted for the password to login to the Oracle instance.
25.8.3.5. Kill Data Connector for Oracle and Hadoop Jobs
Use the Hadoop Job Tracker to kill the Sqoop job, just as you would kill any other Map-Reduce job.
$ hadoop job -kill jobid
To allow an Oracle DBA to kill a Data Connector for Oracle and Hadoop job (via killing the sessions in Oracle) you need to prevent Map-Reduce from re-attempting failed jobs. This is done via the following Sqoop command-line switch:
-D mapred.map.max.attempts=1
This sends instructions similar to the following to the console:
14/07/07 15:24:51 INFO oracle.OraOopManagerFactory:
Note: This Data Connector for Oracle and Hadoop job can be killed via Oracle
by executing the following statement:
begin
for row in (select sid,serial# from v$session where
module=‘Data Connector for Oracle and Hadoop’ and
action=‘import 20140707152451EST’) loop
execute immediate ‘alter system kill session ‘’’ || row.sid ||
‘,’ || row.serial# || ‘’’’;
end loop;
end;
25.8.4. Import Data from Oracle
25.8.4.1. Match Hadoop Files to Oracle Table Partitions
25.8.4.2. Specify The Partitions To Import
25.8.4.3. Consistent Read: All Mappers Read From The Same Point In Time
Execute Sqoop. Following is an example command:
$ sqoop import --direct --connect … --table OracleTableName
If The Data Connector for Oracle and Hadoop accepts the job then the following text is output:
*** Using Data Connector for Oracle and Hadoop ***
[笔记] Note
More information is available on the --connect parameter. See “Connect to Oracle / Oracle RAC” for more information.
If Java runs out of memory the workaround is to specify each mapper’s JVM memory allocation. Add the following parameter for example to allocate 4GB:
-Dmapred.child.java.opts=-Xmx4000M
An Oracle optimizer hint is included in the SELECT statement by default. See “oraoop.import.hint” for more information.
You can alter the hint on the command line as follows:
-Doraoop.import.hint=“NO_INDEX(t)”
You can turn off the hint on the command line as follows (notice the space between the double quotes):
-Doraoop.import.hint=" "
25.8.4.1. Match Hadoop Files to Oracle Table Partitions
-Doraoop.chunk.method={ROWID|PARTITION}
To import data from a partitioned table in such a way that the resulting HDFS folder structure in Hadoop will match the table’s partitions, set the chunk method to PARTITION. The alternative (default) chunk method is ROWID.
[笔记] Note
For the number of Hadoop files to match the number of Oracle partitions, set the number of mappers to be greater than or equal to the number of partitions.
If the table is not partitioned then value PARTITION will lead to an error.
25.8.4.2. Specify The Partitions To Import
-Doraoop.import.partitions=PartitionA,PartitionB --table OracleTableName
Imports PartitionA and PartitionB of OracleTableName.
[笔记] Note
You can enclose an individual partition name in double quotes to retain the letter case or if the name has special characters.
-Doraoop.import.partitions=’“PartitionA”,PartitionB’ --table OracleTableName
If the partition name is not double quoted then its name will be automatically converted to upper case, PARTITIONB for above.
When using double quotes the entire list of partition names must be enclosed in single quotes.
If the last partition name in the list is double quoted then there must be a comma at the end of the list.
-Doraoop.import.partitions=’“PartitionA”,“PartitionB”,’ --table OracleTableName
Name each partition to be included. There is no facility to provide a range of partition names.
There is no facility to define sub partitions. The entire partition is included/excluded as per the filter.
25.8.4.3. Consistent Read: All Mappers Read From The Same Point In Time
-Doraoop.import.consistent.read={true|false}
When set to false (by default) each mapper runs a select query. This will return potentially inconsistent data if there are a lot of DML operations on the table at the time of import.
Set to true to ensure all mappers read from the same point in time. The System Change Number (SCN) is passed down to all mappers, which use the Oracle Flashback Query to query the table as at that SCN.
[笔记] Note
Values true | false are case sensitive.
By default the SCN is taken from V$database. You can specify the SCN in the following command
-Doraoop.import.consistent.read.scn=12345
25.8.5. Export Data into Oracle
25.8.5.1. Insert-Export
25.8.5.2. Update-Export
25.8.5.3. Merge-Export
25.8.5.4. Create Oracle Tables
25.8.5.5. NOLOGGING
25.8.5.6. Partitioning
25.8.5.7. Match Rows Via Multiple Columns
25.8.5.8. Storage Clauses
Execute Sqoop. Following is an example command:
$ sqoop export --direct --connect … --table OracleTableName --export-dir /user/username/tablename
The Data Connector for Oracle and Hadoop accepts all jobs that export data to Oracle. You can verify The Data Connector for Oracle and Hadoop is in use by checking the following text is output:
*** Using Data Connector for Oracle and Hadoop ***
[笔记] Note
OracleTableName is the Oracle table the data will export into.
OracleTableName can be in a schema other than that for the connecting user. Prefix the table name with the schema, for example SchemaName.OracleTableName.
Hadoop tables are picked up from the /user/username/tablename directory.
The export will fail if the Hadoop file contains any fields of a data type not supported by The Data Connector for Oracle and Hadoop. See “Supported Data Types” for more information.
The export will fail if the column definitions in the Hadoop table do not exactly match the column definitions in the Oracle table.
The Data Connector for Oracle and Hadoop indicates if it finds temporary tables that it created more than a day ago that still exist. Usually these tables can be dropped. The only circumstance when these tables should not be dropped is when an The Data Connector for Oracle and Hadoop job has been running for more than 24 hours and is still running.
More information is available on the --connect parameter. See “Connect to Oracle / Oracle RAC” for more information.
25.8.5.1. Insert-Export
Appends data to OracleTableName. It does not modify existing data in OracleTableName.
Insert-Export is the default method, executed in the absence of the --update-key parameter. All rows in the HDFS file in /user/UserName/TableName are inserted into OracleTableName. No change is made to pre-existing data in OracleTableName.
$ sqoop export --direct --connect … --table OracleTableName --export-dir /user/username/tablename
[笔记] Note
If OracleTableName was previously created by The Data Connector for Oracle and Hadoop with partitions then this export will create a new partition for the data being inserted.
When creating OracleTableName specify a template. See “Create Oracle Tables” for more information.
25.8.5.2. Update-Export
–update-key OBJECT
Updates existing rows in OracleTableName.
Rows in the HDFS file in /user/UserName/TableName are matched to rows in OracleTableName by the OBJECT column. Rows that match are copied from the HDFS file to the Oracle table. No action is taken on rows that do not match.
$ sqoop export --direct --connect … --update-key OBJECT --table OracleTableName --export-dir /user/username/tablename
[笔记] Note
If OracleTableName was previously created by The Data Connector for Oracle and Hadoop with partitions then this export will create a new partition for the data being inserted. Updated rows will be moved to the new partition that was created for the export.
For performance reasons it is strongly recommended that where more than a few rows are involved column OBJECT be an index column of OracleTableName.
Ensure the column name defined with --update-key OBJECT is specified in the correct letter case. Sqoop will show an error if the letter case is incorrect.
It is possible to match rows via multiple columns. See “Match Rows Via Multiple Columns” for more information.
25.8.5.3. Merge-Export
–update-key OBJECT -Doraoop.export.merge=true
Updates existing rows in OracleTableName. Copies across rows from the HDFS file that do not exist within the Oracle table.
Rows in the HDFS file in /user/UserName/TableName are matched to rows in OracleTableName by the OBJECT column. Rows that match are copied from the HDFS file to the Oracle table. Rows in the HDFS file that do not exist in OracleTableName are added to OracleTableName.
$ sqoop export --direct --connect … --update-key OBJECT -Doraoop.export.merge=true --table OracleTableName --export-dir /user/username/tablename
[笔记] Note
Merge-Export is unique to The Data Connector for Oracle and Hadoop. It is not a standard Sqoop feature.
If OracleTableName was previously created by The Data Connector for Oracle and Hadoop with partitions, then this export will create a new partition for the data being inserted. Updated rows will be moved to the new partition that was created for the export.
For performance reasons it is strongly recommended that where more than a few rows are involved column OBJECT be an index column of OracleTableName.
Ensure the column name defined with --update-key OBJECT is specified in the correct letter case. Sqoop will show an error if the letter case is incorrect.
It is possible to match rows via multiple columns. See “Match Rows Via Multiple Columns” for more information.
25.8.5.4. Create Oracle Tables
-Doraoop.template.table=TemplateTableName
Creates OracleTableName by replicating the structure and data types of TemplateTableName. TemplateTableName is a table that exists in Oracle prior to executing the Sqoop command.
[笔记] Note
The export will fail if the Hadoop file contains any fields of a data type not supported by The Data Connector for Oracle and Hadoop. See “Supported Data Types” for more information.
The export will fail if the column definitions in the Hadoop table do not exactly match the column definitions in the Oracle table.
This parameter is specific to creating an Oracle table. The export will fail if OracleTableName already exists in Oracle.
Example command:
$ sqoop export --direct --connect… --table OracleTableName --export-dir /user/username/tablename -Doraoop.template.table=TemplateTableName
25.8.5.5. NOLOGGING
-Doraoop.nologging=true
Assigns the NOLOGGING option to OracleTableName.
NOLOGGING may enhance performance but you will be unable to backup the table.
25.8.5.6. Partitioning
-Doraoop.partitioned=true
Partitions the table with the following benefits:
The speed of the export is improved by allowing each mapper to insert data into a separate Oracle table using direct path writes. (An alter table exchange subpartition SQL statement is subsequently executed to swap the data into the export table.)
You can selectively query or delete the data inserted by each Sqoop export job. For example, you can delete old data by dropping old partitions from the table.
The partition value is the SYSDATE of when Sqoop export job was performed.
The partitioned table created by The Data Connector for Oracle and Hadoop includes the following columns that don’t exist in the template table:
oraoop_export_sysdate - This is the Oracle SYSDATE when the Sqoop export job was performed. The created table will be partitioned by this column.
oraoop_mapper_id - This is the id of the Hadoop mapper that was used to process the rows from the HDFS file. Each partition is subpartitioned by this column. This column exists merely to facilitate the exchange subpartition mechanism that is performed by each mapper during the export process.
oraoop_mapper_row - A unique row id within the mapper / partition.
[笔记] Note
If a unique row id is required for the table it can be formed by a combination of oraoop_export_sysdate, oraoop_mapper_id and oraoop_mapper_row.
25.8.5.7. Match Rows Via Multiple Columns
-Doraoop.update.key.extra.columns=“ColumnA,ColumnB”
Used with Update-Export and Merge-Export to match on more than one column. The first column to be matched on is --update-key OBJECT. To match on additional columns, specify those columns on this parameter.
[笔记] Note
Letter case for the column names on this parameter is not important.
All columns used for matching should be indexed. The first three items on the index should be ColumnA, ColumnB and the column specified on --update-key - but the order in which the columns are specified is not important.
25.8.5.8. Storage Clauses
-Doraoop.temporary.table.storage.clause=“StorageClause”
-Doraoop.table.storage.clause=“StorageClause”
Use to customize storage with Oracle clauses as in TABLESPACE or COMPRESS
-Doraoop.table.storage.clause applies to the export table that is created from -Doraoop.template.table. See “Create Oracle Tables” for more information. -Doraoop.temporary.table.storage.clause applies to all other working tables that are created during the export process and then dropped at the end of the export job.
25.8.6. Manage Date And Timestamp Data Types
25.8.6.1. Import Date And Timestamp Data Types from Oracle
25.8.6.2. The Data Connector for Oracle and Hadoop Does Not Apply A Time Zone to DATE / TIMESTAMP Data Types
25.8.6.3. The Data Connector for Oracle and Hadoop Retains Time Zone Information in TIMEZONE Data Types
25.8.6.4. Data Connector for Oracle and Hadoop Explicitly States Time Zone for LOCAL TIMEZONE Data Types
25.8.6.5. java.sql.Timestamp
25.8.6.6. Export Date And Timestamp Data Types into Oracle
25.8.6.1. Import Date And Timestamp Data Types from Oracle
This section lists known differences in the data obtained by performing an Data Connector for Oracle and Hadoop import of an Oracle table versus a native Sqoop import of the same table.
25.8.6.2. The Data Connector for Oracle and Hadoop Does Not Apply A Time Zone to DATE / TIMESTAMP Data Types
Data stored in a DATE or TIMESTAMP column of an Oracle table is not associated with a time zone. Sqoop without the Data Connector for Oracle and Hadoop inappropriately applies time zone information to this data.
Take for example the following timestamp in an Oracle DATE or TIMESTAMP column: 2am on 3rd October, 2010.
Request Sqoop without the Data Connector for Oracle and Hadoop import this data using a system located in Melbourne Australia. The data is adjusted to Melbourne Daylight Saving Time. The data is imported into Hadoop as: 3am on 3rd October, 2010.
The Data Connector for Oracle and Hadoop does not apply time zone information to these Oracle data-types. Even from a system located in Melbourne Australia, The Data Connector for Oracle and Hadoop ensures the Oracle and Hadoop timestamps match. The Data Connector for Oracle and Hadoop correctly imports this timestamp as: 2am on 3rd October, 2010.
[笔记] Note
In order for The Data Connector for Oracle and Hadoop to ensure data accuracy, Oracle DATE and TIMESTAMP values must be represented by a String, even when --as-sequencefile is used on the Sqoop command-line to produce a binary file in Hadoop.
25.8.6.3. The Data Connector for Oracle and Hadoop Retains Time Zone Information in TIMEZONE Data Types
Data stored in a TIMESTAMP WITH TIME ZONE column of an Oracle table is associated with a time zone. This data consists of two distinct parts: when the event occurred and where the event occurred.
When Sqoop without The Data Connector for Oracle and Hadoop is used to import data it converts the timestamp to the time zone of the system running Sqoop and omits the component of the data that specifies where the event occurred.
Take for example the following timestamps (with time zone) in an Oracle TIMESTAMP WITH TIME ZONE column:
2:59:00 am on 4th April, 2010. Australia/Melbourne
2:59:00 am on 4th April, 2010. America/New York
Request Sqoop without The Data Connector for Oracle and Hadoop import this data using a system located in Melbourne Australia. From the data imported into Hadoop we know when the events occurred, assuming we know the Sqoop command was run from a system located in the Australia/Melbourne time zone, but we have lost the information regarding where the event occurred.
2010-04-04 02:59:00.0
2010-04-04 16:59:00.0
Sqoop with The Data Connector for Oracle and Hadoop imports the example timestamps as follows. The Data Connector for Oracle and Hadoop retains the time zone portion of the data.
2010-04-04 02:59:00.0 Australia/Melbourne
2010-04-04 02:59:00.0 America/New_York
25.8.6.4. Data Connector for Oracle and Hadoop Explicitly States Time Zone for LOCAL TIMEZONE Data Types
Data stored in a TIMESTAMP WITH LOCAL TIME ZONE column of an Oracle table is associated with a time zone. Multiple end-users in differing time zones (locales) will each have that data expressed as a timestamp within their respective locale.
When Sqoop without the Data Connector for Oracle and Hadoop is used to import data it converts the timestamp to the time zone of the system running Sqoop and omits the component of the data that specifies location.
Take for example the following two timestamps (with time zone) in an Oracle TIMESTAMP WITH LOCAL TIME ZONE column:
2:59:00 am on 4th April, 2010. Australia/Melbourne
2:59:00 am on 4th April, 2010. America/New York
Request Sqoop without the Data Connector for Oracle and Hadoop import this data using a system located in Melbourne Australia. The timestamps are imported correctly but the local time zone has to be guessed. If multiple systems in different locale were executing the Sqoop import it would be very difficult to diagnose the cause of the data corruption.
2010-04-04 02:59:00.0
2010-04-04 16:59:00.0
Sqoop with the Data Connector for Oracle and Hadoop explicitly states the time zone portion of the data imported into Hadoop. The local time zone is GMT by default. You can set the local time zone with parameter:
-Doracle.sessionTimeZone=Australia/Melbourne
The Data Connector for Oracle and Hadoop would import these two timestamps as:
2010-04-04 02:59:00.0 Australia/Melbourne
2010-04-04 16:59:00.0 Australia/Melbourne
25.8.6.5. java.sql.Timestamp
To use Sqoop’s handling of date and timestamp data types when importing data from Oracle use the following parameter:
-Doraoop.timestamp.string=false
[笔记] Note
Sqoop’s handling of date and timestamp data types does not store the timezone. However, some developers may prefer Sqoop’s handling as the Data Connector for Oracle and Hadoop converts date and timestamp data types to string. This may not work for some developers as the string will require parsing later in the workflow.
25.8.6.6. Export Date And Timestamp Data Types into Oracle
Ensure the data in the HDFS file fits the required format exactly before using Sqoop to export the data into Oracle.
[笔记] Note
The Sqoop export command will fail if the data is not in the required format.
ff = Fractional second
TZR = Time Zone Region
Oracle Data Type Required Format of The Data in the HDFS File
DATE yyyy-mm-dd hh24:mi:ss
TIMESTAMP yyyy-mm-dd hh24:mi:ss.ff
TIMESTAMPTZ yyyy-mm-dd hh24:mi:ss.ff TZR
TIMESTAMPLTZ yyyy-mm-dd hh24:mi:ss.ff TZR
25.8.7. Configure The Data Connector for Oracle and Hadoop
25.8.7.1. oraoop-site-template.xml
25.8.7.2. oraoop.oracle.session.initialization.statements
25.8.7.3. oraoop.table.import.where.clause.location
25.8.7.4. oracle.row.fetch.size
25.8.7.5. oraoop.import.hint
25.8.7.6. oraoop.oracle.append.values.hint.usage
25.8.7.7. mapred.map.tasks.speculative.execution
25.8.7.8. oraoop.block.allocation
25.8.7.9. oraoop.import.omit.lobs.and.long
25.8.7.10. oraoop.locations
25.8.7.11. sqoop.connection.factories
25.8.7.12. Expressions in oraoop-site.xml
25.8.7.1. oraoop-site-template.xml
The oraoop-site-template.xml file is supplied with the Data Connector for Oracle and Hadoop. It contains a number of ALTER SESSION statements that are used to initialize the Oracle sessions created by the Data Connector for Oracle and Hadoop.
If you need to customize these initializations to your environment then:
Find oraoop-site-template.xml in the Sqoop configuration directory.
Copy oraoop-site-template.xml to oraoop-site.xml.
Edit the ALTER SESSION statements in oraoop-site.xml.
25.8.7.2. oraoop.oracle.session.initialization.statements
The value of this property is a semicolon-delimited list of Oracle SQL statements. These statements are executed, in order, for each Oracle session created by the Data Connector for Oracle and Hadoop.
The default statements include:
alter session set time_zone = ‘{oracle.sessionTimeZone|GMT}’;
This statement initializes the timezone of the JDBC client. This ensures that data from columns of type TIMESTAMP WITH LOCAL TIMEZONE are correctly adjusted into the timezone of the client and not kept in the timezone of the Oracle database.
[笔记] Note
There is an explanation to the text within the curly-braces. See “Expressions in oraoop-site.xml” for more information…
A list of the time zones supported by your Oracle database is available by executing the following query: SELECT TZNAME FROM V$TIMEZONE_NAMES;
alter session disable parallel query;
This statement instructs Oracle to not parallelize SQL statements executed by the Data Connector for Oracle and Hadoop sessions. This Oracle feature is disabled because the Map/Reduce job launched by Sqoop is the mechanism used for parallelization.
It is recommended that you not enable parallel query because it can have an adverse effect the load on the Oracle instance and on the balance between the Data Connector for Oracle and Hadoop mappers.
Some export operations are performed in parallel where deemed appropriate by the Data Connector for Oracle and Hadoop. See “Parallelization” for more information.
alter session set “_serial_direct_read”=true;
This statement instructs Oracle to bypass the buffer cache. This is used to prevent Oracle from filling its buffers with the data being read by the Data Connector for Oracle and Hadoop, therefore diminishing its capacity to cache higher prioritized data. Hence, this statement is intended to minimize the Data Connector for Oracle and Hadoop’s impact on the immediate future performance of the Oracle database.
–alter session set events ‘10046 trace name context forever, level 8’;
This statement has been commented-out. To allow tracing, remove the comment token “–” from the start of the line.
[笔记] Note
These statements are placed on separate lines for readability. They do not need to be placed on separate lines.
A statement can be commented-out via the standard Oracle double-hyphen token: “–”. The comment takes effect until the next semicolon.
25.8.7.3. oraoop.table.import.where.clause.location
SUBSPLIT (default)
When set to this value, the where clause is applied to each subquery used to retrieve data from the Oracle table.
A Sqoop command like:
sqoop import -D oraoop.table.import.where.clause.location=SUBSPLIT --table JUNK --where “owner like ‘G%’”
Generates SQL query of the form:
SELECT OWNER,OBJECT_NAME
FROM JUNK
WHERE ((rowid >=
dbms_rowid.rowid_create(1, 113320, 1024, 4223664, 0)
AND rowid <=
dbms_rowid.rowid_create(1, 113320, 1024, 4223671, 32767)))
AND (owner like ‘G%’)
UNION ALL
SELECT OWNER,OBJECT_NAME
FROM JUNK
WHERE ((rowid >=
dbms_rowid.rowid_create(1, 113320, 1024, 4223672, 0)
AND rowid <=
dbms_rowid.rowid_create(1, 113320, 1024, 4223679, 32767)))
AND (owner like ‘G%’)
SPLIT
When set to this value, the where clause is applied to the entire SQL statement used by each split/mapper.
A Sqoop command like:
sqoop import -D oraoop.table.import.where.clause.location=SPLIT --table JUNK --where “rownum ⇐ 10”
Generates SQL query of the form:
SELECT OWNER,OBJECT_NAME
FROM (
SELECT OWNER,OBJECT_NAME
FROM JUNK
WHERE ((rowid >=
dbms_rowid.rowid_create(1, 113320, 1024, 4223664, 0)
AND rowid <=
dbms_rowid.rowid_create(1, 113320, 1024, 4223671, 32767)))
UNION ALL
SELECT OWNER,OBJECT_NAME
FROM JUNK
WHERE ((rowid >=
dbms_rowid.rowid_create(1, 113320, 1024, 4223672, 0)
AND rowid <=
dbms_rowid.rowid_create(1, 113320, 1024, 4223679,32767)))
)
WHERE rownum <= 10
[笔记] Note
In this example, there are up to 10 rows imported per mapper.
The SPLIT clause may result in greater overhead than the SUBSPLIT clause because the UNION statements need to be fully materialized before the data can be streamed to the mappers. However, you may wish to use SPLIT in the case where you want to limit the total number of rows processed by each mapper.
25.8.7.4. oracle.row.fetch.size
The value of this property is an integer specifying the number of rows the Oracle JDBC driver should fetch in each network round-trip to the database. The default value is 5000.
If you alter this setting, confirmation of the change is displayed in the logs of the mappers during the Map-Reduce job.
25.8.7.5. oraoop.import.hint
The Oracle optimizer hint is added to the SELECT statement for IMPORT jobs as follows:
SELECT /*+ NO_INDEX(t) */ * FROM employees;
The default hint is NO_INDEX(t)
[笔记] Note
The hint can be added to the command line. See “Import Data from Oracle” for more information.
See the Oracle Database Performance Tuning Guide (Using Optimizer Hints) for more information on Oracle optimizer hints.
To turn the hint off, insert a space between the elements.
AUTO
AUTO is the default value.
Currently AUTO is equivalent to OFF.
ON
During export the Data Connector for Oracle and Hadoop uses direct path writes to populate the target Oracle table, bypassing the buffer cache. Oracle only allows a single session to perform direct writes against a specific table at any time, so this has the effect of serializing the writes to the table. This may reduce throughput, especially if the number of mappers is high. However, for databases where DBWR is very busy, or where the IO bandwidth to the underlying table is narrow (table resides on a single disk spindle for instance), then setting oraoop.oracle.append.values.hint.usage to ON may reduce the load on the Oracle database and possibly increase throughput.
OFF
During export the Data Connector for Oracle and Hadoop does not use the APPEND_VALUES Oracle hint.
[笔记] Note
This parameter is only effective on Oracle 11g Release 2 and above.
25.8.7.7. mapred.map.tasks.speculative.execution
By default speculative execution is disabled for the Data Connector for Oracle and Hadoop. This avoids placing redundant load on the Oracle database.
If Speculative execution is enabled, then Hadoop may initiate multiple mappers to read the same blocks of data, increasing the overall load on the database.
25.8.7.8. oraoop.block.allocation
This setting determines how Oracle’s data-blocks are assigned to Map-Reduce mappers.
[笔记] Note
Applicable to import. Not applicable to export.
ROUNDROBIN (default)
Each chunk of Oracle blocks is allocated to the mappers in a roundrobin manner. This helps prevent one of the mappers from being allocated a large proportion of typically small-sized blocks from the start of Oracle data-files. In doing so it also helps prevent one of the other mappers from being allocated a large proportion of typically larger-sized blocks from the end of the Oracle data-files.
Use this method to help ensure all the mappers are allocated a similar amount of work.
RANDOM
The list of Oracle blocks is randomized before being allocated to the mappers via a round-robin approach. This has the benefit of increasing the chance that, at any given instant in time, each mapper is reading from a different Oracle data-file. If the Oracle data-files are located on separate spindles, this should increase the overall IO throughput.
SEQUENTIAL
Each chunk of Oracle blocks is allocated to the mappers sequentially. This produces the tendency for each mapper to sequentially read a large, contiguous proportion of an Oracle data-file. It is unlikely for the performance of this method to exceed that of the round-robin method and it is more likely to allocate a large difference in the work between the mappers.
Use of this method is generally not recommended.
25.8.7.9. oraoop.import.omit.lobs.and.long
This setting can be used to omit all LOB columns (BLOB, CLOB and NCLOB) and LONG column from an Oracle table being imported. This is advantageous in troubleshooting, as it provides a convenient way to exclude all LOB-based data from the import.
25.8.7.10. oraoop.locations
[笔记] Note
Applicable to import. Not applicable to export.
By default, four mappers are used for a Sqoop import job. The number of mappers can be altered via the Sqoop --num-mappers parameter.
If the data-nodes in your Hadoop cluster have 4 task-slots (that is they are 4-CPU core machines) it is likely for all four mappers to execute on the same machine. Therefore, IO may be concentrated between the Oracle database and a single machine.
This setting allows you to control which DataNodes in your Hadoop cluster each mapper executes on. By assigning each mapper to a separate machine you may improve the overall IO performance for the job. This will also have the side-effect of the imported data being more diluted across the machines in the cluster. (HDFS replication will dilute the data across the cluster anyway.)
Specify the machine names as a comma separated list. The locations are allocated to each of the mappers in a round-robin manner.
If using EC2, specify the internal name of the machines. Here is an example of using this parameter from the Sqoop command-line:
$ sqoop import -D oraoop.locations=ip-10-250-23-225.ec2.internal,ip-10-250-107-32.ec2.internal,ip-10-250-207-2.ec2.internal,ip-10-250-27-114.ec2.internal --direct --connect…
25.8.7.11. sqoop.connection.factories
This setting determines behavior if the Data Connector for Oracle and Hadoop cannot accept the job. By default Sqoop accepts the jobs that the Data Connector for Oracle and Hadoop rejects.
Set the value to org.apache.sqoop.manager.oracle.OraOopManagerFactory when you want the job to fail if the Data Connector for Oracle and Hadoop cannot accept the job.
25.8.7.12. Expressions in oraoop-site.xml
Text contained within curly-braces { and } are expressions to be evaluated prior to the SQL statement being executed. The expression contains the name of the configuration property optionally followed by a default value to use if the property has not been set. A pipe | character is used to delimit the property name and the default value.
For example:
When this Sqoop command is executed
$ sqoop import -D oracle.sessionTimeZone=US/Hawaii --direct --connect
The statement within oraoop-site.xml
alter session set time_zone =’{oracle.sessionTimeZone|GMT}’;
Becomes
alter session set time_zone = ‘US/Hawaii’
If the oracle.sessionTimeZone property had not been set, then this statement would use the specified default value and would become
alter session set time_zone = ‘GMT’
[笔记] Note
The oracle.sessionTimeZone property can be specified within the sqoop-site.xml file if you want this setting to be used all the time.
25.8.8. Troubleshooting The Data Connector for Oracle and Hadoop
25.8.8.1. Quote Oracle Owners And Tables
25.8.8.2. Quote Oracle Columns
25.8.8.3. Confirm The Data Connector for Oracle and Hadoop Can Initialize The Oracle Session
25.8.8.4. Check The Sqoop Debug Logs for Error Messages
25.8.8.5. Export: Check Tables Are Compatible
25.8.8.6. Export: Parallelization
25.8.8.7. Export: Check oraoop.oracle.append.values.hint.usage
25.8.8.8. Turn On Verbose
25.8.8.1. Quote Oracle Owners And Tables
If the owner of the Oracle table needs to be
quoted, use:
$ sqoop import … --table
“”“Scott”.customers""
This is the equivalent of:
“Scott”.customers
If the Oracle table needs to be quoted, use:
$ sqoop import … --table
““scott.“Customers”””
This is the equivalent of:
scott.“Customers”
If both the owner of the Oracle table and the
table itself needs to be quoted, use:
$ sqoop import … --table
“”“Scott”.“Customers”""
This is the equivalent of:
“Scott”.“Customers”
[笔记] Note
The HDFS output directory is called something like: /user/username/“Scott”.“Customers”
If a table name contains a $ character, it may need to be escaped within your Unix shell. For example, the dr$object table in the ctxsys schema would be referred to as: $ sqoop import … --table “ctxsys.dr$object”
25.8.8.2. Quote Oracle Columns
If a column name of an Oracle table needs to be quoted, use
$ sqoop import … --table customers --columns “”“first name”""
This is the equivalent of: select “first name” from customers
If multiple columns need to be quoted, use
$ sqoop import … --table customers --columns “”“first name”,“last name”,“region name”""
This is the equivalent of: select “first name”, “last name”, “region name” from customers
25.8.8.3. Confirm The Data Connector for Oracle and Hadoop Can Initialize The Oracle Session
If the Sqoop output includes feedback such as the following then the configuration properties contained within oraoop-site-template.xml and oraoop-site.xml have been loaded by Hadoop and can be accessed by the Data Connector for Oracle and Hadoop.
14/07/08 15:21:13 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL
25.8.8.4. Check The Sqoop Debug Logs for Error Messages
For more information about any errors encountered during the Sqoop import, refer to the log files generated by each of the (by default 4) mappers that performed the import.
The logs can be obtained via your Map-Reduce Job Tracker’s web page.
Include these log files with any requests you make for assistance on the Sqoop User Group web site.
25.8.8.5. Export: Check Tables Are Compatible
Check tables particularly in the case of a parsing error.
Ensure the fields contained with the HDFS file and the columns within the Oracle table are identical. If they are not identical, the Java code dynamically generated by Sqoop to parse the HDFS file will throw an error when reading the file – causing the export to fail. When creating a table in Oracle ensure the definitions for the table template are identical to the definitions for the HDFS file.
Ensure the data types in the table are supported. See “Supported Data Types” for more information.
Are date and time zone based data types used? See “Export Date And Timestamp Data Types into Oracle” for more information.
25.8.8.6. Export: Parallelization
-D oraoop.export.oracle.parallelization.enabled=false
If you see a parallelization error you may decide to disable parallelization on Oracle queries.
25.8.8.7. Export: Check oraoop.oracle.append.values.hint.usage
The oraoop.oracle.append.values.hint.usage parameter should not be set to ON if the Oracle table contains either a BINARY_DOUBLE or BINARY_FLOAT column and the HDFS file being exported contains a NULL value in either of these column types. Doing so will result in the error: ORA-12838: cannot read/modify an object after modifying it in parallel.
25.8.8.8. Turn On Verbose
Turn on verbose on the Sqoop command line.
–verbose
Check Sqoop stdout (standard output) and the mapper logs for information as to where the problem may be.
Report bugs in Sqoop to the issue tracker at https://issues.apache.org/jira/browse/SQOOP.
Questions and discussion regarding the usage of Sqoop should be directed to the sqoop-user mailing list.
Before contacting either forum, run your Sqoop job with the --verbose flag to acquire as much debugging information as possible. Also report the string returned by sqoop version as well as the version of Hadoop you are running (hadoop version).
Turn on verbose output by executing the same command again and specifying the --verbose option. This produces more debug output on the console which can be inspected to identify any obvious errors.
Look at the task logs from Hadoop to see if there are any specific failures recorded there. It is possible that the failure that occurs while task execution is not relayed correctly to the console.
Make sure that the necessary input files or input/output tables are present and can be accessed by the user that Sqoop is executing as or connecting to the database as. It is possible that the necessary files or tables are present but the specific user that Sqoop connects as does not have the necessary permissions to access these files.
If you are doing a compound action such as populating a Hive table or partition, try breaking the job into two separate actions to see where the problem really occurs. For example if an import that creates and populates a Hive table is failing, you can break it down into two steps - first for doing the import alone, and the second to create a Hive table without the import using the create-hive-table tool. While this does not address the original use-case of populating the Hive table, it does help narrow down the problem to either regular import or during the creation and population of Hive table.
Search the mailing lists archives and JIRA for keywords relating to the problem. It is possible that you may find a solution discussed there that will help you solve or work-around your problem.
27.2. Specific Troubleshooting Tips
27.2.1. Oracle: Connection Reset Errors
27.2.2. Oracle: Case-Sensitive Catalog Query Errors
27.2.3. MySQL: Connection Failure
27.2.4. Oracle: ORA-00933 error (SQL command not properly ended)
27.2.5. MySQL: Import of TINYINT(1) from MySQL behaves strangely
27.2.1. Oracle: Connection Reset Errors
Problem: When using the default Sqoop connector for Oracle, some data does get transferred, but during the map-reduce job a lot of errors are reported as below:
11/05/26 16:23:47 INFO mapred.JobClient: Task Id : attempt_201105261333_0002_m_000002_0, Status : FAILED
java.lang.RuntimeException: java.lang.RuntimeException: java.sql.SQLRecoverableException: IO Error: Connection reset
at com.cloudera.sqoop.mapreduce.db.DBInputFormat.setConf(DBInputFormat.java:164)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:605)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.RuntimeException: java.sql.SQLRecoverableException: IO Error: Connection reset
at com.cloudera.sqoop.mapreduce.db.DBInputFormat.getConnection(DBInputFormat.java:190)
at com.cloudera.sqoop.mapreduce.db.DBInputFormat.setConf(DBInputFormat.java:159)
… 9 more
Caused by: java.sql.SQLRecoverableException: IO Error: Connection reset
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:428)
at oracle.jdbc.driver.PhysicalConnection.(PhysicalConnection.java:536)
at oracle.jdbc.driver.T4CConnection.(T4CConnection.java:228)
at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:32)
at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:521)
at java.sql.DriverManager.getConnection(DriverManager.java:582)
at java.sql.DriverManager.getConnection(DriverManager.java:185)
at com.cloudera.sqoop.mapreduce.db.DBConfiguration.getConnection(DBConfiguration.java:152)
at com.cloudera.sqoop.mapreduce.db.DBInputFormat.getConnection(DBInputFormat.java:184)
… 10 more
Caused by: java.net.SocketException: Connection reset
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at oracle.net.ns.DataPacket.send(DataPacket.java:199)
at oracle.net.ns.NetOutputStream.flush(NetOutputStream.java:211)
at oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:227)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:175)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:100)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:85)
at oracle.jdbc.driver.T4CSocketInputStreamWrapper.readNextPacket(T4CSocketInputStreamWrapper.java:123)
at oracle.jdbc.driver.T4CSocketInputStreamWrapper.read(T4CSocketInputStreamWrapper.java:79)
at oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1122)
at oracle.jdbc.driver.T4CMAREngine.unmarshalSB1(T4CMAREngine.java:1099)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:288)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:191)
at oracle.jdbc.driver.T4CTTIoauthenticate.doOAUTH(T4CTTIoauthenticate.java:366)
at oracle.jdbc.driver.T4CTTIoauthenticate.doOAUTH(T4CTTIoauthenticate.java:752)
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:366)
… 18 more
Solution: This problem occurs primarily due to the lack of a fast random number generation device on the host where the map tasks execute. On typical Linux systems this can be addressed by setting the following property in the java.security file:
securerandom.source=file:/dev/…/dev/urandom
The java.security file can be found under $JAVA_HOME/jre/lib/security directory. Alternatively, this property can also be specified on the command line via:
-D mapred.child.java.opts="-Djava.security.egd=file:/dev/…/dev/urandom"
Please note that it’s very important to specify this weird path /dev/…/dev/urandom as it is due to a Java bug 6202721, or /dev/urandom will be ignored and substituted by /dev/random.
27.2.2. Oracle: Case-Sensitive Catalog Query Errors
Problem: While working with Oracle you may encounter problems when Sqoop can not figure out column names. This happens because the catalog queries that Sqoop uses for Oracle expect the correct case to be specified for the user name and table name.
One example, using --hive-import and resulting in a NullPointerException:
1/09/21 17:18:49 INFO manager.OracleManager: Time zone has been set to
GMT
11/09/21 17:18:49 DEBUG manager.SqlManager: Using fetchSize for next
query: 1000
11/09/21 17:18:49 INFO manager.SqlManager: Executing SQL statement:
SELECT t.* FROM addlabel_pris t WHERE 1=0
11/09/21 17:18:49 DEBUG manager.OracleManager$ConnCache: Caching
released connection for jdbc:oracle:thin:
11/09/21 17:18:49 ERROR sqoop.Sqoop: Got exception running Sqoop:
java.lang.NullPointerException
java.lang.NullPointerException
at com.cloudera.sqoop.hive.TableDefWriter.getCreateTableStmt(TableDefWriter.java:148)
at com.cloudera.sqoop.hive.HiveImport.importTable(HiveImport.java:187)
at com.cloudera.sqoop.tool.ImportTool.importTable(ImportTool.java:362)
at com.cloudera.sqoop.tool.ImportTool.run(ImportTool.java:423)
at com.cloudera.sqoop.Sqoop.run(Sqoop.java:144)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.cloudera.sqoop.Sqoop.runSqoop(Sqoop.java:180)
at com.cloudera.sqoop.Sqoop.runTool(Sqoop.java:219)
at com.cloudera.sqoop.Sqoop.runTool(Sqoop.java:228)
at com.cloudera.sqoop.Sqoop.main(Sqoop.java:237)
Solution:
Specify the user name, which Sqoop is connecting as, in upper case (unless it was created with mixed/lower case within quotes).
Specify the table name, which you are working with, in upper case (unless it was created with mixed/lower case within quotes).
27.2.3. MySQL: Connection Failure
Problem: While importing a MySQL table into Sqoop, if you do not have the necessary permissions to access your MySQL database over the network, you may get the below connection failure.
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
Solution: First, verify that you can connect to the database from the node where you are running Sqoop:
$ mysql --host= --database=test --user= --password=
If this works, it rules out any problem with the client network configuration or security/authentication configuration.
Add the network port for the server to your my.cnf file /etc/my.cnf:
[mysqld]
port = xxxx
Set up a user account to connect via Sqoop. Grant permissions to the user to access the database over the network: (1.) Log into MySQL as root mysql -u root -p. (2.) Issue the following command:
mysql> grant all privileges on test.* to ‘testuser’@’%’ identified by ‘testpassword’
Note that doing this will enable the testuser to connect to the MySQL server from any IP address. While this will work, it is not advisable for a production environment. We advise consulting with your DBA to grant the necessary privileges based on the setup topology.
If the database server’s IP address changes, unless it is bound to a static hostname in your server, the connect string passed into Sqoop will also need to be changed.
27.2.4. Oracle: ORA-00933 error (SQL command not properly ended)
Problem: While working with Oracle you may encounter the below problem when the Sqoop command explicitly specifies the --driver option. When the driver option is included in the Sqoop command, the built-in connection manager selection defaults to the generic connection manager, which causes this issue with Oracle. If the driver option is not specified, the built-in connection manager selection mechanism selects the Oracle specific connection manager which generates valid SQL for Oracle and uses the driver “oracle.jdbc.OracleDriver”.
ERROR manager.SqlManager: Error executing statement:
java.sql.SQLSyntaxErrorException: ORA-00933: SQL command not properly ended
Solution: Omit the option --driver oracle.jdbc.driver.OracleDriver and then re-run the Sqoop command.
27.2.5. MySQL: Import of TINYINT(1) from MySQL behaves strangely
Problem: Sqoop is treating TINYINT(1) columns as booleans, which is for example causing issues with HIVE import. This is because by default the MySQL JDBC connector maps the TINYINT(1) to java.sql.Types.BIT, which Sqoop by default maps to Boolean.
解决方案:一个更干净的解决方案是通过添加tinyInt1isBit=false到您的 JDBC 路径(创建类似jdbc:mysql://localhost/test?tinyInt1isBit=false)来强制 MySQL JDBC 连接器停止将 TINYINT(1) 转换为 java.sql.Types.BIT 。另一种解决方案是显式覆盖数据类型 TINYINT(1) 列的列映射。例如,如果列名称为 foo,则在导入期间将以下选项传递给 Sqoop:–map-column-hive foo=tinyint。在非 Hive 导入到 HDFS 的情况下,使用 --map-column-java foo=integer。
文档主页
本文档是从位于https://git-wip-us.apache.org/repos/asf?p=sqoop.git 的Sqoop 源构建的 。