sqoop 中文文档 User guide 七

22. Compatibility Notes //兼容性说明

22.1. Supported Databases
22.2. MySQL
22.2.1. zeroDateTimeBehavior
22.2.2.UNSIGNEDcolumns
22.2.3.BLOBandCLOBcolumns
22.2.4. Importing views in direct mode
22.2.5. Direct-mode Transactions
22.3. PostgreSQL
22.3.1. Importing views in direct mode
22.4. Oracle
22.4.1. Dates and Times
22.5. Schema Definition in Hive

Sqoop uses JDBC to connect to databases and adheres to published standards as much as possible. For databases which do not support standards-compliant SQL, Sqoop uses alternate codepaths to provide functionality. In general, Sqoop is believed to be compatible with a large number of databases, but it is tested with only a few.

Sqoop使用 JDBC连接数据库并尽可能的遵循规范,对于那些不支持符合标准SQL的数据库,Sqoop 使用额外的codepaths 来支持功能,一般来说,Sqoop 被相信 能够兼容大多数的数据库,但它只在很少的几种中做过测试。

codepaths 是神马?

Nonetheless, several database-specific decisions were made in the implementation of Sqoop, and some databases offer additional settings which are extensions to the standard.

尽管如此,几个特定的数据库决策还是在sqoop中做了实现,并且一些数据库提供额外的设置来扩展标准。


This section describes the databases tested with Sqoop, any exceptions in Sqoop’s handling of each database relative to the norm, and any database-specific settings available in Sqoop.

这个章节描述 一些数据库在使用Sqoop时做过的测试,每个数据库的Sqoop执行中 关于规范的一些异常。和一些特定数据库的有效设置。

While JDBC is a compatibility layer that allows a program to access many different databases through a common API, slight differences in the SQL language spoken by each database may mean that Sqoop can’t use every database out of the box, or that some databases may be used in an inefficient manner.

JDBC 是一个兼容层 允许程序通过共同的API访问不同的数据库 ,每个数据库的SQL语言的微小的差别,可能导致Sqoop在脱离每个数据库的环境时, Sqoop不能使用,或者,一些数据库中可能使用了无效的方式。

When you provide a connect string to Sqoop, it inspects the protocol scheme to determine appropriate vendor-specific logic to use. If Sqoop knows about a given database, it will work automatically. If not, you may need to specify the driver class to load via--driver. This will use a generic code path which will use standard SQL to access the database. Sqoop provides some databases with faster, non-JDBC-based access mechanisms. These can be enabled by specfying the--directparameter.

当提供一个连接字符串给Sqoop,它检查协议的scheme 来决定 适当的制定厂商逻辑来使用,如果你使用的sqoop已经集成了指定数据库的驱动,他会自动工作,否则 你可能需要一个指定的驱动类去加载通过 --driver(指定驱动jar ), 这会使用一个通用的标准的 连接字符串(如'mysql://192.168.2.104:3306/qun')去访问数据库。Sqoop提供一些数据库的更快,非基于jdbc访问机制,这些可以通过指定 --direct参数。

Sqoop includes vendor-specific support for the following databases:

Sqoop 包括了特定厂商的支持 为以下的数据库。

Sqoop may work with older versions of the databases listed, but we have only tested it with the versions specified above

Sqoop可能会工作在上述数据库的老版本,但我们只测试了上述指定的版本。

Even if Sqoop supports a database internally, you may still need to install the database vendor’s JDBC driver in your $SQOOP_HOME/lib path on your client. Sqoop can load classes from any jars in $SQOOP_HOME/lib on the client and will use them as part of any MapReduce jobs it runs; unlike older versions, you no longer need to install JDBC jars in the Hadoop library path on your servers.

即使Sqoop内部支持数据库,你可能还需要安装数据库厂商的JDBC驱动程序在你在你的客户端的$SQOOP_HOME/ lib路径。Sqoop运行时可以从客户端的$SQOOP_HOME/ lib路径加载任意jar包中的类并将使用他们作为MapReduce工作的一部分;与老版本不同,您不再需要安装在Hadoop JDBC jar库路径在你的服务器。

这里有很多疑问 ,Sqoop 分为客户端 和服务器端么,是本身提供了jar包,当需要在客户端安装,还是本身没有提供jar包,需要安装?


22.2.�MySQL

22.2.1. zeroDateTimeBehavior
22.2.2.UNSIGNEDcolumns
22.2.3.BLOBandCLOBcolumns
22.2.4. Importing views in direct mode
22.2.5. Direct-mode Transactions

JDBC Driver:MySQL Connector/J//驱动下载地址

MySQL v5.0 and above offers very thorough coverage by Sqoop. Sqoop has been tested withmysql-connector-java-5.1.13-bin.jar.

//就sqoop而言MySQLv5.0和它以上的版本提供了非常详尽的报告,sqoop 已经使用 mysql-connector-java-5.1.13-bin.jar做过测试。

22.2.1.�zeroDateTimeBehavior//零时间行为

MySQL allows values of'0000-00-00\'forDATEcolumns, which is a non-standard extension to SQL. When communicated via JDBC, these values are handled in one of three different ways:

MySQL允许‘0000-00-00’值作为日期列,这是一个非标准扩展SQL。当沟通通过JDBC,这些值在三种不同的处理方式:

  • Convert toNULL.//转换成 null

  • Throw an exception in the client.//在客户端抛出一个异常

  • Round to the nearest legal date ('0001-01-01\'). 估算一个最接近的一个合法日期。

You specify the behavior by using thezeroDateTimeBehaviorproperty of the connect string. If azeroDateTimeBehaviorproperty is not specified, Sqoop uses theconvertToNullbehavior.

You can override this behavior. For example:

你可以指定行为通过使用zeroDateTimeBehavior这个属性通过连接字符串,如果zeroDateTimeBehavior属性不被指定,Sqooop使用convertToNull 行为

$ sqoop import --table foo \
    --connect jdbc:mysql://db.example.com/someDb?zeroDateTimeround< /pre>

22.2.2.�UNSIGNEDcolumns

Columns with typeUNSIGNEDin MySQL can hold values between 0 and 2^32 (4294967295), but the database will report the data type to Sqoop asINTEGER, which will can hold values between-2147483648and\+2147483647. Sqoop cannot currently importUNSIGNEDvalues above2147483647.

列在MySQL型UNSIGNED 可以容纳值介于0和2 ^ 32(4294967295),但数据库将数据类型报告为INTEGER给Sqoop  ,这个列可容纳-2147483648 - -之间的值和\ + 2147483647。Sqoop目前不能导入 高于2147483647的UNSIGNED值。

22.2.3.�BLOBandCLOBcolumns

Sqoop’s direct mode does not support imports ofBLOB,CLOB, orLONGVARBINARYcolumns. Use JDBC-based imports for these columns; do not supply the--directargument to the import tool.

Sqoop 的 direct模式 不能支持导入  BLOB, CLOB, or LONGVARBINARY columns,对于这些行可以使用 基于JDBC的导入。这时导入工具不要提供 --direct 参数。


22.2.4.�Importing views in direct mode //在direct模式中导入视图

Sqoop is currently not supporting import from view in direct mode. Use JDBC based (non direct) mode in case that you need to import view (simply omit--directparameter).

Sqoop在 direct模式中不支持视图导入,如果你需要导入视图就使用基于JDBC的导入(简单的省略 --direct参数)

22.2.5.�Direct-mode Transactions

For performance, each writer will commit the current transaction approximately every 32 MB of exported data. You can control this by specifying the following argumentbeforeany tool-specific arguments:-D sqoop.mysql.export.checkpoint.bytes=size, wheresizeis a value in bytes. Setsizeto 0 to disable intermediate checkpoints, but individual files being exported will continue to be committed independently of one another.

在执行上来讲,每个writer 导出大约每32MB的数据将提交一个事务,你可以控制这个通过指定下面的参数 在任意指定工具的参数前: -D sqoop.mysql.export.checkpoint.bytes=size,size的单位是 bytes,设置大小为0将禁用的中间检查,这个设置不影响的别的文件导入。

这里的import,export 指的 从  hdfs端 ――》外部运行的mysql


Sometimes you need to export large data with Sqoop to a live MySQL cluster that is under a high load serving random queries from the users of your application. While data consistency issues during the export can be easily solved with a staging table, there is still a problem with the performance impact caused by the heavy export.

使用Sqoop时,有时候你需要导出大型数据到现场随机查询服务的应用程序的多用户高负载,随机访问下的MySQL群集。虽然 导出时数据一致性问题能够轻松的被解决 通过一个临时表,大量的导出仍然导致性能影响。

上面讲的是导入,导出会产生的影响,只有 direct mode下会有这些影响?

答:不是direct模式也会有影响

性能影响的是 mysql还 hadoop集群?

答:都会有影响

First off, the resources of MySQL dedicated to the import process can affect the performance of the live product, both on the master and on the slaves. Second, even if the servers can handle the import with no significant performance impact (mysqlimport should be relatively "cheap"), importing big tables can cause serious replication lag in the cluster risking data inconsistency.

首先,资源的MySQL专用的导入过程会影响正在运行的产品的性能,无论master还是slaves,即使服务器可以处理导入 在没有显著的性能影响(导入的表结构非常简单),其次导入大表可能导致严重复制延迟而产生数据 不一致性的风险在一集群(指的是hadoop集群)上。

With-D sqoop.mysql.export.sleep.ms=time, wheretimeis a value in milliseconds,you can let the server relax between checkpoints and the replicas catch up by pausing the export process after transferring the number of bytes specified insqoop.mysql.export.checkpoint.bytes.Experiment with different settings of these two parameters to archieve an export pace that doesn’t endanger the stability of your MySQL cluster.

-D sqoop.mysql.export.sleep.ms=time, 时间以毫秒为单位,实验证明这两个参数来设置不同来完成导出不会危及你的MySQL集群的稳定性。

[Important] Important

Note that any arguments to Sqoop that are of the form-D parameter=valueare Hadoopgeneric argumentsand must appear before any tool-specific arguments (for example,--connect,--table, etc). Don’t forget that these parameters are only supported with the--directflag set.

注意 这些参数是hadoop 的通用参数 必须出现在 任何指定工具的参数前(例如 ,--connent,-table,etc).不要这些参数只能在 --direct 被设置时才能被支持。

22.3.�PostgreSQL

22.3.1. Importing views in direct mode

Sqoop supports JDBC-based connector for PostgreSQL:http://jdbc.postgresql.org/ //驱动下载地址

The connector has been tested using JDBC driver version "9.1-903 JDBC 4" with PostgreSQL server 9.1.

已经使用JDBC驱动程序版本 "9.1-903 JDBC 4"在 PostgreSQL server 9.1上测试连接

22.3.1.�Importing views in direct mode

Sqoop is currently not supporting import from view in direct mode. Use JDBC based (non direct) mode in case that you need to import view (simply omit--directparameter).

当前的sqoop在 direct模式下不支持导入视图,如果需要导入视图就使用基于JDBC模式。

××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××

22.4.�Oracle

22.4.1. Dates and Times

JDBC Driver:Oracle JDBC Thin Driver- Sqoop is compatible withojdbc6.jar.//驱动下载地址

Sqoop has been tested with Oracle 10.2.0 Express Edition. Oracle is notable in its different approach to SQL from the ANSI standard, and its non-standard JDBC driver. Therefore, several features work differently.

Sqoop已经Oracle 10.2.0 Express Edition中进行过测试。 Oracle 的sql跟标准的ANSI sql有着明显的不同用法,它还有非标准的JDBC驱动程序。因此,一些特性的工作方式不同。

22.4.1.�Dates and Times

Oracle JDBC representsDATEandTIMESQL types asTIMESTAMPvalues. AnyDATEcolumns in an Oracle database will be imported as aTIMESTAMPin Sqoop, and Sqoop-generated code will store these values injava.sql.Timestampfields.

When exporting data back to a database, Sqoop parses text fields asTIMESTAMPtypes (with the formyyyy-mm-dd HH:MM:SS.ffffffff) even if you expect these fields to be formatted with the JDBC date escape format ofyyyy-mm-dd. Dates exported to Oracle should be formatted as full timestamps.

Oracle also includes the additional date/time typesTIMESTAMP WITH TIMEZONEandTIMESTAMP WITH LOCAL TIMEZONE. To support these types, the user’s session timezone must be specified. By default, Sqoop will specify the timezone"GMT"to Oracle. You can override this setting by specifying a Hadoop propertyoracle.sessionTimeZoneon the command-line when running a Sqoop job. For example:

$ sqoop import -D oracle.sessionTimeZone=America/Los_Angeles \
    --connect jdbc:oracle:thin:@//db.example.com/foo --table bar

Note that Hadoop parameters (-D …) aregeneric argumentsand must appear before the tool-specific arguments (--connect,--table, and so on).

Legal values for the session timezone string are enumerated athttp://download-west.oracle.com/docs/cd/B19306_01/server.102/b14225/applocaledata.htm#i637736.

22.5.�Schema Definition in Hive

Hive users will note that there is not a one-to-one mapping between SQL types and Hive types. In general, SQL types that do not have a direct mapping (for example,DATE,TIME, andTIMESTAMP) will be coerced toSTRINGin Hive. TheNUMERICandDECIMALSQL types will be coerced toDOUBLE. In these cases, Sqoop will emit a warning in its log messages informing you of the loss of precision.

hive的用户可能注意到 sql类型 类型和hive类型不是一对一映射。 通常来说, SQL类型不具有直接映射(例如,DATETIMETIMESTAMP)将被强制转换成STRING在hive中NUMERICDECIMALSQL类型将被强制转换成DOUBLE在这些情况下,Sqoop将发出一个警告在其日志中,通知你的损失精度。

23.�Notes for specific connectors//特定连接器的注意事项

23.1. MySQL JDBC Connector
23.1.1. Upsert functionality
23.2. Microsoft SQL Connector
23.2.1. Extra arguments
23.2.2. Schema support
23.2.3. Table hints
23.3. PostgreSQL Connector
23.3.1. Extra arguments
23.3.2. Schema support
23.4. pg_bulkload connector
23.4.1. Purpose
23.4.2. Requirements
23.4.3. Syntax
23.4.4. Data Staging

23.1.�MySQL JDBC Connector

23.1.1. Upsert functionality

This section contains information specific to MySQL JDBC Connector.

这个章节包含关于Mysql JDBC 连接的特殊的信息。

23.1.1.�Upsert functionality 更新或插入功能

MySQL JDBC Connector is supporting upsert functionality using argument--update-mode allowinsert.To achieve that Sqoop is using MySQL clause INSERT INTO … ON DUPLICATE KEY UPDATE. This clause do not allow user to specify which columns should be used to distinct whether we should update existing row or add new row. Instead this clause relies on table’s unique keys (primary key belongs to this set). MySQL will try to insert new row and if the insertion fails with duplicate unique key error it will update appropriate row instead. As a result, Sqoop is ignoring values specified in parameter--update-key, however user needs to specify at least one valid column to turn on update mode itself.?

MySQL的JDBC连接器支持更新或插入功能,使用参数--update-mode allowinsert为了实现这一Sqoop使用MySQL子句INSERT INTO ... ON DUPLICATE KEY更新。此条款不允许用户指定哪些列应该使用不同的,我们是否应该更新现有的行或添加新行。相反,这一条款依赖于表的唯一键(主键属于这一套)。MySQL会尝试插入新行,如果插入失败,重复的唯一密钥错误,它会更新相应的行,而不是。其结果是,,Sqoop忽略参数--update-key中指定的值,但是用户需要指定至少一个有效的列打开更新模式本身。

23.2.�Microsoft SQL Connector

23.2.1. Extra arguments
23.2.2. Schema support
23.2.3. Table hints

23.2.1.�Extra arguments

List of all extra arguments supported by Microsoft SQL Connector is shown below:

所以额外的参数 支持被    Microsoft SQL Connector 如下:

Table�41.�Supported Microsoft SQL Connector extra arguments:

Argument Description
--schema <name> Schemename that sqoop should use. Default is "dbo"//Sqoop 要使用的模式 名,默认是‘dbo’.
--table-hints <hints> Table hints that Sqoop should use for data movement??//.

Scheme 应该怎么翻译?

23.2.2.�Schema support

If you need to work with tables that are located in non-default schemas, you can specify schema names via the--schemaargument. Custom schemas are supported for both import and export jobs. For example:

如果你需要的工作表不想使用默认,架构,您可以通过指定的架构名称--schema参数。自定义模式都支持进口和出口作业。例如
$ sqoop import ... --table custom_table -- --schema custom_schema

剩下的少部分还没有翻译,请期待

你可能感兴趣的:(hadoop,hive,sqoop)