本文介绍了如何使用ApacheSqoop实用程序将数据从Oracle数据库传输到S3。数据将以Avro数据格式存储。
数据传输采用了以下技术:
ApacheSqoop1.4.7支持Avro数据文件。要以Avro格式存储数据,应将以下参数添加到Sqoop命令中:
--as-avrodatafile # imports data to Avro data files
--compression-codec snappy # use Hadoop codec (in this case - snappy)
Sqoop命令的模板如下:
sqoop import \
--bindir ./ \
--connect 'dbc:oracle:thin:/password@ : / ' \
# 'jdbc:sqlserver://: ;databasename= ' \ # SQL Server 2008 and higher
# 'jdbc:jtds:sqlserver://: / ' \ - #SQL Server 2000 \
--username\
--driver# manually specify JDBC driver class to use
# example: --driver net.sourceforge.jtds.jdbc.Driver
--connection-manager # Specify connection manager class to use
# example: --connection-manager org.apache.sqoop.manager.SQLServerManager
--password\
--num-mappers\
--fields-terminated-by '\t' \ # sets the field separator character
--lines-terminated-by '\n' \ # sets the end-of-line character
--as-avrodatafile \ # imports data to Avro data files
--compression-codec snappy \ # use Hadoop codec (in this case - snappy)
--options-file\
--split-by\ # only used if number of mappers > 1
--target-dir s3://\
# example for HDFS: --target-dir hdfs:///
--null-string '' \
--null-non-string ''
--boundary-query # if used then --split-by should also be present
Oracle将数据转储到S3的Sqoop命令示例:
sqoop import \
-Dmapreduce.job.user.classpath.first=true \
--connect "jdbc:oracle:thin:user/password@host_address.com:1521/orcl" \
--num-mappers 1 \
--query 'select * from employee where $CONDITIONS' \
--target-dir s3://my-bucket/staging/employee \
--as-avrodatafile \
--compression-codec snappy \
--null-string '' \
--null-non-string ''
请注意,当您运行该命令时,目标目录不存在,否则Sqoop命令将失败。
您可以使用一个简单的AWSCLI命令来删除目标目录:
aws s3 rm s3://my-bucket/staging/employee --recursive
Oracle将数据转储到Hadoop的Sqoop命令示例:
sqoop import \
-Dmapreduce.job.user.classpath.first=true \
--connect "jdbc:oracle:thin:user/password@host_address.com:1521/orcl" \
--num-mappers 1 \
--query 'select * from employee where $CONDITIONS' \
--delete-target-dir
--target-dir /user/hive/warehouse/employee \
--as-avrodatafile \
--compression-codec snappy \
--null-string '' \
--null-non-string ''
注意,有一个参数, -删除-目标-dir,在删除目标目录的命令中,只有当目标目录位于HDFS中时才能使用。
Sqoop可以将数据传输到Hadoop(HDFS)或AWS(S3)。要查询传输的数据,需要在物理文件之上创建表。如果数据被传输到Hadoop,您可以创建Hive表。如果数据被传输到S3,则可以创建Hive表或AmazonAthena表。在这两种情况下,您都需要一个可以从物理文件中检索的表模式。从1.4.7版(EMR5.14.0)开始,Hadoop发行版:Amazon2.8.3 Sqoop自动检索表模式并将其存储在AutoGeneratedSchema.avsc文件在同一个文件夹中。如果使用Sqoop版本1.4.6(EMR 5.13.0的一部分)或更低版本,则可以手动检索表模式。
如果数据的目的地是HDFS,则可以使用以下命令检索表架构:
hadoop jar avro-tools-1.8.1.jar getschema /user/hive/warehouse/employee/part-m-00000.avro > employee.avsc
如果数据的目标是S3,则需要将Avro数据文件复制到本地文件系统,然后检索架构:
java -jar avro-tools-1.8.1.jar getschema part-m-00000.avro > employee.avsc
Avro-tools-1.8.1.jar是Avro工具的一部分,它提供用于处理Avro文件的CLI接口。
在检索表模式之后,可以使用它来进一步创建表。
要在Hive中创建Avro表(在Hadoop集群上或在EMR上),您必须提供从Avro数据文件中检索到的表模式位置:
CREATE TABLE employee
STORED AS AVRO
LOCATION '/user/hive/warehouse/employee'
TBLPROPERTIES('avro.schema.url'='hdfs:///user/hive/warehouse/avsc/employee.avsc');
还可以在S3:中指定表位置:
CREATE TABLE employee
STORED AS AVRO
location 's3://my-bucket/staging/employee'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/hive/warehouse/avsc/employee.avsc');
您甚至可以在S3中保留一个表模式:
CREATE EXTERNAL TABLE employee
STORED AS AVRO
location 's3:/my-bucket/staging/employee'
TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/staging/avsc/employee.avsc');
Employee表的Avro架构如下所示:
{
"type" : "record",
"name" : "AutoGeneratedSchema",
"doc" : "Sqoop import of QueryResult",
"fields" : [ {
"name" : "ID",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "ID",
"sqlType" : "2"
}, {
"name" : "NAME",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "NAME",
"sqlType" : "12"
}, {
"name" : "AGE",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "AGE",
"sqlType" : "2"
}, {
"name" : "GEN",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "GEN",
"sqlType" : "12"
}, {
"name" : "CREATE_DATE",
"type" : [ "null", "long" ],
"default" : null,
"columnName" : "CREATE_DATE",
"sqlType" : "93"
}, {
"name" : "PROCESS_NAME",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "PROCESS_NAME",
"sqlType" : "12"
}, {
"name" : "UPDATE_DATE",
"type" : [ "null", "long" ],
"default" : null,
"columnName" : "UPDATE_DATE",
"sqlType" : "93"
} ],
"tableName" : "QueryResult"
}
注意,所有时间戳列都定义为短的.
重要*在蜂巢中创建的所有表都使用创建表语句是托管表。这意味着,如果删除表,HDFS或S3中的相应目录也将被删除。要保留数据为HDFS或S3,应将表创建为外部表:
CREATE EXTERNAL TABLE employee
在这种情况下,即使外部表被删除,HDFS或S3中的物理文件也将保持不变。
AmazonAthena不支持table属性。avro.schema.url-模式需要显式地添加在avro.schema.literal:
CREATE EXTERNAL TABLE employee
(
ID string,
NAME string,
AGE string,
GEN string,
CREATE_DATE bigint,
PROCESS_NAME string,
UPDATE_DATE bigint
)
STORED AS AVRO
LOCATION 's3://my-bucket/staging/employees'
TBLPROPERTIES (
'avro.schema.literal'='
{
"type" : "record",
"name" : "AutoGeneratedSchema",
"doc" : "Sqoop import of QueryResult",
"fields" : [ {
"name" : "ID",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "ID",
"sqlType" : "2"
}, {
"name" : "NAME",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "NAME",
"sqlType" : "12"
}, {
"name" : "AGE",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "AGE",
"sqlType" : "2"
}, {
"name" : "GEN",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "GEN",
"sqlType" : "12"
}, {
"name" : "CREATE_DATE",
"type" : [ "null", "long" ],
"default" : null,
"columnName" : "CREATE_DATE",
"sqlType" : "93"
}, {
"name" : "PROCESS_NAME",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "PROCESS_NAME",
"sqlType" : "12"
}, {
"name" : "UPDATE_DATE",
"type" : [ "null", "long" ],
"default" : null,
"columnName" : "UPDATE_DATE",
"sqlType" : "93"
} ],
"tableName" : "QueryResult"
}
');
注意,表定义中的所有时间戳列都定义为比金.对此的解释如下。
当Sqoop将数据从Oracle导入到Avro时(使用-as-avrodatafile)它以unix时间格式存储所有“时间戳”值。(时代),即短的.
在Hive中创建Avro表时不会发生任何更改:
CREATE TABLE employee
STORED AS AVRO
LOCATION '/user/hive/warehouse/employee'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/hive/warehouse/avsc/employee.avsc');
查询数据时,只需将毫秒转换为弦:
from_unixtime(div 1000)
不使用时间戳转换的结果数据集如下所示:
hive> select id, name, age, gen, create_date, process_name, update_date
> from employee limit 2;
OK
id name age gen create_date process_name update_date
-- ---- --- --- ----------- ------------ -----------
2 John 30 M 1538265652000 BACKFILL 1538269659000
3 Jennie 25 F 1538265652000 BACKFILL 1538269659000
使用时间戳转换生成的数据集如下所示:
hive> select
> id, name, age, gen,
> from_unixtime(create_date div 1000) as create_date,
> process_name,
> from_unixtime(update_date div 1000) as update_date
> from employee limit 2;
OK
id name age gen create_date process_name update_date
-- ---- --- --- ----------- ------------ -----------
2 John 30 M 2018-09-30 00:00:52 BACKFILL 2018-09-30 01:07:39
3 Jennie 25 F 2018-09-30 00:00:52 BACKFILL 2018-09-30 01:07:39
重要*在蜂巢中,如果保留字用作列名(如时间戳)您需要使用反引号来转义它们:
select from_unixtime(`timestamp` div 1000) as time_stamp
from employee;
创建雅典娜表时,所有短的 字段应创建为比金 在创建表中 语句(在Avro模式中不存在!):
CREATE EXTERNAL TABLE employee
(
ID string,
NAME string,
AGE string,
GEN string,
CREATE_DATE bigint,
PROCESS_NAME string,
UPDATE_DATE bigint
)
STORED AS AVRO
LOCATION 's3://my-bucket/staging/employee'
TBLPROPERTIES (
'avro.schema.literal'='
{
"type" : "record",
"name" : "AutoGeneratedSchema",
"doc" : "Sqoop import of QueryResult",
"fields" : [ {
"name" : "ID",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "ID",
"sqlType" : "2"
}, {
"name" : "NAME",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "NAME",
"sqlType" : "12"
}, {
"name" : "AGE",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "AGE",
"sqlType" : "2"
}, {
"name" : "GEN",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "GEN",
"sqlType" : "12"
}, {
"name" : "CREATE_DATE",
"type" : [ "null", "long" ],
"default" : null,
"columnName" : "CREATE_DATE",
"sqlType" : "93"
}, {
"name" : "PROCESS_NAME",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "PROCESS_NAME",
"sqlType" : "12"
}, {
"name" : "UPDATE_DATE",
"type" : [ "null", "long" ],
"default" : null,
"columnName" : "UPDATE_DATE",
"sqlType" : "93"
} ],
"tableName" : "QueryResult"
}
');
查询数据时,只需将毫秒转换为弦:
from_unixtime(/ 1000)
不使用时间戳转换的结果数据集如下所示:
select id, name, age, gen, create_date, process_name, update_date
from employee limit 2;
id name age gen create_date process_name update_date
-- ---- --- --- ----------- ------------ -----------
2 John 30 M 1538265652000 BACKFILL 1538269659000
3 Jennie 25 F 1538265652000 BACKFILL 1538269659000
使用时间戳转换生成的数据集如下所示:
select id, name, age, gen,
from_unixtime(create_date / 1000) as create_date,
process_name,
from_unixtime(update_date / 1000) as update_date
from employee limit 2;
id name age gen create_date process_name update_date
-- ---- --- --- ----------- ------------ -----------
2 John 30 M 2018-09-30 00:00:52.000 BACKFILL 2018-09-30 01:07:39.000
3 Jennie 25 F 2018-09-30 00:00:52.000 BACKFILL 2018-09-30 01:07:39.000
如果您不希望每次运行查询时都从Unix时间转换时间戳,则可以通过向Sqoop添加以下参数将时间戳值存储为文本:
--map-column-java CREATE_DATE=String,UPDATE_DATE=String
在应用此参数并运行Sqoop之后,表模式如下所示:
{
"type" : "record",
"name" : "AutoGeneratedSchema",
"doc" : "Sqoop import of QueryResult",
"fields" : [ {
"name" : "ID",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "ID",
"sqlType" : "2"
}, {
"name" : "NAME",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "NAME",
"sqlType" : "12"
}, {
"name" : "AGE",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "AGE",
"sqlType" : "2"
}, {
"name" : "GEN",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "GEN",
"sqlType" : "12"
}, {
"name" : "CREATE_DATE",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "CREATE_DATE",
"sqlType" : "93"
}, {
"name" : "PROCESS_NAME",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "PROCESS_NAME",
"sqlType" : "12"
}, {
"name" : "UPDATE_DATE",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "UPDATE_DATE",
"sqlType" : "93"
} ],
"tableName" : "QueryResult"
}
注意,表架构中的时间戳列定义为弦.
用于以字符串格式存储时间戳字段的Sqoop命令:
sqoop import \
-Dmapreduce.job.user.classpath.first=true \
--connect "jdbc:oracle:thin:user/password@host_address.com:1521/orcl" \
--num-mappers 1 \
--query 'select * from employee where $CONDITIONS' \
--target-dir s3://my-bucket/staging/employee_ts_str \
--as-avrodatafile \
--compression-codec snappy \
--null-string '' \
--null-non-string '' \
--map-column-java CREATE_DATE=String,UPDATE_DATE=String
对于将数据转储到HDFS,Sqoop命令将与-目标-DIR参数:
--target-dir hdfs:.///user/hive/warehouse/employee_ts_str
使用新表模式在Hive中创建一个新表:
CREATE TABLE employee_ts_str
STORED AS AVRO
LOCATION '/user/hive/warehouse/employee_ts_str'
TBLPROPERTIES('avro.schema.url'='hdfs:///user/hive/warehouse/avsc/employee_ts_str.avsc');
选择不使用时间戳转换的数据:
hive> select id, name, age, gen, create_date, process_name, update_date
> from employee_ts_str limit 2;
OK
id name age gen create_date process_name update_date
-- ---- --- --- ----------- ------------ -----------
2 John 30 M 2018-09-30 00:00:52 BACKFILL 2018-09-30 01:07:39
3 Jennie 25 F 2018-09-30 00:00:52 BACKFILL 2018-09-30 01:07:39
使用新的表模式在AmazonAthena中创建一个新表:
CREATE EXTERNAL TABLE employee_ts_str
(
ID string,
NAME string,
AGE string,
GEN string,
CREATE_DATE string,
PROCESS_NAME string,
UPDATE_DATE string
)
STORED AS AVRO
LOCATION 's3://my-bucket/staging/employee_ts_str'
TBLPROPERTIES (
'avro.schema.literal'='
{
"type" : "record",
"name" : "AutoGeneratedSchema",
"doc" : "Sqoop import of QueryResult",
"fields" : [ {
"name" : "ID",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "ID",
"sqlType" : "2"
}, {
"name" : "NAME",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "NAME",
"sqlType" : "12"
}, {
"name" : "AGE",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "AGE",
"sqlType" : "2"
}, {
"name" : "GEN",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "GEN",
"sqlType" : "12"
}, {
"name" : "CREATE_DATE",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "CREATE_DATE",
"sqlType" : "93"
}, {
"name" : "PROCESS_NAME",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "PROCESS_NAME",
"sqlType" : "12"
}, {
"name" : "UPDATE_DATE",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "UPDATE_DATE",
"sqlType" : "93"
} ],
"tableName" : "QueryResult"
}
');
注意,表定义中的时间戳列定义为弦.
选择不使用时间戳转换的数据:
select id, name, age, gen, create_date, process_name, update_date
from employee_ts_str limit 2;
id name age gen create_date process_name update_date
-- ---- --- --- ----------- ------------ -----------
2 John 30 M 2018-09-30 00:00:52 BACKFILL 2018-09-30 01:07:39
3 Jennie 25 F 2018-09-30 00:00:52 BACKFILL 2018-09-30 01:07:39
如果有多个输出文件(有多个映射器),并且希望将它们组合成一个文件,则可以使用级联:
hadoop jar avro-tools-1.8.1.jar part-m-00000.avro part-m-00001.avro cons_file.avro
文件可以是本地文件,也可以是S3文件:
hadoop jar avro-tools-1.8.1.jar concat s3://my-bucket/staging/employee/part-m-00000.avro s3://my-bucket/staging/employee/part-m-00001.avro s3://my-bucket/staging/employee/employee_final.avro
本文解释了如何将数据从关系数据库(Oracle)传输到S3或HDFS,并使用ApacheSqoop将其存储在Avro数据文件中。本文还演示了如何使用Avro表模式,以及如何处理Avro中的时间戳字段(使它们保持在Unix Time(Epoch Time)或转换为弦数据类型)。