sqoop从orc文件到oracle,Sqoop import as OrC file

问题

Is there any option in sqoop to import data from RDMS and store it as ORC file format in HDFS?

Alternatives tried: imported as text format and used a temp table to read input as text file and write to hdfs as orc in hive

回答1:

At least in Sqoop 1.4.5 there exists hcatalog integration that support orc file format (amongst others).

For example you have the option

--hcatalog-storage-stanza

which can be set to

stored as orc tblproperties ("orc.compress"="SNAPPY")

Example:

sqoop import

--connect jdbc:postgresql://foobar:5432/my_db

--driver org.postgresql.Driver

--connection-manager org.apache.sqoop.manager.GenericJdbcManager

--username foo

--password-file hdfs:///user/foobar/foo.txt

--table fact

--hcatalog-home /usr/hdp/current/hive-webhcat

--hcatalog-database my_hcat_db

--hcatalog-table fact

--create-hcatalog-table

--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")'

回答2:

Sqoop import supports only below formats.

--as-avrodatafile Imports data to Avro Data Files

--as-sequencefile Imports data to SequenceFiles

--as-textfile Imports data as plain text (default)

--as-parquetfile Imports data as parquet file (from sqoop 1.4.6 version)

回答3:

In current version of sqoop available, it is not possible to import data from RDBS to HDFS in ORC format in a single shoot command. This is something known issue in sqoop.

Reference link for this issue raised: https://issues.apache.org/jira/browse/SQOOP-2192

I think the only alternative available for now, is the same as you mentioned. I also came across the similar use case, and have used the alternative two step approach.

回答4:

Currently there is no option to import the rdms table data directly as ORC file using sqoop.

We can achieve the same using two steps.

Import the data in any available format (say text).

Read the data using Spark SQL and save it as an orc file.

Example:

Step 1: Import the table data as a text file.

sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username retail_dba --password cloudera \

--table orders \

--target-dir /user/cloudera/text \

--as-textfile

Step 2: Use spark-shell on command prompt to get scala REPL command shell.

scala> val sqlHiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlHiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@638a9d61

scala> val textDF = sqlHiveContext.read.text("/user/cloudera/text")

textDF: org.apache.spark.sql.DataFrame = [value: string]

scala> textDF.write.orc("/user/cloudera/orc/")

Step 3: Check the output.

[root@quickstart exercises]# hadoop fs -ls /user/cloudera/orc/

Found 5 items

-rw-r--r-- 1 cloudera cloudera 0 2018-02-13 05:59 /user/cloudera/orc/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 153598 2018-02-13 05:59 /user/cloudera/orc/part-r-00000-24f75a77-4dd9-44b1-9e25-6692740360d5.orc

-rw-r--r-- 1 cloudera cloudera 153466 2018-02-13 05:59 /user/cloudera/orc/part-r-00001-24f75a77-4dd9-44b1-9e25-6692740360d5.orc

-rw-r--r-- 1 cloudera cloudera 153725 2018-02-13 05:59 /user/cloudera/orc/part-r-00002-24f75a77-4dd9-44b1-9e25-6692740360d5.orc

-rw-r--r-- 1 cloudera cloudera 160907 2018-02-13 05:59 /user/cloudera/orc/part-r-00003-24f75a77-4dd9-44b1-9e25-6692740360d5.orc

来源:https://stackoverflow.com/questions/29978522/sqoop-import-as-orc-file

你可能感兴趣的:(sqoop从orc文件到oracle,Sqoop import as OrC file)