这两天搞sqoop1.x遇到的坑

简单分享下,我这两天搞sqoop1.x遇到的坑。

先总体总结下,首先是1.4.4的import不支持as-parquetfile。

好吧,换成1.4.6版本,倒是支持了,但是不支持--as-parquetfile和--query一起使用,报一个什么莫名的空指针异常,--as-parquetfile和--table是可以的。同时import也不支持parquetfile格式直接导入hive表,只能间接的先导入hdfs再load到hive。有个问题就是mysql的datetime类型在hive查询出来始终是null(对应的string,timestamp,bigint都试过了)。

好吧,换成最新的1.4.7版本试试,完美解决了上面所有问题,当然import也不支持parquetfile格式直接导入hive表,不过执行后有明显的提示说不兼容parquet格式直接导入hive。只是第一次运行时报一个找不到org.json.JSONObject,这个简单,网上找个java-json.jar包放到lib下就好了。-as-parquetfile和--query是支持了,不过对于mysql的datatime等时间类型,用parquet是转成bigint类型的,所以hive建表时对应类型要是bigint,转换成时间戳存储了。建成其他格式会报类型不支持。然后,继续mysql导数据直接到hive表,又陆续报了一些错,原因是我使用的是hive1.2.0版本, sqoop1.4.7版本是不支持的.换成hive2.3.4版本成功导入.

下面详细一一道来.

 

 1.   mysql建表

CREATE TABLE `sqoop_job` (

  `id` int(11) DEFAULT NULL,

  `name` varchar(255) DEFAULT NULL,

  `jobname` varchar(255) DEFAULT NULL,

  `formatTimeStamp` datetime DEFAULT NULL,

  `time` varchar(20) DEFAULT NULL

) ENGINE=InnoDB DEFAULT CHARSET=utf8

 

2. 插入数据

insert into sqoop_job values(1,"name1","jobname1","2015-11-09 21:00:00","2015-11-09 21:00:00");

insert into sqoop_job values(2,"name2","jobname2","2015-11-09 22:00:00","2015-11-09 22:00:00");

insert into sqoop_job values(3,"name3","jobname3","2015-11-09 23:00:00","2015-11-09 23:00:00");

insert into sqoop_job values(4,"name4","jobname4","2015-11-10 21:00:00","2015-11-10 21:00:00");

 

将mysql表结构同步到hive(实际不建议这样做,手动在hive建表) 

sqoop create-hive-table \

--connect jdbc:mysql://hadoop03:3306/test \

--username root --password root \

--table sqoop_job \

--hive-table test.sqoop_job \

--fields-terminated-by ,

查看建表语句:

show create table sqoop_job;

| CREATE TABLE `sqoop_job`(`id` int, `name` string, `jobname` string, `formattimestamp` string, `time` string)

COMMENT 'Imported by sqoop on 2019/03/02 18:49:49'

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

WITH SERDEPROPERTIES (                         

  'line.delim' = '

',

  'field.delim' = ',',

  'serialization.format' = ','

)

STORED AS

  INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'

  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

TBLPROPERTIES (

  'transient_lastDdlTime' = '1551523804'

)

 

可以看出对于mysql的datetime类型,默认是转成String类型的.

 

3. import mysql 直接--> parquet hive表:失败sqoop1.4.6+hive1.2.0的版本)

 

3.1hive建表

 

CREATE TABLE `parquet_sqoop_job` (

  `id` int ,

  `name` string ,

  `jobname` string ,

  `date_time` string ,

  `time` string

)

row format delimited fields terminated by '\t'

stored as parquet;

 

注意:指定ROW FORMAT DELIMITED,在hive1..0版本会报错:

ROW FORMAT DELIMITED is only compatible with 'textfile', not 'parquet'(line 1, pos 0).Hive2..4是可以的.

执行sqoop import语句

sqoop import \

--connect jdbc:mysql://hadoop03:3306/test \

 --username root \

--password root  \

--table sqoop_job \

--hive-import  \

--hive-database test  \

--hive-table parquet_sqoop_job \

--delete-target-dir \

 --fields-terminated-by '\t'

-m 1  \

--target-dir test  \

--as-parquetfile

 

Sqoop1.4.7报错:Exception in thread "main" java.lang.NoClassDefFoundError:      org/apache/hadoop/hive/metastore/api/NoSuchObjectException  感觉是hive的问题,因为最后sqoop还是通过hive load 数据到hive表的.             

 

4. import mysql 直接-->textfile hive表(sqoop1.4.6+hive1.2.0的版本)

 

4.1 使用自动建hive表的方式

 

      #########每行最后的\后面不能有空格##########    

sqoop import --connect jdbc:mysql://hadoop03:3306/test \

--username root --password root  \

--create-hive-table \

--direct  \

--table sqoop_job \

--hive-overwrite \

--hive-import --hive-database test --hive-table sqoop1  \

--delete-target-dir --fields-terminated-by '\t' --target-dir sqoopjob -m 1

 

查询结果:

 

0: jdbc:hive2://hadoop03:20000/> select * from sqoop1;

+------------+--------------+-----------------+-------------------------+----------------------+

| sqoop1.id  | sqoop1.name  | sqoop1.jobname  | sqoop1.formattimestamp  |     sqoop1.time      |

+------------+--------------+-----------------+-------------------------+----------------------+

| 1          | name1        | jobname1        | 2015-11-09 21:00:00     | 2015-11-09 21:00:00  |

| 2          | name2        | jobname2        | 2015-11-09 22:00:00     | 2015-11-09 22:00:00  |

| 3          | name3        | jobname3        | 2015-11-09 23:00:00     | 2015-11-09 23:00:00  |

| 4          | name4        | jobname4        | 2015-11-10 21:00:00     | 2015-11-10 21:00:00  |

+------------+--------------+-----------------+-------------------------+----------------------+

 

4.2 使用手动建hive表的方式

CREATE TABLE `sqoop_job` (

`id` int ,

`name` string ,

`jobname` string ,

`date_time` string ,

`time` string

)

row format delimited fields terminated by '\t'

stored as textfile;

 

执行语句:

sqoop import --connect jdbc:mysql://hadoop03:3306/test \

--username root --password root --direct \

--hive-overwrite \

--query 'SELECT id as id,name as name ,jobname as jobname,formatTimeStamp as date_time,time as time FROM test.sqoop_job WHERE $CONDITIONS  \

--hive-import --hive-database test --hive-table sqoop_job  \

--delete-target-dir --fields-terminated-by '\t' --target-dir sqoopjob -m 1

 

--query     后面跟的sql语句,如果有单引号''引起来的内容,整个语句用""双引号,使用\$转义$,否则不用.

--query "SELECT id as id,name as name ,jobname as jobname,formatTimeStamp as date_time,time as time FROM test.sqoop_job WHERE \$CONDITIONS"                   

 

查询结果:

0: jdbc:hive2://hadoop03:20000/> select * from sqoop_job;

+---------------+-----------------+--------------------+------------------------+----------------------+

| sqoop_job.id  | sqoop_job.name  | sqoop_job.jobname  |  sqoop_job.date_time   |    sqoop_job.time    |

+---------------+-----------------+--------------------+------------------------+----------------------+

| 1             | name1           | jobname1           | 2015-11-09 21:00:00.0  | 2015-11-09 21:00:00  |

| 2             | name2           | jobname2           | 2015-11-09 22:00:00.0  | 2015-11-09 22:00:00  |

| 3             | name3           | jobname3           | 2015-11-09 23:00:00.0  | 2015-11-09 23:00:00  |

| 4             | name4           | jobname4           | 2015-11-10 21:00:00.0  | 2015-11-10 21:00:00  |

+---------------+-----------------+--------------------+------------------------+----------------------+

问题:mysql的datetime类型对应的字段多了一个.0

解决:不用--query,换成--table的方式,就不会多出.0。具体原因不知道。

#Cannot specify --query and --table together.

#--table 不能和--query一起使用

5. import mysql -->hdfs-->parquet hive表(sqoop1.4.6+hive1.2.0的版本)

 

5.1  mysql -->hdfs:

测试hdfs路径设为hive表在hdfs的存储路径:报错

手动创建hive表:

CREATE TABLE `parquet_sqoop_job` (

`id` int ,

`name` string ,

`jobname` string ,

`date_time` string ,

`time` string

)

row format delimited fields terminated by '\t'

stored as parquet;

执行语句:

 

sqoop import

--connect jdbc:mysql://hadoop03:3306/test

--username root

 --password root

 --mapreduce-job-name FromMySQL2HDFS

 --table sqoop_job

--fields-terminated-by "\t"

--delete-target-dir

--target-dir /user/hive/warehouse/test.db/parquet_sqoop_job

--null-string '\\N'

--fields-terminated-by '\t'

--null-non-string '\\N'

--num-mappers 1  

--as-parquetfile

报错:(sqoop1.4.7同样)

19/03/03 23:07:02 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.ValidationException: Namespace test.db is not alphanumeric (plus '_')

org.kitesdk.data.ValidationException: Namespace test.db is not alphanumeric (plus '_')

--as-parquetfile 是不予许Namespace中有.的;同时Dataset中也不能有.的。

类似的问题还可以参看:https://issues.apache.org/jira/browse/SQOOP-2874

textfile是可以这样的,重复执行会覆盖源表已经有的内容.

 

设置target-dir路径为其他路径:

 

# sqoop1.4.6   --as-parquetfile 配合--query  失败,支持 --as-parquetfile和--table

 

sqoop import \

--connect jdbc:mysql://hadoop03:3306/test \

--username root \

--password root \

--query "SELECT * FROM sqoop_job WHERE \$CONDITIONS" \

--target-dir /sqoop/import/user_parquet \

--delete-target-dir \

--num-mappers 1 \

--as-parquetfile

 

报错:NullPointerException

19/03/03 00:47:32 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.NullPointerException

java.lang.NullPointerException

        at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:97)

        at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:478)

        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)

        at org.apache.sqoop.Sqoop.run(Sqoop.java:143)

        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)

        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)

        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)

        at org.apache.sqoop.Sqoop.main(Sqoop.java:236)

参看: https://issues.apache.org/jira/plugins/servlet/mobile#issue/SQOOP-2582

 

# sqoop1.4.6 支持 --as-parquetfile和--table

 

sqoop import \

--connect jdbc:mysql://hadoop03:3306/test \

--username root \

--password root \

--table sqoop_job \

--target-dir /sqoop/import/user_parquet \

--delete-target-dir \

--num-mappers 1 \

--as-parquetfile

 

创建hive外部表:

 

CREATE external TABLE `ex_parquet_sqoop_job` (

  `id` int ,

  `name` string ,

  `jobname` string ,

  `date_time` string ,

  `time` string

)

stored as parquet

location "/sqoop/import/user_parquet" ;

 

查询结果:

+-----+--------+-----------+------------+----------------------+--+

| id  |  name  |  jobname  | date_time  |         time         |

+-----+--------+-----------+------------+----------------------+--+

| 1   | name1  | jobname1  | NULL       | 2015-11-09 21:00:00  |

| 2   | name2  | jobname2  | NULL       | 2015-11-09 22:00:00  |

| 3   | name3  | jobname3  | NULL       | 2015-11-09 23:00:00  |

| 4   | name4  | jobname4  | NULL       | 2015-11-10 21:00:00  |

+-----+--------+-----------+------------+----------------------+--+

发现date_time字段为null;试过将date_time字段类型换为timestamp、bigint均为null;

无奈,果断安装sqoop1.4.7。

6.   安装sqoop1.4.7(sqoop1.4.7+hive1.2.0的版本)

  下载:http://mirror.bit.edu.cn/apache/sqoop/1.4.7/

  6.1  初次执行sqoop 1.4.7 :import  报错

 

ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver

解决:lib/下添加mysql连接jar包:

 ln -s /home/software/sqoop-1.4.4.bin__hadoop-2.0.4-alpha/lib/mysql-connector-java-5.1.38-bin.jar /home/software/sqoop1.4.7/lib/mysql-connector-java-5.1.38-bin.jar

 

 

java.lang.ClassNotFoundException: org.json.JSONObject

解决:下载java-json.jar包,放到lib/目录下.下载地址: http://www.java2s.com/Code/Jar/j/Downloadjavajsonjar.htm

 

继续填坑:

19/03/03 16:43:50 ERROR hive.HiveConfig: Could not load org.apache.hadoop.hive.conf.HiveConf. Make sure HIVE_CONF_DIR is set correctly.

19/03/03 16:43:50 ERROR tool.ImportTool: Import failed: java.io.IOException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf

 

解决:原因是缺少hive的hive-common-.jar包;找到hive/lib/下的hive-common-.jar将其添加到sqoop的lib/下(我的是hive-common-1.2.0.jar):

ln -s /home/software/apache-hive-1.2.0-bin/lib/hive-common-1.2.0.jar /home/software/sqoop1.4.7/lib/hive-common-1.2.0.jar

 

 

解决之后继续执行import出现:

19/03/03 16:50:33 INFO hive.HiveImport: Loading uploaded data into Hive

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hive/shims/ShimLoader

原因:

百度,有大佬说: shims相关类是用来兼容不同的hadoop和hive版本的

详见:   https://blog.51cto.com/caiguangguang/1564601

估计是sqoop1.4.7版本与hive1.2.0版本不兼容吧,于是更新hive到hive2.3.4的版本.

 

7.   安装、配置hive2.3.4 (sqoop1.4.7+hive2.3.4的版本)

 

   具体安装配置见我的另外一篇文章。。。。

7.1  替换hive-common-2.3.4.jar

删除原来hive-common-1.2.0.jar的软连接,替换为hive-common-2.3.4.jar:

ln -s /home/software/apache-hive-2.3.4-bin/lib/hive-common-2.3.4.jar /home/software/sqoop1.4.7/lib/hive-common-2.3.4.jar

7.2  修改sqoop-env.sh  hive_home路径

 

7.3  执行import 作业 .

 

Mysql --> parquet hive表:失败

sqoop import --connect jdbc:mysql://hadoop03:3306/test --username root --password root --table sqoop_job --fields-terminated-by "\t" --lines-terminated-by "\n" --hive-import --hive-overwrite --create-hive-table  --hive-table test.sqoop --delete-target-dir -m 1 --null-string '\\N' --null-non-string '\\N' --as-parquetfile

 

提示:  Hive import and create hive table is not compatible with importing into ParquetFile format.

表明1.4.7的也是不支持parquetfile直接导入hive表的;

同时,--as-parquetfile   hive中表事先存在也不行.

 

Mysql --> hdfs -->  parquet hive表:成功

--as-parquetfile配合--query一起使用:支持

sqoop import \

--connect jdbc:mysql://hadoop03:3306/test \

--username root \

--password root \

--query 'SELECT id as id,name as name ,jobname as jobname,formatTimeStamp as date_time,time as time FROM test.sqoop_job WHERE $CONDITIONS' \

--target-dir /sqoop/import/user_parquet \

--delete-target-dir \

--num-mappers 1 \

--as-parquetfile

 

hive建立外部表:

CREATE external TABLE `ex_parquet_sqoop_job` (

  `id` int ,

  `name` string ,

  `jobname` string ,

  `date_time` bigint ,

  `time` string

)

stored as parquet

location "/sqoop/import/user_parquet" ;

 

查看结果:

 

0: jdbc:hive2://hadoop03:20000/> select * from ex_parquet_sqoop_job;

+--------------------------+----------------------------+-------------------------------+---------------------------------+----------------------------+

| ex_parquet_sqoop_job.id  | ex_parquet_sqoop_job.name  | ex_parquet_sqoop_job.jobname  | ex_parquet_sqoop_job.date_time  | ex_parquet_sqoop_job.time  |

+--------------------------+----------------------------+-------------------------------+---------------------------------+----------------------------+

| 1                        | name1                      | jobname1                      | 1447074000000                   | 2015-11-09 21:00:00        |

| 2                        | name2                      | jobname2                      | 1447077600000                   | 2015-11-09 22:00:00        |

| 3                        | name3                      | jobname3                      | 1447081200000                   | 2015-11-09 23:00:00        |

| 4                        | name4                      | jobname4                      | 1447160400000                   | 2015-11-10 21:00:00        |

+--------------------------+----------------------------+-------------------------------+---------------------------------+----------------------------+

date_time是时间戳的数据存储了.

date_time类型建为除bigint类型的其他类型会报错:

 

  • Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.io.LongWritable (state=,code=0)
  • Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.DateWritable (state=,code=0)

 

注意: hive表与mysql字段名不一致时,parquet文件,还是要使用--query  sql语句指定对应的别名与源mysql表中的字段对应,最好都这样做.否则,仍会发现date_timenull的情况.(上面5小节的问题应该也是这个的问题,之前没发现)

如使用这个import语句:

sqoop import \

--connect jdbc:mysql://hadoop03:3306/test \

--username root \

--password root \

--table sqoop_job \

--target-dir /sqoop/import/user_parquet \

--delete-target-dir \

--num-mappers 1 \

--as-parquetfile

 

最后体会,还是尽量都更新为最新的比较稳定的版本吧.

你可能感兴趣的:(sqoop,1.x)