简单分享下,我这两天搞sqoop1.x遇到的坑。
先总体总结下,首先是1.4.4的import不支持as-parquetfile。
好吧,换成1.4.6版本,倒是支持了,但是不支持--as-parquetfile和--query一起使用,报一个什么莫名的空指针异常,--as-parquetfile和--table是可以的。同时import也不支持parquetfile格式直接导入hive表,只能间接的先导入hdfs再load到hive。有个问题就是mysql的datetime类型在hive查询出来始终是null(对应的string,timestamp,bigint都试过了)。
好吧,换成最新的1.4.7版本试试,完美解决了上面所有问题,当然import也不支持parquetfile格式直接导入hive表,不过执行后有明显的提示说不兼容parquet格式直接导入hive。只是第一次运行时报一个找不到org.json.JSONObject,这个简单,网上找个java-json.jar包放到lib下就好了。-as-parquetfile和--query是支持了,不过对于mysql的datatime等时间类型,用parquet是转成bigint类型的,所以hive建表时对应类型要是bigint,转换成时间戳存储了。建成其他格式会报类型不支持。然后,继续mysql导数据直接到hive表,又陆续报了一些错,原因是我使用的是hive1.2.0版本, sqoop1.4.7版本是不支持的.换成hive2.3.4版本成功导入.
下面详细一一道来.
CREATE TABLE `sqoop_job` (
`id` int(11) DEFAULT NULL,
`name` varchar(255) DEFAULT NULL,
`jobname` varchar(255) DEFAULT NULL,
`formatTimeStamp` datetime DEFAULT NULL,
`time` varchar(20) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
insert into sqoop_job values(1,"name1","jobname1","2015-11-09 21:00:00","2015-11-09 21:00:00");
insert into sqoop_job values(2,"name2","jobname2","2015-11-09 22:00:00","2015-11-09 22:00:00");
insert into sqoop_job values(3,"name3","jobname3","2015-11-09 23:00:00","2015-11-09 23:00:00");
insert into sqoop_job values(4,"name4","jobname4","2015-11-10 21:00:00","2015-11-10 21:00:00");
sqoop create-hive-table \
--connect jdbc:mysql://hadoop03:3306/test \
--username root --password root \
--table sqoop_job \
--hive-table test.sqoop_job \
--fields-terminated-by ,
show create table sqoop_job;
| CREATE TABLE `sqoop_job`(`id` int, `name` string, `jobname` string, `formattimestamp` string, `time` string) COMMENT 'Imported by sqoop on 2019/03/02 18:49:49' ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'line.delim' = ' ', 'field.delim' = ',', 'serialization.format' = ',' ) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' TBLPROPERTIES ( 'transient_lastDdlTime' = '1551523804' ) |
|
可以看出对于mysql的datetime类型,默认是转成String类型的.
CREATE TABLE `parquet_sqoop_job` (
`id` int ,
`name` string ,
`jobname` string ,
`date_time` string ,
`time` string
)
row format delimited fields terminated by '\t'
stored as parquet;
注意:指定ROW FORMAT DELIMITED,在hive1.2.0版本会报错:
ROW FORMAT DELIMITED is only compatible with 'textfile', not 'parquet'(line 1, pos 0).Hive2.3.4是可以的.
执行sqoop import语句
sqoop import \
--connect jdbc:mysql://hadoop03:3306/test \
--username root \
--password root \
--table sqoop_job \
--hive-import \
--hive-database test \
--hive-table parquet_sqoop_job \
--delete-target-dir \
--fields-terminated-by '\t'
-m 1 \
--target-dir test \
--as-parquetfile
Sqoop1.4.7报错:Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hive/metastore/api/NoSuchObjectException 感觉是hive的问题,因为最后sqoop还是通过hive load 数据到hive表的.
#########每行最后的\后面不能有空格########## |
sqoop import --connect jdbc:mysql://hadoop03:3306/test \
--username root --password root \
--create-hive-table \
--direct \
--table sqoop_job \
--hive-overwrite \
--hive-import --hive-database test --hive-table sqoop1 \
--delete-target-dir --fields-terminated-by '\t' --target-dir sqoopjob -m 1
查询结果:
0: jdbc:hive2://hadoop03:20000/> select * from sqoop1;
+------------+--------------+-----------------+-------------------------+----------------------+
| sqoop1.id | sqoop1.name | sqoop1.jobname | sqoop1.formattimestamp | sqoop1.time |
+------------+--------------+-----------------+-------------------------+----------------------+
| 1 | name1 | jobname1 | 2015-11-09 21:00:00 | 2015-11-09 21:00:00 |
| 2 | name2 | jobname2 | 2015-11-09 22:00:00 | 2015-11-09 22:00:00 |
| 3 | name3 | jobname3 | 2015-11-09 23:00:00 | 2015-11-09 23:00:00 |
| 4 | name4 | jobname4 | 2015-11-10 21:00:00 | 2015-11-10 21:00:00 |
+------------+--------------+-----------------+-------------------------+----------------------+
CREATE TABLE `sqoop_job` (
`id` int ,
`name` string ,
`jobname` string ,
`date_time` string ,
`time` string
)
row format delimited fields terminated by '\t'
stored as textfile;
sqoop import --connect jdbc:mysql://hadoop03:3306/test \
--username root --password root --direct \
--hive-overwrite \
--query 'SELECT id as id,name as name ,jobname as jobname,formatTimeStamp as date_time,time as time FROM test.sqoop_job WHERE $CONDITIONS \
--hive-import --hive-database test --hive-table sqoop_job \
--delete-target-dir --fields-terminated-by '\t' --target-dir sqoopjob -m 1
--query 后面跟的sql语句,如果有单引号''引起来的内容,整个语句用""双引号,使用\$转义$,否则不用. --query "SELECT id as id,name as name ,jobname as jobname,formatTimeStamp as date_time,time as time FROM test.sqoop_job WHERE \$CONDITIONS" |
0: jdbc:hive2://hadoop03:20000/> select * from sqoop_job;
+---------------+-----------------+--------------------+------------------------+----------------------+
| sqoop_job.id | sqoop_job.name | sqoop_job.jobname | sqoop_job.date_time | sqoop_job.time |
+---------------+-----------------+--------------------+------------------------+----------------------+
| 1 | name1 | jobname1 | 2015-11-09 21:00:00.0 | 2015-11-09 21:00:00 |
| 2 | name2 | jobname2 | 2015-11-09 22:00:00.0 | 2015-11-09 22:00:00 |
| 3 | name3 | jobname3 | 2015-11-09 23:00:00.0 | 2015-11-09 23:00:00 |
| 4 | name4 | jobname4 | 2015-11-10 21:00:00.0 | 2015-11-10 21:00:00 |
+---------------+-----------------+--------------------+------------------------+----------------------+
问题:mysql的datetime类型对应的字段多了一个.0
解决:不用--query,换成--table的方式,就不会多出.0。具体原因不知道。
#Cannot specify --query and --table together.
#--table 不能和--query一起使用
CREATE TABLE `parquet_sqoop_job` (
`id` int ,
`name` string ,
`jobname` string ,
`date_time` string ,
`time` string
)
row format delimited fields terminated by '\t'
stored as parquet;
执行语句:
sqoop import
--connect jdbc:mysql://hadoop03:3306/test
--username root
--password root
--mapreduce-job-name FromMySQL2HDFS
--table sqoop_job
--fields-terminated-by "\t"
--delete-target-dir
--target-dir /user/hive/warehouse/test.db/parquet_sqoop_job
--null-string '\\N'
--fields-terminated-by '\t'
--null-non-string '\\N'
--num-mappers 1
--as-parquetfile
19/03/03 23:07:02 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.ValidationException: Namespace test.db is not alphanumeric (plus '_') org.kitesdk.data.ValidationException: Namespace test.db is not alphanumeric (plus '_') |
--as-parquetfile 是不予许Namespace中有.的;同时Dataset中也不能有.的。
类似的问题还可以参看:https://issues.apache.org/jira/browse/SQOOP-2874
textfile是可以这样的,重复执行会覆盖源表已经有的内容.
sqoop import \
--connect jdbc:mysql://hadoop03:3306/test \
--username root \
--password root \
--query "SELECT * FROM sqoop_job WHERE \$CONDITIONS" \
--target-dir /sqoop/import/user_parquet \
--delete-target-dir \
--num-mappers 1 \
--as-parquetfile
报错:NullPointerException
19/03/03 00:47:32 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.NullPointerException java.lang.NullPointerException at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:97) at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:478) at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605) at org.apache.sqoop.Sqoop.run(Sqoop.java:143) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227) at org.apache.sqoop.Sqoop.main(Sqoop.java:236) |
参看: https://issues.apache.org/jira/plugins/servlet/mobile#issue/SQOOP-2582
sqoop import \
--connect jdbc:mysql://hadoop03:3306/test \
--username root \
--password root \
--table sqoop_job \
--target-dir /sqoop/import/user_parquet \
--delete-target-dir \
--num-mappers 1 \
--as-parquetfile
CREATE external TABLE `ex_parquet_sqoop_job` (
`id` int ,
`name` string ,
`jobname` string ,
`date_time` string ,
`time` string
)
stored as parquet
location "/sqoop/import/user_parquet" ;
+-----+--------+-----------+------------+----------------------+--+
| id | name | jobname | date_time | time |
+-----+--------+-----------+------------+----------------------+--+
| 1 | name1 | jobname1 | NULL | 2015-11-09 21:00:00 |
| 2 | name2 | jobname2 | NULL | 2015-11-09 22:00:00 |
| 3 | name3 | jobname3 | NULL | 2015-11-09 23:00:00 |
| 4 | name4 | jobname4 | NULL | 2015-11-10 21:00:00 |
+-----+--------+-----------+------------+----------------------+--+
发现date_time字段为null;试过将date_time字段类型换为timestamp、bigint均为null;
无奈,果断安装sqoop1.4.7。
下载:http://mirror.bit.edu.cn/apache/sqoop/1.4.7/
ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver |
解决:lib/下添加mysql连接jar包:
ln -s /home/software/sqoop-1.4.4.bin__hadoop-2.0.4-alpha/lib/mysql-connector-java-5.1.38-bin.jar /home/software/sqoop1.4.7/lib/mysql-connector-java-5.1.38-bin.jar
java.lang.ClassNotFoundException: org.json.JSONObject |
解决:下载java-json.jar包,放到lib/目录下.下载地址: http://www.java2s.com/Code/Jar/j/Downloadjavajsonjar.htm
继续填坑:
19/03/03 16:43:50 ERROR hive.HiveConfig: Could not load org.apache.hadoop.hive.conf.HiveConf. Make sure HIVE_CONF_DIR is set correctly. 19/03/03 16:43:50 ERROR tool.ImportTool: Import failed: java.io.IOException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf |
解决:原因是缺少hive的hive-common-
ln -s /home/software/apache-hive-1.2.0-bin/lib/hive-common-1.2.0.jar /home/software/sqoop1.4.7/lib/hive-common-1.2.0.jar
解决之后继续执行import出现:
19/03/03 16:50:33 INFO hive.HiveImport: Loading uploaded data into Hive Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hive/shims/ShimLoader |
原因:
百度,有大佬说: shims相关类是用来兼容不同的hadoop和hive版本的
详见: https://blog.51cto.com/caiguangguang/1564601
估计是sqoop1.4.7版本与hive1.2.0版本不兼容吧,于是更新hive到hive2.3.4的版本.
具体安装配置见我的另外一篇文章。。。。
删除原来hive-common-1.2.0.jar的软连接,替换为hive-common-2.3.4.jar:
ln -s /home/software/apache-hive-2.3.4-bin/lib/hive-common-2.3.4.jar /home/software/sqoop1.4.7/lib/hive-common-2.3.4.jar
sqoop import --connect jdbc:mysql://hadoop03:3306/test --username root --password root --table sqoop_job --fields-terminated-by "\t" --lines-terminated-by "\n" --hive-import --hive-overwrite --create-hive-table --hive-table test.sqoop --delete-target-dir -m 1 --null-string '\\N' --null-non-string '\\N' --as-parquetfile
提示: Hive import and create hive table is not compatible with importing into ParquetFile format.
表明1.4.7的也是不支持parquetfile直接导入hive表的;
同时,--as-parquetfile hive中表事先存在也不行.
--as-parquetfile配合--query一起使用:支持
sqoop import \
--connect jdbc:mysql://hadoop03:3306/test \
--username root \
--password root \
--query 'SELECT id as id,name as name ,jobname as jobname,formatTimeStamp as date_time,time as time FROM test.sqoop_job WHERE $CONDITIONS' \
--target-dir /sqoop/import/user_parquet \
--delete-target-dir \
--num-mappers 1 \
--as-parquetfile
CREATE external TABLE `ex_parquet_sqoop_job` (
`id` int ,
`name` string ,
`jobname` string ,
`date_time` bigint ,
`time` string
)
stored as parquet
location "/sqoop/import/user_parquet" ;
0: jdbc:hive2://hadoop03:20000/> select * from ex_parquet_sqoop_job;
+--------------------------+----------------------------+-------------------------------+---------------------------------+----------------------------+
| ex_parquet_sqoop_job.id | ex_parquet_sqoop_job.name | ex_parquet_sqoop_job.jobname | ex_parquet_sqoop_job.date_time | ex_parquet_sqoop_job.time |
+--------------------------+----------------------------+-------------------------------+---------------------------------+----------------------------+
| 1 | name1 | jobname1 | 1447074000000 | 2015-11-09 21:00:00 |
| 2 | name2 | jobname2 | 1447077600000 | 2015-11-09 22:00:00 |
| 3 | name3 | jobname3 | 1447081200000 | 2015-11-09 23:00:00 |
| 4 | name4 | jobname4 | 1447160400000 | 2015-11-10 21:00:00 |
+--------------------------+----------------------------+-------------------------------+---------------------------------+----------------------------+
date_time是时间戳的数据存储了.
date_time类型建为除bigint类型的其他类型会报错:
|
注意: hive表与mysql字段名不一致时,parquet文件,还是要使用--query sql语句指定对应的别名与源mysql表中的字段对应,最好都这样做.否则,仍会发现date_time是null的情况.(上面5小节的问题应该也是这个的问题,之前没发现)
如使用这个import语句:
sqoop import \
--connect jdbc:mysql://hadoop03:3306/test \
--username root \
--password root \
--table sqoop_job \
--target-dir /sqoop/import/user_parquet \
--delete-target-dir \
--num-mappers 1 \
--as-parquetfile
最后体会,还是尽量都更新为最新的比较稳定的版本吧.