Sqoop常见问题

1、sqoop从MySQL导入数据到hive时,报:

20/09/18 11:20:33 INFO mapreduce.Job: Job job_1600395587790_0002 failed with state FAILED due to: Application application_1600395587790_0002 failed 2 times due to AM Container for appattempt_1600395587790_0002_000002 exited with  exitCode: -104

Failing this attempt.Diagnostics: [2020-09-18 11:20:32.442]Container [pid=69122,containerID=container_e106_1600395587790_0002_02_000001] is running 58400768B beyond the 'PHYSICAL' memory limit. Current usage: 225.7 MB of 170 MB physical memory used; 2.0 GB of 357.0 MB virtual memory used. Killing container.

解决办法:

关键错误为“Current usage: 225.7 MB of 170 MB physical memory used; 2.0 GB of 357.0 MB virtual memory used”,意思为用了225.7M物理内存,但是只有170M,用了2.0G虚拟内存,但是只有357M。

在yarn.site 中设置 yarn.scheduler.minimum-allocation-mb 的值为256MB问题解决

如果提示虚拟内存不足,可以关闭检查,如下设置:yarn.nodemanager.vmem-check-enabled     false

2、从sqoop导入到hive表的数据全为NULL

解决办法:

建表和导入的字段分隔符fields terminated by '\001'不一致造成的,统一设置为'\001'即可

3、sqoop import 导入到hive后数据量变多的问题:

从sqoop导入到HIV中后,使用select count(*) 进行统计,会发现数据量比原MySQL数据库的数据量多,而且比sqoop日志打印的“Retrieved 52136 records.”也要多

解决办法:

1)--split-by时,使用的切分字段不是int型,有重复造成的,详见:sqoop import 导入到hive后数据量变多的问题_IKnowNothinglee的博客-CSDN博客

2)因为分隔符的问题造成的,详见:关于在sqoop导入数据的时候,数据量变多的解决方案。_weixin_30693183的博客-CSDN博客

4、sqoop导出到mysql,报错:java.io.FileNotFoundException: Path is not a file

执行如下导出命令时

sqoop export --connect jdbc:mysql://192.56.1.111:3306/bigdata --username root --password 2342344 --table user_tb_summary --fields-terminated-by '\001' --update-key date_str --update-mode allowinsert --export-dir  /warehouse/tablespace/managed/hive/taxbook1.db/user_summary/

报错以下错误:

20/10/10 18:34:03 ERROR tool.ExportTool: Encountered IOException running export job:
java.io.FileNotFoundException: Path is not a file: /warehouse/tablespace/managed/hive/taxbook1.db/user_summary/delta_0000053_0000053_0000
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:90)
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)

查看user_summary表下目录,发现多了一层目录,经过重重排查发现hive版本是3.1,hive3.x用的不是mr引擎,用的tez引擎,所以会生成一层目录(貌似是这个原因)

[hdfs@hadoop05 sqoop_job]$  hdfs dfs -ls /warehouse/tablespace/managed/hive/taxbook1.db/user_summary

Found 10 items
drwxrwx---+  - hive hadoop          0 2020-10-10 14:56 /warehouse/tablespace/managed/hive/taxbook1.db/user_summary/delta_0000053_0000053_0000
drwxrwx---+  - hive hadoop          0 2020-10-10 14:57 /warehouse/tablespace/managed/hive/taxbook1.db/user_summary/delta_0000054_0000054_0000

解决办法:使用占位符

sqoop export --connect jdbc:mysql://192.56.1.111:3306/bigdata --username root --password 2342344 --table user_tb_summary --fields-terminated-by '\001' --update-key date_str --update-mode allowinsert --export-dir  /warehouse/tablespace/managed/hive/taxbook1.db/user_summary/delta*

5、在执行到 连接hive2时卡住

21/05/25 14:02:42 INFO hive.HiveImport: Connecting to jdbc:hive2://hadoop02.com:2181,hadoop01.com:2181,hadoop03.com:2181,hadoop04.com:2181/default;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2

初次使用时执行到这里会卡住,解决办法如下:

在hive conf目录(一般在/etc/hive/conf)新建(如果没有)一个beeline-hs2-connection.xml文件(使用hive用户),然后再次执行即可。





beeline.hs2.connection.user
hive


beeline.hs2.connection.password
hive

6、sqoop任务运行中mapreduce job报错:

21/06/11 09:40:08 INFO impl.YarnClientImpl: Submitted application application_1622620346132_0051
21/06/11 09:40:08 INFO mapreduce.Job: The url to track the job: http://hadoop01.com:8088/proxy/application_1622620346132_0051/
21/06/11 09:40:08 INFO mapreduce.Job: Running job: job_1622620346132_0051
21/06/11 09:49:33 INFO mapreduce.Job: Job job_1622620346132_0051 running in uber mode : false
21/06/11 09:49:33 INFO mapreduce.Job:  map 0% reduce 0%
21/06/11 09:49:38 INFO mapreduce.Job:  map 100% reduce 0%
21/06/11 09:52:37 INFO mapreduce.Job: Job job_1622620346132_0051 failed with state FAILED due to: Task failed task_1622620346132_0051_m_000003
Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0

21/06/11 09:52:37 INFO mapreduce.Job: Counters: 12
    Job Counters
        Failed map tasks=1
        Killed map tasks=3
        Launched map tasks=4
        Rack-local map tasks=4
        Total time spent by all maps in occupied slots (ms)=10237
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=10237
        Total vcore-milliseconds taken by all map tasks=10237
        Total megabyte-milliseconds taken by all map tasks=10482688
    Map-Reduce Framework
        CPU time spent (ms)=0
        Physical memory (bytes) snapshot=0
        Virtual memory (bytes) snapshot=0
21/06/11 09:52:37 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
21/06/11 09:52:37 INFO mapreduce.ExportJobBase: Transferred 0 bytes in 774.6075 seconds (0 bytes/sec)
21/06/11 09:52:37 INFO mapreduce.ExportJobBase: Exported 0 records.
21/06/11 09:52:37 ERROR mapreduce.ExportJobBase: Export job failed!
21/06/11 09:52:37 ERROR tool.ExportTool: Error during export:
Export job failed!
    at org.apache.sqoop.mapreduce.ExportJobBase.runExport(ExportJobBase.java:445)
    at org.apache.sqoop.manager.SqlManager.exportTable(SqlManager.java:930)
    at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:94)
    at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:113)
    at org.apache.sqoop.Sqoop.run(Sqoop.java:151)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:187)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:241)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:250)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:259)

没有任何详细报错信息,但是可以看出是Task failed task_1622620346132_0051_m_000003任务报错了,那么此时可以查看yarn日志,拿到该task的所属的applicationId,然后再yarn所在主机执行以下命令即可看到详细日志信息:

yarn logs -applicationId application_1622620346132_0051

你可能感兴趣的:(大数据,sqoop,hive,hadoop)