生产环境踩坑系列::Hive on Spark的connection timeout 问题

起因

7/16凌晨,钉钉突然收到了一条告警,一个公司所有业务部门的组织架构表的ETL过程中,数据推送到DIM层的过程中出现异常,导致任务失败。

因为这个数据会影响到第二天所有大数据组对外的应用服务中组织架构基础数据,当然,我们的Pla-nB也不是吃素的,一旦出现错误,后面的权限管理模块与网关会自动配合切换前一天的最后一次成功处理到DIM中的组织架构数据,只会影响到在前一天做过组织架构变化的同事在系统上的操作,但是这个影响数量是可控的,并且我们会也有所有组织架构变化的审计数据,如果第二天这个推数的ETL修复不完的话,我们会手动按照审计数据对这些用户先进行操作,保证线上的稳定性。

技术架构

  • 集群:CDH 256G/64C计算物理集群 X 18台
  • 调度:dolphin
  • 数据抽取:datax
  • DIM层数据库:Doris
  • Hive版本:2.1.1

告警

告警策略现在是有机器人去捕捉dolphin的告警邮件,发到钉钉群里,dolphin其实是可以获取到异常的,需要进行一系列的开发,但是担心复杂的调度过程会有任务监控的遗漏,导致告警丢失,这样就是大问题,所以简单粗暴,机器人代替人来读取邮件并发送告警到钉钉,这样只关注这个幸福来敲门的小可爱即可。

生产环境踩坑系列::Hive on Spark的connection timeout 问题_第1张图片

集群log

Log Type: stderr

Log Upload Time: Fri Jul 16 01:27:46 +0800 2021

Log Length: 10569

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data7/yarn/nm/usercache/dolphinscheduler/filecache/8096/__spark_libs__6065796770539359217.zip/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/07/16 01:27:43 INFO util.SignalUtils: Registered signal handler for TERM
21/07/16 01:27:43 INFO util.SignalUtils: Registered signal handler for HUP
21/07/16 01:27:43 INFO util.SignalUtils: Registered signal handler for INT
21/07/16 01:27:43 INFO spark.SecurityManager: Changing view acls to: yarn,dolphinscheduler
21/07/16 01:27:43 INFO spark.SecurityManager: Changing modify acls to: yarn,dolphinscheduler
21/07/16 01:27:43 INFO spark.SecurityManager: Changing view acls groups to: 
21/07/16 01:27:43 INFO spark.SecurityManager: Changing modify acls groups to: 
21/07/16 01:27:43 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, dolphinscheduler); groups with view permissions: Set(); users  with modify permissions: Set(yarn, dolphinscheduler); groups with modify permissions: Set()
21/07/16 01:27:43 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1625364172078_3093_000001
21/07/16 01:27:43 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread
21/07/16 01:27:43 INFO yarn.ApplicationMaster: Waiting for spark context initialization...
21/07/16 01:27:43 INFO client.RemoteDriver: Connecting to HiveServer2 address: hadoop-task-1.bigdata.xx.com:24173
21/07/16 01:27:44 INFO conf.HiveConf: Found configuration file file:/data8/yarn/nm/usercache/dolphinscheduler/filecache/8097/__spark_conf__.zip/__hadoop_conf__/hive-site.xml
21/07/16 01:27:44 ERROR yarn.ApplicationMaster: User class threw exception: java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.xx.com/10.25.15.104:24173
java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.XX.com/10.25.15.104:24173
	at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:41)
	at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:155)
	at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:559)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:673)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.xx.com/10.25.15.104:24173
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused
	... 10 more
21/07/16 01:27:44 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User class threw exception: java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.xx.com/10.25.15.104:24173
	at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:41)
	at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:155)
	at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:559)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:673)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refuse

你可能感兴趣的:(Spark,Hive,spark,hive)