org.apache.hadoop.hdfs.DataStreamer[]-Slow ReadProcessor read fields...82171ms(threshold=30000ms)

报错信息:

2023-07-31 14:31:02,502 INFO  org.apache.hadoop.yarn.client.RMProxy                        [] - Connecting to ResourceManager at hadoop102/172.18.0.202:8032
2023-07-31 14:31:02,756 INFO  org.apache.flink.yarn.YarnClusterDescriptor                  [] - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2023-07-31 14:31:02,772 WARN  org.apache.flink.yarn.YarnClusterDescriptor                  [] - Job Clusters are deprecated since Flink 1.15. Please use an Application Cluster/Application Mode instead.
[INFO] 2023-07-31 06:31:08.114 +0000 -  -> 2023-07-31 14:31:07,201 INFO  org.apache.hadoop.conf.Configuration                         [] - resource-types.xml not found
2023-07-31 14:31:07,202 INFO  org.apache.hadoop.yarn.util.resource.ResourceUtils           [] - Unable to find 'resource-types.xml'.
[INFO] 2023-07-31 06:31:16.115 +0000 -  -> 2023-07-31 14:31:15,644 INFO  org.apache.flink.yarn.YarnClusterDescriptor                  [] - The configured JobManager memory is 2728 MB. YARN will allocate 3072 MB to make up an integer multiple of its minimum allocation memory (1024 MB, configured via 'yarn.scheduler.minimum-allocation-mb'). The extra 344 MB may not be used by Flink.
2023-07-31 14:31:15,645 INFO  org.apache.flink.yarn.YarnClusterDescriptor                  [] - The configured TaskManager memory is 2280 MB. YARN will allocate 3072 MB to make up an integer multiple of its minimum allocation memory (1024 MB, configured via 'yarn.scheduler.minimum-allocation-mb'). The extra 792 MB may not be used by Flink.
2023-07-31 14:31:15,645 INFO  org.apache.flink.yarn.YarnClusterDescriptor                  [] - Cluster specification: ClusterSpecification{masterMemoryMB=2728, taskManagerMemoryMB=2280, slotsPerTaskManager=1}
[INFO] 2023-07-31 06:32:29.129 +0000 -  -> 2023-07-31 14:32:28,658 INFO  org.apache.hadoop.hdfs.DataStreamer                          [] - Slow ReadProcessor read fields for block BP-578738613-127.0.0.1-1680765983742:blk_1073846764_313346 took 34319ms (threshold=30000ms); ack: seqno: 5 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 34317847272 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.18.0.201:9866,DS-0532fcd2-9e17-44e9-be20-846250dfbcaf,DISK], DatanodeInfoWithStorage[172.18.0.203:9866,DS-5e53ca55-f7a6-4ca3-b691-7838c48fd169,DISK], DatanodeInfoWithStorage[172.18.0.202:9866,DS-64752494-84a5-4022-b9b4-b1aac3c707bc,DISK]]
[INFO] 2023-07-31 06:32:34.131 +0000 -  -> 2023-07-31 14:32:33,378 INFO  org.apache.hadoop.hdfs.DataStreamer                          [] - Slow ReadProcessor read fields for block BP-578738613-127.0.0.1-1680765983742:blk_1073846764_313346 took 39141ms (threshold=30000ms); ack: seqno: 6 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 39138907324 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.18.0.201:9866,DS-0532fcd2-9e17-44e9-be20-846250dfbcaf,DISK], DatanodeInfoWithStorage[172.18.0.203:9866,DS-5e53ca55-f7a6-4ca3-b691-7838c48fd169,DISK], DatanodeInfoWithStorage[172.18.0.202:9866,DS-64752494-84a5-4022-b9b4-b1aac3c707bc,DISK]]
[INFO] 2023-07-31 06:32:39.133 +0000 -  -> 2023-07-31 14:32:38,438 INFO  org.apache.hadoop.hdfs.DataStreamer                          [] - Slow ReadProcessor read fields for block BP-578738613-127.0.0.1-1680765983742:blk_1073846764_313346 took 44201ms (threshold=30000ms); ack: seqno: 7 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 44198378674 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.18.0.201:9866,DS-0532fcd2-9e17-44e9-be20-846250dfbcaf,DISK], DatanodeInfoWithStorage[172.18.0.203:9866,DS-5e53ca55-f7a6-4ca3-b691-7838c48fd169,DISK], DatanodeInfoWithStorage[172.18.0.202:9866,DS-64752494-84a5-4022-b9b4-b1aac3c707bc,DISK]]
[INFO] 2023-07-31 06:32:42.134 +0000 -  -> 2023-07-31 14:32:41,794 INFO  org.apache.hadoop.hdfs.DataStreamer                          [] - Slow ReadProcessor read fields for block BP-578738613-127.0.0.1-1680765983742:blk_1073846764_313346 took 47555ms (threshold=30000ms); ack: seqno: 8 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 47552750303 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.18.0.201:9866,DS-0532fcd2-9e17-44e9-be20-846250dfbcaf,DISK], DatanodeInfoWithStorage[172.18.0.203:9866,DS-5e53ca55-f7a6-4ca3-b691-7838c48fd169,DISK], DatanodeInfoWithStorage[172.18.0.202:9866,DS-64752494-84a5-4022-b9b4-b1aac3c707bc,DISK]]
[INFO] 2023-07-31 06:32:49.135 +0000 -  -> 2023-07-31 14:32:48,227 INFO  org.apache.hadoop.hdfs.DataStreamer                          [] - Slow ReadProcessor read fields for block BP-578738613-127.0.0.1-1680765983742:blk_1073846764_313346 took 53988ms (threshold=30000ms); ack: seqno: 9 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 53985441808 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.18.0.201:9866,DS-0532fcd2-9e17-44e9-be20-846250dfbcaf,DISK], DatanodeInfoWithStorage[172.18.0.203:9866,DS-5e53ca55-f7a6-4ca3-b691-7838c48fd169,DISK], DatanodeInfoWithStorage[172.18.0.202:9866,DS-64752494-84a5-4022-b9b4-b1aac3c707bc,DISK]]
[INFO] 2023-07-31 06:32:53.136 +0000 -  -> 2023-07-31 14:32:52,498 INFO  org.apache.hadoop.hdfs.DataStreamer                          [] - Slow ReadProcessor read fields for block BP-578738613-127.0.0.1-1680765983742:blk_1073846764_313346 took 58259ms (threshold=30000ms); ack: seqno: 10 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 58255516777 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.18.0.201:9866,DS-0532fcd2-9e17-44e9-be20-846250dfbcaf,DISK], DatanodeInfoWithStorage[172.18.0.203:9866,DS-5e53ca55-f7a6-4ca3-b691-7838c48fd169,DISK], DatanodeInfoWithStorage[172.18.0.202:9866,DS-64752494-84a5-4022-b9b4-b1aac3c707bc,DISK]]
[INFO] 2023-07-31 06:33:00.138 +0000 -  -> 2023-07-31 14:32:59,277 INFO  org.apache.hadoop.hdfs.DataStreamer                          [] - Slow ReadProcessor read fields for block BP-578738613-127.0.0.1-1680765983742:blk_1073846764_313346 took 65038ms (threshold=30000ms); ack: seqno: 11 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 65033887715 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.18.0.201:9866,DS-0532fcd2-9e17-44e9-be20-846250dfbcaf,DISK], DatanodeInfoWithStorage[172.18.0.203:9866,DS-5e53ca55-f7a6-4ca3-b691-7838c48fd169,DISK], DatanodeInfoWithStorage[172.18.0.202:9866,DS-64752494-84a5-4022-b9b4-b1aac3c707bc,DISK]]
[INFO] 2023-07-31 06:33:04.139 +0000 -  -> 2023-07-31 14:33:03,958 INFO  org.apache.hadoop.hdfs.DataStreamer                          [] - Slow ReadProcessor read fields for block BP-578738613-127.0.0.1-1680765983742:blk_1073846764_313346 took 69718ms (threshold=30000ms); ack: seqno: 12 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 69714786695 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.18.0.201:9866,DS-0532fcd2-9e17-44e9-be20-846250dfbcaf,DISK], DatanodeInfoWithStorage[172.18.0.203:9866,DS-5e53ca55-f7a6-4ca3-b691-7838c48fd169,DISK], DatanodeInfoWithStorage[172.18.0.202:9866,DS-64752494-84a5-4022-b9b4-b1aac3c707bc,DISK]]
[INFO] 2023-07-31 06:33:09.140 +0000 -  -> 2023-07-31 14:33:08,814 INFO  org.apache.hadoop.hdfs.DataStreamer                          [] - Slow ReadProcessor read fields for block BP-578738613-127.0.0.1-1680765983742:blk_1073846764_313346 took 74573ms (threshold=30000ms); ack: seqno: 13 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 74570160268 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.18.0.201:9866,DS-0532fcd2-9e17-44e9-be20-846250dfbcaf,DISK], DatanodeInfoWithStorage[172.18.0.203:9866,DS-5e53ca55-f7a6-4ca3-b691-7838c48fd169,DISK], DatanodeInfoWithStorage[172.18.0.202:9866,DS-64752494-84a5-4022-b9b4-b1aac3c707bc,DISK]]
[INFO] 2023-07-31 06:33:15.141 +0000 -  -> 2023-07-31 14:33:14,765 INFO  org.apache.hadoop.hdfs.DataStreamer                          [] - Slow ReadProcessor read fields for block BP-578738613-127.0.0.1-1680765983742:blk_1073846764_313346 took 80523ms (threshold=30000ms); ack: seqno: 14 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 80520664025 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.18.0.201:9866,DS-0532fcd2-9e17-44e9-be20-846250dfbcaf,DISK], DatanodeInfoWithStorage[172.18.0.203:9866,DS-5e53ca55-f7a6-4ca3-b691-7838c48fd169,DISK], DatanodeInfoWithStorage[172.18.0.202:9866,DS-64752494-84a5-4022-b9b4-b1aac3c707bc,DISK]]
[INFO] 2023-07-31 06:33:22.142 +0000 -  -> 2023-07-31 14:33:21,697 INFO  org.apache.hadoop.hdfs.DataStreamer                          [] - Slow ReadProcessor read fields for block BP-578738613-127.0.0.1-1680765983742:blk_1073846764_313346 took 87455ms (threshold=30000ms); ack: seqno: 15 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 87452286022 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.18.0.201:9866,DS-0532fcd2-9e17-44e9-be20-846250dfbcaf,DISK], DatanodeInfoWithStorage[172.18.0.203:9866,DS-5e53ca55-f7a6-4ca3-b691-7838c48fd169,DISK], DatanodeInfoWithStorage[172.18.0.202:9866,DS-64752494-84a5-4022-b9b4-b1aac3c707bc,DISK]]
[INFO] 2023-07-31 06:33:25.143 +0000 -  -> 2023-07-31 14:33:24,716 INFO  org.apache.hadoop.hdfs.DataStreamer                          [] - Slow ReadProcessor read fields for block BP-578738613-127.0.0.1-1680765983742:blk_1073846764_313346 took 90475ms (threshold=30000ms); ack: seqno: 16 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 90472112027 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.18.0.201:9866,DS-0532fcd2-9e17-44e9-be20-846250dfbcaf,DISK], DatanodeInfoWithStorage[172.18.0.203:9866,DS-5e53ca55-f7a6-4ca3-b691-7838c48fd169,DISK], DatanodeInfoWithStorage[172.18.0.202:9866,DS-64752494-84a5-4022-b9b4-b1aac3c707bc,DISK]]
[INFO] 2023-07-31 06:33:29.144 +0000 -  -> 2023-07-31 14:33:28,523 INFO  org.apache.hadoop.hdfs.DataStreamer                          [] - Slow ReadProcessor read fields for block BP-578738613-127.0.0.1-1680765983742:blk_1073846764_313346 took 94281ms (threshold=30000ms); ack: seqno: 17 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 94278043524 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.18.0.201:9866,DS-0532fcd2-9e17-44e9-be20-846250dfbcaf,DISK], DatanodeInfoWithStorage[172.18.0.203:9866,DS-5e53ca55-f7a6-4ca3-b691-7838c48fd169,DISK], DatanodeInfoWithStorage[172.18.0.202:9866,DS-64752494-84a5-4022-b9b4-b1aac3c707bc,DISK]]
[INFO] 2023-07-31 06:33:35.145 +0000 -  -> 2023-07-31 14:33:34,447 INFO  org.apache.hadoop.hdfs.DataStreamer                          [] - Slow ReadProcessor read fields for block BP-578738613-127.0.0.1-1680765983742:blk_1073846764_313346 took 100206ms (threshold=30000ms); ack: seqno: 18 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 100202433416 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.18.0.201:9866,DS-0532fcd2-9e17-44e9-be20-846250dfbcaf,DISK], DatanodeInfoWithStorage[172.18.0.203:9866,DS-5e53ca55-f7a6-4ca3-b691-7838c48fd169,DISK], DatanodeInfoWithStorage[172.18.0.202:9866,DS-64752494-84a5-4022-b9b4-b1aac3c707bc,DISK]]
[INFO] 2023-07-31 06:33:37.145 +0000 -  -> 2023-07-31 14:33:37,061 INFO  org.apache.hadoop.hdfs.DataStreamer                          [] - Slow ReadProcessor read fields for block BP-578738613-127.0.0.1-1680765983742:blk_1073846764_313346 took 102819ms (threshold=30000ms); ack: seqno: 19 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 102814408448 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.18.0.201:9866,DS-0532fcd2-9e17-44e9-be20-846250dfbcaf,DISK], DatanodeInfoWithStorage[172.18.0.203:9866,DS-5e53ca55-f7a6-4ca3-b691-7838c48fd169,DISK], DatanodeInfoWithStorage[172.18.0.202:9866,DS-64752494-84a5-4022-b9b4-b1aac3c707bc,DISK]]
[INFO] 2023-07-31 06:33:44.147 +0000 -  -> 2023-07-31 14:33:43,437 INFO  org.apache.hadoop.hdfs.DataStreamer                          [] - Slow ReadProcessor read fields for block BP-578738613-127.0.0.1-1680765983742:blk_1073846764_313346 took 109190ms (threshold=30000ms); ack: seqno: 20 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 109187549515 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.18.0.201:9866,DS-0532fcd2-9e17-44e9-be20-846250dfbcaf,DISK], DatanodeInfoWithStorage[172.18.0.203:9866,DS-5e53ca55-f7a6-4ca3-b691-7838c48fd169,DISK], DatanodeInfoWithStorage[172.18.0.202:9866,DS-64752494-84a5-4022-b9b4-b1aac3c707bc,DISK]]
[INFO] 2023-07-31 06:33:47.148 +0000 -  -> 2023-07-31 14:33:46,414 INFO  org.apache.hadoop.hdfs.DataStreamer                          [] - Slow ReadProcessor read fields for block BP-578738613-127.0.0.1-1680765983742:blk_1073846764_313346 took 112171ms (threshold=30000ms); ack: seqno: 21 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 112168545196 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.18.0.201:9866,DS-0532fcd2-9e17-44e9-be20-846250dfbcaf,DISK], DatanodeInfoWithStorage[172.18.0.203:9866,DS-5e53ca55-f7a6-4ca3-b691-7838c48fd169,DISK], DatanodeInfoWithStorage[172.18.0.202:9866,DS-64752494-84a5-4022-b9b4-b1aac3c707bc,DISK]]

分析:

"Slow ReadProcessor" 和"Slow BlockReceiver"往往是因为集群负载比较高或者某些节点不健康导致的,本文主要是帮助你确认是因为集群负载高导致的还是因为某些节点的硬件问题。

症状:

1.作业比以前运行的时间变长

2.Job的日志中有以下WARN的信息

[INFO] 2023-07-31 06:33:47.148 +0000 -  -> 2023-07-31 14:33:46,414 INFO  org.apache.hadoop.hdfs.DataStreamer                          [] - Slow ReadProcessor read fields for block BP-578738613-127.0.0.1-1680765983742:blk_1073846764_313346 took 112171ms (threshold=30000ms); ack: seqno: 21 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 112168545196 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.16.0.201:9866,DS-0532fcd2-9e17-44e9-be20-846250dfbcaf,DISK], DatanodeInfoWithStorage[172.16.0.203:9866,DS-5e53ca55-f7a6-4ca3-b691-7838c48fd169,DISK], DatanodeInfoWithStorage[172.16.0.202:9866,DS-64752494-84a5-4022-b9b4-b1aac3c707bc,DISK]]

3.Datanode的日志中有以下WARN信息

2023-04-17 06:23:48,796 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 341ms (threshold=300ms)
2023-06-21 06:23:55,775 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:873ms (threshold=300ms)
2023-04-17 08:37:52,397 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow flushOrSync took 534ms (threshold=300ms), isSync:false, flushTotalNanos=533345033ns 
2023-04-17 08:38:57,929 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow manageWriterOsCache took 331ms (threshold=300ms)

原因:

症状 原因
集群负载高 如果你的集群处于或接近资源上限(内存,cpu或磁盘),则你在处理作业时,你的集群可能无法确保数据本地化,因此需要在网络上传输数据块。如果是这种情况,由于使用集群上的额外负载来传输数据块,因此可能会在作业或数据节点中看到WARN消息。
Slow BlockReceiver write packet to mirror 这表明在网络上写入块时有延迟
Slow BlockReceiver write data to disk cost 这表示在将块写入OS缓存或磁盘时存在延迟
Slow flushOrSync 这表示在将块写入OS缓存或磁盘时存在延迟
Slow manageWriterOsCache 这表示在将块写入OS缓存或磁盘时存在延迟

需要注意的是,在生产环境的正常负载下,一些集群的WARN消息在datanode日志中是正常的。当单个节点具有比正常情况更多的上述WARN消息时,表明存在底层硬件问题。

解决办法:

以下步骤将有助于确定导致DataNode日志中的“Slow”消息的底层硬件问题。

1.在每个DataNode上运行以下命令来收集所有Slow消息的计数:

egrep -o "Slow.*?(took|cost)" /path/to/current/datanode/log | sort | uniq -c

该命令将提供DataNode日志中所有“Slow”消息的计数。输出将类似于:

1000 Slow BlockReceiver write data to disk cost
234 Slow BlockReceiver write packet to mirror took
4 Slow flushOrSync took
6 Slow manageWriterOsCache took
41 Slow PacketResponder send ack to upstream took

2.如果单个节点的一个或多个类别的”Slow“消息比其他主机的”Slow“消息数量多出数量级,则需要调查底层硬件问题。

3.如果Slow消息数最多的是Slow BlockReceiver write packet tomirror took,请通过以下命令的输出来调查可能的网络问题:

  • ifconfig -a(定期检查问题主机上增加的errors和dropped的数量,往往代表的是网卡,网线或者上游的网络有问题)

  • netstat -s(与正常节点相比,查找大量重新传输的数据包或其他异常高的指标)。

  • netstat -s | grep -i retrans(整个集群执行)。 (在一个或多个节点上查找大于正常的计数)。

4.如果Slow消息最多的是一些其他消息,请使用以下命令检查磁盘问题:

  • iostat[高iowait百分比,超过15%]

  • iostat -x和sar -d(特定分区的高await或%util)

  • dmesg (磁盘错误)

  • 使用smartctl对磁盘进行健康检查:停止受影响节点的所有Hadoop进程,然后运行sudo smartctl -H /dev/,检查HDFS使用的每块

你可能感兴趣的:(hadoop,apache,hdfs)