环境版本
- flink 1.3.5
- hive 2.1.1
问题现象
- 任务运行一段时间出现 oom
- GC 信息来自于 arthas dashboard
heap 7839M 7912M 7912M 99.08% gc.g1_young_generation.count 3103
g1_eden_space 266M 338M -1 78.70% gc.g1_young_generation.time(ms) 149029
g1_survivor_space 2M 2M -1 100.00% gc.g1_old_generation.count 84
g1_old_gen 7571M 7572M 7912M 95.70% gc.g1_old_generation.time(ms) 1728307
nonheap 201M 213M 1520M 13.29%
code_cache 75M 77M 240M 31.47%
metaspace 113M 120M 256M 44.24%
compressed_class_space 13M 14M 1024M 1.29%
direct 1030M 1030M - 100.00%
mapped 0K 0K - 0.00%
g1_old_gen 7571M 7572M 7912M 95.70%
老年代被占满
- jmap -histo:live 13176 | head -n 10
num #instances #bytes class name
----------------------------------------------
1: 29038550 4765922152 [C
2: 78950689 3158027560 java.util.LinkedHashMap$Entry
3: 29037950 696910800 java.lang.String
4: 3039863 660059128 [Ljava.util.HashMap$Node;
5: 3037192 170082752 java.util.LinkedHashMap
6: 7459 48344936 [B
7: 98 3212832 [Lakka.dispatch.forkjoin.ForkJoinTask;
8: 32815 2100160 java.nio.DirectByteBuffer
9: 18195 2023848 java.lang.Class
10: 33261 1862616 org.apache.flink.core.memory.MemorySegment
11: 48302 1545664 java.util.concurrent.ConcurrentHashMap$Node
12: 14347 1328040 [Ljava.lang.Object;
13: 32817 1312680 sun.misc.Cleaner
14: 32810 1049920 java.nio.DirectByteBuffer$Deallocator
问题定位
- 1、查找 java.util.LinkedHashMap 是由哪些类加载加载
-
[arthas@510008]$ sc -d java.util.LinkedHashMap 结果如下:
class-info org.apache.hadoop.hive.ql.udf.UDFJson$HashCache code-source /HDATA/8/yarn/local1/usercache/mario/appcache/application_1572953124045_54575/filecache/37/flink-hudi090-20211215-1.0-SNAPSHOT-jar-with-dependencies.jar name org.apache.hadoop.hive.ql.udf.UDFJson$HashCache isInterface false isAnnotation false isEnum false isAnonymousClass false isArray false isLocalClass false isMemberClass true isPrimitive false isSynthetic false simple-name HashCache modifier static annotation interfaces super-class +-java.util.LinkedHashMap +-java.util.HashMap +-java.util.AbstractMap +-java.lang.Object class-loader +-sun.misc.Launcher$AppClassLoader@6e0be858 +-sun.misc.Launcher$ExtClassLoader@e73f9ac classLoaderHash 6e0be858 class-info org.apache.hadoop.hive.ql.udf.generic.GenericUDTFJSONTuple$HashCache code-source /HDATA/8/yarn/local1/usercache/mario/appcache/application_1572953124045_54575/filecache/37/flink-hudi090-20211215-1.0-SNAPSHOT-jar-with-dependencies.jar name org.apache.hadoop.hive.ql.udf.generic.GenericUDTFJSONTuple$HashCache isInterface false isAnnotation false isEnum false isAnonymousClass false isArray false isLocalClass false isMemberClass true isPrimitive false isSynthetic false simple-name HashCache modifier static annotation interfaces super-class +-java.util.LinkedHashMap +-java.util.HashMap +-java.util.AbstractMap +-java.lang.Object class-loader +-sun.misc.Launcher$AppClassLoader@6e0be858 +-sun.misc.Launcher$ExtClassLoader@e73f9ac classLoaderHash 6e0be858 class-info org.apache.hudi.org.apache.avro.Schema$Names code-source /HDATA/8/yarn/local1/usercache/mario/appcache/application_1572953124045_54575/filecache/37/flink-hudi090-20211215-1.0-SNAPSHOT-jar-with-dependencies.jar name org.apache.hudi.org.apache.avro.Schema$Names isInterface false isAnnotation false isEnum false isAnonymousClass false isArray false isLocalClass false isMemberClass true isPrimitive false isSynthetic false simple-name Names modifier static annotation interfaces super-class +-java.util.LinkedHashMap +-java.util.HashMap +-java.util.AbstractMap +-java.lang.Object class-loader +-sun.misc.Launcher$AppClassLoader@6e0be858 +-sun.misc.Launcher$ExtClassLoader@e73f9ac classLoaderHash 6e0be858 class-info org.codehaus.jackson.util.InternCache code-source /HDATA/8/yarn/local1/usercache/mario/appcache/application_1572953124045_54575/filecache/37/flink-hudi090-20211215-1.0-SNAPSHOT-jar-with-dependencies.jar name org.codehaus.jackson.util.InternCache isInterface false isAnnotation false isEnum false isAnonymousClass false isArray false isLocalClass false isMemberClass false isPrimitive false isSynthetic false simple-name InternCache modifier final,public annotation interfaces super-class +-java.util.LinkedHashMap +-java.util.HashMap +-java.util.AbstractMap +-java.lang.Object class-loader +-sun.misc.Launcher$AppClassLoader@6e0be858 +-sun.misc.Launcher$ExtClassLoader@e73f9ac classLoaderHash 6e0be858
-
上面结果表明 LinkedHashmap 来自于:
org.apache.hadoop.hive.ql.udf.UDFJson$HashCache classLoaderHash 6e0be858
org.apache.hadoop.hive.ql.udf.generic.GenericUDTFJSONTuple$HashCache classLoaderHash 6e0be858
org.apache.hudi.org.apache.avro.Schema$Names classLoaderHash 6e0be858
org.codehaus.jackson.util.InternCache classLoaderHash 6e0be858
-
- 2、感觉
org.apache.hadoop.hive.ql.udf.UDFJson
可疑于是查看下org.apache.hadoop.hive.ql.udf.UDFJson
实例的个数[arthas@350606]$ vmtool --action getInstances --className org.apache.hadoop.hive.ql.udf.UDFJson --limit 100 @UDFJson[][ @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@277ec174], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@2a46f1fe], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@67cd8d87], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@3fafc749], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@6e0bc4a9], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@735f0197], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@1b8fc53e], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@78f50bce], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@39a10e16], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@9bfab07], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@2e7401e], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@521c90f7], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@63fcf51c], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@17ef8df7], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@2f30c8a4], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@191d6e9d], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@14304fa2], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@1dcd2cb7], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@76a68acc], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@5ffa1954], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@2d61fdb0], @UDFJson[org.apache.hadoop.hive.ql.udf.UDFJson@593dd43f], ]
- 实例个数并不多同样的方法再看下
org.apache.hadoop.hive.ql.udf.UDFJson$HashCache
arthas@350606]$ vmtool --action getInstances --className org.apache.hadoop.hive.ql.udf.UDFJson$HashCache --limit 20 @HashCache[][ @HashCache[isEmpty=false;size=11], @HashCache[isEmpty=false;size=11], @HashCache[isEmpty=false;size=11], @HashCache[isEmpty=false;size=11], @HashCache[isEmpty=false;size=1653617], ]
- 发现第 5 个 HashCache size 有点异常, 并且隔一段时间执行发现 size 一致在增加, 基本定位到内存泄漏的点
还可以执行如下语句看下 hashCache 里面的属性,遗憾的是第 5 个 hashCache 可能 size 太大执行不出来,其他四个的执行结果如下
[arthas@350606]$ vmtool --action getInstances -c 6e0be858 --className org.apache.hadoop.hive.ql.udf.UDFJson$HashCache -x 2 @HashCache[][ @HashCache[ @String[sws]:@Boolean[true], @String[pickup]:@Boolean[true], @String[elecSign]:@Boolean[true], @String[addresseeBuildingId]:@Boolean[true], @String[addresseeAoiDeptCode]:@Boolean[true], @String[consignorAoiDeptCode]:@Boolean[true], @String[consignorAoiId]:@Boolean[true], @String[wemWaybill]:@Boolean[true], @String[wemAckBill]:@Boolean[true], @String[uploadWaybill]:@Boolean[true], @String[wbepWaybill]:@Boolean[true], ], @HashCache[ @String[sws]:@String[sws], @String[pickup]:@String[pickup], @String[elecSign]:@String[elecSign], @String[addresseeBuildingId]:@String[addresseeBuildingId], @String[addresseeAoiDeptCode]:@String[addresseeAoiDeptCode], @String[consignorAoiDeptCode]:@String[consignorAoiDeptCode], @String[consignorAoiId]:@String[consignorAoiId], @String[wemWaybill]:@String[wemWaybill], @String[wemAckBill]:@String[wemAckBill], @String[uploadWaybill]:@String[uploadWaybill], @String[wbepWaybill]:@String[wbepWaybill], ], @HashCache[ @String[sws]:@ArrayList[isEmpty=true;size=0], @String[pickup]:@ArrayList[isEmpty=true;size=0], @String[elecSign]:@ArrayList[isEmpty=true;size=0], @String[addresseeBuildingId]:@ArrayList[isEmpty=true;size=0], @String[addresseeAoiDeptCode]:@ArrayList[isEmpty=true;size=0], @String[consignorAoiDeptCode]:@ArrayList[isEmpty=true;size=0], @String[consignorAoiId]:@ArrayList[isEmpty=true;size=0], @String[wemWaybill]:@ArrayList[isEmpty=true;size=0], @String[wemAckBill]:@ArrayList[isEmpty=true;size=0], @String[uploadWaybill]:@ArrayList[isEmpty=true;size=0], @String[wbepWaybill]:@ArrayList[isEmpty=true;size=0], ], @HashCache[ @String[$.sws]:@String[][isEmpty=false;size=2], @String[$.pickup]:@String[][isEmpty=false;size=2], @String[$.elecSign]:@String[][isEmpty=false;size=2], @String[$.addresseeBuildingId]:@String[][isEmpty=false;size=2], @String[$.addresseeAoiDeptCode]:@String[][isEmpty=false;size=2], @String[$.consignorAoiDeptCode]:@String[][isEmpty=false;size=2], @String[$.consignorAoiId]:@String[][isEmpty=false;size=2], @String[$.wemWaybill]:@String[][isEmpty=false;size=2], @String[$.wemAckBill]:@String[][isEmpty=false;size=2], @String[$.uploadWaybill]:@String[][isEmpty=false;size=2], @String[$.wbepWaybill]:@String[][isEmpty=false;size=2], ], @HashCache[ ], ]
- 还可以调用如下命令查看第5个实例的方法结果
[arthas@350606]$ vmtool --action getInstances -c 6e0be858 --className org.apache.hadoop.hive.ql.udf.UDFJson$HashCache --express 'instances[4].size()' @Integer[1864440]
- 实例个数并不多同样的方法再看下
- 3、 到此内存泄漏的点基本找到了,过程还执行了其他一些语句
profiler start --event alloc profiler getSamples profiler stop --format html --file /tmp/dw/output2.html watch 等
- 4、问题根因定位 LinkedHashmap 在线程不安全的情况下未能及时清理
import lombok.SneakyThrows;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.UUID;
public class TEST {
static class HashCache extends LinkedHashMap {
private static final int CACHE_SIZE = 16;
private static final int INIT_SIZE = 32;
private static final float LOAD_FACTOR = 0.6f;
HashCache() {
super(INIT_SIZE, LOAD_FACTOR);
}
private static final long serialVersionUID = 1;
@Override
protected boolean removeEldestEntry(Map.Entry eldest) {
return size() > CACHE_SIZE;
}
}
public static void main(String[] args) throws InterruptedException {
final HashCache