现象:Apollo告警平台16台JBOSS服务器中有一台登陆不上,堆内存耗尽,OOM异常。
分析:
1、取回dump日志,通过IBM HeapAnalyzer分析
NonRegisteringDriver类concurrentHashMap占用了73%的堆内存。
疑点:为何GC回收不了此内存?
打开5.1.45版本的mysql-connector-java下NonRegisteringDriver.java。
71 protectedstaticfinalConcurrentHashMap
348 protectedstaticvoidtrackConnection(Connection newConn) {
349
350 ConnectionPhantomReference phantomRef = newConnectionPhantomReference((ConnectionImpl) newConn, refQueue);
351 connectionPhantomRefs.put(phantomRef, phantomRef);
352 }
staticclassConnectionPhantomReference extendsPhantomReference
889 privateNetworkResources io;
890
891 ConnectionPhantomReference(ConnectionImpl connectionImpl, ReferenceQueue
892 super(connectionImpl, q);
893
894 try{
895 this.io = connectionImpl.getIO().getNetworkResources();
896 } catch(SQLException e) {
897 // if we somehow got here and there's really no i/o, we deal with it later
898 }
899 }
900
901 voidcleanup() {
902 if(this.io != null) {
903 try{
904 this.io.forceClose();
905 } finally{
906 this.io = null;
907 }
908 }
909 }
910 }
显然connectionPhantomRefs map占用太多的堆内存了。同时ConnectionPhantomReference 继承PhantomReference。
而这个Reference如果不调用cleanup,将io=null,是无法通过GC回收的。
那么正常情况下,谁负责调用cleanup呢?这里connect/j中会启用AbandonedConnectionUpThread这个常驻线程,负责清空此队列。
AbandonedConnectionCleanUpThread.java
60 publicvoidrun() {
61 for(;;) {
62 try{
63 checkContextClassLoaders();
64 Reference extendsConnectionImpl> ref = NonRegisteringDriver.refQueue.remove(5000);
65 if(ref != null) {
66 try{
67 ((ConnectionPhantomReference) ref).cleanup();
68 } finally{
69 NonRegisteringDriver.connectionPhantomRefs.remove(ref);
70 }
71 }
72
73 } catch(InterruptedException e) {
74 threadRef = null;
75 return;
76
77 } catch(Exception ex) {
78 // Nowhere to really log this.
79 }
80 }
81 }
82
既然有这个线程保证回收队列,那为什么又占那么多堆内存呢。莫非是线程挂起或者阻塞了?
通过visualvm工具打开dump,显示线程信息,找到Abandoned线程
"Abandoned connection -
thread" daemon prio=5 tid=90 BLOCKED
at java.util.ArrayList.
at org.jboss.modules.ModuleClassLoader.loadResourceLocal(ModuleClassLoader.java:349)
Local Variable: java.util.ArrayList#54264
at org.jboss.modules.ModuleClassLoader$1.loadResourceLocal(ModuleClassLoader.java:103)
Local Variable: org.jboss.modules.ModuleClassLoader$1#14
at org.jboss.modules.Module.getResource(Module.java:600)
Local Variable: org.jboss.modules.Module#15
at org.jboss.modules.ModuleClassLoader.findResource(ModuleClassLoader.java:574)
Local Variable: org.jboss.modules.ModuleClassLoader#15
Local Variable: java.lang.String#1101031
at org.jboss.modules.ConcurrentClassLoader.getResource(ConcurrentClassLoader.java:261)
at com.mysql.jdbc.AbandonedConnectionCleanupThread.checkContextClassLoaders(AbandonedConnectionCleanupThread.java:90)
at com.mysql.jdbc.AbandonedConnectionCleanupThread.run(AbandonedConnectionCleanupThread.java:63)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Local Variable: java.util.concurrent.ThreadPoolExecutor#7
Local Variable: com.mysql.jdbc.AbandonedConnectionCleanupThread#1
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Local Variable: java.util.concurrent.ThreadPoolExecutor$Worker#48
at java.lang.Thread.run(Thread.java:724)
发现果然线程被blocked,阻塞状态。等待ArrayList,init方法处理完。
查看上下文,发现是checkContextClassLoaders()的问题。
调用链方法如下:
AbandoneConnectionCleanUpThread.java
privatevoidcheckContextClassLoaders() {
89 try{
90 threadRef.getContextClassLoader().getResource("");
91 } catch(Throwable e) {
92 // Shutdown no matter what.
93 uncheckedShutdown();
94 }
95 }
ConcurrentClassLoader.java
publicfinalURL getResource(finalString name) {
250 for(String s : Module.systemPaths) {
251 if(name.startsWith(s)) {
252 returnJDKSpecific.getSystemResource(name);
253 }
254 }
255 returnfindResource(name, false);
256 }
ModuleClassLoader.java
/**{@inheritDoc} */
650 @Override
651 publicfinalURL findResource(finalString name, finalbooleanexportsOnly) {
652 returnmodule.getResource(name);
653 }
Module.java
748 * @returnthe resource URL, or {@code null}if not found
749 */
750 URL getResource(finalString name) {
751 finalString canonPath = PathUtils.canonicalize(name);
752 for(String s : Module.systemPaths) {
753 if(canonPath.startsWith(s)) {
754 returnmoduleClassLoader.getResource(canonPath);
755 }
756 }
757 log.trace("Attempting to find resource %s in %s", canonPath, this);
758 finalString path = pathOf(canonPath);
759 finalURLConnectionResource jaxpResource = ModuleClassLoader.jaxpImplResources.get(canonPath);
760 finalMap
761 finalList
762 if(loaders != null) {
763 for(LocalLoader loader : loaders) {
764 finalIterator
765 if(iterator.hasNext()) {
766 finalURL url = iterator.next().getURL();
767 if(jaxpResource != null) log.jaxpResourceLoaded(url, this);
768 returnurl;
769 }
770 }
771 }
772 finalLocalLoader fallbackLoader = this.fallbackLoader;
773 if(fallbackLoader != null) {
774 finalIterator
775 if(iterator.hasNext()) {
776 finalURL url = iterator.next().getURL();
777 if(jaxpResource != null) log.jaxpResourceLoaded(url, this);
778 returnurl;
779 }
780 }
781 if(jaxpResource != null) {
782 finalURL url = jaxpResource.getURL();
783 log.jaxpResourceLoaded(url, this);
784 returnurl;
785 }
786 returnnull;
787 }
moduleClassLoader.java
/**
395 *Load a local exported resource from this class loader.
396 *
397 * @paramnamethe resource name
398 * @returnthe list of resources
399 */
400 publicList
401 finalMap
402
403 finalString path = Module.pathOf(name);
404 finalURLConnectionResource jaxpResource = jaxpImplResources.get(name);
405
406 finalList
407 finalList
408 if(loaders != null) {
409 for(ResourceLoader loader : loaders) {
410 finalResource resource = loader.getResource(name);
411 if(resource != null) {
412 if(jaxpResource != null) Module.log.jaxpResourceLoaded(resource.getURL(), module);
413 list.add(resource);
414 }
415 }
416 }
417 returnlist.isEmpty() ? Collections.emptyList() : list;
418 }
分析是ArrayList init的时候阻塞线程,那么就是407行的初始化loaders.size()大小的队列。
考虑到此dump占用了99.7%的空间,说明这个地方已经无法申请那么多空间内存,所以回收线程被阻塞了。
那么concurrentHashMap无法回收数据,导致内存泄漏,最终发生OOM。
所以结论就是堆内存空间不足,导致Abandon线程阻塞,引起内存泄漏,发生OOM
紧接着问题就是为什么堆内存空间不足?
正常情况下的单台服务器占用500M堆内存,最多到2G。但是这台出问题的服务器,内存一直在2G左右。
IBMheap Analyzer分析第二疑似点为:
JBOSS的org.jboss.module.linkage的链表也很大,占用了22%的空间。
57 Linkage(finalDependencySpec[] dependencySpecs, finalDependency[] dependencies, finalState state, finalMap
58 this.dependencySpecs = dependencySpecs;
59 this.dependencies = dependencies;
60 this.state = state;
61 this.allPaths = PathUtils.deduplicateLists(allPaths);
62 }
LinkAge的构造函数里面进行了拷贝操作。
static
400 if(allPaths == null) {
401 returnnull;
402 } elseif(allPaths.size() == 0) {
403 returnCollections.emptyMap();
404 } else{
405 Map
406 Map, List
newHashMap<>();
407 for(Map.Entry
408 List
409 if(l == null) {
410 dedup.put(e.getValue(), l = Collections.unmodifiableList(newArrayList<>(e.getValue())));
411 }
412 newPaths.put(e.getKey(), l);
413 }
414 returnCollections.unmodifiableMap(newPaths);
415 }
416 }
而Module.getPaths,又会new linkage
1554 Map
1555 Linkage oldLinkage = this.linkage;
1556 Linkage linkage;
1557 Linkage.State state = oldLinkage.getState();
1558 if(state == Linkage.State.LINKED) {
1559 returnoldLinkage.getPaths();
1560 }
1561 // slow path loop
1562 booleanintr = false;
1563 try{
1564 for(;;) {
1565 synchronized(this) {
1566 oldLinkage = this.linkage;
1567 state = oldLinkage.getState();
1568 while(state == Linkage.State.LINKING || state == Linkage.State.NEW) try{
1569 wait();
1570 oldLinkage = this.linkage;
1571 state = oldLinkage.getState();
1572 } catch(InterruptedException e) {
1573 intr = true;
1574 }
1575 if(state == Linkage.State.LINKED) {
1576 returnoldLinkage.getPaths();
1577 }
1578 this.linkage = linkage = newLinkage(oldLinkage.getDependencySpecs(), oldLinkage.getDependencies(), Linkage.State.LINKING);
1579 // fall out and link
1580 }
1581 booleanok = false;
1582 try{
1583 link(linkage);
1584 ok = true;
1585 } finally{
1586 if(! ok) {
1587 // restore original (lack of) linkage
1588 synchronized(this) {
1589 if(this.linkage == linkage) {
1590 this.linkage = oldLinkage;
1591 notifyAll();
1592 }
…
这里有无限循环。
所以在Abandon线程在check的时候,可能会不停创建LInkage。
同时分析正常环境的dump,不存在此现象。
我有理由怀疑,Abandon线程在某种场景下会不停的创建LinkAge,
而这句话threadRef.getContextClassLoader().getResource("”);
可能就是问题的原因。
重启WILDFLY后,此问题不复现。Abandon线程正常,不会一直有进程加载资源。
一切都恢复了原样!然而平静的表面下,还是暗藏着玄机。