Java 7 HashMap 多线程并发操作导致cpu 100%

问题现象

在线上发布一个java 7服务的时候,发现某台机器发布完成后无法正常提供服务,发布后出现大量线程被blocked,触发了告警:

JVM线程监控状况

从监控中可以看到,JVM中存活的线程数量已经达到2k+,这本身就是不正常的,其次,有近1.8k的线程被blocked了,这就说明服务根本就没有正常启动,存在启动问题。

问题分析

线程数量超出正常水平,和线程blocked是因果关系,因为线程被blocked了,所以需要更多的线程来执行工作,所以新的线程被不断的创建出来。
所以需要找出线程被阻塞到了什么地方,通过简单排查分析,发现大量的线程都被阻塞在相同的地方:

at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton
(DefaultSingletonBeanRegistry.java: 213)
at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean
(AbstractBeanFactory.java: 308)
at org.springframework.beans.factory.support.AbstractBeanFactory.getBean
(AbstractBeanFactory.java: 197)

来看一下阻塞的地方法代码:org.springframework.beans.factory.support.DefaultSingletonBeanRegistry#getSingleton(java.lang.String, boolean)

    /**
     * Return the (raw) singleton object registered under the given name.
     * 

Checks already instantiated singletons and also allows for an early * reference to a currently created singleton (resolving a circular reference). * @param beanName the name of the bean to look for * @param allowEarlyReference whether early references should be created or not * @return the registered singleton object, or {@code null} if none found */ protected Object getSingleton(String beanName, boolean allowEarlyReference) { Object singletonObject = this.singletonObjects.get(beanName); if (singletonObject == null && isSingletonCurrentlyInCreation(beanName)) { synchronized (this.singletonObjects) { singletonObject = this.earlySingletonObjects.get(beanName); if (singletonObject == null && allowEarlyReference) { ObjectFactory singletonFactory = this.singletonFactories.get(beanName); if (singletonFactory != null) { singletonObject = singletonFactory.getObject(); this.earlySingletonObjects.put(beanName, singletonObject); this.singletonFactories.remove(beanName); } } } } return (singletonObject != NULL_OBJECT ? singletonObject : null); }

org.springframework.beans.factory.support.DefaultSingletonBeanRegistry#getSingleton(java.lang.String, boolean)这个方法确实存在同步代码,需要执行同步代码的线程需要获取到锁才能执行,否则就会被blocked。

分析到这里,我们能确定的事情就是调用方法org.springframework.beans.factory.support.DefaultSingletonBeanRegistry#getSingleton(java.lang.String, boolean)确实会产生因竞争同步锁而导致的线程blocked,但是根据报警,几乎所有的线程都被blocked了,那就可能存在死锁问题,导致这个锁无法被释放,所以所有访问该方法的线程都被blocked,为了搞明白具体的原因,先把线程堆栈转储下来。

"xxx-13-thread-1" daemon prio=10 tid=0x00007fa3790a1000 nid=0x48b4c waiting for monitor entry [0x00007fa38e17b000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:213)
    - waiting to lock <0x0000000727af5b68> (a java.util.concurrent.ConcurrentHashMap)
    at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:308)
    at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:197)
    at org.springframework.aop.aspectj.annotation.BeanFactoryAspectInstanceFactory.getAspectInstance(BeanFactoryAspectInstanceFactory.java:83)
    at org.springframework.aop.aspectj.annotation.LazySingletonAspectInstanceFactoryDecorator.getAspectInstance(LazySingletonAspectInstanceFactoryDecorator.java:53)
    at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:627)
    at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:616)
    at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:70)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:168)
    at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:92)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:671)
    ...
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:736)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
    at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:92)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:671)
    ...
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

被blocked的线程栈都和上面贴出的一样,重点在于:

waiting to lock <0x0000000727af5b68> (a java.util.concurrent.ConcurrentHashMap)

0x0000000727af5b68是对象的地址,其实根据后面的提示(a java.util.concurrent.ConcurrentHashMap)也可以确定是我们上面分析过的那个同步对象,再来看一下刚才那个同步对象:

    /** Cache of singleton objects: bean name --> bean instance */
    private final Map singletonObjects = new ConcurrentHashMap(256);

现在,我需要知道是哪个线程占有了对象0x0000000727af5b68的锁不释放,导致其他线程被blocked,为了搜索占有锁的线程,可以在线程栈转储文件中搜索关键字:"locked <0x0000000727af5b68>",根据对象锁获取逻辑,只可能有一个线程持有该对象锁,搜索后,发现了如下的堆栈:

"main" prio=10 tid=0x00007fa4a0018000 nid=0x4889e runnable [0x00007fa4a8fea000]
   java.lang.Thread.State: RUNNABLE
    at java.util.HashMap.put(HashMap.java:494)
    at org.apache.thrift.meta_data.FieldMetaData.addStructMetaDataMap(FieldMetaData.java:49)
  ...
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:191)
    at com.sun.proxy.$Proxy346.(Unknown Source)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at java.lang.reflect.Proxy.newInstance(Proxy.java:764)
    at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:755)
    at org.springframework.aop.framework.JdkDynamicAopProxy.getProxy(JdkDynamicAopProxy.java:122)
    at org.springframework.aop.framework.JdkDynamicAopProxy.getProxy(JdkDynamicAopProxy.java:112)
    at org.springframework.aop.framework.ProxyFactory.getProxy(ProxyFactory.java:96)
  ...
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.invokeCustomInitMethod(AbstractAutowireCapableBeanFactory.java:1759)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.invokeInitMethods(AbstractAutowireCapableBeanFactory.java:1696)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1626)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:553)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:481)
    at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:312)
    at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:230)
    - locked <0x0000000727af5b68> (a java.util.concurrent.ConcurrentHashMap)
    at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:308)
    at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:197)
    at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:756)
    at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:867)
    at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:542)
    - locked <0x00000007292f2fb0> (a java.lang.Object)
    at org.springframework.boot.context.embedded.EmbeddedWebApplicationContext.refresh(EmbeddedWebApplicationContext.java:123)
    at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:666)
    at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:353)
    at org.springframework.boot.SpringApplication.run(SpringApplication.java:300)

是main线程,它持有了对象0x00007fa4a8fea000锁,并且根据他的状态为RUNNABLE,说明它并没有被阻塞,也就是这其实不是死锁问题,估计是main线程进入了死循环出不来,从而持有的锁无法释放,导致其他需要对象0x00007fa4a8fea000锁的线程都被blocked。

看到栈顶在java.util.HashMap.put(HashMap.java:494),心里一惊,感觉发现了thrift的一个bug了,这件事情还值得再说一说。

HashMap在至今的java版本中均不是线程安全的,也就是说,如果你的场景中会存在并发访问一个Map,你就不能用HashMap,否则会出现或多或少的问题,我们使用的是Java 7,在Java 7中,多线程并发访问HashMap会存在线程死循环的问题。

为了说明问题,截取HashMap的put方法代码如下:

    public V put(K key, V value) {
        if (table == EMPTY_TABLE) {
            inflateTable(threshold);
        }
        if (key == null)
            return putForNullKey(value);
        int hash = hash(key);
        int i = indexFor(hash, table.length);
        for (Entry e = table[i]; e != null; e = e.next) { // ----- 494行
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }

        modCount++;
        addEntry(hash, key, value, i);
        return null;
    }

进入死循环的条件就是当前的e.next = e,也就是某个节点的next指针指向了自己,导致无限循环问题。为了验证这个问题,将堆dump了下来,然后使用Eclipse Memory Analyzer Tool(下文中使用 MAT 来指代该工具)来载入dump下来的堆,然后点击下面示意图中的按钮获取到线程列表:


线程详情

MAT可以将线程的名字,当前的堆栈及持有的对象分析出来,对于排查内存问题非常的方便,找到main线程:

main线程栈顶

结合HashMap的put死循环代码,当时的e就是0x73ae67608这个java.util.HashMap.Entry,可以看到,这个java.util.HashMap.Entry的next还是自己,这样就导致了执行该代码的main线程死循环了。

关于HashMap的死循环问题是如何产生的,可以参考为什么HashMap不线程安全

问题解决

这个HashMap的代码是thrift的代码,我们可以看看原始代码:

//
// Source code recreated from a .class file by IntelliJ IDEA
// (powered by Fernflower decompiler)
//

package org.apache.thrift.meta_data;

import java.io.Serializable;
import java.util.HashMap;
import java.util.Map;
import org.apache.thrift.TBase;
import org.apache.thrift.TFieldIdEnum;

public class FieldMetaData implements Serializable {
    public final String fieldName;
    public final byte requirementType;
    public final FieldValueMetaData valueMetaData;
    private static Map, Map> structMap = new HashMap();

    public FieldMetaData(String name, byte req, FieldValueMetaData vMetaData) {
        this.fieldName = name;
        this.requirementType = req;
        this.valueMetaData = vMetaData;
    }

    public static void addStructMetaDataMap(Class sClass, Map map) {
        structMap.put(sClass, map);
    }

    public static Map getStructMetaDataMap(Class sClass) {
        if(!structMap.containsKey(sClass)) {
            try {
                sClass.newInstance();
            } catch (InstantiationException var2) {
                throw new RuntimeException("InstantiationException for TBase class: " + sClass.getName() + ", message: " + var2.getMessage());
            } catch (IllegalAccessException var3) {
                throw new RuntimeException("IllegalAccessException for TBase class: " + sClass.getName() + ", message: " + var3.getMessage());
            }
        }

        return (Map)structMap.get(sClass);
    }
}

根据问题,我们知道,解决问题的方式有两种,一种是将structMap定义成并发安全的ConcurrentHashMap,另一种方法是将访问structMap的代码写成同步的,也就是在操作structMap的方法上(或者代码段上)加上synchronized关键字。
此时兴奋的我想快去给thrift提个pr,但是发现如下的代码:

github thrift修复代码

可以看到thrift已经修复了该问题,是使用加synchronized关键字的方案来解决的。我们可以升级到0.9.3及之后的版本就可以避免再次发生这样的问题。

这个pr是为了解决THRIFT-1618这个任务的,为了看看这个问题是否和我们的问题一致,可以搜索一下这个任务:

THRIFT-1618

可以看到这个任务的状态是CLOSED,已经被解决,问题描述也和我们的状况一致。

结论

基于上文的分析,总结一下,该问题是因为多线程并发访问HashMap触发Java 7 HashMap扩容时导致链表循环,从而线程进入死循环,而死循环线程持有的对象锁无法得到释放,其他请求获取对象锁的线程均被blocked​。将thrift版本升级到0.9.3以上就可以解决这个问题。

你可能感兴趣的:(Java 7 HashMap 多线程并发操作导致cpu 100%)