最近升级了一次tair(缓存系统)的client jar包——一个完全被重写了的版本,发布到线上的时候,发现某个新上线机房机器cpu占用率比较高,一般50%—100%之间(5核的虚拟机),而另外两个机房机器的cpu使用率却比较低。
1、用top查看是java进程的占用的绝大多数cpu
2、用top H或者top -p PID H查看发现是只有一个线程占用cpu比较多
3、printf %x tid,拿到线程id的16进制表示后jstack pid | grep 0xtid -A 10拿到占用cpu比较多的线程的堆栈:
[admin@hostname001 ~]$ jstack 36776 | grep 0x9003 -A 10 "Xxxx-Timeout-Channel-Checker" daemon prio=10 tid=0x00007fa39438d000 nid=0x9003 runnable [0x00000000431b9000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000752803898> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025) at java.util.concurrent.DelayQueue.take(DelayQueue.java:164) at com...........net.TairBgWorker.run(TairBgWorker.java:52)
是缓存系统超时扫描线程在检查超时队列的时候出问题了,按道理说DelayQueue没有取到超时数据,应该一直park在那的,没啥问题呀,为何会导致cpu占用很多呢? 难道跟自旋锁有关?
4、想不通.......,把问题丢给了缓存团队的哥们儿,但是他们太忙,从第一次反馈到自己花时间着手解决,都一周时间了,感觉完全是你把他们新版上去了,没把你搞成线上故障,剩下的他们都不放在心上了一样.....
5、于是在稍微有点时间的情况下,自己着手解决吧,google了一把DelayQueue占用cpu的问题,竟然发现有哥们儿也遇到了类似的问题,浏览一遍觉得这哥们儿说的挺靠谱,检查一下我们的代码,发现果然getDelay的时候,直接返回的是毫秒值:
public long getDelay(TimeUnit unit) { return delayed - System.currentTimeMillis(); }
而在DelayQueue.take里边,获取到这个时间后是当做ns来使用的!!:
public E take() throws InterruptedException { final ReentrantLock lock = this.lock; lock.lockInterruptibly(); try { for (;;) { E first = q.peek(); if (first == null) { available.await(); } else { long delay = first.getDelay(TimeUnit.NANOSECONDS); if (delay > 0) { long tl = available.awaitNanos(delay); } else { E x = q.poll(); assert x != null; if (q.size() != 0) available.signalAll(); // wake up other takers return x; } } } } finally { lock.unlock(); } }
那问题估计就是这个了,但是之前在日常和性能压测环境,全然没有注意到这个问题呀,而且为啥有些机房有这些问题,而另外的没有呢?如果是这个问题,解释不了呀!由于线上发布不是想发就发,所以就算改了也不能立即验证,如果不是这个原因,每周仅有的1-2次发布点就浪费,只能等到下次发布才能解决了。于是就想着写个程序先简单重现一把,再看是否解决了!
6、自己写了个程序,找个DelayQueue,一个线程塞数据,一个线程等超时取数据,本地跑了不重现!丢掉两个不同机房的机器上,仍然不重现!难道还有别的原因?但是想想,如果请求量比较大,或者超时时间不长,每次DelayQueue.take都能拿到超时的东西,超时后又去处理一堆超时队列的数据,然后sleep 5ms,应该不会这么占用cpu的。
7、继续5的分析,既然怀疑ConditionObject.awaitNanos调用次数过多,那就用btrace查看执行次数,果然:
import com.sun.btrace.*; import com.sun.btrace.annotations.*; import static com.sun.btrace.BTraceUtils.*; @BTrace public class HelloWorld { private static java.util.concurrent.atomic.AtomicInteger ai = new java.util.concurrent.atomic.AtomicInteger(0); private static long start = System.currentTimeMillis(); @OnMethod(clazz="java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject",method="awaitNanos") public static void func(AnyType[] args) { if(ai.incrementAndGet() % 1000 == 0){ println("1000 spends " + (System.currentTimeMillis() - start)); start = System.currentTimeMillis(); } } }
两台机器上的执行结果:
DEBUG: received com.sun.btrace.comm.MessageCommand@67eb366 100,000 spends 116 DEBUG: received com.sun.btrace.comm.MessageCommand@6833f0de 100,000 spends 184 这是2.6.18 kernel 10w次调用的时间
DEBUG: received com.sun.btrace.comm.MessageCommand@474b5f4a 1000 spends 1320 DEBUG: received com.sun.btrace.comm.MessageCommand@49bdc9d8 1000 spends 1388 这个是2.6.32 kernel 1000次调用间隔时间
8、btrace之后继续对6的分析,而后来分析发现,线上超时队列两个任务之间的超时时间点是有一定间隔的,并不是紧紧挨着的,也就是说当前超时任务跟下次超时任务之间可能有几十ms,甚至更多,sleep 5ms后,仍然会每隔几十ns就循环一次在等待,于是就很耗cpu,修改了demo程序,重新跑,重现了问题:
public class Test { static class MyDelayed implements Delayed { long id; long delay; long timeout; public MyDelayed(long id, long timeout) { this.id = id; this.timeout = timeout; this.delay = timeout + System.currentTimeMillis(); } @Override public int compareTo(Delayed o) { MyDelayed md = (MyDelayed) o; return md.delay < this.delay ? 1 : -1; } @Override public long getDelay(TimeUnit unit) { return this.delay - System.currentTimeMillis();//unit.convert(this.delay - System.currentTimeMillis(), TimeUnit.MILLISECONDS); } public String toString(){ return this.id + " delay " + timeout + " ms "; } } private static long id = 0; protected static final int COUNT = 50; static DelayQueue<MyDelayed> dq = new DelayQueue<MyDelayed>(); public Test() { } public static void main(String[] args) { new Thread(new Runnable() { @Override public void run() { while (!Thread.interrupted()) { try { List<MyDelayed> list = new ArrayList<MyDelayed>(); list.add(dq.take()); dq.drainTo(list); for (int i = 0, size = list.size(); i < size; i++) { } Thread.sleep(5L); } catch (InterruptedException e) { e.printStackTrace(); } } } }).start(); new Thread(new Runnable() { @Override public void run() { while (!Thread.interrupted()) { for (int i = 0; i < COUNT; i++) { dq.add(new MyDelayed(id++, 100));//这里如果是很随机的,出现的占用cpu的概率很低的 } try { Thread.sleep(50L); } catch (InterruptedException e) { e.printStackTrace(); } } } }).start();; } }
9、问题找到了,但是为啥会这样呢,是因为硬件原因?cat /proc/cpuinfo分别看了有问题和没问题机器的cpu,果然不一样!
2.6.32 kernel(cpu使用率高)的cpu: model name : Intel(R) Xeon(R) CPU E**** @ 2.*0GHz stepping : 5 cpu MHz : 2400.086 cache size : 8192 KB physical id : 4 siblings : 1 core id : 0 cpu cores : 1 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu de tsc msr pae cx8 apic sep cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc pni ssse3 cx16 sse4_1 sse4_2 popcnt
2.6.18 kernel(cpu使用率低)的cpu: model name : Intel(R) Xeon(R) CPU E**** @ 2.*0GHz stepping : 5 cpu MHz : 2400.086 cache size : 8192 KB physical id : 4 siblings : 1 core id : 0 cpu cores : 1 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu de tsc msr pae cx8 apic sep cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc pni ssse3 cx16 sse4_1 sse4_2 popcnt lahf_lm
10、【方向已经歪了.....】然后既然cpu不一样,那么是不是支持的指令也不一样,比如说运行期优化或者intrisinc之类的;查了一些非标准的命令,用PrintCompilation只能看到哪些方法编译成了本地方法;想用PrintOptoAssembly参数看编译成的汇编指令,发现该参数只在debug版本可用,且只是java代码编译成的,native不行....;用pstack查看发现,park用的是pthread_cond_timedwait,其实两者一样:
2.6.18内核 [admin@hostnameXXX~]$ pstack 9307 | grep 9307 -A 10 Thread 221 (Thread 0x45a0b940 (LWP 9307)): #0 0x00000034f7a0ae00 in pthread_cond_timedwait@@GLIBC_2.3.2 () #1 0x00002b0017a17177 in os::Linux::safe_cond_timedwait () #2 0x00002b00176e19dc in Unsafe_Park () #3 0x00002aaaacc2282f in ?? () #4 0x00000007912ff020 in ?? () .......
2.6.32的kernel [admin@hostnameXXX bin]$ pstack 174038 Thread 1 (process 174038): #0 0x000000354260b14a in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f7fdd8b4177 in os::Linux::safe_cond_timedwait(pthread_cond_t*, pthread_mutex_t*, timespec const*) () from /.../jdk-1.6.0_32/jre/lib/amd64/server/libjvm.so #2 0x00007f7fdd57e9dc in Unsafe_Park () from /.../jdk-1.6.0_32/jre/lib/amd64/server/libjvm.so #3 0x00007f7fd932ba27 in ?? () #4 0x0000000752824210 in ?? () #5 0x00000000430cdc78 in ?? () #6 0x0000000000000001 in ?? () #7 0x00000000430cdc70 in ?? () #8 0x0000000000000000 in ?? ()
11、中午吃饭,跟老大聊了两句,os版本,虽然想这有啥关系,但是还是看看再说吧。os的确不一样:
老机器: 2.6.18-***.*.*.***.***
新机器: 2.6.32-***.*.*.***.***.x86_64
然后老大发了个关于High Resolution Timer的资料,这绝对超过我的知识范围了,完全没想到会是这个原因。
12、既然说的有道理,就按照那链接上验证下,cat /proc/timer_list,2.6.32是hr timer,2.6.18不是
2.6.32 kernel机器 [admin@hostnameXXX ~]$ cat /proc/timer_list Timer List Version: v0.5 HRTIMER_MAX_CLOCK_BASES: 2 now at 3511439984437741 nsecs cpu: 0 clock 0: .base: 0000000000000000 .index: 0 .resolution: 1 nsecs .get_time: ktime_get_real .offset: 1375286460793426172 nsecs active timers: clock 1: .base: 0000000000000000 .index: 1 .resolution: 1 nsecs .get_time: ktime_get .offset: 0 nsecs active timers: #0: <0000000000000000>, tick_sched_timer, S:01, hrtimer_start, swapper/0 # expires at 3511439985000000-3511439985000000 nsecs [in 562259 to 562259 nsecs] #1: <0000000000000000>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, DragoonAgent/247215 # expires at 3511439987840327-3511439987940326 nsecs [in 3402586 to 3502585 nsecs] ………… #50: <0000000000000000>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, java/193871 # expires at 51000553286044185-51000553286094185 nsecs [in 47489113301606444 to 47489113301656444 nsecs] .expires_next : 3511439985000000 nsecs .hres_active : 1 .nr_events : 1359911422 .nr_retries : 132555 .nr_hangs : 0 .max_hang_time : 0 nsecs .nohz_mode : 2 .idle_tick : 3511439976000000 nsecs .tick_stopped : 1 .idle_jiffies : 7806107272 .idle_calls : 1603082563 .idle_sleeps : 1141722573 .idle_entrytime : 3511439982234812 nsecs .idle_waketime : 3511439982233361 nsecs .idle_exittime : 3511439975020148 nsecs .idle_sleeptime : 3356315902706452 nsecs .iowait_sleeptime: 85456657539562 nsecs .last_jiffies : 7806107278 .next_jiffies : 7806107281 .idle_expires : 3511439985000000 nsecs jiffies: 7806107280 …… …… …… …… …… …… Tick Device: mode: 1 Broadcast device Clock Event Device: hpet max_delta_ns: 149983005959 min_delta_ns: 13409 mult: 61496114 shift: 32 mode: 3 next_event: 9223372036854775807 nsecs set_next_event: hpet_legacy_next_event set_mode: hpet_legacy_set_mode event_handler: tick_handle_oneshot_broadcast tick_broadcast_mask: 00000000 tick_broadcast_oneshot_mask: 00000000 ………… ………… Tick Device: mode: 1 Per CPU device: 23 Clock Event Device: lapic max_delta_ns: 1341990326 min_delta_ns: 2399 mult: 26847282 shift: 32 mode: 3 next_event: 3511441485000000 nsecs set_next_event: lapic_next_event set_mode: lapic_timer_setup event_handler: hrtimer_interrupt
12、dmesg找"Switched to high resolution mode on CPU0" (or something similar)信息来确认是否启用high resolution timer,不过两台机器上没有找到这个信息。
13、内核参数,在内核参数中查找CONFIG_HIGH_RES_TIMERS=y 配置型,来确认启用high resolution timer,我们线上linux 2.6.32的机器的配置是在/proc/config.gz,gunzip解压后可以看到这个配置项;而2.6.18的机器的配置项在/boot/config-2.6.18-**.*.*.***.***,不过没有这个配置项,至此可以确定没有启用high resolution timer。