最近升级了一次tair(缓存系统)的client jar包——一个完全被重写了的版本,发布到线上的时候,发现某个新上线机房机器cpu占用率比较高,一般50%—100%之间(5核的虚拟机),而另外两个机房机器的cpu使用率却比较低。
1、用top查看是java进程的占用的绝大多数cpu
2、用top H或者top -p PID H查看发现是只有一个线程占用cpu比较多
3、printf %x tid,拿到线程id的16进制表示后jstack pid | grep 0xtid -A 10拿到占用cpu比较多的线程的堆栈:
[admin@hostname001 ~]$ jstack 36776 | grep 0x9003 -A 10
"Xxxx-Timeout-Channel-Checker" daemon prio=10 tid=0x00007fa39438d000 nid=0x9003 runnable [0x00000000431b9000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000752803898> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
at java.util.concurrent.DelayQueue.take(DelayQueue.java:164)
at com...........net.TairBgWorker.run(TairBgWorker.java:52)
是缓存系统超时扫描线程在检查超时队列的时候出问题了,按道理说DelayQueue没有取到超时数据,应该一直park在那的,没啥问题呀,为何会导致cpu占用很多呢? 难道跟自旋锁有关?
4、想不通.......,把问题丢给了缓存团队的哥们儿,但是他们太忙,从第一次反馈到自己花时间着手解决,都一周时间了,感觉完全是你把他们新版上去了,没把你搞成线上故障,剩下的他们都不放在心上了一样.....
5、于是在稍微有点时间的情况下,自己着手解决吧,google了一把DelayQueue占用cpu的问题,竟然发现有哥们儿也遇到了类似的问题,浏览一遍觉得这哥们儿说的挺靠谱,检查一下我们的代码,发现果然getDelay的时候,直接返回的是毫秒值:
public long getDelay(TimeUnit unit) {
return delayed - System.currentTimeMillis();
}
而在DelayQueue.take里边,获取到这个时间后是当做ns来使用的!!:
public E take() throws InterruptedException {
final ReentrantLock lock = this.lock;
lock.lockInterruptibly();
try {
for (;;) {
E first = q.peek();
if (first == null) {
available.await();
} else {
long delay = first.getDelay(TimeUnit.NANOSECONDS);
if (delay > 0) {
long tl = available.awaitNanos(delay);
} else {
E x = q.poll();
assert x != null;
if (q.size() != 0)
available.signalAll(); // wake up other takers
return x;
}
}
}
} finally {
lock.unlock();
}
}
那问题估计就是这个了,但是之前在日常和性能压测环境,全然没有注意到这个问题呀,而且为啥有些机房有这些问题,而另外的没有呢?如果是这个问题,解释不了呀!由于线上发布不是想发就发,所以就算改了也不能立即验证,如果不是这个原因,每周仅有的1-2次发布点就浪费,只能等到下次发布才能解决了。于是就想着写个程序先简单重现一把,再看是否解决了!
6、自己写了个程序,找个DelayQueue,一个线程塞数据,一个线程等超时取数据,本地跑了不重现!丢掉两个不同机房的机器上,仍然不重现!难道还有别的原因?但是想想,如果请求量比较大,或者超时时间不长,每次DelayQueue.take都能拿到超时的东西,超时后又去处理一堆超时队列的数据,然后sleep 5ms,应该不会这么占用cpu的。
7、继续5的分析,既然怀疑ConditionObject.awaitNanos调用次数过多,那就用btrace查看执行次数,果然:
import com.sun.btrace.*;
import com.sun.btrace.annotations.*;
import static com.sun.btrace.BTraceUtils.*;
@BTrace
public class HelloWorld {
private static java.util.concurrent.atomic.AtomicInteger ai = new java.util.concurrent.atomic.AtomicInteger(0);
private static long start = System.currentTimeMillis();
@OnMethod(clazz="java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject",method="awaitNanos")
public static void func(AnyType[] args) {
if(ai.incrementAndGet() % 1000 == 0){
println("1000 spends " + (System.currentTimeMillis() - start));
start = System.currentTimeMillis();
}
}
}
两台机器上的执行结果:
DEBUG: received com.sun.btrace.comm.MessageCommand@67eb366
100,000 spends 116
DEBUG: received com.sun.btrace.comm.MessageCommand@6833f0de
100,000 spends 184
这是2.6.18 kernel 10w次调用的时间
DEBUG: received com.sun.btrace.comm.MessageCommand@474b5f4a
1000 spends 1320
DEBUG: received com.sun.btrace.comm.MessageCommand@49bdc9d8
1000 spends 1388
这个是2.6.32 kernel 1000次调用间隔时间
8、btrace之后继续对6的分析,而后来分析发现,线上超时队列两个任务之间的超时时间点是有一定间隔的,并不是紧紧挨着的,也就是说当前超时任务跟下次超时任务之间可能有几十ms,甚至更多,sleep 5ms后,仍然会每隔几十ns就循环一次在等待,于是就很耗cpu,修改了demo程序,重新跑,重现了问题:
public class Test {
static class MyDelayed implements Delayed {
long id;
long delay;
long timeout;
public MyDelayed(long id, long timeout) {
this.id = id;
this.timeout = timeout;
this.delay = timeout + System.currentTimeMillis();
}
@Override
public int compareTo(Delayed o) {
MyDelayed md = (MyDelayed) o;
return md.delay < this.delay ? 1 : -1;
}
@Override
public long getDelay(TimeUnit unit) {
return this.delay - System.currentTimeMillis();//unit.convert(this.delay - System.currentTimeMillis(), TimeUnit.MILLISECONDS);
}
public String toString(){
return this.id + " delay " + timeout + " ms ";
}
}
private static long id = 0;
protected static final int COUNT = 50;
static DelayQueue dq = new DelayQueue();
public Test() {
}
public static void main(String[] args) {
new Thread(new Runnable() {
@Override
public void run() {
while (!Thread.interrupted()) {
try {
List list = new ArrayList();
list.add(dq.take());
dq.drainTo(list);
for (int i = 0, size = list.size(); i < size; i++) {
}
Thread.sleep(5L);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}).start();
new Thread(new Runnable() {
@Override
public void run() {
while (!Thread.interrupted()) {
for (int i = 0; i < COUNT; i++) {
dq.add(new MyDelayed(id++, 100));//这里如果是很随机的,出现的占用cpu的概率很低的
}
try {
Thread.sleep(50L);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}).start();;
}
}
9、问题找到了,但是为啥会这样呢,是因为硬件原因?cat /proc/cpuinfo分别看了有问题和没问题机器的cpu,果然不一样!
2.6.32 kernel(cpu使用率高)的cpu:
model name : Intel(R) Xeon(R) CPU E**** @ 2.*0GHz
stepping : 5
cpu MHz : 2400.086
cache size : 8192 KB
physical id : 4
siblings : 1
core id : 0
cpu cores : 1
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu de tsc msr pae cx8 apic sep cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc pni ssse3 cx16 sse4_1 sse4_2 popcnt
2.6.18 kernel(cpu使用率低)的cpu:
model name : Intel(R) Xeon(R) CPU E**** @ 2.*0GHz
stepping : 5
cpu MHz : 2400.086
cache size : 8192 KB
physical id : 4
siblings : 1
core id : 0
cpu cores : 1
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu de tsc msr pae cx8 apic sep cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc pni ssse3 cx16 sse4_1 sse4_2 popcnt lahf_lm
10、【方向已经歪了.....】然后既然cpu不一样,那么是不是支持的指令也不一样,比如说运行期优化或者intrisinc之类的;查了一些非标准的命令,用PrintCompilation只能看到哪些方法编译成了本地方法;想用PrintOptoAssembly参数看编译成的汇编指令,发现该参数只在debug版本可用,且只是java代码编译成的,native不行....;用pstack查看发现,park用的是pthread_cond_timedwait,其实两者一样:
2.6.18内核
[admin@hostnameXXX~]$ pstack 9307 | grep 9307 -A 10
Thread 221 (Thread 0x45a0b940 (LWP 9307)):
#0 0x00000034f7a0ae00 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
#1 0x00002b0017a17177 in os::Linux::safe_cond_timedwait ()
#2 0x00002b00176e19dc in Unsafe_Park ()
#3 0x00002aaaacc2282f in ?? ()
#4 0x00000007912ff020 in ?? ()
.......
2.6.32的kernel
[admin@hostnameXXX bin]$ pstack 174038
Thread 1 (process 174038):
#0 0x000000354260b14a in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f7fdd8b4177 in os::Linux::safe_cond_timedwait(pthread_cond_t*, pthread_mutex_t*, timespec const*) () from /.../jdk-1.6.0_32/jre/lib/amd64/server/libjvm.so
#2 0x00007f7fdd57e9dc in Unsafe_Park () from /.../jdk-1.6.0_32/jre/lib/amd64/server/libjvm.so
#3 0x00007f7fd932ba27 in ?? ()
#4 0x0000000752824210 in ?? ()
#5 0x00000000430cdc78 in ?? ()
#6 0x0000000000000001 in ?? ()
#7 0x00000000430cdc70 in ?? ()
#8 0x0000000000000000 in ?? ()
11、中午吃饭,跟老大聊了两句,os版本,虽然想这有啥关系,但是还是看看再说吧。os的确不一样:
老机器:
2.6.18-***.*.*.***.***
新机器:
2.6.32-***.*.*.***.***.x86_64
然后老大发了个关于High Resolution Timer的资料,这绝对超过我的知识范围了,完全没想到会是这个原因。
12、既然说的有道理,就按照那链接上验证下,cat /proc/timer_list,2.6.32是hr timer,2.6.18不是
2.6.32 kernel机器
[admin@hostnameXXX ~]$ cat /proc/timer_list
Timer List Version: v0.5
HRTIMER_MAX_CLOCK_BASES: 2
now at 3511439984437741 nsecs
cpu: 0
clock 0:
.base: 0000000000000000
.index: 0
.resolution: 1 nsecs
.get_time: ktime_get_real
.offset: 1375286460793426172 nsecs
active timers:
clock 1:
.base: 0000000000000000
.index: 1
.resolution: 1 nsecs
.get_time: ktime_get
.offset: 0 nsecs
active timers:
#0: <0000000000000000>, tick_sched_timer, S:01, hrtimer_start, swapper/0
# expires at 3511439985000000-3511439985000000 nsecs [in 562259 to 562259 nsecs]
#1: <0000000000000000>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, DragoonAgent/247215
# expires at 3511439987840327-3511439987940326 nsecs [in 3402586 to 3502585 nsecs]
…………
#50: <0000000000000000>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, java/193871
# expires at 51000553286044185-51000553286094185 nsecs [in 47489113301606444 to 47489113301656444 nsecs]
.expires_next : 3511439985000000 nsecs
.hres_active : 1
.nr_events : 1359911422
.nr_retries : 132555
.nr_hangs : 0
.max_hang_time : 0 nsecs
.nohz_mode : 2
.idle_tick : 3511439976000000 nsecs
.tick_stopped : 1
.idle_jiffies : 7806107272
.idle_calls : 1603082563
.idle_sleeps : 1141722573
.idle_entrytime : 3511439982234812 nsecs
.idle_waketime : 3511439982233361 nsecs
.idle_exittime : 3511439975020148 nsecs
.idle_sleeptime : 3356315902706452 nsecs
.iowait_sleeptime: 85456657539562 nsecs
.last_jiffies : 7806107278
.next_jiffies : 7806107281
.idle_expires : 3511439985000000 nsecs
jiffies: 7806107280
…… ……
…… ……
…… ……
Tick Device: mode: 1
Broadcast device
Clock Event Device: hpet
max_delta_ns: 149983005959
min_delta_ns: 13409
mult: 61496114
shift: 32
mode: 3
next_event: 9223372036854775807 nsecs
set_next_event: hpet_legacy_next_event
set_mode: hpet_legacy_set_mode
event_handler: tick_handle_oneshot_broadcast
tick_broadcast_mask: 00000000
tick_broadcast_oneshot_mask: 00000000
…………
…………
Tick Device: mode: 1
Per CPU device: 23
Clock Event Device: lapic
max_delta_ns: 1341990326
min_delta_ns: 2399
mult: 26847282
shift: 32
mode: 3
next_event: 3511441485000000 nsecs
set_next_event: lapic_next_event
set_mode: lapic_timer_setup
event_handler: hrtimer_interrupt
12、dmesg找"Switched to high resolution mode on CPU0" (or something similar)信息来确认是否启用high resolution timer,不过两台机器上没有找到这个信息。
13、内核参数,在内核参数中查找CONFIG_HIGH_RES_TIMERS=y 配置型,来确认启用high resolution timer,我们线上linux 2.6.32的机器的配置是在/proc/config.gz,gunzip解压后可以看到这个配置项;而2.6.18的机器的配置项在/boot/config-2.6.18-**.*.*.***.***,不过没有这个配置项,至此可以确定没有启用high resolution timer。