BTrace是一个可以动态跟踪Java代码执行的工具,网上已经有很多文章介绍,我这里分享一个我在实际工作中排查性能问题的例子。
现象
我的一个REST接口调用非常慢,postman耗时3873 ms,这个接口就是从redis里把一批数据取出来,redis性能很好,理论上不会这么慢,于是用btrace,trace方法调用。
/* BTrace Script Template */ import com.sun.btrace.annotations.*; import static com.sun.btrace.BTraceUtils.*; @BTrace public class TracingScript { /* put your code here */ @TLS private static long startlTime = 0; @TLS private static long startmTime = 0; @TLS private static long startjTime = 0; @TLS private static long startrTime = 0; @TLS private static long startbTime = 0; @OnMethod(clazz="com.xx.monitoring.api.util.JedisHelper", method="listSize") public static void startjMethod() { startjTime = timeMillis(); } @OnMethod(clazz="com.xx.monitoring.api.util.JedisHelper", method="listSize", location=@Location(Kind.RETURN)) public static void endjMethod() { println(strcat("listSize method execute time=>", str(timeMillis()-startjTime))); println("-------------------------------------------"); } @OnMethod(clazz="com.xx.monitoring.api.util.JedisHelper", method="listRange") public static void startrMethod() { startrTime = timeMillis(); } @OnMethod(clazz="com.xx.monitoring.api.util.JedisHelper", method="listRange", location=@Location(Kind.RETURN)) public static void endrMethod() { println(strcat("listRange method execute time=>", str(timeMillis()-startrTime))); println("-------------------------------------------"); } @OnMethod(clazz="com.xx.monitoring.api.persistence.MetricRedisDao", method="listMetricByIds") public static void startbMethod() { startbTime = timeMillis(); } @OnMethod(clazz="com.xx.monitoring.api.persistence.MetricRedisDao", method="listMetricByIds", location=@Location(Kind.RETURN)) public static void endbMethod(java.util.List metricIds) { println("metrcIds.size: " + metricIds.size()); println(strcat("listMetricByIds method execute time=>", str(timeMillis()-startbTime))); println("-------------------------------------------"); } @OnMethod(clazz="com.xx.monitoring.api.service.MetricServiceImpl", method="listMetricsInternal") public static void startlMethod() { startlTime = timeMillis(); } @OnMethod(clazz="com.xx.monitoring.api.service.MetricServiceImpl", method="listMetricsInternal", location=@Location(Kind.RETURN)) public static void endlMethod() { println(strcat("listMetricsInternal method execute time=>", str(timeMillis()-startlTime))); println("-------------------------------------------"); } @OnMethod(clazz="com.xx.monitoring.api.service.MetricServiceImpl", method="listMetricsMap") public static void startmMethod() { startmTime = timeMillis(); } @OnMethod(clazz="com.xx.monitoring.api.service.MetricServiceImpl", method="listMetricsMap", location=@Location(Kind.RETURN)) public static void endmMethod() { println(strcat("listMetricsMap execute time=>", str(timeMillis()-startmTime))); println("-------------------------------------------"); } }
编译:
$ btracec TracingScript.java
开始trace
$ btrace 19416 TracingScript.class
结果:
DEBUG: received com.sun.btrace.comm.MessageCommand@5b6f7412 listSize method execute time=>3 DEBUG: received com.sun.btrace.comm.MessageCommand@27973e9b ------------------------------------------- DEBUG: received com.sun.btrace.comm.MessageCommand@312b1dae listRange method execute time=>18 DEBUG: received com.sun.btrace.comm.MessageCommand@7530d0a ------------------------------------------- DEBUG: received com.sun.btrace.comm.MessageCommand@27bc2616 listRange method execute time=>19 DEBUG: received com.sun.btrace.comm.MessageCommand@3941a79c ------------------------------------------- DEBUG: received com.sun.btrace.comm.MessageCommand@506e1b77 listMetricsInternal method execute time=>4820 DEBUG: received com.sun.btrace.comm.MessageCommand@4fca772d ------------------------------------------- DEBUG: received com.sun.btrace.comm.MessageCommand@9807454 listMetricsMap execute time=>4821
可以看到listMetricsInternal执行了4820毫秒
这个方法就是把metric的id循环从redis取metric bean,初步判断是id太多导致很慢,继续trace
添加方法:
@OnMethod(clazz="com.xx.monitoring.api.persistence.MetricRedisDao", method="listMetricByIds", location=@Location(Kind.RETURN)) void endbMethod(java.util.List<String> metricIds) { println("metrcIds.size: " + str(metricIds)); println("-------------------------------------------"); }
执行结果:
metrcIds.size: [weblogic.datasource.waiting_for_connection_high_count, weblogic.datasource.leaked_connection_count, weblogic.datasource.active_connections_currentCount, weblogic.threadpool.standby_thread_count, weblogic.threadpool.active_thread_idle_count, weblogic.threadpool.execute_thread_total_count, weblogic.session.open_sessions_current_count, weblogic.session.open_sessions_high_count, weblogic.session.sessions_opened_total_count, weblogic.server.health_state, weblogic.server.activation_time, server.os.kernel_version, server.os.version, server.os.id, server.interfaces.tx, server.interfaces.rx, server.disk.used_percent, server.cpu.load, server.memory.free_percent, server.memory.used_percent, server.cpu.idle, server.cpu.iowait, server.cpu.sys, server.cpu.user, oracle.info.status, oracle.worst.sql, oracle.time_ratio.cpu_time_ratio, oracle.time_ratio.wait_time_ratio, oracle.connection.count, jvm.memory.used, jvm.memory.max, jvm.memory.committed, jvm.thread.total_started_thread_count, jvm.thread.daemon_thread_count, jvm.thread.peak_thread_count, jvm.thread.count, jvm.gc.collection_time, jvm.gc.collection_count, jvm.classsloading.unloaded_class_count, jvm.classsloading.loaded_class_count, jvm.classsloading.total_loaded_class_count, jvm.info.input_arguments, jvm.info.vm_id, jvm.info.vm_version, weblogic.datasource.waiting_for_connection_high_count, weblogic.datasource.leaked_connection_count, weblogic.datasource.active_connections_currentCount, weblogic.threadpool.standby_thread_count, weblogic.threadpool.active_thread_idle_count, weblogic.threadpool.execute_thread_total_count, weblogic.session.open_sessions_current_count, weblogic.session.open_sessions_high_count, weblogic.session.sessions_opened_total_count, weblogic.server.health_state, weblogic.server.activation_time, server.os.kernel_version, server.os.version, serve 还有很多 ...
一共3610个,从redis取一个1ms,那也要3610ms,怪不得。
原因找到了,改代码就很简单了
另外发现BTrace还有一个工具专门分析性能的BTraceUtils.Profiling,代码:
import com.sun.btrace.BTraceUtils; import com.sun.btrace.Profiler; import com.sun.btrace.annotations.*; import static com.sun.btrace.BTraceUtils.*; @BTrace class Profiling { @Property Profiler swingProfiler = BTraceUtils.Profiling.newProfiler(); @OnMethod(clazz="/com\\.xx\\.monitoring\\.api\\..*/", method="/.*/") void entry(@ProbeMethodName(fqn=false) String probeMethod) { //fqn是否打印长方法名 BTraceUtils.Profiling.recordEntry(swingProfiler, probeMethod); } @OnMethod(clazz="/com\\.xx\\.monitoring\\.api\\..*/", method="/.*/", location=@Location(value=Kind.RETURN)) void exit(@ProbeMethodName(fqn=false) String probeMethod, @Duration long duration) { BTraceUtils.Profiling.recordExit(swingProfiler, probeMethod, duration); } @OnTimer(5000) //每5秒打印一次 void timer() { BTraceUtils.Profiling.printSnapshot("Performance profile", swingProfiler); } }
结果:
Performance profile Block Invocations SelfTime.Total SelfTime.Avg SelfTime.Min SelfTime.Max WallTime.Total WallTime.Avg WallTime.Min WallTime.Max preHandle 1 96000 96000 96000 96000 96000 96000 96000 96000 listSize 1 2231000 2231000 2231000 2231000 2231000 2231000 2231000 2231000 listRange 2 18447000 9223500 23000 18424000 36871000 18435500 18424000 18447000 getMetric 3610 20915000 5793 1000 384000 3805159000 1054060 689000 5752000 getMetricIndexOfId 3610 189140000 52393 19000 930000 189140000 52393 19000 930000 get 3610 3583769000 992733 629000 4758000 3595104000 995873 633000 4787000 <init> 3610 760000 210 0 10000 760000 210 0 10000 setId 3610 1983000 549 0 20000 1983000 549 0 20000 setUnit 3610 2019000 559 0 51000 2019000 559 0 51000 setCreated 3610 2111000 584 0 32000 2111000 584 0 32000 setUpdated 3610 1556000 431 0 26000 1556000 431 0 26000 setValueType 3610 1251000 346 0 19000 1251000 346 0 19000 setDisplayName 3610 1476000 408 0 54000 1476000 408 0 54000 setCustomized 271 179000 660 0 16000 179000 660 0 16000 doFilter 1 22833000 22833000 22833000 22833000 3867649000 3867649000 3867649000 3867649000 listMetricsMap 2 2151000 1075500 49000 2102000 7685645000 3842822500 3842798000 3842847000 listMetricsInternal 1 13000 13000 13000 13000 3839969000 3839969000 3839969000 3839969000 listMetrics 1 2332000 2332000 2332000 2332000 3839956000 3839956000 3839956000 3839956000 listMetricByIds 1 11787000 11787000 11787000 11787000 3816946000 3816946000 3816946000 3816946000 getId 3676 737000 200 0 12000 737000 200 0 12000 isCustomized 66 11000 166 0 1000 11000 166 0 1000 getUnit 66 8000 121 0 1000 8000 121 0 1000 getCreated 66 9000 136 0 1000 9000 136 0 1000 getUpdated 66 9000 136 0 1000 9000 136 0 1000 getValueType 66 14000 212 0 1000 14000 212 0 1000 getDisplayName 66 9000 136 0 1000 9000 136 0 1000 postHandle 1 1803000 1803000 1803000 1803000 1803000 1803000 1803000 1803000
可见getMetric执行了3610次
怎么样,BTrace还是很强大的吧,分析线上问题很方便