许多现代Java应用程序都依赖于一组复杂的分布式依赖关系和活动部件。 许多外部因素都可能影响应用程序的性能和可用性。 在预生产环境中,几乎不可能完全消除或影响并准确模拟这些影响。 东西发生了 。 但是,您可以通过创建和维护一个全面的系统来监视应用程序的整个生态系统,从而大大降低这些事件的严重性和持续时间。
这篇由三部分组成的文章介绍了实现这种系统的一些模式和技术。 这些模式以及我将要使用的一些术语是有意使用的。 与代码示例和插图一起,它们将帮助您从概念上理解应用程序性能监视。 这种理解强调了对解决方案的需求,进而可以帮助您选择商业解决方案或开源解决方案,扩展和定制解决方案,或者(对于有动机的)充当构建解决方案的蓝图。
第1部分:
第2部分将重点介绍在不修改原始源代码的情况下检测Java类和资源的方法。 第3部分将介绍用于监视JVM外部资源的方法,这些方法包括主机及其操作系统以及数据库和消息传递系统等远程服务。 最后将讨论其他APM问题,例如数据管理,数据可视化,报告和警报。
首先,我应该强调一点,尽管我在这里介绍的许多特定于Java的内容似乎与应用程序和代码概要分析的过程类似,但这并不是我所指的。 概要分析是一个非常有价值的预生产过程,可以确认或禁止Java代码具有可伸缩性,高效性,快速性,并且通常是出色的。 但是基于这些事情发生的公理,当您在生产中遇到莫名其妙的问题时,开发阶段代码分析器的最高荣誉将无法为您服务。
我这里指的是实现一些在生产分析和从正在运行的应用程序及其所有外部依赖的收集一些相同数据的实时的方面。 这些数据包括一系列持续进行的定量测量,这些测量在其目标中普遍存在,以提供整个系统运行状况的细粒度和详细表示。 而且,通过保留这些测量值的历史存储,您可以捕获准确的基准线,这些基准线可以帮助您确认环境是否健康或查明特定缺口的根本原因和大小。
这可能是一个罕有的应用程序,它根本没有监视资源,但请考虑以下反模式,这些反模式在操作环境中经常出现:
不连贯和不连贯的监视系统:应用程序可以托管在大型共享数据中心中,其中依赖性包括大量共享资源,例如数据库,存储区域网络(SAN)存储或消息传递和中间件服务。 组织有时非常孤立,每个小组都管理自己的监视和APM系统(请参见孤立监视侧边栏的陷阱 )。 如果没有每个依赖项的统一视图,则每个组件所有者只能看到一小部分。
图1对比了孤立和合并的APM系统:
合并的APM的实施不会排除或贬低高度特定的监视和诊断工具,例如DBA管理工具集,低级网络分析应用程序和数据中心管理解决方案。 这些工具仍然是宝贵的资源,但是如果依赖于排除统一视图,则很难克服孤岛效应。
与我刚刚描述的反模式相反,本系列文章介绍的理想APM系统具有以下属性:
不过,在我深入研究该系统的实现细节之前,这将有助于理解APM系统的某些通用方面。
所有APM系统都访问性能数据源,并包括收集和跟踪功能 。 请注意,这些是我自己选择的通用术语,用于描述通用类别。 它们并不特定于任何特定的APM系统,对于相同的概念可能会使用其他术语。 在下面的文章中,我将根据以下定义使用这些术语。
性能数据源(PDS)是性能或可用性数据的来源,可用作反映组件相对运行状况的度量。 例如,Java管理扩展(JMX)服务通常可以提供有关JVM运行状况的大量数据。 大多数关系数据库通过SQL接口发布性能数据。 这两个PDS都是示例,我称之为直接来源。 源直接提供性能数据。 相反, 推论来源衡量的是有意或偶然的行为,绩效数据是从中推导出来的。 例如,可以定期发送测试消息,然后再从Java消息服务(JMS)服务器中获取该消息,而往返时间则是对该服务性能的推论性度量。
推论来源(其实例称为综合交易 )可能非常有用,因为它们可以通过与真实活动相同的路径来有效地衡量多个组成部分或分层调用。 当直接来源不足时,合成交易在监视连续性以确认相对状态下的系统运行状况方面也起着关键作用。
收集是从PDS获取性能或可用性数据的过程。 对于直接PDS, 收集器通常实现某种API来访问该数据。 要从网络路由器读取统计信息,收集器可能使用简单网络管理协议(SNMP)或Telnet。 在推断性PDS的情况下,收集器将执行并测量基础操作。
跟踪是将测量结果从收集器传递到核心APM系统的过程。 为此,许多商业和开源APM系统都提供了某种API。 对于本文中的示例,我已经实现了一个通用的Java跟踪器接口,我将在下一节中对其进行详细介绍。
大多数APM系统通常将跟踪器提交的数据组织为某种分类和层次结构。 图2说明了此数据捕获的一般流程:
图2还显示了APM系统中一些常用的服务:
整个APM的目标环境中通用跟踪API的实现和使用提供了一定的一致性。 并且,出于定制收集器的目的,它使开发人员可以专注于获取性能数据,而不必担心跟踪方面。 下一节介绍解决此主题的APM跟踪接口。
Java语言可以很好地用作收集器的实现语言,因为它具有:
但是,一个警告是,您的Java收集器必须能够与目标APM系统提供的跟踪API集成。 如果您的APM跟踪机制不提供Java接口,则其中某些模式仍然适用。 但在案件中,目标PDS是专门基于Java(如JMX)和应用程序平台是不是,你需要一个桥接接口,如IKVM,一个Java-to-.NET编译器(见相关主题 )。
在没有正式标准的情况下,不同的APM产品提供的跟踪API都是不同的。 因此,我通过实现一个称为org.runtimemonitoring.tracing.ITracer
的通用跟踪Java接口来抽象化此问题。 ITracer
接口是专有跟踪API的通用包装。 此技术可保护源库免受版本或API提供程序更改的影响,并且还提供了实现包装API中不可用的其他功能的机会。 本文其余的大多数示例都实现了ITracer
接口及其支持的基本概念。
图3是org.runtimemonitoring.tracing.ITracer
接口的UML类图:
ITracer
界面和工厂类 ITracer
的根本前提是向中央APM系统提交度量值和相关名称。 这项活动是通过trace
方法实现的, trace
方法根据提交的度量的性质而变化。 每个跟踪方法都接受String[] name
参数,该参数包含复合名称的上下文组件,该复合名称的结构特定于APM系统。 复合名称同时向APM系统指示提交的名称空间和实际的度量名称。 因此,复合名称通常至少具有一个根类别和一个度量描述。 底层的ITracer
实现应该知道如何根据传递的String[]
构建复合名称。 表1展示了复合命名约定的两个示例:
名称结构 | 化合物名称 |
---|---|
简单的斜杠分隔 | 主机/销售数据库服务器/ CPU利用率/ CPU3 |
的JMX MBean ObjectName |
com.myco.datacenter.apm:type = Hosts,service = SalesDatabaseServer,group = CPU利用率,instance = CPU3 |
清单1是使用此API跟踪调用的简短示例:
ITracer simpleTracer = TracerFactory.getInstance(sprops);
ITracer jmxTracer = TracerFactory.getInstance(jprops);
.
.
simpleTracer.trace(37, "Hosts", "SalesDatabaseServer",
"CPU Utilization", "CPU3", "Current Utilization %");
jmxTracer.trace(37,
"com.myco.datacenter.apm",
"type=Hosts",
"service=SalesDatabaseServer",
"group=CPU Utilization",
"instance=CPU3", "Current Utilization %");
);
在此界面中,度量可以具有以下类型之一:
int
long
java.util.Date
String
APM系统提供商可能支持其他数据类型来收集度量。
给定一种特定的测量数据类型(例如long
),可以根据APM系统中的类型支持以不同的方式解释给定值。 还要记住,每个APM实现对于相同的类型可能使用不同的术语,而ITracer
使用某种通用命名。
ITracer
中表示的跟踪器类型为:
trace(long value, String[] name)
和trace(int value, String[] name)
方法发布时间间隔平均值的跟踪(请参见“ Intervals”边栏)。 这意味着每个提交都被计入当前间隔的合计值中。 一旦开始新的间隔,则合计值计数器将重置为零。 traceSticky(value long, String[] name)
和traceSticky(value int, String[] name)
方法发布粘性值的跟踪。 这意味着与间隔平均指标相反,聚合在间隔之间保留其值。 如果我现在跟踪的值为5,然后直到明天某个时候才再次跟踪,则该指标将永久保持为5,直到提供新值为止。 traceIncident(String[] name)
调用没有值,并且一个事件的滴答声是隐式的。 如果想要一个以上的滴答声而不是在一个循环中多次调用该方法, traceIncident(int value, String[] name)
方法会按value
滴答总数。 String
传入,并且可用类型在接口中定义为常量。 这是一种方便的方法,适用于以下情况:收集器不知道所收集数据的数据类型或跟踪器类型是什么,但是可以直接将其传递给跟踪器以收集收集的值和配置的类型名称。 TracerFactory
是通用工厂类,用于基于传递的配置属性创建新的ITracer
实例,或从缓存中引用创建的ITracer
。
收集器通常使用以下三种模式之一,这会影响应使用的跟踪器类型:
GET
, POST
等)或统一资源标识符(URI)进行划分。 现在,我已经概述了性能数据跟踪API,其底层数据类型以及数据收集的模式,接下来,我将介绍一些使该API工作的特定用例和示例。
JVM本身是开始实施性能监视的明智之地。 我将从所有JVM共有的性能指标开始,然后继续介绍一些在企业应用程序中常见的JVM驻留组件。 几乎没有例外,Java应用程序的实例是受底层操作系统支持的进程,因此,最好从托管OS的角度来查看JVM监视的多个方面,这将在第3部分中介绍 。
在Java平台标准版5(Java SE)发行之前,可以在运行时高效且可靠地收集内部和标准化JVM诊断的方法一直受到限制。 现在,可以通过java.lang.management
接口使用几个有用的监视点,该接口在所有兼容的Java SE 5(及更新版本)JVM版本中都是标准的。 这些JVM的某些实现提供了额外的专有指标,但是访问模式大致相同。 我将重点介绍可通过JVM的MXBean访问的标准对象-部署在VM内部的JMX MBean,它们公开了管理和监视界面(请参阅参考资料 ):
ClassLoadingMXBean
:监视类加载系统。 CompilationMXBean
:监视编译系统。 GarbageCollectionMXBean
:监视JVM的垃圾收集器。 MemoryMXBean
:监视JVM的堆和非堆内存空间。 MemoryPoolMXBean
:监视由JVM分配的内存池。 RuntimeMXBean
:监视运行时系统。 该MXBean提供了一些有用的监视指标,但确实提供了JVM的输入参数以及启动时间和正常运行时间,这两者都可以用作其他派生指标中的因素。 ThreadMXBean
:监视线程系统。 JMX收集器的前提是它获取MBeanServerConnection
,该对象可以从部署在JVM中的MBean读取属性,读取目标属性的值并使用ITracer
API进行跟踪。 对于此类收集,一个关键的决定是在何处部署收集器。 选择是本地部署和远程部署 。
在本地部署中,收集器及其调用调度程序部署在目标JVM本身内。 然后,JMX收集器组件使用PlatformMBeanServer
访问MXBean,后者是JVM内部的静态可访问MBeanServerConnection
。 在远程部署中,收集器在单独的进程中运行,并使用某种形式的JMX Remoting连接到目标JVM。 这可能比本地部署效率低,但是不需要将任何其他组件部署到目标系统。 JMX Remoting不在本文讨论范围之内,但是可以通过部署RMIConnectorServer
或通过简单地在JVM中启用外部连接来轻松实现(参见参考资料 )。
本文的样本JMX收集器(有关完整的源代码,请参见下载 )包含用于获取MBeanServerConnection
三种独立方法。 收集器可以:
MBeanServerConnection
到本地JVM的平台MBeanServer
使用的静态调用java.lang.management.ManagementFactory.getPlatformMBeanServer()
方法。 MBeanServerConnection
到二级MBeanServer
在JVM的平台使用的静态调用本地部署javax.management.MBeanServerFactory.findMBeanServer(String agentId)
方法。 请注意,有可能在一个JVM中驻留多个MBeanServer
,并且更复杂的系统(例如Java平台,企业版(Java EE)服务器)几乎总是具有特定于应用程序服务器的MBeanServer
,该MBeanServer
与平台MBeanServer
分离且不同。 交叉注册MBeans侧栏)。 javax.management.remote.JMXServiceURL
通过标准RMI远程处理获取远程MBeanServerConnection
。 清单2是JMXCollector collect()
方法的简短片段,显示了ThreadMXBean
中线程活动的收集和跟踪。
ThreadMXBean
的示例JMX收集器的collect()
方法的一部分 .
.
objectNameCache.put(THREAD_MXBEAN_NAME, new ObjectName(THREAD_MXBEAN_NAME));
.
.
public void collect() {
CompositeData compositeData = null;
String type = null;
try {
log("Starting JMX Collection");
long start = System.currentTimeMillis();
ObjectName on = null;
.
.
// Thread Monitoring
on = objectNameCache.get(THREAD_MXBEAN_NAME);
tracer.traceDeltaSticky((Long)jmxServer.getAttribute(on,"TotalStartedThreadCount"),
hostName, "JMX", on.getKeyProperty("type"), "StartedThreadRate");
tracer.traceSticky((Integer)jmxServer.getAttribute(on, "ThreadCount"), hostName,
"JMX", on.getKeyProperty("type"), "CurrentThreadCount");
.
.
// Done
long elapsed = System.currentTimeMillis()-start;
tracer.trace(elapsed, hostName, "JMX", "JMX Collector",
"Collection", "Last Elapsed Time");
tracer.trace(new Date(), hostName, "JMX", "JMX Collector",
"Collection", "Last Collection");
log("Completed JMX Collection in ", elapsed, " ms.");
} catch (Exception e) {
log("Failed:" + e);
tracer.traceIncident(hostName, "JMX", "JMX Collector",
"Collection", "Collection Errors");
}
}
清单2中的代码跟踪TotalThreadsStarted
和CurrentThreadCount
的值。 因为这是一个轮询收集器,所以两个跟踪都使用sticky选项。 但是,由于TotalThreadsStarted
始终是一个递增的数字,所以最有趣的方面不是绝对数,而是线程的创建速率,因此跟踪器使用DeltaSticky
选项。
图7显示了此收集器创建的APM度量树:
JMX收集器具有清单2中未显示的几个方面(但是可以在完整的源代码中看到),例如调度注册,它每10秒创建一次到collect()
方法的定期回调。
在清单2中,根据数据源实现了不同的跟踪器类型和数据类型。 例如:
TotalLoadedClasses
和UnloadedClassCount
被跟踪为粘性增量,因为这些值始终会上升,并且该增量可能比绝对值更有用,可以用来衡量类加载活动。 ThreadCount
是一个可变的数量,可以递增或递减,因此可追溯为粘滞状态。 Collection Errors
被作为间隔事件进行跟踪,在进行收集时遇到的任何异常都会增加。 为了提高效率,由于目标MXBean的JMX ObjectName
在目标JVM的生存期内不会更改,因此收集器使用ManagementFactory
常量名称来缓存名称。
对于两种类型的MXBean( GarbageCollector
和MemoryPool
),确切的ObjectName
可能不是预先知道的,但是您可以提供一个通用模式。 在这些情况下,第一次进行集合时,将对MBeanServerConnection
发出查询,并请求与提供的模式匹配的所有MBean的列表。 为了避免在目标JVM生命周期内进行将来的查询,将缓存返回的匹配MBean ObjectName
。
在某些情况下,集合的目标MBean属性可能不是平面数字类型。 MemoryMXBean
和MemoryPoolMXBean
就是这种情况。 在这些情况下,属性类型是一个CompositeData
对象,该对象将为其键和值查询。 对于java.lang.management
JVM管理接口,MXBean标准采用JMX 开放类型模型,其中所有属性都是与语言无关的类型,例如java.lang.Boolean
和java.lang.Integer
。 Or in the case of complex types such as javax.management.openmbean.CompositeType
, these types can be decomposed into key/value pairs of the same simple types. The full list of simple types is enumerated in the static javax.management.openmbean.OpenType.ALLOWED_CLASSNAMES
field. This model supports a level of type independence so that JMX clients do not have a dependency on nonstandard classes and can also support non-Java clients because of the relative simplicity of the underlying types. For more detail on JMX Open Types, see Related topics .
In cases in which a target MBean attribute is a nonstandard complex type, you need to ensure that the class defining that type is in your collector's classpath. And you must implement some custom code to render the useful data from the retrieved complex object.
In instances in which a single connection is acquired and retained for all collections, error detection and remediation is required to create a new connection in the event of a failure. Some collection APIs provide disconnect listeners that can prompt the collector to close, clean up, and create a new connection. To address scenarios in which a collector tries to connect to a PDS that has been taken down for maintenance or is inaccessible for some other reason, the collector should poll for reconnect on a friendly frequency. Tracking a connection's elapsed time can also be useful in order to degrade the frequency of collections if a slowdown is detected. This can reduce overhead on a target JVM that may be overly taxed for a period of time.
Two additional techniques not implemented in these examples can improve the JMX collector's efficiency and reduce the overhead of running it against the target JVM. The first technique applies in cases in which multiple attributes are being interrogated from one MBean. Rather than requesting one attribute at a time using getAttribute(ObjectName name, String attribute)
, it is possible to issue a request for multiple attributes in one call using getAttributes(ObjectName name, String[]
attributes). The difference might be negligible in local collection but can reduce resource utilization significantly in remote collection by reducing the number of network calls. The second technique is to reduce the polling overhead of the JMX exposed memory pools further by implementing the listening collector pattern instead of a polling pattern. The MemoryPoolMXBean
supports the ability to establish a usage threshold that, when exceeded, fires a notification to a listener, which in turn can trace the value. As the memory usage increases, the usage threshold can be increased accordingly. The downside of this approach is that without extremely small increments in the usage threshold, some granularity of data can be lost and patterns of memory usage below the threshold become invisible.
A final unimplemented technique is to measure windows of elapsed time and the total elapsed garbage-collection time and implement some simple arithmetic to derive the percentage of elapsed time that the garbage collector is active. This is a useful metric because some garbage collection is (for the time being) an inevitable fact of life for most applications. Because some number of collections, each lasting some period of time, are to be expected, the percentage of elapsed time when garbage collections are running can put the JVM's memory health in a clearer context. A general rule of thumb (but highly variable by application) is that any more than 10 percent of any 15-minute period indicates a potential issue.
The JMX collector I've outlined in this section is simplified to illustrate the collection process, but it's extremely limiting always to have hard-coded collections. Ideally, a collector implements the data-access how , and an externally supplied configuration supplies the what . Such a design makes collectors much more useful and reusable. For the highest level of reuse, an externally configured collector should support these configuration points:
Listing 3 illustrates an external configuration for a JMX collector:
collectors.jmx.RemoteRMIMBeanServerConnectionFactory
jmx.rmi.url=service:jmx:rmi://127.0.0.1/jndi/rmi://127.0.0.1:1090/jmxconnector
AppServer3.myco.org,JMX
10000
Note that the TargetAttribute
elements contain an attribute called type
, which represents a parameterized argument to a smart type tracer. The SINT
type represents ticky int
, and the SDINT
type represents delta sticky int
.
So far, I've examined monitoring only standard JVM resources through JMX. However, many application frameworks, such as Java EE, can expose important application-specific metrics through JMX, depending on the vendor. One classic example is DataSource
utilization. A DataSource
is a service that pools connections to an external resource (most commonly, a database), limiting the number of concurrent connections to protect the resource from misbehaving or stressed applications. Monitoring data sources is a critical piece of an overall monitoring plan. The process is similar to what you've have already seen, thanks to JMX's abstraction layer.
Here's a list of typical data source metrics taken from a JBoss 4.2 application server instance:
This time, the collector uses batch attribute retrieval and acquires all the attributes in one call. The only caveat here is the need to interrogate the returned data to switch on the different data and tracer types. DataSource
metrics are also pretty flat without any activity, so to see some movement in the numbers, you need to generate some load. Listing 4 shows the DataSource collector's collect()
method:
public void collect() {
try {
log("Starting DataSource Collection");
long start = System.currentTimeMillis();
ObjectName on = objectNameCache.get("DS_OBJ_NAME");
AttributeList attributes = jmxServer.getAttributes(on, new String[]{
"AvailableConnectionCount",
"MaxConnectionsInUseCount",
"InUseConnectionCount",
"ConnectionCount",
"ConnectionCreatedCount",
"ConnectionDestroyedCount"
});
for(Attribute attribute: (List)attributes) {
if(attribute.getName().equals("ConnectionCreatedCount")
|| attribute.getName().equals("ConnectionDestroyedCount")) {
tracer.traceDeltaSticky((Integer)attribute.getValue(), hostName,
"DataSource", on.getKeyProperty("name"), attribute.getName());
} else {
if(attribute.getValue() instanceof Long) {
tracer.traceSticky((Long)attribute.getValue(), hostName, "DataSource",
on.getKeyProperty("name"), attribute.getName());
} else {
tracer.traceSticky((Integer)attribute.getValue(), hostName,
"DataSource",on.getKeyProperty("name"), attribute.getName());
}
}
}
// Done
long elapsed = System.currentTimeMillis()-start;
tracer.trace(elapsed, hostName, "DataSource", "DataSource Collector",
"Collection", "Last Elapsed Time");
tracer.trace(new Date(), hostName, "DataSource", "DataSource Collector",
"Collection", "Last Collection");
log("Completed DataSource Collection in ", elapsed, " ms.");
} catch (Exception e) {
log("Failed:" + e);
tracer.traceIncident(hostName, "DataSource", "DataSource Collector",
"Collection", "Collection Errors");
}
}
Figure 8 shows the corresponding metric tree for the DataSource collector:
This section addresses techniques that can be used to monitor application components, services, classes, and methods. The primary areas of interest are:
Using metrics made available by some implementations of the Java SE 5 (and newer) ThreadMXBean
, it is also possible to collect the following metrics:
WAITING
or TIMED_WAITING
pending another thread's activity. BLOCKED
state while invoking a method or service. Blocks occur when a thread is waiting for a monitor lock to enter or reenter a synchronized block. These metrics, and others, can also be determined using alternative tool sets and native interfaces, but this usually involves some level of overhead that makes them undesirable for production run-time monitoring. Having said that, the metrics themselves, even when collected, are low level. They may not be useful for anything other than trending, and they are quite difficult to correlate with any causal effects that can't be identified through other means.
All of the above metrics can be collected by a process of instrumenting the classes and methods of interest to make them collect and trace the performance data to the target APM system. A number of techniques can be used to instrument Java classes directly or to derive performance metrics from them indirectly:
Here in Part 1, I address only source code based instrumentation; you'll read more about interception, bytecode instrumentation, and class wrapping in Part 2 . (Interception, bytecode instrumentation, and class wrapping are virtually identical from a topological perspective, but the action to achieve the result has slightly different implications in each case.)
Asynchronous instrumentation is a fundamental issue in class instrumentation. A previous section explored the concepts of polling for performance data. If polling is done reasonably well, it should have no impact on the core application performance or overhead. In contrast, instrumenting the application code itself directly modifies and affects the core code's execution. The primary goal of any sort of instrumentation must be Above all, do no harm . The overhead penalty must be as close to negligible as possible. There is virtually no way to eliminate an extremely small execution penalty in the measurement itself, but once the performance data has been acquired, it is critical that the remainder of the trace process be asynchronous. There are several patterns for implementing asynchronous tracing. Figure 9 illustrates a general overview of how it can be done:
Figure 9 illustrates a simple instrumentation interceptor that measures the elapsed time of an invocation by capturing its start time and end time, and then dispatches the measurement (the elapsed time and the metric compound name) to a processing queue. The queue is then read by a thread pool, which acquires the measurement and completes the trace process.
This section addresses the subject of implementing source level instrumentation and provides some best practices and example source code. It also introduces some new tracing constructs that I'll detail in the context of source code instrumentation to clarify their actions and their implementation patterns.
Despite the prevalence of alternatives, instrumentation of source code is unavoidable in some instances; in some cases it's the only solution. With sensible precautions, it's not necessarily a bad one. Considerations include:
Contextual tracing is highly subjective to the specific application, but consider the simplified example of a payroll-processing class with a processPayroll(long clientId)
method. When invoked, the method calculates and stores the paycheck for each of the client's employees. You can probably instrument the method through various means, but an underlying pattern in the execution clearly indicates that the invocation time increases disproportionately with the number of employees. Consequently, examining a trend of elapsed times for processPayroll
has no context unless you know how many employees are in each run. More simply put, for a given period of time the average elapsed time of processPayroll
was x milliseconds. You can't be sure if that value indicates acceptable or poor performance because if the window comprised only one employee, you would perceive it as poor, but if it comprised 150 employees, you'd think it was flying. Listing 5 displays this simplified concept in code:
public void processPayroll(long clientId) {
Collection employees = null;
// Acquire the collection of employees
//...
//...
// Process each employee
for(Employee emp: employees) {
processEmployee(emp.getEmployeeId(), clientId);
}
}
The primary challenge here is that by most instrumentation techniques, anything inside the processPayroll()
method is untouchable. So although you might be able to instrument processPayroll
and even processEmployee
, you have no way of tracing the number of employees to provide context to the method's performance data. Listing 6 displays a poorly hardcoded (and somewhat inefficient) example of how to capture the contextual data in question:
public void processPayrollContextual(long clientId) {
Collection employees = null;
// Acquire the collection of employees
employees = popEmployees();
// Process each employee
int empCount = 0;
String rangeName = null;
long start = System.currentTimeMillis();
for(Employee emp: employees) {
processEmployee(emp.getEmployeeId(), clientId);
empCount++;
}
rangeName = tracer.lookupRange("Payroll Processing", empCount);
long elapsed = System.currentTimeMillis()-start;
tracer.trace(elapsed, "Payroll Processing", rangeName, "Elapsed Time (ms)");
tracer.traceIncident("Payroll Processing", rangeName, "Payrolls Processed");
log("Processed Client with " + empCount + " employees.");
}
The key part of Listing 6 is the call to tracer.lookupRange
. Ranges are named collections that are keyed by a numerical range limit and have a String
value representing the name of the numerical range. Instead of tracing a payroll process's simple flat elapsed times, Listing 6 demarcates employee counts into ranges, effectively separating out elapsed times and grouping them by roughly similar employee counts. Figure 10 displays the metric tree generated by the APM system:
Figure 11 illustrates the elapsed times of the payroll processing demarcated by employee counts, revealing the relative relationship between the number of employees and the elapsed time:
The tracer configuration properties allow the option of including a URL to a properties file where ranges and thresholds can be defined. (I'll cover thresholds shortly.) The properties are read in at tracer construction time and provide the backing data for the tracer.lookupRange
implementation. Listing 7 shows an example configuration of the Payroll Processing
range. I have elected to use the XML representation of java.util.Properties
because it is more forgiving of oddball characters.
Payroll Process Range
181+ Emps,10:1-10 Emps,50:11-50 Emps,
80:51-80 Emps,120:81-120 Emps,180:121-180 Emps
The injection of externally defined ranges protects your application from the need to update constantly at a source-code level because of adjusted expectations or business-driven changes to service level agreements (SLAs). As ranges and thresholds changes take effect, you are only required to update the external file, not the application itself.
The flexibility of externally configurable contextual tracing enables a more accurate and granular way to define and measure performance thresholds . While a range defines a series of numerical windows within which a measurement can be categorized, a threshold is a further categorization on a range that grades the acquired measurement in accordance with a measurement's determined range. A common requirement for the analysis of collected performance data is the determination and reporting of "successful" executions vs. executions that are considered "failed" because they did not occur within a specified time. The aggregation of this data can be required as a general report card on a system's operational health and capacity or as some form of SLA compliance assessment.
Using the payroll-processing system example, consider an internal service-level goal that defines execution times of payrolls (within the defined employee count ranges) into bands of Ok
, Warn
, and Critical
. The process of generating threshold counts is conceptually simple. You just need to provide the tracers the values you consider to be the upper elapsed time of each group for each band and direct the tracer to issue a tracer.traceIncident
for the categorized elapsed time, and then — to simplify reporting — a total. Table 2 outlines some contrived SLA elapsed times:
员工人数 | Ok (ms) | Warn (ms) | Critical (ms) |
---|---|---|---|
1-10 | 280 | 400 | >400 |
11-50 | 850 | 1200 | >1200 |
51-80 | 900 | 1100 | >1100 |
81-120 | 1100 | 1500 | >1500 |
121-180 | 1400 | 2000 | > 2000 |
181+ | 2000 | 3000 | >3000 |
The ITracer
API implements threshold-reporting using values defined in the same XML (properties) file as the ranges we explored. Range and threshold definitions differ slightly in two ways. First, the key value for a threshold definition is a regular expression. When ITracer
traces a numeric value, it checks to see if a threshold regular expression matches the compound name of the metric being traced. If it matches, the threshold can then grade the measurement as Ok
, Warn
, or Critical
, and an additional tracer.traceIncident
is piggybacked on the trace. Second, because thresholds define only two values (a Critical
value is defined as being greater than a warn
value), the configuration consists of simply two numbers. Listing 8 shows the threshold configuration for the payroll-process SLA I outlined previously:
1100,1500
280,400
850,1200
900,1100
1400,2000
2000,3000
Figure 12 shows the metric tree for payroll processing with the added threshold metrics:
Figure 13 illustrates what the data collected can represent in the form of a pie chart:
It is important to ensure that lookups for contextual and threshold categorization are as efficient and as fast as possible because they are being executed in the same thread that is doing the actual work. In the ITracer
implementation, all metric names are stored into (thread-safe) maps designated for metrics with and without designated thresholds the first time they are seen by the tracer. After the first trace event for a given metric, the elapsed time for the determination of the threshold (or lack of one) is a Map
lookup time, which is typically fast enough. In cases where the number of threshold entries or the number of distinct metric names is extremely high, a reasonable solution would be to defer the threshold determination and have it handled in the asynchronous tracing thread-pool worker.
This first article in the series has presented some monitoring antipatterns as well as some desirable attributes of an APM system. I've summarized some general performance data collection patterns and introduced the ITracer
interface, which I'll continue to use for the rest of the series. I've demonstrated techniques for monitoring the health of a JVM and general performance data acquisition through JMX. Lastly, I summarized ways you can implement efficient and code-change-resistant source-level instrumentation that monitors raw performance statistics and contextual derived statistics, and how these statistics can be used to report on application SLAs. Part 2 will explore techniques for instrumenting Java systems without modifying the application source code, by using interception, class wrapping, and dynamic bytecode instrumentation.
Go to Part 2 now.
翻译自: https://www.ibm.com/developerworks/java/library/j-rtm1/index.html