java性能监视_Java系统的运行时性能和可用性监视

许多现代Java应用程序都依赖于一组复杂的分布式依赖关系和活动部件。 许多外部因素都可能影响应用程序的性能和可用性。 在预生产环境中,几乎不可能完全消除或影响并准确模拟这些影响。 东西发生了 。 但是,您可以通过创建和维护一个全面的系统来监视应用程序的整个生态系统,从而大大降低这些事件的严重性和持续时间。

这篇由三部分组成的文章介绍了实现这种系统的一些模式和技术。 这些模式以及我将要使用的一些术语是有意使用的。 与代码示例和插图一起,它们将帮助您从概念上理解应用程序性能监视。 这种理解强调了对解决方案的需求,进而可以帮助您选择商业解决方案或开源解决方案,扩展和定制解决方案,或者(对于有动机的)充当构建解决方案的蓝图。

第1部分:

  • 探索应用程序性能管理(APM)系统的属性
  • 描述用于系统监视的常见反模式
  • 提出了用于监视JVM性能的方法
  • 提供有效地检测应用程序源代码的技术

第2部分将重点介绍在不修改原始源代码的情况下检测Java类和资源的方法。 第3部分将介绍用于监视JVM外部资源的方法,这些方法包括主机及其操作系统以及数据库和消息传递系统等远程服务。 最后将讨论其他APM问题,例如数据管理,数据可视化,报告和警报。

APM系统:模式和反模式

首先,我应该强调一点,尽管我在这里介绍的许多特定于Java的内容似乎与应用程序和代码概要分析的过程类似,但这并不是我所指的。 概要分析是一个非常有价值的预生产过程,可以确认或禁止Java代码具有可伸缩性,高效性,快速性,并且通常是出色的。 但是基于这些事情发生的公理,当您在生产中遇到莫名其妙的问题时,开发阶段代码分析器的最高荣誉将无法为您服务。

我这里指的是实现一些在生产分析和从正在运行的应用程序及其所有外部依赖的收集一些相同数据的实时的方面。 这些数据包括一系列持续进行的定量测量,这些测量在其目标中普遍存在,以提供整个系统运行状况的细粒度和详细表示。 而且,通过保留这些测量值的历史存储,您可以捕获准确的基准线,这些基准线可以帮助您确认环境是否健康或查明特定缺口的根本原因和大小。

监视反模式

这可能是一个罕有的应用程序,它根本没有监视资源,但请考虑以下反模式,这些反模式在操作环境中经常出现:

  • 盲点:某些系统相关性未受到监视,或者无法访问监视数据。 一个可操作的数据库可以具有完整的监视范围,但是如果支持的网络没有,则故障分类将有效地隐藏在网络中,而一个分类小组则对数据库的性能和应用程序服务器的症状进行审查。
  • 黑匣子:核心应用程序或其依赖项之一可能对其内部没有监视透明性。 JVM实际上是一个黑匣子。 例如,分流小组研究的JVM中只有无法解释的延迟,而JVM仅具有支持的操作系统统计信息,例如CPU利用率或要使用的进程的内存大小,因此可能无法诊断垃圾回收或线程-同步问题。
  • 不连贯和不连贯的监视系统:应用程序可以托管在大型共享数据中心中,其中依赖性包括大量共享资源,例如数据库,存储区域网络(SAN)存储或消息传递和中间件服务。 组织有时非常孤立,每个小组都管理自己的监视和APM系统(请参见孤立监视侧边栏的陷阱 )。 如果没有每个依赖项的统一视图,则每个组件所有者只能看到一小部分。

    图1对比了孤立和合并的APM系统:

    图1.孤立的APM与整合的APM系统
    java性能监视_Java系统的运行时性能和可用性监视_第1张图片
  • 事后报告和关联:为了解决孤立监视的问题,运营支持团队可能会运行定期过程以从各种来源获取数据,事后将数据整合到一个地方,然后生成摘要报告。 这种方法可能效率低下或无法按常规频率执行,并且缺乏实时合并数据可能会对分流小组在现场诊断问题的能力产生负面影响。 此外,事后聚合可能缺乏足够的粒度,从而导致数据中重要模式的隐藏。 例如,报告可能显示特定服务调用昨天的平均经过时间为200毫秒,而隐藏了一个事实,即在1:00 pm和1:45 pm之间,它通常在3500毫秒以上计时。
  • 定期或按需监视:由于某些监视工具带来很高的资源开销,因此它们不能(或不应)持续运行。 结果,他们很少或仅在检测到问题之后才收集数据。 结果,APM系统执行的基线最少,无法在严重性变得无法忍受之前提醒您问题,并且本身可能使状况恶化。
  • 非持久监视:许多监视工具提供了有用的实时性能和可用性指标显示,但是它们没有配置或不支持持久保存测量以进行长期或短期比较和分析的功能。 通常,在没有历史背景的情况下,绩效指标几乎没有价值,因为没有依据来判断指标的价值是好,坏还是糟糕。 例如,考虑当前的CPU利用率为45%。 在不知道过去重负载或轻负载期间利用率是多少的情况下,这种测量所提供的信息远不如知道典型值是x %,并且可接受的用户性能的上限历史上是y %。
  • 依赖于生产前建模:仅依靠生产前监视和系统建模的做法(假设所有潜在问题都可以在生产部署之前从环境中清除掉),通常会导致运行时监视不足。 此假设无法解决无法预测的事件和依赖项失败,从而使分诊团队没有工具或数据可以在发生此类事件时使用。

合并的APM的实施不会排除或贬低高度特定的监视和诊断工具,例如DBA管理工具集,低级网络分析应用程序和数据中心管理解决方案。 这些工具仍然是宝贵的资源,但是如果依赖于排除统一视图,则很难克服孤岛效应。

理想的APM系统的属性

与我刚刚描述的反模式相反,本系列文章介绍的理想APM系统具有以下属性:

  • 普及 :它监视所有应用程序组件和依赖项。
  • 细粒度的 :它可以监视极其底层的功能。
  • 合并 :将所有收集的度量值路由到支持合并视图的同一逻辑APM。
  • 常量 :它每周7天,每天24小时监控。
  • 高效 :性能数据的收集不会不利地影响监视目标。
  • 实时 :可以实时可视化,报告和警告受监视的资源指标。
  • 历史记录 :监视的资源指标将持久保存到数据存储中,以便可以可视化,比较和报告历史数据。

不过,在我深入研究该系统的实现细节之前,这将有助于理解APM系统的某些通用方面。

APM系统概念

所有APM系统都访问性能数据源,并包括收集和跟踪功能 。 请注意,这些是我自己选择的通用术语,用于描述通用类别。 它们并不特定于任何特定的APM系统,对于相同的概念可能会使用其他术语。 在下面的文章中,我将根据以下定义使用这些术语。

效果数据源

性能数据源(PDS)是性能或可用性数据的来源,可用作反映组件相对运行状况的度量。 例如,Java管理扩展(JMX)服务通常可以提供有关JVM运行状况的大量数据。 大多数关系数据库通过SQL接口发布性能数据。 这两个PDS都是示例,我称之为直接来源。 源直接提供性能数据。 相反, 推论来源衡量的是有意或偶然的行为,绩效数据是从中推导出来的。 例如,可以定期发送测试消息,然后再从Java消息服务(JMS)服务器中获取该消息,而往返时间则是对该服务性能的推论性度量。

推论来源(其实例称为综合交易 )可能非常有用,因为它们可以通过与真实活动相同的路径来有效地衡量多个组成部分或分层调用。 当直接来源不足时,合成交易在监视连续性以确认相对状态下的系统运行状况方面也起着关键作用。

收藏家和收藏家

收集是从PDS获取性能或可用性数据的过程。 对于直接PDS, 收集器通常实现某种API来访问该数据。 要从网络路由器读取统计信息,收集器可能使用简单网络管理协议(SNMP)或Telnet。 在推断性PDS的情况下,收集器将执行并测量基础操作。

追踪和追踪

跟踪是将测量结果从收集器传递到核心APM系统的过程。 为此,许多商业和开源APM系统都提供了某种API。 对于本文中的示例,我已经实现了一个通用的Java跟踪器接口,我将在下一节中对其进行详细介绍。

大多数APM系统通常将跟踪器提交的数据组织为某种分类和层次结构。 图2说明了此数据捕获的一般流程:

图2.收集,跟踪和APM系统
java性能监视_Java系统的运行时性能和可用性监视_第2张图片

图2还显示了APM系统中一些常用的服务:

  • 实时可视化 :图形和图表可实时显示所选指标。
  • 报告 :生成的指标活动报告。 这些通常包括罐装报告,自定义报告的集合,以及导出数据以供其他地方使用的功能。
  • 历史存储 :包含原始或摘要指标的历史数据存储,以便可以在特定时间范围内查看可视化和报告。
  • 警报:可以将有兴趣的个人或组通知从收集的指标确定的特定条件的功能。 典型的警报方法是电子邮件和某种自定义挂钩界面,以允许操作团队将事件传播到事件处理系统中。

整个APM的目标环境中通用跟踪API的实现和使用提供了一定的一致性。 并且,出于定制收集器的目的,它使开发人员可以专注于获取性能数据,而不必担心跟踪方面。 下一节介绍解决此主题的APM跟踪接口。

ITracer:跟踪器界面

Java语言可以很好地用作收集器的实现语言,因为它具有:

  • 广泛的平台支持。 Java收集器类可以在大多数目标平台上未经修改地运行。 这使监视体系结构可以灵活地将收集器进程与PDS本地放置在一起,而不必强制进行远程收集。
  • 通常具有出色的性能(尽管它随可用资源而变化)。
  • 强大的并发和异步执行支持。
  • 支持多种通讯协议。
  • 来自第三方API的广泛支持,例如JDBC实现,SNMP和专有Java接口,这些API反过来又支持各种收集器库。
  • 活跃的开源社区提供的支持为该语言提供了更多工具和界面,以使该语言可以从大量资源中访问或获取数据。

但是,一个警告是,您的Java收集器必须能够与目标APM系统提供的跟踪API集成。 如果您的APM跟踪机制不提供Java接口,则其中某些模式仍然适用。 但在案件中,目标PDS是专门基于Java(如JMX)和应用程序平台是不是,你需要一个桥接接口,如IKVM,一个Java-to-.NET编译器(见相关主题 )。

在没有正式标准的情况下,不同的APM产品提供的跟踪API都是不同的。 因此,我通过实现一个称为org.runtimemonitoring.tracing.ITracer的通用跟踪Java接口来抽象化此问题。 ITracer接口是专有跟踪API的通用包装。 此技术可保护源库免受版本或API提供程序更改的影响,并且还提供了实现包装API中不可用的其他功能的机会。 本文其余的大多数示例都实现了ITracer接口及其支持的基本概念。

图3是org.runtimemonitoring.tracing.ITracer接口的UML类图:

图3. ITracer界面和工厂类
java性能监视_Java系统的运行时性能和可用性监视_第3张图片

跟踪类别和名称

ITracer的根本前提是向中央APM系统提交度量值和相关名称。 这项活动是通过trace方法实现的, trace方法根据提交的度量的性质而变化。 每个跟踪方法都接受String[] name参数,该参数包含复合名称的上下文组件,该复合名称的结构特定于APM系统。 复合名称同时向APM系统指示提交的名称空间和实际的度量名称。 因此,复合名称通常至少具有一个根类别和一个度量描述。 底层的ITracer实现应该知道如何根据传递的String[]构建复合名称。 表1展示了复合命名约定的两个示例:

表1.示例化合物名称
名称结构 化合物名称
简单的斜杠分隔 主机/销售数据库服务器/ CPU利用率/ CPU3
的JMX MBean ObjectName com.myco.datacenter.apm:type = Hosts,service = SalesDatabaseServer,group = CPU利用率,instance = CPU3

清单1是使用此API跟踪调用的简短示例:

清单1.跟踪API调用的示例
ITracer simpleTracer = TracerFactory.getInstance(sprops);
ITracer jmxTracer = TracerFactory.getInstance(jprops);
.
.
simpleTracer.trace(37, "Hosts", "SalesDatabaseServer",
   "CPU Utilization", "CPU3", "Current Utilization %");
jmxTracer.trace(37, 
   "com.myco.datacenter.apm", 
   "type=Hosts", 
   "service=SalesDatabaseServer", 
   "group=CPU Utilization", 
   "instance=CPU3", "Current Utilization %");
);

示踪剂测量数据类型

在此界面中,度量可以具有以下类型之一:

  • int
  • long
  • java.util.Date
  • String

APM系统提供商可能支持其他数据类型来收集度量。

示踪剂类型

给定一种特定的测量数据类型(例如long ),可以根据APM系统中的类型支持以不同的方式解释给定值。 还要记住,每个APM实现对于相同的类型可能使用不同的术语,而ITracer使用某种通用命名。

ITracer中表示的跟踪器类型为:

  • 平均时间间隔trace(long value, String[] name)trace(int value, String[] name)方法发布时间间隔平均值的跟踪(请参见“ Intervals”边栏)。 这意味着每个提交都被计入当前间隔的合计值中。 一旦开始新的间隔,则合计值计数器将重置为零。
  • 粘性: traceSticky(value long, String[] name)traceSticky(value int, String[] name)方法发布粘性值的跟踪。 这意味着与间隔平均指标相反,聚合在间隔之间保留其值。 如果我现在跟踪的值为5,然后直到明天某个时候才再次跟踪,则该指标将永久保持为5,直到提供新值为止。
  • 增量 :增量跟踪通过了一个数字,但提供给APM系统(或由APM系统解释)的实际值是此度量与前一个度量之间的增量。 这些有时称为费率类型,反映了它们的优点。 考虑对事务管理器的提交总数的度量。 这个数字总是增加,并且绝对值很有可能没有用。 该数字的有用方面是它增加的速率,因此定期收集绝对数字并跟踪两次读数之间的差异可反映事务提交的速率。 尽管很少有用例是间隔平均的,但Delta痕迹有间隔平均和粘性的味道。 增量迹线必须能够将仅预期增加的测量值与增加和减少的测量值区分开。 提交的小于先前值的测量应被忽略或导致基础增量的重置。
  • 突发事件 :此类型是一种简单的非聚集度量标准,是间隔中特定事件发生次数的递增计数。 因为在任何给定时间都不会期望收集器或跟踪器知道运行的总数,所以基本traceIncident(String[] name)调用没有值,并且一个事件的滴答声是隐式的。 如果想要一个以上的滴答声而不是在一个循环中多次调用该方法, traceIncident(int value, String[] name)方法会按value滴答总数。
  • Smart :Smart Tracer是参数化类型,它映射到Tracer中的其他类型之一。 度量值和跟踪类型作为String传入,并且可用类型在接口中定义为常量。 这是一种方便的方法,适用于以下情况:收集器不知道所收集数据的数据类型或跟踪器类型是什么,但是可以直接将其传递给跟踪器以收集收集的值和配置的类型名称。

TracerFactory是通用工厂类,用于基于传递的配置属性创建新的ITracer实例,或从缓存中引用创建的ITracer

收集器模式

收集器通常使用以下三种模式之一,这会影响应使用的跟踪器类型:

  • 轮询:收集器以固定的频率被调用,它从PDS检索并跟踪度量标准或一组度量标准的当前值。 例如,收集器可能每分钟调用一次,以读取主机的CPU使用率或通过JMX接口从事务管理器中读取已提交事务的总数。 轮询模式的前提是对目标指标进行定期采样。 因此,在发生轮询事件时,会将度量标准的值提供给APM系统,但是在中间时间段的持续时间内,假定该值不变。 因此,轮询收集器通常使用粘性跟踪器类型:APM系统在所有轮询事件之间将值报告为不变。 图4说明了这种模式:
    图4.轮询收集模式
    java性能监视_Java系统的运行时性能和可用性监视_第4张图片
  • 侦听 :此常规数据模式是观察者模式的一种形式。 收集器将自身注册为目标PDS的事件侦听器,并在感兴趣的事件发生时接收回调。 由于回调而发出的可能的跟踪值取决于回调有效负载本身的内容,但至少收集器可以为每个回调跟踪事件。 图5说明了这种模式:
    图5:听力收集模式
    java性能监视_Java系统的运行时性能和可用性监视_第5张图片
  • 拦截 :在这种模式下,收集器将自己插入为目标与其调用者之间的拦截器。 对于通过拦截器的每个活动实例,它都会进行测量并进行跟踪。 在拦截模式是请求/响应的情况下 ,收集器可以测量请求的数量,响应时间,并可能测量请求或响应的有效负载。 例如,还可以充当收集器的HTTP代理服务器可以:
    • 对请求进行计数,可以选择按HTTP请求类型( GETPOST等)或统一资源标识符(URI)进行划分。
    • 计时请求的响应。
    • 测量请求和响应的大小。
    因为可以假定拦截收集器“看到”每个事件,所以通常将对实现的跟踪器类型进行间隔平均。 因此,如果某个间隔到期而没有任何活动,则该间隔的合计值将为零,而与先前间隔中的活动无关。 图6说明了这种模式:
    图6.拦截收集模式
    java性能监视_Java系统的运行时性能和可用性监视_第6张图片

现在,我已经概述了性能数据跟踪API,其底层数据类型以及数据收集的模式,接下来,我将介绍一些使该API工作的特定用例和示例。

监控JVM

JVM本身是开始实施性能监视的明智之地。 我将从所有JVM共有的性能指标开始,然后继续介绍一些在企业应用程序中常见的JVM驻留组件。 几乎没有例外,Java应用程序的实例是受底层操作系统支持的进程,因此,最好从托管OS的角度来查看JVM监视的多个方面,这将在第3部分中介绍 。

在Java平台标准版5(Java SE)发行之前,可以在运行时高效且可靠地收集内部和标准化JVM诊断的方法一直受到限制。 现在,可以通过java.lang.management接口使用几个有用的监视点,该接口在所有兼容的Java SE 5(及更新版本)JVM版本中都是标准的。 这些JVM的某些实现提供了额外的专有指标,但是访问模式大致相同。 我将重点介绍可通过JVM的MXBean访问的标准对象-部署在VM内部的JMX MBean,它们公开了管理和监视界面(请参阅参考资料 ):

  • ClassLoadingMXBean :监视类加载系统。
  • CompilationMXBean :监视编译系统。
  • GarbageCollectionMXBean :监视JVM的垃圾收集器。
  • MemoryMXBean :监视JVM的堆和非堆内存空间。
  • MemoryPoolMXBean :监视由JVM分配的内存池。
  • RuntimeMXBean :监视运行时系统。 该MXBean提供了一些有用的监视指标,但确实提供了JVM的输入参数以及启动时间和正常运行时间,这两者都可以用作其他派生指标中的因素。
  • ThreadMXBean :监视线程系统。

JMX收集器的前提是它获取MBeanServerConnection ,该对象可以从部署在JVM中的MBean读取属性,读取目标属性的值并使用ITracer API进行跟踪。 对于此类收集,一个关键的决定是在何处部署收集器。 选择是本地部署和远程部署 。

在本地部署中,收集器及其调用调度程序部署在目标JVM本身内。 然后,JMX收集器组件使用PlatformMBeanServer访问MXBean,后者是JVM内部的静态可访问MBeanServerConnection 在远程部署中,收集器在单独的进程中运行,并使用某种形式的JMX Remoting连接到目标JVM。 这可能比本地部署效率低,但是不需要将任何其他组件部署到目标系统。 JMX Remoting不在本文讨论范围之内,但是可以通过部署RMIConnectorServer或通过简单地在JVM中启用外部连接来轻松实现(参见参考资料 )。

样本JMX收集器

本文的样本JMX收集器(有关完整的源代码,请参见下载 )包含用于获取MBeanServerConnection三种独立方法。 收集器可以:

  • 获取MBeanServerConnection到本地JVM的平台MBeanServer使用的静态调用java.lang.management.ManagementFactory.getPlatformMBeanServer()方法。
  • 获取MBeanServerConnection到二级MBeanServer在JVM的平台使用的静态调用本地部署javax.management.MBeanServerFactory.findMBeanServer(String agentId)方法。 请注意,有可能在一个JVM中驻留多个MBeanServer ,并且更复杂的系统(例如Java平台,企业版(Java EE)服务器)几乎总是具有特定于应用程序服务器的MBeanServer ,该MBeanServer与平台MBeanServer分离且不同。 交叉注册MBeans侧栏)。
  • 使用javax.management.remote.JMXServiceURL通过标准RMI远程处理获取远程MBeanServerConnection

清单2是JMXCollector collect()方法的简短片段,显示了ThreadMXBean中线程活动的收集和跟踪。

清单2.使用ThreadMXBean的示例JMX收集器的collect()方法的一部分
.
.
objectNameCache.put(THREAD_MXBEAN_NAME, new ObjectName(THREAD_MXBEAN_NAME));
.
.
public void collect() {
   CompositeData compositeData = null;
   String type = null;
   try {
      log("Starting JMX Collection");
      long start = System.currentTimeMillis();
      ObjectName on = null;
.
.
      // Thread Monitoring
      on = objectNameCache.get(THREAD_MXBEAN_NAME);
      tracer.traceDeltaSticky((Long)jmxServer.getAttribute(on,"TotalStartedThreadCount"), 
        hostName, "JMX", on.getKeyProperty("type"), "StartedThreadRate");
      tracer.traceSticky((Integer)jmxServer.getAttribute(on, "ThreadCount"), hostName, 
        "JMX", on.getKeyProperty("type"), "CurrentThreadCount");
.
.
      // Done
      long elapsed = System.currentTimeMillis()-start;
      tracer.trace(elapsed, hostName, "JMX", "JMX Collector", 
         "Collection", "Last Elapsed Time");
      tracer.trace(new Date(), hostName, "JMX", "JMX Collector", 
         "Collection", "Last Collection");         
      log("Completed JMX Collection in ", elapsed, " ms.");         
   } catch (Exception e) {
      log("Failed:" + e);
      tracer.traceIncident(hostName, "JMX", "JMX Collector", 
         "Collection", "Collection Errors");
   }
}

清单2中的代码跟踪TotalThreadsStartedCurrentThreadCount的值。 因为这是一个轮询收集器,所以两个跟踪都使用sticky选项。 但是,由于TotalThreadsStarted始终是一个递增的数字,所以最有趣的方面不是绝对数,而是线程的创建速率,因此跟踪器使用DeltaSticky选项。

图7显示了此收集器创建的APM度量树:

图7. JMX收集器APM指标树
java性能监视_Java系统的运行时性能和可用性监视_第7张图片

JMX收集器具有清单2中未显示的几个方面(但是可以在完整的源代码中看到),例如调度注册,它每10秒创建一次到collect()方法的定期回调。

在清单2中,根据数据源实现了不同的跟踪器类型和数据类型。 例如:

  • TotalLoadedClassesUnloadedClassCount被跟踪为粘性增量,因为这些值始终会上升,并且该增量可能比绝对值更有用,可以用来衡量类加载活动。
  • ThreadCount是一个可变的数量,可以递增或递减,因此可追溯为粘滞状态。
  • Collection Errors被作为间隔事件进行跟踪,在进行收集时遇到的任何异常都会增加。

为了提高效率,由于目标MXBean的JMX ObjectName在目标JVM的生存期内不会更改,因此收集器使用ManagementFactory常量名称来缓存名称。

对于两种类型的MXBean( GarbageCollectorMemoryPool ),确切的ObjectName可能不是预先知道的,但是您可以提供一个通用模式。 在这些情况下,第一次进行集合时,将对MBeanServerConnection发出查询,并请求与提供的模式匹配的所有MBean的列表。 为了避免在目标JVM生命周期内进行将来的查询,将缓存返回的匹配MBean ObjectName

在某些情况下,集合的目标MBean属性可能不是平面数字类型。 MemoryMXBeanMemoryPoolMXBean就是这种情况。 在这些情况下,属性类型是一个CompositeData对象,该对象将为其键和值查询。 对于java.lang.management JVM管理接口,MXBean标准采用JMX 开放类型模型,其中所有属性都是与语言无关的类型,例如java.lang.Booleanjava.lang.Integer Or in the case of complex types such as javax.management.openmbean.CompositeType , these types can be decomposed into key/value pairs of the same simple types. The full list of simple types is enumerated in the static javax.management.openmbean.OpenType.ALLOWED_CLASSNAMES field. This model supports a level of type independence so that JMX clients do not have a dependency on nonstandard classes and can also support non-Java clients because of the relative simplicity of the underlying types. For more detail on JMX Open Types, see Related topics .

In cases in which a target MBean attribute is a nonstandard complex type, you need to ensure that the class defining that type is in your collector's classpath. And you must implement some custom code to render the useful data from the retrieved complex object.

In instances in which a single connection is acquired and retained for all collections, error detection and remediation is required to create a new connection in the event of a failure. Some collection APIs provide disconnect listeners that can prompt the collector to close, clean up, and create a new connection. To address scenarios in which a collector tries to connect to a PDS that has been taken down for maintenance or is inaccessible for some other reason, the collector should poll for reconnect on a friendly frequency. Tracking a connection's elapsed time can also be useful in order to degrade the frequency of collections if a slowdown is detected. This can reduce overhead on a target JVM that may be overly taxed for a period of time.

Two additional techniques not implemented in these examples can improve the JMX collector's efficiency and reduce the overhead of running it against the target JVM. The first technique applies in cases in which multiple attributes are being interrogated from one MBean. Rather than requesting one attribute at a time using getAttribute(ObjectName name, String attribute) , it is possible to issue a request for multiple attributes in one call using getAttributes(ObjectName name, String[] attributes). The difference might be negligible in local collection but can reduce resource utilization significantly in remote collection by reducing the number of network calls. The second technique is to reduce the polling overhead of the JMX exposed memory pools further by implementing the listening collector pattern instead of a polling pattern. The MemoryPoolMXBean supports the ability to establish a usage threshold that, when exceeded, fires a notification to a listener, which in turn can trace the value. As the memory usage increases, the usage threshold can be increased accordingly. The downside of this approach is that without extremely small increments in the usage threshold, some granularity of data can be lost and patterns of memory usage below the threshold become invisible.

A final unimplemented technique is to measure windows of elapsed time and the total elapsed garbage-collection time and implement some simple arithmetic to derive the percentage of elapsed time that the garbage collector is active. This is a useful metric because some garbage collection is (for the time being) an inevitable fact of life for most applications. Because some number of collections, each lasting some period of time, are to be expected, the percentage of elapsed time when garbage collections are running can put the JVM's memory health in a clearer context. A general rule of thumb (but highly variable by application) is that any more than 10 percent of any 15-minute period indicates a potential issue.

External configuration for collectors

The JMX collector I've outlined in this section is simplified to illustrate the collection process, but it's extremely limiting always to have hard-coded collections. Ideally, a collector implements the data-access how , and an externally supplied configuration supplies the what . Such a design makes collectors much more useful and reusable. For the highest level of reuse, an externally configured collector should support these configuration points:

  • A PDS connection-factory directive to provide the collector the interface to use to connect to the PDS and the configuration to use when connecting.
  • The frequency to collect on.
  • The frequency on which to attempt a reconnect.
  • The target MBean for collection, or a wildcard object name.
  • For each target, the tracing compound name or fragment the measurement should be traced to, and the data type that it should be traced as.

Listing 3 illustrates an external configuration for a JMX collector:

Listing 3. Example of external configuration for a JMX collector


   
      collectors.jmx.RemoteRMIMBeanServerConnectionFactory
   
   
      jmx.rmi.url=service:jmx:rmi://127.0.0.1/jndi/rmi://127.0.0.1:1090/jmxconnector
   
   AppServer3.myco.org,JMX
   10000
   
      
         
         
            
   

Note that the TargetAttribute elements contain an attribute called type , which represents a parameterized argument to a smart type tracer. The SINT type represents ticky int , and the SDINT type represents delta sticky int .

Monitoring application resources through JMX

So far, I've examined monitoring only standard JVM resources through JMX. However, many application frameworks, such as Java EE, can expose important application-specific metrics through JMX, depending on the vendor. One classic example is DataSource utilization. A DataSource is a service that pools connections to an external resource (most commonly, a database), limiting the number of concurrent connections to protect the resource from misbehaving or stressed applications. Monitoring data sources is a critical piece of an overall monitoring plan. The process is similar to what you've have already seen, thanks to JMX's abstraction layer.

Here's a list of typical data source metrics taken from a JBoss 4.2 application server instance:

  • Available connection count : The number of connections that are currently available in the pool.
  • Connection count : The number of actual physical connections to the database from connections in the pool.
  • Maximum connections in use : The high-water mark of connections in the pool being in use.
  • In-use connection count : The number of connections currently in use.
  • Connections-created count : The total number of connections created for this pool.
  • Connections destroyed count : The total number of connections destroyed for this pool.

This time, the collector uses batch attribute retrieval and acquires all the attributes in one call. The only caveat here is the need to interrogate the returned data to switch on the different data and tracer types. DataSource metrics are also pretty flat without any activity, so to see some movement in the numbers, you need to generate some load. Listing 4 shows the DataSource collector's collect() method:

Listing 4. The DataSource collector
public void collect() {
   try {
      log("Starting DataSource Collection");
      long start = System.currentTimeMillis();
      ObjectName on = objectNameCache.get("DS_OBJ_NAME");
      AttributeList attributes  = jmxServer.getAttributes(on, new String[]{
            "AvailableConnectionCount", 
            "MaxConnectionsInUseCount",
            "InUseConnectionCount",
            "ConnectionCount",
            "ConnectionCreatedCount",
            "ConnectionDestroyedCount"
      });
      for(Attribute attribute: (List)attributes) {
         if(attribute.getName().equals("ConnectionCreatedCount") 
            || attribute.getName().equals("ConnectionDestroyedCount")) {
               tracer.traceDeltaSticky((Integer)attribute.getValue(), hostName, 
               "DataSource", on.getKeyProperty("name"), attribute.getName());
         } else {
            if(attribute.getValue() instanceof Long) {
               tracer.traceSticky((Long)attribute.getValue(), hostName, "DataSource", 
                  on.getKeyProperty("name"), attribute.getName());
            } else {
               tracer.traceSticky((Integer)attribute.getValue(), hostName, 
                  "DataSource",on.getKeyProperty("name"), attribute.getName());
            }
         }
      }
      // Done
      long elapsed = System.currentTimeMillis()-start;
      tracer.trace(elapsed, hostName, "DataSource", "DataSource Collector", 
         "Collection", "Last Elapsed Time");
      tracer.trace(new Date(), hostName, "DataSource", "DataSource Collector", 
         "Collection", "Last Collection");         
      log("Completed DataSource Collection in ", elapsed, " ms.");         
   } catch (Exception e) {
      log("Failed:" + e);
      tracer.traceIncident(hostName, "DataSource", "DataSource Collector", 
         "Collection", "Collection Errors");
   }      
}

Figure 8 shows the corresponding metric tree for the DataSource collector:

Figure 8. The DataSource collector metric tree
java性能监视_Java系统的运行时性能和可用性监视_第8张图片

Monitoring components in the JVM

This section addresses techniques that can be used to monitor application components, services, classes, and methods. The primary areas of interest are:

  • Invocation rate: The rate at which a service or method is being invoked.
  • Invocation response rate : The rate at which a service or method responds.
  • Invocation error rate : The rate at which a service or method generates errors.
  • Invocation elapsed time : The average, minimum, and maximum elapsed time for an invocation per interval.
  • Invocation concurrency : The number of threads of execution concurrently invoking a service or method.

Using metrics made available by some implementations of the Java SE 5 (and newer) ThreadMXBean , it is also possible to collect the following metrics:

  • System and user CPU time : The elapsed CPU time consumed invoking a method.
  • Number of waits and total wait time : The number of instances and total elapsed time when the thread was waiting while invoking a method or service. Waits occur when a thread enters a wait state of WAITING or TIMED_WAITING pending another thread's activity.
  • Number of blocks and total block time : The number of instances and total elapsed time when the thread was in a BLOCKED state while invoking a method or service. Blocks occur when a thread is waiting for a monitor lock to enter or reenter a synchronized block.

These metrics, and others, can also be determined using alternative tool sets and native interfaces, but this usually involves some level of overhead that makes them undesirable for production run-time monitoring. Having said that, the metrics themselves, even when collected, are low level. They may not be useful for anything other than trending, and they are quite difficult to correlate with any causal effects that can't be identified through other means.

All of the above metrics can be collected by a process of instrumenting the classes and methods of interest to make them collect and trace the performance data to the target APM system. A number of techniques can be used to instrument Java classes directly or to derive performance metrics from them indirectly:

  • Source code instrumentation: The most basic technique is to add instrumentation at the source code phase so that the compiled and deployed classes already contain the instrumentation at run time. In some cases, it makes sense to do this, and certain practices make it a tolerable process and investment.
  • Interception : By diverting an invocation through an interceptor that performs the measurement and tracing, it is possible to monitor accurately and efficiently without touching the targeted classes, their source code, or their run-time bytecode. This practice is quite accessible because many Java EE frameworks and other popular Java frameworks:
    • Favor abstraction through configuration.
    • Enable class injection and referencing through interfaces.
    • In some cases directly support the concept of an interception stack. The flow of execution passes through a configuration-defined stack of objects whose purpose and design is to accept an invocation, do something with it, and then pass it on.
  • Bytecode instrumentation: This is the process of injecting bytecode into the application classes. The injected bytecode adds performance-data-collecting instrumentation that is invoked as part and parcel of what is essentially a new class. This process can be highly efficient because the instrumentation is fully compiled bytecode, and the code's execution path is extended in about as small a way possible while still collecting data. It also has the virtue of not requiring any modification to the original source code, and potentially minimal configuration change to the environment. Moreover, the general pattern and techniques of bytecode injection allow the instrumentation of classes and libraries for which source code is not unavailable, as is the case with many third-party classes.
  • Class wrapping : This is the process of wrapping or replacing a target class with another class that implements the same functionality but also contains instrumentation.

Here in Part 1, I address only source code based instrumentation; you'll read more about interception, bytecode instrumentation, and class wrapping in Part 2 . (Interception, bytecode instrumentation, and class wrapping are virtually identical from a topological perspective, but the action to achieve the result has slightly different implications in each case.)

Asynchronous instrumentation

Asynchronous instrumentation is a fundamental issue in class instrumentation. A previous section explored the concepts of polling for performance data. If polling is done reasonably well, it should have no impact on the core application performance or overhead. In contrast, instrumenting the application code itself directly modifies and affects the core code's execution. The primary goal of any sort of instrumentation must be Above all, do no harm . The overhead penalty must be as close to negligible as possible. There is virtually no way to eliminate an extremely small execution penalty in the measurement itself, but once the performance data has been acquired, it is critical that the remainder of the trace process be asynchronous. There are several patterns for implementing asynchronous tracing. Figure 9 illustrates a general overview of how it can be done:

Figure 9. Asynchronous tracing
java性能监视_Java系统的运行时性能和可用性监视_第9张图片

Figure 9 illustrates a simple instrumentation interceptor that measures the elapsed time of an invocation by capturing its start time and end time, and then dispatches the measurement (the elapsed time and the metric compound name) to a processing queue. The queue is then read by a thread pool, which acquires the measurement and completes the trace process.

Java class instrumentation through source code

This section addresses the subject of implementing source level instrumentation and provides some best practices and example source code. It also introduces some new tracing constructs that I'll detail in the context of source code instrumentation to clarify their actions and their implementation patterns.

Despite the prevalence of alternatives, instrumentation of source code is unavoidable in some instances; in some cases it's the only solution. With sensible precautions, it's not necessarily a bad one. Considerations include:

  • If the option to instrument source code is available, and you're prohibited from implementing configuration changes to effect instrumentation more orthogonally, the use of a configurable and flexible tracing API is critical.
  • An abstracted tracing API is analogous to a logging API such as log4j, with these attributes in common:
    • Runtime verbosity control: The verbosity level of log4j loggers and appenders can be configured at start time and then modified at run time. Similarly, a tracing API should be able to control which metric names are enabled for tracing based on a hierarchical naming pattern.
    • Output endpoint configuration: log4j issues logging statements through loggers, which in turn are dispatched to appenders that can be configured to send the log stream to a variety of outputs such as files, sockets, and e-mail. The tracing API does not require this level of output diversity, but the ability to abstract a proprietary or APM system-specific library protects the source code from change through external configuration.
  • In some circumstances, it might not be possible to trace a specific item through any other means. This is typical in cases I refer to as contextual tracing . I use this term to describe performance data that is not of primary importance but adds context to the primary data.

Contextual tracing

Contextual tracing is highly subjective to the specific application, but consider the simplified example of a payroll-processing class with a processPayroll(long clientId) method. When invoked, the method calculates and stores the paycheck for each of the client's employees. You can probably instrument the method through various means, but an underlying pattern in the execution clearly indicates that the invocation time increases disproportionately with the number of employees. Consequently, examining a trend of elapsed times for processPayroll has no context unless you know how many employees are in each run. More simply put, for a given period of time the average elapsed time of processPayroll was x milliseconds. You can't be sure if that value indicates acceptable or poor performance because if the window comprised only one employee, you would perceive it as poor, but if it comprised 150 employees, you'd think it was flying. Listing 5 displays this simplified concept in code:

Listing 5. A case for contextual tracing
public void processPayroll(long clientId) {
   Collection employees = null;
   // Acquire the collection of employees
   //...
   //...
   // Process each employee
   for(Employee emp: employees) {
      processEmployee(emp.getEmployeeId(), clientId);
   }
}

The primary challenge here is that by most instrumentation techniques, anything inside the processPayroll() method is untouchable. So although you might be able to instrument processPayroll and even processEmployee , you have no way of tracing the number of employees to provide context to the method's performance data. Listing 6 displays a poorly hardcoded (and somewhat inefficient) example of how to capture the contextual data in question:

Listing 6. Contextual tracing example
public void processPayrollContextual(long clientId) {      
   Collection employees = null;
   // Acquire the collection of employees
   employees = popEmployees();
   // Process each employee
   int empCount = 0;
   String rangeName = null;
   long start = System.currentTimeMillis();
   for(Employee emp: employees) {
      processEmployee(emp.getEmployeeId(), clientId);
      empCount++;
   }
   rangeName = tracer.lookupRange("Payroll Processing", empCount);
   long elapsed = System.currentTimeMillis()-start;
   tracer.trace(elapsed, "Payroll Processing", rangeName, "Elapsed Time (ms)");
   tracer.traceIncident("Payroll Processing", rangeName, "Payrolls Processed");
   log("Processed Client with " + empCount + " employees.");
}

The key part of Listing 6 is the call to tracer.lookupRange . Ranges are named collections that are keyed by a numerical range limit and have a String value representing the name of the numerical range. Instead of tracing a payroll process's simple flat elapsed times, Listing 6 demarcates employee counts into ranges, effectively separating out elapsed times and grouping them by roughly similar employee counts. Figure 10 displays the metric tree generated by the APM system:

Figure 10: Payroll-processing times grouped by range
java性能监视_Java系统的运行时性能和可用性监视_第10张图片

Figure 11 illustrates the elapsed times of the payroll processing demarcated by employee counts, revealing the relative relationship between the number of employees and the elapsed time:

Figure 11. Payroll-processing elapsed times by range
java性能监视_Java系统的运行时性能和可用性监视_第11张图片

The tracer configuration properties allow the option of including a URL to a properties file where ranges and thresholds can be defined. (I'll cover thresholds shortly.) The properties are read in at tracer construction time and provide the backing data for the tracer.lookupRange implementation. Listing 7 shows an example configuration of the Payroll Processing range. I have elected to use the XML representation of java.util.Properties because it is more forgiving of oddball characters.

Listing 7. Sample range configuration



   Payroll Process Range
   181+ Emps,10:1-10 Emps,50:11-50 Emps,
      80:51-80 Emps,120:81-120 Emps,180:121-180 Emps

The injection of externally defined ranges protects your application from the need to update constantly at a source-code level because of adjusted expectations or business-driven changes to service level agreements (SLAs). As ranges and thresholds changes take effect, you are only required to update the external file, not the application itself.

Tracking thresholds and SLAs

The flexibility of externally configurable contextual tracing enables a more accurate and granular way to define and measure performance thresholds . While a range defines a series of numerical windows within which a measurement can be categorized, a threshold is a further categorization on a range that grades the acquired measurement in accordance with a measurement's determined range. A common requirement for the analysis of collected performance data is the determination and reporting of "successful" executions vs. executions that are considered "failed" because they did not occur within a specified time. The aggregation of this data can be required as a general report card on a system's operational health and capacity or as some form of SLA compliance assessment.

Using the payroll-processing system example, consider an internal service-level goal that defines execution times of payrolls (within the defined employee count ranges) into bands of Ok , Warn , and Critical . The process of generating threshold counts is conceptually simple. You just need to provide the tracers the values you consider to be the upper elapsed time of each group for each band and direct the tracer to issue a tracer.traceIncident for the categorized elapsed time, and then — to simplify reporting — a total. Table 2 outlines some contrived SLA elapsed times:

Table 2. Payroll-processing thresholds
员工人数 Ok (ms) Warn (ms) Critical (ms)
1-10 280 400 >400
11-50 850 1200 >1200
51-80 900 1100 >1100
81-120 1100 1500 >1500
121-180 1400 2000 > 2000
181+ 2000 3000 >3000

The ITracer API implements threshold-reporting using values defined in the same XML (properties) file as the ranges we explored. Range and threshold definitions differ slightly in two ways. First, the key value for a threshold definition is a regular expression. When ITracer traces a numeric value, it checks to see if a threshold regular expression matches the compound name of the metric being traced. If it matches, the threshold can then grade the measurement as Ok , Warn , or Critical , and an additional tracer.traceIncident is piggybacked on the trace. Second, because thresholds define only two values (a Critical value is defined as being greater than a warn value), the configuration consists of simply two numbers. Listing 8 shows the threshold configuration for the payroll-process SLA I outlined previously:

Listing 8. The threshold configuration for payroll process



   
   1100,1500   
   280,400   
   850,1200   
   900,1100      
   1400,2000   
   2000,3000   

Figure 12 shows the metric tree for payroll processing with the added threshold metrics:

Figure 12. Payroll processing metric tree with thresholds
java性能监视_Java系统的运行时性能和可用性监视_第12张图片

Figure 13 illustrates what the data collected can represent in the form of a pie chart:

Figure 13. SLA summary for payroll processing (1 to 10 employees)
java性能监视_Java系统的运行时性能和可用性监视_第13张图片

It is important to ensure that lookups for contextual and threshold categorization are as efficient and as fast as possible because they are being executed in the same thread that is doing the actual work. In the ITracer implementation, all metric names are stored into (thread-safe) maps designated for metrics with and without designated thresholds the first time they are seen by the tracer. After the first trace event for a given metric, the elapsed time for the determination of the threshold (or lack of one) is a Map lookup time, which is typically fast enough. In cases where the number of threshold entries or the number of distinct metric names is extremely high, a reasonable solution would be to defer the threshold determination and have it handled in the asynchronous tracing thread-pool worker.

第1部分的结论

This first article in the series has presented some monitoring antipatterns as well as some desirable attributes of an APM system. I've summarized some general performance data collection patterns and introduced the ITracer interface, which I'll continue to use for the rest of the series. I've demonstrated techniques for monitoring the health of a JVM and general performance data acquisition through JMX. Lastly, I summarized ways you can implement efficient and code-change-resistant source-level instrumentation that monitors raw performance statistics and contextual derived statistics, and how these statistics can be used to report on application SLAs. Part 2 will explore techniques for instrumenting Java systems without modifying the application source code, by using interception, class wrapping, and dynamic bytecode instrumentation.

Go to Part 2 now.


翻译自: https://www.ibm.com/developerworks/java/library/j-rtm1/index.html

你可能感兴趣的:(java性能监视_Java系统的运行时性能和可用性监视)