The Rise of AIOps How Data, Machine Learning, and AI Will Transform Performance Monitoring

Over the last decade, application environments have exploded in complexity.

过去10年,应用程序的环境变得越来越复杂。

Gone are the days of managing monoliths. Today’s IT professionals are tasked with ensuring the performance and reliability of distributed systems across virtualized and multi-cloud environments. And while it may be true that the emergence of this modern application environment has provided the speed and flexibility professionals demand, these numerous services have unleashed a deluge of data on the enterprise IT environment.

管理巨石系统的日志一去不复返了。今天,IT专家的任务是确保跨虚拟机和多云环境下的分布式系统的性能和可靠性。虽然这种现代的应用程序环境的出现可能提供了专家所需要的速度和灵活性,但是这些众多的服务释放了大量的企业IT环境数据。

Application performance monitoring (APM) solutions have proven essential in helping leaders take back control by providing the real-time insights needed to take action. But as the volume of data in IT ecosystems increases, many professionals are finding it challenging to take a proactive approach to managing it all. While automating tasks have helped teams free up some bandwidth for operations and planning, automation alone is no match for today’s increasingly complex environments. What’s needed is a strategy focused on reducing the burden of mounting IT operations responsibilities, and surfacing the insights that matter the most so that businesses can take the right action.

(必不可少的)事实证明,应用程序性能监控方案通过提供行动所需的实时洞察力帮助管理者夺回控制权。但随着IT生态中数据量的增长,很多专家发现采取积极有效的手段去管理它(IT环境/系统)是一个挑战。虽然自动化任务可以帮助团队(运维团队)释放一些空闲(时间精力)去做运维和计划,但单单自动化是无法应对当今日益复杂的IT环境。(我们)所需要的是一种策略,其重点是减轻(承当)IT运维职责,并且展示最佳的洞察力(最重要的见解),以便企业采取正确的行动。

So, what are forward-thinking IT professionals doing to stay ahead of the curve?

那么,为了保持领先的地位,具有前瞻性思维的IT专家正在做什么呢?

Many are applying what’s being called an AIOps approach to the challenge of application environment complexity. This approach leverages advances in machine learning and artificial intelligence (AI) to proactively solve problems that arise in the application environment. Even though relatively new, the approach is gaining momentum. And for good reason: Using AI to identify potential challenges within the application environment doesn’t just help IT professionals get ahead of problems — it helps companies avoid revenue-impacting outages that jeopardize the customer experience, the business, and the brand.

很多人采用一种被称之为APIOps的途径(方法)来应对应用系统环境复杂性的调整。这种方法充分利用(杠杠方式)机器学习与人工智能的进步来主动解决应用环境中(出现)的问题。虽然它比较新颖,这种方法正变得很火(正获得动力)。并且有充分(好的)理由:使用AI识别应用程序中潜在的问题(挑战)不仅能帮助IT专家解决头等问题,还帮助企业避免因应用系统运行中断而影响到客户的体验、业务和品牌。

In order to fully understand the rise of AIOps and why it has developed the momentum it has, we wanted to dig deeper to uncover the actual challenges faced by IT professionals, and how they’re managing them in an increasingly complex application environment. To accomplish that, AppDynamics undertook a study of 6,000 global IT leaders in Australia, Canada, France, Germany, the United Kingdom, and the United States. Their responses answered three key questions about the shift in the performance space:

为了更全面的了解AIOps的兴起和它发展势头迅猛的原因,我们需要深入挖掘(剖析)IT专家所面对的真实挑战。以及他们如何在日益复杂的应用系统环境中管理它们(应用程序及系统环境)。为了做到(完成)这一点,AppDynamics 对澳大利亚、加拿大、法国、德国、英国和美国的6000名全球IT领导者进行了研究

(1) What’s the current enterprise approach to managing increasing application environment complexity?

现在,企业采取什么方式去管理日益复杂的应用系统环境?

(2) How are global IT leaders taking a proactive approach to identifying problems in the application environment?

全球IT领导者们如何采取积极有效的方式去识别应用系统环境中的问题(故障)?

(3) How broadly is AI identified as a potential solution to reducing complexity in IT ecosystems?

AI在多大程度上被确认为降低IT生态系统复杂性的一个潜在解决方案?

Let’s see what the research revealed.

让我们来看看研究揭示了什么。

The Demand for Proactive Application Performance Monitoring Tools

对主动型APM工具的需求

Today, midsize to large companies use an average of eight different cloud providers for various enterprise applications and services. As a result, IT professionals are managing an ever-increasing set of tasks that have the potential to become disconnected if not managed properly. What’s more, within these highly distributed systems, IT leaders must grapple with the impact of new code being deployed, as well as the virtually infinite potential outcomes associated with doing so. Without a unified view of how all of these elements interact, there’s significant potential for issues to arise that impact performance — and, ultimately — the customer experience.

当今,中大型企业平均使用8个不同的云计算(多云)来支撑其多个应用和服务。这意味着(从结果来看),IT专家正管理着一组不断增长的任务,这些服务可能不被管理到就会有不知觉中变得不可连通(如果管理不当,这些服务可能会断开连接)。甚至(更重要的是),在这些高度分散的系统中,IT领导者们必须要应对新代码发布之后的影响,以及与此相关的无限潜在的实质性结果。没有统一的视图去了解这些因素是如何相互作用的(发布新代码之后导致潜在问题的各种因素),这些重大的、潜在的因素会影响性能,并最终影响用户体验。

New research from AppDynamics underscores the cause for concern: 48% of enterprises surveyed say they’re releasing new features or code at least monthly, but their current approach to monitoring only provides a siloed view on the quality and impact of each release. In fact, of those enterprises that release on that cadence, a massive 91% say that monitoring tools only provide data on how each release drives the performance of their own area of responsibility.

AppDynamics的研究重点强调(运维这件事情)令人关注的原因:48%的受访企业表示他们至少每月发布新的功能/代码,但他们现在只有每个版本的质量和影响情况的孤立视图(言外之意就是没有版本对比)。事实上,按照(固定)节奏发布(功能)的企业,高达91%比例的人表示监控工具只能提供每个版本如何帮助他们驱动自身职责范围的表现的数据(言外之意是说无法从项目或产品的全局去运维和监控)。

Research from AppDynamics indicates performance monitoring remains siloed.

AppDynamics的研究表明:性能监控依然是孤立的。

Should these findings raise eyebrows? Absolutely.

这些研究结果会引起注意吗?绝对会的。

That’s because they indicate that for the vast majority of those surveyed, a holistic view of business and customer value is still difficult to achieve. And that puts innovation — as well as modern, best-in-class software development practices like continuous delivery — at serious risk.

这是他们表明:对于大多数受访者,业务和客户的价值很难整体的实现。这会使得创新——新的模式和一流的软件开发实践比如“持续交付”——面临严重的风险。

But that’s where leveraging data about the application environment using machine learning, as well as AI, can make a massive difference. Instead of merely ingesting data from every dimension of the application environment, these tools can help IT professionals build a more proactive approach to APM.

但是,采用机器学习及AI技术来充分使用应用程序环境的数据,它可以产生巨大的不同。这些工具不仅可以从应用程序的环境的各个维度提取数据,还能帮助IT专业人员构建一个更为主动的APM方法。

And, by all accounts, that’s what most global IT leaders want.

而且,从各个方面看,这是大多数全球IT领导者们想要的。

According to research findings from AppDynamics, 74% of surveyed said they want to use monitoring and analytics tools proactively to detect emerging business-impacting issues, optimize user experience, and drive business outcomes like revenue and conversion. But according to our research, 42% of respondents are still using monitoring and analytics tools reactively to find and resolve technical issues. There’s indication, however, that this approach is extremely problematic for businesses. Beyond a serving as a pain point for IT professionals in terms of capacity and resource planning, reactive monitoring — in some cases — can potentially cost businesses hundreds of thousands of dollars in lost revenue.

AppDynamics的研究结果表明:74%的受访者说他们想通过监控和分析工具主动的监测新的重要的业务问题(这个有点像用户画像),优化用户体验和驱动商业增长比如收入和转化率。但根据我们的研究,42%的回复依然是使用监控和分析工具(去)被动发现和解决技术问题。有迹象表明,然而这种方法对于企业而言有很大的问题。除了容量和资源计划方面成为IT专业人士的痛点之外,被动式的监控——有时候——可能会使企业(潜在)损失数十万美元的收入。

How Reactive Monitoring Hurts Performance, Revenue, and Brand

被动式(反应式)的监控如何损害绩效、收入和品牌

From e-commerce to banking, booking flights to watching movies on Netflix, applications have proliferated people’s lives. As a result, consumers have high expectations for application performance that businesses must deliver on. If not, they risk jeopardizing brand loyalty and, as our research revealed, their bottom line.

从电子商务到银行(支付),从预订机票到通过Netflix观看视频,应用程序潜入到人们的(生活/工作)方方面面(扩散)。因此,消费者对商家提供的应用程序的性能具有很高的期望(这种期望必须得以兑现)。否则,他们会损害品牌形象,危机品牌的忠诚度,就像我们研究的那样,触碰了他们的底线。“As the broader technology landscape undergoes its own dramatic change, forcing businesses to double down on their customer focus, managing the performance of applications has never been more critical to the bottom line.” — Jason Bloomberg, The Rebirth of Application Performance Management

“随着更广泛的技术领域的变革,企业对于他们的客户关注度提高了1倍,管理应用性能从未如此重要。”——Jason Bloomberg,APM的重生

IT professionals have long relied on the mean time to repair (MTTR) metric to evaluate the overall health of an application environment. The longer it takes to resolve an issue, the greater the potential for it to turn into a significant business problem, particularly in an increasingly fast-paced digital world. However, in this latest AppDynamics research, we made a startling discovery: Most organizations are grappling with a high average MTTR:  Respondents reported that it took an average of 1 business day, or seven hours, to resolve a system-wide issue.

IT专业人员重启依赖(故障)平均耗费时间这一指标来度量应用程序环境的健康情况。解决问题的时间越长,它就越有可能变成一个严重的业务问题,尤其在这个日益变化的快节奏的数字世界之中。然而在AppDynamics最近的一次研究中,我们有了一个惊人的发现:大部分的企业正在解决故障平均耗费时间高的问题:回复者称他们花费了一个工作日或者7个小时去解决一个系统范围内的问题。

But that wasn’t the most alarming finding.

然而那并不是最震惊的发现。

Our research also revealed that many enterprise IT teams weren’t notified about performance issues via monitoring tools at all. In fact:

我们的研究也揭露了:一些企业IT团队并没有通过监控工具通知(告警)性能问题。事实上:

58% find out from users calling or emailing their organization’s help desk

58%受访者企业通过呼叫或者邮件的方式通知企业的服务中心

55% find out from an executive or non-IT team member at their company who informs IT

55%受访者企业通过执行或非IT团队成员获取信息

38% find out from users posting on social networks

38%受访者企业通过社交网络发帖

To fully appreciate the impact of 7 hour MTTR on a business, AppDynamics asked survey respondents to report the total number of dollars lost during an hour-long outage, and used that figure to extrapolate the typical cost of an average, day-long outage. For the United States and United Kingdom, the cost of an average outage totals $402,542 USD and $212,254 USD, respectively (the cost of an outage in the United Kingdom was converted into United States dollars).

为了充分了解故障平均修复时间对企业的影响,AppDynamics让受访者报告在1小时停运的时间里会损失多少经济收益,并以此来推算典型情况下一天停运的损失收益。美国和英国的统计结果分别是:402542美元和212254美元。

It’s important to note that these figures reflect the total cost for a single outage in the enterprise — if a company has more than one, that figure can rise dramatically. In fact, a substantial 97% of global IT leaders surveyed said they’d had performance issues related to business-critical applications in the last six months alone.

重要的是这些分析图表(数据)仅是一个企业单次的停运损失成本——如果一个企业有多次这样的情况,那这个数据将会急剧上升。事实上,将近97%的全球IT领导者们表示最近每半年就有一次业务相关的应用程序性能问题。

In addition to the impact on a company’s bottom line, global IT leaders reported that reactive performance monitoring had created stressful war room situations and damaged their brand. 36% said they had to pull developers and other teams off other work to analyze and fix problems as they presented themselves, and nearly a quarter of respondents said slow root cause analyses drained resources.

除了对公司的收益造成影响外,全球的IT领导者们还报告说被动式的性能监控制造了工作室的压力气氛并损害他们的品牌。36%的受访者表示他们不得不让开发人员先暂缓手头的工作去分析和解决他们自己引起的问题,近四分之一的受访者说缓慢的根因分析耗尽了资源(时间/人力成本)。The takeaway here is clear: global IT leaders need to build a more proactive approach to APM in order to lower MTTR and protect their bottom line. But in today’s increasingly complex application environment, that’s easier said than done. 

很明显:全球IT领读者们需要构建一个更主动的APM方法以此降低MTTR并保护他们自身的收益。但在今天日益复杂的应用程序环境中,说起来容易做起来难。

Unless, of course, you’re developing an AIOps strategy to manage it.

当然,除非你正在制定AIOps的策略来管理它们。

The Risk of Not Adopting an AIOps Strategy

不采用AIOps的风险

AppDynamics research showed that the overwhelming majority of IT professionals want a more proactive approach to APM, but one of the main ways of achieving that — through the adoption of an AIOps strategy — isn’t being widely pursued by global IT teams in the near-term.

AppDynamics研究表明大部分的IT专业人士想要一种更主动的APM方法,但其中一种重要的途径——通过采用AIOps策略——短期内并没有被全球的IT团队广泛采用。

In fact, the global IT leaders AppDynamics surveyed reported that although they believe AIOps will be critical to their monitoring strategy, only 15% identified it as a top priority for their business in the next two years.

实际上,被AppDynamics采访的全球IT领导者们的报告表明:即使他们相信AIOps将是一个重要的监控策略,也只有15%的人认为对他们的业务而言接下来的2年内会是一个比较高的优先级。

What’s more, the capabilities that respondents identified as essential to APM in the next 5 years are precisely those that AIOps has the potential to help provide. For example:

此外,受访者认为未来5年对APM监控至关重要的能力确切的说正是AIOps潜在提供的能力,例如:

Intelligent alerting that can be trusted to indicate an emerging issue.

49% of respondents identified this feature as core to their performance monitoring capabilities in the next five years. By ingesting data from any application environment, AIOps platforms and technology can play a pivotal role in not just automating existing IT tasks, but identifying and managing new ones based on potential problems detected in the application environment.

让人信服的智能告警,能让人识别一个新型的问题。

49%的受访者认为这个功能特性是他们未来5年性能监控的核心能力。通过收集多个应用程序环境的数据,AIOps平台和技术能够扮演一个重要角色:不仅能自动执行IT任务,而且能够依托在应用程序环境中被检测出来的潜在的问题识别和管理新的任务。

Automated root cause analysis and business impact assessment. 

44% of respondents said solving problems quickly and understanding their impact on the business would play a crucial part of their performance management in the years ahead. With the help of AIOps technology, this can be achieved, providing increased agility in the face of potential service disruptions or threats, and without additional drain on resources.

自动化根因分析和业务影响评估。

44%的受访者说,在这些年快速的解决问题和理解它们对业务的影响在他们的绩效管理中是至关重要的。在AIOps技术的帮助下,这是可以实现的,提供一个不断敏捷的方法在面对潜在的服务中断和威胁,并且不会消耗额外的资源。

Automated remediation for common issues.

42% of survey respondents said that they needed to build automated remediation into their strategy for performance monitoring. With AIOps, it’s easy to not only automate remediation for known issues, but unknown issues, too. That’s because it not only ingests data from your application environment, but provides more intelligent insights as a result of it.

常见问题的自动修复。

42%的受访者说他们需要在性能监控中构建自动修复策略。通过AIOps,不仅能够简单的自动修复已知的问题,还能处理未知的问题。那是因为它不仅收集了不同环境的数据,还因为提供了更多智能洞察。

Leading The Way With AIOps Strategy and Platforms  

通过AIOps战略和平台引领潮流。

Despite increasingly complex application environments, few of the global IT leaders surveyed are prioritizing the development of an AIOps strategy, which would allow them to implement the platforms and practices to permit proactive identification of issues before they become system-wide problems. Instead, global IT leaders report an average MTTR rate that hovers at a full business day, and has the potential to cost companies hundreds of thousands of dollars in lost revenue with each incident.

尽管应用程序环境日益复杂化,但在受访的全球IT领导者中很少有人优先提及(考虑)AIOps策略,该策略可以让他们在问题演变为系统级问题之前完成平台搭建并有效的去识别问题。相反,全球IT领导者报告称占用整个工作日的那些平均的MTTR率,在每次事件中都会给企业带来成百上千美元的经济损失。

What’s more, AppDynamics research findings also make it clear that many global IT leaders are struggling to integrate monitoring activities into the purview of the broader business. This can cause significant delays in MTTR, as noted, as well as make companies vulnerable to service disruptions that can cause irreparable harm to the customer experience, and the enterprise as a whole.

另外,AppDynamics研究还发现:一些全球IT领导者正努力把监控集成到更广泛的业务场景中。如上所示,这可能会导致MTTR出现严重的延迟,并且公司容易受到可能会对用户体验(整个公司)产生不可挽回的伤害的服务中断。

While IT leaders have expressed a desire for a more proactive approach to monitoring, this research indicates that there’s still plenty of work to be done on numerous fronts. But the first step is clear: IT leaders must prioritize the development of an AIOps strategy and related technology. In doing so, they’ll  simplify the demands of an increasingly complex application environment, and build a stronger connection from IT to the business as a whole.

虽然IT领导者表达了对一种更加主动的监控方法的渴望,但这项研究还表明:在众多方面还有很多工作要做。但首先要清楚:IT领导者必须优先考虑采用AIOps策略及相关技术。这种做,他们将简化日益复杂的应用程序环境的需求,并构建从IT到业务的整体的强大连接。

—Editor’s Note: In this piece, the term “global IT leaders” refers to the respondents surveyed for this report. The term “IT professionals” refers to people in the IT or related professions as a whole.

编者注:在这篇文章中“全球IT领导者”指的是报告中的受访者。“IT专业人员”指的是在IT领域中相关人员的整体称呼。

你可能感兴趣的:(The Rise of AIOps How Data, Machine Learning, and AI Will Transform Performance Monitoring)