Stripe: Architecting for observability at massive scale

Welcome everyone. Thanks for joining today's session, "Architecting for Observability at Massive Scale", joined by Stripe. I'm sure most of you are working with systems and applications that are critical for business. So you need these systems to be observable as businesses grow. The scale of observability also grows with it, presenting some new challenges.

In today's session, you will learn how Stripe navigated similar challenges and the solutions they implemented to address them.

My name is Hassan Tariq. I'm a Principal Solutions Architect with AWS. I work closely with Stripe on multiple projects including their observable architecture. Uh joining me today is uh K. I'll let him introduce himself uh when he takes the stage.

Here is our overview of today's agenda:

I'll start by giving some context of importance of observability, followed by the observable landscape at AWS. I'll cover some services a little bit more in detail that are most relevant to today's session followed by Cody from Stripe who will walk you through the architecture that Stripe has implemented, the lessons learned and how Stripe builds a culture of observability.

Then we'll wrap the session up by giving you some takeaways. So let's dive into it.

I'll start by addressing the question "Why we need observability?" Amazon CTO W Vogel said "Everything fails all the time." What this means is despite your best efforts, the systems and applications that you're working on may eventually experience failures that can be due to a configuration change, faulty deployment or just something in the underlying infrastructure that may fail.

Now, you may not be able to control all of these aspects. But what you can do is give your users visibility into their environments so they can respond quickly in case of issues. A comprehensive observable strategy not only alerts you when there is a problem, but also helps you understand and deal with it.

So how do we create such a comprehensive strategy? You need to be looking beyond just operations and cover all aspects of business. But it starts with collecting system level telemetry data like CPU utilization, memory metrics from your VMs, control plane. You can use these signals to manage and track business level metrics.

For example, two of the important metrics that Stripe tracks are charge latency and charge success rate. Using these metrics, you can drive business insights that are very valuable and can help you make informed decisions and also improve the business performance.

Now that we have established the importance of observability, let's take a look at some of the challenges as we dive deep into the world of observability at scale.

One of the challenges that customers face is infrastructure complexity. Our customers tell us that with the adoption of microservices and expansion of compute platforms, their infrastructure has become more complex over time. One of the primary factors that contribute towards this increased complexity is our need to deploy distributed and dynamic systems that scale automatically based on customer demand.

Some of the underlying or related factors of this complexity is also edge computing, hybrid systems, containerization and orchestration.

So to understand these distributed and dynamic systems, businesses need data - lots of it. This data can be in the form of metrics, logs and traces. As you deploy more of these dynamic systems, more systems may eventually fail, scale in and out. So that dynamic behavior produces even larger volumes of data. And as a direct consequence, the overall cost of managing this data becomes higher and higher and it can become pretty significantly higher.

So at its core, observability at scale presents a fundamental challenge of balancing the benefit of insights with increased complexity and cost.

Now to solve for these challenges, AWS offers a wide range of observable services that you can use as building blocks to architect a solution that works for you.

We understand that observability comes in many shapes and sizes. In general, there are two high level categories of these services - the CloudWatch services which are cloud native to the AWS ecosystem and the open source managed services.

I'll just briefly explain some of these services. Starting with the CloudWatch, at the collectors level, you have the CloudWatch agent which is used by millions of customers to collect metrics and logs data from EC2 instances, applications and managed services. This data is then sent to the CloudWatch service for processing.

You also have X-Ray agent which collects application trace data. As we go higher up in this stack, there are higher level insight services you can use. The Container Insights to understand how your containerized workloads are doing. Similarly, you have Lambda Insights and Application Insights that can give you a much broader insight into your application performance.

Switching over to the managed open source side, on the collectors level, you have AWS Distro of OpenTelemetry that is used to collect metrics and traces data. This data is sent to the managed services like Amazon Opensearch or Amazon Managed Service for Prometheus for processing, to query and visualize this data.

You can use Amazon Managed Grafana. And in between, you can either develop your own applications to further process this data or even use a third party ISV application.

So the idea here is to give you choices so you can select the services that best fit your need. For today's session, I'll just cover these two services a little more in detail because these are relevant and will be used in the architecture that Cody will be presenting soon.

Amazon Managed Service for Prometheus is compatible with open source Prometheus, which is a monitoring and alerting tool. Amazon Managed Grafana is compatible with open source Grafana, which is a popular visualization tool.

Now, with a show of hands, how many of you have used Prometheus or Grafana in any shape or form? Wow, looks like a lot of you are familiar with it.

So before I go further and explain these two services a little more in detail, I'd like to address one question that some of you might be thinking - why not to use open source software?

Well, when it comes to open source, enterprises have some concerns. The number one concern is - are these software secure and compliant with other applications and tools in their environment? Upgrading and patching is another concern. Whenever there is a new version, you have to test it and validate to make sure it works well. What happens when a vulnerability is discovered?

So in short, when you're using open source, you are responsible for patching, securing and scaling the underlying infrastructure. Amazon managed services does all of that for you - it abstracts away the operational overhead so you can focus on the functionality.

Let's go through these one by one.

Amazon Managed Service for Prometheus is serverless, meaning there are no servers for you to manage. You can create a workspace and just start ingesting metrics. If need be, you can create multiple workspaces in different regions, each of these workspaces will allow you up to 500 million active time series metrics ingested in a two hour time frame.

This service is fully managed and can scale automatically based on your query and ingest needs. You can use Prometheus query language, which I'm sure a lot of you are familiar with, or PromQL, to query the data.

What this means is if you are using open source Prometheus and you migrate over to Amazon Managed Service for Prometheus, all your queries stay the same. There is no change required.

And finally, the pricing is very flexible - it's a pay as you go model, you only pay for the data that you ingest and query.

Amazon Managed Grafana is also a managed, highly scalable service. You start by creating a workspace and then you connect that workspace to many different data sources. There are a lot of AWS native data sources that are supported. For example, you can connect it to Amazon Managed Service for Prometheus and use PromQL to query this data. You can also connect it to CloudWatch and use CloudWatch Insights or Metric Insights to query this data.

In addition to that, there are dozens of non-AWS services also that are supported. So when using Amazon Managed Grafana, you can essentially create a dashboard where you're pulling data from various different sources and creating a single pane of glass for you to observe your systems.

Now, let's take a look how these services can be used together to create a scalable solution.

You always start with collecting the data from your applications and services. So that's why we have the collectors layer. And you have multiple choices here.

You can use AWS Distro of OpenTelemetry that I explained earlier to collect the data. You can also use Prometheus server to scrape the metrics and then use the remote write APIs to write them to Amazon Managed Service for Prometheus workspace. If need be, you can create a Prom-compatible agent, write it on your own to collect this data.

When you create a workspace in Amazon Managed Service for Prometheus, you get an ingest endpoint that is used to ingest this data coming from these collectors. This data is then saved in an internal storage which is highly scalable and has 11 9s of durability.

I'm sure most of you can guess what AWS service is internally used, which is very scalable and have 11 9s of durability. If you guessed S3, you are correct.

The query component allows you to query this data. You can summarize and aggregate based on different durations, rules or conditions that get evaluated regularly.

There are two types of rules that are supported:

  • Recording rules which can pre-compute frequently accessed and computationally expensive expressions and the result of those computations are written back as new time series. So if you use those new time series in your queries, your queries yield much faster results as compared to using the original expression in your queries.

  • Alerting rules are conditions with thresholds. When a rule is triggered, a notification gets sent to the Alert Manager that can route these to, for example, SNS to receive this notification and fan it out and further maybe generate further notifications or send it downstream to other tools or send emails out.

No, to query and manage this data. You can use Amazon Manage Grafana to create the dashboards and visualizations that are needed. All these services can be connected together with or without VPC end points. So I hope this gives you a good overview of how these services can be used together to create a scalable observable solution that can work for you.

Now, I'd like to invite Cody to walk you through Stripe's architecture and cover the rest of the session.

If you continue to scale your business indefinitely, you will fast reach a point where the standard observable stack no longer meets the needs of your business. This might be because it can't scale to the cardinality that you need to feed into it. It might be reliability issues related to the total volume of data you're throwing at it, but most likely it has to do with the cost of the overall observable solution.

So my name is Cody Re and I spent the last year at Stripe helping them navigate this inflection point. But before that, I spent about eight years at Netflix working in similar problems. Most of that time I spent working on a distributed system named Mantis that is designed specifically to implement the architectures that we'll discuss today.

So what I'd like to do with you over the next 40 or so minutes is go through a bunch of lessons that I've learned in the last 10 years of my career where we'll cover a few facts that will help us inform our decision making. Then we'll cover five architectural changes that you can bolt onto the standard observable stack that you're probably using today in order to solve some of the problems that you're having. And then finally, we'll discuss the cultural elements that go into creating a company that has good observable practices because as your business continues to scale, your problems will become less technological or well, they'll still be technological, but you will also have people problems or cultural problems related to this.

But before we dive into that, I'd like to discuss the scale of what we're dealing with at Stripe here. So we have approximately 3000 engineers across 360 teams. Those engineers are producing about half a billion metrics every 10 seconds. And on those metrics, they have about 40,000 alerts and 150,000 dashboards or 150,000 dashboard queries, apologies.

Now, a few things might jump out at you about this half a billion metrics or sorry, half a billion metrics sounds like a lot. But the reality is, is there are solutions in distributed systems these days that can handle that volume without much trouble. 40,000 alerts and 150,000 dashboard queries also sounds like a lot. But you'll see a little further in this talk that those two sets are actually much smaller than they appear on the surface.

I would argue that the most difficult part about managing and affecting change in this environment is actually the 3000 engineers part affecting change on 3000 engineers mental model is much more difficult than deploying distributed systems these days.

So much of this talk is gonna focus on how can you convince people uh to change the way that they're doing observable in order to leverage some of the benefits that we'll talk about here.

So we have 3000 engineers at Stripe who are generally working on things that are not observable. What are the two dozen or so Stripes that are working on observable doing well. I like to say that our mandate at Stripe in the observable team is to support Stripe's availability, reliability and performance cost effectively and at scale that that's quite the mouthful. But I'd really like to direct your attention to the highlighted word cost effectively. That is a key element in observable. I would argue that cost effectiveness is one of the most important factors in observable and hassan alluded to this as well. uh, in, in his section of the talk cost is very important. If something is too expensive, you simply won't do it. And if it's cheap, you can do a lot of it.

And to think about this a little bit, I'd like to maybe just take this to two extremes and think about what happens on the first day, you spin up a new service, you probably don't have any observable for it, right? And, and that's your cost is zero and that's great. But your effectiveness is also zero. You have no observable on the other end of the spectrum over here, you can think of a world where you record absolutely everything and you could replay all that state and get your system into any state possible. But this is unfortunately far too expensive, not only in terms of money, but in terms of developer time, figuring out what's going on and latency and complexity on your service as it ensures that the observability system is consistent with what's going on in production. That's also not something we would do.

So somewhere between that point and this point that I'm standing exists this, you know, point in the middle where you have reached your cost effectiveness balance if you will. And I like to think about this as sort of the efficient frontier of observable. The efficient frontier of insight to cost it is the job of the observable team to bend that curve. To get your cost closer to observing nothing. That limit of observing nothing but to get your observable insight closer to that location of being able to view everything.

And in order to get there, we're going to have to take some, some axioms, some lessons that I've learned over the last decade and apply them to our architecture here.

So it wouldn't be a Monday morning if somebody wasn't throwing an axiom sheet at you here, I'd like to discuss three facts uh that I've learned about observability and then we will take these facts and apply them to the observable stack that we put up on screen here.

So the first one and this one came as a surprise to me, some of the work that we did in 2023 was we wrote a parser to analyze all the queries that our users were using in the, in their alerts. And one of the things that we discovered was there, there are actually only a few dozen unique queries used in most alerts. The 40,000 alerts that I put up on screen, it turns out that there's actually only a few dozen unique ones. And we learned this because my colleague Michael, who is somewhere in the crowd here, took those parse trees and ignored some nodes that don't matter. And then performed some clustering on them. And what he discovered was that just three mo uh queries that are modular that we provided to our users represents a quarter of our alerts, just eight of them, which is the set that observability provides for users represents 60% of our alerts. And if you send that out to a few dozen, you have almost everything except for the, the long tail where each unique alert has its own unique parse tree.

And there are some pretty profound implications for user experience here. Primarily that our users don't necessarily care about the query language or how they're expressing it. What they really want is just to declare what they want out of the alert. And that gives us a major advantage because we can move the storage layer, the query engine or we can even move that query outside of the time series database into a stream processor. And we'll talk about all of this a little later. But let's put this lesson in our back pocket here. We have a ton of alerts. We have 150,000 dashboard queries. A very similar dynamic exists there. Most of these are actually belonging to a very small set.

Now, our second axiom and this one might be my favorite. Um mostly because I've known this one for a while. There's actually uh an error on this slide. Can anyone point it out and feel free to shout it out. Oh, really quiet bunch today. Ok. So I wrote metrics, but that's actually not true. It's metrics, logs and traces. This applies to almost all of your observable data. You will only use between two and 20% of that data that you're shipping to your observable store and whether those are logs, metrics or traces. And when I say this, I mean that 80 to 98% of that data that you write in is not referenced by an alert will never be looked at in a dashboard and will never be looked up ad hoc during or outside of an incident. It will be written and never read.

And now if you're an engineer in this crowd, hopefully you're thinking, hey, maybe there's some way I can bifurcate this set and save my company a bunch of money and hopefully get a promotion. And if you take nothing else away from this talk and you go back to work next week and do that, that's great. And if you're a leader in this crowd and you're thinking, hey, that big observable bill that I have. This guy on stage is telling me that we're wasting 80 to 98% of that money. Yes and no right there, you're not going to read it, but there is some value in the optionality of having it there just in case you want to look at it in the future. So this isn't necessarily about getting rid of it as much as it is about reducing the cost of that optionality, creating cost effectiveness in our optionality.

So let's take that one, put it in our back pocket along with the small set of total alerts and look at the next one.

So actually I'm three, the tradeoffs are fundamentally different for observable data. And what I mean when I say this is that we generally use the same distributed systems architecture for observability that we use when we're building any other kind of system. But the tradeoffs that we want to make in those distributed systems are very different. And if we don't, you know, if we're not cognizant of this, while we're building those observable systems, we are doomed to build systems that are too expensive and too slow to achieve the objectives we want to.

And we'll dive into that a little bit when we get into the architectures. But I wanted to do sort of an illustrative example here at Stripe we process payment, transaction data. And if I had to choose between a payment being correctly logged in a ledger or the observable in telemetry about that being logged correctly, I would always choose the payment ledger being logged correctly or being written correctly. And so this tells me that no matter how important, I think my observable data is, I know that I have a preference for my business's control plane functioning correctly rather than the information about it functioning correctly.

And we can leverage this um to essentially build systems that cost significantly less. And this has an added benefit that the tradeoffs that will make tend to make our systems faster as well, which is equally important because the value of observable data decays exponentially over time. The data that's coming in right now is very important because it's being used in our alerts. And we need that data to be accurate so that we can be alerting accurately so that we can get alerted when there's a problem. And so that we don't get alerted when there isn't a problem, an hour goes by and that data is significantly less valuable to us a day, a week, a month goes by. And we almost don't care about that anymore

So it's critical that our data not only is inexpensive to process, but that it comes in quickly because if it doesn't, it's not very useful to us. So with those three axioms in our back pocket, let's start to look at the architecture a little bit.

I have this very unlabeled diagram up here and I've done that on purpose. So when I look at that cloud, I see my cloud stripes cloud and that cloud's piping data over to an observable vendor. And again, I've left that unlabeled so that you can picture your vendor there. And our users engage with that vendor's user interface in order to get their dashboards, get their alerts, get all the insights that we get from our telemetry data.

Now, quick show of hands whose observable stack looks just like this today. I've seen a lot of hands up if it doesn't keep those hands up, if it doesn't look like this today, has it looked like this in the past? Ok. Decent number of hands. So I think we all understand each other. We're on the same page with this architecture and generally speaking, this is a really good architecture. There's some stuff to love about this. Uh number one being it's simple, it's so simple. You just feed data across, you know, you install the client, you configure it and or it configures itself and then you configure your network, you feed data to your vendor and everything works, you users engage with that vendor and they get the insights that they need.

Now, the nice part about this is that this simplicity comes with limited failure modes, right? Your vendor, when you deploy changes to your environment, whether those are code changes, configuration changes or environmental changes like increased traffic occurs in your environment. These are very unlikely to break that link between your service and your vendor. And because of that, your vendor is likely to be working when you're not working. And that's your critical moment because that's the moment of truth for your vendor. If they're up when you need them, then that observability is functioning for you.

And of course, this offers us a feature-rich walled garden and we like that, right? All the new features work together really well. Everything is integrated, great. But the one thing that we know about walled gardens is that once you need to go outside them, it can become really painful. And if you're trying to grow your business, which I hope almost everybody in here is doing. One of the things you'll realize is that your business growth eventually results in super linear metrics growth. And the reason for that is simply the complexity of your environment will continue to expand. This is something that Hasan mentioned earlier. The inflection point for most companies is when they break up their monolith into microservices, you used to have one thing reporting metrics. Now you have end things reporting metrics. And so now you're reporting a super linear number of metrics compared to last week, your vendor probably charges you linearly based on metrics growth. But of course, a linear function times a super linear function is anyone here? That's, I've, I've heard the word super linear. That's that's correct.

So now the cost of observable is growing faster than your business is growing. And this is a major problem if that growth continues, you'll run into scale limitations. Eventually you'll start recording metrics that are higher cardinality than your time series database can handle. That's going to be a problem for you because then you're gonna have blind spots in your observability and it's probably in the locations that matter, the most, these high cardinality situations get worse as you continue to scale. Because while they represent maybe a small percentage of the overall business, they represent larger and larger amounts of your customers.

And finally, if this scale problem continues long enough, you'll actually start running into reliability issues and very likely you'll be firefighting all three of these before you actually go ahead and do something about this. So if I had to sum this problem up in one single sentence, I would say that this baseline architecture is very database centric. You have to put everything into the time series database before you can get the insights out of it. And because of this, you are subject not only to the technical limitations of that time series database, but also the economic limitations of whatever that interaction is. If it's, if it costs a certain amount to deal with the metrics, you want to deal with, you are subject to that cost and it's very difficult for you to bend that curve because you're in this walled garden.

So what can we do to become less database centric? Well, I'd like to talk about five architectural changes that you can bolt onto this existing architecture. So you don't need to do some sort of big swap and these five changes will help you deal with scale reliability. And most importantly, the cost effectiveness of this entire system.

The first one that you probably should be reaching for is fairly simple sharding. You can see I've got two copies of the same database here. And now, and I've got, you know, the rest of this blurred to the background here. So we can focus on these essentially starting across some partition keys. Uh is one easy way to increase the scalability and the reliability of your time series database. And I'd recommend you reach for this first, despite the fact that that might be kind of unintuitive because when you start out, you're going to be of a scale where one database works really well and you can probably double or even 10x in size and that single database will work really well.

The problem with that is that when you shard, you create a user experience problem for your customers. If they're used to everything being in one database, and now it's spread across multiple databases or data sources, it's very painful for your users to figure out where all their metrics went and to change their mindset when they're observing their systems. So shard early is, is what I'm getting at when we discovered axiom one there, when we ran our analysis on the time series database or on the uh sorry on the alert queries. One of the things that we discovered is that well, I didn't discover this. I knew this at Stripe. We have a QA, a prep prod and a prod environment. But what we discovered was that our users are querying specifically for their QA, prep prod or prod environment in their alerts.

So, something we learned was that they were sharding, implicitly in their mental models, even if we hadn't sharded the time series database underneath them. And this works pretty well. We get some extra mileage out of, out of that sharding. But what would be really great is if we could shard along one of our company's fault lines, right? A lot of, a lot of companies use, let's say regional failover to create, you know, their reliability story. If you deploy regionally and you break that region, you can simply fail your traffic into another region while you debug and and triage the the existing region. If you shared your time series database across that fault line, you'll already be up and running in the new region without taking any action whatsoever. And your users will already be thinking about all that traffic in that region now without you having to do anything.

So I would highly encourage you to shard your database earlier than you think you need to and consider your reliability fault lines when you're doing it. The second solution you can see I've stuck this cloud thing in between our cloud and the time series database is aggregation. And I think this is the first thing that teams tend to reach for because it's easy. Actually, maybe it's the second thing that teams reach for the first thing they tend to reach for is austerity measures as in you're producing too many metrics. So you go back to your client teams and you ask them, hey, do you need this tag? Can you stop producing this thing?

But one of the things that you really quickly learn is that 3000 engineers can produce metrics a lot faster than 24 engineers can ask them to clean it up. And once you realize that this is a losing battle and every team that does this will you end up in a situation where you decide to stick an aggregator in between your users who are generating metrics and your time series database and you start to remove, let's say tag values that you don't think are useful. And this is an implicit endorsement of that 80/20 or that 98/2 rule that we talked about. You are trying to get rid of the unused data while maintaining the useful 2% or 20%.

But what ends up happening here is you end up losing the optionality of that unused data as you discard it. So there's a trade off with the aggregation, you might be able to get it right. And if you nailed it perfectly, you'd save a ton of money. But you need to be careful because throwing away too much data, you end up losing information that people need.

Let's dive into a little bit what that architecture looks like you might be looking at this and thinking like, ok, wow, this is a streaming map produce not super interested. And that's true. And generally speaking, when people think about a streaming mapreduce, there's two technologies that tend to come to mind. Um anyone wanna shout them out, Kafka and Flink tend to be the two most people immediately think, ok. If I need to do a streaming mapreduce, I'm gonna throw everything on Kafka, I'm gonna spin up a Flink architecture and I'm going to process all these metrics, reduce the dimensions. I need out of it and report it into my time series database.

The problem with this is that when you change the Flink architecture, whether you're deploying a new version or maybe a node has failed, what happens? It stops, it waits for that change to deploy fully, then it resets to the last checkpoint and starts processing again. Meanwhile, none of your metrics are being delivered, none of your alerts are functioning correctly. And you've essentially just given up x minutes of data while you or x minutes of observability, while you reprocess everything. This is entirely unacceptable for your observable architecture.

So what led us to that decision? The problem was, was that we applied the same thought process that we applied to all of our other distributed systems here. We didn't consider axiom three. What are our tradeoffs? Our trade off here is that we want our data to be really fresh we want our data to be really fast and we want our data to be really cheap.

So what we need to do here is apply other technologies, right? There are a lot of great technologies in this space right now. Right I mentioned Mantis earlier, there's Vector, there's the OpenTelemetry Collector both deployed into aggregation mode. Uh there are a lot of new stream processors coming up in this space and I highly encourage you to check them out. I think even Flink can be tuned to behave like this. But you probably want to keep Kafka out of your observability stack.

The third architectural change we can apply...

I like this one, a lot tiered storage also nothing completely wild. But unlike our aggregation, if we can bifurcate that 80/20 perfectly into, you know, putting the 80 into our our cold tier here and keeping the 20 in the hot tier, we can essentially create all that optionality and not give up any of any of the observable power because we still have all the data.

Typically speaking with this cold storage, we get less expensive data, but at the cost of speed, right, looking it up in the cold tier is much, much slower. But the thing that we know about that data we're putting in the cold tier is that we're probably not going to look it up anyway.

And unlike the rest of these Hassan alluded to this earlier as well, there's really only one technology that you should consider for the 80% case. And that's Amazon S3, it's incredibly durable and it's a somewhat open secret in the industry that S3 with the correct file format and with a good index can behave almost as quickly as a database with effectively infinite scalability and at a very favorable price point.

Now, the fourth change and this one's probably my favorite because i spent a lot of my career working on this. I wrote streaming alerts, but really this should be written something like arbitrary computation on your metric stream. So again, if you've deployed this stream processor in between your cloud, that's generating metrics and your observability solution, you can effectively decouple the concept of alerting from the time series database.

And that means that you can decouple the concept of alerting from the technical limitations of your storage layer. And this might sound basic, but this becomes huge, right? You may have teams that are dealing with really high cardinality metrics that can't be put into the time series database, right? Maybe your service mesh wants to do an end by m comparison of all the connections at your company or maybe you're comparing you know, devices and and back end versions in countries or across networks, you know, you multiply a few things together and you get a very high cardinality computation.

What you can do if you deploy this stream processor in between is that you can perform alerts in memory where it's very inexpensive and where cardinality is effectively limitless, it's, you know, whatever you can fit onto the box. And this is a major advantage because now we can optionally choose to store it in the time series database to store it in the cold storage or to not store it at all. Right, we could just write a summary value in, we can toss the data away unless of course, maybe an alert is happening. Then maybe we optionally write it into the time series database. Maybe we pump it into the cold storage or maybe we just write a snapshot into the alert and toss away the rest of the data.

The point is is that this this streaming system can be reactive to the changes in your environment, including your changing alerting needs as you go from a normal state to an anomalous state. And of course, I've been talking about metrics this whole time, but this exact same concept applies to logs and traces as well. You can actually take this a step further. You know, we probably all have those users who've written these really gnarly multi 100 line alerts that are basically an application that's actually monitoring their, their service. As opposed to, you know, a simple declared alert, those users probably aren't that happy with that alert either.

I know that, you know, you on the observable team, aren't this gives them an additional option. One of the things that I've seen, you know, in the last 10 years here is that if you enable users to write an application on top of this stream processor to monitor their service, they will implement all that domain knowledge that they have on top of your your stream processor and build an application that monitors their service. You know, in the context of your business.

I've seen cloud gateway teams build applications that monitor all the middle tier services for connectivity problems, latency errors. I've seen globally distributed database teams that have used this to monitor the entire view of their distributed architecture. You buy yourself a lot of leeway, not only with your power users, but also with your users who are hungry for more speed and more cardinality.

So streaming alerts or more specifically arbitrary computation on that stream of data before it goes into your time series database. Not only lets you serve the power users serve your cardinality hungry users, it also makes it very easy to route to the different data stores depending on changing conditions within your environment.

The fifth and final architectural consideration is isolation and this is a really big one because you're going to be working at a company that has probably a lot of infrastructure teams, they're building great platform code. Everyone's going to be excited about that. You're going to want to use it. The problem is is that every time you take a dependency on a piece of technology that's been developed at your company. You risk being in a failure state at the same time that that technology is in a failure state. And if they're depending on you to get them out of that failure state, that's a major problem.

So we'll talk about this a little bit more in the cultural section. But you need to cultivate a culture of self reliance here on the observable team, you need to evaluate your tradeoffs realistically. And ultimately, you need to minimize the probability of observable, experiencing an incident while your company is experiencing an incident. But you still need these technological levers, right? You can't just exist, you know, outside of all your company's technology and I've been there, i, i've worked on observable teams that have rewritten the platform code at their company because they didn't want to depend on the platform, team's code.

I've been on teams that have rewritten their deployment system to not use the company's deployment system. So these teams have existed essentially completely outside of their company's normal platform offerings. I think an easier place to get a lever is through your partner, right? So when you start out with that baseline architecture that we talked about your partner is your observable vendor. They are the start and finish of your technical offerings. But as you start to bolt more of these solutions onto it, more of the responsibility falls on you to maintain that separation from your environment.

So I have up here, you know, I've shown you Kinesis streams for the stream processor, but you could just as easily deploy EKS, deploy Mantis to it, deploy Vector to it, deploy OpenTelemetry to it. I've shown Amazon Managed Service for Prometheus, but you could just as easily again deploy something like that to EKS or to ECS. I don't really consider S3 optional. It's kind of an essential, essential component of this architecture. And then of course, same thing with Amazon Managed Grafana.

So the point here is you really need to lean on your partners. That was true when you were just using a vendor based solution. And it continues to be true as you begin to take more responsibility for your architecture. If you don't heed this, you're going to have outages at the same time that your company is and that's a major problem for you and it's a major problem for your business.

But I mentioned that this isn't just a technical problem, it's also a cultural problem and the larger your company is, the more cultural this problem becomes. I'd like to take a few minutes to talk about how we evolved our culture at Stripe in order to meet these changing needs and changing demands of observable.

And the first one is of course, creating a culture of self reliance on the observable team. And really what I mean by this, if I had to distill it down to one definition is minimizing the probability of having an observable incident conditional on your company having an incident. Let's say, you know, you take a dependency on your company's container platform that they're offering, maybe it's really reliable, maybe it has like five nines of reliability. And that means that your overall probability of having an incident based on the decision seems very low.

The problem is if they're having a problem, you're having a problem. And that means your conditional probability of having an incident in that case is very high. This is something that we want to minimize and this influences your technology choices. And so I've sort of drove this point home that your engineers need to make their technology decisions based on, on this mindset, but that's not where it stops.

The leaders on your team are going to need to be able to get behind this mindset and change the way that they interact with the company. What I mean by this is they have peers on the probably in the infrastructure org at your company and those peers are going to be evaluated on the uptake of the technologies that they're building and they're going to look at the observable team and say, hey, you run, you know, a lot of services here at the company you're, you're running, you know, a lot of compute resources. Why aren't you using our platform that would look really good for us. Your leaders need to be able to successfully negotiate that relationship without, you know, caving in and giving in to the, the pressure and temptation to adopt the cool technologies that are being built across your company.

Furthermore, they need to be able to manage this relationship upward as well because you're going to have executives at the company who are looking at things from a very high level view and they're going to start seeing redundancy. Why isn't the observable team using these high leverage solutions that these other teams are building? You know, I'm told this is great. I'm told that is great. Why is the observable team using that? Why are they wasting all these resources and simply put the leaders on your team are going to need to be able to navigate that conversation with this mindset because if they don't, you're going to end up taking dependencies that are very unfavorable for the reliability dynamics that you need on observable.

Finally, you need to cultivate a culture of observable across the company. And this really boils down to one thing. You need to make it easy for your users to do the right thing. And there comes in, you know, this, this is an old user experience adage, but there comes so many or there are so many points or moments of truth where you can do this for your users.

I mentioned earlier that we did an analysis and we found out that 60% of our alerts came from just eight modules that we offered to users. But one of the other things that we discovered in that analysis is that there are a few extra tens of percentage alerts that could fit into those modules, but just didn't happen to be written that way and part of that's on us because I think maybe we didn't make it easy enough to implement those modules.

Taking this thought process a step further. If we manage to modular, all of those alerts, we could do things like move around which storage layer they're in or move around which query engine they run on or even again, move them out of the out of the query engine and into the stream processor all without your users knowing. Ultimately, we need to engineer our user experience such that our users are guided towards the things that we want them to do, creating lower cardinality metrics or executing those higher cardinality queries on the on the stream processor and not in the query engine or on the uh on the cold storage and not where you want them to be

A corollary is we want to make it hard to do the wrong thing. So don't, you know, don't use an ultra flexible query language that allows them to build entire programs in the query engine, right? If you want them to do that, make them write a stream processing job on your stream processor, use simple declarative alerts instead of complex programmable alerts, this thought process can be taken, you know, to the extreme. And I'll leave that as an exercise to the listener.

So I've thrown a lot at you here. So I'd like to recap. There are five architectural decisions that you should make in order to reduce your scale problems, your reliability problems and ultimately improve your cost effectiveness.

The first one should be sharding. The first one you're probably going to go for is aggregation. Tiered storage is critical to getting that 80/20 split correctly. And streaming alerts are really sort of the future of, of what we're doing here in observability as you begin to take control over your architecture and you should be considering isolation at all points within this. Because if you don't stay isolated from your company's technology, you're going to be subject to failures of your company's technology.

Culturally, you need to really cultivate that culture of self reliance and finally make it easy to do the right thing for your users. But if I had to distill this just down to one line, I want you to become way less database centric and focus on your data plane and observable.

So if you stuck with us this far, thank you. And I hope you have a great re:Invent.

你可能感兴趣的:(aws,亚马逊云科技,科技,人工智能,re:Invent,2023,生成式AI,云服务)