Thank you for joining us today. You're going to learn about resilient architectures at scale. And the way you're going to learn about them is through real life examples, going to share with you today from amazon.com.
My name is Seth. I'm really happy to be here, really glad to see you here in this lovely theater and I'm really thrilled to be joined by my co presenters today, Avinash and Tip. They're going to introduce themselves a little bit later, but I am a developer, architect, developer, architect, developer, I know my own title. I was a solutions architect, so I just put them together. And when I was a solutions architect, I was the reliability lead for AWS. Well architected. So I've worked with a lot of folks and a lot of customers on the resilience challenges and I've had about 12 years total at Amazon, the last four years of those at AWS, but the previous eight at amazon.com and that's the examples you're going to hear about today. Those are examples of resilient scalable architectures from the.com side and to get us kicked off, I mean, the title is about resilient architecture is what is resilience resilience is when your application can withstand or mitigate or recover from the kinds of faults and load spikes you're going to see in production.
If any of you are running applications in a data center or the cloud or any kind of production environment, you know, it's chaos out there, right? There's always things happening, unusual user patterns, network issues. So you have to build resilience so that your application remains available. And towards that end about a month ago, maybe two months, we released this, it's the life cycle framework for resilience because resilience is a continuous process. It's not the one and done kind of thing. You just don't do it and you're done. And we made this process such that it maps to like a software development life cycle. So you could be aware of that. And in your software development life cycle being make sure that resilience is part of that and you can actually learn more about that later today, there's going to be a breakout session about that. But if you miss that there's a link there, you can read about it for our purposes. Today, we're going to present to you, as I said, multiple real life examples of resilient architectures from Amazon. And it's going to fall into three categories on this framework.
The first one is about designing, designing and implementation, right? Designing for the best practices for resilience. And we're going to show you examples that show you things like fault isolation, using things like cells show you auto scaling, they show you decoupled architectures. And then the next one on the life cycle is about testing and evaluation. We're going to show you examples of teams that have done chaos engineering and load testing. And finally, you know, sometimes people forget about this, you can design what you want into an application to make it resilient, but you have to operate it resiliently. And we're going to show you examples of how teams are using metrics and observable across accounts across services to ensure the resilience of their workloads of their applications.
And so the second part of the title is it's resilient architectures at scale. So I might as well define what scale is. I think we all know, but it's basically about, you know, when, when you get extra load, extra scope and your system, your application accommodates that and remains available despite the amount of load or scope you're getting and going back to amazon.com, which where our example is going to come from. Amazon was in 1995 running on two servers. So it started out small and they had a motto. Get big fast. You can see that t shirt there from the 1997 picnic, get big fast, eat another hot dog. But the motto is get big fast. And so they started with two servers, right? But what are? And they started with two servers running this website. One server is running the executable. The other was running the database, but they did get big fast. If you look at prime day from this year, the number of items sold, the dollars of sales are quite impressive.
Now, we're a tech audience here though. So this is the slide I really like to show, this is showing the AWS services and resources. Some of them, just a few of them that Amazon teams are using to be resilient and to scale. You can see DynamoDB with millions of requests per second. You can see Aurora with billions of transactions and terabytes of data. I'm not going to read all the stats to you. But the point here is to show you how if you want to be resilient and and scalable and resilient at scale using the cloud and a services are a way that can help you achieve that goal.
And the point I forgot to make earlier about Amazon starting small and scaling up is no matter what scale your applications are or your enterprises, everything, almost everything we're going to show you here today applies to you. You want to put in those resilience, best practices and you want to put them in so that you can scale when you need to so fast forward. This is the architecture I said two servers before now this is the services each dot on here is a service, a microservice or a part of a solutions oriented architecture. At Amazon and there are lines between them showing the dependencies between the services. There's tens of thousands of services running at Amazon today. It's a mic uh Amazon does like to use the so a and microservices type architectures.
I'm going to show you an example of that right now. This is our first example. Actually, this is just an Amazon web page. It's called a detail page. It's the page where you buy stuff. In this case, you're going to buy a Kindle Fire tablet tablet, right? And when you look at that page, it has everything you need, you know, it has the reviews, it has the picture, it has the title has the price, etcetera, right? But this is actually a framework, an internal framework that is supported and owned by a team inside Amazon. And the framework makes hundreds of calls to back end services called widgets and those widgets are essentially microservices. And each widget owns a little piece of business logic and a little piece of what's displayed on the page and it's making those calls in parallel and rendering them very quickly.
So if I took this page and I run it through an internal tool at Amazon, it looks like this. Now, you can see there's a microservice for serving the image, there's a microservice for serving the title, there's a microservice, we're serving the average customer reviews and all of these are being called in parallel and being rendered and this leads to both resilience and scalability. Because if one of these services to have a fault or a failure and not operate properly, as long as it's not the title or the image or the price, the customer still has a usable experience, they could still get most of what they need and make a purchase.
So this is what we call graceful degradation rather than go down, we remain available to the customer and maybe, you know, not have some functionality that's the resilience part of it. The scalability part is each of these back end micro services can be deployed independently. So that gives these teams the ability to deploy when they need to and innovate when they need to and put in features when they need to.
Finally, the third point is, as I said, it's a framework owned by a centralized team that, that maintains the framework, the business logic is owned by the teams that own the widgets and these are different teams that own each widget, right? So that means that the teams that own the widgets can focus on the business logic and not have to focus on what the framework is doing that and that basically takes the burden off of them so they can innovate better.
Ok. So with that, we're gonna go to our next example with two up.
Thank you, sir. So let's get a quick raise of hands to see how many of you know about CBA architecture, I see a few hands out there. So in this part of the session, I'll talk about the basics of cell based architecture. Some use cases from Prime Video and Amazon Music and how they were able to improve availability and fault isolation using cell based architecture to start off with.
I'm Tulip Gupta. I'm a senior solution architect with AWS. I've been with AWS for the past 2.5 years helping Amazon customers like Prime Video. Amazon Game Studios, Amazon Music, Twitch and Audible.
So you might be familiar with traditional scaling and in traditional scaling usually have your worker notes. In this case, eight worker not serving the needs of all your customers. And we have eight customers out there. But let's say one of the customer intentionally or unintentionally sends in a bad request. Now, one of your worker nodes gets impaired and so he retries again and slowly all your worker nodes are impaired and thus all your customers are impacted. So the blast radius is all your customers and in cell based killing, let's see how we can avoid that poison pill situation that you saw in the previous slide.
So the same customer sends in a bad request. But in this case, what we have done is we have broken it up into cells and each cell consists of two worker nodes. So now when the customer sends in the bad request, only two worker nodes are getting impacted. It's only one cell and any customer that is being served that by that cell is also impacted. So as you can see, only two customers are impacted out of the eight in this scenario. So the blast radius has reduced considerably and there's a forex improvement from entire system impact.
So this is what a cell based architecture looks like. So you have cells, cells are a design pattern where a service is split into multiple deployment stacks called cells, they are independent instance of their own and they can independently service the full workload of customers. One important thing to note that cells share nothing. So if we have like three cells, in this case, like cell zero, cell one or let's say cell two, it's very important that cell zero and cell one do not share any data. And the reason behind that is if there's any data that cell one needs and if cell zero is impaired and cell one would be impaired too.
The other key important thing is the cell router. Now cell router routes the request based on some configuration logic. So request comes in cell router routes. It based on like maybe a partition key like customer id, maybe it'll do on robin from cell zero to sell one to sell two. And it's just rather to sell uh the different cells. One important thing about the cell router is it has to be as thin as possible. And that the reason behind that is like because if, if that's impaired, then your then it will not be able to route your request to the different cells and your customers would be impacted too.
So we're going to look into deep dive into some of the use cases from Prime Video and Amazon Music and i helped them adopt CBA architecture this year.
So the team that adopted CBA architecture at Prime Video is the Prime Video analytics team. And it allows internal clients to deep dive into the external experiences of the external customers as they're watching Prime Video and thus provide improved video delivery quality. One of the key reasons they were trying to adopt CBA architecture was simplifying global setups. They wanted to remove uh they wanted to move their workload quickly from an underperform region to a healthy region. And also if a region doesn't have enough capacity, they wanted to be able to quickly move to a different region.
So let's say they had their workload on us east one and there was not enough capacity for certain instance types. They wanted to be able to quickly move to us east too well. For Amazon Music, it was for the team metric transition service and what it does, it collects the metrics for from different clients and helps improve music delivery quality. And what the key reason they wanted to adopt cell based architecture was fault isolation. They had different kinds of events coming in from the most critical to the least critical and events coming in from different device types. They wanted that fault isolation so that if there's a lot of noisy traffic like on operational events, they didn't want their most critical customer impact events getting impacted.
So I'll go to the key decisions that Prime Video took one of the key decisions that they did was how they wanted to design their themselves. So previously they had this one workload serving the needs of all their uh customers. And they headed across all the zones in one region and they split it up into different cells. Uh and they had three cells per region. And the uh the reason they had it across a cs in one region because they had regional services like lambda. And that's why it was a regional cell.
And so uh the one of the key decisions uh you know, whenever you adopt cell based architecture is to look what services you're using. So if you're using ec2 or services like that, which are ac based, you could have c which are, can be a cba. Um and the second decision that they took is around cellular traffic policy. So when a request came in from the devices to route 53 they had traffic policies built in on route 53 that would round uh route the traffic based on round robin.
So the request would go into cell one, cell two and cell three and so on. Let's say the request comes in the cell two and they had route 53 dns policies out there as well, which would do geo proximity routing and geo proximity routing means it will route the request to the region that is closest to where the request came from. So let's say the request came from New York. Route it to the closest region.
In this case, US is one in and when it goes into a region, it has the Application Load Balancer and then to the corresponding cell behind it. The third decision that they took was calculated health check. Now, one of the things I want you to note is that you don't want to route your request to a cell that's underperforming or unhealthy. So the way they checked if a cell is healthy is they set up Route 53 health checks. They ping the bootstrap API or the individual cells. And if they got a 400 or 500 error, they would know that the cell is unhealthy and would not route the request.
The second thing that they did is CloudWatch alarms. They looked at the ELB 500 errors and if there were more than 100 errors for a minute, they would know that a load balancer in that particular region is unhealthy and would not route the request as well. And as a result of this, they were able to see an outcome of 99.999% availability over a span of four weeks. And this is the percentage of events that were processed successfully.
And the way they calculated availability was:
Total Requests - Errors / Total Requests x 100
Any failure was labeled as a service-side failure like an ELB 500 error. And all of this came with improved ability, ability to failover that comes with solarization.
This brings us to Amazon Music. In one of my previous slides I talked about the cell router and that's what exactly they did. They had their cells contain a LB, Lambda, and SQS for Prime Video. But for Amazon Music a LB, Lambda, and Kinesis. And by stateless system, I mean they did not store any information in their cells.
The routing policy for Prime Video was round robin and geo proximity while for Amazon Music it was device type and event based. And as a result, they both saw similar outcomes. It was increased availability and resiliency.
So with that, I'll hand it over to Sal.
Alright, thank you Tulip.
So we learned about the website. She made me get a drink of water. We learned about Amazon Music, learned about Prime Video. So now we're gonna learn about Ring and Ring built a massively scalable event driven architecture that could achieve six 9s of availability while serving about 129,000 requests per second.
So before I dive into what it looks like, I gotta make sure everybody knows what Ring is. I'm a Ring customer, I'm a Ring fan. So Ring is a set of doorbells and cameras and alarm equipment that you can put on your house. And then you know, when something happens in your driveway, you get a motion alert and you look on your phone and see, oh there's my driveway. Oh there's my minivan and oh there's my bunny, oh that's not really my bunny, but it was a bunny crossing my driveway and it was still fun to see.
So that's what Ring is about. And before I get into the 129,000 requests per second case, I want to present a different service. This is their video encoder service. This is the one where that previous slide I showed you was a snapshot of a video. So there's a camera in my driveway, taking raw video footage and putting it into an S3 bucket, object storage. But that's not what Ring wants to show me on my phone. They need to do some kind of post processing transcoding.
So when they put the video in the bucket, it sets off an event that puts a request on an SQS queue where a fleet, that's those three little boxes, a fleet of two instances running a transcoder service are pulling that queue and when they get that, oh yeah there's work to do. They're going to pick up that video, transcode it, put it in that other bucket and that's where I can look at it on my camera.
So that's how the transcoder works. But like many services at Amazon, Ring has to be able to scale up and scale down. With most services at Amazon, if you're looking at the website or video or music, they're going to have big events around Prime Day. But Ring is different, right? So Ring is doing this video transcoding. What do you think the big event for Ring is where they're doing video transcoding?
Yeah, you got it. It's Halloween. So there's kids going door to door setting off the motion detection. I personally love it because I take the kids out trick or treating, my wife stays home with the candy bowl and I can get little alerts showing the kids coming up to our door and see that we didn't waste our money buying all that candy. So it's great.
But Ring needs to be able to scale up. That's quite a massive scale that's happening there to be able to transcode all that video. So how do they do it?
Well, here's that architecture again and they monitor using CloudWatch the queue and they monitor a metric called EmptyReceives. EmptyReceives is interesting because if there's too many EmptyReceives, meaning the poller is asking for work and there's nothing there, it means we're probably over scaled, we can scale down. But if they're asking for work and there's never an EmptyReceive, there's always work there, it means we're probably backing up the queue and we need to scale up.
So they feed that data into a Step Function which is a state machine where they could take that data plus some other proprietary metrics and decide whether to scale up or scale down to be able to serve that video as quickly as possible.
And so that example, that's a previous example about reducing the latency to see video. So the next example is also latency focused about reducing the latency between everything. Basically Ring is built in event driven architecture between the devices and their backend services and the services to services, everything is event driven.
So they wanted to build a system that was going to reduce the latency for those things to talk to each other by as much as possible and make it as resilient and scalable as possible. What do I mean by event driven? So for example, a camera might record an event like StreamStart. That's the name of the event internally, you know, to us that means the device detected motion and that event then needs to get routed to a notification service because that notification service is going to send the push notification to me and tell me there's someone at my doorbell.
In this case, me, I'm at my own doorbell, but still as a user, you want to know that as quick as possible. So that's an example.
So they built the Streaming Event Bus or SEB. So I love making architecture diagrams. This one looks a little complex, but I'm going to walk you through it and we're going to break it down piece by piece.
So first thing there's multiple, it's a multi-tier architecture, everything that's in gray is outside of scope of SEB. So there's event producers like the cameras and various other services and there's event consumers like that, like the event, the notification service I showed you earlier. So that's, that's in gray, everything in white is SEB.
So in that first tier is the API layer and at the API layer, it's doing some authentication, it's doing some logic, but it's also doing routing just like Tulip showed you. It's deciding which cell to send a given event to based on the event topic.
At the processing layer, you can see there's multiple cells here running Kafka. Apache Kafka is a high throughput, highly scalable event stream processing system.
And then at this layer is the consumer proxy, they did something clever, they wanted to be able to onboard many consumers and they didn't want all those consumers to have to be pulling Kafka. So what they did is they built a consumer proxy that pulls Kafka for them and then serves them the events either by direct API call or by putting in an SQS queue.
So that's SEB. As I said, it's multi-cell and you might notice that each of these pink boxes is a separate AWS account. They divide all these tiers and cells into different AWS accounts both for blast radius and manageability.
As I said, Kafka is an event streaming system, highly scalable, high throughput. So what is Managed Kafka or Managed Streaming for Kafka (MSK)? Managed Kafka is a way you can run Kafka on AWS and AWS takes care of setting up the cluster for you. You don't have to worry about setting up the servers. You know, you tell it what servers you want, it takes care of that, you tell it you want encryption, it takes care of that. You tell you want shared storage, it takes care of that. So it's managed, that's what managed means.
And so they built SEB, the Streaming Event Bus, as a cellular architecture, right? So cellular architecture as Tulip said, those little pink boxes on top are different events coming in thousands of them based on the event topic, they're either going to go to Cell 1 or Cell 2.
And the thing about a cellular architecture is blast radius. So we've already shared with you if a cell goes down, then only half the topics are affected, the other half are still going to work. So that's pretty good. But the team found something really interesting after they implemented this, they found that if a cell goes down, they could actually scale up the other cell and accommodate all the topics there.
So it's cellular. But when it needs to be, it actually scales up to accommodate all the cells. So they get all the benefits, all the scalability of a cellular architecture. And now the blast radius is nothing because all topics are being served by the remaining healthy cell.
And this is something really clever that they did that I really like. In this case, you can see Cell 1 and Cell 2 are not healthy, they're down. Why might that be? Because Tulip said that a cell should be a fault isolation boundary. That's the whole point of it - that fault shouldn't go across cells. But it can happen. Call that a correlated failure, there's some failure that has some correlated impact across multiple cells.
Let's say Kafka is having an issue or they deploy a bug to their Kafka implementation. In this case, a lot of services, a lot of implementations might choose to go multi region, right? Oh something's wrong with Kafka in US East, we'll fail over to US West too. They didn't do that route. They created what's called a fallback cell. I call it cell three. But I put cell in quotes because the cell is really supposed to be the same stack everywhere. But cell three is not the same stack. Cell three is not running Kafka. It's not running MSK, it's running SNS, Simple Notification Service. It's using Simple Notification Service and Simple Q Service to do the stream processing, not necessarily as efficiently as Kafka would do it in C1 and C2. But to maintain availability and this avoids the correlated failure. If the correlated failure is something to do with MSK or Kafka, it's not likely to be affecting SNS and SQS. Therefore, they're able to maintain availability.
The other pattern they implemented is circuit breaker. I think many people have heard about circuit breaker. We're just going to cover it anyway. So we're all on the same page and show you how they did it with a circuit breaker. The circuit starts closed and closed is good. Think of a light switch when you close, when you, when your light switch is on the circuit is closed, which means that there's electricity flowing, right? And that's a good thing in this case, circuit closed means that the cells are accepting requests. But if you get a certain number of errors above above a threshold that tells you that you think that cell is unhealthy, you open the circuit. And you can see with cell one, we've opened the circuit there, that way requests are not going to go to an ailing cell, they're not going to go to a cell where they're not going to be served. They're gonna go to the healthy cell. And remember if you get a few sporadic failures in cell two, it's going to fail back to that quote unquote cell three using the different technology and serve it from there that gives you that extra layer of resilience.
Now, once a circuit's been opened, it's going to go into a half open state and a half open state, it sends request occasional requests. And if it gets enough healthy requests, it assesses that the cell is healthy and closes the circuit again and the cell will be receiving requests again. So just I showed you an example of what these events and notifications looked like before. But I want to bring it home for me. I learned from these examples by understanding what the service actually does.
So in this case, here's another example. So remember the stream start example from before where a camera detects motion and it sent me a push notification. Well, here's another example. It will actually send that same event to the event manager and the event manager is going to put together a nice timeline for me. So I can see when a person was detected or a package was detected in front of my door there. In this case, an Amazon delivery driver. And here's a fun example, the ring ding event, the ring ding event means someone rang your doorbell. If you've configured it, it'll send that event to the bot service and the bot service will do an auto response. The auto response could be something as mundane as please leave a message or for Halloween, it could be like trick or treat or something like that. You could read what they are there. They're kind of fun and I promise massive scale.
So you see on the right there that's showing all those little pink boxes or messages, thousands of messages coming in. So how many messages per hour is Seb receiving? If you look at the graph on the left, it's actually showing the messages per hour for eight different regions. So Seb is deployed by Ring in eight different regions in US East. One, the biggest region where it's deployed, it goes up to that promise 129,000 per second. It actually goes higher than that. That's just the highest it was for the screen grab I got. But that's, that's a good representation of how high it gets. But you can see it's actually multiple regions all running their separate sub stacks and serving all these events. And because it's eight different regions, when you look at it, it says 299,000 total requests per second total across all eight regions and being served at the promised six nines of availability averaged across all of those eight regions. And we look at each region, break down each region. You look at even US East one for the three day period I was looking at here was able to achieve 100% availability on Seb. And that's because they implemented the cellular architecture, they implemented the uh fail over and the fail back and they implemented circuit breaker and took these actions and applied these best practices to get that six nines and even 100% availability. And with that, I'm gonna hand it over to Abas. Thank you.
Alright, let's do a quick hand raise and see how many of you are here using Alexa mobile app. Some I see a few hands. Alright. Today, I would be discussing about how Alexa has improved their resiliency and improved their developer velocity. And this is in particular, with an example of Alexa mobile personalization.
I'm Avinash Kuri. I'm a senior solutions architect with AWS. I'm I'm supporting Amazon as a customer of AWS and primarily working with Alexa and devices.
Alright. So Alexa mobile personalization is basically a landing zone app for all these sort of smart devices that are integrated with Alexa. With that, you can also go ahead and arrange sort of actions that you wanted to quickly sort out on your favorite devices. Or else you could also take certain shortcuts in order to do particular actions like controlling your thermostat temperatures or switching on the living room lights. And at the same time, it also helps you to focus mainly on your daily routines such as weather updates or traffic updates as set pointed out earlier in the vast ecosystem of microservices that we support Alexa mobile personalization is one among them, which serves as a kind of a triggering point for many other downstream services across Alexa.
Here are some of our resiliency goals that we wanted to focus for. And this is kind of like a snapshot of common goals across different organizations. For us. We come from a customer obsession background and improving customer experience is one of our key priorities. And at the same time when we see a certain peak events. As discussed earlier, it could be either a Prime Day or any sort of event. We see a lot of new devices being added and when a new device is added, at the same time, we also would have to scale for corresponding downstream services and support their transparency. So it is necessary for us to scale and support all these peak events and also the downstream services. The next is fault tolerant. We wanted to make sure that we identify pre pre identify the faults and issues before customers catch this. And then at the same time, take contingency measures. So that fault tolerance is another important key aspect for us to do all of this. It requires a lot of developer efforts. And at the same time, it involves a lot of resilience efforts within the developers. Still, we wanted to make sure our developers is always focusing primarily on innovation and the development activities but not more on operational or resiliency activities.
Well, we had many challenges in this overall journey of resiliency. But here are again, some of those challenges that we wanted to bring it to here. So we tried initially working with many different tools and technologies. We have built our own homemade scripts so that all of these scripts tools requires a lot of operational capabilities because you would have to go maintain them, patch them and deal certain operational burdens as well. So while doing that, what we have observed is as we come from a diversified technology stack. It is equally important for us to go ahead and attain a compatibility. And when you have many tools or agents and libraries attaining compatibility across all of this technology stack is another challenge. And at the same time, we also wanted to make sure our security is tightened. We are not leaving a room for any sort of intruders or leaks when using different tools or agents within our production systems. And last is mimicking a real world events, a real world scenarios to do that. We would have to pull in different teams together and the workforces and make sure all of them are aligning to a certain standard in order to mimic or simulate a set of events. Unfortunately, to do all of this, we are not an Avengers and just a developers.
Alright. So with that, we have started leaning on to AWS Fault Injection Service. This is a managed chaos experiment service that helps you to directly run fault isolation experiments, our actions on your AWS resources while it supports a lot of actions on various AWS resources. But today, I would be showcasing about these 21 is about Amazon EC2 instances on how, how you could terminate, stop or boot. And the other one is about how you could run Systems Manager, run commands and take an explicit control action on your resources. And one good thing is Fault Injection Service goes hand in hand with the CloudWatch so that you can set up your own monitoring and alarms and make sure that you have your staff or an exit condition in a guard rail so that, you know, when to exit while conducting this sort of experiments.
Here is a quick overview of our steady state. We want to make sure that our CPU is always less than 50% and with the memory of less than 20%. But at the same time, we wanted to support at least 3 million users within an P99 latency of less than 100 milliseconds.
Here is the first example of what I would be discussing about on CPU and memory stress experiment. On the left side, you're looking at a hypothesis. So what am I doing here here is within the Alexa mobile personalization. We are injecting 40% CPU and memory load. And at the same time, we are also scaling the traffic, our load generator with an additional of 30%. And when we do that, the outcome, what we have expected is there are no incidents reported. And at the same time, our P99 latency stays within 100 milliseconds. And with an exception of having a spike of 130 milliseconds at sometimes. And the mitigation, what we are using here is is our the experiment templates is basically JSON or YAML templates that you could directly use within your Fault Injection Service. And this starts with the name description and roar. Roar is again used to give you an explicit control on the targets or AWS resources that you wanted to go ahead and execute these actions. For as I stated earlier, we have a stop conditions here making sure that we take the control manner within our experiment.
And the stop condition is the CloudWatch alarm and the targets are uh is Easy to instances. And we are using uh resource s uh to classify all our uh instances and make sure that we are uh leveraging this experiment on those targets.
And these are the actions. The actions includes with uh uh Systems Manager uh uh command that that's introducing uh CPU and memory stress. And these actions are executed against those targets of EC2 instances.
And this is the whole template for you. So when we have executed this template, here is a quick snapshot again of many events that we have been capturing across our entire infrastructure stack and we use CloudWatch to capture all of these events.
And this is a particular instance of those events we have observed there's a CPU utilization because we have introduced additional 40% CPU. And at the same time, the memory has grown up. And what we have also observed was there is a network spike of uh traffic out from these EC2 instances because we are also generating an additional uh TPS load.
And when we do that. What we kind of noticed is our P99 latency still stays within our uh within our uh outcome. It's 131 33 milliseconds. It's just one instance of it, but our P90 latency still remains less than 100 milliseconds. So giving us the confidence that our infrastructure stack is ready to take extra load and extra traffic, even though we have limited our CPU and memory.
Here is a second experiment. So in this experiment, what we are trying to see is we are trying to implement an availability zone outage. So in order to have a kind of a real time failure, what does it happen when an availability zone goes down and how you going to deal with it?
And again, the hypothesis over here is we are injecting an availability zone impairment and the mitigation that expected is the auto skiing should get kicked in in another availability zones. And the outcome is the traffic should be handled gracefully because now that we are cutting down an availability zone, we wanted to make sure that traffic is handled gracefully.
And at the same time, our P90 latency is again in less than 100 milliseconds. I stated earlier. The experiment starts with a role description and a RN and the stop condition is again the CloudWatch alarm here and the targets are your two instances.
And the actions over here is we are trying to stop a set of EC2 instances related to a specific availability zone. The beauty of uh using uh AWS FIS experiments is you go you get to have uh the entire experiments conducted in either in a series or panel.
At the end of year, you are looking at uh the CPU stress and uh memory stress experiments are gonna start a uh are sorry. This uh chaos experiment of uh availability zone outage is gonna start after the CPU and memory stress.
And here is another observation based out of our CloudWatch dashboards. What we have seen while executing this experiment is we have seen a constant increase in TPS. That's because we have taken down one of the availability zone. And then there is an immediate CPU spike because the other two availability zones has to start taking this traffic.
And at the same time, our average latency is still uh under 100 milliseconds. So this is something uh noticing us like giving a room uh that we could go ahead and make sure our infrastructure is always in a ready state and we are able to serve the traffic even though uh there's an outage of one availability zone.
So this is how we have started. We have started with our manual testing load testing and then we have moved to the game days. Those are always evergreen. And then we have introduced all the functional and traditional testing into the pipeline baked into the pipeline. And make sure they cater every time to your deployment cycles.
And at the same time, we also started involvement with designing our own set of tools and technologies along the side of scripts to do these sort of resiliency testing. But we have understood all of this to include a lot of operational overload for us and also the cost and then we have completely moved to fully automate chaos testing using Fault Injection Service.
Now, the additional advantage of using AWS is these experiment templates can be shared across different developer communities. So the other teams need not go ahead and start this from scratch. Whereas they could directly use this as an abstraction layer on top of their services too.
Some of the key takeaways that that we have observed out of this entire experiment was we were able to scale from 3 to 4 million users without the change of anything on our infrastructure site that's improving a lot on our operational resilience.
And now that our developers get more time to focus on innovation and new development activities that we got to know that as per calculation, there's almost 640 hours, developer hours have been saved per quarter. And that's improving a lot of on the developer productivity too.
We also made sure that we have taken down some of the infrastructure that we have provision based on our experiments of 40% CPU and memory. And that has helped us to reduce 60% of the cost on the entire infrastructure and also committing to a carbon savings of 30%.
With that, I'll hand over to Tulip for the next use case.
Thank you, Avinash. I'll get a quick sip of water. It's very dry up here. So you heard about the stories around chaos engineering, cell based architecture, and ri rings architecture as well. And so when you have this massive architectures or massive workloads running on AWS observ becomes really important, you want to be able to monitor your infrastructure.
And so in this part of the session, we're going to learn about how Audible scales observable using the CloudWatch unified observ solution. So you might know that Audible is one of the largest producer of audio books in the world. And so they have a lot of services and each of the services generates its own logs and metrics.
So previously, they lacked that holistic view to be able to pinpoint root causes. They weren't able to get down to the bottom of why a severity issue occurred very quickly and it took them a long time. And so when the CloudWatch cross account observability was released last year as one of the features, they were one of the early adopters and were able to quickly realize the benefits.
So this is what CloudWatch cross account observability looks like. Let's say you have three AWS accounts and you're running your ECS EC2 and Lambdas out there. And so you might have set up AWS X-Ray. Now, what AWS X-Ray does is it's able to trace a request from one service to the other service and create this trace map and this collects all these traces and there's CloudWatch as well set up in all of these accounts which collects the logs and the metrics.
Now with Cross CloudWatch cross account observability, you're able to send this traces logs and metrics into one single AWS monitoring account and that becomes your centralized AWS observable account. And all you need to do is log into that monitoring account and be able to correlate, correlate your logs, traces and metrics across all your source accounts.
And now I'm going to do a demo of tracing a severity issue walk. Basically step into an on call engineer from Audible and see how he would do it.
So here's this trace map, what it looks like. So for folks who don't know what a trace map is, it basically shows how your request flows from one service to the other service. So let's put ourselves in the shoes of an on-call engineer. And let's say you're seeing a lot of error codes coming up because one of the clients is not able to see the request. And so how would you do it? It's traditionally like for, for um on call internet audible, you had to log into separate accounts and to look at those logs and metrics. But now they can just log into one monitoring account out here and be able to trace that request out here.
And so in a trace map, they're able to see the arrows as how the request is flowing from one service node to the other. While the circles out here are the service nodes and on the top, you can see a little red dot and what that red dots indicates is there is an error or a fault happened in that one particular service node. So this becomes really easy to just go into that one account, look at the service map and see where all the errors occurred.
And you can also filter down. So this is collecting all the services from all the like the four AWS accounts out here and you can filter down and select one of the accounts in this case, treat you and see all the services attached to it while if you wanted to and select a different account like 70. You can see like two services that are attached to 70 and be able to like also like look at separate AWS accounts also get and and get a holistic view.
So let's go back. So you uh you're able to see the service node that has that error. So we click on that service node and then it brings us to this view. What it helps you is to correlate your metrics to your traces and you're able to see a trace map. And you can, are you able, and you're able to see your metrics at the bottom and you can see metrics like latency and you can also see metrics like false, which basically indicates there are errors are out there.
And if you want to deep down deep dive further, all you need to do is click on view traces which picks up the trace segment around the time when this false occurred. So if you click on that, it brings us to something like this where you can see the trace map at the top and you can see there are some faults associated when this trace was collected.
And then it also shows you what happened. What exactly was the cause? The cause was the customer ID didn't get propagated or didn't get sent from one service note to the other service. And so we're able to quickly deep dive and find out why the error occurred.
And the other thing, you can also to the page where you clicked on view traces. You can also click on view container logs inside and this brings you to this logs inside screen out here. And here it automatically selects the time frame when that error is occurred. And you can see all the logs associated with that time frame.
And it also picks up the trace ID when this error is occurred. So you're able to get further information as to why those errors occurred. And as a result, after Audible implemented this cross account observability, they were able to correlate the logs, traces and metrics easily.
They were able to only use one monitoring account to look at across all their source accounts. And as a result, they saw over 60% reduction in debugging time. So previously, they would spend around a on an average two hours on any separate issue to debug. And now they're spending like um almost 20 to 30 minutes.
And this is one of the quotes from one of the developers who, who is saying like previously, he had to log into multiple windows. Now he has to only log into one window. And so he's able to see query all the services under one single pane of view and and saves him a lot of time.
And with this, I'm gonna hand it over to Say thank you.
I i love seeing quotes from developers and that picture we had there is just, you know, licensed photography and it's the same one i've used with other developer quotes in the past. So somebody's gonna think that's the happiest developer in the world as all the happiest quotes from Amazon.
The conclusion today is pretty straightforward. We wanted to show you how you could build resilient scalable architectures and we wanted to do that by example, and we showed you examples from all these teams from all these teams here showing you real life examples to help inspire you and show you it could be done. As I said, no matter what size you are now small to large. A lot of these principles apply. A lot of these best practices apply. I want you to go out there and do them now to learn more.
There's other sessions you can check out. This slide is from an earlier version of this talk. So some of these sessions might have happened already if they're breakouts, they're going to be recorded. So don't worry about it.
That life cycle I started out talking about there's the link to it there and there's a couple of other things you can learn about cell based architecture, chaos engineering X-ray, etc from all of these links here. And again, this will be posted to YouTube. Eventually you get the links there.
And well, Avinash talked about FIS, there's several different purpose built services for resilience at AWS. You should be aware of them. Resilience Hub is another great one. And Adios Backup and Elastic Disaster Recovery and R53 Arc. So these are all good services that you might want to use for your resilience journey.
And with that, we do have some time for questions. I want to make sure you fill out the survey, but we thank you very much. Thank you.