Building Serverlesspresso: Creating event-driven architectures

So today we're going to spend the next hour talking about a real life production application that my team built. We're gonna go through some of the lessons we learned, lots of mistakes we made, and hopefully show you some of the patterns that we think are useful when you're building this type of application.

So my name is James Bek and I lead the Developer Advocacy team in AWS. And prior to being an advocate, I was a software engineer for quite a few years and also a product manager for a long time as well. I'm a serverless geek so I've built quite a few different, uh, large scale serverless applications and I really enjoy this type of technology.

The most useful thing on this slide is the QR code - make friends with me on LinkedIn because long after the conference and in the future when you have questions, this is the best way to reach me. And if you ask me any questions I'll do my very best to help.

So we've got four things to cover today:

I'm going to give you an introduction to Serverless Espresso since it's early in the conference and you won't have seen it in the expo hall just yet.

Then I'll talk about the design decisions involved in building this application.

And then we'll go through these lessons learned and some of these useful patterns.

At the very end, I'll share a resources link that will have the deck and a bunch of other useful things you can download and play with.

So Serverless Espresso is an event based coffee ordering system and this was really created as an idea during the pandemic. We wanted to build something that was a bit more real life that customers could interact with and see how you could build real life serverless applications.

And we took two of our great loves in life - Serverless and coffee - and put them together. Now, we didn't really know how this would be received. It was on my office floor for a long time and when we brought it to re:Invent in 2021 we didn't know just how many people would, would be enjoying this, this application.

It turned out to be wildly popular. We did 1920 drinks over the course of three days peaking at 71 drinks per hour. And since then, we've been in summits all over the world - in Berlin, London, Stockholm, uh Milan and in GoTo and other conferences too.

And so after doing about 30 conferences, what we found is we do about 1000 drinks per day and one day we even had five events in different countries. And we had to modify the application to support this sort of SaaS type performance where you can, you can serve different countries at the same time. It's really been incredibly popular.

So how does it work? Well, it's, it's designed to be very, very simple. So you scan a QR code that's above the barista at the booth - from that point on your phone, you then place an order with your mobile device and then the order then appears on the screens above the barista also on your phone and on the barista's tablet.

And at that point, once it goes into production, all of the updates that are being made appear on those various apps, give you an idea of what these apps look like.

So this is the app that appears above the screen on the barista. This shows a dynamic barcode that changes every five minutes and it also shows the order status uh of whatever drinks are currently in the queue. And it listens to global events such as the store being open or closed or any any other problems. We display those on the screen.

The ordering app - this is the one that's in your pocket. So when you scan the barcode to start with as a customer on your phone that loads a dynamic menu from the back end, the menu can change for various reasons. And then once the uh you place your order, it gives you a human readable order number and tells you where to look for updates.

Next. When the barista pulls it into production and starts making your drink, you get that notification. And then finally, once the drink is ready to be picked up, you get that very last notification.

And then the third UI is the barista web app. So the baristas, we typically have at least one, often two or three each. Barista has their own tablet and their own thermal printer and they don't really interact with the customers except through this app.

So on the left hand side, they have incoming orders, they can choose to pull those in to make them - that takes them off of the other barista screens. Sometimes they have to cancel them in the event they're out of product and they can take other admin actions as well using this application.

And this also interacts with a thermal printer that prints tickets.

Now under the covers, these are the services we use to put all of this together. So we've got Amplify Console because we've got three front ends. This is a great service for uh building and and deploying VueJS, React and Angular type apps, you can simply deploy them to a GitHub repo and this service makes sure that you've got a live production site, manages certificates and all of this, the sort of work involved with that.

API Gateway is what connects the front end and the back end and we're using Cognito to secure that. So when you get to use this, you'll see you sign up with a phone number and it's, it's Cognito providing that authentication.

DynamoDB - all of the microservices involved have their own local storage, which are DynamoDB tables. We also have a main table that contains all of the coffee orders. I think it's got something like 80,000 coffee drinks in there right now.

EventBridge and Step Functions are the main focus of what I'll be talking about. So I I'll come back to that in just a few minutes.

IoT Core is how we keep the front ends up to date in real time. So when you use this app, you'll see, you don't need to refresh, it's all using web sockets to keep the data up to date.

And then finally, of course, for any custom compute, we're using AWS Lambda to run our code.

So those are the finished UIs. But really there are a lot of points of, of building up to this where we had to build a back end and make various design decisions before we even got this far.

So before we did anything at all, we sat down with some design guidelines as a team and we, we're a team of uh six or seven people and we tried to operate like a real software team.

So we decided we needed to operate with minimal code. I like this idea because I think code is a liability, especially my code and having less code makes it more testable, more maintainable. I think extensibility is really important because you just don't know what you need your application to do at the point where you're designing it out.

So we need to come up with a, with an architecture that enabled us to make changes as we went forward. Scalability similarly is important because when all the equipment was on my office floor and not at a conference, I didn't know how popular it would be, but it's very difficult to design scale in in your code early on when you don't know what the traffic is going to be like.

And then also cost efficiency - you know, I've been a developer long time and it's important to me how you think about the cost of your application. So I wanted to make sure that whatever we built was the most cost effective way of delivering this app.

And so we had some tenants that went along with this. We said that each team member is responsible for one component and there was no implementation sharing. So I couldn't just look at the code of other people to see how their component worked.

Each microservice produced, had to have its own API its own events or consume events to be participate in the system.

Before doing anything, we found a white board in the conference room and we we about what this application needed to do and it came down to about seven different things.

First of all, we knew the customer had to scan a barcode. At that point, we need to check if the store is open. And the reason for that is because of a race condition if we just close the store, but they managed to sneak an order in just at the last moment, we want to check on the back end that the store is in fact still open.

We want to get the barista capacity and that's important because if there's too many orders in the queue, we don't want to just add another order.

Then at that point, we want to wait for the customer order. So in the UI we give them up to five minutes to make a choice about the drink. And at that point, if they don't make the choice, we time out and we return the token back to the queue and allow somebody else to make an order.

Then we generate the human order, human readable order number. So instead of the GUIDs that we passed around, we see the numbers that are sequential - Order 1, Order 2 and so forth.

Then we wait for the barista. Now this is a tech demo. It's not a real coffee shop. So we decided that 15 minutes is probably the maximum amount of time. If we have to wait that long, something's gone wrong. So we should cancel.

And then finally, we need to handle some, any sort of situations like cancellations, customers can cancel drinks, cause they don't await or barista may have to cancel if they're out of product. So there's some corner cases there.

So once we had this whiteboard, we then decided to write it all in pseudocode. So we thought about what sort of Lambda functions you'd need to be able to run some of these requirements.

And so on the left hand side, we've got this order acceptance function. And in our pseudocode, this checks if the store is open, checks the barista is not too busy. If that's the case, it saves the DynamoDB, otherwise it rejects.

On the right hand side, we look at dealing with those timeouts. And so we've got a bit of pseudocode that gets a list of all the open orders, iterates through the open list. And then if the timestamp compared to now versus the last action is too long, we can then cancel the order and so on and so forth.

And as we were writing this, we realized this was completely the wrong approach because what we're really doing here is we're building a state machine and there's a few problems with this.

So the first is that on the right hand side, when we add the number of open orders of open orders grows, the process of iterating through the list gets slower and slower. So that's not great. It's not very scalable, that doesn't meet one of our tenants.

Also, this works once per minute because we can use this a schedule rule to invoke this through Lambda. But what happens if we need to do it every 15 seconds or some other time period that doesn't really work out either. That's, that's hard to do.

And so we realized this wasn't the right approach and we very quickly stopped trying to code a state machine solution.

And instead what we did is we went towards Workflow Studio. So this was a fairly new feature at the time in Step Functions. And this is really interesting cause what we could do is take those steps on the whiteboard and drag snap the various services to the canvas to represent what the workflow should do.

And it took us about a couple of hours. But at that point, we could then put a test payload into an execution and see which way the uh workflow would then flow, make sure it's doing what we expect.

We could replace all of that spaghetti pseudocode and use this uh very elegant, very simple way of representing what the workflow was.

And on the. And then once you finished building with that, you ended up with version one of this workflow. So you can see at the very, you know, the first step is is the store open. If so take a decision, otherwise check the capacity and so on. And so forth.

Now, any time there's a problem like the store isn't open, we branch off to a step where we omit an event saying the store isn't open. What happens? We don't know yet. That's a problem for future James. But it's, it's something we want to handle, but just not at this point.

So we can use the events to know this when we need to handle those types of exceptions. We also like this because we can pause anytime there's a timeout we need to wait for a customer or a barista.

"So five minutes, 15 minutes, we can use a task token to pause the workflow and wait to resume. And so this is kind of an interesting pattern because actually whether you're waiting for five seconds or nine months, the workflow will do this for you. You don't need to have some sort of compute running periodically to check the status of your time out. So that again, that comes back to being very low code.

Now, in this first version, we're still using lambda functions for some of this custom logic. And we figured out this capacity status quo through lambda using the SDK. And we've got a step that creates an order number using a, a sequential counter in a DynamoDB table. So this was, this was version one of the workflow. And we like to think about this as a conveyor belt. Essentially, the order comes in, we wait for the customer to decide what the order is. Then we wait again for the brister, it handles a couple of exceptions and uh emits events as needed.

So if you look at the architecture so far, this is what we've got, we've got this workflow on the right hand side, we've got no front end apps and we've got something on the left. We know it needs to be there again, not sure what it is, but this workflow is going to exist by consuming and emitting events and that's all we have.

So next, we think about how do we start the order process and this is where the QR code came to life. So we know that customers are going to come to the booth and we're going to show a QR code. Now, in version one of this, we thought about printing a big QR code and sticking it on the booth and then it occurred to us that somebody might share this on Reddit and you re reinventing and basically your, your exhibit is being dust by people who aren't be, aren't here. So this seemed like a bad idea. We needed QR codes that could change.

And so if somebody was to share them on social media, our maximum risk is really just a couple of minutes. And so this is where this idea of this QR service was born. Now we set five minutes as an interval. We allowed it to make 10 drinks per order. And because this is event based, it occurred to us pretty early. We could do something clever here. So as you might know, service applications scale up incredibly quickly, but often things that are downstream don't scale up nearly as fast. So whether that's a database or various legacy systems or in our case, a human booster, you have to make sure you protect those systems from scaling up. And we could do that here. Essentially, we could count the number of scans coming into the system. And once we've hit a certain number over a period of time, we can take the barcode away. And every time they code scan us successfully, this microservice can then emit an event that would start that workflow that we designed earlier.

And so in terms of where this service lives on the architecture, it lives on the left hand side of the bus. Here, it's got a REST API so that the front ends can uh can reach the service and then otherwise emits events to other parts of the system. Originally, we used Postman for testing this, but very quickly, we decided we wanted to build a couple of skeletal UIs just to get a sense of what the flow was. And so those ordering and display apps were the first things that we put together.

Now inside this service, there's two main methods that make it work. So you've got the generate method and this is an admin method that's protected. So it can only be used by the admin applications provided. This is what's responsible for generating the QR code random number and it's just random. There's nothing particularly special about that. So this lambda function creates the random random number and you can call it uh multiple times over a five minute period. It breaks the w the day into a five minute bucket and it will store that random number in a DynamoDB table.

And then on the validate side, it does the same thing in reverse. So when you scan the QR code with your phone, your phone decodes the QR code and passes along this random number along with the API call. And so that lambda function is just checking that the number you've provided is in the table. And if it is, it then raises an event saying that it was a successful scan.

So this is what it looks like in that DynamoDB table. So we've got each row is a five minute window where the last code is a g si a global secondary index on the table. So we can query it very quickly. And then we've got a number of available tokens, we're setting that to 10. And then each time this is scanned, that's then decremented in the table. And at the point where it hits zero, it's no longer valid. And then five minutes later, once the next code is retrieved, that adds another row to the table.

So at this point, we've got this endpoint that handles the IDs and scanning of QR codes and it can trigger an order to start. But now we need a way to interact with the order. So there's several things we need to do. We need to make sure that customers can update and cancel orders. The baristas also can cancel or complete orders as they're completed task tokens that we're generating in that workflow that whenever we have to wait, they have to be stored somewhere. And then we've got this display app and barista app, those need a list of open and completed orders so they can show those on the displays. And we've got a few questions here that we're not sure about how to solve.

So, should we build a monolithic workflow? This happened very early on. We got back to the white board and we charted out literally every possible variant of what could happen between a customer and a brea. And it ended up with hundreds of steps and we stopped. I think we've had too much coffee at that point. We stopped and said, ok, this isn't gonna work because the whole point of doing this is to get away from monoliths, not to build another monolith. And so we just intuitively thought that wasn't really the right way to go.

Then we considered about keeping those task tokens at the client. There's nothing particularly special about a task token and it's just related to your cup of coffee and your order. So maybe we pass that back to the client and let it manage when it is returned to the this. But that doesn't really feel right because what's happening is you're taking a piece of back end data and keeping it on the front end. If they change browsers, clear caches, it's not really that great.

And in terms of querying all open workflows for the list of open orders, do we go to Step Functions and pull a list of all of those executions that are currently running and pull some sort of metadata? But the problem with that is it gets very slow as you've got more and more open orders, that process is slowing down. So again, not really a great way to approach the problem and all these problems came together at the same time. And that's really how we ended up with this idea of an order manager microservice.

And so this order manager microservice solved all of these main problems. It's got this REST API endpoint for the UIs to communicate with it. And again, it consumes and emits events to the main bus. And whenever the workflows emit those task tokens to the event bus, it then catches those and stores them in a DynamoDB table. And so inside this, this bage of microservice, it's really just a standard serverless microservice that's using API Gateway, Lambda and DynamoDB and it's listening to all these events flowing through about your orders and then storing them in this table.

So any time that now that you need to get a list of open orders, we can query a GSI and DynamoDB very quickly even today where we've got these 80,000 orders in the table. If we query for open orders, we typically get that back within a couple of milliseconds. And so it doesn't slow down as the number of orders grows.

Now, the order manager microservice in version one looks a little bit like this. And so each of the main functions is handled by an API Gateway method followed by lambda function. And what we quickly realized was there's a lot of duplication. When you cancel an order, it's updating a DynamoDB table and resuming a workflow. And then when you complete an order, it updates the DynamoDB table and then resumes a workflow. And when you make an order and so on and so forth, and what happens is you start to realize that this is becoming a kind of tightly coupled code base with a lot of duplication. It's the same code in different folders. And when you're sharing this with your team, this is a little bit awkward. And so it's fairly complicated for something that's largely doing something fairly simple.

So we had a couple of revisions to this. The first was using direct integration. So in cases where we had routes that were just reading or writing to a table, we didn't really need the lambda function there. And so we're able to take that lambda function out now to do that, we're using something called VTL Velocity Template Language which API Gateway uses. And it enables you to map the request coming in to the attributes in the DynamoDB table. And this can be very useful. Now, perhaps not in our case, so much because of the coffee ordering isn't that high capacity. But where you have microservices where you might be handling millions of requests, they can, this can give you a lot of scale because you're taking a service out. You're also mapping directly between API Gateway and DynamoDB. API Gateway natively starts at 10,000 TPS. Many customers are much higher than that DynamoDB. You simply tweak up the RSRC US or WC US by putting those services together. You can get this massive scale. It also reduces the latency because you're taking one service out. And so this can be a really interesting solution for microservices that largely doing CRUD type operations.

Now, also in our microservice, we've got a couple of steps that are right based and they are a bit more complex. So when you submit an order, we have to make sure that the order you provided actually is available. So because the menu is nested JSON, we've got a lambda function that compares what you've provided with what's in the currently supported menu just to make sure it's there. But we realized in that step, we could actually replace it with a Step Functions workflow. And so that put operation, we took apart and replaced that with a workflow. And we decided not to stop there. We thought actually this whole microservice could be a workflow. We simply remove everything we've done so far and have one single Step Functions workflow where the first step decides the the uh the state is to create read update, delete and it takes the next step accordingly. And you'll see in this workflow, there's now only one lambda function. The rest of it is all just state transitions. And so there are interesting benefits to this"

You've got essentially your entire microservice being represented by a Step Functions state machine. The language that Step Functions uses makes versioning very easy, because you're using Step Functions versioning as part of the service.

Also, this approach might reduce your cost as well. In cases of lots of capacity with lots of Lambda functions, this could be a very cost effective solution.

So at this point, we have a way to start the order workflow, we interact with the order, but we've really got no way to keep the front ends up to date. And so we need to do several things here.

We've got to keep the web apps in sync with the orders. We've got to respond to issues like the stores opening and closing. We've got to basically do all this close to real time, so customers get a sense of what's happening as events are changing. And we have to be resilient enough to be able to handle network drop outs, which can be a problem at conferences.

So we have to make sure that our system is resilient to that. So in this notification problem again, we came to a bunch of questions that we didn't immediately know the answers to:

  • Should we create polling APIs, so we create a secondary API we provide the transaction ID and see what the state of the order is, and just have the front ends pinging these APIs?

  • What about not doing it in the front end at all? We could use SMS, since all the customers are on their phones. We can simply have them receive a text message when the drink is ready, that could work.

  • All about using API Gateway websockets as a possible option too.

So we evaluated all of those. Thinking about this problem a little bit more abstractly, this is the problem you're trying to solve when you're thinking about these notifications.

In the state problem in distributed systems, you've got a client, the client makes a request to Service A, Service A then turns around to Service B and does something, and then Service A only acknowledges to the client. But the client actually has no idea what happened to the transaction - did it work? Did it fail? Who knows, not sure.

So how can we address this problem? Well, the traditional method for this is polling, which has been around since the dawn of computers. The idea is you've got this secondary endpoint that you're pinging to see what the state of your transaction is. And this is actually the technology equivalent of having your kids in the back of the car saying "Are we there yet? Are we there yet? Are we there yet?"

The problem with this is that if you have a randomly distributed series of events and you poll every 60 seconds, on average you're gonna be out of date by 30 seconds. What can you do - you could poll every 30 seconds and now you're out of date only 15 seconds, or poll every 10 seconds. And we see customers doing this - they're basically tightening the polling window all the time, increasing their compute cost because you're making all these empty calls with nothing coming back. And you're still not real time, you haven't solved the problem.

So we didn't like this approach. We think polling is something you should generally avoid. The real answer is to use websockets, because if you think about your customers using apps these days, they've got email clients, maps clients, social media apps - they're all essentially approximating real time. This is the standard you have to meet when you're building these web apps.

So we knew that we wanted to use websockets. But the problem is websockets are a little bit hard to use, because the way they work is the front end makes the connection and the connection never closes from the back end. The server keeps sending information to keep the connection alive and it's very clever, but it's not easy to use.

We were very happy to find the middle ground that enabled us to do both - and that was to use IoT Core. Using the IoT Core service, you can think of this as being like AWS websockets. Essentially we could pull this into our service and make it do the websocket management for us.

This is a good situation anytime you've got partial information coming back from your backend, where you might want to provide updates. This is a good candidate - think about ride management or rideshare services, when you make a request for drivers and you see drivers appearing on your map, that's not an API call where they're pinging, they're simply making a subscription. And then as drivers come into the zone, it's appearing on your device.

So this is a very elegant solution to the problem. So what happens is, essentially the IoT Core approach for real time uses the MQTT protocol, which is based around topics. First, the front end makes a subscription to the topics that it's interested in. And then we've got this publisher microservice in the backend that's listening to events on the event bus, and the ones it's interested in, it's passing along back through the publisher microservice to those topics.

And then these messages are categorized to those topics. Here's one of the great things - if a service handles fanout for you, I think that's a win for developers. If you think about having thousands of front ends out there where they're listening to information, this is a hard problem to solve by yourself, whereas if the service basically fans out and gets that data to all those clients for you with very little code, that's great.

The reason is this is a more difficult problem than it sounds, because on mobile phones, apps can stop for any reason, customers can lose network connection for any reason. So you want to use a service so that when they come back, they can catch up on messages. And this is something that IoT Core can do for you.

So there's lots of great functionality you can use here, and again with very little custom code. To give you a sense of what the custom code looks like, we're using VueJS, but if you're using anything like React or Angular, this is very similar.

I've omitted a little bit of code at the top that sets up the connection, but you can see that on the repo I share at the end. These are the key things - the handlers. You've got three handlers you have to care about:

  1. When you first connect, that's just making sure you've got a connection. Always good to check that.

  2. You want to check if there's an error - this should only happen if there's a service disruption or some other problem. But you want to make sure you manage that too.

  3. The third one's the really important one, and this is when an event arrives. So in this case, a cup of coffee is completed being made, this event is passed through the event bus, it comes through the publisher microservice, it lands here. And all we do is unpack that JSON and then update our app.

Here's the really cool thing - if you're using VueJS or React, the data models in these components, all you have to do is change the data model with the updated state and the UI takes care of itself. So those UIs I showed you earlier with the numbers whizzing across the screen, we didn't write that code. We're just using the framework to enable those real time updates.

So it's a really good way of getting real time events and keeping that event type formatting from the backend all the way through to the front.

And so this is the publisher microservice that we slotted into the architecture here. And amazingly, this is only about 20 lines of code. Its job in life is to be triggered by an EventBridge rule, it decides which events it cares about, and then routes those to those topics in IoT Core.

Now, one of the really interesting things here is you can combine approaches - you don't want to just completely ditch APIs and do everything through a websocket. I've seen some customers try to approach this problem - websockets are very lightweight messages that essentially are just deltas, they're changes in the application.

But remember if you've got a mobile app or a web app, it can essentially restart at any time, it's nondeterministic when it restarts. And so when your app starts up, it's a good idea to use an API call to get the state of the world, get the list of all your orders, everything you care about in that initial API call.

And at the same time, you're making that subscription on the backend of the websocket to get those deltas. And it's this combination - when you put these together, as the changes happen, your application can then stay up to date.

It was an interesting thing here that by using this, we had a situation at re:Invent where we'd obviously not built out all of the application functionality we have today. We ran out of soy milk - everybody in the first year was ordering soy milk, and we realized that we needed to change the menu, of course we didn't have that functionality built yet.

And because we don't know anything about coffee shops, what we did is we went to the backend and we changed the menu service and updated the table directly, removing soy milk. And we were absolutely shocked when we walked outside to the booth and saw everybody's phone - the soy milk had disappeared!

I think at the time we claimed it was intentional, but it really wasn't. Essentially what had happened was exactly this - the front end apps had made a request for the menu, the DynamoDB table when it had that change emitted that event through the event bus, that got picked up by the publisher service and fanned out to all of these front ends in the crowd.

So you get some really interesting benefits with this type of architecture. At this point, we've got this final architecture and this is the complete loop. You can see the front ends on the left, they're making these REST API calls coming in through the top.

The microservices are doing whatever they need to do there, passing off events to the event bus. The order processing workflows on the right hand side are orchestrating each one of those workflows, events coming back into the bus, filtered by the publisher service, and then pushed out to those front ends.

So you have this flow all the way around. One of the interesting things here is, and we didn't really think about this at the time, is that those microservices don't know anything about each other at all. They're completely decoupled. Each one of them simply consumes and publishes an event, has no idea it's part of a coffee machine process. So these whole parts of this really are totally decoupled.

Let's look at the software lessons learned. I think in any software project you make a lot of mistakes and learn a lot of lessons for next time. In the 20+ years I've been building things, I still make loads of mistakes. So I'd like to look at the end and figure out what we could have done better, and hopefully some of these things are things you can bring to your own projects as you start to build with event driven architectures.

What should events contain? Producers create events, consumers have no control whatsoever over what's in an event. On the screen on the left, you can see this is what the event looks like - it's just JSON. So if you can write JSON you can write an event.

The envelope attributes that are shown on the screen - source, account, time, region - those are provided by the EventBridge service. You have no control over those, they'll just always be there. But everything in blue is yours - the detail, the detail type, and the source.

It's entirely your choice and that's good and bad, isn't it? Because essentially different teams meet, they can do different things with what goes in those blue sections. One of the first questions we came to in our team that was working in a decoupled way was, should you produce fat events or thin events? What do I mean by that?

Well, a fat event would be, this is order number two. It's a cappuccino. It's by James. It costs $4.50 not in Vegas, but $4.50. A thin event would be, it's order number two and it's on the receiving service to then look up what that order is. What's the right approach here? It honestly depends on what you're building, but it's something you need to think about fairly early on.

Also versioning is another factor because these events change as you're producing them and because you have no idea who consumers are, somebody might be listening, somebody might not, you have to assume the moment you're publishing an event that somebody is listening. And if you're going to change the event, you probably want to provide some sort of versioning information so you don't accidentally break them downstream.

So events are immutable observable and temporal. And so what events should you generate? So here are the events i pulled out from the system towards the end and there are about 15 or so events. And the interesting thing is that some of these aren't used yet. And that's an important point when you're working with events because publishing more events than you need, helps future you future your team building functionality that maybe you don't know you need at this point in time. And so this is very different to the api world when you think about it because you can generate all these events, step functions, makes events, your workflows, make events. When you build things with api s, you typically don't build an api with the view that someone's not gonna use it. Whereas with events, you might well need to just publish events and maybe they never get used. It's a very different problem.

Naming conventions also become fairly important early on because an event could be called anything you like. And so we decided early on that this worked for us microservice dot thing that happened and the past tense was important because events are immutable and so being in the past tense really represents that. Yeah, this thing happened and it's never gonna change just because a coffee order gets canceled doesn't undo the original coffee order event. You've just now got two events.

Another early question that came along was an eventbridge. Do you need one bus or two? So you already have one default bus that's already on in every region of your aws account. And that's where aws events get thrown from all the services that you use today. And so you're likely you've got a bus full of events going on already. But the question is here, do you want to add more events to this bus or have a separate bus?

Now, if you add more events to the default bus, you're, you're arguably helping with extensibility because other teams in your organization, other people who can see the bus have more access and visibility into what you're producing. And so that could be helpful helping them. Of course, that's, that might be not what you want to do. You might have events where for security reasons, you don't want other people to see them. And so essentially that second custom bus can act as almost a security boundary if you decide to use that instead.

And there's a bit more to this because events, there's a lot of security controls where you can control, who sees which attributes. But largely speaking that initial decision is about this, this choice of, do you want to allow other people to see what's going on in your application? And so that's an early trade off.

Also thinking about discovery of events is something we're not considered either because when your your app is producing all these events, you've got separate teams building separate microservices, they don't know what events you're producing and these events are also changing in development. And so there's a few things you can do here that can help. One is we use schema registry and eventbridge. So this is easy to use you simply on your bus, it captures all of the events that are coming through and basically creates a very tidy registry of what's going on in your application.

There are also some open source tools like atlas. And so atlas was created by somebody called david boyne who's on my team. Now bef before he joined aws, he built this tool and we discovered this and this is great because you could visualize all the microservices you have and the flow of events between those microservices. So that was very helpful.

We also decided that documenting early on was important because it's not always obvious what you mean in your event. If you have something like price as a field and you put in a price of four spot 50 is that $4.50? Is it in euros pounds or is it pence? What is it? And so you, you realize that actually documenting the event can help consumers of the events understand what you mean.

So testing by event injection was also another fun thing. So we originally tested by calling api s using artillery and we've got these um secured api s. We're having to create json web tokens for the testing process, throw all these orders at the api. And we realized we didn't need to do this. Actually, we could essentially simulate the event that incoming micros servers could produce and simply throw the events on the event bus at much higher volume.

So we created this robot tester where it did two things. It acted like a customer, placing tens of thousands of orders and it also acted as a barista filling those orders as quickly as possible. So we could test the throughput of the system. It was really useful.

Also, there's a really great feature called archive and replay in eventbridge. Again, you just turn this on, on your bus and this will archive all of the events coming in through your bus. Now, why would you want to do this? Well, if you've got a busy production system and you want to make changes and then test what those look like in development, you can take a day of data. That's there's live data from the system, from the archive and play that in your dev environment. It gives you a much better idea of what your changes are going to do in the system.

Also another really simple one was if you can create one rule and send all of your events into the bus and then log out, want that rule to a cloudwatch logs output. So you're just essentially logging everything that goes through the bus in dev and test. This is a great way to see what's happening. You probably don't want to turn it on in prade because of the amount of data. But in the development process that saved us a lot of time

Now in the communication between microservices in ed a, this is a little bit different and again, takes a bit of time to get used to. So you've got this event flow that's driving the application as events are emitted and triggering other parts of the application, other microservices, that's what's causing things to happen. And so the events are choreographing the services between them while step functions orchestrates the transactions. And it's important to know when you should use each.

And we, we actually got this wrong a couple of times. It's easy to accidentally feel that you need to orchestrate across microservices. Cos it's very comforting having something that knows about retrials and then error handling, watching those events. But you end up building a monolith if you do this and it's also easy to go the other way and decide you don't need orchestration, you can do it all in events, but often that doesn't save you any time and creates a lot of unnecessary events. A lot of extra code. And so knowing where the separation is can save you a lot of time and reduce the amount of code that you create these microsoft on the left are completely decoupled and they emit events with no knowledge of each other. And that's another really powerful part of when you're building using ed a.

When we wanted to build this large application. At the end, we had about a dozen microservices with a set functions, workflow and all the other parts. And originally we had one aws s am template and i'm not sure if you use s am, but it's a bit like cd k or terraform. We ended up with this massive template and the problem was occasionally when we made certain changes and deployed them, it would do things like tear down resources. We didn't want to tear down like the cognito user pool or the event bus.

And so what we did instead is we came up the idea of a core template that would contain the main things that would never change the bus name, the cognito set up and so forth and then each microservice ended up having its own template. And so we could deploy those separately. And that made it much simpler and much faster when we were deploying changes to the application. And we did the same thing with the order processing work flow that again with its own s a template. But it made it worthwhile in this type of design to break up these monolithic deployments.

Now, there's a few useful patterns that i think also emerged uh that we think we can use in future projects and hopefully you can find some of these useful as well. So you might have heard of c qr s this command query responsibility segregation separation, which is a very sort of mouthy long way to say that you're simply separating the paths that update or read data. And this is very useful when you're building distributed systems or systems that have to scale.

And so in this, what happened was we, we did this by accident by having those api s coming in on the, on the input and then on the way out going out over web sockets. But i've not actually seen this in other applications. And so i'm not sure if we're the first to have come up with this idea. But creating c qr s over web sockets with web apps afforded us quite a lot of flexibility. And so this can be very useful as a design when you're trying to update everyone about something happening.

One example of this is when the the coffee store opens or closes. Remember we've got those barista tablets and the baristas can hit the open or close button when they want to uh change the state of the store. So how would you do this? So if you imagine you've got three tablets and you've got 100 people outside and you've got the screens, how do you get those messages out to everything to indicate the state is closed? And this enables that to become a very easy problem to solve because the barista tablet where they close or open the store simply makes a call to a config service to make the change.

And at that point, what happens is that the service updates dynamo db table, all those events track around these various systems and then get published out. And then the the thing that made the original request to open or close the store also receives that event. So until that happens is in a pending state waiting to find out if the open or close happened. But it's a very elegant way to solve this type of problem.

Now, the way it works under the hood is also worth taking apart in a little bit more detail as well. So with c qr s with circus, we've got these events coming through, we've got this publisher microservice, the lambda functions, routing between what it sees in the event and what are the topics in the iot core configuration. And this works with a very large number of subscribers, subscribers and it's actually a very cost effective way to push those messages out.

But what we found was when you've got applications where you've got uh customers and admin users, there's certain topic configurations that start to appear and it becomes a pattern that's worth being aware of. So you've got this first type of topic which is just for the user, this is your cup of coffee. So when your coffee is ready. It goes, goes down to a topic that only you have per see and that topic is secured using i am when you first log in with cognito. So the message is coming to you and just you are secure. And so that's useful when it's, it's just something that one person cares about in your application.

Then you've got this second type topic b which are more general and these are used by the, the admin apps like the tv app and the brewster where they need to know about changes across multiple orders. Here, we can use wild cards in the topics so that we can pass along those messages in bulk. So if all of you have coffee orders, the tv screen is getting a notification for each one of those changes and the individual applications on the phone are not receiving these messages. This isn't a customer channel, it's simply an admin channel.

And then you've got this third type, which is the global set up where absolutely everything that's connected needs to know those are for emergencies or changes. You know, the store is open or closed. Something that happens of that nature and everybody who is on the system, whether they're a customer or an admin channel would subscribe to these.

There's a couple of other things we could do is we can use the rules in the eventbridge bus to then route very clearly to those topics. And so we want to keep uh true to the json format that's going around. We can simply map that and pass that along to those topics and allow the front ends then to unpack the information that's being received.

We can also use content filtering rules in eventbridge and wild cards. So we can pass along large numbers of messages such as the order subscription

And so you don't need to set up custom rules for every single type and that can save a lot of time too.

So, orchestration and choreography are two words that typically strike fear into people because they're easy to mix up and get the wrong way around. But actually, as concepts, they're really important to know as you build out these types of applications.

Orchestration is what you want to think about when you've got microservices where you need very tight control over what's happening. Think about something like a payment service. If you're processing a credit card, typically lots of things can go wrong. You want to be retry, have very elegant error handling. You don't want to accidentally charge a customer twice. It's a great idea to use an orchestration service to manage that type of operation.

Whereas choreography is most of the time what you'll be using for communicating between events, the more you apply orchestration across multiple services, the more you're creating something that's monolithic. So generally, the idea is the orchestration would exist within the microservice boundary. And the choreography is what exists between microservices. And if you just remember that, that's an easy way of thinking about how you put them together.

But if you do it right, what you end up with is a very, very low code implementation that can save you a lot of time as you add uh new microservices and make changes to your application.

Now, we took this one step further because we obviously had all these microservices that are being developed by our team. But as we rolled out this feature or the application to all these different conferences in different parts of the world, other aws teams wanted to add extra functionality. And we thought this would be interesting is is to add an extension component, something that can listen to the bus and then route those events to, to code that we've not written or managed.

And so we created this idea called s espresso extensions. We documented how it worked and we put it on our internal system to allow our teams to build extra functionality for our system. And what was amazing is that over the course of a few months, we saw people create these types of extensions like an average wait time, something that just listens to all the events flowing through and deciding how long something is taking. I think you'll see that on the expo hall today. In other parts, they built chatops where they can integrate with slack and other messaging systems. So that if there were any problems, they could catch those and receive those messages through chat applications or business metrics, dashboards. If you want to get a sense of how the application is performing and what's going on, we saw a lot of custom dashboards being built.

So this became a very powerful way of opening up the application very broadly to other people who weren't, you know, involved with our team. And it also meant they could produce code that wasn't part of our c i cd pipeline. We didn't need to test it, but it gave them the f the the ability to build whatever they needed to build.

Now, this week at re invent, you'll see a new thing that the d a team has produced called serverless video. And so this is a server streaming application that enables you to broadcast from your phone and watch what else is happening at re invent. And so it'll be in the expo hall and you can, you can take part and broadcast videos with each other.

But the interesting thing about this application, we started building this about three or four months ago and we weren't intending originally to build an ed a based app. But very quickly, a lot of the lessons we've learned from buildings, espresso became useful here. And so interestingly, the end architecture looked a lot like er espresso.

We have the similar sort of thing where microservice is on the left. They were they're receiving information by rest api s, they're pushing information back out of web sockets. We've actually got web rt c and a couple of other things, but that's all on the left.

We've got this event bus in the middle that's essentially routing and coordinating all this traffic between these different services. And then on the right hand side, you've got the orchestration pieces where we've got a plug in based system and it's all being orchestrated with step functions.

And so accidentally, we found this pattern that is very useful for building a very, very different application. You know, apart from the fact that it looks like this, it really has nothing to do with the c ordering microservice.

Also, if you're moving for the event driven world. And I think probably many of us have, there's a few things that are a bit awkward and you have need to get your hands around when you start building with ed a in API based systems, you often see these types of diagrams for something like an order processing architecture and they look very tidy because they're showing you the happy path.

One service calls the calls, the api of the next service, calls, the api of the next service and so forth. But what happens is that when you have problems, it doesn't look like this anymore, you end up with these microservices calling backwards upstream to other api s trying to make changes whenever something goes wrong.

And what you find with these api based microservices is you've built a monolith. This is very monolithic building, all these very tightly bound microservices that you need to be able to coordinate when something goes wrong. And what happens here is that as the complexity grows in your api based system and you have downstream and upstream consumers who are working with you, you just become part of this where other api s are also calling into you and you'll start to find the system gets to be more and more fragile and less and less agile.

Now, if you take this type of api system, which i've seen many times in things we've built and you move that to the ed a based world. It actually looks like this. And this is fascinating to me because all of those services are still there in this diagram, but they're just communicating through the bus.

And if you have the same problem where, for example, the fulfillment service can no longer fulfill and it has a problem, it simply raises an event to indicate it's got that problem and other services can listen to those events as needed and take action as needed as well.

Also as you add more microservices to this, you're not having to think about upstream and downstream api s. You can simply add them as consumers and producers of events with their own api s as well. So suddenly it becomes much more extensible and actually i think in all the time, i've been building these things. i think this is the magic of ed a.

So as developers, i think the hardest problem we have to deal with on a daily basis is the fact you don't know what you have to build three or six months from now. You know, you think you've got the requirements but things change so fast. And so you're trying to build architectures that give you this extensibility so that six months from now, you can keep up with customer demand. And i think this is something that ed a gives you a lot of power as a developer and i i became converted when i saw the power of this.

One example of this was about a week before we went live at re invent. We uh had a request to make a change to the application to build an a uh an, an order journey report which shows every microservice that your coffee order went through as it went through our system. And we panic thinking this is a disaster because you have to basically change the application a week before re invent, retest everything. What do you do?

And then we realized, no, we don't because it's ed a it's actually very simple. We simply build the order journey microservice, add it on it, listens to all the coffee orders that are coming through, creates the report. And the only code change we made was to put a button on the front end that could access that report. So it really does give you much greater flexibility.

Now, also they've got this, this crud micro flow a cru workflow using step functions which again we tested in a microservice. I think it could be very useful for many mc r based microservices. And why is that? Well, there's several things you can do.

You've got one API gateway end point that starts that service. The first step decides what crd action it's taking and then use service integrations directly wherever you can and only use lambda functions where you need custom code. This can help reduce complexity of your application because you've got one a sl file for your whole microservice. And that basically you can reduce a massive amount of code down to one single file.

It can make your monitoring and debugging that much easier because you can see in step functions when something has gone wrong very clearly, you can use the versioning in the service. It may be something that's very cost effective. If you're, if you're pulling out lots of custom compute from this and doing it directly in your workflow, you might also use something like um express workflows to, to make it work even quicker for very high throughput that could also reduce your cost. Something definitely worth thinking about and looking at using workflows to manage wading was an early win.

But i think it's something we've all built and written in code somewhere as having some sort of time based system that polls to see if you've timed out on something. And you know, if you're waiting nine months or six months or 15 seconds, you have a job that run its checking to see if something needs to time out. You don't need to do that. You can have a workflow, manage the whole thing for. You simply start the workflow and tell it to wait, a workflow can wait up to one year and you still only pay for one state transition.

So it can be a very cost effective way of managing uh waiting at scale as well. The way you do this is that essentially you have this workflow and you pass the task token out of the workflow, you then in your code when you're ready to resume, pass that back into your step functions process and it will then restart the workflow up to one year later. You can use the sdk or the api very simple, very low code.

So i said at the very beginning that we were thinking about cost, you know, very, very um very seriously. When we built this application, i wanted to be able to stand here today and tell you we built something cost effective. And i think we managed to do that.

You know, if you think about this being a highly available scalable application that supports thousands of drinks um using the serverless approach was very cost effective. We took, we ran this for a month. We went through our aws bill and we took apart what it cost by service and this is what it looked like.

So on a day where we did nearly 1000 drinks, we found that sns is the most expensive because it's sending the text messages. And essentially, however you send text messages. it, it, it, it costs more money than you hope. So. that's one number. But the rest of the services altogether, we're running the entire platform for less than $1 a day and on days where we don't have any events happening whatsoever, it doesn't cost us anything at all.

And so i truly believe that this is actually one of the most cost effective ways of approaching this type of problem.

So this qr code gives you a link to everything in this deck and other resources, including the code of the application, so you can deploy your own s espresso. But here are some of the things that we just talked about today.

So from a design point of view, we liked the idea of starting with the workflow and then thinking about the front ends, then building microservices and adding them on as we needed that worked really well in this type of project. In our microservices, we communicated with events instead of using private api s. Remember, private api s can be challenging to set up because you have vpc s and other considerations, whereas where events are private by default. And so this became an effective way of having microservices communicate with each other wherever you have real time, front ends involved. Think about using iot core. Many customers don't know about this. But actually, it's a very low code scalable, inexpensive way to bring real time back to your front end and continue that event journey from the back end through to the front end and then combining orchestration with arch with choreography is really kind of magic. It gives you extraordinary capability as a developer to increase the extensibility of your code, reduce the amount of code that you have in the whole application and also reduce your cost.

Now, one other website i wanted to share with you very quickly is called c land. This is something that's managed by the d a team and it contains news, uh blogs videos and hundreds of patterns that can help you get started as you start building serverless apps. So it's available 24 7 at cus land.com and also aws reinvented is all about learning and the learning doesn't stop any sessions.

So think about looking at skill builder, ramp up guides and digital badges to learn more about how serves solutions can be built. And there's a link at the bottom that shares all of these s 12 d.com/sist dash learning.

Thank you very much for your time today. It's been an absolute pleasure i hope you enjoy the rest of your week here at aws re invent. Thank you.

你可能感兴趣的:(aws,亚马逊云科技,科技,人工智能,re:Invent,2023,生成式AI,云服务)