Using serverless for event-driven architecture & domain-driven design

All right, I think everybody heard the keynote today from Swami. It's quite amazing, right? All the generative AI stuff this stock is no different. It's still the same thing, innovation, innovation, innovation, right?

So welcome to NT 203 Using AWS Service for Event Driven Architecture and Domain Driven Design. I'm Balachander, AWS Solutions Architect and a DevOps/Systems Engineer by trade. And I'm here with Jonathan Luk, Global Director of Software Engineering at Citi, and also Lee Gilmore, Global Head of Architecture and Technology at Citi.

I've had the amazing opportunity to work with the Citi team. Please give a big shout out to the Citi team in the front here. Uh you know, so I, I've been working with them for the past year and uh they have been continuously learning and trying to reinvent themselves using AWS services. And today is where we learn a few things of how their journey went in the past year.

So with that, here's a quick look at the agenda:

  • We're going to look at the different benefits and advantages of going towards the EDA and DDD approach
  • And Jonathan is going to go over Citi's journey with AWS in the past year.
  • And then Lee is going over Citi's architecture

So with that, um how many of us here in the room are already users of AWS services or event driven architecture? Ok. That's a handful. That's very encouraging. Because 14 years ago, I was super proud of getting my Linux server installed along with the databases configured in the HA mode. And you know, that was amazing back to writing scripts and now that's different. The age is, you know, is transformational of what we can do today with AWS services. All I have to do is focus on my business logic.

So with that generally of EDA and EDA driven era, we as enterprises are continuing to transform our businesses and in this day and age, it's even more of a greater need to reinvent ourselves. So how is AWS going to help you in this journey? Right? If we think about reinventing ourselves, the first thing is we have to think about what, what is the value we are providing to our customers and how can we reinvent ourselves? It's more of a design question even before technology comes into the picture and to get to the design problem question, you first have to think about events, what are events, events are pretty much everywhere and to go ahead and innovate yourself, you need to completely understand not just your own product but your business domains. And what are all the events within those business domains.

Later on, we will touch a little bit more about domain driven design, but specifically about events, events can be anywhere right within the airport where you went through the process of getting a flight ticket and checking in. That's an event, right? And maybe in your hotel check in process customer experiences get defined by events. So this is so fundamental that this day and age of digital software innovation, we as enterprises have the responsibility to do the best for our customers and events are one way to approach this problem from the bottom up.

So with that into the context, let's talk about what Melvin Conway said, Conway's law, it's very popular and I, I come very close to it primarily because as a systems engineer, I can understand what he's saying, right? And we all do so any organization that designs a system defined broadly will produce a design whose structure is a copy of the organization's communication structure.

Why is this important? The more tightly coupled our processes and our teams are software is just ends up being a communication implementation of our teams are set up and within an organization, it is our leaders responsibilities to keep it very loosely coupled because that will end up building scalable systems over time. And this is all taken into concentration from the long term, right?

So event driven architecture is one way to help you promote this loosely coupled architectural mechanism, right? So let's take a look at what AWS EDA do for it because event driven architecture is not new, we've been talking about it for a while. But where does AWS EDA come into the picture? Enabling this journey of orchestrating event driven architectures.

The most important thing as a systems engineer I realized is yes, it was so satisfying to install a server, upgrade, the server doing all of that. But I didn't really care about what was the bigger picture of my business, but rather focused on my own individual focus or the functional role of what I was doing. And I was happy about it. But when I got to know the bigger picture of how I as an individual can contribute to my bigger business value, then it all made sense.

So that's where I, I, you know, I fell in love with serverless and event driven talks over time and this is what I see. Like every single event driven architecture needs a number of services across event producers, event consumers and the broker, you can have your team spin up those services or use AWS services. But the best part about this with AWS services, you get all those services across the board and you're able to do all of the underlying boilerplate code with very, very minimal effort, right?

So that's the beauty of it. Like it's off the shelf, no infrastructure to manage and you only pay as you use. And also one more thing that I realize is cost is important factor for most of the enterprises. When something is fancy and complex and too costly enterprises try to avoid it, they don't want to spend too much even though it's, it looks fancy, right? But when your developers are unleashed with this freedom of innovation, of getting yourself to experiment really quickly with event driven serverless model, it becomes very interesting of how fast we can innovate as a business.

And that's where the combination of event driven architecture and EDA together provides this amazing benefits of greater developer agility, scalability and fault tolerance and lower TCO.

So when you think, think about great developer agility and extensibility but tightly coupled systems hard to innovate because feature velocity slows down, you want to understand your entire ecosystem of services and components and one change on one side of the component affects the other change. So hence risk is there and that's why we slow down. And that's just part of the nature of this problem, right?

But with the event broker in the middle, as a central event broker, you're now decoupled. So what happens to any team member within each of the component is free to innovate and all they have to understand is event broker itself and the event that they are actually working on. So this is why in the bounded context later on, when Lee explains why we need to have a bounded context. It makes sense to understand what are the events within those bounded contexts.

And Taco Bell is an amazing example of this double agility story. So they realized during the pandemic, they had a sudden surge of orders from delivery, online delivery partners. And those orders have to be integrated with the point of sale system in those stores so that the customer orders are delivered right seamlessly and they're able to achieve it. They are able to do this integration by integrating with those existing systems. Not much change is done to those but having an event driven services approach and create this API layer to integrate with their point of sale in store applications.

All of the APIs were done within a span of two weeks and within two months, this application that you see here was in production and they were able to scale this by load testing to up to millions of orders per hour and later on, they also added further integrations with other delivery partners.

And the next benefit you see here is increased scalability and fault tolerance. I've been as a systems engineer up all night, sometimes burning the midnight oil just to troubleshoot my system that you know, one small change that brought down the entire system and it happens a lot. We have all been there and we want to avoid that. Even driven approach is one mechanism because what happens is it avoids this cascading failure effect by having this asynchronous.

We have communication right from the order service. Let's say your shipping service has some errors and doesn't respond. It's ok. You still have your responses back from the inventory service and then you make sure you have the shipping service in effect and then you're fine, right? So, but the most important thing that we realize is serverless also provides this fault tolerance built into it.

Lego is an excellent example of using this approach to rebuild their ecommerce platform. Um you know, from scratch and then they were able to successfully scale and meet their black friday needs.

So with that, I know the final thing is one of the benefits is lower TCO. We always don't think of cost as the first thing when you're solving a problem. But in this serverless is an event driven approach. It's fundamentally important for us to think like that primarily because you need to know what are the events that have an impact to the business value. And then you need to know like how do you architect those in a scalable manner?

And with event driven and serverless approach, you don't have no infrastructure to manage, it's all pay per use and you don't pay for the idle time and Liberty Mutual is an amazing example of it early adopter of AWS serverless and event driven architecture and they processed over 1 million transactions for just $60. And that's amazing. Right? Like I couldn't do it 14 years ago. And it, it's not, not even thinkable. Right.

So now taking a step back, yes, we know it's a design problem. We have solved it with the help of services and the ability to have things up just when we need it. That's important. But everything starts from the ground up. You developers have now more freedom to unleash the creativity and that's where it starts because us business enablers have provided them this platform which can enable you to succeed more for your customers, right?

So that's where you developers get to brainstorm even call it event storm, right? And that's super important and Lee and Jonathan are going to talk about it as well a little bit more. And at the end of the day when you have these loosely coupled teams and products coming together, it's going to enable greater business innovation.

And events for example, generative AI tools like Code Whisperer is going to play a significantly important role in accelerating this developer agility even further, enabling you to modernize your future better and at a much, much better pace, right? And this is all with the focus on customer experience at the center of it.

With that. Let me bring in Jonathan Luk onto the stage, please.

Thank you. Thank you, Paula. Thank you everybody for uh joining us today. I appreciate you all coming out and listening to our story. Uh so, what I am planning to do is to kind of give you uh the why, what, who uh backstory around what we did and, and what we achieved and, and why it was uh why it was important.

Uh so, first off, just a little bit about me, uh I'm a software developer by trade, started out as a developer in an ERP system and sort of uh raised rose through the ranks within Citi and ended up being the deputy global director of uh software engineering. And right now, my current role is I'm, I'm responsible for the North American uh software engineering teams. So I've been uh at Citi for about 20 years. So I've got a lot of good experience uh within this, within this market.

So who are we at Citi Electric? What are we? Right? We are a full line electrical wholesaler. We sell wire pipe breakers, all the wonderful things that keep the light and the energy flowing in your home and in your office. As far as uh who we are as a business, we are privately owned. We've been in business for 40 years. Um we have about 705 branches in North America and we did just shy of $3 billion of turnover last year. And uh we have about 2.3 million lines of uh lines of inventory.

So our story, right, we want to talk a little bit today about um you know, uh moving into event driven architecture and servius and a little bit of where we, where we came from. So speaking of where we came from. So this like many, you know, existing businesses, you have a big giant bowl of spaghetti. That is your software ecosystem. Right now. We have about seven bespoke products that cover the entirety of our operations within Citi. And they're in all different languages, all different databases, all different integrations, release cycles, things, things like that.

I want to take a do take a second to compliment that. It has these software platforms have driven and grown our business from, you know, where we were a few years ago, which is around 300 million to now 3 billion. So um while it worked, um eventually you do have to move on and you have to mature your, your, your architecture.

So here's the a little bit of uh the teams that help support this. Um I don't know how many of you have these similar challenges. We have teams that are spread across um a good chunk of the US and some in Canada. And so we're in two countries, 13 states and three different time zones. And we have about 60 plus people supporting all the various uh platforms and infrastructure that allows these, these, these software systems to, to run.

Ok. So with that being said, what are the challenges, right? What are the problems? Why are we here to try to, you know, um find solutions? Well, one of the biggest problems we do have uh within Citi's architecture at the moment is it's not uniform. Um we have seven different platforms and seven different IDEs with seven different databases. And so it's just all over the place which um breeds or brings in a lot of duplication of work because there's no shareable components, some of the platforms kind of age out a little bit because like with a lot of ERP systems, when you get in, it's very tough to get them back out because it's got got momentum.

So you deal with that aging platforms, no uniform architecture, duplication of work. And so that drives up costs and technical debt, which is a, which is a challenge and then that then leads into people, right? So you've got people that are supporting these systems and with these systems that are very siloed, it's very hard to move people around between those systems.

so you have challenges with recruitment, you have challenges with onboarding. and then the other thing that we end up having a problem with is just keeping up with technical trends.

so we obviously wanted to make a change, right? but we had to start somewhere and what we ended up doing is we had citi we do a yearly inventory and we call it, call it stocktake and this is where we stop the business for about 12 hours and we count all those 2.3 million inventory, um those lines of inventory and we do this so we can validate the counts in our system. and this allows things to get picked up. so if there's human errors, physical errors, system errors, it allows us to reset that inventory.

so why do we do this? why is it important? so within citi, we have the luxury of being, you know, essentially a profit sharing businesses. so our employees share in the success of citi and then this inventory number feeds into the profitability of the branch and that then feeds into people's compensation. so it's very, very important. we do it right? and it's accurate and, and, and efficient.

so this is what we do today, this wonderful piece of hardware that is a million years old. so, uh it is an old motorola scanner. if you have poor eyesight or big fingers, you're in a world of hurt trying to use the stuff out in the warehouse. ah, it's a very dumb device. it does batch downloading over serial ports. ah, there's no feedback, there's no interaction. um, it's a very cumbersome tool to, to use. and so we said, you know what, we've got to move on from this uh for many reasons, uh the ones that i've given before, but also we can't source these things anymore. so uh we have to move on to new to new tech, new technology.

so, where do we go? right. what do we want? how do we structure a solution? um, you know, for this, for this problem, you know, trying to embrace the new and what we're here and what lee's going to talk about, ah, while sort of feeding into, into the old.

so, here was just a handful of things that we sort of put together to have an outline for a solution, right? we, we we at city uh have a history of building things our own. so we wanted it to be bespoke. we wanted it to be cloud based. we wanted to make sure that it was modern architecture, uh something that was reusable that we can do as a foundational piece, other projects that will come on the heels of this. we also didn't want to go wholly on our own, right? we wanted to leverage existing knowledge within city and lee will probably talk a little bit more about that as well. and then we also wanted to utilize proserve and, you know, the assistance that the aws teams can, can bring.

so, so in regards to that, i can't, i don't know if any of you have had the luxury or the privilege of working with the aws account teams or the proserve teams. um they've been absolutely fantastic uh and amazing with us as many of, you know, aw s is so, yeah. aw s is a massive set of tools and suites and services that you can get totally lost in. it's like an absolute forest and what those teams do is they help us kind of navigate that forest and making sure that we are pulling the right solutions for the right problem.

they also bring in various different type of architects and experts in those fields to really kind of guide us. it's been amazing and, and individually, i can't say enough good things about the solutions architects and the technical architects that have come from aws, they've been incredibly supportive, patient worked with us a lot of good stuff.

um the other thing that they have provided, which has been brilliant is upskilling uh and, and training, right? so we're coming from a very on prem legacy type of development environment that you probably all are familiar with uh moving into cloud based. so there was a big effort around upskilling and bringing knowledge and expertise into the teams in which they, they've been amazing at.

um we've mentioned a couple of times here ce f just to kind of give you quick background of this for about 10 seconds, 30 seconds. um is that this is the founding member of city electric city electric as a as a business was founded in coventry england in 1951. uh and it expanded in us in, in 1983. so um we had different businesses at that point, different technology teams uh and then they were running into problems several years back and they started to embrace cloud and serverless. and that's where, you know, lee and others have joined us. and so we were able to kind of lean on their expertise to help guide, you know, our, our journey.

so this is kind of what we did and this is where i'm really most, most proud of, right? we, we took all these different challenges, new technologies, new team uh philosophies the whole nine yards and we were able to build something incredibly quick and incredibly effective um and incredibly cheap.

so we went from teams that had very little experience to no solution to within uh within nine weeks, we had a full production ready application uh that worked exceedingly well. um us being a little bit reserved and careful, we only did this within 16 locations and we had 78 end users uh but they, it worked and it worked fantastically.

um this is kind of what we did as a quick little screenshot. um it was all um on zebra scanners. um it was a view front end, it was very interactive guided journey is very simple, uiux for um you know, for our users. and then the things that uh also was a benefit now because it was all wi fi uh it was able to interact to our existing er ps er p system dynamically. so we were able to get feedback back into the scanners. so it was an incredible experience for the um you know, for the users.

so when all of this, what did it cost? you know, there's a little joke here. it costs us peanuts the night to run all these systems cost $4.63 which was amazing. right? granted, it wasn't the whole business, but it still shows how quick you can develop an aws and how cost-effective it, it really, it really is.

so here's just some general feedback that we got from our branch staff. uh the, the short, short uh version of this is they loved it. um saved them hours of efficiency, it worked. um it was clean, it was just a really great experience for them and the syncing with the their current er p systems, it was just, it was a, it was a fantastic result. the, the team should, should be really, really proud.

so what's next? where do we go from here? right. well, first off, nothing is perfect and neither are we and neither was this. and so there was some feedback that we had to go back and address and correct. and so that was a big thing that we were going to be working on.

the other thing that we wanted to tackle is scalability, right? so we have 705 stores and we only did 16. so we've got to scale this out by 30 x or more.

um there were some bits around um the code that we kind of took shortcuts on as you do in a nine week timeline, right? you got to go quick and you've all probably have been there at some point where you kind of beg borrow and steal to get through uh and hit the deliverables.

um but there are some things we wanted to go back and revisit. um and then like every other product, we want to make sure that we are continually improving this product and making it better for our users. so we're going to add obviously more feature sets to it as we um you know, as we uh as we move on.

so um i'd like to introduce lee up onto the stage now.

hi everyone. so my name is lee gilmore and i'm the global head of architecture and technology at city. and what i'm going to talk about is the global tech strategy that allowed us to build this out in nine weeks.

um the teams previously hadn't used c aws or typescript and did really well to get this out production ready. so what we're going to talk about is these five key areas.

so we've got surplus first supported by aws, domain driven design and team topologies ad a so the advent driven architecture strategy, our evolutionary architecture and then we're going to touch on the global tech radar.

so services first supported by aws. so historically, we come from data centers in v ms with ccf and cs. now this move to cs allowed us to be really quick to market. we could innovate and we could get this out very, very quickly.

and then, you know, it ate on these particular designs and we also had reduced operational complexity. so we're not having to patch servers anymore. that's part of the shared responsibility model with aws.

and what this gave us is massive scalability and obviously high availability to. and now we've got a future proof technology stock as we move away from cy base and delphi to typescript cd k and surplus technologies.

and as jonathan talked about earlier, obviously, we've got lower running costs for anything that we actually run in the cloud. and just to note here, we actually built out our reusable reference architectures as well. so we've got something called the city c care where we can actually start a package, these parts of i ac and take the cognitive load of teams actually allow them to utilize this through an m pm package.

and although we are service first as a mindset, it is still buy before build. so if there's no competitive advantage in building something out, we're just going to buy that off the shelf.

and as jonathan said earlier, obviously, we've been supported by aws pro serve. so 18 months ago, we had some initial wins in the uk. um teams are very siloed, they'd done some small mvps, got that out. and it was a learning curve but there was no reuse um no standards really in place.

so when myself and proserve first started 18 months ago, we did a gap analysis and this allowed us to move towards a cloud target operating model. so we generated a backlog of work for that. and they also supported our reference architecture um that actually included being on site and doing white boarding sessions with us and obviously training and upskilling.

so a lot of immersion days and actually working with the teams to help upskill them. and as i said before, this has allowed us to move to what we call the city city care. so taking the cognitive load off teams and having these reusable building blocks of code, which they can use and coming from a data center, it's very different to running in the cloud.

so they also supported the security teams with best practices and standards. so once we did all this, we then started looking at actually what we're going to build and who is going to build it. and this is where we looked at the main design and team topologies.

so this is a great quote from eric evans. he wrote the original blue book for anyone that's read that. and this quote really resonated with us, which is the heart of software is its ability to solve domain related problems for its users. and that's what we need to do. we need to go back to what is the actual problem that we're solving through these service architectures.

so, domain driven design aligns real world business domains um with real world problems with architecture and software engineering. so we're actually building something that the customer wants and it places the key focus on the core domains. what makes us unique where the competitive advantage, that's where we want to focus our efforts.

and it also promotes continuous collaboration between technical so engineering teams and domain experts ie the customer and they agree to these models and they refine them. so we're all on the same wavelength with what we're actually building.

and it also allows us to understand the key domain events. so what are the significant changes that are happening within the domains?

so i said the word domain there a lot of times but what actually is a domain? so if we went to the cambridge dictionary definition, we'd say things like an area of interest or an area over which a person has control

You'd better ask, Paul. Electronics is not my domain, I'm afraid. And boardroom decisions are the exclusive domain of company directors.

So if we look to tangibly what that is, we've got fuzzy interpretations of interest and meaning. We've got boundaries for language and concepts and distinct boundaries are ideal for system design. So good fences make good neighbors. That's a common um quote that most of us would have heard of.

So then we've got this notion of our overall domain. This is city and this is very complex. This is a cross manufacturing warehouse and a sale customer. And if we went to build this outright, we'd build a complex model for a big ball of mud.

So what we typically do is look at things like event storm and context mapping to start to break this down in the smaller subdomains. So we've got a few of those listed there. And as we do this, what we find is there's three types that we need to look at.

So we've got core now, this encapsulates what makes you competitive, what differentiates you with your competitors. And this is where you want to spend about 80% of your engineering effort. And then we've got supporting so these support the core domains but don't really have value in their, on their own and these don't really affect the competitiveness. And then we've got generic.

Um so this is typically where it's very complex to build and more often than not, we're just going to buy this off the shelf. So to bring this to life a little bit at city, we would have core domains such as order fulfillment, product, customer, and price. Uh this is very unique to what we do at city. It makes us competitive.

We've got supporting things like data b i integrations and messaging which underpin everything we do in the core domains. And then we've got generic things like finance and hr everybody uses the same systems pretty much. So we're just going to buy that off the shelf and integrate into the other systems.

So as we start to break these down in these different subdomains, what we typically do is look at the domain model. So this is your structured knowledge and problems within the domain and this is made up of rules and languages and concepts and that's key here. It's all about language and it should identify the relationships among all of the entities that happen within the scope of that domain.

And it's typically modeled with things like post it notes if you're doing an event installment or could be sort of code and diagrams and this is to make sure we're all on the same wavelength as what we're building between the customer and the engineering teams. So it's just a construct to allow that.

And as we start to build out these sub domains, we've then got the notion of bounded context. So a bounded context is the boundary of the sub domain model. And it's always desirable to have a 1 to 1 map in between the domain model and the bounded context.

So the subdomain therefore creates to the problem space. So the language and the concepts and the rules within that sub domain and then the bounded context is the solution space. And this is where we're going to build microservices and one team should build and maintain within that bound of context because we want to limit dependencies on teams, we want them to do a full, full slice of the work and one team how however could actually own multiple bound of contexts.

So again, to bring us to life a little bit, we can see a simplified diagram here. So we've got hr that's generic. We're just going to buy that off the shelf. We've got data mb, that's a supporting sub domain. So we're gonna probably have a 1 to 1 ratio between build and buy. And then we've got price and customer which are both core, this makes us competitive and we're going to spend a lot of engineering effort in building these out.

And we've zoomed into customer a little bit there to see the domain model which is in yellow there. And you can see for the customer sub domain, this part of the overall system, this is where we've got language processes and rules and you can see that red dotted line around the edge and that's the bound of context and that's where we're going to start building out these micro services.

So we can see here that within that bound of context, it's well encapsulated. We can see that we've got one or more microservices that communicate within there and they can communicate privately between themselves, they might have private domain events there. Um but all of that is fully encapsulated. So the only way that we actually talk to other domains is either expose certain functionality through rest a ps or through raising public domain events.

And one thing that's missing from that diagram is typically we would have an eventbridge custom bus in there as well its city and that allows us to also consume domain events. So now we're looking at team topologies and this is a great quote from matthew manuel who wrote the book on team topologies, which is choose software architectures that encourage team scope flows.

So how do we allow teams to work very quickly? So team topologies is an approach to organizing business and technology teams for fast flow and limiting the dependencies between them. And there's four fundamental team types that we need to align to. And this helps reduce the cognitive load on teams and streamlines the interactions between them.

So these are the four fundamental team types that we've aligned to at city. So we've got stream aligned uh for us that would be our main teams or in some companies, this would be product teams and these are long lived teams that typically work on a backlog of work.

And then we've got the notion of enabling teams and these are teams that help support the streamline teams. So this could be, for example, architecture or ux practices. So we might want to actually um have architects work with the team, help upskill them, overcome certain problems and then step back and let the team work away.

And we've got complicated subsystems which we don't actually have at city. This is where we've got significant mathematics calculations or technical expertise needed. And then finally, we've got platform teams. So this is teams that build internal developer platforms. Again, this supports the streamline teams and allows them to move quicker.

So how has ddd and team topologies affected city as an organization? So we can see here that we've got our enabling teams and yellow. So we've got enterprise architecture. Um typically, they're looking at the north star architecture and supporting the stream line teams and move into that.

We've got the architecture enablement team and the architecture practice, engineering ux and q a practices. So again, supporting an upskill in those streamline teams, we can see the streamline teams there and a left to right, the purple ones. So these are all long live domain teams.

If you go back to ddd and underpinning all of this, we've got our cloud engineering teams who build out all of the platforms for us. And if we overlay the ddd aspect onto this, we can see our co domains. Um we've got things like product and price, customer and sale. This is where we're going to spend a lot of engineering effort.

We've got supporting things like the common component team and b i analytics. And then we've got uh the generic sub domains. These are things that we're just going to buy off the shelf typically.

So if we then zoomed out and looked at the high level architecture at city. Now we call this service architectural layers or cell architecture. You can see at the top we have channels, this could be ad this could be chat bot, it could be web or mobile and they only communicate through this experience layer.

So this is back in for front ends typically, um these don't have a lot of business logic in them and this is just the interface in the domain services. So we can see that we've got integration api sb to b and web and mobile and they then talk to the domain services. These are private to the aws network.

So we use private link to actually talk between them with sv four and i am authentication. And what you can see here going back to team topologies, we want the teams to do the full slice. So with the branch to ms project, the team did the front end back end for front end at the back end, for front end, sorry, on the back end warehouse api changes.

So we don't want to have dependencies between teams. We want them to do the full vertical slice. And then we've got our data, we can say we've got our enterprise service bus and we've got things like our data lake and data warehouse. You'd also have machine learning and a i in this area.

And then we've got the platform layer. So this is where we've got things like develop experience, landing zones and pipelines, everything that's supporting the teams above with undifferentiated heavy lift. And again, we want the teams to move very quickly. So spinning up an aws account or equivalent should be very quick.

And then on the left hand side, we've got our cross cutting. So this is typically something that affects everything within an organization, things like logging, tracing. We've got our own sdk and obviously, we've got our city cd care that i talked about earlier and this is managed through an enablement team.

So this is our common component team who build these out. So we saw an esb on there and a lot of people will be thinking about. So a style architectures which we don't actually have at city, but you can see we're now going to go on to the ad a side of it or ad a strategy.

Now paa talked about this earlier. Um and this is a great quote from verner vogels as well that everything fails all the time. And this is why we want to decouple our service microservices.

So the ad a strategy that we've got has allowed us to have decoupled domain and experience services. So typically, we only communicate through events. There are always times we need to do synchronous calls through rest, but it is event driven first where we can and this enables these services to scale independently and the events are based on our domain models, going back to ddd that we talked about earlier.

And again, boa talked about this earlier, but we can utilize error handling. So things like dlqs. So when a service comes back online, we can replay the events and again, the customer is not affected.

So i said the word event there quite a few times. But what is an event? So there's two types that we use at city. So we've got a typical, typical event. This is a domain event. It's something that's happened. It's immutable, it's in the past, this could be uh order created or invoice generated

"And then we have the notion of a command which we don't use a lot at City. But this is where there's an intent aimed at another domain, something like send email or generate pdf.

Now I won't read this verbatim. This is a quote from AWS about an ESB. But what this typically is is allowing different services to communicate through events in a, in a standard way.

So going back to service architecture layers, we can see in our data layer, we've got our ASB which is Amazon EventBridge and we can see that we have events flowing between the experience layer and the domain leather and to bring it to life a little bit, we have things like payment, canceled order, created, user logged in or product updated.

Now, the reason we call it an ESB is we use something called a single bus, multi account pattern. So you can see on this diagram, you've got a central event bus that everything flows through from an event perspective. And then we can see we've got the blue orange and purple teams here and they're publishing the main events to the central event bus, but also consuming to their own local event buses where they need to. So we've got that asynchronous two way communication and again, zooming out a little bit and having a look at an example at City, we can say our stream aligned team has an experience there. And you can see this is our customer web app and it's publishing domain events which are going to our data layer. So our ESB and this is our shared EventBridge account and then we've got target rules that route to our domain layer.

So these private domain services and these again are looked after by our stream align team and we can see the orders domain might do some processing and the customer service might do its own processing off the back of that.

Now, you can see the customer service, the domain layer is then publishing back to the main VPO and then that's being consumed again in the front end. So you might use something like a Appsync subscriptions and you might want to do that in real time and give some kind of update to the user or maybe cash some kind of read store in the experience.

So the key learnings for us using Amazon EventBridge is you haven't got guaranteed ordering. So when you need to, to do this, you need to look at SNS and SQS patterns. There is a way of using the two in conjunction. If anybody's interested, you can grab me after the talk and we can talk, talk that through.

So publishers should always validate the events. You should be a good event citizen because a lot of other domain services are going to consume that event. But at the same time, you should always validate anything you consume. And Amazon EventBridge supports at least one semantic and that means you could get the same event multiple times. So your downstream services have to be idempotent, they need to be able to ignore the subsequent events.

What you don't want to do is charge a customer twice or maybe send an email three times. So we've went through very quickly there, the strategic side of DDD and now we're going to look at the tactical side.

So we're going to look at evolutionary architecture. So again, this is a quote that resonated with us. This was by a guy called Robert C Martin or Uncle Bob as some people know him as, and this is "When any of the external parts of the system become obsolete, such as the database of the web framework, you can replace those obsolete elements with minimum of fuss."

Now it's City we knew this was transient. So to start with, we would call through to our monolithic API. We knew that over time we gather insights and we might want to swap out certain AWS services for other ones.

And we knew as we broke down our monolithic API, we'd break it down into domain services and that meant we'd have to start changing different HTTP calls hitting different REST interfaces and we have different DTO objects coming back to. And some of these services that we build might be more applicable to containers, some might be more applicable to functions and vice versa. And we only find out sometimes when they're in production, we've actually got some kind of scale going through them.

And again, for the same reason, we can't know all data access patterns up front. So we want to be able to swap out a certain database during development if we need to without a full rewrite.

So at City, we've got two types, we've got the lightweight version of hexagonal architecture and then we've got the full version. Now, we can't talk through the full version today because it covers things like repositories, use cases, aggregates, aggregate roots. That's a talk in its own right? But we'll talk, talk about the lightweight version.

So we use this typically when it's more of a crude style service, it could be in our experience layer or an integration and the full version we typically use when there's a lot of business logic in there a very heavy domain model.

So if we look at this example here, we can see we've got API Gateway which is invoking a Lambda function. That's what the, the purple circle denotes. And we can see that we've got primary adapters on the driving side. So this is some kind of input that's coming in. So this would be taking the event from API Gateway, it would be transposing it maybe doing some kind of instrumentation and logging and then call the use case for the business logic.

And eventually, that's going to return the status code and the body back to API Gateway. So this is completely devoid of any business logic and it's completely just framework. But as it calls the use case again, this is vice versa. It has no notion of frameworks. It is purely domain logic and this will have to persist data or retrieve data.

So we've got DynamoDB here and that does it on the driven side through something called secondary adapters. And we can also see that we're publishing a domain event there to Amazon EventBridge.

Now, the beauty of this comes in where we might want to go with the storage first pattern. So we might want to go from API Gateway to SQS now and then Lambda. So all of a sudden, we've got this primary adapter that needs to understand SQS so what we typically do is write a new primary adapter for SQS and we can slot that in very easily and nothing else changes from left to right.

And then if you look at the driven side, you can see that we might want to swap out Amazon EventBridge for SNS. Again, we can create a new secondary adapter for SNS and with it being adapters, we can just unplug the EventBridge one and plug in SNS very easily.

Now, if we've done this as one big Lambda handler, this would have been pretty much a rewrite. And then if you extrapolate that out across maybe 100 or 1000 Lambdas, that's a lot of work.

So I just want to touch upon the global tech radar that we've got at City because this has really underpinned what we've done. So a global tech radar is something that was created by ThoughtWorks and this allows us to look at these four quadrants.

So, tools and techniques, platforms, languages and frameworks, and then we've got these rings. So we've got adopt now for us at City, this would be things like TypeScript and the CDK and this is what we want teams to go to anytime they build out a new service.

But then we've got the notion of on hold. So we don't want to do any new development with COBOL, Delphi and this makes sure that we're all on the same wavelength with what we're building globally and things like Jai might be in assess.

So we might look at Amazon Bedrock for example. And as we start to use this a little bit more, we might put this in the trial and eventually, as we use this and more and more services, and we think we get real value from it, we can move that to adopt. And that way as a business, we're all aligned to what we're actually building.

And this has really supported our common component team. So they're building things out like our City, City care. So the composable architecture. Now, if we had teams using Terraform and Serverless Framework, they wouldn't be able to use this particular framework that we're creating. And that's exactly the same with our front end components as well in our design language.

So we did this over 18 months in the UK and this has really underpinned the first bit of work that we did in North America, which is our branch WMS project.

So we needed to use the agility and speed of service and really thin slices. As, as Jonathan said, we had quite tight time skills to get this out and we knew it would take time to strangle the monolith that we have in North America.

So the first iteration was going to be to call through from the experience layer through to visit the domain layer and future iterations over time would then call out to the domain services as we start to break those down.

So that means we have to have evolutionary architecture. We wanted to make this really easy for us, for ourselves. And we use the patterns, best practices and reference architectures that we built out over the past 18 months to really underpin the work that we did in North America.

And the North American team also aligned to the tech radar, which meant that they could start to utilize the common components that we had.

So this is a very simple diagram. There's a lot of services that have been removed. There's only so much I can get on the slide. But you can see here, we've got a scanner device. So this is the somebody doing the stock take, we can see we've got AWS WAF because we want to make sure that we will restrict the use of this down to the actual branches.

The WMS app is in S3 and we've got a CloudFront distribution there. And obviously, there's other services at play like Route 53 and there's also authentication. So it's authenticated with Azure AD.

And then we've got API Gateway, as you can see, they're calling through these Lambda functions within a VPC. Now, the reason we've got the Lambda functions in a VPC is because typically, as I said earlier, the communication synchronously is using Sig v4 and IAM so we keep everything private to the AWS network.

And again, going back to DDD, we can see these Lambda functions are raising public domain events and they're going to our shared EventBridge account. So our shared Amazon EventBridge bus and we've got gateway endpoints and VPC endpoints there for talking out to Parameter Store and DynamoDB again, keeping it private to the AWS network.

And we've got that gateway there because we want to make sure any egress is all on a static IP address. And this allows on the right hand side, the on-prem API to allow this particular experience layer. And again, that's fully authorized, you've got a machine-to-machine auth flow there. So client credential grant flow between the two.

So I'm now going to invite Bala and Jonathan to come on stage and talk about what we actually learned as part of this project.

Thanks Lee. So obviously, there was a lot of work, went into this across the board and, you know, like everything else, there's wins and there's losses and things we did, right, and things we did wrong.

The thing that I want most focus on is the fact we did it, you know, we took a lot of these concepts, we took a lot of this new technology and we found a way to deliver a solution in a very quick and efficient way that had very minimal cost impact, which was sort of impressive.

The team did an amazing job working together, rallying around each other to, to deliver, to deliver the solution with all the different supports coming from AWS and from our colleagues in, in, in Europe.

But like everything else, there's things here that don't work as well. So we did incur some technical debt like we mentioned before that we had to go back and, and correct and clean. There was a little bit of a siloed approach kind of, we just, we needed to do that in order to get it delivered.

And then a lot of the processes that we had were ad hoc, right? So we kind of had to go back and, and look at restructuring them to have them being a little bit more formalized in their ways of working.

So, I mean, at the end of the day, we delivered value to the business, we did it in a, a modern architecture utilizing cloud services. And the team were able to hit their, hit their market which was an impressive, an impressive result.

So I'm very proud of what everybody has, has done and thank you all for coming. We appreciate the time and the effort."

你可能感兴趣的:(aws,亚马逊云科技,科技,人工智能,re:Invent,2023,生成式AI,云服务)