Todd: Thanks so much for uh showing up for this session today. I hope your Reinvent's going well. Uh I really hope you're ready here to, to dig into a little bit of uh Sass and multi tenancy today. Uh as the slide says, my name is Todd Golding. Uh I'm a Solutions Architect at AWS and I've been working, I guess for the eightish past years uh with different sort of SaaS providers and different companies and different domains helping them across a whole range of different problems.
Um and also been part of a SaaS factory team where we've just developed a ton of sort of reference content and guidance around what it means to, to build and deliver SaaS on AWS. Uh and obviously a big part of the effort that we've put in over the years has been focused on, like just show me how to build something, like give me concrete examples. That was always the feedback we got from lots of people. It was like, good, you've told us about these principles, but show us how.
And so we went away for a few years ago and really started our talks were very much about show me how to build an EKS SaaS solution. Show me how to build a serverless SaaS solution. Give me the code, give me a token vending machine. And these things were awesome and they're still awesome at really giving you a great head start and terms of a like specific end to end example of how to build SaaS.
But then when we sat down for Reinvent this year and we said, well, what are the topics that were gonna be valuable? We said, well, we, we what what happens like beyond sort of just getting over the basics of getting a solution to work, right? What do, what does it mean to build a really good scalable multi-tenant SaaS solution independent of which technology you're building it on what is a good way to build a resilient um sort of SaaS environment.
Uh and so that was the sort of motivation and the spirit behind this talk, which is to say, instead of going specifically to just show me one solution, give me a sort of mental model, give me a framework for, for, for the basic moving parts of what it means to build a great scalable efficient SaaS solution on top of AWS.
Now, I don't know, I think it's a super broad topic. I don't think we're gonna get into every nuance what it means to achieve this. But what I've tried to do for the talk today is tease out kind of the moving parts of this to give you a sense of like, here's some of the big things that if I were sitting down with a customer right now and talking to them about scale and resilience, these are the things I would tell them to go after.
And I would also say yes, the front of this will be about scale and resilience to me the back third of this talk, which is about how do we actually prove these things are working? It's the area we've talked the least about which is how do I actually build chaos mechanisms. How do I build validation mechanisms that tell me that scale is working, that tell me that resilience is working now, this is a 300 level session, right? So it says deep dive, but I hope let's set your expectations about what deep dive is. We're going to get into the architecture. Absolutely. It'll be all over the architecture here. But I'm not gonna like show you a bunch of code. I'm not gonna show you like the uh like all the underlying movie parts of this. This is more about architectural patterns, architectural strategies. I hope that's why you're here. I hope that matches what you read in the abstract. Uh and if so let's move forward.
Well, I want to start here at the outer level here, which is just if I'm trying to write and build an architect as SaaS application. Um it turns out that, even though when we talk about scale and we talk about availability and we talk about resilience, there's tons of really good bits out there. In fact, I would tell you as a compliment to this talk, you should be absolutely looking at the guidance if you've not checked out AWS well architected, just really good pillars of resilience and scale and availability that are great guidance on how to do those things and those things are all still valid for a SaaS environment.
But what I find is SaaS architects have another layer of considerations that they have to think about. And multi tenancy adds all these extra sort of nuances to what it means to achieve scale and resilience. And the things that are sort of at play here is that yes, we have availability, everybody wants availability, we never want the system to go down. But imagine in a multi-tenant environment, what it means to say we are targeting availability and it's your job as the architect to build a highly available system.
Well, in in traditional sort of systems, if you weren't available, it might mean one or two customers go down because some particular aspect of the system went down in a multi tenant SaaS environment, we have the potential to bring the entire business down and all the customers down if our system goes down. So for me, the bar for availability is way higher in a multi-tenant SaaS environment. In fact, when big multi-tenant SaaS companies have outages, it makes the news, right? Because some other business can't consume their solution or some other dimension of that is gonna be so wide sweeping that others are interested in.
The other thing here is we, we as SaaS architects, we're being asked to achieve cost efficiency. I give a whole talk on cost optimization on Monday. In fact, there'll be a little bit of overlapping here because scale and resilience and all those things also intersect with these concepts. But we're being asked to be achieve economies of scale and SaaS, the whole reason that people go to SaaS is to actually get the the the great margins that come out of this. So as the business grows, we uh we generate more profit.
So how do we somehow achieve all these awesome ways of like minimizing the amount of money we're spending on infrastructure but still want to over provision and do all the other things we do that give us safety in terms of availability and so on. The other challenge here is predictability like you like and software is inherently unpredictable. But imagine in a multi-tenant environment where you have new tenants showing up all the time, you have tenants potentially leaving and then you have the workloads of those tenants varying wildly across the day, wildly across the month. And so here you're saying I want availability, I want efficiency. I want all these other things. But by the way you have to achieve all that and you have to design an environment that will still allow for the fact that the profile of tenants and how they consume your system is changing all the time.
The other piece of this is um we, we don't tend to end up with one sort of architectural footprint, right? If we just have to achieve scale and resilience and all these things in one well understood architecture, it's a little easier to do. But the reality is, and we'll get into this SaaS environments actually have to support a range of deployment models. Sometimes some tenants are deployed and what we'll talk about silo deployments, we'll talk about pool deployments, the footprint of their solutions and their architecture change. Uh and, and now what does it mean to achieve scale resilience, et cetera across that experience?
And then finally, the business is coming to us and saying, by the way, we want, we want to be able to sell into many segments, we want to sell into small companies, we want to sell into large companies and we, by the way, want to offer them different tiered experiences. So we might wanna put throttling experiences or we might want to exchange the experience of those customers. Uh and your architecture by the way, needs to achieve all of that for me.
So for, for me, it feels like this big tug of war sometimes uh on one side of the equation. The business is telling me like we want all this on the left side you'll see here, we want all this efficiency, right? Well, uh just enough infrastructure for what we're doing, maximum cost efficiency, share all the infrastructure you possibly can. So we get all the great economies of scale because we want to have as big a margin as we can possibly have.
But on the other side, they're saying by the way, we got all these other variations we want to support because we, we want to be able to go into multiple markets, we want able to offer tier, we want to support multiple deployment models. And these things don't necessarily outright conflict with one another, but they definitely are hard to support side by side with one another.
So I feel like this is part of the challenge that you all have in trying to build these systems and part of the fun of building these systems honestly too, because it has to be pretty creative to come up with the approaches that are going to work.
So awesome. Let's talk about um scale here. Uh and what does it mean to actually build a scalable multi-tenant environment? What are the unique multi-tenant things you need to be thinking about? And I think if we're going to talk about this, we really have to broaden our view of what it means to scale. And this is probably me get on the soap box a little bit because I feel like this is an area where people put scale in too small of a box, right?
Like generally, I think people think especially infrastructure, people and architects, the view of scale is usually, well, how will I scale like vertically or how will I scale horizontally? And so as I throw a bunch of workload at this, I'm just going to scale. Uh and, and I'll use the elasticity of, of the cloud and I'll use all the sort of great constructs that are out there to just make the load grow as much as I need, maybe a lower provision a little bit, but I'll still get good, sort of good scale at it. Absolutely. 100% true. And a good way to build a scalable SaaS solution.
I just think it's not enough. I think you have to add to that definition. Yes, I want the scale of infrastructure to be part of my experience. But if you're in a SaaS universe, whether you like it or not, you're also in, as a SaaS architect, you're wearing a business hat now a little bit more, you have to think about how the SaaS business is going to scale.
Uh so that means your definition of scale has to be bigger than just the infrastructure. Think about onboarding. What does it mean to onboard new tenants into an environment, right? A lot of organizations will actually put on boarding off till later they'll focus all their scale on the application and then eventually say, yeah, we got to go write that on boarding. Automation on boarding has to scale it. If I tell you tomorrow, I'm going to give you 100 new tenants or 1000 new tenants. And I ask you whether or not your on boarding process can scale to meet that need. A lot of companies haven't even thought about whether that solution scales like b to c companies do because they have to, they only can survive by being able to scale wildly. But a b to b organization like they, who only gets like maybe 10 or so 10 a month or something, they might think they don't have to focus on this. I would, I would suggest that you have to include on boarding as part of this experience.
Oops went forward didn't mean to let's go back. I think operations is part of this story as well. If you, you have to scale operations, you as an architect have to provide your teams with the tools and the mechanisms and the constructs. They need to be able to operate a multi tenant environment to achieve scale. You may not even know if you're scaling effectively if you don't have the data and the metrics and the insights operationally to see how your architecture is actually behaving.
And finally, even though I want to get, I wanna move forward for some reason, even though I shouldn't uh let's try this one more time. Um deployment is part of this as well. Like how rapidly can we roll out features? How can we support feature flags? How can we support all these unique deployment footprints of all these SaaS environments uh and still do that efficiently.
And so I could probably put more items in that box. But my big point is think beyond just the core infrastructure of your application. When you think about scale, think about everything in your business that has to scale as you add new customers. Ok?
So when I, when I sort of said, well, what were, what are the main things I put on one slide, if I could just create one slide that tried to say, what are the things that are sort of on my mind when I'm thinking about developing a scaling strategy, it's a mix of the things you see here.
So in the left, you'll see, I have workloads, right? We have to support all kinds of workloads with different sort of profiles, different tenants, potentially saturating the system in different ways, consuming different parts of the system in different ways and in different patterns all the time. So no matter what, no matter what i'm doing for scale, I've got to figure that out. I've got to be thinking about that and that's got to be part of my plan.
And then on the right hand side, you'll see, I've got also options here in terms of how I can address those workloads. That's a pretty long list of options here. Certainly the compute stack I choose has a lot to do with how my system is scale. I'm going to do EC2. Am I going to do serverless? We'll get into these details.
"Am I going to do containers and how do those particular compute technologies align? Well, with all these varying workloads and all these other requirements, the storage that we choose, the storage stack we choose, are we going to do RDS or Dynamo managed not managed? Like there's all kinds of variation in here that depending on the nature of those workloads, depending on the goals of the business mean, you have to pick a different strategy and even here like domain industry, other kind of requirements that are just common environment compliance, things of that nature are going to influence your approach and then overlaying all of that is the stuff in the middle that you see here.
So you'll see like tearing like strategies, you'll see uh isolation, you wouldn't think of isolation when you're thinking about scale. But isolation actually has a lot to do with scale because depending on how i deploy the resources, but depending on the architecture, i choose some things will scale differently than others and some are good compromises on isolation. Some are not um same with noisy neighbor, that's here, right? What strategy is noisy? Which one of these are good for noisy neighbor if i have really spiky loads, which compute stack is the right stack.
So for me, this is like, if i'm at the very beginning of the process, and you said todd, what are you going to do? Where are you going to start? I'm going to try to get the business and myself to think enough about these things that i have a sense of like what the landscape of, of options i have and what choices that will you get it? All right. On the first day, you absolutely will not get it right. But if you don't do this at all and you just go grab a stack because it's the one you like. And you just go grab a database because it's the one you like. You still may not end up having made very good choices. At least here, there's a little bit of data in the process.
Now, if we just said scale, what's the simplest view of scale? Uh and i could just do this one slide, you could all go home. Um this was this is the easiest and simplest view of what scale can look like in sas, which is we put a bunch of tenants uh into a shared environment. Shared pool means infrastructure is shared. That term. Uh is what we use to describe. Shared infrastructure here and everything is pooled for all tenant storage, storage is shared, compute is shared and all these tenants are just poured into this environment and it just scales horizontally to meet their needs, right? We just look at this as one big collective workload, we probably over provision it a bit sort of assumes that whatever the experience here is, the saint is good enough for everybody.
Um but you can imagine that in this environment, like scaling policies and mechanisms are gonna be challenging. Um because it depends on the nature of these microservices, how they scale. There'd be a little work here to chase the scaling policies which most people will just overcome with, with over provisioning. But if this is all you need and this is what you, this is what you think your sas environment look like. Just use all the basic tools of scale that are available to. You take advantage of the goodness of pooled resources and you're done.
The reality is that most environments that i work with don't look like that. There's some part of the environment looks like that, but it, it generally doesn't all fall out like that. In fact, the landscape of the environments that sas that sas companies are running are way different than a lot of people think, right? Um they are a mix of patterns.
So here you'll see, i've got a pooled environment just like we showed it has order and product in it. As these two microservices, one's running in lambda, but one's running in containers for some particular reason because of the nature of the workload. And then i have these siloed microservice, siloed, meaning dedicated to an individual tenant. Um and here we've said, hey, the analytics service based on slas are based on compliance or some other need for our solution needed to be broken out as a separate stand-alone microservice. And then also we have a set of customers who are what we call full stack silo to get an entire silo of their own. By the way, now all these tenants and all these services are running the exact same version of the, so where they're all managed through the same experience. So it's not like we're having one off versions or anything here, but they are running different infrastructure deployed in different patterns here. And if somebody's willing to write you a big enough check and i've seen lots of companies here and says we won't buy your system. We want it to be full stack silo or some people just want that to be their premium experience, they offer that as an option. So now for me, what does it mean to scale when this is the footprint? Because all of this is your whole environment. So you have to come up with a way to scale when you have to support all these strategies. And now the notion of scale much gets much harder.
Um what is it scaling in a full stack silo? It looks way different than scaling in, in a pooled environment. So if we sort of said, like, let's go ahead and try to put a little formula to this together, i'd say, start with the personas. And what i would go out and do is i'd actually go create some profiles. I'd actually go say, ask myself, what are the consumption profiles? What are the isolation profiles? What are the tier profiles and so on of my environment and take the mix of the personas and the different ways they are going to consume these resources, the different ways that they want to be isolated and then lay those alongside the different options i have available to me. For example, how am i going to meet these needs with different microservice decomposition strategies? What's the right breakdown of my microservices? What's the right mix of microservices? Which one should be siloed, which one should be pooled to meet the of these specific workloads? So i'm not just arbitrarily taking domain objects or doing event storming or using one of these other things coming up with nouns and there's my microservices. No, i'm picking these microservices based on the actual profiles that you see here. And sometimes by the way, you end up with microservices that aren't mapped to anything in the domain. It's just this is a big bottleneck in the system. We carved out this one little piece of functionality because it made sense to run it as a stand alone microservice. Other things may be way more coarse grained than you expected.
Then again, back to compute technologies. What are we gonna do? Serverless lambda uh containers? Which ones fit and then what deployment models are we going to need to support? And when i have some union of all of these things, uh i think if i have a good sense of where i'm gonna go and don't think of compute as mutually exclusive here. I absolutely see like we do control, we're not getting into control plane and app plane here. But if you look at our patterns, we'll say control plane and app plane are part of a sass environment. I'll see some people using serverless for the control plane and then containers for the app plane or even for some microservices, a batch workload versus not containers versus lambda, you could pick them for any number of different reasons.
And this bottom right one don't overlook that deployment models. Ask your product teams now like who are we selling to? Are people going to come along and need full stack silo? I want to know that now. So i can build for scale around that and come up with the scaling strategy for that. This just to drive home this point on decomposition say i have this order service and it's just a bunch of lambda operations. Each one corresponds, each function corresponds to some operation. And as a noun in my environment, it totally made sense to just make this a microservice and i was all done. But after i got more data on the profile of how these of these profiles uh showed consumption and isolation needs. I came up with four separate microservices that represented this decomposition. And this is an oversimplified example. But just imagine, for example, like fulfillment was some huge uh scaling point of your system or fulfillment had some specific isolation need or so on. I might decompose differently to achieve scale. Um based on the nature of the workloads.
The other thing we have to look here is just i've talked about compute, i want to drill into that a little more in terms of picking, compute. The easy one here, i've got this order microservice and the order service has a couple of operations, get order update, order on it. And the unit of deployment is the microservice. And in two, it's just what aws has been doing forever. It's just a scale here in elasticity and we scale horizontally and i can absolutely run in this environment. I will say if you're building for scale and you're dealing with all these multi tenant workloads challenges uh uh that ec2 to spin up and re react and respond to spiky loads can be tough in a multi-tenant environment. And this is why people will often over provision in this scenario. So uh you know, i have, you do have to think about what it means to spin up these instances and how quickly they'll spin up. Um and obviously this works probably better in a pooled scenario. So if i send a whole bunch of tenants into it, then i get more tenants scaling probably less idle resources, probably a better fit where things get better.
And where i have done lots of talks on is uh lambda and the fit with lambda and multi tenancy because in lambda now we move to a managed compute service. And when a managed compute service, the unit of scale is not the whole service, the unit of scale are individual functions. So if today somebody's just really consuming the update order function and doing nothing with get order and tomorrow that inverts, i don't really care. I'm just going to pay for whatever tenants are doing. And in fact, i might even not care as much about how i break the service down and decompose it because the service is still going to scale at the function level, at the whole microservice level. And this is just awesome in terms of getting you away from what's the right scaling policy, how will i get this thing to scale? I'll just rely on this and by the way, this works great for both silo and poo models. In fact, our our service reference architecture as examples of both of those, i would absolutely recommend you take a look at that and see how that works.
And then of course containers, lots of sas companies just loving eks as a as a deployment model. It gives them lots of tools here. What i like here is in the eks space. I also get new ways to think about deployment models. For example, name space per tenant comes in here as an option so i can put name spaces in here. I get ways to do things like node affinity and attach workloads to certain kinds of nodes. We'll look at that in a minute. Um so here i get fast scaling sort of environments. So i don't have to over provision a bunch. So i get really good sort of efficiency out of this. But i also have the option here. You'll see aws fargate in the window. Fargate lets me bring the cerberus option into the ek space for me. So i don't even have to think about what the nodes are that are running underneath the cluster and just operate kind of in the same mindset i do in lambda, there's nuances there, but generally you get to bring that far with you.
So for me, when you're sitting down and like picking your scaling strategy, where are you going? Well, like again, i'd still say, you can say absolutely one of these is the right one, but i'm gonna look at my workloads and figure out what's best. And again, all back to the personas and the nature of those workloads it's there. But if we were gonna pull this apart and we would say, well, how do the deployment models affect scaling? Uh well, if we look at siloed scaling silos a pretty like you have one tenant in there. And if it's fully siloed that resource, its scaling profile is probably a little more predictable. It's like a traditional system. It has a life cycle. It may have an end of day for the business that's consuming it and have a tail up at the end. You're not going to work too hard to figure out how to scale those environments"
But when you start putting resources into these pooled environments, any resource in a pooled environment, now the consumption patterns are all over this place, right? So now figuring out how to scale here much more focused on how do we deal with the peaks and valleys of these things? How do we don't have to worry about idle consumption here as much? But now we have to think about things like noisy neighbor and things of that nature.
Finally, one of the other units of scale and when i've talked more about it re invent this year, i did in the cost of efficiency talk and i think it's valid here as well is i think generally we think about scale along the sort of service boundaries like which how does compute scale? How does this storage scale? But you can also take a broader view of scale and say, um i'm gonna create these notions called pods. Um and i'm gonna put these tenants into pods and i'm going to put a certain number of them in there. I'm going to know that set of tenants in this. What i've got eight tenants here. I'm going to know probably generally for just eight tenants. I can probably figure out how to define scaling policies for this pod that generally would mean this pod is safe. Also, it isolates this pod. So it's kind of a blast, its blast radius is just limited to those eight tenants.
And then once i sort of get comfortable wherever the boundaries of that pod are and how many tenants i want to put into it, i can then spin up another pod. It's just i'm just sharding here and horizontally on a pod by pod basis. And i put the next set of tenants in here and for some teams, they like this because they're not scaling and dealing with policies all the way down to the individual service levels. They're just trying to get the pod to scale successfully enough. Um and then also by the way, if they get a tenant in one pod that's not like fitting anymore or their struggle with them, they will migrate tenants potentially between pods to deal with sort of management and scaling issues. And then this just sort of continues on and we essentially can scale out on pods.
Now, I would also say this is a has value in terms of scaling because it means that you can support multi region sort of models as well here. So if i have pod already, it's a unit of deployment for my environment. Sorry about that unit of deployment for my environment. Now, i can take these same pods and deploy them into region as well and have a multi region footprint. Lots of nuance to that. This doesn't come at zero cost because it has more deployment complexity. It has more operational complexity. We now have to aggregate everything across the pods for operational views into it. But still, if you're talking about scale, it is another dimension. I think it's still somewhat debatable, but i think it is a way to approach scale a little bit differently.
The other thing is and this is uh something some of my team has started talking about and i still, i'm still sort of wrapping my head around it. But the idea was also well as part of looking at all these workloads and scaling strategies. Um like i have certain kinds of services that have very different footprints. Some are compute intensive, some are sl uh batch focused. Do i just write all my microservices? Put them all on the same instance type, especially let's say this is in a container based environment. Um and just assume that like if, if that instance type needs more memory or it's it favors a gp u model. I'm just gonna let that instance type scale out and it'll just have to scale out to meet that load.
And that raises the question of, could we really connect specific workloads to specific instance type to optimize scale a little bit better here? Uh and, and again, i think we're still thinking about this. But imagine, for example, uh these are three made up services. So who knows if they fit the profile that i've got here? But the idea is, could i run these three different services on three different instance types? Could i use no definity inside of eks and bind certain types of workloads to certain types of instances? And would that yield a better scaling experience for my environment? Like if i have something really memory intensive or something that would really benefit from a gp u? Am i better giving that workload a gp u? Well, it depends, there's a whole lot of depends in there. The cost, how much does it scale? Like there's a lot of math to do in there to prove to yourself that it's valid. But i think it's an interesting area to think about, right?
And just to sort of go one step further with this. Uh if you look at our eks and you look at our clusters are. So i've got a couple of nodes here that are running and they're running on m five instances. Um one of the cool things that uh that we have here is this tool called carpenter and carpenter actually gives me a way to sort of go after this problem uh within the eks environment, which is with carpenter. I can essentially go out and tell carpenter here's a list of available um instance types. And i would like you as you schedule a set of pods into a node here, potentially assign different um instance types to those nodes as they come to life.
I think this is super speculative, but i think it's interesting enough to start thinking about especially the no infinity version of it. The carpenter b i still have to figure out how does it schedule effectively enough to really get the right workload to the right instance type. But it seems like it, it it's intriguing now the other part of this, which you may not think about at all, but i i encourage you to really lean into this is i talked about on boarding and why onboarding is part of this story. You do have to think about the scale of onboarding, right?
So if we have this onboarding process, you'll see the control plane in here and it creates tenants and it provisions tenant environments and obviously goes out to the app plane and sets up all the services we need all these different deployment models. It talks to a billing provider to set up that relationship. There is a ton of moving parts to this. Ask yourself like what's a lot of scale for us? So today we're doing about 10. So what if i threw 100 at it or what if i threw 1000 at it, you know, find some upper limit that's practical but high for your organization. And do you scale effectively in that environment? Because imagine on boarding starting to fail, say tomorrow we just for whatever reason as a business needed on board, a hunch bunch more people and our answer back to the business is we can't on board them fast enough like there are, we're, we're not gonna be able to scale to meet that need. That's gonna be a big hit to the business. And so please give emphasis to that.
Um now, if we look at the multi tenant complexities like that come with, with provisioning these environments and we talk about this on boarding process and we have our sort of control plane. One of the things that has to deal with is these different deployment models. So when you go build these environments and you're looking at scale, one of the things you have to figure out is how to automate the deployment and to each one of the and configuration of these different tenant environments.
So if i have full stack silo for my premium tier, there's i'm going to have unique terraform, i'm going to have cdk bits, i'm going to have all kinds of whatever my dev ops automation bits are here that are going to have to deal with the fact that full stack looks a little bit different than everything else. It might share some things, but it has its own nuances. I might have an advanced tier and the advanced tier happens to have one service that runs in silo and then the rest of its services running shared with the pool environment.
Now, i also have the model where i on board, a basic tier tenant and they just go to the fully pooled environment. Well, there's a lot to think about here. And if you look at our examples that are out there, we have examples that show all the moving parts of this. In fact, there's a great builder session here that shows how helm and a bunch of other kubernetes tooling is used to automate and control all of this experience in a way that you aren't just chasing all kinds of crazy one off code to make it all work, but it's still one process and to end.
And then as part of scale, we want to say great. This works for one customer. What if i throw a whole bunch at it? Now, will all that automation work? We have good sort of fallbacks if some of these things fail along the way, how does the system know if something failed or succeeded? Lots of important questions to ask there.
Now, the other bit of this is deployment, deployment is a little bit different than tenant provisioning, right? We on board a tenant we get a tenant into the environment, we set them all up and provision all their environments. But we also have the experience where a builder on your team somewhere is just writing some new microservice and they don't care about deployment models. They shouldn't care about deployment models. But somewhere in your deployment pipeline, um you still have to have this tenant awareness baked in.
So for example, say i'm using feature flags here and i've got standard here and advanced here and they all have all kinds of different settings. And then this environment then has to deploy to all these different configurations. You have to figure out how to make this scale as well. Like what does it mean to push out a new feature? And how do you push it out with, with feature flags or what if you're doing a b or you're doing canary releases. Canary releases are really popular uh in sas environments. Well, now you have to do that against all these different deployment models. What's it look like? How does it work? How do you make it effective?
And then just to make this a little more concrete, like here's one example uh lifted out of uh the c as s a reference architecture that supports multiple deployment models. It has two stacks you'll see here a basic stack and then one that's a fully siloed stack that's identified as tenant one. So all the basically all the pool tenants go into the basic stack. And then every new premium tier or advanced tier or whatever it is gets its own entry into this table because we have to keep track. This is an example of some of the work you have to do, you have to keep track of which tenants have been allocated and with which models and where the resources are. So that when you come back to deploy all this stuff, you know where it needs to go and who gets what including which flags might be on or off to know whether they get it.
And then here you'll see a ws code pipeline here on the right hand side that is just going through gets the source, does the build and then it will do the deploy and it'll do the deploy uh creating whatever entry needs to go into the stack and also using whatever configuration here. Like here, i have provision concurrency because it's a service environment and my basic tier tenants get zero and my premium tier tenants get 50. I don't know if zero is a good idea there. Uh we should think about that. I just got that in the diagram myself. And then obviously, now we have this stack that is this tenant one stack and now we deploy it. And because we're using api gateway and each tenant stack gets its own api gateway, we also get, i have to get a keep track of the url that's the entry point for that.
So when workloads are being processed here, you get this going, this stuff is all the moving parts and i show it to you not to teach you like what the stack does i show it to you to make the point that if you're going to scale, you have to scale in a way that like all these mechanisms need to scale effectively with you. Are these, the right tools to use is the right mechanisms. You have to think about deployment as part of this story.
Ok. Resilience. And then we're gonna get into validation and chaos at the end of this. So resilience obviously like is key again to every environment in this case though resilience is more about what are the sass layers of resilience? Like we don't, we know it has to be fault tolerant, we know it has to behave and hand have all these circuit breakers and all these other patterns in the architecture to be resilient. But what else? Right. And to me, i tried to break it down into a few key areas. For me, this is the first time i made this slide or tried to categorize these areas and it's an evolving area for me. But i thought if i have this whole sass architecture, it's got compute, it's got a control plane. Uh it's got these tenants that are coming in from the top. What are, what are the layers of this resilience? story. And one of them is how do i just control the way tenants are in putting load on my environment
Right. Classic sort of question, which is just how do I make sure as they're coming in the front door, they're not coming in in a way that is saturating my system or putting a load in the system that's going to just bring it to its knees and just going to have it fail.
The other thing we're going to look at and this is one that you could argue isn't part of resilience. But I think it's part of resilience is we talk about tenant isolation all the time. To me, a part of resilience is making sure one tenant can't see another tenant's resources, right? So I feel like part of your resilience story is that you have to do everything you can to be sure that you've put all the pieces in place to make sure one tenant can't see another tenant's data because if they do that again, can be a huge event for a SaaS company than the one we really have covered most here.
And then I'm not going to recheck, but I have to include scale as part of resilience because we do have to scale effective. We have to have enough resources if we don't scale well enough and the system falls down because we can't scale, that's going to be a problem.
And then the one you might not be thinking of, do we have enough visibility into what the system is actually doing? How is it scaling? How are tenants scaling? How are microservices scaling? How are they putting load on the system and how is the system behaving based on that load if you don't have visibility into that? Even though you haven't, it's mostly about surfacing that. You don't know if you're scaling effectively. To me, you don't know if your system is resilient here. Like resilience is partly, is the ability to detect things before they actually go wrong. Well, if you don't have a way to detect them before they go wrong, you're not going to be able to use all those other approaches to achieve resilience.
And then the last one here, um I think that on boarding and deployment resilience, right, we have to think about those two pits of this as well. Like how well does onboarding withstand issues? How does it recover from issues? How does deployment handle failures and recover from failures and make sure that we're, we're as, as sound as we can be there. So now we start at that front door and we work our way in.
Um we're going to start with the most basic theme and this actually applies to any system which is if we're going to prevent users from just imposing excess load on our system and bring it to its knees, we have to put throttling mechanisms into place. And so for me, I've shown one easy example here, which is API Gateway. API Gateway has this notion of lambda authorizer. We'll look at in a second um that let me define policies and control the kind of loads that tenants are putting in my environment.
But this goes deeper than this. This whole discussion of a throttling is a layered discussion. So even after I get through the front door of the application, as I'm going service to service, service to, to storage all of those parts and layers of my system should be asking if I'm going to implement, implement some kind of throttling here. So do I have provision concurrency on my storage or do I have some, what is the mechanism or the knobs and dials? Somebody's giving me so that I can achieve uh and control the workloads and, and the, and the demands that tenants are putting on my environment and to make this a little more concrete if we take this out of and put tiers on it.
So they've got four different tiers. It's actually three tiers with two tenants in the platinum tier. And I say they're coming into my environment, they're hitting the API gateway. I hit the authorizer figure out which tenant maps to which API key. And then with a lambda authorizer, I can have API keys which are connected to usage plan. So I basically have three different API keys here. Those three API keys map to usage plans and all within the lambda authorizer, I resolve that, that incoming tier to a usage plan and then that usage plan uh configures the author uh authorizer policy and then applies that in the API gateway as we go downstream.
And if you've exceeded the quota, you won't, your, your message won't go through the other piece. I wish I'd put on this diagram that isn't here is you can also use that authorizer policy to control which uh methods and entry points are visible. So if somebody's trying to access an entry point on the on the gateway that's not valid for their role or for something else, you can block that path here, which to me is another great tool to have in the resilience story.
But think about it here is I've actually equated, tearing to resilience. I think tearing is part of your resilience story. If that basic tier tenant is imposing a ton of load on the system and they're affecting the availability of a platinum tier tenant. That's a resilience problem for me in my system, I categorize that. And so I'm going to put policies on that basic tier tenant that says you, you're going to get cut off at a certain level and that's intentional.
And so when they call me and say, wait a minute, I'm getting throttled, what's going on. I'm going to say, well, become a standard tier or move up to the platinum tier. If you want better throughput most of the time it's going to be ok. And I'm going to set that policy somewhere where that's not happening all day long, but I'm still going to set it because I don't want to be there that one day that they choose to just go crazy with it and they end up affecting all my other tenants just to show you one other approach to this.
Just not make this all about the API gateway here with lambda. You'll see that I can actually use a reserve concurrency. So here I could actually deploy my tenants into three separate tiers, separate copies of the functions, but with different reserve concurrency values. And this means how many concurrent executions can i have with a lambda function? And here now with 100 at basic tier, you'll get, you'll hit the concurrency wall there with 300 advanced, you'll hit it there and then with premium, everybody gets more.
Um so for me, this is just another way to implement the same mindset that I talked about in the prior slide. Now the area of resilience that's a little harder to classify is this notion of uh resilient storage. How do we build reser in storage? This is one of the toughest and most difficult areas i talk about because so many people who pick storage are also picking, some have to pick some compute size when they're picking it, right?
So they go out and say, hey, my tenant, one is a silo. They're going to start out as a db m three, silo two is a db m five. And then I've got all these pool tenants who are going to run into db m five as well. Is it the right size instance? I really don't know. That's usually why we over provision here. Right. is, that's the only option we have, but that's not really resilience to me that's just sort of hoping that you've put enough capacity there that you're really not gonna fail.
And then what you're really trying to do is real time for efficiency. This is efficiency versus scale fighting with one another. I don't want it too over provision. So then you'll see on the right hand side uh as i move over there, um i, i start adjusting the size of the instances which organizations do and i, i have no clue to do what we do with silo because remember that graph for, i'm sorry, with pool, that graph for pools all over the place. I probably can never size it down because who knows on the one day when somebody goes wild with something, they might take out all my pooled customers down. That's not a good moment. Right?
So then what's the answer here? There is no magic answer for this one. I wish i had one for you. I do think a hint of the answer is, is serverless storage. There's lots of good storage options now on AWS that have serverless option. Um Dynamo, DB and Aurora CLA I think EMR has serve as opensearch has cerus now so that more and more cs is finding its way into the storage stack. And the more and more you can put your, your storage on those models. The more resilient your storage strategies are going to be here because you're not so tightly coupled to this.
I will also say these storage mechanisms also have their own knobs and dials around provision throughput and how, what's the capacity you're given? Are you on demand? Are you not on demand? So there's a lot of tools you can use there to deal with resilience as well.
Now, the other bit of this is for resilience is where are my fault boundaries kind of in my environment? And, and, and for me, for example, i look at on boarding and onboarding may have interactions with uh identity provider. It may be have interactions with tenant provisioning which is going off and ask another service to provision all your resources and then it may have interactions with billing uh which may be a third party billing provider.
Well, all those potentially asynchronous, potentially third party dependencies are a fault tolerant sort of point of your uh of contact for your system. If the billing system is down, what do i want to do? This is where you got to have fallback strategies. If you look at classic sort of resilience strategies. Maybe i'll let the billing fail right now and then come back later and i will um i will retry again, but let the system go forward with the rest of the on boarding experience.
Same thing is true for deployment. Like if i'm trying all these different deployment mechanisms and there, and part of that deployment is running all this terra for running all the CDK code to do this. Uh and that's probably doing a little bit of that asynchronously as well. How do i, what do i do when it fails? You probably have more control over this because this is probably more your own code. But you still have to have a strategy. Are you gonna retry it? Are you gonna clean it up and retry it? What are you gonna do?
You also have to have a around this whole notion of isolation. I told you resilience is part of the isolation story. And to me like figuring out how you're going to build resilience in for isolation, kind of depends on how you've deployed your environment. For example, if i'm account, i'm using some notion of account per tenant as my deployment model, then i have to think about how do i prevent cross account access between these probably an easier one.
Um if i'm a VPC for every tenant, then now what's my strategy and what's my resilience strategy? And how do i make sure that these VPCs are successfully isolated from under good constructs available for you there. And then when you get more granular, we get down to service and resource isolation and that's where it gets way more tricky.
How do i make sure one service can't call another service if they're both siloed? Some cases you want to talk to, but some cases you don't, however, every other AWS service that i'm talking to, how do i be? How can i make sure you're only talking to the view of that service that you're supposed to get for that specific tenant and that specific tenant context, you have to come up with a strategy for that and generally tell you tenant isolation is something you have to do.
I'm connecting tenant isolation to the resilience of your system. I'm saying do tenant isolation like we've always said to do it, but think about it through the lens of um of resilience.
Um the last bit of resilience here and this one is uh maybe a little bit vague and maybe a little bit too hopeful, but i actually feel like part of resilience is moving code and policies away from builders. I don't want my isolation policy sitting in the code of every single microservice, right? I want there to be a generic mechanism that handle that.
How do i unpack jot tokens to get tenant context out? How do i record metrics and logs? I don't want my builders to be the uh applying that policy over and over across their code that is going to create issues, it's going to create resilience problems for me.
So in this particular case, it's lambda. I've got a lambda layer, this could be EC2 with java libraries, it could be in jars or whatever. It just my point is you as the architect are gonna move this, these policies outside the view of developers and just ask them to use them.
So in this particular case, you'll see um my, i've got a product, an order service and all the multi tenant kind of policies that are being applied here are being applied in a layer that's shared across all the services. So now if i want to go change something about the way logs are injecting tenant context or the way that tokens are being handled or the way that i am and policies are being used to assume roles to get tenant scope, that's all outside of the view.
So for me, that is a great defensive tactic that will lead to higher resilience in your system. It's also just a good practice.
Finally, the last bit that i talked about, ok, we build scale, we build resilience, most people just stop there like it's great. We, we, we wrote good code. We, we, it seems to work really well. That's enough
I feel like these mechanisms, they're so nuanced that if you don't know if they're working, um, you don't really haven't proved that AI you've achieved what you were after.
So here, if you look at the common sort of nodes of this and by the way, this is, there's nothing uniquely SaaS about this. This is the sort of notion of chaos and just good testing here, which is I'm going to go find the resilience profiles. I'm going to go find the scale profiles that I'm after.
What are those different consumption profiles we talked about earlier, what are the different isolation profiles? I'm going to use that as all as input to an experience. And I'm going to define all these different workloads uh and these different tiers and I'm gonna use that data as input to my application.
So I'm gonna go generate some tenant population, go build out some population of tenants that matches the profile that I'm after. I'm gonna have some automated off into this experience. And then I'm just gonna run in this case because we're trying to do something that's load based, have all these sort of parallel workloads, exercising my environment um and just stressing these bits right.
And the whole idea here is like build all of these strategies and be interested in the left hand side of this. I think a lot of people are interested in the right hand side. They want to go right to the code that does the right hand side of this. But the cool things that are going to uncover issues with your scale and your resilience happen on that left hand side, what kind of data is going to be put into the process? How many tenants in? What profiles should I do? Tenants of one like a bunch of premium tier tenants and very few pooled and then switch that and invert that in different patterns and see how it reacts and responds.
Now simulate real time code, executing that environment and does it scale and respond the way you expect it? This will tell you a ton about how your environment is performing before it goes out into the wild and then uncover things along the way that you may not realize, have shown up as a problem.
Um and i, i think it, it's, it's really just like going after the high value scaling strategies. You don't want to do this to cover every single scenario and every bit here. But you, you'll know where those bit bits of your scaling and resilience strategy are and you'll know how to go after them.
And so if you look at exercising that right, for noisy neighbor and scale and resilience here, like we've got a bunch of this is a noisy neighbor scenario. I've created a bunch of noisy neighbors with different profiles of different colors, sort of represents the different people or are the outliers in these groups that are doing different things. I create some noisy neighbor orchestrator here. This is new code. You're going to have to go, right? Or some third party tool. There's good tools that do this stuff as well. It's going to go out and provision tenants and provision the app plane for these tenants. And then I'm going to run the these, these different workloads through this different load simulator.
And then the key piece of this that most people miss, I'm gonna hit the application plane is I'm going to observe the operational view of what happens when this happened. When I run these loads. A lot of people like rent it, it survived. I looked at Nelli or app or data dog or app dynamics or whatever and it looks like everything's healthy, good. We thumbs up.
No, I want to go see whether or not inside of my dashboards that the ops person seem to look at. Can I create conditions where I can see? Oh, something's getting saturated. Is it getting surfaced? Are the alerts and alarms going off that are supposed to be here? I want to know that the operational view of this is going to be able to have the view into it that I expect they're going to have when I put these different demands on the system.
That's the missing piece I think for a lot of this is people don't go all the way to the operations side of this experience.
The other bit is these async integrations we talked about and having these graceful sort of fault tolerant experiences. This is like we already talked about we have on boarding hits, tender management hits tender provisioning visions, the app plane hits, billing goes there. I wanna, I wanna handle these, these failures like I and wherever else you have them anywhere, you've got a third party dependency anywhere where somebody else's availability or resilience is outside of your control. What's your fallback strategy? And that's fundamental resilience but more important in a multi tenant environment.
And more important, like in this control plane, you're orchestrating most of this experience, your your control plane is gonna have to be able to handle these bits. You also need to validate your on boarding experience. I said you need to have a scalable and resilient onboarding experience. This means taking different profiles of tenants with distributions and of of loads across these and different makeups and running them through your on boarding process and proving to yourself that that when we've been bought on board platinum, which does a full stack silo versus um basic which only configures like a new environment settings, but doesn't really provision much new infrastructure and we do those things in different combinations or some at the same time, does a system like handle all of that effectively, all the pieces they're doing what they're supposed to do most of the time. Not by the way, a lot of people have a lot of confidence, even when they've built a full on boarding and then they start doing this stuff and they start finding out, oh, there's little things that go wrong. They just don't happen a lot. So we don't see them, the other one here.
And I think this is one of the hardest one is we set isolation as part of the resilience story. What do we do? Like how do we test for that? Um so if I have this environment, it's using the token vending machine. And I've got a tenant one that's coming into this environment and they're trying to access tenant one, database, tenant two, database are here. Tenant ones obviously only supposed to see tenant two's database. Well, what happens? I've done the token vending machine. It's assuming role, all those things are there. How do I prove that it's going to work if something goes wrong? And really the only option you really have here is either in the code or through a third party tool, you have to inject sort of a context that says, no, you're not tenant one, you're tenant two. I'm going to inject the jot token, I'm going to do something that changes the context and then prove that you've injected at a point where the im policy is going to say no, you can't, you've you're trying to cross the boundary. And again, when somebody does try to cross a boundary here does something land in the operations dashboard that says, oh, there was an attempt to cross a boundary between tenants here. I want to know if that happens in my environment, anything that tries to cross the boundary, even if it's rejected, it means something's going on. Maybe somebody's trying to do something to corrupt the system. So I want to know what's going on.
And then I said we have this operations experience. Yes, now we have this awesome like, you know, operations experience, how do we, how do we just feed it as much as we possibly can? Right? What are all the tests that just prove that that operations experience is, is working the way we expect? See this as part of what you and the ops are, if your ops great, you already do that, see this as a combined sort of thing. And I would say to everybody that this is as valuable to dev and to q a as it is to production, you ever trying to build a system, a multi tenant system and everybody's trying to build and everybody's pushing stuff and of course, it's broken because it's new, the tool i want to go to in the dev environment is a tool like this. That's like why is it going on? Because I'm still sorting out what my new service is doing and alongside all the other services and i want to see what's going on. And as a q a person, i want to simulate these load issues and see does it is it really doing what it's supposed to do.
Ok. A few takeaways here. I, i hope that it's clear that the story for resilience and scale is, is not just the traditional notion of scale and resilience, multi tenancy. Absolutely. To me, adds a whole new layer of considerations on top of that. And so when you're thinking about scale and you're thinking about resilience for your solution. Absolutely. Be thinking about what's the multi-tenant sort of version of this story and what's the new things i ought to be thinking about?
I expect like efficiency and scale to always be competing with one another here, right? Because they're always pulling at one another. We want to be as efficient as we can. I told everybody to build good efficient environments. But you're always like, yes, but scale within reason, i wanna be, i wanna give myself a room to be sure we're not having outages or we're not scaling in a way that's meeting the needs of the system.
And i will also say like this is everything in sas, is a mix of business and technicals, but scale is absolutely it. Like if somebody's gonna tell me go build a scalable sas environment, what's the right way to go build it? I'm gonna say, well, what are the deployment models you're gonna have? Like, i'm gonna ask 100 questions about what they wanted to do to figure out what version of scale is right. So you have to lean into the business here to find the right answers.
Um and then i think you should be looking at how you can use different deployments strategies, tools for scale and resilience. I feel like how do, how, how do you, how do you sort of deploy and exercise these environments to really prove that these things are working.
And obviously, to me throttling isn't just throttling is a tier strategy, but throttling is also just fundamental strategy in a multi-tenant environment. I wanna throttle and, and, and, and make sure nobody's pushing excess load in environment that's gonna impact my my environment.
And by all means, build interesting workload profiles, figure out what the workload patterns in your environment. It sounds so not like interesting in terms of what you might be doing. But if you invest time in simulating these workloads and figuring out really interesting personas, get your product owners and others invested in that, you will have very interesting sort of sort of data to feed into your scale and resilience bits.
Uh and then by all means, don't make validation of all this stuff like an afterthought, like i would try to build it and immediately it's almost like test driven development or something like i built something really cool. It's supposed to do something really cool. How do i prove to myself? It does that thing that's cool and don't go overboard here but at least do enough to prove to yourself that the fundamentals are doing what, what they're supposed to.
Ok. And now just a few highlights here in terms of, other sessions. Some of these, i think none of these have, i can't figure out which ones have happened yet and what's not, it's all blurred to me at this point. But here are the breakouts that are related to sas that are going on. If you're interested, here's some chop talks that are still going on. Federated identity is interesting. There's a chaos talk which is a variation of what i talked about here if you want to go deeper to that, but get in a chalk talk setting, i'll be doing that and we'll go deeper into this notion of some of this validation and testing.
Great workshops that are out there. This s a survivor i think is a really cool one that is about operations and testing like all these operational sort of tooling and, and so on.
And that's it. Well, i really appreciate you being here for this session. I hope you got a lot of value out of it. I hope this gives you a general sense of the things you've got to think about when you're thinking about scale and resilience for a multi tenant environment. I hope this goes well with the sort of prescriptive tools that we give you that are a little more concrete, but enjoy the rest of your re invent and have a good day.