SaaS architecture pitfalls: Lessons from the field

All right, everybody. Let's get started.

First, I want to thank you all. I know it is the last day of reinventing. This is probably the last session any of you will be able to attend. Thank you for taking the time to come out and joining us, join us today and we're going to be talking about sass architectural pitfalls.

These are the lessons that our team, the aws sass factory has learned over the last few years working with hundreds of sass companies and those lessons range across a wide variety of stages and types of companies. So hopefully there'll be something in here for all of you. If not feel free to boo, i won't mind. No.

So with that, let's jump right into the promise of sass and start talking. Oh, by the way, i'm bill tarr, i'm a senior solutions architect of the aws sass factory.

So let's start off with the promise of sass and start talking about what we expect to get when we think about building sass. And for me that starts with agility and flexibility. Now, those of the development background might immediately start thinking, oh, he's talking about agile methodologies but with sas, that's not really the gist of what we're trying to get to. We're talking about organizational agility. We're talking about being able to build a product that can move between markets between regions and easily adapt to changes in market conditions.

And also that agility comes in the same sense that we were thinking about with agile methodologies, the ability to deliver rapid innovation. If you're building a sass product, you're able to deliver value to your customers continually. We want to get to a point where our customers aren't expecting releases quarterly or monthly. Instead, they're expecting their bugs to be fixed within hours or days and they're seeing new releases and new features continually keeping them delighted with their platform and reducing churn of them leaving your platform.

And all of this comes with both a benefit and a responsibility of being able to operate your platform for all of your tenants through a single view paint plane. Sometimes we call this a single pane of glass, but it's the ability to see all of your tenants, their experiences and to be able to adjust to those experiences on the fly with a relatively lean team and at relatively low cost.

And we've been talking about sas as a growth model for years. 2023 has been a very interesting year for a lot of sass providers. There's been a lot of economic challenges that have changed the landscape for us. And increasingly, i'm starting to think about growth in terms of sustainable growth. Increasingly investors, stakeholders and executives are asking us to prove that what we're building is actually going to become profitable over time.

So it's not just enough to attract logos to get new companies. Instead, we have to prove what we're building is going to be profitable and continue to grow those customers over time. And of course, that involves cost efficiency. If we're not building cost efficient sass products, we're not going to be able to have a successful operation over time. So we need to have a focus on cost efficiency. But with sass, that promise isn't simply cost efficiency for bottom line cost rather, it's cost efficiency for how we build our platform. And that's one of the things we'll be talking about today and every sass journey is, is individual, but we tend to think of this in a series of steps and the pitfalls also fall into those steps and it starts right up front with our envisioning stage.

When we start to think about whether we should be building sas even do we have a fit for the product that we're trying to build? And the sass delivery model or perhaps is there a different delivery model? Do we not really have enough customers enough of an addressable market to make the investment into sass worthwhile? And our team does try to talk to customers to make sure that when they think about building sass, they understand the value proposition of sass and aren't simply building it because they think it's the next great thing to build, but really have a strong idea of the value their customers are going to get out of the product.

When it happens in the design phase, there's a lot of solutions architects in the room when you break out your markers and you start drawing on the white board. This is another area. A lot of pitfalls can fall into when we are drawing on the white board. When we're thinking about what we're going to build, we have to be thinking far enough ahead and have a roadmap for what our customers are going to be asking for us. We have to be thinking about those customers, what markets they're in, what regions they're in and doing and planning to avoid some of the technical debt that can occur in the later phases of our development.

And of course, build, if we're going to build a product, are we building something that our customers are expecting and we're promising to our customers. If our sales teams are selling this thing to customers, what are the promises they're making around? sla's, you know, what's the expectations for how this should perform and have we really thought through what that experience should be for our customers and how they expect to consume our product when operating our product.

This may be one of the more interesting ones. Operations is really where we see a vast majority of the time spent in a sass product. Of course, we spend 6, 12 months building it, but we've got to operate it for the rest of time. Right. So we have to be thinking about how we operate this efficiently, how we create a product that can be sustained by a team without continually doing repetitive tasks, how we automate all of the things that we've, we've built and the last phase may actually be something of a phase by omission.

We have to continue to optimize our product and iterate on it. It's not enough to simply build it, operate it. We have to continue to revisit all of these decisions we made over time and continue to get better at the understanding of our customers of what they're asking for and of course, the cost profile and operational profile and the security of what we've built.

So we saw that promise of sass. So i've just taken that slide and i thought about some of the things that challenge those specific promises and it starts out with in that visioning phase, we often don't take the time to think about our customer profiles. What are the types of customers, what are the roles who are going to use our applications? What industries in them? What regions are they in? If we don't take care of all of these, if we don't think about all of these things ahead of time, we're not going to achieve that agility, flexibility. Instead, we're going to be surprised when our customers come to us and tell us, you know what we actually need hipaa compliance. How long is that going to take? It's going to take a minute. Hold on, let me go ask. Right. We have to be getting ahead of the curve on those if we want to achieve the agility and flexibility that was promised in sass.

And this might not seem like something you have to think about up front. But in the envisioning stage of what we're building, we want to get a mindset of maintaining a single version of our software. And this is something a lot of people miss, especially in the early stages of building sass. And sometimes we have to make concessions as we're getting toward product market fit and figuring out whether our product is really going to be successful and adopted by our customers. But we have to maintain this view that having a single version that we're operating for all of our customers is the goal of what we're trying to get to or we're never going to get to the rapid innovation. That was the promise of sas. Instead, our developers are increasingly going to be pulled into operating all of these different versions of our software that we've built for individual tenants and it's going to get harder and harder to grow our platform.

And you can see the proliferation of people here. If you don't invest in operations, that operational excellence that we were talking about, instead it's going to become human processes, we're going to get to a point where individuals are being sucked into onboarding processes. Individuals are switching between accounts trying to understand the different tenant experiences. And instead of getting a team that's focused on delivering value to customers, which is the goal of sas, we have a team that's invested into operating it and that bogs down slower and slower and the team gets bigger and bigger. And what they're doing isn't rewarding and it's not rewarding to their customers. It's not adding value to customers.

And we said sustainable growth was one of the promises of ss so not having a targeted growth strategy, not understanding how we're going to grow our product and planning ahead in this visioning stage and thinking who are the logos we want to address? What are the industries outside of our target industries that might be future industries that this product could also address getting ahead of that and thinking about how we're going to grow. The product is an important step of envisioning the product. And if we can't do that, if we can't think of how this product is going to be successful. If we don't have a plan for it, it won't be.

This is another one that directly challenges the promise we talked about. I hear lots of customers coming to me saying, my executives are saying our aws bill is too high. That's the wrong conversation to be having about sass. If you're building a sass product, your aws bill isn't your overall problem. It's how we think about the cost efficiency of what we're building and the units within that. And that's going to be one of the pitfalls. We're going to spend a lot of time on in a little while.

And with that, we're going to jump into some of these architectural pitfalls we've gotten past the division stage. We're going to talk about some of the build and design challenges that we see customers fall into.

Now. Deployments is a one way door. Essentially a lot of you will be familiar with tenant isolation, i think, right, the idea that everyone could get their own stack or everyone shares a stack of infrastructure. These are deployments and deployments are often something we think about during the design phase. And we say, you know what our customers are really going to demand a completely siloed version of this or we're going to build a shared version of this and thinking that this is the last time we're going to revisit. This decision is one of these pitfalls i often meet customers in the field. We reached the point that this quote is coming from saying we've got lots of great enterprise logos, customers coming on board, but we're having a hard time growing, we can't move out of this market that we're in, we can't address other markets. We're not able to do a free tier. We're not able to switch strategies to product led growth, which we could talk about a little bit as well

We've put ourselves in this box and now we can't find different price, different points to reach our customer at. How do we get around this pitfall?

So first, let's describe a little bit about what the one archetype might look like. This is a perfectly reasonable stack. So the customers built out an S cluster, they put an Aurora database behind it and they put an OpenSearch instance because their customers demanded some specific search. And this is an actual customer example, a customer I worked with earlier this year.

And what they've done is every time a new customer came on board and you, you'll see the word tenant here. If you're not too familiar with SaaS, we can use tenant and customer interchangeably. In this case, in some cases, it could be internal operational teams or there can be different versions of SaaS. But for our purposes, think of these as your customers and they kept on boarding customers creating these new stacks.

And they got to a point where it was expensive and difficult for them to operate all of these different stacks. They didn't really have operational efficiency, but really the main problem for them was that OpenSearch data OpenSearch instances that they were spinning up for each customer were actually quite expensive and it was getting increasingly hard for them to operate their solutions at a profitable level and their customers were pushing back on the price points and they weren't able to move downmarket and find new customers outside of their current targets.

So, what did they, what did we talk to them about? What are some of the strategies that we used to work around this pitfall?

Well, the first, if you're in this situation and your databases or your storage are one of your main compute contention cost, contention points, look at your data patterns. Why are you storing the data where you're storing it? Have we thought about whether there's PII, whether there's compliance implications to those? And for this customer, it turns out in OpenSearch, there wasn't, they were simply putting a lot of data in there to make it easier for them to search. There really wasn't any compliance footprint, there really wasn't any need for them to silo out that data.

So the first thing we did was create a combined OpenSearch instance and use an index per tenant in that instance. And of course, you can see a new word appeared on here as well. You see Advanced Tier where we used to say Tenant. So now we also introduce the concept of a pricing tier. So we're saying these are our premium tier customers and even for our premium customers, we're saying, hey, we can still share something in your stack to save on costs.

But of course, if there's an Advanced Tier, there's got to be some other tiers, right? So we've got a Standard Tier here as well. And we also looked at the rest of their data and said some of your customers would be willing to or don't even really care about how you're storing their data as long as you're storing it securely, they're not asking you to silo out their data.

So we create a new Standard Tier of pricing and say, listen, all of the data is shared if the customers are really asking it to be siloed. Well, let's give them the Advanced Tier product, but we've created significant savings in terms of the overall profile. Now, we've got a couple of price points where we can meet our customers.

We can go even further and look at the compute. We've got a namespace per tenant model here, which is simply a construct in ECS if you're not familiar with it allows you to use the same ECS clusters, create specific namespaces for specific customers and they're protected from one another. But from a networking perspective, we can pack in those containers a lot more tightly save some money on the underlying compute of ECS.

Now we've got a Basic Tier that is an even lower price point to meet our customers at. So this is one option considering shared storage. It's certainly not the only way we could go, I don't know if you noticed, but the Catalog Service disappeared from that last slide.

And the Catalog Service we thought about what that was actually doing and it was simply an asynchronous service that was running a couple of times a day and going out, reaching out and grabbing some catalog data. And that was fairly unstructured data. It wasn't relational and it really wasn't necessary for that to be part of the overall compute stack, we can consider breaking workloads out and doing workload decomposition.

Some people might sort of throw the word microservices at this, that may or may not be true. These could be microservices in this case, it kind of looks like a microservice, right? But thinking about the individual workloads, what they do, are they asynchronous? Could they be event driven? Could they use different compute profiles or different data storage is another interesting strategy for SaaS, especially if you're in this model where their deployments have gotten to a point where they're limiting your potential as a SaaS product.

So here we just have a simple Cerberus stack that's just managing your Catalog Service. But of course, we have introduced some complexity here. Now we have two different technologies in our stack and now we have some different deployment models.

What do we need to do to prove that we haven't introduced a whole new set of pitfalls into our solution? Well, anyone who's a software engineer realizes the testing is at the core of everything we do, right. We have to be able to prove what we've built is sustainable and that we can operate it on behalf of our customers safely, that it's reliable.

So let's talk about what that means for a SaaS product naturally load testing is at the center of almost every testing strategy. We have to be able to test the upper range of the capabilities of our solution. And now that we've introduced these different deployment models, we might have different expectations, our customers might have different expectations of what those levels are our product service might.

We have an expectation of one second on average for our Advanced customers in three seconds for our Basic Tier customers. This is just one example, we can test all of our different, all of the different pieces of architecture for load and that's important.

But another piece that's important here is creating different load profiles. If you're building a SaaS solution, I guarantee customers will find ways to do use your system in unexpected ways that you never even considered. So you have to be able to come up with different scenarios that you can test in parallel.

What happens if you, you're working in the finance industry, for example, and most of your banks tend to send updates to you at five o'clock every day in the same region that creates an unexpected load on your system. But it's a load that we might be able to predict, we might be able to think ahead and think about what these different profiles look like.

What happens if all of our customers have a spiky workload that happens at a similar time. How does our system react creating these different load profiles and testing them and applying them when we do releases or continually at regular intervals will allow us to assess whether we're meeting our SLAs for our customers, even at the worst of times for our system.

And finally, as I said, we've kind of decomposed our work a little bit. There is some new complexity that's been added. So we need to be able to test individual workloads, test things like making sure that this asynchronous workload is in fact asynchronous and the events that we're driving aren't in any way interrupting with the rest of our synchronous workloads.

We have to be able to test that the, that all of our other systems that now are relying on this external system can continue to function if that Catalog Service for some reason disappears and stop operating. Are we caching our, our Catalog? Well enough, are those other services able to continue to operate in the case of a failure of that particular service?

These are just a few examples obviously getting in and testing and getting into the, the SaaS well-architected lens will give you a lot more testing scenarios you can get into. So make sure you check that out, we'll reach out to our team and we can talk you through some more of these testing scenarios.

And this takes us to the other side of the tenant isolation, right? So we started with a very siloed model which is common. We often see customers start with a completely siloed architecture. What happens if you went in the other direction? And you started with a pooled solution where all of your tenants are using the same infrastructure that comes with its own set of pitfalls.

And those often start with people assuming that all of your tenants are the same that they're going to behave the same that they have the same requirements and needs. And this is one example of a customer I worked with last year. They were running events, large-scale events and they had a shared system, it was running great. Usually they can anticipate the load and they scale ahead of time.

But then what happens? One customer comes along and throws a great event unexpectedly, thousands more users pile into the system at the same time. At the same time, another customer happens to be running an event that they knew about but hadn't really planned to scale for.

So now these two events are running simultaneously noisy neighbor. This is what we'd use to describe the situation, noisy neighbor issues impacted all of their customers. Their customers were unhappy with the experience, they even lost some customers over this event. And this is exactly what we're trying to avoid in a SaaS solution. These type of noisy neighbor issues negatively impacting other customers on this system.

And what might this look like? And what are some of the symptoms that we see here? Well, noisy neighbor symptoms can happen at every layer of our solution. So this is where I want to start with your thought process. When you think about looking for noisy neighbor issues, troubleshooting noisy neighbor issues, remember it's not necessarily enough to just look at the front edge of your solution.

So you might start with your API Gateway and try to assess, you know, is there ways we could actually protect ourselves from unexpected traffic. But that's not the only layer we need to be looking at our compute has to be able to scale. In the case of this simple example, with a Lambda, perhaps you might use up your provision concurrency. The next customer that's coming in is starting to run into cold starts. If you're using EC2s, maybe these are scaling events regardless of what this looks like. It can, it affects both our front end and our compute layer and of course our database and this can happen in unexpected ways.

Unlike the current load that's on your solution, perhaps we have to think about the data that's actually stored and how that's distributed in our data storage. This example here, I'm using a Postgres schema per tenant, which simply means we've got a shared instance of our database, multiple tenants datas are shared in that same database, we have three tenants in there. If one of them has a lot of data, does that affect the queries for the other two tenants? Does that affect, affect the indexing processes? Is it possible that those indexing processes themselves become a problem for the overall database thinking about these type of issues ahead and thinking about how we can avoid them or perhaps coming up with some different strategies for how we can work around them is part of this pitfall.

But before you even get to how you're going to solve the problem, how do you know you have a problem when you're in a shared construct? You don't have a simple way to understand what the tenant experience is unless and this is our simple Cervi stack, you're emitting tenant identity across your whole stack into whatever logging and metric solution you're using. You're going to see me use CloudWatch here as an example. A lot don't pay any attention to the CloudWatch part of this. If you're using Datadog, if you're using Honeycomb, whatever you're using has the capability to do queries like this, make sure you're emitting tenant identity in all of those logs and metrics so that we can quickly aggregate all of that data. But then de-aggregate it as well. When things are going wrong, we really need to be able to drill down to the tenant level and understand what that tenant's experience is at that moment.

So all of these pieces and this is just a simple example of us using CloudWatch and being able to query those logs to be able to produce this obviously fake mock of my three different tenants where Tenant 2 is causing the issues because they have a lot more requests than my other two tenants, right?

And so getting to this point where I can understand that the profile has changed for one of my tenants is one of the ways to start to work around this pitfall.

Of course, there's an alarm aspect to this as well. We do want to be able to get alarms and tell us when truly unexpected things have happened. But we don't want to necessarily to rely on our alarms either. You're certainly your SREs don't want to. So make sure that we're only using alarms after we've created the overall dashboard experience and provide our administrators the ability to see the single view pane of everything that's going on on our solution.

So how do we start to think about solving this problem? And there's a few different strategies. One of those is aligning our tenant infrastructure to the usage that we're experiencing in our solution and why, what might that look like?

Well, we've got a shared stack. What we've built isn't necessarily wrong. There's nothing we have to do to this overall stack. We might keep our API gateway, our Lambda, we might keep the exact same strategy we've had of a schema per tenant. But we need to focus on each one of these layers and think about what they mean. And we have to think about whether we need to take that one noisy tenant and move them out of that infrastructure into their own environment.

And again, you see pricing tiers appear here. So we could say simply, hey, 10 and 2 is an advance to your customer. We simply have to put them in their own siloed environment. And I'm using air quotes there for siloed because in fact, it's the same shared environment. And what do you call a shared environment that only has one tenant in it siloed, you didn't have to change your infrastructure, you simply had to deploy another version of it and shard your tenants across that.

And that decision might not be as simple as this. You might have specific tenants that are in different regions and their usage profiles don't overlap very much and we could put them in the same infrastructure, we might have groups of tenants. And again, we're going to use tiers here, we're going to pack more customers into the standard tier and simply acknowledge to them, your SLA aren't as high. There might be times where your system is not going to be. If you want a more performance system, you can upgrade to the advanced tier, but after all of that, after we've made any sort of sharding changes, after we've revisited those policies and thought about what it means at the API gateway at the compute layer at our data layer. And we put policies in place to protect our tenants.

How do we prove that they're working? We need to be able to test and create artificial tenant profiles that allow us to hit tenants specific infrastructure and expose whether there's any unexpected behaviors or any impact that these tenants could have in terms of noisy neighbors, but also in security.

So we need to be able to test the front edge of our solution. You know whether that's Route 53 or some other DNS or even if it goes around DNS, what happens if i put a tremendous load with my fake tenant on that? What happens if I get behind that and go to an API gateway or a load balancer behind that and start directly hitting that with lots of traffic? Does that affect my other tenants?

What happens if one of the ARNs or one of the underlying URLs that's exposed by an API gateway that's supposed to be tenants specific leaks out. For example, somebody changes roles and they decide, well, I'm going to screw my old old employer here and just they send a bunch of traffic their way and ruin their experience. Are we protected in some way against that? Can i access the underlying compute. And this is a tricky one if I've got a legitimate payload, is there some way for me to get that payload back into infrastructure that wasn't, wasn't assigned to me? Or can I create a JWT token if we're using JWT tokens for security that has a different tenant's identity? And how do we handle and understand that this isn't a legitimate workload.

And finally the back of the database, this is what we're trying to protect at the heart of everything we're trying to protect tenant data. Have we sufficiently built out the solution so that the tenant data is protected regardless of what route the traffic comes through. What happens if I somehow database credentials leak? Is there any way for those credentials to be used with going through our compute stack? What happens again if that big JWT token actually comes through? Have we intercepted that traffic and prevented it from getting to other tenants data? All of these are legitimate aspects of this pitfall, all of them.

As we think about how we build and fix for noisy neighbors, we have to continue to test to make sure that we haven't introduced other problems into our solution. I don't know if you guys heard Berner earlier this week. He gave a quote during his keynote where he said every engineering decision. Yeah, every engineering decision is a purchasing decision. This is actually a quote that comes from CloudZero from the founder of CloudZero. And they do a tremendous job of helping to understand costs. I want to make sure I call them out because I don't think they were called out during the keynote.

But we hear this conversation constantly now, stakeholders in your SaaS application coming to you and saying, I simply don't understand what this bill is that you've put in front of me. It keeps going up, it's expensive. It doesn't seem like it's matching up with the revenue numbers I'm seeing. And if we don't have a common language to talk to our stakeholders about costs in SaaS, there's no way we can have a successful conversation. It's simply not viable for us to try to drive down our total costs when we're adding new customers into our stack.

So what does it look like to focus on costs in SaaS? What is the thing that we're trying to get to? Well, it starts with this because this is the picture that, that CFO or those stakeholders are seeing and that total cost is continuing to climb. And from their viewpoint, this is what they're seeing. We're in the early stages of our, our SaaS product. When we were launching, when we're trying to get the product market fit, we're usually seeing a revenue dip. So from their perspective, everything is a disaster, the world's on fire. We need to fix this and somehow get to a point of profitability, what they don't see happening that all of us in the engineering side are trying to work on is us having a focus on unit costs.

This is the common language that we need to talk about in SaaS unit economics. The ability to say yes, our total costs are going up. But the cost per transaction, the cost per tenant, the cost per tier, even the costs per feature are being driven down as we're becoming more and more efficient at building this platform. This is the cost efficiency of SaaS, not the total cost of your AWS bill. And I can't say this strongly enough. This is one of the biggest pitfalls I've seen people falling into this year as they're getting pressure from investors simply trying to reduce their total cost rather than focusing on the efficiency of their platforms, continue to try to grow, continue to try to add customers but find ways to drive down the unit cost to become more efficient in what you're building. This is what we're trying to get to.

How do we get there? How do we get out of this pitfall? Well, it starts with capturing those costs and what are the costs that we want to capture? Let's look at a couple of simple stacks. You're going to see some of the same language here is silo we have an infrastructure per tenant and we can simply tag these right if anyone's not familiar with cost allocation tags, this is another option we have just like tagging your servers or everything else in AWS to simply take our silo resources, tag that lambda tag the RDS instance that's behind it. And we'll be able to allocate our costs out for solo for silo costs to each of our tenants. And that continues tenant. Number two, tenant, number three, we're going to have some shared infrastructure. Sometimes we call this a control plane, our onboarding services, our billing services, these are things that are hard to split. They're sort of our cost of doing business, right?

So we need to have a strategy for how we're going to divide those up as well. But the pool, the pool is one of the main problems we're going to have with any infrastructure in SaaS. What do we, how do we do this? How do we get to a point where we can understand these pooled costs and just to give away the ending here, I'm I am sorry, but it's going to be some manual work. We're going to have to put some effort into this process. I'd love to say that AWS has a new service and we've launched it and it's going to tell you all your shared costs. We didn't, we didn't launch that this year.

So instead I'm going to give you a couple of different techniques you can use and they start with what I call a coarse grained approach. And this is simply finding some spot in your architecture where most of your traffic flows through. You could call it a choke point and that might look like an API gateway. It might look like a load balancer. Well, all of that infrastructure produces logs, a lot of those logs if we set them up properly, will have tenant identity baked into them.

So we can take our Lambda authorizer, have it inject the tenant identity of the JWT token that's flowing through into those logs. And now, hey, I can do almost the same process we were talking about before i can use my CloudWatch logs. I can write a simple query that looks and says, how many times is each tenant calling this specific service? This gets me to a pretty good view of consumption. Now, I haven't said anything about cost yet, right? But understanding consumption is a pretty good step to be able to divide up our costs.

Let's say before we were dividing up our cost per tenant evenly and saying we've got 30 tenants. Each one gets an equal piece of this pool. Now, we have visibility into how much each of these tenants is consuming of our services. You could take that and divide your, your costs up by that and you've gotten a lot more accurate already for some providers. That's enough. You may never need to do more.

Uh I'll simply show another version of this with the Application Load Balancer logs. If you're doing subdomain based routing, then your tenant identity is actually in your path. If you've got your services somehow reflected in your path as well, like in this example, in order service? Well, now I know both the tenant and the feature that they're using in my logs and I can use those, they'll be shoved into S3 bucket. I can use Athena to simply query those and ask the same question i was asking of API Gateway. How much does it, how much of each of these services is each one of my tenants using? I've got pretty good consumption metrics again, that might be enough.

But what if it's not? What if I look at my services? And I'm thinking of this customer, i was talking about before with OpenSearch. Now they've got this shared OpenSearch instance and it's an, it's an index per tenant, but it's still a big part of their overall costs. They want to understand how much it costs for each one of their tenants using that specific service.

Well, at that point, you've got to get down to a fine grained understanding of cost and that usually means instrumenting your code and i apologize. That means you're actually going to have to go in, put some specific metrics

This is a simple example and it's probably hard to read on the big screen, but I'm just simply using Lambda Power Tools using the metrics library that's in there. And we're creating, we're injecting some dimensions, of course, tenant identity perhaps what services they're consuming, how long it took for that service to respond you might actually put in here as well like calling DynamoDB under the hood. How long did it take? How many WCUs did we use for DynamoDB? He can get pretty specific, right? So you can get down to a point where you're measuring in the OpenSearch example, how long did it take? Each call for OpenSearch? We can query OpenSearch and ask how much data it's using under the hood. You have lots of options at a fine grain level, but it's work. It's real effort. So you have to be able to emit these metrics.

It might end up looking something as simple like this. It could be much more complicated. You can use an embedded metrics format and CloudWatch, send those off to CloudWatch. And again, we can query and simply ask CloudWatch, hey, break down my tenant tenant consumption. But now we've got a much more granular view of that and some people, some of you in the audience are going to have systems that are complicated enough where you really have to get down to this level, start coarse grained, work your way down to fine grained and then sorry, I saw some people take pictures and I advanced it a little fast there, but you know, we'll share out the deck afterwards, correlate this with cost.

So now we've done this side of the equation. We figured out what we've got CloudWatch Logs that tell us what our consumption looks like. The other side of this. The other leg is going out to our Cost and Usage Report or I don't have it on this slide. Maybe the AWS Price List API another option that simply gives you the public pricing of AWS. Depending how you want to slice and dice these costs. The Cost and Usage Report is your exact record of how much AWS has billed you for each of your Lambda calls for each of your DynamoDBs. We can use that. Those are essentially written off to S3. Again, we can use Athena to query those.

I didn't give an example of this, but we do have another workshop that talks about these. So if you want some more details about querying your Cost and Usage Report, we can share those. You can also look at the AWS Well Architected Labs where they have a number of Cost and Usage Report queries all written out for Athena. And you can use those as well for the basis of creating a system like this. But essentially we're trying to get to this. We're simply going to try to get to a point where we've measured tenant consumption. I say Tenant One has used 50% of this particular service that, that equals $10. Well, that's $5 for them, right? We're able to create these records that describe the consumption and the correlate that to the cost because this is what we're trying to get to in the end if you're a business user in the audience. And you've been thinking this is a lot of technical detail here. This is what you want. You're trying to get to a point where you can create a dashboard that gives us the cost per tenant and tells us specifically what this looks like. And I want to be able to slice and dice this data.

The reason I talked about tier and feature and some of the other metadata is these queries can start to get quite complicated. What if I want to understand the usage of a specific feature by the different tiers of customers? And why might I want to do this? Because I might want to repackage my software and think about whether this specific feature that we're looking at here is in the advanced tier today. But really, I want to move it down to the standard tier because it's not too expensive. And I want to be able to offer that more cheaply. And I want to take this very expensive feature and move that up to my advance tier, being able to repackage and rethink how we think of our SaaS solution and who's consuming it is part of the journey to SaaS and part of this cost question, then we need to iterate on this process.

So I'm going to get a little bit away from the technical side of this and think about the process of what we're building. Most, most requests to add something to our application come through as speech or requests they came from a customer, they came from our product team and we just have a simple software development life cycle that we follow. We design, we, we deploy, develop, we deploy, we operate and this goes through and we iterate over it. We continue to try to improve on what we've built from an architectural perspective, from a performance perspective, from a reliability perspective. But alongside that costs are occurring and we're not spending enough time thinking about those.

We need to get to a point where in the very beginning our design decisions, remember how we said every engineering decision is a purchase decision. That means we should be involving our finance teams and our product teams and having a conversation about what the cost is not just to build it but to operate this feature that we're, we're creating. How does this, how does the product team understand how they're going to sell this? How does the sales team understand who the audience is for this? Can they provide us feedback of who's asking for this feature? How do they know if they don't know how much it's going to cost to operate? This estimates are never accurate. I've never created an accurate estimate in my entire life as a developer or an architect that doesn't matter. It's the exercise of going through and actually doing these estimations and making an effort to communicate about what we think this might cost. Because sometimes the answers are surprising and sometimes those answers will affect our, how we develop and how we architect this specific feature.

And then of course, everything we've been talking about before this is that the operation phase continue to iterate over our cost through this whole cycle, just like we iterate over it from an architectural perspective, thinking about cost as a nonfunctional requirement of our application where we continue to revisit this not just by reliability and security, but also by cost with the team in the middle whose responsibility it is to make sure that we're effectively building unit cost-effective features is really the point of this slide. And one of the ways we get out of this pitfall of thinking about costs holistically rather than specifically an assa solution.

The DevOps journey is another one that a lot of SaaS customers fall into. We often talk to customers early in the process when they're engaging with us and their onboarding process is measured in weeks or months. And it's become, it's because they're coming from a traditional software background where they deliver their software on a regular basis, maybe their customers are used to that and maybe that is the expectation in the industry. Why set the bar low, raise the bar on your application, think about how you're going to do that, but to do that requires better DevOps practices and they might look something like this, right? For a SaaS solution.

Onboarding is one of the first terms I'm going to introduce onboarding produces new challenges for our DevOps teams. We're used to software release cycles. We, we release software in a release. We have a software release cycle today with SaaS. This is the first DevOps challenge though, we're going to continue to release our software more rapidly. We're going to introduce these rapid releases perhaps daily, perhaps even hourly. But alongside of those, our tenants are continuing to onboard, enter our solution asynchronous to our software release cycle between a single release, maybe five new customers signed up to our software. This creates an unexpected load or an unpredictable load on our solution that we need to handle from a DevOps perspective alongside our software release cycle.

How do we think about this in SaaS? How do we start to think about this as a potential pitfall of our onboarding either affecting the experience of our customers or simply taking too long? Well, the first concept I want to introduce is the control plane. We mentioned control plane before the control plane is simply a group of services that includes onboarding. It also includes billing and metrics and analytics. How we think about our, how we store our individual tenants' information and what tier they belong to. All of these are concepts. We're really going to double click on onboarding today and think about what it means to have a repeatable onboarding process because this is how we get to good in SaaS DevOps.

We need to be able to automate everything that we're doing in onboarding. And now some people might be thinking that doesn't work in my industry. You know, we, our customers require long sale cycles. We're going to go and integrate with their backend systems that's relevant and it's true, that doesn't mean you can't on board, you can't automate everything around that so that the blocker is never you. If we have to wait for a customer to tell us about their identity store, that's fine. That's great. You know, that's part of the human process. There's always human elements, everything that's technical could be automated and that might look something like this.

We had a great workshop this year, Michael Beardsley and, and Bob Provis created an interesting example of a workshop for how you could build out a control plane. And they posited that we could use Cerberus for that control plan, simply creating some Lambda services, some API Gateway end points that will help us to onboard customers. So you can go through that workshop on your own later and look at how we built this out. But this serverless control plan allows us to manage everything that's happening in the application plan and we'll talk about some of those processes in a second. But hey, it's probably event driven, right. So they proposed using EventBridge to be able to send the message back and forth as customers on board and as that, we wanted to update their software, EventBridge provides a very interesting way for us to communicate back and forth between the layers.

So creating this repeatable process, being able to have an automated process for onboarding starts with building out the control plane aspects of this. And then we have to find the right things to automate. It starts with our SaaS application. That's SaaS application is at the core of our customer's experience, but we have to think about the deployment pipelines. What does it mean for us to bootstrap a new customer environment? How do we onboard the tenant admin if they haven't, if they've self onboarded, how do we allow them to add new customers? How do we set up their user management? How do we add any sort of external systems where they might need to be added to? Perhaps we need to add them to a marketplace or a billing provider? How do we get those tenant users into our system? Is it in fact an integration, a federation with an external identity provider or does the tenant act, do we have to provide UI for the system administrator to come in and do this themselves? So we automate all of that to enable them to take on that role themselves.

We need to think about how we automate how we send our updates out. How does that happen automatically and seamlessly and invisibly to our customers. And the last one and this is a pitfall in and of itself off boarding. I run into so many SaaS customers who've onboarded thousands of customers, hundreds of customers and they forgot when those customers left to deprovision all of the resources. Now, they may have deprovisioned some of the resources that were costing them money and deleted some EC2s, but they leave behind lots of artifacts. If a tenant has offboarded your system and they're not coming back, remove everything because one of the pitfalls you can fall into here is running into quota and limit issues.

So you, you may not know all of the quotas and limits. I certainly don't, the number of AWS accounts you could have under a single payer, the number of S3 buckets, you could have the number of IM rolls you can have in a single AWS account per region. All of these are relevant quotas. And if you don't clean up your resources, you might find out what they are and some of them are hard quotas and we cannot increase them on your behalf as a SA provider. It's your responsibility to make sure you don't run into these and these prevent you from onboarding your next customer having a solid onboarding process and monitoring of your quotas and limits to make sure that you're not running foul of any of those or running up against those.

And coming up with a sharding strategy for moving your tenants across environments is another part of this Dev Ops journey that we need to get to, to get out of this pitfall last pitfall, I promise. I know it's lunchtime. Everyone's probably dying to get out need. But let's think a little bit about customization and I see this one, this pitfall, especially with early stage customers as they start to onboard new logos, they're willing to do anything to keep these customers happy and who can blame them. Our 1st 10 customers are probably the most important customers we're ever going to land. They help us find product market fit. They ask us for a change, we make the change and we're happy to do it. But what happens when that starts to affect our entire system? And we get to a point where our teams are managing all of these different versions.

Well, it might look something like this, right? We might start to have some tenants and develop technical debt that's affecting our whole system. A simple stack that involves our, our repositories, our build tools, our infrastructure as code and our application might just get more and more complicated as we branched out. We might end up with different branches of our code trying to react to relatively simple changes like injecting new variables into our, into our code build for as simple as efs configurations perhaps we start to put application code in that's tenants and you start to see tenant names appear on our code differentiating that this guy is over here is using efs infrequent access and this one isn't. And what happens if a specific tenant asks for a new service that other ones aren't using? Now, we introduce fsx for luster. Now we've got these different branches. How do we ever reconcile what tenant four is doing with tenant one? It gets harder and harder to do the more we do this.

How do we avoid this? Well, this is a very simple, an overly simple diagram that shows what good looks like configuration over customization. If you want to get to a point where regardless of what you're doing for a specific tenant, you do it with an external configuration system if in any way possible. But we get to a point where tenant three here simply gets this feature and everyone else doesn't. And if we need to add it to 10 and four, we flip a switch and now tenant four has it. And this allows us to improve numerous parts of our system. The interesting thing about configuration is it affects so many of our processes.

So now we have to think about our tiers, we have different tiers of customers and they have different feature sets that we've applied for them. Now, we can get to a point where we change our development processes perhaps before we had a development environment per developer. And that isn't necessarily an anti pattern. But maybe we can get to a point where we can do trunk based development where individual developers can simply turn on the features that they're working on. And we get to a point where those are continually flowing out to production, continual deployments, right? And we can get to a point where we inject these variables into our build pipelines and the decisions between which tenant gets what are automated and depend on our configuration rather than having different code pipelines for each one of our customers. And of course, our code starts to look cleaner.

Now, one of the things i'm proposing here and this is a little different than how some of you might think about feature flags. For example, is these are long lived feature flags in sass. We introduced the concept this year and wrote a couple of blogs about it, how you can use Launch Darkly or AppConfig to simply do custom configuration that sets up your pricing and packaging in sass. So these are long lived flags that say this group of tenants, this tier of tenants now receive this features and these don't so unlike release flags which would quickly be removed from your system. Now, these are operational flags that are inside of your code that allow us to externalize the configuration of our different tiers of pricing and you can look up those blogs as well. There's really interesting solutions. If you're more on the development side, a good fig is an awesome solution launch, darkly is an amazing aw s partner. Both of them provide very interesting ways to do this. And one of those differentiators might be if you have business users to be able to make these changes. And our configuration launch darkly has a really nice ui driven solution that makes that easier. We can talk through those choices if you're considering what what to do there reach out. We're happy to talk about that as well.

Now, what do you do? Now, you've got this configuration, how do we go through this configuration and prove out that in fact, we can have a better tenant experience, all of those things we were talking about the operational efficiency, the better release management. How do we prove all of these things out? Well, having these configurations allows us to do the great chaos in our system. We can go and turn everything off, we can turn everything on, we can mess with all of our configurations and prove that the tenant experience at least meets the bar of what we're expecting in those circumstances.

Case engineering is hard and i'm not talking about infrastructure level. I'm going to flip off some ecs being able to test that our application actually performs with all of the features turned off is one way that we can test and prove that our configuration is added value to our customers, blue green testing. Canary canary, blue green releases, canary releases a b testing. All of these options are enabled by feature flags or by configuration. Now, we can get to a point where we can release a new feature out to 90% of our customers or sorry, 10% of our customers and let the other 90 90% live in peace. If something goes wrong, we simply roll it back for those 10% lowering the blast radius for our overall customer base and allows us to also do a b testing. What if we're introducing a new feature? And we're not really sure how our customers are going to prefer it. Like what, what, what's the ui going to look like? What's the behavior that they expect? What's the user experience? This is also enabled by configuration and of course, as i mentioned, getting to the point where he could iterate on pricing and packaging, think of this as adding a new tier. Our business users have evaluated the marketplace. They want a new enterprise tier of our product. How do they test this out? How are they able to set up the configurations so that when they turn on all of our new enterprise tier customers, everything simply works externalizing this configuration and allowing our business users to manage the configuration of our pricing tiers is another interesting option that is unlocked by having these type of configurations externalized from our system.

So we've seen several different benefits of configuration over customization and really getting out of that pitfall is as simple as not falling into the not going and doing these individual customizations. Instead, thinking ahead about how we're going to do this configuration, how we're going to externalize it. And with that, we've gone over a number of these different potential pitfalls. When we're talking about building sass, we want to get to the point where we're building. Great sass, we've talked about deployments and how deployments shouldn't be a one way door that instead we should be thinking about deployments holistically and continuing to iterate over what we're what we're developing for our customers and how we can differentiate that product.

So we've covered that pitfall. We talked about noisy neighbor issues and how having a shared solution needs to be carefully planned, how we need to test and make sure that we're covering noisy neighbor issues and have the ability to potentially shard customers out across different environments. If in fact, those noisy neighbor issues are persistent and we can't fix them for specific tenants. We've talked about having, having a focus on cost is important, but it's not enough to simply focus on our overall cost. We need to get down to unit economics and really think about the cost of supporting individual tenants, building specific features or even tiers of tenants.

We've talked about how dev ops practices can help us over get over the hump of being able of being able to onboard, customers have customers self on board and being able to automate all those processes to allow our developers to focus on delivering value to customers. And we've covered that pitfall. And finally, we've talked about how customizations are the root of all evil in sass. And now we have to get to configurations over customizations to get to a point to allow us to innovate in our solution, to be able to test our solution and allow potentially our business users to also manage the packaging of our solution.

So we've covered a few pitfalls. You'll notice there's a lot more on here. We only have enough time to cover these few pitfalls, reach out to our team. And if you are experiencing any other pitfalls, let us know what you're going through. We're here to work with you in helping you build excellent sass on aws. We've got resources as well. Make sure you take a picture of these qr codes. Just this week, we launched sass on aws as a new home page to collect all of our resources to collect our partner resources in what they're building and provide you different options to self-service in sass. But you can also reach out to both our team and our sass competency partners and that's the second link on here. We have lots of domain experts in sass who could help you work with our team, work with our sass competency partners who are also tested for their ability to build sass and finally join the, the aws partner network.

Sass factory is a benefit, a complementary program for aws partners. And there's numerous other programs that give you benefits as a partner, both with go to market with helping you to build on specific types of solutions including compliance solutions. Reach out. Do you have any of any questions about joining the partner network? But there's lots of benefits and i encourage anyone who's building sass to become an aws partner with that. My name is bill tarr. This is my contact information. Please don't hesitate to reach out. We're here to answer your questions. I love working with partners and i love talking to our partners. Please reach out. It's been a great rein i hope all of you have a safe trip home and i look forward to seeing you again next year. Thank you, everybody.

你可能感兴趣的:(aws,亚马逊云科技,科技,人工智能,re:Invent,2023,生成式AI,云服务)