netflix

by Rohan Dhruva, Ed Ballot

由 罗汉德鲁瓦 ， 埃德选票

As Android developers, we usually have the luxury of treating our backends as magic boxes running in the cloud, faithfully returning us JSON. At Netflix, we have adopted the Backend for Frontend (BFF) pattern: instead of having one general purpose “backend API”, we have one backend per client (Android/iOS/TV/web). On the Android team, while most of our time is spent working on the app, we are also responsible for maintaining this backend that our app communicates with, and its orchestration code.

作为Android开发人员，我们通常可以将后端视为在云中运行的魔术盒，从而忠实地向我们返回JSON。在Netflix，我们采用了后端后端(BFF)模式：我们没有一个通用的“后端API”，而是每个客户端(Android / iOS / TV / web)只有一个后端。在Android团队中，尽管我们大部分时间都花在了应用程序上，但我们还负责维护与应用程序通信的后端及其编排代码。

Recently, we completed a year-long project rearchitecting and decoupling our backend from the centralized model used previously. We did this migration without slowing down the usual cadence of our releases, and with particular care to avoid any negative effects to the user experience. We went from an essentially serverless model in a monolithic service, to deploying and maintaining a new microservice that hosted our app backend endpoints. This allowed Android engineers to have much more control and observability over how we get our data. Over the course of this post, we will talk about our approach to this migration, the strategies that we employed, and the tools we built to support this.

最近，我们完成了为期一年的项目重新架构，并将后端与之前使用的集中式模型脱钩。我们进行了这种迁移，而不会减慢我们发行版的常规节奏，并特别注意避免对用户体验造成任何负面影响。我们从单片服务中的实质上无服务器的模型发展到部署和维护托管我们的应用程序后端端点的新微服务。这使Android工程师对我们如何获取数据有更多的控制权和可观察性。在这篇文章的过程中，我们将讨论迁移的方法，所采用的策略以及为支持此迁移而构建的工具。

背景 (Background)

The Netflix Android app uses the falcor data model and query protocol. This allows the app to query a list of “paths” in each HTTP request, and get specially formatted JSON (jsonGraph) that we use to cache the data and hydrate the UI. As mentioned earlier, each client team owns their respective endpoints: which effectively means that we’re writing the resolvers for each of the paths that are in a query.

Netflix Android应用程序使用falcor数据模型和查询协议。这使应用程序可以在每个HTTP请求中查询“路径”列表，并获取我们用于缓存数据和对UI进行水合处理的特殊格式的JSON( jsonGraph )。如前所述，每个客户团队拥有各自的端点：这实际上意味着我们正在为查询中的每个路径编写解析器。

As an example, to render the screen shown here, the app sends a query that looks like this:

例如，要渲染此处显示的屏幕，应用程序将发送如下查询：

paths: ["videos", 80154610, "detail"]

A path starts from a root object, and is followed by a sequence of keys that we want to retrieve the data for. In the snippet above, we’re accessing the detail key for the video object with id 80154610.

路径从根对象开始，然后是要为其检索数据的一系列键。在上面的代码段中，我们正在访问ID为80154610的video对象的detail键。

For that query, the response is:

对于该查询，响应为：

netflix_无缝交换netflix android应用程序的api后端_第1张图片

Response for the query [“videos”, 80154610, “detail”] 对查询的响应[“视频”，80154610，“详细信息”]

在巨石 (In the Monolith)

In the example you see above, the data that the app needs is served by different backend microservices. For example, the artwork service is separate from the video metadata service, but we need the data from both in the detail key.

在上面的示例中，应用所需的数据由不同的后端微服务提供。例如，图稿服务与视频元数据服务是分开的，但是我们需要detail键中的数据。

We do this orchestration on our endpoint code using a library provided by our API team, which exposes an RxJava API to handle the downstream calls to the various backend microservices. Our endpoint route handlers are effectively fetching the data using this API, usually across multiple different calls, and massaging it into data models that the UI expects. These handlers we wrote were deployed into a service run by the API team, shown in the diagram below.

我们使用API团队提供的库在端点代码上进行编排，该库公开了RxJava API以处理对各种后端微服务的下游调用。我们的端点路由处理程序通常使用此API有效地(通常是在多个不同的调用之间)获取数据，并将其组合为UI期望的数据模型。我们编写的这些处理程序已部署到API团队运行的服务中，如下图所示。

netflix_无缝交换netflix android应用程序的api后端_第2张图片

previously published blog post 先前发布的博客文章

As you can see, our code was just a part (#2 in the diagram) of this monolithic service. In addition to hosting our route handlers, this service also handled the business logic necessary to make the downstream calls in a fault tolerant manner. While this gave client teams a very convenient “serverless” model, over time we ran into multiple operational and devex challenges with this service. You can read more about this in our previous posts here: part 1, part 2.

如您所见，我们的代码只是此整体服务的一部分(在图中为＃2)。除了托管我们的路由处理程序外，该服务还处理以容错方式进行下游调用所需的业务逻辑。虽然这为客户团队提供了非常方便的“无服务器”模型，但是随着时间的流逝，我们在使用此服务时遇到了多个运营和devex挑战。您可以在我们以前的文章中阅读更多有关此内容的内容：第1 部分，第2部分。

微服务 (The Microservice)

It was clear that we needed to isolate the endpoint code (owned by each client team), from the complex logic of fault tolerant downstream calls. Essentially, we wanted to break out the client-specific code from this monolith into its own service. We tried a few iterations of what this new service should look like, and eventually settled on a modern architecture that aimed to give more control of the API experience to the client teams. It was a Node.js service with a composable JavaScript API that made downstream microservice calls, replacing the old Java API.

显然，我们需要将端点代码(每个客户团队拥有)与容错下游调用的复杂逻辑隔离开来。从本质上讲，我们想将特定于客户端的代码从此整体中分解成自己的服务。我们尝试了对该新服务的外观进行一些迭代，最终选择了一种现代体系结构，旨在为客户团队提供对API体验的更多控制。这是一个具有可组合JavaScript API的Node.js服务，该API进行了下游微服务调用，从而取代了旧的Java API。

Java…脚本？ (Java…Script?)

As Android developers, we’ve come to rely on the safety of a strongly typed language like Kotlin, maybe with a side of Java. Since this new microservice uses Node.js, we had to write our endpoints in JavaScript, a language that many people on our team were not familiar with. The context around why the Node.js ecosystem was chosen for this new service deserves an article in and of itself. For us, it means that we now need to have ~15 MDN tabs open when writing routes :)

作为Android开发人员，我们开始依赖像Kotlin这样的强类型语言的安全性，也许还有Java的一面。由于此新的微服务使用Node.js，因此我们不得不用JavaScript编写端点，这是我们团队中许多人不熟悉的语言。关于为什么要为此新服务选择Node.js生态系统的上下文本身值得一提。对我们来说，这意味着我们现在需要在编写路由时打开〜15个MDN标签：)

Let’s briefly discuss the architecture of this microservice. It looks like a very typical backend service in the Node.js world: a combination of Restify, a stack of HTTP middleware, and the Falcor-based API. We’ll gloss over the details of this stack: the general idea is that we’re still writing resolvers for paths like [videos, , detail], but we’re now writing them in JavaScript.

让我们简要讨论此微服务的体系结构。它看起来像Node.js世界中非常典型的后端服务： Restify ，HTTP中间件堆栈和基于Falcor的API的组合。我们将详细介绍该堆栈的详细信息：总体思路是，我们仍在为[videos, , detail]类的路径编写解析器，但是现在我们正在用JavaScript编写它们。

The big difference from the monolith, though, is that this is now a standalone service deployed as a separate “application” (service) in our cloud infrastructure. More importantly, we’re no longer just getting and returning requests from the context of an endpoint script running in a service: we’re now getting a chance to handle the HTTP request in its entirety. Starting from “terminating” the request from our public gateway, we then make downstream calls to the api application (using the previously mentioned JS API), and build up various parts of the response. Finally, we return the required JSON response from our service.

但是，与整体的最大区别在于，它现在是作为独立服务部署为我们的云基础架构中的单独“应用程序”(服务)。更重要的是，我们不再只是从服务中运行的终结点脚本的上下文中获取和返回请求：我们现在有机会完整地处理HTTP请求。从“终止”来自公共网关的请求开始，然后我们对api应用程序进行下游调用(使用前面提到的JS API)，并构建响应的各个部分。最后，我们从服务中返回所需的JSON响应。

迁移 (The Migration)

Before we look at what this change meant for us, we want to talk about how we did it. Our app had ~170 query paths (think: route handlers), so we had to figure out an iterative approach to this migration. Let’s take a look at what we built in the app to support this migration. Going back to the screenshot above, if you scroll a bit further down on that page, you will see the section titled “more like this”:

在查看此更改对我们意味着什么之前，我们想谈一谈我们如何做到的。我们的应用程序具有约170个查询路径(认为：路由处理程序)，因此我们必须找出一种迭代的方法来进行此迁移。让我们看一下我们在应用程序中内置的功能以支持此迁移。回到上面的屏幕截图，如果您在页面上进一步向下滚动，则会看到标题为“更像这样”的部分：

As you can imagine, this does not belong in the video details data for this title. Instead, it is part of a different path: [videos, , similars]. The general idea here is that each UI screen (Activity/Fragment) needs data from multiple query paths to render the UI.

可以想象，这不属于该标题的视频详细信息数据。相反，它是另一条路径的一部分： [videos, , similars] 。这里的总体思路是，每个UI屏幕( Activity / Fragment )都需要来自多个查询路径的数据来呈现UI。

To prepare ourselves for a big change in the tech stack of our endpoint, we decided to track metrics around the time taken to respond to queries. After some consultation with our backend teams, we determined the most effective way to group these metrics were by UI screen. Our app uses a version of the repository pattern, where each screen can fetch data using a list of query paths. These paths, along with some other configuration, builds a Task. These Tasks already carry a uiLabel that uniquely identifies each screen: this label became our starting point, which we passed in a header to our endpoint. We then used this to log the time taken to respond to each query, grouped by the uiLabel. This meant that we could track any possible regressions to user experience by screen, which corresponds to how users navigate through the app. We will talk more about how we used these metrics in the sections to follow.

为了为终端技术堆栈的重大变化做好准备，我们决定围绕响应查询所花费的时间来跟踪指标。经过与后端团队的协商，我们确定了将这些指标分组的最有效方法是通过UI屏幕。我们的应用程序使用版本库模式，每个屏幕都可以使用查询路径列表来获取数据。这些路径以及其他一些配置将构建Task 。这些Tasks已经带有一个uiLabel ，它可以唯一地标识每个屏幕：此标签成为我们的起点，我们将标头传递给端点。然后，我们使用它来记录响应每个查询所花费的时间(按uiLabel分组)。这意味着我们可以通过屏幕跟踪对用户体验的任何可能的回归，这与用户如何浏览应用程序相对应。在后面的小节中，我们将详细讨论如何使用这些指标。

Fast forward a year: the 170 number we started with slowly but surely whittled down to 0, and we had all our “routes” (query paths) migrated to the new microservice. So, how did it go…?

快进一年：我们开始时的170数缓慢但确实减少到了0，我们将所有“路由”(查询路径)迁移到了新的微服务。结果怎么样…？

善良 (The Good)

Today, a big part of this migration is done: most of our app gets its data from this new microservice, and hopefully our users never noticed. As with any migration of this scale, we hit a few bumps along the way: but first, let’s look at good parts.

今天，迁移的大部分工作已经完成：我们的大多数应用程序都是从这种新的微服务获取数据的，希望我们的用户从未注意到。与这种规模的迁移一样，我们在此过程中遇到了一些障碍：但是首先，让我们看一下好零件。

迁移测试基础架构 (Migration Testing Infrastructure)

Our monolith had been around for many years and hadn’t been created with functional and unit testing in mind, so those were independently bolted on by each UI team. For the migration, testing was a first-class citizen. While there was no technical reason stopping us from adding full automation coverage earlier, it was just much easier to add this while migrating each query path.

我们的整体架构已经存在了很多年，并且并不是在考虑功能和单元测试的情况下创建的，因此这些都是由每个UI团队独立确定的。对于迁移而言，测试是一等公民。尽管没有技术上的原因阻止我们提早添加完整的自动化范围，但在迁移每个查询路径时添加此内容要容易得多。

For each route we migrated, we wanted to make sure we were not introducing any regressions: either in the form of missing (or worse, wrong) data, or by increasing the latency of each endpoint. If we pare down the problem to absolute basics, we essentially have two services returning JSON. We want to make sure that for a given set of paths as input, the returned JSON is always exactly the same. With lots of guidance from other platform and backend teams, we took a 3-pronged approach to ensure correctness for each route migrated.

对于我们迁移的每条路由，我们都希望确保不引入任何回归：以丢失(或更糟，错误)数据的形式，或者通过增加每个端点的延迟。如果我们将问题简化为绝对基础，则实际上有两个返回JSON的服务。我们要确保对于给定的一组路径作为输入，返回的JSON始终完全相同。在其他平台和后端团队的大量指导下，我们采取了三管齐下的方法来确保所迁移的每条路线的正确性。

Functional TestingFunctional testing was the most straightforward of them all: a set of tests alongside each path exercised it against the old and new endpoints. We then used the excellent Jest testing framework with a set of custom matchers that sanitized a few things like timestamps and uuids. It gave us really high confidence during development, and helped us cover all the code paths that we had to migrate. The test suite automated a few things like setting up a test user, and matching the query parameters/headers sent by a real device: but that’s as far as it goes. The scope of functional testing was limited to the already setup test scenarios, but we would never be able to replicate the variety of device, language and locale combinations used by millions of our users across the globe.

功能测试功能测试是所有功能中最直接的功能：一组测试以及每条路径都针对新旧端点进行了测试。然后，我们将出色的Jest测试框架与一组自定义匹配器结合使用，以清除时间戳和uuid等一些东西。它在开发过程中给了我们很高的信心，并帮助我们涵盖了必须迁移的所有代码路径。测试套件可以自动完成一些事情，例如设置测试用户，并匹配真实设备发送的查询参数/标题：但这已尽其所能。功能测试的范围仅限于已经设置好的测试方案，但是我们永远无法复制全球数百万用户使用的各种设备，语言和语言环境组合。

Replay TestingEnter replay testing. This was a custom built, 3-step pipeline:

重播测试输入重播测试。这是一个定制的三步流水线：

Capture the production traffic for the desired path(s)
捕获所需路径的生产流量
Replay the traffic against the two services in the TEST environment
在TEST环境中针对两种服务重播流量
Compare and assert for differences
比较并断言差异

It was a self-contained flow that, by design, captured entire requests, and not just the one path we requested. This test was the closest to production: it replayed real requests sent by the device, thus exercising the part of our service that fetches responses from the old endpoint and stitches them together with data from the new endpoint. The thoroughness and flexibility of this replay pipeline is best described in its own post. For us, the replay test tooling gave the confidence that our new code was nearly bug free.

根据设计，这是一个自包含的流程，它捕获了整个请求，而不仅仅是我们请求的一条路径。该测试是最接近生产的测试：它重播了设备发送的真实请求，从而行使了我们的服务部分，该部分从旧端点获取响应并将它们与新端点的数据缝合在一起。重播管道的彻底性和灵活性最好在其自己的文章中描述。对于我们来说，重播测试工具使我们确信我们的新代码几乎没有错误。

CanariesCanaries were the last step involved in “vetting” our new route handler implementation. In this step, a pipeline picks our candidate change, deploys the service, makes it publicly discoverable, and redirects a small percentage of production traffic to this new service. You can find a lot more details about how this works in the Spinnaker canaries documentation.

Canaries Canaries是“审核”我们新的路由处理程序实现的最后一步。在此步骤中，管道将选择我们的候选更改，部署服务，使其可公开发现，并将一小部分生产流量重定向到此新服务。您可以在Spinnaker canaries文档中找到有关此工作原理的更多详细信息。

This is where our previously mentioned uiLabel metrics become relevant: for the duration of the canary, Kayenta was configured to capture and compare these metrics for all requests (in addition to the system level metrics already being tracked, like server CPU and memory). At the end of the canary period, we got a report that aggregated and compared the percentiles of each request made by a particular UI screen. Looking at our high traffic UI screens (like the homepage) allowed us to identify any regressions caused by the endpoint before we enabled it for all our users. Here’s one such report to get an idea of what it looks like:

这就是我们前面提到的uiLabel指标变得重要的地方：在canary期间， Kayenta被配置为捕获和比较所有请求的这些指标(除了已经在跟踪的系统级指标，例如服务器CPU和内存)。金丝雀结束时，我们得到了一份报告，该报告汇总并比较了特定UI屏幕提出的每个请求的百分位。查看我们的高流量UI屏幕(如主页)可以让我们在为所有用户启用端点之前确定由端点引起的任何回归。这是一份这样的报告，用于了解其外观：

Each identified regression (like this one) was subject to a lot of analysis: chasing down a few of these led to previously unidentified performance gains! Being able to canary a new route let us verify latency and error rates were within acceptable limits. This type of tooling required time and effort to create, but in the end, the feedback it provided was well worth the cost.

每个确定的回归(像这样的回归)都需要进行大量分析：追逐其中的一些导致先前无法确定的性能提升！能够启用一条新路线，使我们能够验证延迟和错误率是否在可接受的范围内。这种类型的工具需要花费时间和精力来创建，但是最终，它提供的反馈非常值得。

可观察性 (Observability)

Many Android engineers will be familiar with systrace or one of the excellent profilers in Android Studio. Imagine getting a similar tracing for your endpoint code, traversing along many different microservices: that is effectively what distributed tracing provides. Our microservice and router were already integrated into the Netflix request tracing infrastructure. We used Zipkin to consume the traces, which allowed us to search for a trace by path. Here’s what a typical trace looks like:

许多Android工程师会熟悉systrace或Android Studio中出色的分析器之一。想象一下，您为端点代码获得了类似的跟踪，遍历了许多不同的微服务：这实际上就是分布式跟踪所提供的。我们的微服务和路由器已经集成到Netflix请求跟踪基础结构中。我们使用Zipkin消耗了踪迹，这使我们能够按路径搜索踪迹。这是典型的跟踪结果：

A typical zipkin trace (truncated) 典型的zipkin迹线(已截断)

Request tracing has been critical to the success of Netflix infrastructure, but when we operated in the monolith, we did not have the ability to get this detailed look into how our app interacted with the various microservices. To demonstrate how this helped us, let us zoom into this part of the picture:

请求跟踪对于Netflix基础架构的成功至关重要，但是当我们在整体架构中运行时，我们无法深入了解应用程序如何与各种微服务进行交互。为了演示它如何帮助我们，让我们放大图片的这一部分：

netflix_无缝交换netflix android应用程序的api后端_第3张图片

Serialized calls to this service adds a few ms latency 对该服务的序列化调用会增加几毫秒的延迟

It’s pretty clear here that the calls are being serialized: however, at this point we’re already ~10 hops disconnected from our microservice. It’s hard to conclude this, and uncover such problems, from looking at raw numbers: either on our service or the testservice above, and even harder to attribute them back to the exact UI platform or screen. With the rich end-to-end tracing instrumented in the Netflix microservice ecosystem and made easily accessible via Zipkin, we were able to pretty quickly triage this problem to the responsible team.

很明显，这些调用正在被序列化：但是，在这一点上，我们已经从微服务断开了10个跃点。通过查看原始数字很难得出结论并发现此类问题：无论是在我们的服务上还是在上面的testservice上，甚至更难将其归因于确切的UI平台或屏幕。借助Netflix微服务生态系统中丰富的端到端跟踪功能，并可以通过Zipkin轻松进行访问，我们能够Swift将此问题归类给负责的团队。

端到端所有权 (End-to-end Ownership)

As we mentioned earlier, our new service now had the “ownership” for the lifetime of the request. Where previously we only returned a Java object back to the api middleware, now the final step in the service was to flush the JSON down the request buffer. This increased ownership gave us the opportunity to easily test new optimisations at this layer. For example, with about a day’s worth of work, we had a prototype of the app using the binary msgpack response format instead of plain JSON. In addition to the flexible service architecture, this can also be attributed to the Node.js ecosystem and the rich selection of npm packages available.

如前所述，我们的新服务现在在请求的整个生命周期中都拥有“所有权”。以前我们只将Java对象返回给api中间件，而现在服务的最后一步是将JSON刷新到请求缓冲区中。所有权的增加为我们提供了在此层轻松测试新优化的机会。例如，经过大约一天的工作，我们有了使用二进制msgpack响应格式而不是纯JSON的应用程序原型。除了灵活的服务架构之外，这还可以归因于Node.js生态系统和可用的npm软件包的丰富选择。

当地发展 (Local Development)

Before the migration, developing and debugging on the endpoint was painful due to slow deployment and lack of local debugging (this post covers that in more detail). One of the Android team’s biggest motivations for doing this migration project was to improve this experience. The new microservice gave us fast deployment and debug support by running the service in a local Docker instance, which has led to significant productivity improvements.

在迁移之前，由于部署缓慢且缺乏本地调试，因此在端点上进行开发和调试很痛苦( 本文对此进行了详细介绍)。 Android团队执行此迁移项目的最大动机之一就是改善这种体验。通过在本地Docker实例中运行服务，新的微服务为我们提供了快速部署和调试支持，从而显着提高了生产力。

不太好 (The Not-so-good)

In the arduous process of breaking a monolith, you might get a sharp shard or two flung at you. A lot of what follows is not specific to Android, but we want to briefly mention these issues because they did end up affecting our app.

在艰苦的过程中，您可能会碰到一两个锋利的碎片。接下来的很多事情并不是特定于Android的，但是我们想简短地提到这些问题，因为它们最终会影响我们的应用程序。

延迟时间 (Latencies)

The old api service was running on the same “machine” that also cached a lot of video metadata (by design). This meant that data that was static (e.g. video titles, descriptions) could be aggressively cached and reused across multiple requests. However, with the new microservice, even fetching this cached data needed to incur a network round trip, which added some latency.

旧的api服务运行在同一台“机器”上，该机器还缓存了许多视频元数据(根据设计)。这意味着可以将静态数据(例如，视频标题，描述)积极地缓存并在多个请求中重复使用。但是，使用新的微服务，即使获取此缓存的数据也需要进行网络往返，这会增加一些延迟。

This might sound like a classic example of “monoliths vs microservices”, but the reality is somewhat more complex. The monolith was also essentially still talking to a lot of downstream microservices: it just happened to have a custom-designed cache that helped a lot. Some of this increased latency was mitigated by better observability and more efficient batching of requests. But, for a small fraction of requests, after a lot of attempts at optimization, we just had to take the latency hit: sometimes, there are no silver bullets.

这听起来像是“整体与微服务”的经典示例，但实际情况要复杂一些。从本质上讲，整体还是还在与许多下游微服务进行对话：它恰好具有定制设计的缓存，该缓存对很多人有帮助。更好的可观察性和更有效的请求批处理可以缓解某些增加的延迟。但是，对于一小部分请求，经过大量的优化尝试之后，我们只需要应对延迟问题：有时，没有灵丹妙药。

增加的部分查询错误 (Increased Partial Query Errors)

As each call to our endpoint might need to make multiple requests to the api service, some of these calls can fail, leaving us with partial data. Handling such partial query errors isn’t a new problem: it is baked into the nature of composite protocols like Falcor or GraphQL. However, as we moved our route handlers into a new microservice, we now introduced a network boundary for fetching any data, as mentioned earlier.

由于对端点的每个调用可能都需要向api服务发出多个请求，因此其中一些调用可能会失败，从而给我们留下了部分数据。处理这样的部分查询错误不是一个新问题：它融入了Falcor或GraphQL等复合协议的性质。但是，当我们将路由处理程序移到新的微服务中时，我们现在引入了网络边界来获取任何数据，如前所述。

This meant that we now ran into partial states that weren’t possible before because of the custom caching. We were not completely aware of this problem in the beginning of our migration: we only saw it when some of our deserialized data objects had null fields. Since a lot of our code uses Kotlin, these partial data objects led to immediate crashes, which helped us notice the problem early: before it ever hit production.

这意味着我们现在遇到了由于自定义缓存而无法实现的部分状态。在迁移开始时，我们还没有完全意识到这个问题：只有在我们一些反序列化数据对象具有null字段时，我们才看到它。由于我们的许多代码都使用Kotlin，因此这些部分数据对象导致立即崩溃，这有助于我们及早发现问题：在生产之前。

As a result of increased partial errors, we’ve had to improve overall error handling approach and explore ways to minimize the impact of the network errors. In some cases, we also added custom retry logic on either the endpoint or the client code.

由于局部错误的增加，我们不得不改进整体错误处理方法，并探索使网络错误影响最小化的方法。在某些情况下，我们还在端点或客户端代码上添加了自定义重试逻辑。

最后的想法 (Final Thoughts)

This has been a long (you can tell!) and a fulfilling journey for us on the Android team: as we mentioned earlier, on our team we typically work on the app and, until now, we did not have a chance to work with our endpoint with this level of scrutiny. Not only did we learn more about the intriguing world of microservices, but for us working on this project, it provided us the perfect opportunity to add observability to our app-endpoint interaction. At the same time, we ran into some unexpected issues like partial errors and made our app more resilient to them in the process.

对于Android团队来说，这是一段漫长的旅程(您可以说！)，这是一段充实的旅程：正如我们前面提到的，在我们的团队中，我们通常使用该应用程序，直到现在，我们还没有机会与之合作我们的端点具有这种检查水平。我们不仅了解了许多有趣的微服务世界，而且为我们从事该项目提供了绝佳的机会，为我们的应用程序-端点交互增加了可观察性。同时，我们遇到了一些意外问题，例如部分错误，并使我们的应用程序在处理过程中更具弹性。

As we continue to evolve and improve our app, we hope to share more insights like these with you.

随着我们不断发展和改进我们的应用程序，我们希望与您分享更多类似的见解。

The planning and successful migration to this new service was the combined effort of multiple backend and front end teams.

多个后端和前端团队的共同努力是对这项新服务的规划和成功迁移。

On the Android team, we ship the Netflix app on Android to millions of members around the world. Our responsibilities include extensive A/B testing on a wide variety of devices by building highly performant and often custom UI experiences. We work on data driven optimizations at scale in a diverse and sometimes unforgiving device and network ecosystem. If you find these challenges interesting, and want to work with us, we have an open position.

在Android团队中，我们将Netflix应用程序在Android平台上交付给全球数百万名成员。 我们的职责包括通过建立高性能且经常自定义的UI体验，对各种设备进行广泛的A / B测试。 我们致力于在一个多样化的，有时甚至是不可原谅的设备和网络生态系统中大规模进行数据驱动的优化。 如果您发现这些挑战很有趣，并希望与我们合作，我们将为您提供 职位空缺 。

翻译自: https://netflixtechblog.com/seamlessly-swapping-the-api-backend-of-the-netflix-android-app-3d4317155187