Reliable, Scalable, and Maintainable Applications 高可靠、易扩展、易运维应用

寻找翻译本书后续章节合作者  微信:18600166191

----------------------------------

PART I Foundations of Data Systems

第一部分:数据系统基础

The first four chapters go through the fundamental ideasthat apply to all data systems, whether running on a single machine ordistributed across a cluster of machines:

前四章我们会简单介绍无论是单机还是分布式数据系统的基础知识。

1.    Chapter 1 introduces the terminology and approach that we’re going to usethroughout this book. It examines what we actually mean by words likereliability,scalability, and maintainability, and how we can try to achievethese goals.

第一章:介绍本书所用的专业术语。包括高可用性,易扩展性,易运维性以及我们如何努力达成以上目标

2.    Chapter 2 compares several different data models and query languages—themost visible distinguishing factor between databases from a developer’s pointof view. We will see how different models are appropriate to differentsituations.

第二章:以开发者视角,针对几种不通数据模型和查询语言的对比来说明他们的关键区别。我们会知道不同模型的适用场景。

3. Chapter 3 turns to the internals of storage engines and looks at howdatabases lay out data on disk. Different storage engines are optimized fordifferent workloads, and choosing the right one can have a huge effect onperformance.

4. Chapter 4 compares various formats for data encoding (serialization) andespe cially examines how they fare in an environment whereapplication requirements change and schemas need to adapt over time.

Later, Part II will turn to the particular issues of distributed data systems.

 

CHAPTER 1

Reliable, Scalable, and Maintainable Applications

高可靠、易扩展、易运维应用

The Internet was done so well that most people think of itas a natural resource like the Pacific Ocean, rather than something that wasman-made. When was the last time a technology with a scale like that was soerror-free?

Internet这么易用,已至很多人认为它是向太平洋一样的自然资源,从来不会想到它是一个人造物。上一次如此完美无瑕的大规模技术应用是什么时候?

Alan Kay, in interview withDr Dobb’sJournal (2012)

Many applications today are data-intensive, asopposed tocompute-intensive. Raw CPU power is rarely a limiting factorfor these applications—bigger problems are usually the amount of data, thecomplexity of data, and the speed at which it is changing.

现今很多应用都是数据密集型应用(相对于计算密集型应用)。对这些应用来说,CPU能力很少成为瓶颈,多数情况下数据规模、数据复杂度和数据变化速度是我们面对的主要问题。

A data-intensive application is typically built fromstandard building blocks that provide commonly needed functionality. Forexample, many applications need to:

数据密集型应用一般来说都是由一些提供普通功能的标准模块组成。比如,很多应用需要一下功能:

•    Store data so that they, or another application,can find it again later (databases)

•    数据,其它的用其后可以去除些数据(数据

•    Remember the result of an expensiveoperation, to speed up reads (caches)

•    为了加速读,临时一些需要耗大量操作资源结果(caches

•    Allow users to search data by keywordor filter it in various ways (search indexes)

•    一些关键词检索或者过滤等各种操作(检索索引)

•    Send a message to another process, tobe handled asynchronously (stream processing)

•    将一条信息送到另一个异步理的流程(流式理)

•    Periodically crunch a large amount ofaccumulated data (batch processing)

•    周期性的累的大量数据(批量理)。

If that sounds painfully obvious, that’s just because thesedatasystems are such a successful abstraction: we use them all the time withoutthinking too much. When build ing an application, most engineerswouldn’t dream of writing a new data storage engine from scratch, becausedatabases are a perfectly good tool for the job.

为啥这么令人解?因数据系统这个曾经的抽象太成功了:我们一直都在使用,但是从来没必要细想。当构建一个应用的时候,有很多数据库能很好的提供存储服务,大数据工程师都不会想从头构建一个存储引擎。

3

But reality is not that simple. There are many databasesystems with different characteristics, because different applications havedifferent requirements. There are various approaches to caching, several waysof building search indexes, and so on. When building an application, we stillneed to figure out which tools and which approaches are the most appropriatefor the task at hand. And it can be hard to combine tools when you need to dosomething that a single tool cannot do alone.

但是实际情况没这么简单。不同的应用有不同的需求,因此也就存在很多具有不同特性的数据库。它们有各自不同cache、组织索引、及其它实现同一功能不同的方法。当构建一个应用系统时,我们仍然需要弄明白哪个工具的特点更切合应用的场景。当单独一个工具无法满足需求的时候,结合不同的工具解决问题就没那么容易了。

This book is a journey through both the principles and thepracticalities of data systems, and how you can use them to builddata-intensive applications. We will explore what different tools have incommon, what distinguishes them, and how they achieve their characteristics.

本书涵盖数据系统的特性和提供服务的可能性,以及如何利用这些特性构建一个数据密集系统。我们将会说明不同工具的共性和特性以及他们是如何设计来实现特性的。

In this chapter, we will start by exploring the fundamentalsof what we are trying to achieve: reliable, scalable, and maintainable datasystems. We’ll clarify what those things mean, outline some ways of thinkingabout them, and go over the basics that we will need for later chapters. In thefollowing chapters we will continue layer by layer, looking at different designdecisions that need to be considered when working on a data-intensiveapplication.

本章,我们开始接触我们终极系统(高可用,易扩展,易运维的数据系统)的基础要求。我们先解释这些词汇具体含义,从这些维度思考的基本思路,后续的章节我们会详细介绍每个特性。接下来的章节,我们一层层抽丝剥茧,说明设计一个数据密集系统要做出设计折衷时的思维方式。

Thinking About Data Systems

数据系统思考

We typically think of databases, queues, caches, etc. asbeing very different categories of tools. Although a database and a messagequeue have some superficial similarity— both store data for some time—they havevery different access patterns, which means different performancecharacteristics, and thus very different implementations.

我们一般任务数据库,队列,caches是非常不同的工具。虽然数据库和消息队列有些相似之处-都能存储数据,他们却是完全不同的工组模式,这意味着不同的性能表现,实现也是完全不通。

So why should we lump them all together under an umbrellaterm likedata systems?

那为什么我们还把他们都放在数据系统这个大概念下一起讨论呢?

Many new tools for data storage and processing have emergedin recent years. They are optimized for a variety of different use cases, andthey no longer neatly fit into traditional categories [1]. For example, thereare datastores that are also used as message queues (Redis), and there aremessage queues with database-like durability guarantees (Apache Kafka). Theboundaries between the categories are becoming blurred.

今些年出现了很多数据存储和处理的新系统。他们都针对不同的应用场景做过优化,同时也不再适合传统的分类方法。例如:存在很多被用来做消息队列的存储系统(Redis),也存在具有数据库持久存储特性的消息队列(Apache Kafka)。不同种类系统间的边界变的越来越模糊。

Secondly, increasingly many applications now have suchdemanding or wide-ranging requirements that a single tool can no longer meetall of its data processing and storage needs. Instead, the work is broken downinto tasks thatcan be performed efficiently on a single tool, and thosedifferent tools are stitched together using application code.

而且,越来越多的应用系统有了这些跨种类需求或者说,原有单一的工具已经不能满足所有的数据处理存储需求。相反,原有需求被分成多个任务的方式不如在一个单独工具内实现高效,同时多个工具的集成也需要应用层编码。

For example, if you have an application-managed cachinglayer (using Memcached or similar), or a full-text search server (such asElasticsearch or Solr) separate from your main database, it is normally theapplication code’s responsibility to keep those caches and indexes in sync withthe main database. Figure 1-1gives a glimpse of what this may looklike (we will go into detail in later chapters).

例如:你原来有一个应用层cache层(用Memcached或其类似组件),或者一个与主数据库分离的全文本检索服务(Elasticsearch或者Solr)。一般来说,应用层代码负责保持cache和索引与主数据库的同步。图1-1给出一种可能的实现方法(我们在后续章节给出实现细节)。

4 | Chapter 1: Reliable, Scalable, and MaintainableApplications

 

Figure 1-1. One possible architecture for a data system thatcombines several components.

由多个模块构成的一个数据系统

When you combine several tools in order to provide aservice, the service’s interface or application programming interface (API)usually hides those implementation details from clients. Now you haveessentially created a new, special-purpose data system from smaller,general-purpose components. Your composite data system may provide certainguarantees: e.g., that the cache will be correctly invalidated or updated onwrites so that outside clients see consistent results. You are now not only anapplication developer, but also a data system designer.

当我们把几个工具集成成一个服务时,对应用者来说,工具的实现细节经常被服务或者应用的编程接口屏蔽。现在,你从更小的通用组件开始构建一个全新,特定功能的数据系统。你组成的系统必须提供很多服务保证,比如,cache数据必须及时失效和更新,惟其如此,使用者才能看到前后一致的结果。你现在不只是一个应用开发者,也是一个系统设计者。

If you are designing a data system or service, a lot oftricky questions arise. How do you ensure that the data remains correct andcomplete, even when things go wrong internally? How do you provide consistentlygood performance to clients, even when parts of your system are degraded? Howdo you scale to handle an increase in load? What does a good API for theservice look like?

当你设计数据系统或者服务的时候,会碰见各种奇怪的问题。你如何才能保证即使发生内部错误,数据依然是完整正确的呢?在系统的部分组件不能服务情况下,你如何才让终端用户享受到一致的高性能服务?你该怎么处理增长的请求?service什么样的API才是好的呢?

There are many factors that may influence the design of adata system, including the skills and experience of the people involved, legacysystem dependencies, the timescale for delivery, your organization’s toleranceof different kinds of risk, regulatory constraints, etc. Those factors dependvery much on the situation.

个人的技术积累,老系统遗留问题,开发周期,业务对于不通风险的容忍程度,监管限制等都会影响数据系统的设计。这些因素针对每个人大不相同。

Thinking About Data Systems | 5

In this book, we focus on three concerns that are importantin most software systems:

本书,我们专注于软件系统中最重要的三个特性:

ReliabilityThe system should continue to workcorrectly (performingthe correct function at the desired level of performance) even in the face ofadversity(hardware or soft ware faults, and even human error).See“Reliability” onpage 6.

可用性。即使遇到一些不正常的问题(软/硬件错误,人为错误),系统应该持续不断地正常工作(在一个预期性能指标下提供正确的功能)。See“Reliability” on page 6.

ScalabilityAs the systemgrows (in data volume, traffic volume, orcomplexity), there should be reasonable ways of dealing with that growth. See“Scalability” on page 10.

扩展性:随着系统规模(数据量,流量,复杂度)增加,应该很方便的处理增长带来的问题。See“Scalability” on page 10.

Maintainability Over time, many different people will work on the system(engineering and operations, both maintaining current behavior and adapting thesystem to new use cases), and they should all be able to work on it productively.See “Maintainabil ity” on page 18.

运维性:在应用生命周期内,会有不同的人接触这个应用(程序员和应用使用者都会负责应用当前的正确运行,并为了满足新需求而进行改进),应用应该高效的满足这些需求。

These words are often cast around without a clearunderstanding of what they mean. In the interest of thoughtful engineering, wewill spend the rest of this chapter exploring ways of thinking aboutreliability, scalability, and maintainability. Then, in the following chapters,we will look at various techniques, architectures, and algorithms that are usedin order to achieve those goals.

这些词一般会直接使用。为了一些不知道这些概念的工程师,我们在本章剩下的部分将会尝试从多个角度思考以上三个特性。在后续章节,我们将讲述,哪些技术,架构,算法可以达到这些目标。

Reliability

Everybody has an intuitive idea of what it means forsomething to be reliable or unreliable. For software, typical expectationsinclude:

每个人对于某物的可靠或者不可靠都有一个直觉的认识。对软件可靠性而言,经典的预期包括:

•    The application performs the functionthat the user expected.

•    用以用户预期的方式工作

•    It can tolerate the user makingmistakes or using the software in unexpected ways.

•    它能容忍错误和一些非期的操作。

•    Its performance is good enough for therequired use case, under the expected load and data volume.

•    在特定的负载和数据规模下,性能必须能满足应用需求。

•    The system prevents any unauthorizedaccess and abuse.If all those things together mean“working correctly,” then we can understandreliability as meaning,roughly, “continuing to work correctly, even when things go wrong.”The things that can go wrong are called faults, andsystems that anticipate faults and can cope with them are calledfault-tolerantor resilient. The former term is slightly misleading: it suggeststhat we could make a system tolerant of every possible kind of fault, which inreality is not feasible. If the entire planet Earth (and all servers on it)were swallowed by a black hole, tolerance of that fault would require webhosting in space—good luck getting that budget item approved. So it only makessense to talk about toleratingcertain types of faults.

•    系统必须访问未授权的操作。如果“正常工作”指的就是这些的话,可靠性大体可以理解为“即使有些东西不正常,但整个系统仍然能正常工作”。某些组件不能正常工作叫错误,系统能预料并处理可能的错误叫容错性。前面有个误导性的词语:它表名我们正在设计一个能处理任何可能出现错误的系统,其实这是不可能的。如果整个地球(这个服务所有服务器都在地球上)都被黑洞吞了,针对这个错误的容错性需要将服务器部署到外太空(如果你能有足够的预算)。因此,只有针对特定类型的错误讲究容错性才是有意义的。

Note that a fault is not the same as a failure [2]. A faultis usually defined as one component of the system deviating from its spec,whereas afailure is when the system as a whole stops providing therequired service to the user. It is impossible to reduce the probability of afault to zero; therefore it is usually best to design fault-tolerancemechanisms that prevent faults from causing failures. In this book we coverseveral techniques for building reliable systems from unreliable parts.

注意,错误和失败是两个概念。错误一般指系统某一组件不按照预期工作,失败指整个系统不能相应用户的请求。我们不可能把错误可能性减少为零,因此我们一般都尽量完善系统的容错机制,以防止因此导致的系统失败。本书我们将介绍一些根据不可靠组件构建可靠系统的技术方案。

Counterintuitively, in such fault-tolerant systems, it canmake sense toincrease the rate of faults by triggering themdeliberately—for example, by randomly killing individual processes withoutwarning. Many critical bugs are actually due to poor error handling [3]; bydeliberately inducing faults, you ensure that the fault-tolerance machinery iscontinually exercised and tested, which can increase your confidence thatfaults will be handled correctly when they occur naturally. The NetflixChaosMonkey [4] is an example of this approach.

与直觉不同,在容错系统中,通过主动触发增加错误出现频率是有意义的,比如无预警地随机杀掉一个进程。很多致命的bug都是因为针对某些错误的处理缺失引起的。手动触发错误能确保容错机制能确保不断的测试,你也会对系统更有信心,它能处理将来发生的错误。NetflixChaos Monkey系统就是一个这样的例子。

Although we generally prefer tolerating faults overpreventing faults, there are cases where prevention is better than cure (e.g.,because no cure exists). This is the case with security matters, for example:if an attacker has compromised a system and gained access to sensitive data,that event cannot be undone. However, this book mostly deals with the kinds offaults that can be cured, as described in the following sections.

虽然相比于防止错误,我们更乐于纠正错误,可是,在某些场景下,防止错误比发生错误后再处理更好。在安全领域就是如此:如果针对一个系统的攻击已经发生了,并且窃取了一些敏感数据,这些事情是没办法回退的。因此,本书后续部分主要讨论可以被纠正的错误。

Hardware Faults

硬件错误

When we think of causes of system failure, hardware faultsquickly come to mind. Hard disks crash, RAM becomes faulty, the power grid hasa blackout, someone unplugs the wrong network cable. Anyone who has worked withlarge datacenters can tell you that these things happen all the time whenyou have a lot of machines.

当我们想到系统失败的原因时,第一个冒出的想法是硬件错误。硬件坏掉,RAM错误,电网断电,网络断线。在大型数据中心工作过的人都知道,当你有很多机器的时候,这些问题经常发生。

Hard disks are reported as having a mean time to failure(MTTF) of about 10 to 50 years [5, 6]. Thus, on a storage cluster with 10,000disks, we should expect on average one disk to die per day.

据说硬盘平局失效周期(MTTF)是1050[5,6]。因此,在一个超过10000块硬盘的集群中,几乎每天都有硬盘坏掉。

Our first response is usually to add redundancy to theindividual hardware components in order to reduce the failure rate of thesystem. Disks may be set up in a RAID configuration, servers may have dualpower supplies and hot-swappable CPUs, and datacenters may have batteries anddiesel generators for backup power. When one component dies, the redundantcomponent can take its place while the broken com ponent is replaced. This approach cannot completely preventhardware problems from causing failures, but it is well understood and canoften keep a machine running uninterrupted for years.

我们想到的最直接的方法是给每个独立的硬件提供副本,这样就能减少整个系统失败的概率。硬盘可以配置为RAID模式,服务器可以配置双电源供电,可拔插CPUs,数据中心配备电池和产油发电机以备外部断电。这些措施并不能完全防止因硬件错误导致的系统失败,但这很容易理解,一般来说让一台机器不间断运行几年足够了。

Reliability | 7

Until recently, redundancy of hardware components wassufficient for most applications, since it makes total failure of a singlemachine fairly rare. As long as you can restore a backup onto a new machinefairly quickly, the downtime in case of failure is not catastrophic in mostapplications. Thus, multi-machine redundancy was only required by a smallnumber of applications for which high availability was absolutely essential.

直到最近,硬件的冗余的方式可以使弹尽很少失败,针对大多数应用都是足够了。只要你能迅速的换掉损坏部件,针对大多数应用来说都不会有灾难性问题。因此,仅仅少数需要绝对高可用的系统需要多机冗余机制保证高可用性。

However, as data volumes and applications’ computing demandshave increased, more applications have begun using larger numbers of machines,which proportionally increases the rate of hardware faults. Moreover, in somecloud platforms such as Amazon Web Services (AWS) it is fairly common forvirtual machine instances to become unavailable without warning [7], as theplatforms are designed to prioritize flexibility and elasticityioversingle-machine reliability.

但是,随着数据规模和计算量增加,越来越多的应用需要大量的机器,相应地,这也增加了整个系统硬件错误的概率。此外,在一些例如:Amazon Web Services (AWS)这样的云平台,平台由于设计阶段就优先考虑灵活性和弹性而非单机的可靠性,因此虚拟机实例没有报警就宕机是家常便饭。

Hence there is a move toward systems that can tolerate theloss of entire machines, by using software fault-tolerance techniques inpreference or in addition to hardware redundancy. Such systems also haveoperational advantages: a single-server system requires planned downtime if youneed to reboot the machine (to apply operating system security patches, forexample), whereas a system that can tolerate machine failure can be patched onenode at a time, without downtime of the entire system (a rolling upgrade;see Chapter 4).

因此,优先通过软件容错或者硬件冗余的方式,系统正在朝着容忍单机整体宕机的方向发展。这种系统有很大的操作优势:如果你需要重启机器,对一个单机系统来说,整个系统必须停止服务(例如给操作系统打安全补丁),但是如果一个系统能容忍单机失败,它就能在某一时刻只给一个节点打补丁,而不会影响整个系统(滚动升级,见第四章)。

Software Errors

软件错误

We usually think of hardware faults as being random andindependent from each other: one machine’s disk failing does not imply thatanother machine’s disk is going to fail. There may be weak correlations (forexample due to a common cause, such as the temperature in the server rack), butotherwise it is unlikely that a large number of hardware components will failat the same time.

我们一般认为多个部件间的硬件错误是独立、随机的:一台机器上的硬盘故障并不预示另一台机器上的硬盘也会故障。也许会有微弱的联系(例如一个由于共同的因素,比如服务器机架的温度),但是一般情况下,不会发生大量硬件同时故障的情况。

Another class of fault is a systematic error within thesystem [8]. Such faults are harder to anticipate, and because they arecorrelated across nodes, they tend to cause many more system failures thanuncorrelated hardware faults [5]. Examples include:

系统中的另一类错误是系统性的错误。这类错误很难预料,因为它们是跨节点的,它们一般会引发系统级的错误,而不仅仅是硬件错误。这包括:

•    A software bug that causes everyinstance of an application server to crash when given a particular bad input.For example, consider the leap second on June 30, 2012, that caused manyapplications to hang simultaneously due to a bug in the Linux kernel [9].

•    接收到错误输入引发的所有服务实例崩溃的软件bug。例如,一个Linux内核bug,导致时间跨过2012630日时,许多应用会hang住。

•    A runaway process that uses up someshared resource—CPU time, memory, disk space, or network bandwidth.

•    一个失控进程用过了一些共享的资源-CPU时间,内存,硬盘空间,网络带宽

i.

Defined in “Approaches for Coping with Load” on page 17.| Chapter 1: Reliable, Scalable, and MaintainableApplications

8

•    A service that the system depends onthat slows down, becomes unresponsive, or starts returning corrupted responses.

•    一个系统依赖的服务变慢了,无响应或者返回已经崩溃的相应。

•    Cascading failures, where a small faultin one component triggers a fault in another component, which in turn triggersfurther faults [10].

•    雪崩式失败,一个小的硬件错误导致的系统错误会触发其它的错误,如此重复,触发出很多错误。

The bugs that cause these kinds of software faults often liedormant for a long time until they are triggered by an unusual set ofcircumstances. In those circumstances, it is revealed that the software ismaking some kind of assumption about its environment—and while that assumptionis usually true, it eventually stops being true for some reason [11].

种能引发软错误的bug常潜伏在系统隐蔽处很久,直到再一些非常规环境下才会被触发。在这些场景中,软件会对它运行的环境进行一些一般情况下都成立的假设,但是由于某些原因这些假设不成立了。

There is no quick solution to the problem of systematicfaults in software. Lots of small things can help: carefully thinking aboutassumptions and interactions in the system; thorough testing; processisolation; allowing processes to crash and restart; measuring, monitoring, andanalyzing system behavior in production. If a system is expected to providesome guarantee (for example, in a message queue, that the number of incomingmessages equals the number of outgoing messages), it can constantly check itselfwhile it is running and raise an alert if a discrepancy is found [12].

没有针对软件中的系统级错误的特效药。以下准则有些帮助:认真推敲系统中的假设和相互依赖;完整的测试;处理隔离;运行进程的中断和重启;衡量、监控、分析系统中的某些表现。在运行过程中不断检查自己的状态并在非预期差异出现时发出报警能给系统提供更多保证(例如,在消息队列中,系统接受的消息条数必须等于输出的条数)。

•    Human Errors

•    认为错误

Humans design and build software systems, and the operatorswho keep the systems running are also human. Even when they have the bestintentions, humans are known to be unreliable. For example, one study of largeinternet services found that configuration errors by operators were the leadingcause of outages, whereas hardware faults (servers or network) played a role inonly 10–25% of outages [13].How do we makeour systems reliable, in spite of unreliable humans? The best systems combineseveral approaches:

设计实现是的人,操作系运行的也是人。众所周知,无多么心, 人也是不可靠的。例如一研究表明操作人的配置错误导致了大多数系停服,相而言,硬件故障(服器或网仅仅占到10-25%。人不可靠的,怎么才能使系可靠呢?设计一个好的系统有以下几个方法:

•    Design systems in a way that minimizesopportunities for error. For example, well-designed abstractions, APIs, andadmin interfaces make it easy to do “the right thing” and discourage “the wrongthing.” However, if the interfaces are too restrictive people will work aroundthem, negating their benefit, so this is a tricky balance to get right.

•    采用减少人犯错可能性的设计方法。例如:良好抽象,API和管理接口使系统错误更少。但是,如果系统接口限制太多,会触犯将来使用它的人的利益,因此这需要好好权衡。

•    Decouple the places where people makethe most mistakes from the places where they can cause failures. In particular,provide fully featured non-productionsandbox environments where peoplecan explore and experiment safely, using real data, without affecting realusers.

•    分割人犯错误错误发生引起失的景。特地,提供一个全功能的非生沙河境,可以在那里安全地用真数据行探索和实验,而不会影响到真实用户

•    Test thoroughly at all levels, fromunit tests to whole-system integration tests and manual tests [3]. Automatedtesting is widely used, well understood, and especially valuable for coveringcorner cases that rarely arise in normal operation.

•    提供各个级别测试,从单测到系统级别的集成测试和人工测试。自动测试已经被广泛使用,也容易理解,对要覆盖很少触发的处于死角的场景尤其有用。

•    Allow quick and easy recovery fromhuman errors, to minimize the impact in the case of a failure. For example,make it fast to roll back configuration changes, roll out new code gradually(so that any unexpected bugs affect only a small subset of users), and providetools to recompute data (in case it turns out that the old computation wasincorrect).

•    减少系停服务影响,要让系统能从人为错误中方便、快速的恢复。例如:配置修改的快速回滚,新代码的逐步上线(非预期的bug仅仅会影响一小部分用户),提供工具对数据进行重新计算(防止老的计算结果不正确)

•    Set up detailed and clear monitoring,such as performance metrics and error rates. In other engineering disciplinesthis is referred to astelemetry. (Once a rocket has left the ground,telemetry is essential for tracking what is happening, and for understandingfailures [14].) Monitoring can show us early warning signals and allow us tocheck whether any assumptions or constraints are being violated. When a problemoccurs, metrics can be invaluable in diagnosing the issue.

•    设置详尽清晰的监控,例如性能和错误率监控。在另一个工程领域叫做遥测(当一个火箭升空后,遥测是追踪火箭正在如何运行以及理解错误的基本方法[14])。监控能给我提早报警,允许我们确认是否某些假设和系统限制是有效的。问题发生后,监控数据对问题追查无比重要。

•    Implement good management practices andtraining—a complex and important aspect, and beyond the scope of this book.

•    好的管理实践和训练-这是一个非常重要复杂的因素,已经超出本书讨论范围。

How Important Is Reliability?

可靠性有多重要

Reliability is not just for nuclear power stations and airtraffic control software— more mundane applications are also expected to workreliably. Bugs in business applications cause lost productivity (and legalrisks if figures are reported incorrectly), and outages of ecommerce sites canhave huge costs in terms of lost revenue and damage to reputation.Even in “noncritical” applications we have a responsibilityto our users. Consider a parent who stores all their pictures and videos oftheir children in your photo application [15]. How would they feel if thatdatabase was suddenly corrupted? Would they know how to restore it from abackup? There are situations in which we maychoose to sacrifice reliability in order to reduce development cost (e.g., whendeveloping a prototype product for an unproven market) or operational cost(e.g., for a service with a very narrow profit margin)—but we should be veryconscious of when we are cutting corners.

可靠性并不仅仅对发电站,空管系说很重要,愈来越多普通系统也需要可靠的工作。交易系统中的bug会引起重大损失(如果某些数据不正确也会有法律风险),商业网站的停服务会引起利润和商誉的损失。即使在“非关键”的应用中,我们也要对自己用户负责。设想一下,一个父母把他们所有孩子的照片和视频存储在你的应用中,数据库突然崩溃了,他们会作何感受?难道他们要知道怎么从一个备份中恢复数?有些场景中,我们为了减少开发成本(比如:给一个未经证实的市场开发一个原型时)或运营成本(一个只有很少盈利空间的服务)而牺牲稳定性,但是我们必须十分清楚自己选择的后果。

Scalability

展性

Even if a system is working reliablytoday, that doesn’t mean it will necessarily work reliably in the future. Onecommon reason for degradation is increased load: perhaps the system has grownfrom 10,000 concurrent users to 100,000 concurrent users, or from 1 million to10 million. Perhaps it is processing much larger volumes of data than it didbefore.

即使今天系是可靠的,也不意味着将来也是可靠的。一个服的普遍的原因系统负载的增加:系由10000到100000,或者从1百万到1千万。也它需要理比以前更多的数据。

Scalability is the term we use to describe asystem’s ability to cope with increased load. Note, however, that it is not aone-dimensional label that we can attach to a system: it is meaningless to say“X is scalable” or “Y doesn’t scale.” Rather, discussing scalability meansconsidering questions like “If the system grows in a particular way, what areour options for coping with the growth?” and “How can we add computingresources to handle the additional load?”

展性用来描述系统处理增长负载的能力。它不是一个非黑即白的描述系统的标签:说“X是扩展的”或者“Y是不可扩展的”是没意义的。相对来说,讨论可扩展性就像这些问题“假如系统按照一种特定的方式进行增长,我们应付这种增长的选择是什么?”和“为了应付额外的负载,我们改怎么增加计算资源?”。

Describing Load

负载描述

First, we need to succinctly describe the current load onthe system; only then can we discuss growth questions (what happens if our loaddoubles?). Load can be described with a few numbers which we callloadparameters. The best choice of parameters depends on the architecture ofyour system: it may be requests per second to a web server, the ratio of readsto writes in a database, the number of simultaneously active users in a chatroom, the hit rate on a cache, or something else. Perhaps the average case iswhat matters for you, or perhaps your bottleneck is dominated by a small numberof extreme cases.

首先,我们需要简要的描述系统中的当前负载;只有在此之后我们才能讨论增长的问题(假如我们的负载翻倍后会发生什么问题?)。我们通过负载参数的来描述负载。最好依据系统架构来选择参数:它可以是一个web服务器每秒接收请求数,数据库的读写比例,聊天室同时在线人数,cache命中比例,或者其它。通用原则是:用你关心的数据,或者很极端场景才会触发的系统瓶颈。

To make this idea more concrete, let’s consider Twitter asan example, using data published in November 2012 [16]. Two of Twitter’s mainoperations are:

为了使这个概念更具体,我们拿根据Twitter November 2012发布的数据来举例。Twitter的两个主要操作是:

Post tweet

tweet

A user can publish a new message to their followers (4.6krequests/sec on average, over 12k requests/sec at peak).

一个用户向所有关注他的人发布一条讯息(平均有4600条请求每秒,峰值12000条每秒)

Home timeline

主页时间线

A user can view tweets posted by the people they follow(300k requests/sec).

一个用户可以查看他关注所有人发布的tweets

Simply handling 12,000 writes per second (the peak rate forposting tweets) would be fairly easy. However, Twitter’s scaling challenge isnot primarily due to tweet volume, but due tofan-outii—each userfollows many people, and each user is followed by many people. There arebroadly two ways of implementing these two operations:

单单一秒处理12000条写请求(发tweets的峰值速度)就相当不容易。但是,twitter的可扩展性挑战不主要来自请求规模,而是扇出-每个人都有很多粉丝,每个人都被很多人关注。一般来说有两种方式来实现这两种操作:

1.    Posting a tweet simply inserts the new tweet into a globalcollection of tweets. When a user requests their home timeline, look up all thepeople they follow, find all the tweets for each of those users, and merge them(sorted by time). In a relational database like in Figure 1-2, you could write a query such as:

发送请求动作会向全局的整体tweets数据集合插入一条新tweet。当一个人请求他的home timeline时,会查看他关注的所有人,然后找到每个人的更新,再把这些更新揉和在一起(以时间排序)。在Figure 1-2指示的关系型数据库中,你需要写一个这个的查询语句:

SELECT tweets.*,users.*FROM tweetsJOINusers ONtweets.sender_id= users.idJOIN followsON follows.followee_id= users.idWHERE follows.follower_id= current_user

ii. A term borrowed from electronic engineering, where itdescribes the number of logic gate inputs that are attached to another gate’soutput. The output needs to supply enough current to drive all the attachedinputs. In transaction processing systems, we use it to describe the number ofrequests to other services that we need to make in order to serve one incomingrequest.

从电气工程领域引入的一个词,它描述一个逻辑门输出到其他逻辑门输入的数目。输出必须提供足够的并发驱动所有的输入。在事务系统中,我们用它来描述为了处理一个请求,我们需要向其他服务发出多少个请求

 

Scalability | 11

2.    Maintain a cache for each user’s home timeline—like a mailbox oftweets for each recipient user (seeFigure 1-3). When a userposts a tweet, look up all the people whofollow that user, and insert the new tweet into each of their home timelinecaches. The request to read the home timeline is then cheap, because its resulthas been computed ahead of time.

为每一个人的home timeline维护一个cache-就像每个接收者都有一个邮箱(Figure 1-3)。当一个人发送一条tweet时,会查找他的粉丝,然后把这条tweet放入他们的home timeline cache。因为每个人home timeline的都已经事先计算好了,请求这个数据时的读代价就很小了。

The first version of Twitter used approach 1, but thesystems struggled to keep up with the load of home timeline queries, so thecompany switched to approach 2. This works better because the average rate ofpublished tweets is almost two orders of magnitude lower than the rate of hometimeline reads, and so in this case it’s preferable to do more work at writetime and less at read time.

Twitter第一个实现版本采用方案一,但是系统随着home timeline请求的增长,变得很困难,因此他们切换到方案二。这个方案更好,因为发tweet的数目几乎是读hometimeline数目的一半,同时这个方案在写的时候花费的时间多一些,在读的时候耗时少一些。

 

Figure 1-2. Simple relational schema for implementing aTwitter home timeline.

实现Twitter home timeline简单的关系型数据库schema

  

Figure 1-3. Twitter’s data pipeline for delivering tweets tofollowers, with load parameters as of November 2012 [16].

Twitter用消息队列将tweets发送到给粉丝

However, the downside of approach 2 is that posting a tweetnow requires a lot of extra work. On average, a tweet is delivered to about 75followers, so 4.6k tweets per second become 345k writes per second to the hometimeline caches. But this average hides the fact that the number of followersper user varies wildly, and some users have over 30 million followers. Thismeans that a single tweet may result in over 30 million writes to hometimelines! Doing this in a timely manner— Twitter tries to deliver tweets tofollowers within five seconds—is a significant challenge.

但是,在方案二中,发送一个tweet需要很多额外的工作。平均一个tweet一般要发送给75个粉丝。但是平均值掩盖了每个人的粉丝个数范围很大这个问题,有些人有超过300万粉丝。这意味的一个tweet会产生300万的写操作!需要及时做完- Twitter会尽力在5s内将数据推送给所有粉丝-是一个非常大的挑战。

In the example of Twitter, the distribution of followers peruser (maybe weighted by how often those users tweet) is a key load parameterfor discussing scalability, since it determines the fan-out load. Yourapplication may have very different characteristics, but you can apply similarprinciples to reasoning about its load.

Twitter的例子中,用户粉丝的分布(可以通过用户发tweet频率衡量)是一个讨论可扩展性的核心负载参数,因为它决定扇出负载。你的系统可能有非常不同的特性,但是你能采取类似的准则推理它的负载。

The final twist of the Twitter anecdote: now that approach 2is robustly implemented, Twitter is moving to a hybrid of both approaches. Mostusers’ tweets continue to be fanned out to home timelines at the time when theyare posted, but a small number of users with a very large number of followers(i.e., celebrities) are excepted from this fan-out. Tweets from any celebritiesthat a user may follow are fetched separately and merged with that user’s hometimeline when it is read, like in approach 1. This hybrid approach is able todeliver consistently good performance. We will revisit this example inChapter 12 after we have covered some more technical ground.

Twitter例子的最终发展:现在,方案二确实被实现了,Twitter正在超两中方案和混合方案前进。大多数用户发tweet时,他们的tweet会被发送给粉丝的home timelines,但是少数人有大量粉丝的人(比如名人)从这种方案排除了。一个人关注的名人的tweets和自己的home timeline在读取时是分别请求的,就像方案一。这种混合结构能有一贯的好的性能表现。我们讲述更多的技术背景后,将在12章重新看到这个例子。

Describing Performance

描述性能表现

Once you have described the load on your system, you caninvestigate what happens when the load increases. You can look at it in twoways:

一旦有了系统的负载描述,你就能推测出负载增加后的系统表现。亦可以用下面两种方法:

•    When you increase a load parameter andkeep the system resources (CPU, memory, network bandwidth, etc.) unchanged, howis the performance of your system affected?

•    当你调高了复杂一个参数而又保持系统资源不变(CPU,内存,网络带宽等),系统的性能会怎样呢?

•    When you increase a load parameter, howmuch do you need to increase the resources if you want to keep performanceunchanged?

•    当你高一个负载参数,你需要增加多少能保持性能不呢?

Both questions require performance numbers, so let’s lookbriefly at describing the performance of a system.In a batch processing system such as Hadoop, we usually careaboutthroughput—the number of records we can process per second, or thetotal time it takes to run a job on a dataset of a certain size.iiiIn onlinesystems, what’s usually more important is the service’sresponse time—thatis, the time between a client sending a request and receiving a response.

这两个问题都需要性能数据,接下来我们简单看下如何描述一个系统的性能。在像Hadoop这样的批量处理系统中,我们经常关注吞吐-每秒能处理的记录条数,或者处理一个固定大小数据集作业的运行总时间。针对线上系统,服务的相应时间-客户端发送请求到接受到响应时间差-更重要。

iii. In an ideal world, the running time of a batch job isthe size of the dataset divided by the throughput. In practice, the runningtime is often longer, due to skew (data not being spread evenly across workerprocesses) and needing to wait for the slowest task to complete.

理想状况下,一个批量作业的运行时间是数据集切分成单挑数据和和。实际上,运行总时间一般来说会更长,因为数据倾斜(数据并不总平均分布到所有节点中)和然后等待最慢任务完成

Latency and response time

延迟和响应时间

Latency andresponse time are often used synonymously, but they arenot the same. The response time is what the client sees: besides the actualtime to process the request (theservice time), it includes networkdelays and queueing delays. Latency is the duration that a request is waitingto be handled—during which it islatent, awaiting service [17].

延迟和响应时间经常混用,但严格来讲他们并不同。响应时间是客户端视角:除去处理请求的时间(服务端花费时间),它还包括网络延迟和在队列中等待时间。延迟是请求等待被处理的时间-在这段时间内请求被放在一边,等待被处理。

Even if you only make the same request over and over again,you’ll get a slightly different response time on every try. In practice, in asystem handling a variety of requests, the response time can vary a lot. Wetherefore need to think of response time not as a single number, but as a distributionof values that you can measure.

即使你重复发送同一请求,你将得到稍微不同的响应时间。实际上,在处理多个请求的系统中,响应时间差距很大。因此,我们不应该把响应时间当做一个数字,而是一个可以衡量的分布。

In Figure 1-4, each gray bar represents a request to a service, and its heightshows how long that request took. Most requests are reasonably fast, but thereare occasional outliers that take much longer. Perhaps the slow requestsare intrinsically more expensive, e.g., because they process more data. Buteven in a scenario where you’d think all requests should take the same time,you get variation: random additional latency could be introduced by a contextswitch to a background process, the loss of a network packet and TCPretransmission, a garbage collection pause, a page fault forcing a read fromdisk, mechanical vibrations in the server rack [18], or many other causes.

在图Figure 1-4中,每个灰条代表一个请求,高度代表响应时间。大多数请求很快被响应,但是偶尔会花费很长时间。也许慢请求本质上更有价值,比如,因为这些请求需要处理更多的数据。即使在一个多有请求假设花费同样时间的场景下,你也会得到变化的值:由于进程调度切换引入的随机延迟,网络丢包和TCP重传,垃圾回收引起的中断,缺页中断导致的从磁盘读数据,服务器的机械振动[18],或者其他原因。

 

Figure 1-4.Illustrating mean and percentiles: response times for a sample of 100 requeststo a service.

100次请求的响应时间

It’s common to see the average response time of aservice reported. (Strictly speaking, the term “average” doesn’t refer to anyparticular formula, but in practice it is usually understood as thearithmeticmean: given n values, add up all the values, and divide byn.)However, the mean is not a very good metric if you want to know your “typical”response time, because it doesn’t tell you how many users actually experiencedthat delay.

一般关注服务的平均响应时间(严格来说,平均这个词并不指任何特定公式算法,在实际中它一般被理解为算数平均:给n个值,求和再除n)。但是,这并不是你想知道的“典型”的响应时间的好的维度,因为它并没有说明多少用户真正能感受到那种时延。

Usually it is better to use percentiles. If you takeyour list of response times and sort it from fastest to slowest, then themedianis the halfway point: for example, if your median response time is 200 ms,that means half your requests return in less than 200 ms, and half yourrequests take longer than that.

一般来说,百分比方式更好。如果你把响应时间列出来并从快到慢排序,中位数也就是中间的数:例如中位数是200ms,意味着一半请求在200ms内返回,一半请求花费更长的时间。

This makes the median a good metric if you want to know howlong users typically have to wait: half of user requests are served in lessthan the median response time, and the other half take longer than the median.The median is also known as the 50th percentile, and sometimesabbreviated as p50. Note that the median refers to a single request; ifthe user makes several requests (over the course of a session, or becauseseveral resources are included in a single page), the probability that at leastone of them is slower than the median is much greater than 50%.

如果你想知道典型用户等待多久,中位数一个很好的维度:一半请求的响应时间比中位数短,另一半请求长。中位数也叫做第50百分数,有时也被缩写为p50.注意,中位数代指一个请求;如果一个用户发送多个请求(在一个会话内或者一个页内请求多个资源),至少有一个请求比中位数慢的概率远大于50%

In order to figure out how bad your outliers are, you can lookat higher percentiles: the95th, 99th, and 99.9th percentilesare common (abbreviatedp95, p99, and p999). They are theresponse time thresholds at which 95%, 99%, or 99.9% of requests are fasterthan that particular threshold. For example, if the 95th percentile responsetime is 1.5 seconds, that means 95 out of 100 requests take less than 1.5seconds, and 5 out of 100 requests take 1.5 seconds or more. This isillustrated inFigure 1-4.

为了衡量你的长尾情况,你可以更高的百分数:一般来说采用95分位,99分位和99.9分位值。他们是95%,99%,99.9%请求比分位值更快返回的阀值。例如:95非为值是1.5秒,那意味着100个请求中95个请求耗时不足1.5秒,5个请求超过1.5秒。如图Figure 1-4.

High percentiles of response times, also known as taillatencies, are important because they directly affect users’ experience ofthe service. For example, Amazon describes response time requirements forinternal services in terms of the 99.9th percentile, even though it onlyaffects 1 in 1,000 requests. This is because the customers with the slowestrequests are often those who have the most data on their accounts because they havemade many purchases—that is, they’re the most valuable customers [19]. It’simportant to keep those customers happy by ensuring the website is fast forthem: Amazon has also observed that a 100 ms increase in response time reducessales by 1% [20], and others report that a 1-second slowdown reduces a customersatisfaction metric by 16% [21, 22].

响应时间的高分位也被叫做长尾延时非常重要,这直接影响用户的服务体验。例如亚马逊描述内部服务响应时间采用99.9分位,虽然这在1000个客户中只会影响其中一个。这是因为最慢请求的客户,也是账号数据数据最多的用户,这些是最具价值用户[19]。确保访问迅速可以使他们购物愉快:Amazon发现响应时间每下降100ms,销售额增加1%[20],其他报告说明每增加1s响应,客户满意度下降16%[21,22]

On the other hand, optimizing the 99.99th percentile (theslowest 1 in 10,000 requests) was deemed too expensive and to not yield enoughbenefit for Amazon’s purposes. Reducing response times at very high percentilesis difficult because they are easily affected by random events outside of yourcontrol, and the benefits are diminishing.

另一方面,最优的99.99分位(10000个请求中只有一个最慢)代价太昂贵,小于产生的利润。将服务保持在高百分位很困难,因为那种情况很容易受不可控随机事件的影响,并且产出的效益很小。

For example, percentiles are often used in service levelobjectives(SLOs) and service level agreements (SLAs), contractsthat define the expected performance and availability of a service. An SLA maystate that the service is considered to be up if it has a median response timeof less than 200 ms and a 99th percentile under 1 s (if the response time is longer,it might as well be down), and the service may be required to be up at least99.9% of the time. These metrics set expectations for clients of the serviceand allow customers to demand a refund if the SLA is not met.

例如:百分位经常用来描述服务水平目标(SLOs)和服务水平协议(SLAs),这些用来描述预期的服务性能和可用性。一个SLA可以表述成:服务响应时间中位数小于200ms99分位小于1s的时候就认为服务是正常的,并且99.9%的时候是正常的。这些维度为用户使用服务提供一个预期,如果没有达到SLA客户可以提出退款。

Queueing delays often account for a large part of theresponse time at high percentiles. As a server can only process a small numberof things in parallel (limited, for example, by its number of CPU cores), itonly takes a small number of slow requests to hold up the processing ofsubsequent requests—an effect sometimes known ashead-of-line blocking.Even if those subsequent requests are fast to process on the server, the clientwill see a slow overall response time due to the time waiting for the priorrequest to complete. Due to this effect, it is important to measure responsetimes on the client side.

排队时延经常是长尾的主要因素。一个服务器同时时刻只能处理几个请求(有限的,比如CPU的核数),几个慢请求就会阻挡住后续的请求-一种叫做队头阻塞的现象。即使后续的请求能被迅速处理,因为要等前面的请求处理完,client仍然感到响应时间很长。由于这种效应,衡量客户端端的响应时延非常重要。

When generating load artificially in order to test thescalability of a system, the load- generating client needs to keep sendingrequests independently of the response time. If the client waits for theprevious request to complete before sending the next one, that behavior has theeffect of artificially keeping the queues shorter in the test than they wouldbe in reality, which skews the measurements [23].

当为了测试系统可扩展性手动发压力是,产生请求的客户端要保证互相独立地发送每个请求。如果客户端等到上一个请求返回再发送下一个请求,这人工地会使队列比实际情况中短,会歪曲测试结果。

Percentiles in Practice

百分位实践

High percentiles become especially important in backendservices that are called multiple times as part of serving a single end-userrequest. Even if you make the calls in parallel, the end-user request stillneeds to wait for the slowest of the parallel calls to complete. It takes justone slow call to make the entire end-user request slow, as illustrated inFigure 1-5. Even if only a small percentage of backend calls are slow,the chance of getting a slow call increases if an end-user request requires multipleback end calls, and so a higher proportion of end-user requestsend up being slow (an effect known astail latency amplification [24]).

百分位的表现对于一次前端请求对应多次后端请求的后端服务尤其重要。即使你并行发送后端请求,前端用户请求依然需要等到并行中最慢的请求完成。一个慢的后端调用就让整个请求慢下来,就像Figure 1-5中展示的。即使只有一小部分后端请求很慢,如果用户请求被转发为后端的多个请求,用户请求时延增加的概率会增加很多。(叫做长尾放大[24])

If you want to add response time percentiles to themonitoring dashboards for your services, you need to efficiently calculate themon an ongoing basis. For example, you may want to keep a rolling window ofresponse times of requests in the last 10 minutes. Every minute, you calculatethe median and various percentiles over the values in that window and plotthose metrics on a graph.

如果你想把百分位响应时间作为服务监控页面的一个维度,你需要持续有效的计算这些值。每分钟,你都需要计算中位数和各个百分位并将它们展示在图表中。

The naïve implementation is to keep a list of response timesfor all requests within the time window and to sort that list every minute. Ifthat is too inefficient for you, there are algorithms that can calculate a goodapproximation of percentiles at minimal CPU and memory cost, such as forwarddecay [25], t-digest [26], or HdrHistogram [27]. Beware thataveraging percentiles, e.g., to reduce the time resolution or to combine datafrom several machines, is mathematically meaningless—the right way ofaggregating response time data is to add the histograms [28].

最简单的实现就是维护一个时间窗口中的请求时延队列,并且每分钟排序一次。假如这种做法对你来说效率太低,有一些好的比较好的算法能用很少的CPU和内存资源来估算百分位,比如decay [25], t-digest [26], orHdrHistogram [27].百分位的平均值(比如., to reduce the time resolution或者取多个机器上百分位平均值在数学角度来说是没意义的),正确的聚合响应时间的方法是把这些数据放入矩阵图[28]

 

Figure 1-5. When several backend calls are needed to serve arequest, it takes just a single slow backend request to slow down the entireend-user request.

当一个前端请求需要多个后端请求时,一个慢后端查询就会使整个前端查询变慢

Approaches for Coping with Load

处理负载问题方法

Now that we have discussed the parameters for describingload and metrics for measuring performance, we can start discussing scalabilityin earnest: how do we maintain good performance even when our load parametersincrease by some amount?

现在已经说明了描述负载的参数和衡量性能表现的维度,我们开始认真讨论一下可扩展性:当负载参数增加时,我们系统如何才能保持好的性能。

An architecture that is appropriate for one level of load isunlikely to cope with 10 times that load. If you are working on a fast-growingservice, it is therefore likely that you will need to rethink your architectureonevery order ofmagnitude load increase —or perhaps even more often than that.

一个适合处理特定负载的架构不太可能处理原有基础10倍的负载。如果你在负责一个快速增涨的服务,每次负载大的增长,你都需要重新思考你的系统架构,有时候频率需要更高。

People often talk of a dichotomy between scaling up(verticalscaling, moving to a more powerful machine) and scaling out(horizontalscaling, distributing the load across multiple smaller machines).Distributing load across multiple machines is also known as ashared-nothingarchitecture. A system that can run on a single machine is often simpler,but high-end machines can become very expensive, so very intensive workloadsoften can’t avoid scaling out. In reality, good architectures usually involve apragmatic mixture of approaches: for example, using several fairly powerfulmachines can still be simpler and cheaper than a large number of small virtualmachines.

人民一般分开讨论扩大(垂直扩展,迁移到能力更强机器上)和扩展(水平扩展,将负载分布到多台更小的机器上)。负载在多个机器间的分布也叫做“无共享”架构。运行在单机上的系统一般来说比较简单,高端机器又太贵,因此大量的负载经常不可避免的被扩展到多台机器上。实际上,好的架构经常是两种方式的混合体:例如利用少数的高端机器能保持结构简单,并且比大量的小型虚拟机要便宜。

Some systems are elastic, meaning that they canautomatically add computing resources when they detect a load increase, whereasother systems are scaled manually (a human analyzes the capacity and decides toadd more machines to the system). An elastic system can be useful if load ishighly unpredictable, but manually scaled systems are simpler and may havefewer operational surprises (see“Rebalancing Partitions” on page 209).

一些系统是“弹性的”,意味着发现负责增加时,它们能自动地增加计算资源,其它的系统只能手动扩容(人工分析容量决定需要加多少台机器到系统中)。当负载不宜预判的时候,弹性系统非常有用。但是手工扩容系统更简单,也会有更少的运营意外。(see“Rebalancing Partitions” on page 209

While distributing stateless services across multiplemachines is fairly straightforward, taking stateful data systems from a singlenode to a distributed setup can introduce a lot of additional complexity. Forthis reason, common wisdom until recently was to keep your database on a singlenode (scale up) until scaling cost or high-availability requirements forced youto make it distributed.

相对于将无状态服务分布在多台机器上很简单,将有状态数据系统从单机分布到多机上会引入很多额外的复杂性。因此,直到最近,大家达成的共识一直都是将数据放在一个单独的节点(扩大)直到扩容代价或者高可用要求不能满足再使系统变成分布式。

As the tools and abstractions for distributed systems getbetter, this common wisdom may change, at least for some kinds of applications.It is conceivable that distributed data systems will become the default in thefuture, even for use cases that don’t handle large volumes of data or traffic.Over the course of the rest of this book we will cover many kinds ofdistributed data systems, and discuss how they fare not just in terms ofscalability, but also ease of use and maintainability.

随着分布式系统的工具和抽象工作越来越好,至少针对某些种类的应用,共识可能变化了。可预见的是,将来分布式系统将会成为系统标配,即使不需要处理大量数据或者请求的场景。在本书剩下部分,我们将涉及很多分布式数据系统,不只在可扩展性方面进行讨论,也会包含如何更方便使用和维护。

The architecture of systems that operate at large scale isusually highly specific to the application—there is no such thing as a generic,one-size-fits-all scalable architecture (informally known asmagic scalingsauce). The problem may be the volume of reads, the volume of writes, thevolume of data to store, the complexity of the data, the response timerequirements, the access patterns, or (usually) some mixture of all of theseplus many more issues.

针对大规模系统的架构一般都是为应用高订制化的,没有通用的架构(magic scaling sauce)。问题可能是读容量,写容量,数据存储量,数据复杂度,响应时延,访问模式或者这些和其它问题的混合。

For example, a system that is designed to handle 100,000requests per second, each 1 kB in size, looks very different from a system thatis designed for 3 requests per minute, each 2 GB in size—even though the twosystems have the same data throughput.

例如,一个系统设计为每秒处理100000个请求,每个请求长度在1k内。这个系统和设计为每分钟处理三个请求,每个请求2G的系统一定大不相同,虽然两个系统有相同的吞吐量。

An architecture that scales well for a particularapplication is built around assumptions of which operations will be common andwhich will be rare—the load parameters. If those assumptions turn out to bewrong, the engineering effort for scaling is at best wasted, and at worstcounterproductive. In an early-stage startup or an unproven product it’susually more important to be able to iterate quickly on product features thanit is to scale to some hypothetical future load.

针对特定应用扩展性好的架构一定基于在负载参数范围内,某些操作频繁,某些操作少的假设。如果假设不成立,在此之上的为了扩展性的设计至少是没用的,搞不好还会适得其反。在项目启动阶段或者未证实的产品,在产品特性上的快速迭代比为了将来也许的容量扩容要重要的多。

Even though they are specific to a particular application,scalable architectures are nevertheless usually built from general-purposebuilding blocks, arranged in familiar patterns. In this book we discuss thosebuilding blocks and patterns.

虽然高扩展性的架构一般都是针对某一个特定应用设计的,但是它们经常由一些通用组件,按照一些通用模式组成的。本书我们将讨论这些通用组件和模式。

Maintainability

可运维性

It is well known that the majority of the cost of softwareis not in its initial development, but in its ongoing maintenance—fixing bugs,keeping its systems operational, investigating failures, adapting it to newplatforms, modifying it for new use cases, repaying technical debt, and adding new features.

众所周知,软件的主要成本不是在初始开发阶段,而是在持续运维阶段-修复bug,保证系统容易操作,查找问题,新平台适配,新业务场景调整,偿还以前的技术性债务,增加一下儿新特性。

Yet, unfortunately, many people working on software systemsdislike maintenance of so-calledlegacy systems—perhaps it involvesfixing other people’s mistakes, or working with platforms that are nowoutdated, or systems that were forced to do things they were never intendedfor. Every legacy system is unpleasant in its own way, and so it is difficult togive general recommendations for dealing with them.

非常不幸的是,软件开发领域的许多人不喜欢运维所谓的遗留系统,可能是这需要修复别人遗留bug,或者需要折腾一个过时系统,或者需要增加一个系统非设计考虑到的一个新功能。每个遗留系统都有各自的讨人嫌的地方,很难给出通用建议。

However, we can and should design software in such a waythat it will hopefully minimize pain during maintenance, and thus avoidcreating legacy software ourselves. To this end, we will pay particularattention to three design principles for software systems:

但是应该并且能给出一些设计系统的原则,它们有望减少运维的代价,因此避免自己的系统成为遗留系统。最后,我们专门说一下设计软件系统的三个原则:

Operability

可操作性

Make it easy for operations teams to keep the system runningsmoothly.

为让系统平滑运行,运营团队比如容易操作

Simplicity

简洁性

Make it easy for new engineers to understand the system, byremoving as much complexity as possible from the system. (Note this is not thesame as simplicity of the user interface.)

通过尽量移除系统的复杂性,让新加入的工程师容易立即系统(和保持系统接口的简洁不同)

Evolvability

可进化性

Make it easy for engineers to make changes to the system inthe future, adapting it for unanticipated use cases as requirements change.Also known asextensibility, modifiability, or plasticity.

让将来的系统改变比较容易,即使需要改变了,调整一些非预期的用户场景也不会太麻烦。也被叫做可扩展性,可修改性或者弹性。

As previously with reliability and scalability, there are noeasy solutions for achieving these goals. Rather, we will try to think aboutsystems with operability, simplicity, and evolvability in mind.

上面说到了可用性和扩展性,针对这些特性,没有什么灵丹妙药。我们应该把系统的易操作,简洁性和可进化性铭记于心。

Operability: Making Life Easy for Operations

可操作性:让运营更容易

It has been suggested that “good operations can often workaround the limitations of bad (or incomplete) software, but good softwarecannot run reliably with bad operations” [12]. While some aspects of operationscan and should be automated, it is still up to humans to set up that automationin the first place and to make sure it’s working correctly.

大家常说“好的运营经常能处理坏系统的短板,但是一个好的软件也不能在一个坏运行人员手下好好工作”[12]。尽管有一些运维操作可以也应该被自动化,这也需要人来把自动化设置为最高优先级并且确保自动化机制正常工作。

Operations teams are vital to keeping a software systemrunning smoothly. A good operations team typically is responsible for thefollowing, and more [29]:

运营团队必须保证系统平滑运行。好的运营团队负责一下任务[29]

•    Monitoring the health of the system andquickly restoring service if it goes into a bad state

•    控系的运行状,如果系统错误能迅速恢复

•    Tracking down the cause of problems,such as system failures or degraded performance

•    快速定位问题,比如到底是系统故障还是性能下降

•    Keeping software and platforms up todate, including security patches

•    证软件和平台及更新,包括一些安全

•    Keeping tabs on how different systemsaffect each other, so that a problematic change can be avoided before it causesdamage

•    监视不同系统间的互相影响,保证有问题的改动在造成损失前发现

•    Anticipating future problems andsolving them before they occur (e.g., capacity planning)

•    对问题有预见性,防患于未然(比如扩容计划)

•    Establishing good practices and toolsfor deployment, configuration management, and more

•    部署,配置管理等操作提供良好的规范和工具

•    Performing complex maintenance tasks,such as moving an application from one platform to another

•    能完成复杂的运维任务,比如将应用从一个平台迁移到另一个平台

•    Maintaining the security of the systemas configuration changes are made

•    随着配置修改,能保的安全性

•    Defining processes that make operationspredictable and help keep the production environment stable

•    为使操作后果可预见和线上环境稳定制定流程

•    Preserving the organization’s knowledgeabout the system, even as individual people come and go 


•    虽然有人员流动,保证知识的传承

Good operability means making routinetasks easy, allowing the operations team to focus their efforts on high-valueactivities. Data systems can do various things to make routine tasks easy,including:

好的运营让例行工作简单营团队专注于更有价的活。数据系统让例行工作简单的几个准如下:

•    Providing visibility into the runtimebehavior and internals of the system, with good monitoring

•    让系统的例行工作和内部状态可见,并配有完备监控

•    Providing good support for automationand integration with standard tools

•    提供自化和准化的工具支持

•    Avoiding dependency on individualmachines (allowing machines to be taken down for maintenance while the systemas a whole continues running uninterrupted)

•    避免依赖单独的机器(允任何机器宕机运而不影响系整体运行)

•    Providing good documentation and aneasy-to-understand operational model (“If I do X, Y will happen”)

•    提供好的文档和操作模式(你行操作X会引起现象Y发生

•    Providing good default behavior, butalso giving administrators the freedom to override defaults when needed

•    提供好的默,同管理充分的自由去重新设置默认行为

•    Self-healing where appropriate, butalso giving administrators manual control over the system state when needed

•    在需要的地方进行自动恢复,也给管理员手动去设置系统状态的接口

•    Exhibiting predictable behavior,minimizing surprises

•    列出可能的系,减少突然性

Simplicity: Managing Complexity

简洁性:管理的复杂性

Small software projects can havedelightfully simple and expressive code, but as projects get larger, they oftenbecome very complex and difficult to understand. This complexity slows downeveryone who needs to work on the system, further increasing the cost ofmaintenance. A software project mired in complexity is sometimes described as abig ball of mud[30].

小的软件项目有简洁明了的代码,但是随着项目越来越大,代码会越来越复杂,难以理解。这种复杂性拖慢了项目中的所有人,进一步增加了运维的代价。深陷于复杂性的系统有时候被叫做“大泥球”[30]

There are various possible symptoms of complexity: explosionof the state space, tight coupling of modules, tangled dependencies,inconsistent naming and terminology, hacks aimed at solving performanceproblems, special-casing to work around issues elsewhere, and many more. Muchhas been said on this topic already [31, 32, 33].

有很多可能的复杂度增加症状:状态空间的爆炸增长,模块间紧耦合,依赖紧密,命名和专业用语的不一致,解决性能问题的小后门,解决别处问题的特例还有其它等等。此处大多数都有讲述[31,32,33]

When complexity makes maintenance hard, budgets andschedules are often over run. In complex software, there isalso a greater risk of introducing bugs when making a change: when the systemis harder for developers to understand and reason about, hidden assumptions,unintended consequences, and unexpected interactions are more easily overlooked.Conversely, reducing complexity greatly improves the maintainability ofsoftware, and thus simplicity should be a key goal for the systems we build.

当运维复杂度太大时,预算和排期一般也会超限。复杂的软件中,做出的一个改变引入bug的风险也变大:当开发者不易理解系统,弄不清楚隐藏的假设,非刻意的结果和非预期的交互更容易被忽略。相反,减少系统复杂度提高软件的可运维性,因此,简洁性是系统的一个核心目标。

Making a system simpler does not necessarily mean reducingits functionality; it can also mean removingaccidental complexity.Moseley and Marks [32] define complexity as accidental if it is not inherent inthe problem that the software solves (as seen by the users) but arises onlyfrom the implementation.

让系统变得更简洁不一定意味着砍掉某些功能;它也可意味着移除很多意外引入的复杂度。Moseley and Marks [32]把意外引入的复杂度定义为:不是因为软件要解决的问题,而是因为实现而引入的复杂性。

One of the best tools we have for removing accidentalcomplexity isabstraction. A good abstraction can hide a great deal ofimplementation detail behind a clean, simple-to-understand façade. A goodabstraction can also be used for a wide range of different applications. Notonly is this reuse more efficient than reimplementing a similar thing multipletimes, but it also leads to higher-quality software, as quality improvements inthe abstracted component benefit all applications that use it.

最好的防止引入意外复杂性的方法是抽象。好的抽象能把实现的细枝末节隐藏在简洁并容易理解的接口背后。好的抽象也能用在范围广大的多个应用。它不仅能有效复用避免重复实现一个同样的东西,它也会对高质量软件有很大贡献。组件好的抽象抽象,能使所有的应用都受益。

For example, high-level programming languages areabstractions that hide machine code, CPU registers, and syscalls. SQL is anabstraction that hides complex on-disk and in-memory data structures,concurrent requests from other clients, and inconsistencies after crashes. Ofcourse, when programming in a high-level language, we are still using machinecode; we are just not using itdirectly, because the programminglanguage abstraction saves us from having to think about it.

例如:高级编程语言是机器代码,CPU寄存器,和系统调用的抽象。SQL语言的抽象隐藏了磁盘、内存中的复杂数据结构,其他client并发请求,软件崩溃后的一致性恢复。当然,用高级语言编程,我们仍然需要用到机器代码;我们只是不会直接使用而已,因为编程语言抽象层帮我们免去对机器语言的思考。

However, finding good abstractions is very hard. In thefield of distributed systems, although there are many good algorithms, it ismuch less clear how we should be packaging them into abstractions that help uskeep the complexity of the system at a manageable level.

但是,做一个好的抽象很不容易。在分布式领域,虽然有很多好的算法,但是如何把这些算法做抽象,保证整个系统复杂度在可管理范围仍然很模糊。

Throughout this book, we will keep our eyes open for goodabstractions that allow us to extract parts of a large system intowell-defined, reusable components.

通过这本书,我们将很关注允许我们把大系统分解成定义良好,可复用组件的好的抽象。

Evolvability: Making Change Easy

可进化性:让改变更容易

It’s extremely unlikely that your system’s requirements willremain unchanged forever. They are much more likely to be in constant flux: youlearn new facts, previously unanticipated use cases emerge, business prioritieschange, users request new features, new platforms replace old platforms, legalor regulatory requirements change, growth of the system forces architecturalchanges, etc.

你的系统需求不大可能永远不变。不断变化很可能是常态:你发现新的问题,以前没想到过的场景出现了,交易优先级变化了,用户请求新的熟悉,新平台代替老平台,法律或者监管要求变化了,系统规模增长要求架构也随之变化等等。

In terms of organizational processes, Agile workingpatterns provide a framework for adapting to change. The Agile community hasalso developed technical tools and patterns that are helpful when developingsoftware in a frequently changing environment, such as test-driven development(TDD) and refactoring.

在一系列有组织化的处理流程中,敏捷开发模式提供一个应对变化迅速调整的框架。敏捷开发社区已经发展出应对频繁变化的环境的技术工具和模式,比如测试驱动开发(TDD)和重构.

Most discussions of these Agile techniques focus on a fairlysmall, local scale (a couple of source code files within the same application).In this book, we search for ways of increasing agility on the level of a largerdata system, perhaps consisting of several different applications or serviceswith different characteristics. For example, how would you “refactor” Twitter’sarchitecture for assembling home timelines (“Describing Load” on page 11) from approach 1 to approach 2?

大多数关于敏捷开发技术讨论,专注于小,本地范围(在属于同一应用的几个源码文件)。本书,我们将探索大数据系统上的敏捷过程,可能由具有不同特征的多个应用组成。例如:你该如何从方案一到方案二重构Twitter的聚合home timelines的架构(page 11)

The ease with which you can modify a data system, and adaptit to changing requirements, is closely linked to its simplicity and itsabstractions: simple and easy-to- understand systems are usually easier tomodify than complex ones. But since this is such an important idea, we will usea different word to refer to agility on a data system level:evolvability [34].

你为满足变化的需求,修改一个数据系统的难易程度,和系统的简洁程度和抽象密切相关:简洁易理解的系统比复杂的系统更容易修改。这个理念如此重要,我们用一个专有词汇:可进化性,来指代数据系统中的敏捷开发。

Summary

汇总

In this chapter, we have explored some fundamental ways ofthinking about data-intensive applications. These principles will guide usthrough the rest of the book, where we dive into deep technical detail.

本章,我们涉及了数据密集系统的基础方法。这些准则将会在本书后续深挖技术细节的时候一直指导我们。

An application has to meet various requirements in order tobe useful. There arefunctional requirements (what it should do, such asallowing data to be stored, retrieved, searched, and processed in variousways), and somenonfunctional requirements (general properties likesecurity, reliability, compliance, scalability, compatibility, andmaintainability). In this chapter we discussed reliability, scalability, andmaintainability in detail.

一个应用如果有用,必须满足很多要求。有很多功能性(它应该做的事情:比如允许数据存储,获取,检索,多种方式处理)和非功能性(一般就像安全,可靠性,compliance,可扩展性,兼容性和易运维性)需求。本章我们讨论了可靠性,可扩展性和可维护性方面的细节。

Reliability means making systems work correctly, even when faults occur.Faults can be in hardware (typically random and uncorrelated), software (bugsare typically systematic and hard to deal with), and humans (who inevitablymake mistakes from time to time). Fault-tolerance techniques can hide certaintypes of faults from the end user.

可靠性意味着即使发生错误,系统也能正常的工作。错误可能来自硬件(典型来说随机不相关),软件(bugs一般来说是系统性的,很难处理),和人(一次次不可避免犯错误)。容错技术能将特定类型错误对用户屏蔽。

Scalability means having strategies for keeping performance good, even whenload increases. In order to discuss scalability, we first need ways ofdescribing load and performance quantitatively. We briefly looked at Twitter’shome timelines as an example of describing load, and response time percentilesas a way of measuring performance. In a scalable system, you can add processingcapacity in order to remain reliable under high load.

可扩展性意味着即使负载增加,依然能保证系统良好性能的策略。为了讨论可扩展性,我们首先要能描述负载和性能指标。我们简单通过Twitterhome timelines例子来描述负载,响应时间百分位作为衡量性能的指标。在扩展性良好系统中,你能通过增加资源就能保持在高负载下的可用性。

Maintainability has many facets, but in essence it’s about making lifebetter for the engineering and operations teams who need to work with thesystem. Good abstractions can help reduce complexity and make the system easierto modify and adapt for new use cases. Good operability means having goodvisibility into the system’s health, and having effective ways of managing it.

可运维性有很多方面,基本来说,它能让系统的工程师和运营团队生活更美好。好的抽象能减少复杂度,让系统更易修改调整以应对新的场景。好的运营意味着对系统健康程度有很好的认识,能有效管理系统。

There is unfortunately no easy fix for making applicationsreliable, scalable, or maintainable. However, there are certain patterns andtechniques that keep reappearing in different kinds of applications. In thenext few chapters we will take a look at some examples of data systems andanalyze how they work toward those goals.

不幸的是,没有什么妙招让系统变的可靠,可扩展和易运维。但是针对不同的系统有一些重复出现的特定的模式和技术。在下几章,我们将看到一些数据系统的例子,分析他们为达成这些目标如何工作的。

Later in the book, in Part III, we will look at patterns for systems that consist of severalcomponents working together, such as the one inFigure 1-1.

本书后续的第三部分,我们将看到一些由不同组件构成系统中的模式,比如Figure 1-1.

References

[1] Michael Stonebraker and UğurÇetintemel: “‘One SizeFits All’: An Idea Whose Time Has Come and Gone,” at 21st International Conference on Data Engineering (ICDE),April 2005.

[2] Walter L. Heimerdinger and Charles B.Weinstock: “A ConceptualFramework for System Fault Tolerance,” Technical Report CMU/SEI-92-TR-033, Software Engi neering Institute, Carnegie Mellon University, October1992.

[3] Ding Yuan, Yu Luo, Xin Zhuang, etal.: “Simple Testing CanPrevent Most Criti cal Failures: An Analysis ofProduction Failures in Distributed Data-Intensive Sys tems,” at11thUSENIX Symposium on Operating Systems Design and Implementation (OSDI),October 2014.

[4] Yury Izrailevsky and Ariel Tseitlin:“The Netflix SimianArmy,”techblog.net flix.com, July 19, 2011.

[5] Daniel Ford, François Labelle,Florentina I. Popovici, et al.: “Availability in Glob ally Distributed Storage Systems,” at 9th USENIX Symposium onOperating Systems Design and Implementation(OSDI), October 2010.

[6] Brian Beach: “Hard Drive Reliability Update – Sep2014,”backblaze.com,Septem ber 23, 2014.

[7] Laurie Voss: “AWS: The Good, the Bad and the Ugly,”blog.awe.sm, December 18,2012.

 

Summary | 23

[8] Haryadi S. Gunawi, Mingzhe Hao,Tanakorn Leesatapornwongsa, et al.: “What Bugs Live in the Cloud?,” at 5th ACM Symposium on Cloud Computing (SoCC), November2014. doi:10.1145/2670979.2670986

[9] Nelson Minar: “Leap Second Crashes Half the Internet,”somebits.com, July 3, 2012.

[10] Amazon Web Services: “Summary of the Amazon EC2 and AmazonRDS Ser vice Disruption in the US East Region,” aws.amazon.com, April 29,2011.

[11] Richard I. Cook: “How Complex Systems Fail,” Cognitive Technologies Labora tory, April 2000.

[12] Jay Kreps: “Getting Real About Distributed System Reliability,”blog.empathy box.com, March 19, 2012.

[13] David Oppenheimer, Archana Ganapathi,and David A. Patterson: “Why Do Internet Services Fail, and What Can Be Done About It?,” at 4th USENIX Symposium onInternet Technologies and Systems(USITS), March 2003.

[14] Nathan Marz: “Principles of Software Engineering, Part1,”nathanmarz.com,April 2, 2013.

[15] Michael Jurewitz: “The Human Impact of Bugs,”jury.me, March 15, 2013. [16] Raffi Krikorian: “Timelines at Scale,” atQCon San Francisco, November 2012.

[17] Martin Fowler:Patterns ofEnterprise Application Architecture. Addison Wesley, 2002. ISBN:978-0-321-12742-6

[18] Kelly Sommers: “After all that run around, whatcaused 500ms disk latency even when we replaced physical server?twitter.com, November 13,2014.

[19] Giuseppe DeCandia, Deniz Hastorun,Madan Jampani, et al.: “Dynamo: Ama zon’s Highly Available Key-ValueStore,” at 21st ACMSymposium on Operating Sys tems Principles(SOSP), October 2007.

[20] Greg Linden: “Make Data Useful,” slides from presentation at StanfordUniver sity Data Mining class (CS345), December 2006.

[21] Tammy Everts: “The Real Cost of Slow Time vsDowntime,”webperformanceto day.com, November 12, 2014.

[22] Jake Brutlag: “Speed Matters for Google Web Search,”googleresearch.blog spot.co.uk, June 22, 2009.

[23] Tyler Treat: “Everything You Know About Latency IsWrong,”bravenew geek.com, December 12, 2015.

24 | Chapter 1: Reliable, Scalable, and MaintainableApplications

[24] Jeffrey Dean and Luiz André Barroso:“The Tail at Scale,”Communications of the ACM,volume 56, number 2, pages 74–80, February 2013.doi: 10.1145/2408776.2408794

[25] Graham Cormode, Vladislav Shkapenyuk,Divesh Srivastava, and Bojian Xu: “Forward Decay: A Practical Time Decay Model for StreamingSystems,” at 25th IEEEInternational Conference on Data Engineering(ICDE), March 2009.

[26] Ted Dunning and Otmar Ertl: “Computing Extremely AccurateQuantiles Using t-Digests,”github.com, March 2014.

[27] Gil Tene: “HdrHistogram,”hdrhistogram.org.[28] Baron Schwartz: “Why Percentiles Don’t Work the WayYou Think,”vividcor

tex.com, December 7, 2015.[29] James Hamilton: “On Designing and Deploying Internet-Scale Services,” at21st

Large Installation System Administration Conference(LISA), November 2007.[30] Brian Foote and Joseph Yoder: “Big Ball of Mud,” at 4th Conference on Pattern

Languages of Programs (PLoP), September 1997.

[31] Frederick P Brooks: “No Silver Bullet– Essence and Accident in Software Engi neering,” inThe Mythical Man-Month, Anniversary edition,Addison-Wesley, 1995. ISBN: 978-0-201-83595-3

[32] Ben Moseley and Peter Marks: “Out of the Tar Pit,” atBCS Software PracticeAdvancement (SPA), 2006.

[33] Rich Hickey: “Simple Made Easy,” atStrange Loop, September2011.

[34] Hongyu Pei Breivold, Ivica Crnkovic,and Peter J. Eriksson: “Analyzing Software Evolvability,” at 32nd Annual IEEE International Computer Softwareand Applica tions Conference(COMPSAC), July 2008. doi:10.1109/COMPSAC.2008.50

你可能感兴趣的:(分布式理论,架构)