Reliable, Scalable, and Maintainable Applications 高可靠、易扩展、易运维应用

PART I Foundations of Data Systems


The first four chapters go through the fundamental ideasthat apply to all data systems, whether running on a single machine ordistributed across a cluster of machines:


1.    Chapter 1 introduces the terminology and approach that we’re going to usethroughout this book. It examines what we actually mean by words likereliability,scalability, and maintainability, and how we can try to achievethese goals.


2.    Chapter 2 compares several different data models and query languages—themost visible distinguishing factor between databases from a developer’s pointof view. We will see how different models are appropriate to differentsituations.


3. Chapter 3 turns to the internals of storage engines and looks at howdatabases lay out data on disk. Different storage engines are optimized fordifferent workloads, and choosing the right one can have a huge effect onperformance.

4. Chapter 4 compares various formats for data encoding (serialization) andespe cially examines how they fare in an environment whereapplication requirements change and schemas need to adapt over time.

Later, Part II will turn to the particular issues of distributed data systems.



Reliable, Scalable, and Maintainable Applications


The Internet was done so well that most people think of itas a natural resource like the Pacific Ocean, rather than something that wasman-made. When was the last time a technology with a scale like that was soerror-free?


Alan Kay, in interview withDr Dobb’sJournal (2012)

Many applications today are data-intensive, asopposed tocompute-intensive. Raw CPU power is rarely a limiting factorfor these applications—bigger problems are usually the amount of data, thecomplexity of data, and the speed at which it is changing.


A data-intensive application is typically built fromstandard building blocks that provide commonly needed functionality. Forexample, many applications need to:


•    Store data so that they, or another application,can find it again later (databases)

•    数据,其它的用其后可以去除些数据(数据

•    Remember the result of an expensiveoperation, to speed up reads (caches)

•    为了加速读,临时一些需要耗大量操作资源结果(caches

•    Allow users to search data by keywordor filter it in various ways (search indexes)

•    一些关键词检索或者过滤等各种操作(检索索引)

•    Send a message to another process, tobe handled asynchronously (stream processing)

•    将一条信息送到另一个异步理的流程(流式理)

•    Periodically crunch a large amount ofaccumulated data (batch processing)

•    周期性的累的大量数据(批量理)。

If that sounds painfully obvious, that’s just because thesedatasystems are such a successful abstraction: we use them all the time withoutthinking too much. When build ing an application, most engineerswouldn’t dream of writing a new data storage engine from scratch, becausedatabases are a perfectly good tool for the job.



But reality is not that simple. There are many databasesystems with different characteristics, because different applications havedifferent requirements. There are various approaches to caching, several waysof building search indexes, and so on. When building an application, we stillneed to figure out which tools and which approaches are the most appropriatefor the task at hand. And it can be hard to combine tools when you need to dosomething that a single tool cannot do alone.


This book is a journey through both the principles and thepracticalities of data systems, and how you can use them to builddata-intensive applications. We will explore what different tools have incommon, what distinguishes them, and how they achieve their characteristics.


In this chapter, we will start by exploring the fundamentalsof what we are trying to achieve: reliable, scalable, and maintainable datasystems. We’ll clarify what those things mean, outline some ways of thinkingabout them, and go over the basics that we will need for later chapters. In thefollowing chapters we will continue layer by layer, looking at different designdecisions that need to be considered when working on a data-intensiveapplication.


Thinking About Data Systems


We typically think of databases, queues, caches, etc. asbeing very different categories of tools. Although a database and a messagequeue have some superficial similarity— both store data for some time—they havevery different access patterns, which means different performancecharacteristics, and thus very different implementations.


So why should we lump them all together under an umbrellaterm likedata systems?


Many new tools for data storage and processing have emergedin recent years. They are optimized for a variety of different use cases, andthey no longer neatly fit into traditional categories [1]. For example, thereare datastores that are also used as message queues (Redis), and there aremessage queues with database-like durability guarantees (Apache Kafka). Theboundaries between the categories are becoming blurred.

今些年出现了很多数据存储和处理的新系统。他们都针对不同的应用场景做过优化,同时也不再适合传统的分类方法。例如:存在很多被用来做消息队列的存储系统(Redis),也存在具有数据库持久存储特性的消息队列(Apache Kafka)。不同种类系统间的边界变的越来越模糊。

Secondly, increasingly many applications now have suchdemanding or wide-ranging requirements that a single tool can no longer meetall of its data processing and storage needs. Instead, the work is broken downinto tasks thatcan be performed efficiently on a single tool, and thosedifferent tools are stitched together using application code.


For example, if you have an application-managed cachinglayer (using Memcached or similar), or a full-text search server (such asElasticsearch or Solr) separate from your main database, it is normally theapplication code’s responsibility to keep those caches and indexes in sync withthe main database. Figure 1-1gives a glimpse of what this may looklike (we will go into detail in later chapters).


Figure 1-1. One possible architecture for a data system thatcombines several components.


When you combine several tools in order to provide aservice, the service’s interface or application programming interface (API)usually hides those implementation details from clients. Now you haveessentially created a new, special-purpose data system from smaller,general-purpose components. Your composite data system may provide certainguarantees: e.g., that the cache will be correctly invalidated or updated onwrites so that outside clients see consistent results. You are now not only anapplication developer, but also a data system designer.


If you are designing a data system or service, a lot oftricky questions arise. How do you ensure that the data remains correct andcomplete, even when things go wrong internally? How do you provide consistentlygood performance to clients, even when parts of your system are degraded? Howdo you scale to handle an increase in load? What does a good API for theservice look like?


There are many factors that may influence the design of adata system, including the skills and experience of the people involved, legacysystem dependencies, the timescale for delivery, your organization’s toleranceof different kinds of risk, regulatory constraints, etc. Those factors dependvery much on the situation.


In this book, we focus on three concerns that are importantin most software systems:


ReliabilityThe system should continue to workcorrectly (performingthe correct function at the desired level of performance) even in the face ofadversity(hardware or soft ware faults, and even human error).See“Reliability” onpage 6.

可用性。即使遇到一些不正常的问题(软/硬件错误,人为错误),系统应该持续不断地正常工作(在一个预期性能指标下提供正确的功能)。See“Reliability” on page 6.

ScalabilityAs the systemgrows (in data volume, traffic volume, orcomplexity), there should be reasonable ways of dealing with that growth. See“Scalability” on page 10.

扩展性:随着系统规模(数据量,流量,复杂度)增加,应该很方便的处理增长带来的问题。See“Scalability” on page 10.

Maintainability Over time, many different people will work on the system(engineering and operations, both maintaining current behavior and adapting thesystem to new use cases), and they should all be able to work on it productively.See “Maintainabil ity” on page 18.


These words are often cast around without a clearunderstanding of what they mean. In the interest of thoughtful engineering, wewill spend the rest of this chapter exploring ways of thinking aboutreliability, scalability, and maintainability. Then, in the following chapters,we will look at various techniques, architectures, and algorithms that are usedin order to achieve those goals.



Everybody has an intuitive idea of what it means forsomething to be reliable or unreliable. For software, typical expectationsinclude:


•    The application performs the functionthat the user expected.

•    用以用户预期的方式工作

•    It can tolerate the user makingmistakes or using the software in unexpected ways.

•    它能容忍错误和一些非期的操作。

•    Its performance is good enough for therequired use case, under the expected load and data volume.

•    在特定的负载和数据规模下,性能必须能满足应用需求。

•    The system prevents any unauthorizedaccess and abuse.If all those things together mean“working correctly,” then we can understandreliability as meaning,roughly, “continuing to work correctly, even when things go wrong.”The things that can go wrong are called faults, andsystems that anticipate faults and can cope with them are calledfault-tolerantor resilient. The former term is slightly misleading: it suggeststhat we could make a system tolerant of every possible kind of fault, which inreality is not feasible. If the entire planet Earth (and all servers on it)were swallowed by a black hole, tolerance of that fault would require webhosting in space—good luck getting that budget item approved. So it only makessense to talk about toleratingcertain types of faults.

•    系统必须访问未授权的操作。如果“正常工作”指的就是这些的话,可靠性大体可以理解为“即使有些东西不正常,但整个系统仍然能正常工作”。某些组件不能正常工作叫错误,系统能预料并处理可能的错误叫容错性。前面有个误导性的词语:它表名我们正在设计一个能处理任何可能出现错误的系统,其实这是不可能的。如果整个地球(这个服务所有服务器都在地球上)都被黑洞吞了,针对这个错误的容错性需要将服务器部署到外太空(如果你能有足够的预算)。因此,只有针对特定类型的错误讲究容错性才是有意义的。

Note that a fault is not the same as a failure [2]. A faultis usually defined as one component of the system deviating from its spec,whereas afailure is when the system as a whole stops providing therequired service to the user. It is impossible to reduce the probability of afault to zero; therefore it is usually best to design fault-tolerancemechanisms that prevent faults from causing failures. In this book we coverseveral techniques for building reliable systems from unreliable parts.


Counterintuitively, in such fault-tolerant systems, it canmake sense toincrease the rate of faults by triggering themdeliberately—for example, by randomly killing individual processes withoutwarning. Many critical bugs are actually due to poor error handling [3]; bydeliberately inducing faults, you ensure that the fault-tolerance machinery iscontinually exercised and tested, which can increase your confidence thatfaults will be handled correctly when they occur naturally. The NetflixChaosMonkey [4] is an example of this approach.

与直觉不同,在容错系统中,通过主动触发增加错误出现频率是有意义的,比如无预警地随机杀掉一个进程。很多致命的bug都是因为针对某些错误的处理缺失引起的。手动触发错误能确保容错机制能确保不断的测试,你也会对系统更有信心,它能处理将来发生的错误。NetflixChaos Monkey系统就是一个这样的例子。

Although we generally prefer tolerating faults overpreventing faults, there are cases where prevention is better than cure (e.g.,because no cure exists). This is the case with security matters, for example:if an attacker has compromised a system and gained access to sensitive data,that event cannot be undone. However, this book mostly deals with the kinds offaults that can be cured, as described in the following sections.


Hardware Faults


When we think of causes of system failure, hardware faultsquickly come to mind. Hard disks crash, RAM becomes faulty, the power grid hasa blackout, someone unplugs the wrong network cable. Anyone who has worked withlarge datacenters can tell you that these things happen all the time whenyou have a lot of machines.


Hard disks are reported as having a mean time to failure(MTTF) of about 10 to 50 years [5, 6]. Thus, on a storage cluster with 10,000disks, we should expect on average one disk to die per day.


Our first response is usually to add redundancy to theindividual hardware components in order to reduce the failure rate of thesystem. Disks may be set up in a RAID configuration, servers may have dualpower supplies and hot-swappable CPUs, and datacenters may have batteries anddiesel generators for backup power. When one component dies, the redundantcomponent can take its place while the broken com ponent is replaced. This approach cannot completely preventhardware problems from causing failures, but it is well understood and canoften keep a machine running uninterrupted for years.


Until recently, redundancy of hardware components wassufficient for most applications, since it makes total failure of a singlemachine fairly rare. As long as you can restore a backup onto a new machinefairly quickly, the downtime in case of failure is not catastrophic in mostapplications. Thus, multi-machine redundancy was only required by a smallnumber of applications for which high availability was absolutely essential.


However, as data volumes and applications’ computing demandshave increased, more applications have begun using larger numbers of machines,which proportionally increases the rate of hardware faults. Moreover, in somecloud platforms such as Amazon Web Services (AWS) it is fairly common forvirtual machine instances to become unavailable without warning [7], as theplatforms are designed to prioritize flexibility and elasticityioversingle-machine reliability.

但是,随着数据规模和计算量增加,越来越多的应用需要大量的机器,相应地,这也增加了整个系统硬件错误的概率。此外,在一些例如:Amazon Web Services (AWS)这样的云平台,平台由于设计阶段就优先考虑灵活性和弹性而非单机的可靠性,因此虚拟机实例没有报警就宕机是家常便饭。

Hence there is a move toward systems that can tolerate theloss of entire machines, by using software fault-tolerance techniques inpreference or in addition to hardware redundancy. Such systems also haveoperational advantages: a single-server system requires planned downtime if youneed to reboot the machine (to apply operating system security patches, forexample), whereas a system that can tolerate machine failure can be patched onenode at a time, without downtime of the entire system (a rolling upgrade;see Chapter 4).


Software Errors


We usually think of hardware faults as being random andindependent from each other: one machine’s disk failing does not imply thatanother machine’s disk is going to fail. There may be weak correlations (forexample due to a common cause, such as the temperature in the server rack), butotherwise it is unlikely that a large number of hardware components will failat the same time.


Another class of fault is a systematic error within thesystem [8]. Such faults are harder to anticipate, and because they arecorrelated across nodes, they tend to cause many more system failures thanuncorrelated hardware faults [5]. Examples include:


•    A software bug that causes everyinstance of an application server to crash when given a particular bad input.For example, consider the leap second on June 30, 2012, that caused manyapplications to hang simultaneously due to a bug in the Linux kernel [9].

•    接收到错误输入引发的所有服务实例崩溃的软件bug。例如,一个Linux内核bug,导致时间跨过2012630日时,许多应用会hang住。

•    A runaway process that uses up someshared resource—CPU time, memory, disk space, or network bandwidth.

•    一个失控进程用过了一些共享的资源-CPU时间,内存,硬盘空间,网络带宽


•    A service that the system depends onthat slows down, becomes unresponsive, or starts returning corrupted responses.

•    一个系统依赖的服务变慢了,无响应或者返回已经崩溃的相应。

•    Cascading failures, where a small faultin one component triggers a fault in another component, which in turn triggersfurther faults [10].

•    雪崩式失败,一个小的硬件错误导致的系统错误会触发其它的错误,如此重复,触发出很多错误。

The bugs that cause these kinds of software faults often liedormant for a long time until they are triggered by an unusual set ofcircumstances. In those circumstances, it is revealed that the software ismaking some kind of assumption about its environment—and while that assumptionis usually true, it eventually stops being true for some reason [11].


There is no quick solution to the problem of systematicfaults in software. Lots of small things can help: carefully thinking aboutassumptions and interactions in the system; thorough testing; processisolation; allowing processes to crash and restart; measuring, monitoring, andanalyzing system behavior in production. If a system is expected to providesome guarantee (for example, in a message queue, that the number of incomingmessages equals the number of outgoing messages), it can constantly check itselfwhile it is running and raise an alert if a discrepancy is found [12].


•    Human Errors

•    认为错误

Humans design and build software systems, and the operatorswho keep the systems running are also human. Even when they have the bestintentions, humans are known to be unreliable. For example, one study of largeinternet services found that configuration errors by operators were the leadingcause of outages, whereas hardware faults (servers or network) played a role inonly 10–25% of outages [13].How do we makeour systems reliable, in spite of unreliable humans? The best systems combineseveral approaches:

设计实现是的人,操作系运行的也是人。众所周知,无多么心, 人也是不可靠的。例如一研究表明操作人的配置错误导致了大多数系停服,相而言,硬件故障(服器或网仅仅占到10-25%。人不可靠的,怎么才能使系可靠呢?设计一个好的系统有以下几个方法:

•    Design systems in a way that minimizesopportunities for error. For example, well-designed abstractions, APIs, andadmin interfaces make it easy to do “the right thing” and discourage “the wrongthing.” However, if the interfaces are too restrictive people will work aroundthem, negating their benefit, so this is a tricky balance to get right.

•    采用减少人犯错可能性的设计方法。例如:良好抽象,API和管理接口使系统错误更少。但是,如果系统接口限制太多,会触犯将来使用它的人的利益,因此这需要好好权衡。

•    Decouple the places where people makethe most mistakes from the places where they can cause failures. In particular,provide fully featured non-productionsandbox environments where peoplecan explore and experiment safely, using real data, without affecting realusers.

•    分割人犯错误错误发生引起失的景。特地,提供一个全功能的非生沙河境,可以在那里安全地用真数据行探索和实验,而不会影响到真实用户

•    Test thoroughly at all levels, fromunit tests to whole-system integration tests and manual tests [3]. Automatedtesting is widely used, well understood, and especially valuable for coveringcorner cases that rarely arise in normal operation.

•    提供各个级别测试,从单测到系统级别的集成测试和人工测试。自动测试已经被广泛使用,也容易理解,对要覆盖很少触发的处于死角的场景尤其有用。

•    Allow quick and easy recovery fromhuman errors, to minimize the impact in the case of a failure. For example,make it fast to roll back configuration changes, roll out new code gradually(so that any unexpected bugs affect only a small subset of users), and providetools to recompute data (in case it turns out that the old computation wasincorrect).

•    减少系停服务影响,要让系统能从人为错误中方便、快速的恢复。例如:配置修改的快速回滚,新代码的逐步上线(非预期的bug仅仅会影响一小部分用户),提供工具对数据进行重新计算(防止老的计算结果不正确)

•    Set up detailed and clear monitoring,such as performance metrics and error rates. In other engineering disciplinesthis is referred to astelemetry. (Once a rocket has left the ground,telemetry is essential for tracking what is happening, and for understandingfailures [14].) Monitoring can show us early warning signals and allow us tocheck whether any assumptions or constraints are being violated. When a problemoccurs, metrics can be invaluable in diagnosing the issue.

•    设置详尽清晰的监控,例如性能和错误率监控。在另一个工程领域叫做遥测(当一个火箭升空后,遥测是追踪火箭正在如何运行以及理解错误的基本方法[14])。监控能给我提早报警,允许我们确认是否某些假设和系统限制是有效的。问题发生后,监控数据对问题追查无比重要。

•    Implement good management practices andtraining—a complex and important aspect, and beyond the scope of this book.

•    好的管理实践和训练-这是一个非常重要复杂的因素,已经超出本书讨论范围。

How Important Is Reliability?


Reliability is not just for nuclear power stations and airtraffic control software— more mundane applications are also expected to workreliably. Bugs in business applications cause lost productivity (and legalrisks if figures are reported incorrectly), and outages of ecommerce sites canhave huge costs in terms of lost revenue and damage to reputation.Even in “noncritical” applications we have a responsibilityto our users. Consider a parent who stores all their pictures and videos oftheir children in your photo application [15]. How would they feel if thatdatabase was suddenly corrupted? Would they know how to restore it from abackup? There are situations in which we maychoose to sacrifice reliability in order to reduce development cost (e.g., whendeveloping a prototype product for an unproven market) or operational cost(e.g., for a service with a very narrow profit margin)—but we should be veryconscious of when we are cutting corners.




Even if a system is working reliablytoday, that doesn’t mean it will necessarily work reliably in the future. Onecommon reason for degradation is increased load: perhaps the system has grownfrom 10,000 concurrent users to 100,000 concurrent users, or from 1 million to10 million. Perhaps it is processing much larger volumes of data than it didbefore.


Scalability is the term we use to describe asystem’s ability to cope with increased load. Note, however, that it is not aone-dimensional label that we can attach to a system: it is meaningless to say“X is scalable” or “Y doesn’t scale.” Rather, discussing scalability meansconsidering questions like “If the system grows in a particular way, what areour options for coping with the growth?” and “How can we add computingresources to handle the additional load?”


Describing Load


First, we need to succinctly describe the current load onthe system; only then can we discuss growth questions (what happens if our loaddoubles?). Load can be described with a few numbers which we callloadparameters. The best choice of parameters depends on the architecture ofyour system: it may be requests per second to a web server, the ratio of readsto writes in a database, the number of simultaneously active users in a chatroom, the hit rate on a cache, or something else. Perhaps the average case iswhat matters for you, or perhaps your bottleneck is dominated by a small numberof extreme cases.


To make this idea more concrete, let’s consider Twitter asan example, using data published in November 2012 [16]. Two of Twitter’s mainoperations are:

为了使这个概念更具体,我们拿根据Twitter November 2012发布的数据来举例。Twitter的两个主要操作是:

Post tweet


A user can publish a new message to their followers (4.6krequests/sec on average, over 12k requests/sec at peak).


Home timeline


A user can view tweets posted by the people they follow(300k requests/sec).


Simply handling 12,000 writes per second (the peak rate forposting tweets) would be fairly easy. However, Twitter’s scaling challenge isnot primarily due to tweet volume, but due tofan-outii—each userfollows many people, and each user is followed by many people. There arebroadly two ways of implementing these two operations:


1.    Posting a tweet simply inserts the new tweet into a globalcollection of tweets. When a user requests their home timeline, look up all thepeople they follow, find all the tweets for each of those users, and merge them(sorted by time). In a relational database like in Figure 1-2, you could write a query such as:

发送请求动作会向全局的整体tweets数据集合插入一条新tweet。当一个人请求他的home timeline时,会查看他关注的所有人,然后找到每个人的更新,再把这些更新揉和在一起(以时间排序)。在Figure 1-2指示的关系型数据库中,你需要写一个这个的查询语句:

SELECT tweets.*,users.*FROM tweetsJOINusers ONtweets.sender_id= users.idJOIN followsON follows.followee_id= users.idWHERE follows.follower_id= current_user

ii. A term borrowed from electronic engineering, where itdescribes the number of logic gate inputs that are attached to another gate’soutput. The output needs to supply enough current to drive all the attachedinputs. In transaction processing systems, we use it to describe the number ofrequests to other services that we need to make in order to serve one incomingrequest.



2.    Maintain a cache for each user’s home timeline—like a mailbox oftweets for each recipient user (seeFigure 1-3). When a userposts a tweet, look up all the people whofollow that user, and insert the new tweet into each of their home timelinecaches. The request to read the home timeline is then cheap, because its resulthas been computed ahead of time.

为每一个人的home timeline维护一个cache-就像每个接收者都有一个邮箱(Figure 1-3)。当一个人发送一条tweet时,会查找他的粉丝,然后把这条tweet放入他们的home timeline cache。因为每个人home timeline的都已经事先计算好了,请求这个数据时的读代价就很小了。

The first version of Twitter used approach 1, but thesystems struggled to keep up with the load of home timeline queries, so thecompany switched to approach 2. This works better because the average rate ofpublished tweets is almost two orders of magnitude lower than the rate of hometimeline reads, and so in this case it’s preferable to do more work at writetime and less at read time.

Twitter第一个实现版本采用方案一,但是系统随着home timeline请求的增长,变得很困难,因此他们切换到方案二。这个方案更好,因为发tweet的数目几乎是读hometimeline数目的一半,同时这个方案在写的时候花费的时间多一些,在读的时候耗时少一些。


Figure 1-2. Simple relational schema for implementing aTwitter home timeline.

实现Twitter home timeline简单的关系型数据库schema


Figure 1-3. Twitter’s data pipeline for delivering tweets tofollowers, with load parameters as of November 2012 [16].


However, the downside of approach 2 is that posting a tweetnow requires a lot of extra work. On average, a tweet is delivered to about 75followers, so 4.6k tweets per second become 345k writes per second to the hometimeline caches. But this average hides the fact that the number of followersper user varies wildly, and some users have over 30 million followers. Thismeans that a single tweet may result in over 30 million writes to hometimelines! Doing this in a timely manner— Twitter tries to deliver tweets tofollowers within five seconds—is a significant challenge.

但是,在方案二中,发送一个tweet需要很多额外的工作。平均一个tweet一般要发送给75个粉丝。但是平均值掩盖了每个人的粉丝个数范围很大这个问题,有些人有超过300万粉丝。这意味的一个tweet会产生300万的写操作!需要及时做完- Twitter会尽力在5s内将数据推送给所有粉丝-是一个非常大的挑战。

In the example of Twitter, the distribution of followers peruser (maybe weighted by how often those users tweet) is a key load parameterfor discussing scalability, since it determines the fan-out load. Yourapplication may have very different characteristics, but you can apply similarprinciples to reasoning about its load.


The final twist of the Twitter anecdote: now that approach 2is robustly implemented, Twitter is moving to a hybrid of both approaches. Mostusers’ tweets continue to be fanned out to home timelines at the time when theyare posted, but a small number of users with a very large number of followers(i.e., celebrities) are excepted from this fan-out. Tweets from any celebritiesthat a user may follow are fetched separately and merged with that user’s hometimeline when it is read, like in approach 1. This hybrid approach is able todeliver consistently good performance. We will revisit this example inChapter 12 after we have covered some more technical ground.

Twitter例子的最终发展:现在,方案二确实被实现了,Twitter正在超两中方案和混合方案前进。大多数用户发tweet时,他们的tweet会被发送给粉丝的home timelines,但是少数人有大量粉丝的人(比如名人)从这种方案排除了。一个人关注的名人的tweets和自己的home timeline在读取时是分别请求的,就像方案一。这种混合结构能有一贯的好的性能表现。我们讲述更多的技术背景后,将在12章重新看到这个例子。

Describing Performance


Once you have described the load on your system, you caninvestigate what happens when the load increases. You can look at it in twoways:


•    When you increase a load parameter andkeep the system resources (CPU, memory, network bandwidth, etc.) unchanged, howis the performance of your system affected?

•    当你调高了复杂一个参数而又保持系统资源不变(CPU,内存,网络带宽等),系统的性能会怎样呢?

•    When you increase a load parameter, howmuch do you need to increase the resources if you want to keep performanceunchanged?

•    当你高一个负载参数,你需要增加多少能保持性能不呢?

Both questions require performance numbers, so let’s lookbriefly at describing the performance of a system.In a batch processing system such as Hadoop, we usually careaboutthroughput—the number of records we can process per second, or thetotal time it takes to run a job on a dataset of a certain size.iiiIn onlinesystems, what’s usually more important is the service’sresponse time—thatis, the time between a client sending a request and receiving a response.


iii. In an ideal world, the running time of a batch job isthe size of the dataset divided by the throughput. In practice, the runningtime is often longer, due to skew (data not being spread evenly across workerprocesses) and needing to wait for the slowest task to complete.


Latency and response time


Latency andresponse time are often used synonymously, but they arenot the same. The response time is what the client sees: besides the actualtime to process the request (theservice time), it includes networkdelays and queueing delays. Latency is the duration that a request is waitingto be handled—during which it islatent, awaiting service [17].


Even if you only make the same request over and over again,you’ll get a slightly different response time on every try. In practice, in asystem handling a variety of requests, the response time can vary a lot. Wetherefore need to think of response time not as a single number, but as a distributionof values that you can measure.


In Figure 1-4, each gray bar represents a request to a service, and its heightshows how long that request took. Most requests are reasonably fast, but thereare occasional outliers that take much longer. Perhaps the slow requestsare intrinsically more expensive, e.g., because they process more data. Buteven in a scenario where you’d think all requests should take the same time,you get variation: random additional latency could be introduced by a contextswitch to a background process, the loss of a network packet and TCPretransmission, a garbage collection pause, a page fault forcing a read fromdisk, mechanical vibrations in the server rack [18], or many other causes.

在图Figure 1-4中,每个灰条代表一个请求,高度代表响应时间。大多数请求很快被响应,但是偶尔会花费很长时间。也许慢请求本质上更有价值,比如,因为这些请求需要处理更多的数据。即使在一个多有请求假设花费同样时间的场景下,你也会得到变化的值:由于进程调度切换引入的随机延迟,网络丢包和TCP重传,垃圾回收引起的中断,缺页中断导致的从磁盘读数据,服务器的机械振动[18],或者其他原因。


Figure 1-4.Illustrating mean and percentiles: response times for a sample of 100 requeststo a service.


It’s common to see the average response time of aservice reported. (Strictly speaking, the term “average” doesn’t refer to anyparticular formula, but in practice it is usually understood as thearithmeticmean: given n values, add up all the values, and divide byn.)However, the mean is not a very good metric if you want to know your “typical”response time, because it doesn’t tell you how many users actually experiencedthat delay.


Usually it is better to use percentiles. If you takeyour list of response times and sort it from fastest to slowest, then themedianis the halfway point: for example, if your median response time is 200 ms,that means half your requests return in less than 200 ms, and half yourrequests take longer than that.


This makes the median a good metric if you want to know howlong users typically have to wait: half of user requests are served in lessthan the median response time, and the other half take longer than the median.The median is also known as the 50th percentile, and sometimesabbreviated as p50. Note that the median refers to a single request; ifthe user makes several requests (over the course of a session, or becauseseveral resources are included in a single page), the probability that at leastone of them is slower than the median is much greater than 50%.


In order to figure out how bad your outliers are, you can lookat higher percentiles: the95th, 99th, and 99.9th percentilesare common (abbreviatedp95, p99, and p999). They are theresponse time thresholds at which 95%, 99%, or 99.9% of requests are fasterthan that particular threshold. For example, if the 95th percentile responsetime is 1.5 seconds, that means 95 out of 100 requests take less than 1.5seconds, and 5 out of 100 requests take 1.5 seconds or more. This isillustrated inFigure 1-4.

为了衡量你的长尾情况,你可以更高的百分数:一般来说采用95分位,99分位和99.9分位值。他们是95%,99%,99.9%请求比分位值更快返回的阀值。例如:95非为值是1.5秒,那意味着100个请求中95个请求耗时不足1.5秒,5个请求超过1.5秒。如图Figure 1-4.

High percentiles of response times, also known as taillatencies, are important because they directly affect users’ experience ofthe service. For example, Amazon describes response time requirements forinternal services in terms of the 99.9th percentile, even though it onlyaffects 1 in 1,000 requests. This is because the customers with the slowestrequests are often those who have the most data on their accounts because they havemade many purchases—that is, they’re the most valuable customers [19]. It’simportant to keep those customers happy by ensuring the website is fast forthem: Amazon has also observed that a 100 ms increase in response time reducessales by 1% [20], and others report that a 1-second slowdown reduces a customersatisfaction metric by 16% [21, 22].


On the other hand, optimizing the 99.99th percentile (theslowest 1 in 10,000 requests) was deemed too expensive and to not yield enoughbenefit for Amazon’s purposes. Reducing response times at very high percentilesis difficult because they are easily affected by random events outside of yourcontrol, and the benefits are diminishing.


For example, percentiles are often used in service levelobjectives(SLOs) and service level agreements (SLAs), contractsthat define the expected performance and availability of a service. An SLA maystate that the service is considered to be up if it has a median response timeof less than 200 ms and a 99th percentile under 1 s (if the response time is longer,it might as well be down), and the service may be required to be up at least99.9% of the time. These metrics set expectations for clients of the serviceand allow customers to demand a refund if the SLA is not met.


Queueing delays often account for a large part of theresponse time at high percentiles. As a server can only process a small numberof things in parallel (limited, for example, by its number of CPU cores), itonly takes a small number of slow requests to hold up the processing ofsubsequent requests—an effect sometimes known ashead-of-line blocking.Even if those subsequent requests are fast to process on the server, the clientwill see a slow overall response time due to the time waiting for the priorrequest to complete. Due to this effect, it is important to measure responsetimes on the client side.


When generating load artificially in order to test thescalability of a system, the load- generating client needs to keep sendingrequests independently of the response time. If the client waits for theprevious request to complete before sending the next one, that behavior has theeffect of artificially keeping the queues shorter in the test than they wouldbe in reality, which skews the measurements [23].


Percentiles in Practice


High percentiles become especially important in backendservices that are called multiple times as part of serving a single end-userrequest. Even if you make the calls in parallel, the end-user request stillneeds to wait for the slowest of the parallel calls to complete. It takes justone slow call to make the entire end-user request slow, as illustrated inFigure 1-5. Even if only a small percentage of backend calls are slow,the chance of getting a slow call increases if an end-user request requires multipleback end calls, and so a higher proportion of end-user requestsend up being slow (an effect known astail latency amplification [24]).

百分位的表现对于一次前端请求对应多次后端请求的后端服务尤其重要。即使你并行发送后端请求,前端用户请求依然需要等到并行中最慢的请求完成。一个慢的后端调用就让整个请求慢下来,就像Figure 1-5中展示的。即使只有一小部分后端请求很慢,如果用户请求被转发为后端的多个请求,用户请求时延增加的概率会增加很多。(叫做长尾放大[24])

If you want to add response time percentiles to themonitoring dashboards for your services, you need to efficiently calculate themon an ongoing basis. For example, you may want to keep a rolling window ofresponse times of requests in the last 10 minutes. Every minute, you calculatethe median and various percentiles over the values in that window and plotthose metrics on a graph.


The naïve implementation is to keep a list of response timesfor all requests within the time window and to sort that list every minute. Ifthat is too inefficient for you, there are algorithms that can calculate a goodapproximation of percentiles at minimal CPU and memory cost, such as forwarddecay [25], t-digest [26], or HdrHistogram [27]. Beware thataveraging percentiles, e.g., to reduce the time resolution or to combine datafrom several machines, is mathematically meaningless—the right way ofaggregating response time data is to add the histograms [28].

最简单的实现就是维护一个时间窗口中的请求时延队列,并且每分钟排序一次。假如这种做法对你来说效率太低,有一些好的比较好的算法能用很少的CPU和内存资源来估算百分位,比如decay [25], t-digest [26], orHdrHistogram [27].百分位的平均值(比如., to reduce the time resolution或者取多个机器上百分位平均值在数学角度来说是没意义的),正确的聚合响应时间的方法是把这些数据放入矩阵图[28]


Figure 1-5. When several backend calls are needed to serve arequest, it takes just a single slow backend request to slow down the entireend-user request.


Approaches for Coping with Load


Now that we have discussed the parameters for describingload and metrics for measuring performance, we can start discussing scalabilityin earnest: how do we maintain good performance even when our load parametersincrease by some amount?


An architecture that is appropriate for one level of load isunlikely to cope with 10 times that load. If you are working on a fast-growingservice, it is therefore likely that you will need to rethink your architectureonevery order ofmagnitude load increase —or perhaps even more often than that.


People often talk of a dichotomy between scaling up(verticalscaling, moving to a more powerful machine) and scaling out(horizontalscaling, distributing the load across multiple smaller machines).Distributing load across multiple machines is also known as ashared-nothingarchitecture. A system that can run on a single machine is often simpler,but high-end machines can become very expensive, so very intensive workloadsoften can’t avoid scaling out. In reality, good architectures usually involve apragmatic mixture of approaches: for example, using several fairly powerfulmachines can still be simpler and cheaper than a large number of small virtualmachines.


Some systems are elastic, meaning that they canautomatically add computing resources when they detect a load increase, whereasother systems are scaled manually (a human analyzes the capacity and decides toadd more machines to the system). An elastic system can be useful if load ishighly unpredictable, but manually scaled systems are simpler and may havefewer operational surprises (see“Rebalancing Partitions” on page 209).

一些系统是“弹性的”,意味着发现负责增加时,它们能自动地增加计算资源,其它的系统只能手动扩容(人工分析容量决定需要加多少台机器到系统中)。当负载不宜预判的时候,弹性系统非常有用。但是手工扩容系统更简单,也会有更少的运营意外。(see“Rebalancing Partitions” on page 209

While distributing stateless services across multiplemachines is fairly straightforward, taking stateful data systems from a singlenode to a distributed setup can introduce a lot of additional complexity. Forthis reason, common wisdom until recently was to keep your database on a singlenode (scale up) until scaling cost or high-availability requirements forced youto make it distributed.


As the tools and abstractions for distributed systems getbetter, this common wisdom may change, at least for some kinds of applications.It is conceivable that distributed data systems will become the default in thefuture, even for use cases that don’t handle large volumes of data or traffic.Over the course of the rest of this book we will cover many kinds ofdistributed data systems, and discuss how they fare not just in terms ofscalability, but also ease of use and maintainability.


The architecture of systems that operate at large scale isusually highly specific to the application—there is no such thing as a generic,one-size-fits-all scalable architecture (informally known asmagic scalingsauce). The problem may be the volume of reads, the volume of writes, thevolume of data to store, the complexity of the data, the response timerequirements, the access patterns, or (usually) some mixture of all of theseplus many more issues.

针对大规模系统的架构一般都是为应用高订制化的,没有通用的架构(magic scaling sauce)。问题可能是读容量,写容量,数据存储量,数据复杂度,响应时延,访问模式或者这些和其它问题的混合。

For example, a system that is designed to handle 100,000requests per second, each 1 kB in size, looks very different from a system thatis designed for 3 requests per minute, each 2 GB in size—even though the twosystems have the same data throughput.


An architecture that scales well for a particularapplication is built around assumptions of which operations will be common andwhich will be rare—the load parameters. If those assumptions turn out to bewrong, the engineering effort for scaling is at best wasted, and at worstcounterproductive. In an early-stage startup or an unproven product it’susually more important to be able to iterate quickly on product features thanit is to scale to some hypothetical future load.


Even though they are specific to a particular application,scalable architectures are nevertheless usually built from general-purposebuilding blocks, arranged in familiar patterns. In this book we discuss thosebuilding blocks and patterns.




It is well known that the majority of the cost of softwareis not in its initial development, but in its ongoing maintenance—fixing bugs,keeping its systems operational, investigating failures, adapting it to newplatforms, modifying it for new use cases, repaying technical debt, and adding new features.


Yet, unfortunately, many people working on software systemsdislike maintenance of so-calledlegacy systems—perhaps it involvesfixing other people’s mistakes, or working with platforms that are nowoutdated, or systems that were forced to do things they were never intendedfor. Every legacy system is unpleasant in its own way, and so it is difficult togive general recommendations for dealing with them.


However, we can and should design software in such a waythat it will hopefully minimize pain during maintenance, and thus avoidcreating legacy software ourselves. To this end, we will pay particularattention to three design principles for software systems:




Make it easy for operations teams to keep the system runningsmoothly.




Make it easy for new engineers to understand the system, byremoving as much complexity as possible from the system. (Note this is not thesame as simplicity of the user interface.)




Make it easy for engineers to make changes to the system inthe future, adapting it for unanticipated use cases as requirements change.Also known asextensibility, modifiability, or plasticity.


As previously with reliability and scalability, there are noeasy solutions for achieving these goals. Rather, we will try to think aboutsystems with operability, simplicity, and evolvability in mind.


Operability: Making Life Easy for Operations


It has been suggested that “good operations can often workaround the limitations of bad (or incomplete) software, but good softwarecannot run reliably with bad operations” [12]. While some aspects of operationscan and should be automated, it is still up to humans to set up that automationin the first place and to make sure it’s working correctly.


Operations teams are vital to keeping a software systemrunning smoothly. A good operations team typically is responsible for thefollowing, and more [29]:


•    Monitoring the health of the system andquickly restoring service if it goes into a bad state

•    控系的运行状,如果系统错误能迅速恢复

•    Tracking down the cause of problems,such as system failures or degraded performance

•    快速定位问题,比如到底是系统故障还是性能下降

•    Keeping software and platforms up todate, including security patches

•    证软件和平台及更新,包括一些安全

•    Keeping tabs on how different systemsaffect each other, so that a problematic change can be avoided before it causesdamage

•    监视不同系统间的互相影响,保证有问题的改动在造成损失前发现

•    Anticipating future problems andsolving them before they occur (e.g., capacity planning)

•    对问题有预见性,防患于未然(比如扩容计划)

•    Establishing good practices and toolsfor deployment, configuration management, and more

•    部署,配置管理等操作提供良好的规范和工具

•    Performing complex maintenance tasks,such as moving an application from one platform to another

•    能完成复杂的运维任务,比如将应用从一个平台迁移到另一个平台

•    Maintaining the security of the systemas configuration changes are made

•    随着配置修改,能保的安全性

•    Defining processes that make operationspredictable and help keep the production environment stable

•    为使操作后果可预见和线上环境稳定制定流程

•    Preserving the organization’s knowledgeabout the system, even as individual people come and go 

•    虽然有人员流动,保证知识的传承

Good operability means making routinetasks easy, allowing the operations team to focus their efforts on high-valueactivities. Data systems can do various things to make routine tasks easy,including:


•    Providing visibility into the runtimebehavior and internals of the system, with good monitoring

•    让系统的例行工作和内部状态可见,并配有完备监控

•    Providing good support for automationand integration with standard tools

•    提供自化和准化的工具支持

•    Avoiding dependency on individualmachines (allowing machines to be taken down for maintenance while the systemas a whole continues running uninterrupted)

•    避免依赖单独的机器(允任何机器宕机运而不影响系整体运行)

•    Providing good documentation and aneasy-to-understand operational model (“If I do X, Y will happen”)

•    提供好的文档和操作模式(你行操作X会引起现象Y发生

•    Providing good default behavior, butalso giving administrators the freedom to override defaults when needed

•    提供好的默,同管理充分的自由去重新设置默认行为

•    Self-healing where appropriate, butalso giving administrators manual control over the system state when needed

•    在需要的地方进行自动恢复,也给管理员手动去设置系统状态的接口

•    Exhibiting predictable behavior,minimizing surprises

•    列出可能的系,减少突然性

Simplicity: Managing Complexity


Small software projects can havedelightfully simple and expressive code, but as projects get larger, they oftenbecome very complex and difficult to understand. This complexity slows downeveryone who needs to work on the system, further increasing the cost ofmaintenance. A software project mired in complexity is sometimes described as abig ball of mud[30].


There are various possible symptoms of complexity: explosionof the state space, tight coupling of modules, tangled dependencies,inconsistent naming and terminology, hacks aimed at solving performanceproblems, special-casing to work around issues elsewhere, and many more. Muchhas been said on this topic already [31, 32, 33].


When complexity makes maintenance hard, budgets andschedules are often over run. In complex software, there isalso a greater risk of introducing bugs when making a change: when the systemis harder for developers to understand and reason about, hidden assumptions,unintended consequences, and unexpected interactions are more easily overlooked.Conversely, reducing complexity greatly improves the maintainability ofsoftware, and thus simplicity should be a key goal for the systems we build.


Making a system simpler does not necessarily mean reducingits functionality; it can also mean removingaccidental complexity.Moseley and Marks [32] define complexity as accidental if it is not inherent inthe problem that the software solves (as seen by the users) but arises onlyfrom the implementation.

让系统变得更简洁不一定意味着砍掉某些功能;它也可意味着移除很多意外引入的复杂度。Moseley and Marks [32]把意外引入的复杂度定义为:不是因为软件要解决的问题,而是因为实现而引入的复杂性。

One of the best tools we have for removing accidentalcomplexity isabstraction. A good abstraction can hide a great deal ofimplementation detail behind a clean, simple-to-understand façade. A goodabstraction can also be used for a wide range of different applications. Notonly is this reuse more efficient than reimplementing a similar thing multipletimes, but it also leads to higher-quality software, as quality improvements inthe abstracted component benefit all applications that use it.


For example, high-level programming languages areabstractions that hide machine code, CPU registers, and syscalls. SQL is anabstraction that hides complex on-disk and in-memory data structures,concurrent requests from other clients, and inconsistencies after crashes. Ofcourse, when programming in a high-level language, we are still using machinecode; we are just not using itdirectly, because the programminglanguage abstraction saves us from having to think about it.


However, finding good abstractions is very hard. In thefield of distributed systems, although there are many good algorithms, it ismuch less clear how we should be packaging them into abstractions that help uskeep the complexity of the system at a manageable level.


Throughout this book, we will keep our eyes open for goodabstractions that allow us to extract parts of a large system intowell-defined, reusable components.


Evolvability: Making Change Easy


It’s extremely unlikely that your system’s requirements willremain unchanged forever. They are much more likely to be in constant flux: youlearn new facts, previously unanticipated use cases emerge, business prioritieschange, users request new features, new platforms replace old platforms, legalor regulatory requirements change, growth of the system forces architecturalchanges, etc.


In terms of organizational processes, Agile workingpatterns provide a framework for adapting to change. The Agile community hasalso developed technical tools and patterns that are helpful when developingsoftware in a frequently changing environment, such as test-driven development(TDD) and refactoring.


Most discussions of these Agile techniques focus on a fairlysmall, local scale (a couple of source code files within the same application).In this book, we search for ways of increasing agility on the level of a largerdata system, perhaps consisting of several different applications or serviceswith different characteristics. For example, how would you “refactor” Twitter’sarchitecture for assembling home timelines (“Describing Load” on page 11) from approach 1 to approach 2?

大多数关于敏捷开发技术讨论,专注于小,本地范围(在属于同一应用的几个源码文件)。本书,我们将探索大数据系统上的敏捷过程,可能由具有不同特征的多个应用组成。例如:你该如何从方案一到方案二重构Twitter的聚合home timelines的架构(page 11)

The ease with which you can modify a data system, and adaptit to changing requirements, is closely linked to its simplicity and itsabstractions: simple and easy-to- understand systems are usually easier tomodify than complex ones. But since this is such an important idea, we will usea different word to refer to agility on a data system level:evolvability [34].




In this chapter, we have explored some fundamental ways ofthinking about data-intensive applications. These principles will guide usthrough the rest of the book, where we dive into deep technical detail.


An application has to meet various requirements in order tobe useful. There arefunctional requirements (what it should do, such asallowing data to be stored, retrieved, searched, and processed in variousways), and somenonfunctional requirements (general properties likesecurity, reliability, compliance, scalability, compatibility, andmaintainability). In this chapter we discussed reliability, scalability, andmaintainability in detail.


Reliability means making systems work correctly, even when faults occur.Faults can be in hardware (typically random and uncorrelated), software (bugsare typically systematic and hard to deal with), and humans (who inevitablymake mistakes from time to time). Fault-tolerance techniques can hide certaintypes of faults from the end user.


Scalability means having strategies for keeping performance good, even whenload increases. In order to discuss scalability, we first need ways ofdescribing load and performance quantitatively. We briefly looked at Twitter’shome timelines as an example of describing load, and response time percentilesas a way of measuring performance. In a scalable system, you can add processingcapacity in order to remain reliable under high load.

可扩展性意味着即使负载增加,依然能保证系统良好性能的策略。为了讨论可扩展性,我们首先要能描述负载和性能指标。我们简单通过Twitterhome timelines例子来描述负载,响应时间百分位作为衡量性能的指标。在扩展性良好系统中,你能通过增加资源就能保持在高负载下的可用性。

Maintainability has many facets, but in essence it’s about making lifebetter for the engineering and operations teams who need to work with thesystem. Good abstractions can help reduce complexity and make the system easierto modify and adapt for new use cases. Good operability means having goodvisibility into the system’s health, and having effective ways of managing it.


There is unfortunately no easy fix for making applicationsreliable, scalable, or maintainable. However, there are certain patterns andtechniques that keep reappearing in different kinds of applications. In thenext few chapters we will take a look at some examples of data systems andanalyze how they work toward those goals.


Later in the book, in Part III, we will look at patterns for systems that consist of severalcomponents working together, such as the one inFigure 1-1.

本书后续的第三部分,我们将看到一些由不同组件构成系统中的模式,比如Figure 1-1.


