这是一篇分布式可扩展架构设计的基础文章,看了以后觉得很受用,分享给各位看一下。附件是原文,原文链接:http://aosabook.org/en/distsys.html
Open source software has become a fundamental building block for some of the biggest websites. And as those websites have grown, best practices and guiding principles around their architectures have emerged. This chapter seeks to cover some of the key issues to consider when designing large websites, as well as some of the building blocks used to achieve these goals.
This chapter is largely focused on web systems, although some of the material is applicable to other distributed systems as well.
开源软件已经成为一些大型网站的基础组件,伴随这些网站的成长,围绕它们的架构,会形成了一些好的的实践和指导准则,本文旨在寻求设计大型网站时考虑的关键因素,以及如何利基础用组件达成目标。
本文主要聚焦大型WEB系统,当然其中一些内容也适合其他分布式系统。
1.1. Principles of Web Distributed Systems Design
What exactly does it mean to build and operate a scalable web site or application? At a primitive level it's just connecting users with remote resources via the Internet—the part that makes it scalable is that the resources, or access to those resources, are distributed across multiple servers.
构建和运营一个可扩展的网站或者应用程序到底指的是什么?从原始的需求来看,仅仅只是通过Internet连接用户和远程资源——使得分布在多台服务器上的资源或者访问这些资源变得可扩展。
Like most things in life, taking the time to plan ahead when building a web service can help in the long run; understanding some of the considerations and tradeoffs behind big websites can result in smarter decisions at the creation of smaller web sites. Below are some of the key principles that influence the design of large-scale web systems:
像生活中其他事情一样,构建一个网站服务前花时间计划会很有帮助:了解了大型网站背后的一些考虑和平衡,在构建小一点的网站时会作出英明的决定。下面就是一些大尺度web系统设计的关键原则:
◊ Availability: The uptime of a website is absolutely critical to the reputation and functionality of many companies. For some of the larger online retail sites, being unavailable for even minutes can result in thousands or millions of dollars in lost revenue, so designing their systems to be constantly available and resilient to failure is both a fundamental business and a technology requirement. High availability in distributed systems requires the careful consideration of redundancy for key components, rapid recovery in the event of partial system failures, and graceful degradation when problems occur.
可用性:对很多公司声誉和服务而言,绝对首要任务就是保证网站的线上运行时间,对于一些大型线上零售网站。即使是几分钟的服务不可用都会导致成千上万元的销售损失,对于这些系统的设计来说,稳定可用和灵活应对故障既是业务基础也是技术的要求。分布式系统的高可用,要求仔细的考虑大量的关键组件,面对部分系统故障时的快速恢复机制,以及问题发生时的平滑的服务降级。
◊ Performance: Website performance has become an important consideration for most sites. The speed of a website affects usage and user satisfaction, as well as search engine rankings, a factor that directly correlates to revenue and retention. As a result, creating a system that is optimized for fast responses and low latency is key.
性能:对于大部分网站,性能也是一个重要考虑因素,网站的访问速度直接影响用户满意度,就像搜索引擎排名直接影响网站收入,所以,快速响应和低延迟是系统的优化关键。
◊ Reliability: A system needs to be reliable, such that a request for data will consistently return the same data. In the event the data changes or is updated, then that same request should return the new data. Users need to know that if something is written to the system, or stored, it will persist and can be relied on to be in place for future retrieval.
可靠性:一个系统必须是可靠的,比如同样的请求必须拿到同样的数据,当数据发生变化或者被更新后,同样的请求必须拿到最新的数据,要让用户知道,存放在系统中的数据是持久化、能够在将来重新获取。.
◊ Scalability: When it comes to any large distributed system, size is just one aspect of scale that needs to be considered. Just as important is the effort required to increase capacity to handle greater amounts of load, commonly referred to as the scalability of the system. Scalability can refer to many different parameters of the system: how much additional traffic can it handle, how easy is it to add more storage capacity, or even how many more transactions can be processed.
扩展性:当分布式系统变得庞大以后,规模只是可伸缩性需要考虑的一个方面,对于系统的可伸缩性来说,首要就是增加容量能够处理很大的负载。系统的扩展性具体指许多不同的参数:能额外处理多少交易,怎样方便的增加存储能力,或者有多少交易能够被处理。
◊ Manageability: Designing a system that is easy to operate is another important consideration. The manageability of the system equates to the scalability of operations: maintenance and updates. Things to consider for manageability are the ease of diagnosing and understanding problems when they occur, ease of making updates or modifications, and how simple the system is to operate. (I.e., does it routinely operate without failure or exceptions?)
管理性:设计系统中,运营的易用性也是另外一个需要考虑的,系统的管理性等同于运营的伸缩性,包括:维护和更新,当故障发生时,能够方便的诊断和排查,能够方便的更新和修改,以及尽量使系统易于管理。
◊ Cost: Cost is an important factor. This obviously can include hardware and software costs, but it is also important to consider other facets needed to deploy and maintain the system. The amount of developer time the system takes to build, the amount of operational effort required to run the system, and even the amount of training required should all be considered. Cost is the total cost of ownership.
成本:成本也是一个重要因素,主要包括软件和硬件成本,但还要考虑部署和维护系统的其他方面成本,构建系统的开发费用,运行系统必须的运营成本,甚至是必要的培训成本,等等这些都需要考虑,成本就是这些因素的总和。
Each of these principles provides the basis for decisions in designing a distributed web architecture. However, they also can be at odds with one another, such that achieving one objective comes at the cost of another. A basic example: choosing to address capacity by simply adding more servers (scalability) can come at the price of manageability (you have to operate an additional server) and cost (the price of the servers).
这些原则为设计分布式web架构提供了基本的决策,然而,这些原则也会互相影响,完成这个目标时会增加其他目标的成本,一个简单例子:通过简单的增加服务器来扩容会带来运营成本和硬件成本。
When designing any sort of web application it is important to consider these key principles, even if it is to acknowledge that a design may sacrifice one or more of them.
当设计各种web应用时重要一点就是要综合考虑这些原则,甚至设计中不得不牺牲一个或者多个原则。
1.2. The Basics
When it comes to system architecture there are a few things to consider: what are the right pieces, how these pieces fit together, and what are the right tradeoffs. Investing in scaling before it is needed is generally not a smart business proposition; however, some forethought into the design can save substantial time and resources in the future.
在开始系统架构时,需要考虑几件事情:什么是正确的组件,这些组件如何配合在一起,如何合理的平衡,虽然在需求来之前投入过多的时间精力通常不是一个聪明的做法,但一些前瞻性的设计能够在未来节省时间和资源。
This section is focused on some of the core factors that are central to almost all large web applications: services, redundancy,partitions, and handling failure. Each of these factors involves choices and compromises, particularly in the context of the principles described in the previous section. In order to explain these in detail it is best to start with an example.
这一节主要集中于诸多大型web应用普遍存在的的核心因素:服务,冗余,切分和故障处理,在前面一节的原则指导下,每个因素都面临选择和妥协,为了更好的说明细节,最好是通过一个例子开始.
Example: Image Hosting Application
At some point you have probably posted an image online. For big sites that host and deliver lots of images, there are challenges in building an architecture that is cost-effective, highly available, and has low latency (fast retrieval).
或许你已经在网上发布过图片,对于一个托管和分发大量图片的大型网站,构建一个低成本,高可用,低延迟的系统是一个挑战。
Imagine a system where users are able to upload their images to a central server, and the images can be requested via a web link or API, just like Flickr or Picasa. For the sake of simplicity, let's assume that this application has two key parts: the ability to upload (write) an image to the server, and the ability to query for an image. While we certainly want the upload to be efficient, we care most about having very fast delivery when someone requests an image (for example, images could be requested for a web page or other application). This is very similar functionality to what a web server or Content Delivery Network (CDN) edge server (a server CDN uses to store content in many locations so content is geographically/physically closer to users, resulting in faster performance) might provide.
想象一下,在这个系统中,用户能够上传图片到服务器,图片也能够通过链接或者API被请求获取,就像Flickr和Picasa,出于简化的目的,假设这个应用包含两个主要部分:能够上传(写入)图片到服务器,能够查询图片。我们当然希望上传是快速的,我们也非常关心当图片被请求获取时(比如,图片能够被网页或者其他应用程序请求),分发也要快,功能上非常像CDN的edge server(注:边缘服务器 它主要是配合主服务器使用,加快不同地区的服务浏览速度,相当于镜像的作用,减轻主服务器的负担).
Other important aspects of the system are:
这个系统其他要求有:
There is no limit to the number of images that will be stored, so storage scalability, in terms of image count needs to be considered.
图片存储的数量没有限制,所以,就图片数量而言,存储的可扩展必须考虑。
There needs to be low latency for image downloads/requests.
需要做到上传/请求的低延迟。
If a user uploads an image, the image should always be there (data reliability for images).
如果用户上传了图片,图片要一直被保存(图片数据的可靠性)。
The system should be easy to maintain (manageability).
系统非常容易维护(可管理性)。
◊ Since image hosting doesn't have high profit margins, the system needs to be cost-effective
托管服务不会有好的利润收益,所以系统必须做到低成本
Figure 1.1 is a simplified diagram of the functionality.
图1.1是一个简单的功能图
Figure 1.1: Simplified architecture diagram for image hosting application
In this image hosting example, the system must be perceivably fast, its data stored reliably and all of these attributes highly scalable. Building a small version of this application would be trivial and easily hosted on a single server; however, that would not be interesting for this chapter. Let's assume that we want to build something that could grow as big as Flickr.
在这个图片托管例子中,这个系统给人直观的感觉是响应快,数据存储可靠,具有高扩展性,可以在一台服务器上很容易的搭建一个小版本的应用,但这不是本文感兴趣的,假定我们是在构建一个增长达到像Flickr一样规模的系统。
Services
When considering scalable system design, it helps to decouple functionality and think about each part of the system as its own service with a clearly defined interface. In practice, systems designed in this way are said to have a Service-Oriented Architecture (SOA). For these types of systems, each service has its own distinct functional context, and interaction with anything outside of that context takes place through an abstract interface, typically the public-facing API of another service.
考虑系统设计的扩展性,服务有助于做功能解耦,系统的每一个部分被当作一个服务,并有着清晰的接口定义,在实践中,按照这个思路设计的系统被称为有着面向服务架构(SOA),对于这类系统,每一个服务都有它自己的功能上下文,通过抽象接口与外部交互,通常是其他服务的面向公共API。
Deconstructing a system into a set of complementary services decouples the operation of those pieces from one another. This abstraction helps establish clear relationships between the service, its underlying environment, and the consumers of that service. Creating these clear delineations can help isolate problems, but also allows each piece to scale independently of one another. This sort of service-oriented design for systems is very similar to object-oriented design for programming.
将一个系统解耦成为一系列相互配合的服务,这种抽象能够帮助在服务之间建立清晰的关系,服务依赖环境,服务的消费方。划分清晰的界限能够帮助隔离问题,也能够使得每一个组件能够独立的扩展。这种面向服务的设计非常像编程中的面向对象设计。
In our example, all requests to upload and retrieve images are processed by the same server; however, as the system needs to scale it makes sense to break out these two functions into their own services.
在我们的例子中,所有的上载请求和获取图片请求都是在一台服务器上处理;然而,当系统需要扩展时,将这两个功能分开为各自的服务会更好。
Fast-forward and assume that the service is in heavy use; such a scenario makes it easy to see how longer writes will impact the time it takes to read the images (since they two functions will be competing for shared resources). Depending on the architecture this effect can be substantial. Even if the upload and download speeds are the same (which is not true of most IP networks, since most are designed for at least a 3:1 download-speed:upload-speed ratio), read files will typically be read from cache, and writes will have to go to disk eventually (and perhaps be written several times in eventually consistent situations). Even if everything is in memory or read from disks (like SSDs), database writes will almost always be slower than reads. (Pole Position, an open source tool for DB benchmarking, http://polepos.org/ and results http://polepos.sourceforge.net/results/PolePositionClientServer.pdf.).
进一步假设服务被大量使用,在这种场景下很容易看到长时间的写操作将会影响读取照片时间(因为这两个功能在共享资源上产生竞争),基于这种架构(注:单机服务)带来的影响会是重大的,即使上传和下载具有同样的速度(在大多数IP网络中一般不会,通常认为上传和下载的速比为3:1),通常文件是从缓存中读取,写入不得不最终要落盘(可能要写入多次才能最终达到一致)。哪怕数据都是在内存或者从SSD盘读取,数据写入通常要慢于读取。(Pole Position,一个开源的数据库测量工具)
Another potential problem with this design is that a web server like Apache or lighttpd typically has an upper limit on the number of simultaneous connections it can maintain (defaults are around 500, but can go much higher) and in high traffic, writes can quickly consume all of those. Since reads can be asynchronous, or take advantage of other performance optimizations like gzip compression or chunked transfer encoding, the web server can switch serve reads faster and switch between clients quickly serving many more requests per second than the max number of connections (with Apache and max connections set to 500, it is not uncommon to serve several thousand read requests per second). Writes, on the other hand, tend to maintain an open connection for the duration for the upload, so uploading a 1MB file could take more than 1 second on most home networks, so that web server could only handle 500 such simultaneous writes.
另外一个潜在的问题就是像Apache或者lighttpd这种web serve通常管理的并发连接是存在上限的(默认是500左右,但可以更高),写入将会消耗大量的服务连接。由于读取可以做成异步,或者采取其他优化性能手段,比如gzip压缩或者分包传输,使web server能够更快的切换读和切换客户端连接获取比最大连接数更高的RPS(request per second)(Apache的最大连接是500,能够服务上千的请求每秒也很有可能),另外一方面,写入时,在上传期间一直持有打开的连接,虽然大部分家庭网络中上传1M的文件会耗时1秒,因此web server仅仅只能支持500个并发写。
Figure 1.2: Splitting out reads and writes
Planning for this sort of bottleneck makes a good case to split out reads and writes of images into their own services, shown in Figure 1.2. This allows us to scale each of them independently (since it is likely we will always do more reading than writing), but also helps clarify what is going on at each point. Finally, this separates future concerns, which would make it easier to troubleshoot and scale a problem like slow reads.
解决此类瓶颈问题最好的办法就是读写分离,图1.2,这样使我们能够分开扩展(因为看上去我们的读取会比写入多),另外也有助于清楚的知道在每个点上将怎么做。最终,关注点也会分开,从而比较容易的解决类似于慢读之类的问题。
The advantage of this approach is that we are able to solve problems independently of one another—we don't have to worry about writing and retrieving new images in the same context. Both of these services still leverage the global corpus of images, but they are free to optimize their own performance with service-appropriate methods (for example, queuing up requests, or caching popular images—more on this below). And from a maintenance and cost perspective each service can scale independently as needed, which is great because if they were combined and intermingled, one could inadvertently impact the performance of the other as in the scenario discussed above.
这种方式的优点在于我们能够分开的解决问题——我们不用担心同一上下文中新图片的写入和读取,尽管写入和读取仍然影响所有的图片,但我们能够基于服务选择合适的方法来优化性能(比如:请求队列,缓存热点图片——更多的在下面),而且从维护和成本的角度来看,每个服务能够根据需要独立扩展,要好于将它们糅合在一起相互影响性能,就像上面讨论的那样。
Of course, the above example can work well when you have two different endpoints (in fact this is very similar to several cloud storage providers' implementations and Content Delivery Networks). There are lots of ways to address these types of bottlenecks though, and each has different tradeoffs.
For example, Flickr solves this read/write issue by distributing users across different shards such that each shard can only handle a set number of users, and as users increase more shards are added to the cluster (see the presentation on Flickr's scaling,http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html). In the first example it is easier to scale hardware based on actual usage (the number of reads and writes across the whole system), whereas Flickr scales with their user base (but forces the assumption of equal usage across users so there can be extra capacity). In the former an outage or issue with one of the services brings down functionality across the whole system (no-one can write files, for example), whereas an outage with one of Flickr's shards will only affect those users. In the first example it is easier to perform operations across the whole dataset—for example, updating the write service to include new metadata or searching across all image metadata—whereas with the Flickr architecture each shard would need to be updated or searched (or a search service would need to be created to collate that metadata—which is in fact what they do).
比如,Flickr解决读/写问题就是将用户分发到不同的切片上,每个切片只处理一部分用户,当用户量增长,就在集群中加入很多的切片(参见Flickr扩容)。在第一个例子中,可以很容易的基于实际行为(整个系统中读和写的数量)来扩容硬件,然而Flickr是基于用户群来扩容,前者的一个问题就是一个服务出现问题会波及到整个系统(比如没有人能够写入),但是Flickr的一个切片出现问题将只会影响与切片相关的用户。前者的好处就是很容易操作整个数据集合——比如增加新的元数据或者搜索整个图片元数据——然而在Flickr的架构下,每个切片都需要完成更新或者搜索。
When it comes to these systems there is no right answer, but it helps to go back to the principles at the start of this chapter, determine the system needs (heavy reads or writes or both, level of concurrency, queries across the data set, ranges, sorts, etc.), benchmark different alternatives, understand how the system will fail, and have a solid plan for when failure happens.
讨论这些没有正确答案,还是本文开头的那些原则决定这系统需求(读写比重,并发级别,查询的集合、范围、排序等),测量不同的方案,知道系统的临界点,以及故障发生时的应急方案。
Redundancy
In order to handle failure gracefully a web architecture must have redundancy of its services and data. For example, if there is only one copy of a file stored on a single server, then losing that server means losing that file. Losing data is seldom a good thing, and a common way of handling it is to create multiple, or redundant, copies.
为了更好的处理故障,web架构中必须具有服务和数据的备份,比如,如果一台服务器上只有文件一份拷贝,服务器的文件丢失就是文件的整个丢失,数据丢失很少是个好事情,一般解决方法就是创建多份拷贝或者冗余。
This same principle also applies to services. If there is a core piece of functionality for an application, ensuring that multiple copies or versions are running simultaneously can secure against the failure of a single node.
对于服务来说也是同样的道理,对于应用的核心功能,确保多份拷贝或者版本同时在运行,安全防范单点故障。
Creating redundancy in a system can remove single points of failure and provide a backup or spare functionality if needed in a crisis. For example, if there are two instances of the same service running in production, and one fails or degrades, the system can failoverto the healthy copy. Failover can happen automatically or require manual intervention.
创建备用系统能够消除单点故障,并提供碰到灾难时需要的备份,比如,生产环境中同一个服务存在两个实例在运行,如果一个发生故障,系统能够故障转移到正常的备份,故障转移可以自动也可以根据需要人工干预。
注:Fail-Over
Fail-Over的含义为“失效转移”,是一种备份操作模式,当主要组件异常时,其功能转移到备份组件。其要点在于有主有备,且主故障时备可启用,并设置为主。如Mysql的双Master模式,当正在使用的Master出现故障时,可以拿备Master做主使用。
Another key part of service redundancy is creating a shared-nothing architecture. With this architecture, each node is able to operate independently of one another and there is no central "brain" managing state or coordinating activities for the other nodes. This helps a lot with scalability since new nodes can be added without special conditions or knowledge. However, and most importantly, there is no single point of failure in these systems, so they are much more resilient to failure.
服务冗余的另外一个关键点就是shared-nothing架构(注:去中心化架构),在这种架构下,每个节点都能够独立于其他节点而运行,没有一个中心脑节点管理其他节点的状态和相应的活跃。这会极大方便新增节点的扩容而不需要特别的条件,另外最重要的,这类系统没有单点故障,从而具有更好的抗故障能力。
For example, in our image server application, all images would have redundant copies on another piece of hardware somewhere (ideally in a different geographic location in the event of a catastrophe like an earthquake or fire in the data center), and the services to access the images would be redundant, all potentially servicing requests. (See Figure 1.3.) (Load balancers are a great way to make this possible, but there is more on that below).
举例来说,对于我们的图片托管应用,所有的图片将会在另外的硬件上会有备份(理想的情况是在另一个地域存在一个灾备),访问图片的服务也是有备份的,都能服务请求。图1.3(下面会讨论负载均衡器)
Figure 1.3: Image hosting application with redundancy
Partitions
There may be very large data sets that are unable to fit on a single server. It may also be the case that an operation requires too many computing resources, diminishing performance and making it necessary to add capacity. In either case you have two choices: scale vertically or horizontally.
可能有非常大的数据量,无法适合放在一台服务器上,还有其他情况,比如一个操作需要很多计算资源,逐渐变差的性能需要增加容量,这些情况下你有两种选择:垂直扩展和水平扩展。
Scaling vertically means adding more resources to an individual server. So for a very large data set, this might mean adding more (or bigger) hard drives so a single server can contain the entire data set. In the case of the compute operation, this could mean moving the computation to a bigger server with a faster CPU or more memory. In each case, vertical scaling is accomplished by making the individual resource capable of handling more on its own.
垂直扩展意味着为单台服务器增加资源,对于很大的数据量,也就意味着增加更多的硬件,使得一台服务器能够存储整个数据集,对于计算密集型的服务,意味着将计算挪到更快CPU和更多内存的服务器上,这种情况下,垂直扩展就是通过为它自身增加资源来完成。
To scale horizontally, on the other hand, is to add more nodes. In the case of the large data set, this might be a second server to store parts of the data set, and for the computing resource it would mean splitting the operation or load across some additional nodes. To take full advantage of horizontal scaling, it should be included as an intrinsic design principle of the system architecture, otherwise it can be quite cumbersome to modify and separate out the context to make this possible.
另一方面,水平扩展就是增加更多的节点,对于大数据集,就是增加一台服务器来存储一部分数据,对于计算密集的服务,意味着分解步骤或者负载到增加的节点上。为了充分的利用水平扩容的好处,必须把它当作系统架构的一个内在设计原则,否则后续进行修改或者分离将会很麻烦。
When it comes to horizontal scaling, one of the more common techniques is to break up your services into partitions, or shards. The partitions can be distributed such that each logical set of functionality is separate; this could be done by geographic boundaries, or by another criteria like non-paying versus paying users. The advantage of these schemes is that they provide a service or data store with added capacity.
当采用水平扩展,一般的技术就是对服务进行切分,分片是分布式的,每个切片的逻辑功能都是独立的,切分可以通过地理位置,或者付费和非付费用户。这种模式的特点就是易于新增容量来提供服务或者存储。
In our image server example, it is possible that the single file server used to store images could be replaced by multiple file servers, each containing its own unique set of images. (See Figure 1.4.) Such an architecture would allow the system to fill each file server with images, adding additional servers as the disks become full. The design would require a naming scheme that tied an image's filename to the server containing it. An image's name could be formed from a consistent hashing scheme mapped across the servers. Or alternatively, each image could be assigned an incremental ID, so that when a client makes a request for an image, the image retrieval service only needs to maintain the range of IDs that are mapped to each of the servers (like an index).
在我们的图片托管例子中,可以将存储图片的单台文件服务器替换成多台文件服务器,每一台都包含不同部分的图片(图1.4),这样的架构允许系统在每台文件服务器上存放图片,当硬盘空间满时再增加新的服务器。这种设计要求一个naming scheme,用来映射图片名称和它存放的哪台服务器。图片名称可以由一致性哈希映射到服务器的方式来组成,或者,每个图片都有一个自增ID,当客户端请求一张图片时,获取图片的服务只需要维护每台服务器的ID区间(就像一个索引)。
Figure 1.4: Image hosting application with redundancy and partitioning
Of course there are challenges distributing data or functionality across multiple servers. One of the key issues is data locality; in distributed systems the closer the data to the operation or point of computation, the better the performance of the system. Therefore it is potentially problematic to have data spread across multiple servers, as any time it is needed it may not be local, forcing the servers to perform a costly fetch of the required information across the network.
当然这里面存在数据分散或者跨多台服务器功能的风险,一个关键的问题就是数据局部性,在分布式系统中,离操作数据越近系统性能越好,因此,一个潜在的问题就是数据散落在多台服务器上,有时需要数据时却不是本地的,需要服务器完成代价较高的跨网络数据获取。
Another potential issue comes in the form of inconsistency. When there are different services reading and writing from a shared resource, potentially another service or data store, there is the chance for race conditions—where some data is supposed to be updated, but the read happens prior to the update—and in those cases the data is inconsistent. For example, in the image hosting scenario, a race condition could occur if one client sent a request to update the dog image with a new title, changing it from "Dog" to "Gizmo", but at the same time another client was reading the image. In that circumstance it is unclear which title, "Dog" or "Gizmo", would be the one received by the second client.
另外一个潜在的问题就是非一致性,当有不同的服务读取和写入共享资源时,对数据产生竞争。(注:这个例子不是很明白)
There are certainly some obstacles associated with partitioning data, but partitioning allows each problem to be split—by data, load, usage patterns, etc.—into manageable chunks. This can help with scalability and manageability, but is not without risk. There are lots of ways to mitigate risk and handle failures; however, in the interest of brevity they are not covered in this chapter. If you are interested in reading more, you can check out my blog post on fault tolerance and monitoring.
当然,关于数据切片还有许多障碍,但是通过数据,负载,使用模式将问题分解。这有助于与可伸缩性和可管理型,但也不是没有风险。有许多办法应对风险和处理故障,但为了简化篇幅,本文不在覆盖这些。
1.3. The Building Blocks of Fast and Scalable Data Access
Having covered some of the core considerations in designing distributed systems, let's now talk about the hard part: scaling access to the data.
Most simple web applications, for example, LAMP stack applications, look something like Figure 1.5.
讨论了分布式系统设计的注意事项,现在开始讨论最难的部分:可扩展的数据访问。
大部分简单的web应用都多少像图1.5
Figure 1.5: Simple web applications
As they grow, there are two main challenges: scaling access to the app server and to the database. In a highly scalable application design, the app (or web) server is typically minimized and often embodies a shared-nothing architecture. This makes the app server layer of the system horizontally scalable. As a result of this design, the heavy lifting is pushed down the stack to the database server and supporting services; it's at this layer where the real scaling and performance challenges come into play.
随着增长,面临这两个挑战:访问应用服务和数据库的可伸缩性,对于高可伸缩的应用设计,app(或者web)服务通常最小化并且经常采用shared-nothing架构,这使得app服务处于可以水平扩展的层面,这种设计的后果就是,压力往下推到了数据库服务器及其相关服务,在这一层才是真正的面临着可扩展和性能的挑战。
The rest of this chapter is devoted to some of the more common strategies and methods for making these types of services fast and scalable by providing fast access to data.
本文的剩下部分将专注一些通用的策略和方法,提供更快的数据访问,使得服务变得更快更有可伸缩性。
Figure 1.6: Oversimplified web application
Most systems can be oversimplified to Figure 1.6. This is a great place to start. If you have a lot of data, you want fast and easy access, like keeping a stash of candy in the top drawer of your desk. Though overly simplified, the previous statement hints at two hard problems: scalability of storage and fast access of data.
许多系统都可以简化为图1.6,就从这里开始,如果你有很多数据,你希望更快的访问。尽管被极度简化,但前面的引子已经揭示了两个难题:可伸缩的存储和快速数据访问。
For the sake of this section, let's assume you have many terabytes (TB) of data and you want to allow users to access small portions of that data at random. (See Figure 1.7.) This is similar to locating an image file somewhere on the file server in the image application example.
为了这一节的需要,假定你有如干TB的数据,而且你运行用户可以随意的访问这些数据的一小部分,看图1.7,就像图片托管应用例子中,在文件服务器上定位一张图片文件。
Figure 1.7: Accessing specific data
This is particularly challenging because it can be very costly to load TBs of data into memory; this directly translates to disk IO. Reading from disk is many times slower than from memory—memory access is as fast as Chuck Norris, whereas disk access is slower than the line at the DMV. This speed difference really adds up for large data sets; in real numbers memory access is as little as 6 times faster for sequential reads, or 100,000 times faster for random reads, than reading from disk (see "The Pathologies of Big Data", http://queue.acm.org/detail.cfm?id=1563874). Moreover, even with unique IDs, solving the problem of knowing where to find that little bit of data can be an arduous task. It's like trying to get that last Jolly Rancher from your candy stash without looking.
这是一个很大的麻烦,因为将TB的数据载入内存的成本会很高;如果直接转化为磁盘IO。从磁盘读取比从内存读取要慢很多倍,内存读取就像Chuck Norris一样快,磁盘读取就像DMV一样慢。对于大数据集这种速度上的差异会拉大,对于顺序读,内存访问是磁盘读取的6倍,对于随机读,内存访问是磁盘读取的100000倍。然而,即使有唯一ID,从哪儿能够找到这一小部分数据仍然是一个艰巨的任务。
Thankfully there are many options that you can employ to make this easier; four of the more important ones are caches, proxies, indexes and load balancers. The rest of this section discusses how each of these concepts can be used to make data access a lot faster.
幸运的是,有许多选择是你能够将事情变得容易:非常重要的4个:缓存、代理、索引和负载均衡。本文的剩下部分将会讨论每一条的概念和如何使用它们使得数据访问更快。
Caches
Caches take advantage of the locality of reference principle: recently requested data is likely to be requested again. They are used in almost every layer of computing: hardware, operating systems, web browsers, web applications and more. A cache is like short-term memory: it has a limited amount of space, but is typically faster than the original data source and contains the most recently accessed items. Caches can exist at all levels in architecture, but are often found at the level nearest to the front end, where they are implemented to return data quickly without taxing downstream levels.
缓存利用了引用本地化原则:最近请求的数据会被再次请求。缓存被用在计算机的每一层:硬件、操作系统、浏览器、web应用等。缓存就像一个短期的内存:它具有有限大小的空间,通常快于原始的数据源,包含了最近访问的条目。缓存可以存在于架构的所有层,但一般用在离前端最近的一层,这样能够快速返回结果,而不用往下深入到其他层。
How can a cache be used to make your data access faster in our API example? In this case, there are a couple of places you can insert a cache. One option is to insert a cache on your request layer node, as in Figure 1.8.
在我们的例子中,如何利用缓存使得数据访问更快?在这个例子中,有两个地方可以嵌入缓存,一个就是在请求层节点上嵌入缓存,图1.8
Figure 1.8: Inserting a cache on your request layer node
Placing a cache directly on a request layer node enables the local storage of response data. Each time a request is made to the service, the node will quickly return local, cached data if it exists. If it is not in the cache, the request node will query the data from disk. The cache on one request layer node could also be located both in memory (which is very fast) and on the node's local disk (faster than going to network storage).
直接将缓存放在请求层,能够响应本地数据。每次请求过来,节点如果缓存数据存在,就快速返回。如果没在缓存里,将从磁盘查询。请求层节点的缓存既可以放在内存中(非常快)也可以放在节点的本地磁盘上(快于通过网络)。
Figure 1.9: Multiple caches
What happens when you expand this to many nodes? As you can see in Figure 1.9, if the request layer is expanded to multiple nodes, it's still quite possible to have each node host its own cache. However, if your load balancer randomly distributes requests across the nodes, the same request will go to different nodes, thus increasing cache misses. Two choices for overcoming this hurdle are global caches and distributed caches.
如果扩展到多个节点会发生什么?如图1.9,如果请求层有多个节点,每个节点都有它自己的缓存,然而,如果你的负载均衡器随意的分发请求,同样的请求到达不同的节点,会增加缓存缺失率。可以通过全局缓存和分布式缓存来解决这个问题。
Global Cache
A global cache is just as it sounds: all the nodes use the same single cache space. This involves adding a server, or file store of some sort, faster than your original store and accessible by all the request layer nodes. Each of the request nodes queries the cache in the same way it would a local one. This kind of caching scheme can get a bit complicated because it is very easy to overwhelm a single cache as the number of clients and requests increase, but is very effective in some architectures (particularly ones with specialized hardware that make this global cache very fast, or that have a fixed dataset that needs to be cached).
全局缓存:所有的节点共用同一个缓存空间。这需要增加一台服务器,或者文件存储之类的,这比从数据原始地要快,并且所有请求层的节点能够访问。每个请求节点通过同样的方式查询全局缓存,就像是本地缓存一样。这种缓存模式会变得复杂一些,因为随着客户端和请求的增加,很容易将单台缓存服务器压垮,但在某些架构中确实非常有效的(特别是利用特殊的硬件使得全局缓存访问特别快,或者固定大小的数据需要被缓存)。
There are two common forms of global caches depicted in the diagrams. In Figure 1.10, when a cached response is not found in the cache, the cache itself becomes responsible for retrieving the missing piece of data from the underlying store. In Figure 1.11 it is the responsibility of request nodes to retrieve any data that is not found in the cache.
全局缓存通常有两种形式,在图1.10中,当在缓存中没有找到响应内容时,缓存自己从下层的存储获取缺失的数据。在图1.11中,当在缓存中没有找到数据时,请求节点自己去获取数据。
Figure 1.10: Global cache where cache is responsible for retrieval
Figure 1.11: Global cache where request nodes are responsible for retrieval
The majority of applications leveraging global caches tend to use the first type, where the cache itself manages eviction and fetching data to prevent a flood of requests for the same data from the clients. However, there are some cases where the second implementation makes more sense. For example, if the cache is being used for very large files, a low cache hit percentage would cause the cache buffer to become overwhelmed with cache misses; in this situation it helps to have a large percentage of the total data set (or hot data set) in the cache. Another example is an architecture where the files stored in the cache are static and shouldn't be evicted. (This could be because of application requirements around that data latency—certain pieces of data might need to be very fast for large data sets—where the application logic understands the eviction strategy or hot spots better than the cache.)
大部分应用采用全局缓存时都倾向于利用第一种方式,缓存自己管理空间收回和获取数据,防止客户端洪水般的请求同一数据。然而,某些情况下第二种实现也很好。举例来说,缓存很大的文件时,很低的缓存命中率会导致缓存空间被撑爆,这种情况下,缓存大部分数据(或者热数据)会很有帮助。另外一个例子就是文件存储在缓存中,是静态的不能够被回收(对于大数据集来说,某些片段数据能够被快速访问——应用程序的逻辑比缓存更为清楚回收策略和热点)。
Distributed Cache
In a distributed cache (Figure 1.12), each of its nodes own part of the cached data, so if a refrigerator acts as a cache to the grocery store, a distributed cache is like putting your food in several locations—your fridge, cupboards, and lunch box—convenient locations for retrieving snacks from, without a trip to the store. Typically the cache is divided up using a consistent hashing function, such that if a request node is looking for a certain piece of data it can quickly know where to look within the distributed cache to determine if that data is available. In this case, each node has a small piece of the cache, and will then send a request to another node for the data before going to the origin. Therefore, one of the advantages of a distributed cache is the increased cache space that can be had just by adding nodes to the request pool.
分布式缓存中(图1.12),每个节点都有它自己的缓存数据,分布式缓存就像把你的食物放在好几个地方——冰箱,橱柜——方便获取的地方。通常,缓存通过一致性哈希函数来分开(存储),这样请求节点能够快速的知道去哪儿查找缓存数据。这种情况下,每个节点都会有一部分缓存数据,并会到去原始数据源之前给其他节点发送请求,因此,分布式缓存的一个优点就是往请求池中增加节点就能增加缓存空间。
A disadvantage of distributed caching is remedying a missing node. Some distributed caches get around this by storing multiple copies of the data on different nodes; however, you can imagine how this logic can get complicated quickly, especially when you add or remove nodes from the request layer. Although even if a node disappears and part of the cache is lost, the requests will just pull from the origin—so it isn't necessarily catastrophic!
分布式缓存的一个问题就是弥补缺失节点的数据。一些分布式缓存是通过在其他节点存储多份拷贝来解决这个问题。然而,可以想象这种方式马上将问题变得复杂起来,特别是从请求层增加或者移除节点。虽然即使一个节点不可用导致缓存数据丢失,请求仍然能够从原始地方获取数据,这也不是一个灾难性的问题。
Figure 1.12: Distributed cache
The great thing about caches is that they usually make things much faster (implemented correctly, of course!) The methodology you choose just allows you to make it faster for even more requests. However, all this caching comes at the cost of having to maintain additional storage space, typically in the form of expensive memory; nothing is free. Caches are wonderful for making things generally faster, and moreover provide system functionality under high load conditions when otherwise there would be complete service degradation.
缓存最牛逼的就是将事情变得更快(当然实现要正确),然而,缓存会增加更多存储成本,通常都是昂贵的内存:没有免费午餐。缓存通常是事情更快,在高负载的情况下提供更好的弹性,否则会导致服务降级。
One example of a popular open source cache is Memcached (http://memcached.org/) (which can work both as a local cache and distributed cache); however, there are many other options (including many language- or framework-specific options).
一个很流行的缓存框架是Memcached(既可以当作本地缓存也可以是用作分布式缓存)。然而,还有很多其他选择。
Memcached is used in many large web sites, and even though it can be very powerful, it is simply an in-memory key value store, optimized for arbitrary data storage and fast lookups (O(1)).
Memcached被很多大型网站采用,它是简单的基于内存的K-V存储,数据的存储和访问达到O(1)级别。
Facebook uses several different types of caching to obtain their site performance (see "Facebook caching and performance"). They use $GLOBALS and APC caching at the language level (provided in PHP at the cost of a function call) which helps make intermediate function calls and results much faster. (Most languages have these types of libraries to improve web page performance and they should almost always be used.) Facebook then use a global cache that is distributed across many servers (see "Scaling memcached at Facebook"), such that one function call accessing the cache could make many requests in parallel for data stored on different Memcached servers. This allows them to get much higher performance and throughput for their user profile data, and have one central place to update data (which is important, since cache invalidation and maintaining consistency can be challenging when you are running thousands of servers).
Facebook利用其他的缓存技术来提升网站性能,他们在语言级别利用$GLOBAL和APC缓存。
Now let's talk about what to do when the data isn't in the cache…
现在来讨论一下如果数据没有在缓存里..
Proxies
At a basic level, a proxy server is an intermediate piece of hardware/software that receives requests from clients and relays them to the backend origin servers. Typically, proxies are used to filter requests, log requests, or sometimes transform requests (by adding/removing headers, encrypting/decrypting, or compression).
基本来说,代理服务器就是一个从客户端接收请求然后传递请求到后台原始服务器的中间软件/硬件部分。通常,代理用用来过滤请求,记录日志,或者转换请求(比如增加/删除协议头,加密/解密或者压缩)
Figure 1.13: Proxy server
Proxies are also immensely helpful when coordinating requests from multiple servers, providing opportunities to optimize request traffic from a system-wide perspective. One way to use a proxy to speed up data access is to collapse the same (or similar) requests together into one request, and then return the single result to the requesting clients. This is known as collapsed forwarding.
代理能够极大的帮助协调多个服务器的请求,能够从系统的视角提供优化机会,其中一个办法是利用代理来加速数据访问,通过合并相同或者类似的请求成为一个请求,然后返回单个结果给请求客户端,这被称作压缩转发。
Imagine there is a request for the same data (let's call it littleB) across several nodes, and that piece of data is not in the cache. If that request is routed thought the proxy, then all of those requests can be collapsed into one, which means we only have to read littleB off disk once. (See Figure 1.14.) There is some cost associated with this design, since each request can have slightly higher latency, and some requests may be slightly delayed to be grouped with similar ones. But it will improve performance in high load situations, particularly when that same data is requested over and over. This is similar to a cache, but instead of storing the data/document like a cache, it is optimizing the requests or calls for those documents and acting as a proxy for those clients.
假设来自好几个节点请求一个相同的数据(称为LittleB),这部分数据没有在缓存里,如果请求通过代理被路由,那么这些请求将会被合并成一个,这就意味这仅有一次从磁盘读取LittleB。看图1.14。这种设计也会有一定代价,每个请求会有轻度的延迟,一些请求会被因为合并成一个而延迟。但是,从负载角度来看提升了性能,特别是相同的数据被请求了多次,这很像缓存,但不同的是缓存是存储了数据,代理是优化请求。
In a LAN proxy, for example, the clients do not need their own IPs to connect to the Internet, and the LAN will collapse calls from the clients for the same content. It is easy to get confused here though, since many proxies are also caches (as it is a very logical place to put a cache), but not all caches act as proxies.
举例来说,局域网代理中,客户端访问Internet时不需要独立的IP,局域网代理会合并相同内容的客户端请求。这很容易混淆,因为一些代理也提供缓存(放入缓存非常合理),但不是所有缓存都能当作代理。
Figure 1.14: Using a proxy server to collapse requests
Another great way to use the proxy is to not just collapse requests for the same data, but also to collapse requests for data that is spatially close together in the origin store (consecutively on disk). Employing such a strategy maximizes data locality for the requests, which can result in decreased request latency. For example, let's say a bunch of nodes request parts of B: partB1, partB2, etc. We can set up our proxy to recognize the spatial locality of the individual requests, collapsing them into a single request and returning only bigB, greatly minimizing the reads from the data origin. (See Figure 1.15.) This can make a really big difference in request time when you are randomly accessing across TBs of data! Proxies are especially helpful under high load situations, or when you have limited caching, since they can essentially batch several requests into one.
另外一个利用代理不仅是对相同数据的请求进行合并,更可以对存储很近(在磁盘上是连续的)的请求进行合并。采用这种策略,对请求而言,最大限度将数据本地化,降低了请求延迟。举例来说,假设有很多节点请求数据B:partB1,partB2等,我们能够利用代理合并这些请求成为一个请求,仅返回BigB,最大限度减少读取原始数据,看图1.15.这在随机访问TB数据时真的很重要!在高负载下,或者缓存有限制的情况下,代理会非常有用,因为它们可以将多个请求变成一个。
Figure 1.15: Using a proxy to collapse requests for data that is spatially close together
It is worth noting that you can use proxies and caches together, but generally it is best to put the cache in front of the proxy, for the same reason that it is best to let the faster runners start first in a crowded marathon race. This is because the cache is serving data from memory, it is very fast, and it doesn't mind multiple requests for the same result. But if the cache was located on the other side of the proxy server, then there would be additional latency with every request before the cache, and this could hinder performance.
值得注意的是,可以同时利用代理和缓存,通常来说最好的是将缓存放在代理前面,就像拥挤的马拉松比赛中将跑得快的选手安排去领跑,因为缓存从内存中读取数据非常快,但它不会注意到对相同数据的多个请求。如果将缓存放在代理服务器的另外一边(后面),每个达到缓存的请求会增加额外的延迟,这可能会影响性能。
If you are looking at adding a proxy to your systems, there are many options to consider; Squid and Varnish have both been road tested and are widely used in many production web sites. These proxy solutions offer many optimizations to make the most of client-server communication. Installing one of these as a reverse proxy (explained in the load balancer section below) at the web server layer can improve web server performance considerably, reducing the amount of work required to handle incoming client requests.
如果你正在为系统寻找一个代理,有许多选择可以考虑:Squid和Varnish已经在很多网站得到检验和广泛应用。这些代理的解决方案为客户端-服务器通信提供很多优化。可以在web服务器层安装后当作反向代理(会在负载均衡章节解释),提升web服务器性能,减少一些客户端请求的处理。
Indexes
Using an index to access your data quickly is a well-known strategy for optimizing data access performance; probably the most well known when it comes to databases. An index makes the trade-offs of increased storage overhead and slower writes (since you must both write the data and update the index) for the benefit of faster reads.
通过索引访问来优化数据访问性能是众所周之的策略,索引概念被广泛认识是来自数据库,索引通过增加额外存储和写变慢(因为既要写数据还有更新索引)来换取快读,使读写达到一种平衡.
Just as to a traditional relational data store, you can also apply this concept to larger data sets. The trick with indexes is you must carefully consider how users will access your data. In the case of data sets that are many TBs in size, but with very small payloads (e.g., 1 KB), indexes are a necessity for optimizing data access. Finding a small payload in such a large data set can be a real challenge since you can't possibly iterate over that much data in any reasonable time. Furthermore, it is very likely that such a large data set is spread over several (or many!) physical devices—this means you need some way to find the correct physical location of the desired data. Indexes are the best way to do this.
你可以利用这个理念运用在大数据中,就像传统型关系型数据库那样,你必须仔细考虑用户是如何访问你的数据,假设数据量达到许多TB,但只有很小一部分数据(比如1k)需要被载入,优化数据访问必须要用到索引。在很大的数据量中找到很小一点数是一个真正的挑战,因为你不可能在有限的时间内遍历查找,此外,如此大的数据分布在机台(或者多台)物理设备上——这就意味着要求你通过某种方式能够正确的找到数据的物理位置,索引是解决此类问题的最好办法。
Figure 1.16: Indexes
An index can be used like a table of contents that directs you to the location where your data lives. For example, let's say you are looking for a piece of data, part 2 of B—how will you know where to find it? If you have an index that is sorted by data type—say data A, B, C—it would tell you the location of data B at the origin. Then you just have to seek to that location and read the part of B you want. (See Figure 1.16.)
索引看上去就像一个指引你想要的数据位置放在哪儿的内容表格,举例来说,假设你要查找数据B的part2这一段——你知道怎么它存放在哪儿?如果你有一个按照数据类型排过序(比如说数据A,B,C)的索引,它会告诉你B的数据存放位置点,然后你就只需要定位这个位置点并开始读取你要想要的数据B。看Figure1.16
These indexes are often stored in memory, or somewhere very local to the incoming client request. Berkeley DBs (BDBs) and tree-like data structures are commonly used to store data in ordered lists, ideal for access with an index.
索引一般存储在内存中,或者离访问请求最近的地方, Berkeley DB和树形数据结构非常适合存储排序列表。
Often there are many layers of indexes that serve as a map, moving you from one location to the next, and so forth, until you get the specific piece of data you want. (See Figure 1.17.)
经常会有很多层级的索引表,就像一个map,带你从一个地址到下一个地址,如此往复,直到你获取到你想要的数据。看Figure1.17
Figure 1.17: Many layers of indexes
Indexes can also be used to create several different views of the same data. For large data sets, this is a great way to define different filters and sorts without resorting to creating many additional copies of the data.
也可以对相同的数据创建几个不同的视图,对于大数据来说,不需要创建额外的多份拷贝,而是定义不同的过滤器和排序,这也是一个好办法。
For example, imagine that the image hosting system from earlier is actually hosting images of book pages, and the service allows client queries across the text in those images, searching all the book content about a topic, in the same way search engines allow you to search HTML content. In this case, all those book images take many, many servers to store the files, and finding one page to render to the user can be a bit involved. First, inverse indexes to query for arbitrary words and word tuples need to be easily accessible; then there is the challenge of navigating to the exact page and location within that book, and retrieving the right image for the results. So in this case the inverted index would map to a location (such as book B), and then B may contain an index with all the words, locations and number of occurrences in each part.
举个例子,假定图片托管系统托管书页的图片,允许客户端查询这些图片中的文字,根据标题搜索整本书的内容,就像搜索引擎允许你搜索html内容一样。在这个例子中,所有书的图片需要大量大量的服务器来存储,找到其中一页并返回给用户会有些复杂。首先,需要倒排索引使任意的单词和词组查询变得容易,然后,引导的那本书的页面和位置,获取正确的图片是一个挑战。因此在这个例子中,倒排索引将建立位置(比如书B)和B包含索引的所有单词,位置和发生数量的映射。
An inverted index, which could represent Index1 in the diagram above, might look something like the following—each word or tuple of words provide an index of what books contain them.
一个倒排索引,就像下面图表展示的那样——每个单词和词组提供哪些书包含它们的索引。
Word(s)
Book(s)
being awesome
Book B, Book C, Book D
always
Book C, Book F
believe
Book B
The intermediate index would look similar but would contain just the words, location, and information for book B. This nested index architecture allows each of these indexes to take up less space than if all of that info had to be stored into one big inverted index.
中间索引看上去类似,包含了单词,位置和书B的信息,嵌套索引架构允许每个索引占用更少的空间,而不是存储为一个大的倒排索引。
And this is key in large-scale systems because even compressed, these indexes can get quite big and expensive to store. In this system if we assume we have a lot of the books in the world—100,000,000 (see Inside Google Books blog post)—and that each book is only 10 pages long (to make the math easier), with 250 words per page, that means there are 250 billion words. If we assume an average of 5 characters per word, and each character takes 8 bits (or 1 byte, even though some characters are 2 bytes), so 5 bytes per word, then an index containing only each word once is over a terabyte of storage. So you can see creating indexes that have a lot of other information like tuples of words, locations for the data, and counts of occurrences, can add up very quickly.
在大尺度系统中,即使这些索引被压缩过,但依然需要(消耗)十分庞大的、价格昂贵贵的存储,在这个系统中,如果我们假定有100,000,000本书,每一本书只有10页,每一页有250个字,这就有2500亿个字,如果平均每个字是5个字符,每个字符需要一个字节,这样每个字就需要5个字节,那么仅仅包含这些字的索引需要一个TB的存储,另外你也看到了,创建索引时还有其他比如单词组,数据地址,出现次数等,导致增长很快。
Creating these intermediate indexes and representing the data in smaller sections makes big data problems tractable. Data can be spread across many servers and still accessed quickly. Indexes are a cornerstone of information retrieval, and the basis for today's modern search engines. Of course, this section only scratched the surface, and there is a lot of research being done on how to make indexes smaller, faster, contain more information (like relevancy), and update seamlessly. (There are some manageability challenges with race conditions, and with the sheer number of updates required to add new data or change existing data, particularly in the event where relevancy or scoring is involved).
创建中间索引并将数据划分成细粒度段使得处理大数据问题得到可能,即使数据分布在多个服务器上也能够快速访问。索引已经成为今天搜索引擎获取信息的基石,当然,这一章节仅仅只是触及一些表面,还有大量的研究需要完成,比如如何使得索引更小更快包含更多的信息和无缝更新。
Being able to find your data quickly and easily is important; indexes are an effective and simple tool to achieve this.
为了能够做到查找数据又快又容易,一个重要点就是:索引是达成目标的一个有效、简单的工具。
Load Balancers
Finally, another critical piece of any distributed system is a load balancer. Load balancers are a principal part of any architecture, as their role is to distribute load across a set of nodes responsible for servicing requests. This allows multiple nodes to transparently service the same function in a system. (See Figure 1.18.) Their main purpose is to handle a lot of simultaneous connections and route those connections to one of the request nodes, allowing the system to scale to service more requests by just adding nodes.
最后,分布式系统另外一个重要部分就是负载均衡器,负载均衡器已经是架构的重要部分,它们的作用就是将服务请求负载到多个节点上,这使得一个系统的多个节点能够提供相同的功能(图1.18),它们的主要目的就是处理大量的并发连接,并将这些连接路由到节点,通过增加节点使得系统能够做到服务可扩展的处理更多请求。
Figure 1.18: Load balancer
There are many different algorithms that can be used to service requests, including picking a random node, round robin, or even selecting the node based on certain criteria, such as memory or CPU utilization. Load balancers can be implemented as software or hardware appliances. One open source software load balancer that has received wide adoption is HAProxy).
用于服务请求有很多不同算法,包括随机选择节点,轮询算法,甚至是基于某个指标,比如内存或者CPU的利用率选择节点,负载均衡器有软件实现也有硬件实现,HAProxy是一个被广泛采用的开源软件负载均衡器。
In a distributed system, load balancers are often found at the very front of the system, such that all incoming requests are routed accordingly. In a complex distributed system, it is not uncommon for a request to be routed to multiple load balancers as shown inFigure 1.19.
在分布式系统中,负载均衡器大多是放在系统的最前面用于对访问请求进行分流,在复杂的分布式系统中,一个请求被多个负载均衡器路由也很有可能,就像Figure1.19
Figure 1.19: Multiple load balancers
Like proxies, some load balancers can also route a request differently depending on the type of request it is. (Technically these are also known as reverse proxies.)
就像代理,一些负载均衡器能够根据请求类型进行路由。(技术上被当作反向代理)
One of the challenges with load balancers is managing user-session-specific data. In an e-commerce site, when you only have one client it is very easy to allow users to put things in their shopping cart and persist those contents between visits (which is important, because it is much more likely you will sell the product if it is still in the user's cart when they return). However, if a user is routed to one node for a session, and then a different node on their next visit, there can be inconsistencies since the new node may be missing that user's cart contents. (Wouldn't you be upset if you put a 6 pack of Mountain Dew in your cart and then came back and it was empty?) One way around this can be to make sessions sticky so that the user is always routed to the same node, but then it is very hard to take advantage of some reliability features like automatic failover. In this case, the user's shopping cart would always have the contents, but if their sticky node became unavailable there would need to be a special case and the assumption of the contents being there would no longer be valid (although hopefully this assumption wouldn't be built into the application). Of course, this problem can be solved using other strategies and tools in this chapter, like services, and many not covered (like browser caches, cookies, and URL rewriting).
负载均衡器的一个问题就是用户session数据的管理,在电子商务网站,如果仅有一个客户端,在用户多次浏览之间将购物车的数据持久化是非常容易的(在用户再次回到网站时,如果购物车的数据仍然存在,你可能会买下这些商品,这一点非常重要)。然而,如果用户路由到其中一个节点产生了session,下一次浏览可能路由到另外一个节点,新的节点并没有用户的购物车数据从而存在不一致。一个办法就是将session固化,从而用户将一直路由到同一个节点,但这样会牺牲像自动故障恢复这样的可用性的能力,举个例子,用户的购物车一直都有东西,但如果固化的节点不可用了,一种情况就是session数据丢失。当然,这个问题可以通过其他手段和工具来解决,比如本文提到的服务,或者没有提到的像浏览器缓存,cookie,url重写。
If a system only has a couple of a nodes, systems like round robin DNS may make more sense since load balancers can be expensive and add an unneeded layer of complexity. Of course in larger systems there are all sorts of different scheduling and load-balancing algorithms, including simple ones like random choice or round robin, and more sophisticated mechanisms that take things like utilization and capacity into consideration. All of these algorithms allow traffic and requests to be distributed, and can provide helpful reliability tools like automatic failover, or automatic removal of a bad node (such as when it becomes unresponsive). However, these advanced features can make problem diagnosis cumbersome. For example, when it comes to high load situations, load balancers will remove nodes that may be slow or timing out (because of too many requests), but that only exacerbates the situation for the other nodes. In these cases extensive monitoring is important, because overall system traffic and throughput may look like it is decreasing (since the nodes are serving less requests) but the individual nodes are becoming maxed out.
如果系统只有一对节点,系统可以采用轮询机制可能会更好,采用负载均衡器会更加昂贵和增加不必要的一层变得复杂。当然,在大型系统中,会有各种负载调度的各种类型算法,简单包括像随机选择或者轮询,复杂的一些的还要考虑利用率和容量。所有的这些算法都是为了分发请求,通过一些工具提高可用性,像自动故障恢复,或者自动移除损坏节点(比如当节点变得响应迟缓),然而,这些先进的特性会使得问题诊断变得复杂。举例来说,在一个高负载的情况下,负载均衡器会移除变得响应慢或者超时的节点(因为请求很多),但这会影响恶化其他节点。这时需要更为细致的监控,因为系统的整个交易量和吞吐量看上去在下降(因为节点处理的请求量变少)但单个节点变得透支。
Load balancers are an easy way to allow you to expand system capacity, and like the other techniques in this article, play an essential role in distributed system architecture. Load balancers also provide the critical function of being able to test the health of a node, such that if a node is unresponsive or over-loaded, it can be removed from the pool handling requests, taking advantage of the redundancy of different nodes in your system.
负载均衡器是对系统扩容的一个容易办法,就像本文提到的其他技术,它在分布式架构中扮演重要角色。负载均衡器同时也提供其他关键功能像检测节点的存活,当节点变得迟缓或者过载会被移除,其他节点来处理请求。
Queues
So far we have covered a lot of ways to read data quickly, but another important part of scaling the data layer is effective management of writes. When systems are simple, with minimal processing loads and small databases, writes can be predictably fast; however, in more complex systems writes can take an almost non-deterministically long time. For example, data may have to be written several places on different servers or indexes, or the system could just be under high load. In the cases where writes, or any task for that matter, may take a long time, achieving performance and availability requires building asynchrony into the system; a common way to do that is with queues.
我们已经讨论了很多快速读取数据的办法,另外一个重要部分就是有效管理数据写入。当系统非常简单只有少量的负载和数据,可以预见写入还是很快的,然而,更大的复杂系统的写入需要的时间变得不可预测。举例来说,数据要写入到多个服务器或者索引上,或者系统处于高负载运行状态。这种情况下,写入或者其他任务都会需要很长时间,只有为系统建立异步机制才能达到优化和可用性的要求,一个通用的处理办法就是队列。
Figure 1.20: Synchronous request
Imagine a system where each client is requesting a task to be remotely serviced. Each of these clients sends their request to the server, where the server completes the tasks as quickly as possible and returns the results to their respective clients. In small systems where one server (or logical service) can service incoming clients just as fast as they come, this sort of situation should work just fine. However, when the server receives more requests than it can handle, then each client is forced to wait for the other clients' requests to complete before a response can be generated. This is an example of a synchronous request, depicted in Figure 1.20.
假设一个系统,客户端都在远程请求服务,每个客户端发送请求到服务器,服务器尽可能的处理任务并将结果返回给对应的客户端,在小系统中,只有一个服务器能够处理过来的请求,能够工作的不错。然而,当服务器接收到超过它处理能力的更多请求时,一些客户端不得不等待其他请求完成,这是处理同步请求的例子,如1.20图描述。
This kind of synchronous behavior can severely degrade client performance; the client is forced to wait, effectively performing zero work, until its request can be answered. Adding additional servers to address system load does not solve the problem either; even with effective load balancing in place it is extremely difficult to ensure the even and fair distribution of work required to maximize client performance. Further, if the server handling requests is unavailable, or fails, then the clients upstream will also fail. Solving this problem effectively requires abstraction between the client's request and the actual work performed to service it.
这种同步行为会极大降低客户端的性能:客户端不得不等待,工作效率为零,直到请求被应答。增加额外的服务器提升系统负载也不能解决这个问题;即使在很有效的负载均衡的情况下,也很难最大化客户端性能要求的分布式工作的公平性。另外,如果服务器处理请求失败,上游的客户端也会失败。有效解决此类问题可以抽象为客户端请求和服务器真正处理请求的能力。
Figure 1.21: Using queues to manage requests
Enter queues. A queue is as simple as it sounds: a task comes in, is added to the queue and then workers pick up the next task as they have the capacity to process it. (See Figure 1.21.) These tasks could represent simple writes to a database, or something as complex as generating a thumbnail preview image for a document. When a client submits task requests to a queue they are no longer forced to wait for the results; instead they need only acknowledgement that the request was properly received. This acknowledgement can later serve as a reference for the results of the work when the client requires it.
入列。队列听起来很简单:一个任务进来,进入队列,空闲的工作者挑选下一任务开始处理。(图1.21)这些任务可能是简单的写入数据库,或者复杂的像生成文件的预览缩略图。当客户端提交一个任务到队列,便不在等待关心结果,实际上它只需要确认请求已经收到。这个确认将会被作为以后客户请求运行结果的引用。
Queues enable clients to work in an asynchronous manner, providing a strategic abstraction of a client's request and its response. On the other hand, in a synchronous system, there is no differentiation between request and reply, and they therefore cannot be managed separately. In an asynchronous system the client requests a task, the service responds with a message acknowledging the task was received, and then the client can periodically check the status of the task, only requesting the result once it has completed. While the client is waiting for an asynchronous request to be completed it is free to perform other work, even making asynchronous requests of other services. The latter is an example of how queues and messages are leveraged in distributed systems.
队列使得客户端处在一种异步方式,提供了客户请求和响应的抽象策略。另一方面,在同步系统中,请求和响应没有被分离,因此不能分开管理。在异步系统中,客户端请求一个任务,服务响应一条信息确认任务已经收到,然后客户端开始周期性的检测任务状态,一旦完成便可请求结果,当客户端在等待一个异步请求结果时,可以随便处理其他任务,即使调用其他异步服务。
Queues also provide some protection from service outages and failures. For instance, it is quite easy to create a highly robust queue that can retry service requests that have failed due to transient server failures. It is more preferable to use a queue to enforce quality-of-service guarantees than to expose clients directly to intermittent service outages, requiring complicated and often-inconsistent client-side error handling.
队列还能提供服务中断和故障的保护,举例来说,可以十分容易的构建一个健壮性高的队列,能够重新处理因为瞬间服务器异常导致失败的服务请求。对于间隔性的服务中断,通过队列来确保服务质量比直接暴露给客户端更为合适。
Queues are fundamental in managing distributed communication between different parts of any large-scale distributed system, and there are lots of ways to implement them. There are quite a few open source queues like RabbitMQ, ActiveMQ, BeanstalkD, but some also use services like Zookeeper, or even data stores like Redis.
队列也是大尺度分布式系统中不同组件进行分布式通信的基础,有许多办法来实现,有相当多的开源队列框架,像RabbitMQ, ActiveMQ, BeanstalkD, 也可以使用像Zookeeper的服务,甚至是数据存储框架像Redis。
1.4. Conclusion
Designing efficient systems with fast access to lots of data is exciting, and there are lots of great tools that enable all kinds of new applications. This chapter covered just a few examples, barely scratching the surface, but there are many more—and there will only continue to be more innovation in the space.