Large-scale cluster management at Google with Borg 论文要点记录

PPT 为个人总结成果


Abstract

Google’s Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines.
It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior.

Borg 是Google集群管理的系统,上面运行着十万级别的作业、成千级别的应用,并且,它管理者数万级别的机器

1 Introduction
Borg provides three main benefits: it (1) hides the details of resource management and failure handling so its users can focus on application development instead; (2) operates with very high reliability and availability, and supports applications that do the same; and (3) lets us run workloads across tens of thousands of machines effectively.

Borg benefits

  • 隐藏资源管理和错误处理的细节,使得用户可以聚焦应用开发;
  • 高可靠性和高利用性
  • 在上万节点的机器集群上能有效运行

2 The user perspective
borg主要面向于系统管理员和google开发者,这些用户在borg上面运行他们的服务和应用程序,用户以job的形式提交任务,每个job包含一个或者多个tasks,每个job运行在一个cell里,Borg cell是机器的集合,也作为一个管理的单元。

Borg cells run a heterogenous workload with two main parts. The first is long-running services that should “never” go down, and handle short-lived latency-sensitive requests (a few μs to a few hundred ms). Such services are used for end-user-facing products such as Gmail, Google Docs, and web search, and for internal infrastructure services (e.g., BigTable). The second is batch jobs that take from a few seconds to a few days to complete; these are much less sensitive to short-term performance fluctuations.

Borg Cell中运行着异构的工作负载,主要包括两个部分:服务和批处理作业。服务长时间运行,几乎不会停止,处理短期的、延迟敏感的请求,服务通常是面向终端用户的产品(如Gmail、Google Docs、web search)和一些内部基础设施服务(BigTable,即HBase的Google版本)。批处理作业运行时间在几秒到几天不等;它们对短期性能波动并不敏感。工作负载会在Cell上混合部署。

For this paper, we classify higher-priority Borg jobs as “production” (prod) ones, and the rest as “non-production” (non-prod). Most long-running server jobs are prod; most batch jobs are non-prod. In a representative cell, prod jobs are allocated about 70% of the total CPU resources and rep- resent about 60% of the total CPU usage; they are allocated about 55% of the total memory and represent about 85% of the total memory usage.

论文将高优先级的作业成为“生产型作业”(prod),将其他称为“非生产型作业”(non-prod)。大部分长时间运行的服务都是生产型作业,生产型作业被分配了大约70%的CPU资源并占用了大约60%的利用率;同时被分配了大约55%的内存资源并占用了大约85%的内存利用率。

The machines in a cell belong to a single cluster, defined by the high-performance datacenter-scale network fabric that connects them. A cluster lives inside a single datacenter building, and a collection of buildings makes up a site.
Our median cell size is about 10 k machines after excluding test cells; some are much larger. The machines in a cell are heterogeneous in many dimensions: sizes (CPU, RAM, disk, network), processor type, performance, and capabilities such as an external IP address or flash storage.

一个集群Cluster通常包括一个大规模的单元Cell和若干个小规模的用于测试和特殊目的的单元Cell。Cell的规模大约是10k节点,且机器是异构的(即CPU、RAM、磁盘各不相同)。


A Borg job’s properties include its name, owner, and the number of tasks it has. Jobs can have constraints to force its tasks to run on machines with particular attributes such as processor architecture, OS version, or an external IP address.

一个Borg作业Job具有属性,包括name、owner、task数量等。Job具有约束(constraint)来强制其任务运行在具有特定属性的机器上,例如处理器架构、OS版本、IP地址等。约束可以分为硬约束和软约束,前者是需求,必须满足;后者是偏好,尽量满足。

Each task maps to a set of Linux processes running in a container on a machine [62]. The vast majority of the Borg workload does not run inside virtual machines (VMs), because we don’t want to pay the cost of virtualization. Also, the system was designed at a time when we had a considerable investment in processors with no virtualization support in hardware.

A task has properties too, such as its resource require- ments and the task’s index within the job.

Borg programs are statically linked to reduce dependencies on their runtime environment, and structured as packages of binaries and data files, whose installation is orchestrated by Borg.

job的属性包括:名称,owner,tasks,同时还包括一些调度的约束条件,例如处理器架构,os版本,ip地址等等,这些会影响borg-master调度的结果,当然这些条件不一定是强制约束的,分hard和soft两种。
一个job只能跑在一个cell里,每个job会有N个task,每个task运行期间会有多个进程,google并没有使用虚拟机的方式来进行task之间的资源隔离,而是使用轻量级的容器技术cgroup。
task也有自己的属性:资源需求和一个index,大部分时候一个job里的所有task的资源需求都是一样的。
job 和task 生命状态周期图:

A Borg alloc (short for allocation) is a reserved set of re- sources on a machine in which one or more tasks can be run; the resources remain assigned whether or not they are used. Allocs can be used to set resources aside for future tasks, to retain resources between stopping a task and start- ing it again, and to gather tasks from different jobs onto the same machine

一个Borg alloc是一台machine上预留资源的集合,可以用于一个或多个任务运行。Alloc被用于为未来的task预留资源,用来在停止和再次启动任务之间保持资源,或者将不同作业Job的任务task聚合在一起运行(例如一个web服务器实例和一个相关的logsaver)。
一个Alloc集合类似一个job:它是在多个机器上预留资源的一组alloc。一旦一个alloc集合被创建,一个或者多个作业Job就可以被提交在其中运行。简而言之,一个task与一个alloc对应,一个job与一个alloc集合对应。

An alloc set is like a job: it is a group of allocs that reserve resources on multiple machines. Once an alloc set has been created, one or more jobs can be submitted to run in it. For brevity, we will generally use “task” to refer to an alloc or a top-level task (one outside an alloc) and “job” to refer to a job or alloc set.

2.5 Priority,quota,and admission control
Every job has a priority, a small positive integer. A high- priority task can obtain resources at the expense of a lower- priority one, even if that involves preempting (killing) the latter. Borg defines non-overlapping priority bands for different uses, including (in decreasing-priority order): monitoring, production, batch, and best effort (also known as testing or free). For this paper, prod jobs are the ones in the monitoring and production bands.

每个作业都有一个优先级,具体形式是一个小的正整数。Borg定义非重叠优先级,包括:monitoring, production, batch, and best effort (also known as testing or free),生产型作业(prod job)包含前两种优先级。
为了避免“抢占洪流”,Borg不允许一个生产型作业区抢占另一个,只允许生产型作业抢占非生产型作业。
资源配额Quota是一组资源量化表达式(CPU、RAM、Disk等),它与一个优先级和一个时间段对应。如果Quota不足,作业提交会被拒绝。

Borg has a capability system that gives special privileges to some users; for example, allowing administrators to delete or modify any job in the cell, or allowing a user to access restricted kernel features or Borg behaviors such as disabling resource estimation (§5.5) on their jobs. 

因为用户是google 专业开发人员,所以某些用户权限方面有适当的开放。

2.6 Naming and monitoring
It’s not enough to create and place tasks: a service’s clients and other systems need to be able to find them, even after they are relocated to a new machine. To enable this, Borg creates a stable “Borg name service” (BNS) name for each task that includes the cell name, job name, and task number. Borg writes the task’s hostname and port into a consistent, highly-available file in Chubby [14] with this name, which is used by our RPC system to find the task endpoint.

仅仅是创建和放置任务是不够的,一个服务的客户端或者其他系统需要找到服务,当服务被重新定位到新机器之后也是如此。为此,Borg创建了一个稳定的Borg名字服务(BNS),borg为每个task创建一个BNS名字:cell名 + job名 + task索引,BNS名字和task的hostname + port会被持久化到chubby上,通过DNS解析,用户凭BNS名字就能找到task,另外,Job的task数量和每个task的健康状态也会更新到chubby上,这么做的目的主要是为了服务(这里的服务是指job本身,可能是个web server,也可能是个分布式存储系统等等)的高可用,对用户请求做负载均衡。

每个task都会有一个内置的http服务,暴漏一些task的健康信息和各种性能指标,例如rpc时延等等。borg通过监控某个特定的url来决定task是否正常,如果不正常,比如http返回错误码等,就重启task。
google还有一个叫sigma的系统,用户通过web界面就可以直观的观察到用户自己所有的job,cell状态,甚至是task的健康信息,资源利用率,日志,状态变更历史等等。日志是rotated的,避免打飞磁盘,另外,为了调试方便,即使task运行结束后,log也会保留一段时间。

3 Borg architecture


A Borg cell consists of a set of machines, a logically central- ized controller called the Borgmaster, and an agent process called the Borglet that runs on each machine in a cell (see Figure 1). All components of Borg are written in C++.

每个cell,包含一个控制器,borgmaster,同时cell里的每个机器,都运行着一个叫borglet的agent程序,不管是master和agent,都是用c++写的

Each cell’s Borgmaster consists of two processes: the main Borgmaster process and a separate scheduler (§3.2). The main Borgmaster process handles client RPCs that either mutate state (e.g., create job) or provide read-only access to data (e.g., lookup job). It also manages state machines for all of the objects in the system (machines, tasks, allocs, etc.), communicates with the Borglets, and offers a web UI as a backup to Sigma.

Borgmaster包含两类进程:主Borgmaster进程和分离的调度器进程。主Borgmaster进程处理客户端RPC;管理系统中所有对象Object的状态机,包括machines、tasks、allocs;与Borglet通信;提供webUI。
Borgmaster逻辑上一个进程,但是拥有5个副本。每个Borgmaster副本维护cell状态的一份内存副本,cell状态同时在高可用、分布式、基于Paxos的存储系统中做本地磁盘持久化存储。一个单一的被选举的master既是Paxos leader,也是状态管理者。当cell启动或者被选举master挂掉时,系统会选举Borgmaster,选举机制按照Paxos算法流程进行。

When a job is submitted, the Borgmaster records it persis- tently in the Paxos store and adds the job’s tasks to the pend- ing queue. This is scanned asynchronously by the scheduler, which assigns tasks to machines if there are sufficient avail- able resources that meet the job’s constraints.
当作业被提交,Borgmaster将其记录到Paxos store中,并将作业的任务增加到等待队列中。调度器异步浏览该队列,并将任务分配给机器。调度算法包括两个部分:可行性检查和打分。
可行性检查,用于找到满足任务约束、具备足够可用资源的一组机器;打分,则是在“可行机器”中根据用户偏好,为机器打分。用户偏好主要是系统内置的标准,例如挑选具有任务软件包的机器、分散任务到不同的失败域中(出于容错考虑)。
Borg使用不同的策略进行打分。实践中,E-PVN(worst fit)会将任务分散到不同的机器上;best fit,会尽量“紧凑”的使用机器,以减少资源碎片。Borg目前使用一种混合模型,尽量减少“受困资源”。

3.3 Borglet
The Borglet is a local Borg agent that is present on every machine in a cell. It starts and stops tasks; restarts them if they fail; manages local resources by manipulating OS ker- nel settings; rolls over debug logs; and reports the state of the machine to the Borgmaster and other monitoring systems.

Borglet是运行在每台machine上的本地Borg代理,管理本地的任务和资源。Borgmaster会周期性地向每一个Borglet拉取当前状态。这样做更易于Borgmaster控制通信速度,避免“恢复风暴”。
borglet是borg运行在单机上的agent程序,borglet的职责如下:
1. 启/停任务
2. 如果任务失败,负责任务重启
3. 任务之间的资源隔离,主要通过修改内核参数来实现,例如cgroup等等
4. 日志
5. 监控&报告 任务状态
borgmaster会定期轮询所有的borglet,收集处理所有任务的运行状态。master连agent的好处是有利于master控制负载,也有大部分分布式系统是agent去连master的,好处是master的异常处理逻辑相对简单。
前面我们提到master是多副本的,leader负责向agent发送心跳,并根据agent的返回结果更新master的状态,为了提高性能,心跳的内容可能会被压缩,只传输diff。另外,如果一个borglet长期不响应master的心跳,则master会认为该机器已经宕机,并且这机器上的所有task都会被重新调度。如果borglet突然恢复,则master会让该机器kill掉所有的task。
master宕机并不影响borglet以及正在运行的task,另外,borglet进程挂了也是不影响正在运行的task的。

It repeatedly: retrieves state changes from the elected master (including both assigned and pend- ing work); updates its local copy; does a scheduling pass to assign tasks; and informs the elected master of those assignments. The master will accept and apply these assignments unless they are inappropriate (e.g., based on out of date state), which will cause them to be reconsidered in the scheduler’s next pass. This is quite similar in spirit to the optimistic concurrency control used in Omega [69], and in- deed we recently added the ability for Borg to use different schedulers for different workload types.

在google里,平均每个borgmaster需要管理数千台机器(前面我们提过,一个中等规模的cell大约是1w台服务器左右),有些cell每分钟提交的任务数就超过1w个,一个繁忙的borgmaster甚至可以用到10-14核,超过50G的内存。那么google如何解决集群规模不断扩展带来的可扩展性问题呢?

早期的borgmaster只有一个简单的,同步的循环过程:
1. 接收用户请求
2. 调度任务
3. 和borglets通讯
为了解决大集群,borgmaster分离出一个调度进程,两个进程并行协作,当然,灾备是有的。
分离出来的调度进程职责是:
1. 从elected master接收cell状态 (including both assigned and pending work);
2. 更新本地拷贝
3. 预调度task(并非真正的调度)
4. 通知master确认调度结果(可能成功or失败,例如过期)
这个过程和Omega里的乐观并发控制精神是一致的,borg最近还新增了一个feature,针对不同的workload类型使用不同的调度器
此外,borg针对可扩展性还做了几个优化:
1. Score caching: 给机器打分的开销是很大的,而且通常机器的属性静态的,task的属性也不会经常发生变化,所以,这个结果可以cache,除非机器或者task属性发生变化
2. Equivalence classes: 同一个job里的task通常都有一致的资源需求和约束条件,borg这将这些具有相同配置的task进行分类,打分的时候只按照分类给机器打分
3. Relaxed randomization: 只随机取一部分机器或者纬度来进行打分,以提升效率。

为了提高系统可扩展性,Borg调度器还作了几种优化,分别是得分缓存(可以将可行性检查和打分结果缓存下来)、等价类(同一job中的task通常具有类似的约束,因此可以将多个任务视为一个等价类)、轻松随机化(在大规模cell中计算所有机器的可行性和得分代价太高,可以随机遍历直到找到一个“足够好”的机器)。

4 availability
 失败在大规模系统中非常常见。本文列举了Borg提高可用性的例子:


(1)自动重新调度器被驱逐的任务;
(2)为了降低相关失败,将任务分散到不同的失败域中;
(3)限制一个作业中任务的个数和中断率;
(4)限制任务重新调度的速率,因为不能区分大规模机器故障和网络分区;
(5)避免引发错误的任务-机器匹配对;
(6)关键数据持久化,写入磁盘。

5 Utilization
Figure 5 shows that segregating prod and non-prod work would need 20–30% more machines in the median cell to run our workload. That’s because prod jobs usually reserve resources to handle rare workload spikes, but don’t use these resources most of the time. Borg reclaims the unused resources (§5.5) to run much of the non-prod work, so we need fewer machines overall.

Borg users request CPU in units of milli-cores, and memory and disk space in bytes. (A core is a processor hyperthread, normalized for performance across machine types.) Figure 8 shows that they take advantage of this granularity: there are few obvious “sweet spots” in the amount of memory or CPU cores requested, and few obvious correlations between these resources.

随后,结合实验说明,几种方法可以提高集群利用率,具体包括Cell sharing、Large cell、Fine-grained resource requests和Resource reclamation。前几种方法都比较直观,不做展开。Resource reclamation比较有意思,重点阐述。
一个作业(job)可以定义一个资源上限(resource limit),资源上限用于Borg决定用户是否具有足够的资源配额(quota)来提交作业(job),并且用于决定是否具有足够的空闲资源来调度任务。因为Borg会kill掉一个尝试使用更多RAM和disk空间资源(相比于其申请的资源)的task,或者节流CPU资源(相比于其要求的),所以用户总是申请更多的资源(相比其实际所有的)。此外,一些任务偶尔会使用其所有资源,但大部分时间没有。

对于可以容忍低质量资源的工作(例如批处理作业),Borg会评估任务将使用的资源,并回收空闲资源。这个整个过程称为资源回收(resource reclamation)。评估过程称为任务预留(task reservation)。最初的预留值与其资源请求一致,然后300秒之后,会慢慢降低到实际使用率外加一个安全边缘。如果利用率超过资源预留值,预留值会快速增长。
这里引用华为钟诚的图片,直观的表明实际使用资源、预留资源、资源上限的关系。

你可能感兴趣的:(Borg)