Original author: Andrei Matei
This document aims to explain the execution of an SQL query against CockroachDB, explaining the code paths through the various layers of the system (network protocol, SQL session management, parsing, execution planning, syntax tree transformations, query running, interface with the KV code, routing of KV requests, request processing, Raft, on-disk storage engine). The idea is to provide a high-level unifying view of the structure of the various components; no one will be explored in particular depth but pointers to other documentation will be provided where such documentation exists. Code pointers will abound.
本文档旨在解释针对蟑螂数据库的SQL查询的执行,解释系统各层的代码路径(网络协议、SQL会话管理、解析、执行规划、语法树转换、查询运行、与KV代码的接口、KV请求的路由、请求处理、raft、磁盘存储引擎)。其目的是为各个组件的结构提供一个高层次的统一视图;但不会特别深入地探索,将提供指向其他文档的指针。代码指针将比比皆是。
This document will generally not discuss design decisions; it will rather focus on tracing through the actual (current) code.
本文档一般不会讨论设计决策;它更侧重于跟踪实际(当前)代码。
The intended audience for this post is folks curious about a dive through the architecture of a modern, albeit young, database presented differently than in a design doc. It will hopefully also be helpful for open source contributors and new Cockroach Labs engineers.
这篇文章的目标读者是那些对一个现代的,尽管还很年轻的数据库的体系结构感兴趣的人,这个数据库的呈现方式与设计文档不同。它也有望对开源贡献者和新的蟑螂实验室工程师有所帮助。
This document does not cover some important aspects of query execution, in particular major developments that have occurred after the document was initially authored; including but not limited to:
本文档不包括查询执行的一些重要方面,特别是在文档最初编写之后发生的主要发展;包括但不限于:
A SQL query arrives at the server through the Postgres wire protocol (CockroachDB speaks the Postgres protocol for compatibility with existing client drivers and applications). The pgwire
package implements protocol-related functionality; once a client connection is authenticated, it is represented by a pgwire.v3Conn
struct (it wraps a net.Conn
interface - Go's sockets). v3Conn.serve()
implements the "read query - execute it - return result" loop. The protocol is message-oriented: for the lifetime of the connection, we read a message usually representing one or more SQL statements, pass it to the sql.Executor
for executing all the statements in the batch and, once that's done and the results have been produced, serialize them and send them to the client.
SQL查询通过PostgresWire协议到达服务器(CockroachDB使用Postgres协议是为了与现有的客户机驱动程序和应用程序兼容)。pgwire包实现了与协议相关的功能;一旦对客户端连接进行了身份验证,它就由一个pgwire.v3Conn结构表示(它包装了一个net.Conn接口-go的套接字)。v3Conn.service()实现了“Read Query-Execute-Return Results”循环。该协议是面向消息的:在连接的生命周期中,读取通常表示一个或多个SQL语句的消息,并将其传递给sql.Executor以执行批处理中的所有语句,一旦执行完成并生成结果,则将它们序列化并发送给客户端。
Notice that the results are not streamed to the client and, moreover, a whole batch of statements might be executed before any results are sent back.
请注意,结果不会流到客户端,而且,在返回任何结果之前,可能会执行一整批语句。
The sql.Executor
is responsible for parsing statements, executing them and returning results back to the pgwire.v3Conn
. The main entry point is Executor.execRequest()
, which receives a batch of statements as a raw string
. The execution of the batch is done in the context of a sql.Session
object which accumulates information about the state of the connection (e.g. the database that has been selected, the various variables that can be set, the transaction status), as well as accounting the memory in use at any given time by this connection. The Executor
also manipulates a planner
struct which provides the functionality around actually planning and executing a query.
sql.Executor负责解析语句,执行它们并将结果返回给pgwire.v3Conn。主要入口点是Executor.execRequest(),它接收一批原始字符串形式的语句。批处理的执行是在sql.Session对象的上下文中完成的,该对象累积有关连接状态的信息(例如,已经选择的数据库,可以设置的各种变量,事务状态),以及通过此连接计算在任何给定时间使用的内存。 Executor还操作一个planner结构,它提供围绕实际规划和执行查询的功能。
Executor.execRequest()
implements a state-machine of sorts by receiving batches of statements from pgwire
, executing them one by one, updating the Session
's transaction state (did a new transaction just begin or an old transaction just end? did we encounter an error which forces us to abort the current transaction?) and returning results and control back to pgwire. The next batch of statements received from the client will continue from the transaction state left by the previous batch.
executor.execrequest()通过从pgwire接收多批语句,逐个执行,更新会话的事务状态来实现排序状态机(新事务刚刚开始,还是旧事务刚刚结束?我们是否遇到了迫使我们中止当前事务的错误?),并将结果和控制返回到PGWIRE。从客户机接收的下一批语句将从上一批留下的事务状态继续。
The first thing the Executor
does is parse the statements; parsing uses a LALR parser generated by go-yacc
from a Yacc-like grammar file, originally copied from Postgres and stripped down, and then gradually grown organically with ever-more SQL support. The process of parsing transforms a string
into an array of ASTs (Abstract Syntax Trees), one for each statement. The AST nodes are structs defined in the sql/parser
package, generally of two types - statements and expressions. Expressions implement a common interface useful for applying tree transformations. These ASTs will later be transformed by the planner
into an execution plan.
Exector要做的第一件事就是分析语句;解析使用一个由go yacc从yacc类语法文件生成的lalr解析器,它最初是从postgres复制并剥离的,然后随着越来越多的SQL支持逐渐有机地增长。解析过程将一个字符串转换为一个ASTS(抽象语法树)数组,每个语句一个。AST节点是SQL/Parser包中定义的结构,通常有两种类型-语句和表达式。表达式实现了一个对应用树转换有用的公共接口。这些AST稍后将由规划器转化为执行计划。
With a list of statements in hand, Executor.execRequest()
goes through them in order and executes one transaction's worth of statements at a time (i.e. groups of statements between a BEGIN
and COMMIT/ROLLBACK
statements, or single statements executed outside of a transaction). If the session had an open transaction after execution of the previous batch, we continue consuming statements until a COMMIT/ROLLBACK
. This "consuming of statements" is done by the call to runTxnAttempt
; this function returns after executing statements until the COMMIT/ROLLBACK
has been encountered.
有了一个语句列表,Executor.execRequest()按顺序遍历它们并一次执行一个事务的语句(即BEGIN和COMMIT / ROLLBACK语句之间的语句组,或者在事务外执行的单个语句) )。如果会话在执行上一批次后有一个打开的事务,我们继续使用语句直到COMMIT / ROLLBACK。这种“消耗语句”是通过调用runTxnAttempt完成的;执行语句后,此函数返回,直到遇到COMMIT / ROLLBACK。
There is an impedance mismatch that has to be explained here, around the interfacing of the SQL Executor/session
code, which is stream-oriented (with statements being executed one at a time possibly within the scope of SQL transactions) and CockroachDB's Key/Value (KV) interface, which is request oriented with transactions explicitly attached to every request. The most interesting interface for the KV layer of the database is the Txn.Exec()
method. Txn
lives in the internal/client
package, which contains the KV client interface (the "client" and the server in this context are both internal to CockroachDB, although we used to expose the KV interface externally in the past and it's not out of the question that we'll do it again in the future). Txn
represents a KV transaction; there's generally one associated with the SQL session, reused between client ping-pongs.
这里必须解释一种阻抗不匹配,即SQL执行器/会话代码与CockroachDB的Key/Value(KV)接口的接口,前者面向流(一次执行一个语句,可能在SQL事务的范围内),后者面向请求,与每个请求显式附加的事务相关联。数据库的kv层最有趣的接口是txn.exec()方法。txn位于internal/client包中,该包包含kv客户端接口(此上下文中的“客户端”和服务器都是蟑螂数据库的内部接口,尽管我们过去曾在外部公开kv接口。估计我们以后还会再公开一次)。txn表示一个kv事务;通常有一个与SQL会话相关联的事务,在客户端来回通讯之间重用。
The Txn.Exec
interface takes a callback and some execution options and, based on those options, executes the callback possibly multiple times and commits the transaction afterwards. If allowed by the options, the callback might be called multiple times, to deal with retries of transactions that are sometimes necessary in CockroachDB (usually because of data contention). The SQLExecutor
might or might not want to let the KV client perform such retries automatically.
txn.exec接口接受回调和一些执行选项,并根据这些选项可能多次执行回调,然后提交事务。如果选项允许,可以多次调用回调,以处理蟑螂数据库中有时必要的事务重试(通常是因为数据争用)。sqlExecutor可能希望也可能不希望让kv客户机自动执行此类重试。
To hint at the complications: a single SQL statement executed outside of a SQL transaction (i.e. an "implicit transaction") can be safely retried. However, a SQL transaction spanning multiple client requests will have different statements executed in different callbacks passed to Txn.Exec()
; as such, it is not sufficient to retry one of these callbacks - we have to retry all the statements in the transaction, and generally some of these statements might be conditional on the client's logic and thus cannot be retried verbatim (i.e. different results for a SELECT
might trigger different subsequent statements). In this case, we bubble up a retryable error to the client; more details about this can be read in our transaction documentation. This complexity is captured inExecutor.execRequest()
, which has logic for setting the different execution options and contains a suitable callback passed toTxn.Exec()
; this callback will call runTxnAttempt()
. The statement execution code path continues inside the callback, but it is worth noting that, from this moment on, we have interfaced with the (client of the) KV layer and everything below is executing in the context of a KV transaction.
提示一种复杂的情况:可以安全地重试在SQL事务之外执行的单个SQL语句(即“隐式事务”)。但是,跨越多个客户端请求的SQL事务将在传递给txn.exec()的不同回调中执行不同的语句,因此,重试其中一个回调是不够的-我们必须重试事务中的所有语句,并且通常,其中一些语句可能是以客户端的逻辑为条件的因此不能逐字检索(即,一个select的不同结果可能会触发不同的后续语句)。在这种情况下,我们向客户提供一个可重试的错误;有关此错误的更多详细信息,请参阅我们的事务文档。这种复杂性是在Executor.execrequest()中捕获的,它具有设置不同执行选项的逻辑,并且包含一个合适的回调传递给Txn.exec();此回调将调用runTxnaAttempt()。语句执行代码路径在回调中继续,但值得注意的是,从现在起,我们已经与(Kv层的客户机)进行了接口,下面的所有内容都是在Kv事务的上下文中执行的。
Now that we have figured out what (KV) transaction we're running inside of, we are concerned with executing SQL statements one at a time. runTxnAttempt()
has a few layers below it dealing with the various states a SQL transaction can be in (open /aborted / waiting for a user retry, etc.), but the interesting one is execStmt. This guy creates an "execution plan" for a statement and runs it.
现在我们已经弄清楚我们在里面运行的是什么(KV)事务,我们关心的是一次执行一个SQL语句。 runTxnAttempt()在它下面有几层处理SQL事务可以处于的各种状态(打开/中止/等待用户重试等),但有趣的是execStmt。它为一个语句创建一个“执行计划”并运行该计划。
An execution plan in CockroachDB is a tree of planNode
nodes, similar in spirit to the AST but, this time, containing semantic information and also runtime state. This tree is built by planner.makePlan()
, which takes a parsed statement and returns the root of the planNode
tree after having performed all the semantic analysis and various transformations. The nodes in this tree are actually "executable" (they have Start()
and Next()
methods), and each one will consume data produced by its children (e.g. a JoinNode
has left and right
children whose data it consumes).
CockroachDB中的执行计划是planNode节点的树,与AST类似,但这次包含语义信息和运行时状态。此树由planner.makePlan()构建,它接受已解析的语句,并在执行了所有语义分析和各种转换后返回planNode树的根。此树中的节点实际上是“可执行的”(它们具有Start()和Next()方法),并且每个节点将消耗其子节点生成的数据(例如,JoinNode具有其消耗的数据的左右子节点)。
Currently building the execution plan, performing semantic analysis and applying various transformations is a pretty ad-hoc process, but we are working on replacing the code with a more structured process and separating the IR (Intermediate Representation) used for analysis and transforms from the runtime structures (see this WIP RFC)
目前构建执行计划,执行语义分析和应用各种转换是一个非常特别的过程,但我们正在努力用更结构化的过程替换代码并分离用于分析的IR(中间表示)和从运行时转换结构(参见此WIP RFC)[https://github.com/cockroachdb/cockroach/pull/10055/files#diff-542aa8b21b245d1144c920577333ceed].
In the meantime, the planner
looks at the type of the statement at the top of the AST and, for each statement type, invokes a specific method that builds the execution plan. For example, the tree for a SELECT
statement is produced byplanner.SelectClause()
. Notice how different aspects of a SELECT
statement are handled there: a scanNode
is created (renderNode.initFrom()
->...-> planner.Scan()
) to scan a table, a WHERE
clause is transformed into an expression and assigned to a filterNode
, an ORDER BY
clause is turned into a sortNode
, etc. In the end, a selectTopNode
is produced, which in fact is a tree of a groupNode
, a windowNode
, a sortNode
, a distinctNode
and a renderNode
wrapping a scanNode
acting as an original data source).
在此期间,planner查看AST顶部的语句类型,并且对于每种语句类型,调用构建执行计划的特定方法。例如,SELECT语句的树由planner.SelectClause()生成。注意如何处理SELECT语句的不同方面:创建scanNode(renderNode.initFrom() - > ...-> planner.Scan())来扫描表,WHERE子句将转换为表达式并赋值到了filterNode,ORDER BY子句变成了sortNode等。最后,生成了一个selectTopNode,它实际上是一棵由:groupNode,windowNode,sortNode,distinctNode和包裹scanNode的renderNode(作为原始数据源)构成的树)。
Finally, the execution plan is simplified and optimized somewhat; this includes removing the selectTopNode
wrappers and eliding all no-op intermediate nodes.
最后,执行计划有所简化和优化;这包括删除selectTopNode包装器和删除所有no-op中间节点。
To make this notion of the execution plan more concrete, consider one actually "rendered" by the EXPLAIN
statement:
为了使执行计划的概念更具体,请考虑EXPLAIN语句实际“渲染”的一个:
root@:26257> create table customers(
name string primary key,
address string,
state string,
index SI (state)
);
root@:26257> insert into customers values
('Google', '1600 Amphitheatre Parkway', 'CA'),
('Apple', '1 Infinite Loop', 'CA'),
('IBM', '1 New Orchard Road ', 'NY');
root@:26257> EXPLAIN(EXPRS,NOEXPAND,NOOPTIMIZE,METADATA) SELECT * FROM customers WHERE address like '%Infinite%' ORDER BY state;
+-------+--------+----------+---------------------------+------------------------+----------+
| Level | Type | Field | Description | Columns | Ordering |
+-------+--------+----------+---------------------------+------------------------+----------+
| 0 | select | | | (name, address, state) | +state |
| 1 | nosort | | | (name, address, state) | +state |
| 1 | | order | +@3 | | |
| 1 | render | | | (name, address, state) | |
| 1 | | render 0 | name | | |
| 1 | | render 1 | address | | |
| 1 | | render 2 | state | | |
| 2 | filter | | | (name, address, state) | |
| 2 | | filter | address LIKE '%Infinite%' | | |
| 3 | scan | | | (name, address, state) | |
| 3 | | table | customers@primary | | |
+-------+--------+----------+---------------------------+------------------------+----------+
You can see data being produced by a scanNode
, being filtered by a renderNode
(presented as "render"), and then sorted by a sortNode
(presented as "nosort", because we have turned off order analysis with NOEXPAND and the sort node doesn't know yet whether sorting is needed), wrapped in a selectTopNode
(presented as "select").
您可以看到由scanNode生成的数据,由renderNode过滤(显示为“render”),然后按sortNode排序(显示为“nosort”,因为我们已使用NOEXPAND关闭了排序分析,排序节点还不知道是否需要排序),包装在selectTopNode中(表示为“select”)。
With plan simplification turned on, the EXPLAIN output becomes:
打开计划简化后,EXPLAIN输出变为:
root@:26257> EXPLAIN (EXPRS,METADATA) SELECT * FROM customers WHERE address LIKE '%Infinite%' ORDER BY state;
+-------+------+--------+---------------------------+------------------------+--------------+
| Level | Type | Field | Description | Columns | Ordering |
+-------+------+--------+---------------------------+------------------------+--------------+
| 0 | sort | | | (name, address, state) | +state |
| 0 | | order | +state | | |
| 1 | scan | | | (name, address, state) | +name,unique |
| 1 | | table | customers@primary | | |
| 1 | | spans | ALL | | |
| 1 | | filter | address LIKE '%Infinite%' | | |
+-------+------+--------+---------------------------+------------------------+--------------+
Expressions
A subset of ASTs are parser.Expr
, representing various "expressions" - parts of statements that can occur in many various places - in a WHERE
clause, in a LIMIT
clause, in an ORDER BY
clause, as the projections of a SELECT
statement, etc. Expressions nodes implement a common interface so that a visitor pattern can be applied to them for different transformations and analysis. Regardless of where they appear in the query, all expressions need some common processing (e.g. names appearing in them need to be resolved to columns from data sources). These tasks are performed by planner.analyzeExpr
. Each planNode
is responsible for calling analyzeExpr
on the expressions it contains, usually at node creation time (again, we hope to unify our execution planning more in the future).
AST的子集是parser.Expr,表示各种“表达式” - 可以出现在许多不同位置的语句部分 - 在WHERE子句中,在LIMIT子句中,在ORDER BY子句中,作为SELECT语句的投影, 表达式节点实现了一个通用接口,以便可以将访问者模式应用于它们以进行不同的转换和分析。 无论它们出现在查询中的哪个位置,所有表达式都需要一些常见的处理(例如,出现在其中的名称需要从数据源解析为列)。 这些任务由planner.analyzeExpr执行。 每个planNode负责在其包含的表达式上调用analyzeExpr,通常是在节点创建时(再次,我们希望将来更多地统一我们的执行计划)。
planner.analyzeExpr
performs the following tasks:
planner.analyzeExpr执行以下任务:
colA
in select 3 * colA from MyTable
needs to be replaced by an index within the rows produced by the underlying data source (usually a scanNode
))解析名称(select 3 * colA from MyTable中的colA需要被底层数据源(通常是scanNode)生成的行中的索引替换)a = 1 + 1
-> a = 2
, a not between b and c
-> (a < b) or (a > c)
)归一化(例如a = 1 + 1 - > a = 2,a not between b and c - >(a c))1 + 2
becomes 3
): we perform exact arithmetic using the same library used by the Go compilerand classify all the constants into two categories: numeric - NumVal
or string-like - StrVal
. These representations of the constants are smart enough to figure out the set of types that can represent the value (e.g. NumVal.AvailableTypes
常量折叠(例如1 + 2变为3):我们使用Go编译器使用的相同库执行精确算术,并将所有常量分为两类:数字 - NumVal或类似字符串 - StrVal。这些常量的表示足够聪明,可以找出可以表示值的类型集(例如NumVal.AvailableTypes5
can be represented as int, decimal or float
, but 5.4
can only be represented as decimal or float
) This will come in useful in the next step. 5可以表示为int,decimal或float,但5.4只能表示为decimal或float)这将在下一步中有用。
type inference and propagation: this analysis phase assigns a result type to an expression, and in the process types all the sub-expressions. Typed expressions are represented by the TypedExpr
interface, and they are finally able to evaluate themselves to a result value through the Eval
method. The typing algorithm is presented in detail in the typing RFC: the general idea is that it's a recursive algorithm operating on sub-expressions; each level of the recursion may take a hint about the desired outcome, and each expression node takes that hint into consideration while weighting what options it has. In the absence of a hint, there's also a set of "natural typing" rules. For example, a NumVal
described above checks whether the hint is compatible with its list of possible types. This process also deals with overload resolution
for function calls and operators. 类型推断和传播:此分析阶段将结果类型分配给表达式,并在进程中分配所有子表达式的类型。表达式分配类型由TypedExpr接口表示,它们最终能够通过Eval方法将自身评估为结果值。 类型归类算法在归类RFC中详细介绍:一般的想法是它是一个在子表达式上运算的递归算法; 递归的每个级别可以暗示期望的结果,并且每个表达式节点在加权其具有的选项时考虑该提示。 在没有提示的情况下,还有一套“自然归类”规则。 例如,上述NumVal检查提示是否与其可能类型列表兼容。 此过程还处理函数调用和运算符的重载解析。
replacing sub-query syntax nodes by a sql.subquery
execution plan node.通过sql.subquery执行计划节点替换子查询语法节点。
A note about sub-queries: consider a query like select * from Employees where DepartmentID in (select DepartmentID from Departments where NumEmployees > 100)
. The query on the Departments
table is called a sub-query. Subqueries are recognized and replaced with an execution node by subqueryVisitor
. The subqueries are then run and replaced by their results through thesubqueryPlanVisitor
. This is usually done by various top-level nodes when they start execution (e.g. renderNode.Start()
).
关于子查询的注释:考虑像select * from Employees where DepartmentID in (select DepartmentID from Deartments where Numemployees >100)这样的查询。 Departments表上的查询称为子查询。 通过subqueryVisitor识别子查询并将其替换为执行节点。 然后运行子查询,并通过subqueryPlanVisitor替换它们的结果。 这通常由各种顶级节点在开始执行时完成(例如renderNode.Start())。
planNodes 著名的planNodes
As hinted throughout, execution plan nodes are responsible for executing parts of a query. Each one consumes data from lower-level nodes, performs some logic, and feeds data into a higher-level one.
正如贯穿始终所暗示的,执行计划节点负责执行查询的部分。每个节点消耗来自较低级别节点的数据,执行一些逻辑,并将数据馈送到较高级别节点。
After being constructed, their main methods are Start
, which initiates the processing, and Next
, which is called repeatedly to produce the next row.
在构造之后,它们的主要方法是start(启动),启动处理,和next(下一步),重复调用以生成下一行。
To tie this to the SQL Executor section above, executor.execLocal()
, the method responsible for executing one statement, callsplan.Next()
repeatedly and accumulates the results.
为了将其与上面的sql executor部分联系起来,executor.execlocal(),该方法负责执行一条语句,调用plan.next()重复并累积结果。
Consider some planNode
s involved in running a SELECT
statement, using the table defined above and
考虑运行SELECT语句所涉及的一些planNodes,使用上面定义的表和
SELECT * FROM customers WHERE State LIKE 'C%' AND strpos(address, 'Infinite') != 0 ORDER BY Name;
as a slightly contrived example. This is supposed to return customers from states starting with "C" and whose address contains the string "Infinite". To get excited, let's see the query plan for this statement:
作为一个有点人为的例子。 这应该从以“C”开头并且其地址包含字符串“Infinite”的州返回客户。 为了激动,让我们看看这个语句的查询计划:
root@:26257> EXPLAIN(EXPRS) SELECT * FROM customers WHERE State LIKE 'C%' and strpos(address, 'Infinite') != 0 order by name;
+-------+------------+--------+----------------------------------+
| Level | Type | Field | Description |
+-------+------------+--------+----------------------------------+
| 0 | sort | | |
| 0 | | order | +name |
| 1 | index-join | | |
| 2 | scan | | |
| 2 | | table | customers@SI |
| 2 | | spans | /"C"-/"D" |
| 2 | | filter | state LIKE 'C%' |
| 2 | scan | | |
| 2 | | table | customers@primary |
| 2 | | filter | strpos(address, 'Infinite') != 0 |
+-------+------------+--------+----------------------------------+
So the plan produced for this query, from top (highest-level) to bottom, looks like:
针对此查询生成的计划,从顶部(最高级别)到底部,看起来像:
sortNode -> indexJoinNode -> scanNode (index)
-> scanNode (PK)
Before we inspect the nodes in turn, one thing deserves explanation: how did the indexJoinNode
(which indicates that the query is going to use the "SI" index) come to be? The fact that this query uses an index is not apparent in the syntactical structure of the SELECT
statement, and so this plan is not simply a product of the mechanical tree building hinted to above. Indeed, there's a step that we haven't mentioned before: "plan expansion". Among other things, this step performs "index selection" (more information about the algorithms currently used for index selection can be found in Radu's blog post). We're looking for indexes that can be scanned to efficiently retrieve only rows that match (part of) the filter. In our case, the "SI" index (indexing the state) can be scanned to efficiently retrieve only the rows that are candidates for satisfying the state LIKE 'C%'
expression (in an ecstasy to agony moment, we see that our index selection / expression normalization code is smart enough to infer that state LIKE 'C%'
implies state >= 'C' AND state < 'D'
, but is not smart enough to infer that the two expressions are in fact equivalent and thus the filter can be elided altogether). We won't go into plan expansion or index selection here, but the index selection process happens in the expansion of the SelectNode
and, as a byproduct, produces indexJoinNode
s configured with the index spans to be scanned.
在我们依次检查节点之前,有一点值得解释:indexJoinNode(表明查询将使用“SI”索引)是如何形成的?此查询使用索引的事实在SELECT语句的语法结构中并不明显,因此该计划不仅仅机械地从上面提到的树构建的产品。实际上,我们之前没有提到过一个步骤:“计划扩张”。除此之外,这一步执行“索引选择”(有关当前用于索引选择的算法的更多信息可以在Radu的博客文章中找到)。我们正在寻找可以扫描的索引,以便只有效地检索匹配(部分)过滤器的行。在我们的例子中,可以扫描“SI”索引(索引状态)以有效地仅检索满足状态LIKE'C%'表达式的候选行(在狂喜到痛苦时刻,我们看到我们的索引选择/表达式规范化代码足够聪明,可以推断状态LIKE'C%'意味着状态> ='C'和状态<'D',但是不够聪明,无法推断这两个表达式实际上是等价的,因此过滤器可以完全被省略)。我们不会在此处进行计划扩展或索引选择,但索引选择过程在SelectNode的扩展中发生,并且作为副产品,生成使用要扫描的索引跨度配置的indexJoinNodes。
Now let's see how these planNode
s run:
现在让我们来看看这些planNodes是如何运行的:
sortNode
: The sortNode
sorts the rows produced by its child and corresponds to the ORDER BY
SQL clause. The constructorhas a bunch of logic related to the quirky rules for name resolution from SQL92/99. Another interesting fact is that, if we're sorting by a non-trivial expression (e.g. SELECT a, b ... ORDER BY a + b
), we need the a + b
values (for every row) to be produced by a lower-level node. This is achieved through a pattern that's also present in other node: the lower node capable of evaluating expressions and rendering their results is the renderNode
; the sortNode
constructor checks if the expressions it needs are already rendered by that node and, if they are not, asks for them to be produced through therenderNode.addOrMergeRenders()
method. The actual sorting is performed in the sortNode.Next()
method. The first time it is called, it consumes all the data produced by the child node and accumulates it into n.sortStrategy
(an interface hiding multiple sorting algorithms). When the last row is consumed, n.sortStrategy.Finish()
is called, at which time the sorting algorithm finishes its processing. Subsequent calls to sortNode.Next()
simply iterate through the results of sorting algorithm.
sortNode:sortNode对其子级生成的行进行排序,对应order by子句。构造器包含一系列逻辑,这些逻辑与sql92/99中用于名称解析的奇怪规则相关。另一个有趣的事实是,如果我们用一个非平凡的表达式(例如select a,b… order by a+b),a+b的值(每行)由较低级别的节点生成。这是通过另一个节点中也存在的模式实现的:能够评估表达式并呈现其结果的较低节点rendeNode;sortnode构造函数检查该节点是否已经呈现了所需的表达式,如果没有,则要求通过renderNode.addOrMergeRenders()方法生成这些表达式。实际排序在sortnode.next()方法中执行。第一次调用它时,它消耗子节点生成的所有数据,并将其累积到n.sortStrategy(一个隐藏多个排序算法的接口)。当消耗最后一行时,调用n.sortStrategy.finish(),此时排序算法完成其处理。对sortnode.next()的后续调用只需迭代排序算法的结果。
indexJoinNode
: The indexJoinNode
implements joining of results from an index with the rows of a table. It is used when an index can be used for a query, but it doesn't contain all the necessary columns; columns not available in the index need to be retrieved from the Primary Key (PK) key-values. The indexJoinNode
sits on top of two scan nodes - one configured to scan the index, and one that is constantly reconfigured to do "point lookups" by PK. In the case of our query, we can see that the "SI" index is used to read a compact set of rows that match the "state" filter but, since it doesn't contain the "address" columns, the PK also needs to be used. Each index KV pair contains the primary key of the row, so there is enough information to do PK lookups. indexJoinNode.Next
keeps reading rows from the index and, for each one, adds a spans to be read by the PK. Once enough such spans have been batched, they are all read from the PK. As described in the section on SQL rows to KV pairs) from the design doc, each SQL row is represented as a single KV pair in the indexes, but as multiple consecutive rows in the PK (represented by a "key span").
state LIKE 'C%'
condition is evaluated by the index scan, and the strpos(address, 'Infinite') != 0
condition is evaluated by the PK scan. This is nice because it means that we will be filtering as much as we can on the index side and we will be doing fewer expensive PK lookups. The code that figures out which conjunction is to be evaluated where is in splitFilter()
, called by the indexJoinNode
constructor.一个有趣的细节与过滤器的处理方式有关:请注意,state LIKE 'C%' 条件由索引扫描,strpos(address,“Infinite”)=0条件由pk扫描评估。这很好,因为这意味着我们将尽可能多地过滤索引端的内容,并且减少昂贵的pk查找。用于确定哪个连接将在splitFilter()中进行求值的代码,由indexJoinNode构造函数调用。
scanNode
: The scanNode
generally constitutes the source of a renderNode
or filterNode
; it is responsible for scanning over the key/value pairs for a table or index and reconstructing them into rows. This node is starting to smell like rubber meeting a road, because we are getting closer to the actual data - the monolithic, distributed KV map. You'll see that the Next()
method is not particularly climactic, since it delegates the work to a rowFetcher
, described below. There's one interesting thing that the scanNode
does: it runs a filter expression, just like the filterNode
. That is because we are trying to push down parts of the WHERE
clause as far as possible. This is generally a work in progress, see filter_opt.go
. The idea is that a query like
scanNode:scanNode通常构成renderNode或filterNode的源; 它负责扫描表或索引的键/值对并将它们重建为行。 这个节点开始像橡胶一样闻到道路,因为我们越来越接近实际数据 - 单调,分布式KV map。 您将看到Next()方法不是终点,因为它将工作委托给rowFetcher,如下所述。 scanNode有一件有趣的事情:它运行一个过滤器表达式,就像filterNode一样。 那是因为我们试图尽可能地将WHERE子句往下推。 这通常是一项正在进行的工作,请参阅filter_opt.go。 这个想法就像是一个查询
/* Select the orders placed by each customer in the first year of membership. */
SELECT * FROM Orders o inner join Customers c ON o.CustomerID = c.ID WHERE Orders.amount > 10 AND Customers.State = 'NY' AND age(c.JoinDate, o.Date) < INTERVAL '1 year'
is going to be compiled into two scanNode
s, one for Customers
, one for Orders
. Each one of them can do the part of filtering that refers exclusively to their respective tables, and then the higher-level joinNode
only needs to evaluate expressions that need data from both (i.e. age(c.JoinDate, o.Date) < INTERVAL '1 year'
).
将被编译为两个scanNodes,一个用于Customers,一个用于Orders。 他们中的每一个都可以执行过滤的部分,专门针对各自的表,然后更高级别的joinNode只需要评估需要来自两者的数据的表达式(即age(c.JoinDate,o.Date)
Let's continue downwards, looking at the structures that the scanNode
uses for actually reading data.
让我们继续向下,看看scanNode用于实际读取数据的结构。
rowFetcher
: The rowFetcher
is responsible for iterating through key-value pairs, figuring out where a SQL table or index row ends (remember that a SQL row is potentially encoded in multiple KV entries), and decoding all the keys and values in SQL column values, dealing with differences between the primary index and other indexes and with the layout of a table. For details on the mapping between SQL rows and KV pairs, see the corresponding section from the Design Doc and the encoding tech note.
rowFetcher
also performs decoding from on-disk byte arrays to the representation of data that we do most processing on: implementation of the parser.Datum
interface. For details on what the on-disk format is for different data types, browse around the util/encoding directory
.rowFetcher
delegates to the kvBatchFetcher
.为了从数据库中实际读取KV对,rowFetcher委托给kvBatchFetcher。
kvBatchFetcher
: The kvBatchFetcher
finally reads data from the KV database. It understands nothing of SQL concepts, such as tables, rows or columns. When it is created, it is configured with a number of "key spans" that it needs to read (these might be, for example, a single span for reading a whole table, or a couple of spans for reading parts of the PK or of an index).
kvBatchFetcher:kvBatchFetcher最终从KV数据库中读取数据。 它不了解SQL概念,例如表,行或列。 在创建它时,它配置了需要读取的许多“键跨度”(例如,这些可能是用于读取整个表的单个跨度,或者用于读取PK的部分的几个跨度或者 一个索引)。
To actually read data from the KV database, the kvBatchFetcher
uses the KV layer's "client" interface, namely client.Batch
. This is where the "SQL layer" interfaces with the "KV layer" - the kvBatchFetcher
will build such Batch
es of requests, send them for execution in the context of the KV transaction (remember the Transaction
mentioned in the Statement Execution section), read the results and return them to the hierarchy of planNodes
. The requests being sent to the KV layer, in the case of this read-only query, are ScanRequest
s.
为了实际从KV数据库读取数据,kvBatchFetcher使用KV层的“客户端”接口,即client.Batch。 这就是“SQL层”与“KV层”接口的地方--kvBatchFetcher将构建这样的批量请求,将它们发送到KV事务的上下文中执行(记住Statement Execution部分中提到的Transaction),阅读 结果并将它们返回到planNodes的层次结构。 在此只读查询的情况下,发送到KV层的请求是ScanRequests。
The rest of this document will walk through the "execution" of KV requests, such as the ones sent by the kvBatchFetcher
.
本文档的其余部分将介绍KV请求的“执行”,例如kvBatchFetcher发送的请求。
The KV layer of CockroachDB deals with execution of "requests". The protocol-buffer-based API is defined in api.proto, listing the various types of requests and response. In practice, the KV's client always sends BatchRequest
s, a generic request containing a collection of other requests. All requests have a Header which contains routing information (which replica a request is destined for) and transaction information
CockroachDB的KV层处理“请求”的执行。 基于协议缓冲区的API在api.proto中定义,列出了各种类型的请求和响应。 实际上,KV的客户端总是发送BatchRequests,这是一个包含其他请求集合的通用请求。 所有请求都有一个Header,其中包含路由信息(请求所指向的副本)和事务信息
Clients "send" KV requests using a client interface (currently this interface is internal, used by SQL, but we might offer it directly to users in some form in the future). This client interface contains primitives for starting a (KV) transaction (remember, the SQL Executor
uses this to run every statement in the context of a transaction). Afterwards, a Txn
object is available for executing requests in the context of that transaction
kvBatchFetcher
uses. If you trace what happens inside that Txn.Run()
method you eventually get to txn.db.sender.Send(..., batch)
: the request starts percolating through a hierarchy of Sender
sSender
s have a single methodSend()
- which ultimately passes the request to the lower level. Let's go down this "sending" rabbit hole: TxnCoordSender -> DistSender -> Node -> Stores -> Store -> Replica
. The first two run on the same node as the that received the SQL query and is doing the SQL processing (the "gateway node"), the others run on the nodes responsible for the data that is being accessed (the "range node").客户端使用客户端接口“发送”KV请求(当前此接口是内部的,由SQL使用,但我们可能会在将来以某种形式直接向用户提供)。此客户端接口包含用于启动(KV)事务的原语(请记住,SQL Executor使用它来运行事务上下文中的每个语句)。之后,Txn对象可用于在该事务的上下文中执行请求
The top-most client.Sender
is the TxnCoordSender
. A TxnCoordSender is responsible for dealing with transactions' state (see the Transaction Management section of the design doc). After a transaction is started, the TxnCoordSender starts asynchronously sending heartbeat messages to that transaction's "txn record", to keep it live. It also keeps track of each written key or key range over the course of the transaction. When the transaction is committed or aborted, it clears accumulated write intents for the transaction. All requests being performed as part of a transaction have to go through the same TxnCoordSender
so that all write intents are accounted for and eventually cleaned up. After performing this bookkeeping, the request is passed to the DistSender
.
最顶层的client.Sender是txnCordSender。txnCordSender负责处理事务的状态(请参阅设计文档的事务管理部分)。事务启动后,txnCordSender开始异步地向该事务的“txn记录”发送心跳消息,以使其保持活动状态。它还可以在事务过程中跟踪每个写入的key或key范围。当事务被提交或中止时,它清除事务的累积写入意图。事务的所有请求都必须经过同一个txncordsender,以便对所有写入意图进行记录,并最终清除。在执行这个簿记之后,请求被传递给DistSender。
The DistSender
is truly a workhorse: it handles the communication between the gateway node and the (possibly many) range nodes, putting the "distributed" in "distributed database". It receives BatchRequest
s, looks at the requests inside the batch, figures out what range each command needs to go to, finds the nodes/replicas responsible for that range, routes the requests there and then collects and reassembles the results.
DistSender确实是一个主力:它处理网关节点和(可能很多)范围节点之间的通信,将“分布式”放在“分布式数据库”中。 它接收BatchRequests,查看批处理内的请求,确定每个命令需要去的范围,找到负责该范围的节点/副本,将请求路由到那里,然后收集并重新组合结果。
Let's go through the code a bit:
我们稍微讨论一下代码:
The request is subdivided into ranges: DistSender.Send()
calls DistSender.divideAndSendBatchToRanges()
which iterates over the constituent ranges of requests by using a RangeIterator
(a single request, such as a ScanRequest
can refer to a key span that might straddle potentially many ranges). A lot of things hide behind this innocent-looking iteration: the cluster's range metadata needs to be accessed in order to find the mapping of keys to ranges (info on this metadata can be found in the Range Metadata section of the design doc). Range metadata is stored as regular data in the cluster, in a two-level index mapping range end keys to descriptors about the replicas of the respective range (the ranges storing this index are called "meta-ranges"). The RangeIterator
logically iterates over these descriptors, in range key order. Brace yourselves: for moving from one range to the next, the iterator calls back into the DistSender
, which knows how to find the descriptor of the range responsible for one particular key. The DistSender
delegates resolving a key to a descriptor to therangeDescriptorCache
(a LRU tree cache, indexed by range end key). This cache desynchronized with reality as ranges in a cluster split or move around; when an entry is discovered to be stale, we'll see below that the DistSender
removes it from the cache.
DistSender
, which sends a RangeLookupRequest
KV command addressed directly to the meta range (so the DistSender
is not recursively involved in routing this request).DistSender.sendPartialBatchAsync()
which truncates all the requests in the batch to the current range and then it sends the truncated batch to a range. All these partial batches are sent concurrently.每个子请求(部分批处理)都发送到其范围。 这是通过调用DistSender.sendPartialBatchAsync()完成的,该调用将批处理中的所有请求截断到当前范围,然后将截断的批处理发送到范围。 所有这些部分批次同时发送。
sendPartialBatch()
is the level at which error stemming from stale rangeDescriptorCache
information are handled: the range descriptor that's detected to be stale is evicted from the cache and the partial batch is reprocessed.
sendPartialBatch()是处理从陈旧的rangeDescriptorCache信息产生的错误的级别:检测到过时的范围描述符从缓存中逐出,并且重新处理部分批处理。
Sending a partial batch to a single range implies selecting the right replica of that range and performing an RPC to it. By default, each range is replicated three ways, but only one of the three replicas is the "lease holder" - the temporarily designed owner of that range, in charge of coordinating all reads and writes to it (see the Range Leases section in the design doc). Figuring out which replica has the lease is done through another cache - the leaseHolderCache
, whose information can also get stale.
将部分批处理发送到单个范围意味着选择该范围的正确副本并对其执行RPC。默认情况下,每个范围有三个副本,但三个副本中只有一个是“租赁持有者” - 该范围的临时设计所有者,负责协调对范围的所有读取和写入(请参阅中的范围租赁部分设计doc)。确定哪个副本具有租约是通过另一个缓存完成的 - leaseHolderCache,其信息也可能变得陈旧。
The method of the DistSender
dealing with this is sendSingleRange
. It will use the cache to send the request to the lease holder, but it's also prepared to try the other replicas, in order of "proximity". The replica that the cache says is the leaseholder is simply moved to the front of the list of replicas to be tried and then an RPC is sent to all of them, in order.
DistSender处理此情况的方法是sendSingleRange。它将使用缓存将请求发送给租约持有者,但它也准备按照“邻近”的顺序尝试其他副本。缓存所说的副本是租约持有者,只需将租约持有者移动到尝试副本列表的最前面,然后按顺序将RPC发送给所有副本。
Sending the RPCs is initiated by sendToReplicas
, which sends the request to the first one and subsequently to the other, until one succeeds or returns a processing error. Processing errors are distinguished from routing errors byhandlePerReplicaError
which, among others, handles wrong information in the leaseHolderCache
.
发送RPC由sendToReplicas启动,sendToReplicas将请求发送到第一个副本,然后发送到另一个副本,直到成功或返回处理错误。处理错误通过handlePerReplicaError与路由错误区分开来,其中sendReplicaError处理leaseHolderCache中的错误信息。
Actually sending the RPCs is hidden behind the Transport interface
. Concretely, grpcTransport.SendNext()
does gRPC calls to the nodes containing the destination replicas, namely to a service implementing the Internal
service.
实际上,发送RPC隐藏在传输接口后面。具体而言,grpcTransport.SendNext()对包含目标副本的节点执行gRPC调用,即对实现内部服务的服务执行gRPC调用。
The (async) responses from the different replicas are combined into a single BatchResponse
, which is ultimately returned from the Send()
method.
来自不同副本的(异步)响应组合成单个BatchResponse,最终从Send()方法返回。
We've now gone through the relevant things that happen on the gateway node. Further, we're going to look at what happens on the "remote" side - on each of the ranges.
我们现在已经了解了网关节点上发生的相关事情。此外,我们将在每个范围节点内查看“远程”侧发生的情况。
RPC server - Node and Stores RPC服务器—节点和存储
We've seen how the DistSender
splits BatchRequest
into partial batches, each containing commands local to a single replica, and how these commands are sent to the lease holders of their ranges through RPCs. We're now moving to the "server" end of these RPCs. The struct that implements the RPC service is Node
. The Node
doesn't do anything of great relevance; it delegatesthe request to its Stores
member which represents a collection of "stores" (on-disk databases imagined to be one per physical disk, see the Architecture section of the design doc). The Stores
implements the Sender
interface, just like the gateway layers that we've seen before, resuming the pattern of wrapping another Sender
and passing requests down through the Send()
method.
我们已经看到distssender如何将batchrequest拆分为部分批,每个批包含单个副本的本地命令,以及如何通过rpcs将这些命令发送给其范围的租用持有者。我们现在要转到这些RPC的“服务器”端。实现RPC服务的结构是节点。节点不做任何有意义的事情;它将请求委托给它的Stores存储成员,该成员代表一组“存储”(在磁盘上,数据库设计为每个物理磁盘一个,请参阅设计文档的架构部分)。Stores实现sender接口,就像我们以前看到的网关层一样,恢复包装另一个sender的模式,并通过send()方法向下传递请求。
Stores.Send()
identifies which particular store contains the destination replica (based on routing info filled into the request by the DistSender
) and routes the request there. One interesting thing that the Stores
does, in case requests from the current transactions have already been processed on this node, is update the upper bound on the uncertainty interval to be used by the current request (see the "Choosing a Timestamp" section of the design doc) for details on uncertainty intervals). The uncertainty interval dictates which timestamps for values are ambiguous because of clock skew between nodes (the values for which don't know if they were written before or after the serialization point of the current txn). This code realizes that, if a request from the current txn has been processed on this node before, no value written after that node's timestamp at the time of that other request processing is ambiguous.
Stores.Send()标识哪个特定存储包含目标副本(基于DistsSender填充到请求中的路由信息),然后路由请求到那个存储。存储做的一件有趣的事情是,如果当前事务的请求已经在此节点上处理,则更新当前请求使用的不确定度间隔的上限(有关不确定度间隔的详细信息,请参阅设计文档的“选择时间戳”部分)。不确定间隔指示哪些时间戳的值不明确,因为节点之间存在时钟偏差(不知道这些值是在当前txn的序列化点之前还是之后写入的)。此代码认识到,如果以前在该节点上处理过来自当前txn的请求,那么在该节点的时间戳之后写入的值在其他请求处理时是不明确的。
A Store
represents one physical disk device. For our purposes, a Store
mostly delegates the request to a replica
, but it has one important role - in case the request runs into "write intents" (i.e. uncommitted values), it deals with those intents. This handles read-write and write-write conflicts between transactions. Notice that the code calling the replica
is inside a big infinite retry loop and that a bunch of the code inside it deals with WriteIntentError
. When we see such an error, we try to "resolve" itusing the intentResolver
. Resolving means figuring out if the transaction to which the intent belongs is still pending (it might already be committed or aborted, in which case the intent is "resolved"), or possibly "pushing" the transaction in question (forcing it to restart at a higher timestamp, such that it doesn't conflict with the current txn). If the conflicting txn is no longer pending or if it was pushed, then the intents can be properly resolved (i.e. either replaced by a committed value, or simply discarded). The first part - figuring out the txn status or pushing it
Store表示一个物理磁盘设备。出于我们的目的,Store主要将请求委托给副本,但它有一个重要的角色 - 如果请求遇到“写入意图”(即未提交的值),它会处理这些意图。这会处理事务之间的读写和写写冲突。请注意,调用副本的代码位于一个大的无限重试循环中,并且其中的一堆代码处理WriteIntentError。当我们看到这样的错误时,我们尝试使用intentResolver“解决”它。解决意味着确定意图所属的事务是否仍然挂起(它可能已经提交或中止,在这种情况下意图被“解决”),或者可能“推动”有问题的事务(强制它在更高的时间戳重新启动,使其不与当前的txn冲突)。如果冲突的txn不再挂起或者它被推动,则可以正确地解决意图(即,由提交的值替换,或者简单地丢弃)。第一部分 - 搞清楚txn状态或推动它
intentResolver.maybePushTransaction
: we can see that a series of PushTxnRequest
s are batched and sent to the cluster (meaning the hierarchy of Sender
s on the current node will be used, top to bottom, to route the requests to the various transaction records - see the "Transaction execution flow" section of the design doc). In case the transaction we're trying to push is still pending, the decision about whether or not the push is successful is done deep in the processing of thePushTxnRequest
(several levels below the Store
level we're discussing here, in the stack for the spinned-off PushTxnRequest
) based on the relative priorities of the pusher/pushee txns.The second part - replacing the intents that can now be resolved, is done through a call to intentResolver.resolveIntents
. Back where we left off in Store.Send()
, the call to the intentResolver
, if successful, will change the resolved
field of the WriteIntentError
which will cause us to retry immediately. Otherwise, we'll retry according to an exponential backoff, waiting for the still pending transaction that we couldn't push to complete - we don't want to retry too soon, as we'd almost surely run into the same intent again (we're working to replace this "polling"-based mechanism for waiting for a conflicting txn to finish with something more reactive).
第二部分 - 替换现在可以解决的意图,是通过调用intentResolver.resolveIntents完成的。 回到我们在Store.Send()中停止的地方,对intentResolver的调用,如果成功,将更改WriteIntentError的resolved字段,这将导致我们立即重试。 否则,我们将根据指数退避重试,等待我们无法推动的未决的事务 - 我们不想太快重试,因为我们几乎肯定会再次遇到相同的意图( 等待冲突的txn完成时,我们正在努力用反应机制取代这种“轮询”机制。)
A Replica
represents one copy of range, which in turn is a contiguous keyspace managed by one instance of the Raft consensus algorithm. The system tries to keep ranges around 64MB, by default. The Replica
is the final Sender
in our hierarchy. The role of all the other Sender
s was, mostly, to route requests to the Replica
currently acting as the lease holder for the range (a primus inter pares Replica
that takes on a bunch of coordination responsibilities we'll explore below). A replica deals with read requests differently than write requests. Reads are evaluated directly, whereas writes will enter another big chapter in their life and go through the Raft consensus protocol.
一个副本表示范围的一个副本,而该副本又是一个由raft共识算法的一个实例管理的连续键空间。默认情况下,系统尝试将范围保持在64MB左右。副本是我们层次结构中的最终发送者。所有其他发送者的作用主要是将请求路由到当前充当范围的租约持有的副本(一个主要副本,承担了我们将在下面探讨的一系列协调责任)。副本处理读请求的方式与处理写请求的方式不同。读取被直接评估,而写作将进入他们生命的另一个大篇章,并通过RAFT共识协议。
The difference between the paths of read requests vs write requests is seen immediately: replica.Send()
quickly branches offbased on the request type. We'll talk about the read/write paths in turn.
读取请求和写入请求的路径之间的差异立即显示:replica.send()根据请求类型快速分支。我们将依次讨论读/写路径。
The first thing that is done for a read request is checking if the request got to the right place (i.e. the current replica is the lease holder); remember that a lot of the routing was done based on caches or out-right guesses. This check is performed byreplica.redirectOnOrAcquireLease()
, a rabbit hole in its own right. Let's just say that, in case the current replica is not the lease holder, redirectOnOrAcquireLease
either redirects to the lease holder, if there is a valid lease (remember that the DistSender
will handle such redirections), or requests a new lease otherwise, in the hope that it will become the lease holder. Requesting a lease is done through the pendingLeaseRequest
helper struct, which coalesces multiple requests for the same lease and eventually constructs a RequestLeaseRequest
and sends it for execution directly to the replica (as we've seen in other cases, bypassing all the senders to avoid recursing infinitely). In case a lease is requested, redirectOnOrAcquireLease
will wait for that request to complete and check if it was successful.
对读取请求所做的第一件事是检查请求是否到达正确的位置(即当前副本是租约持有者);请记住,很多路由都是基于缓存或正确的猜测完成的。这项检查由replica.redirectOnOrAcquireLease()执行,这是一个独立的兔子洞。我们只是说,如果当前副本不是租约持有者,redirectOnOrAcquireLease重定向到租约持有者,如果存在有效租约(请记住DistSender将处理此类重定向),或者不存在请求新租约,总之,希望它将成为租约持有者。请求租约是通过pendingLeaseRequest帮助器结构完成的,该结构为同一个租约合并多个请求,并最终构造一个RequestLeaseRequest并将其直接发送到副本(正如我们在其他情况下所见,绕过所有发送器以避免无限地递归)。如果请求租约,redirectOnOrAcquireLease将等待该请求完成并检查它是否成功。
Once the lease situation has been settled, the next thing to do for the read is synchronizing it with possible in-flight writes - if a write to an overlapping key span is in progress, the read might need to see its value, so we can't race with it; we must wait until the write is done. This synchronization is done through the CommandQueue
struct - an interval tree maintaining all the in-flight requests, indexed by the key or span of keys that they touch. Waiting for the writes is done inside replica.beginCmds()
. Notice that immediately after figuring out which commands we need to wait for, we atomically add the current read to the command queue in order to block future writes. This overlaps in spirit with the use of the TimestampCache
structure described below and in fact there is a proposal for not putting reads in the queue. Removal of commands from the queue is done later through the callback returned by beginCmds
. This epilogue also does something else that's important: it records the read in theTimestampCache
, a bounded in-memory cache from key range to the latest timestamp at which it was read. This structure serves to protect against violations of the Snapshot Isolation transaction isolation level (the lowest that CockroachDB provides) which require that the outcome of reads must be preserved, i.e. a write of a key at a lower timestamp than a previous read must not succeed (see the Read-Write Conflicts – Read Timestamp Cache section in Matt's blog post). As we'll see in the writes section, writes consult this structure to make sure they're not writing "under" a read that has already been performed.
一旦解决了租用情况,接下来要做的就是将读取与可能正在进行的写入进行同步-如果正在对重叠的键跨度进行写入,则读取可能需要查看其值,因此我们无法与之竞争;我们必须等待写入完成。这种同步是通过CommandQueue结构完成的,该结构是一个间隔树,用于维护所有正在进行的请求,并通过它们所接触的键或键的跨度进行索引。等待写入操作在replica.beginCmds()内完成。请注意,在确定需要等待哪些命令之后,我们会自动地将当前读取添加到命令队列中,以阻止将来的写入。这在精神上与下面描述的TimestampCache结构的使用重叠,事实上,有一个建议不将读取放入队列。从队列中删除命令是稍后完成通过BeginCmds返回的回调。这篇结语还做了一些其他重要的事情:它在TimeStampCache中记录了读取,这是一个内存缓存,从键范围到读取它的最新时间戳。此结构用于防止违反快照隔离事务隔离级别(蟑螂数据库提供的最低级别),该级别要求必须保留读取结果,即在低于以前读取时间戳的时间戳处写入key不得成功(请参阅读取写入冲突-读取时间戳缓存部分In马特的博客帖子)。正如我们将在“写入”部分看到的,写入操作参考此结构,以确保它们不会在已执行的“读取”下写入。
Now we're reading to actually evaluate the read - control moves to replica.evaluateBatch()
which callsreplica.evaluateCommand
for each request in the batch. evaluateCommand
switches over the request types using a helper request to method map and passes execution to the request-specific method. One typical read request is a ScanRequest
; this is evaluated by evalScan
. The code is very brief - it immediately calls a corresponding on the engine
- a handle to the on-disk RocksDBdatabase. Before we dig a bit into this engine
, let's look at what evalScan
will do next: it will return intents to the higher levels. These are intents that the scanning encountered, but they didn't prohibit it from continuing (e.g. intents with a timestamp higher than the timestamp at which we're reading - the read doesn't care if those intents are committed or not); this is in contrast with intents that do block the read - those, as we'll see below, are transformed into WriteIntentError
s which we've seen that they're handled by the Store
. These non-interfering intents are collected for cleanup purposes - they might be garbage left-over by dead transactions and we want to proactively clean them up. They're returned up the stack until replica.addReadOnlyCmd
tries to clean them up using our old friend, the intentResolver
.
现在我们正在阅读以实际评估读取 - 控制移动到replica.evaluateBatch(),它为批处理中的每个请求调用了reprelica.evaluateCommand。 evaluateCommand使用helper request to method map切换请求类型,并将执行传递给特定的请求方法。一个典型的读取请求是ScanRequest;这是由evalScan评估的。代码非常简短 - 它立即调用引擎上的相应代码 - 磁盘上RocksDB数据库的句柄。在我们深入研究这个引擎之前,让我们看一下evalScan下一步会做什么:它会将意图返回到更高级别。这些是扫描遇到的意图,但它们并没有禁止它继续(例如时间戳高于我们正在读取的时间戳的意图 - 读取并不关心这些意图是否被提交);这与阻止读取的意图形成对比 - 正如我们将在下面看到的那样,它们被转换为WriteIntentErrors,我们已经看到它们由Store处理。这些非干扰意图是为了清理目的而收集的 - 它们可能是死交易留下的垃圾,我们希望主动清理它们。它们被返回堆栈,直到replica.addReadOnlyCmd尝试使用我们的老朋友,intentResolver来清理它们。
We're getting to the bottom of the CockroachDB stack - the Engine
is an interface abstracting away different on-disk stores. The only implementation we currently use is RocksDB
, which is a wrapper around the RocksDB C++ library. We won't go into this wrapper other than to say that it uses cgo for interfacing with C++ code. We also won't go into the RocksDB code which, although it's obviously an important part of servicing a request, is not something that CockroachDB devs generally deal with.
我们已经到了CockroachDB堆栈的底部 - Engine是一个抽象出不同的磁盘存储的接口。我们目前使用的唯一实现是RocksDB,它是RocksDB C ++库的包装器。除了说它使用cgo与C ++代码连接之外,我们不会进入这个包装器。我们也不会进入RocksDB代码,尽管它显然是服务请求的重要部分,但并不是CockroachDB开发人员通常会处理的内容。
For reads, the entry point into the engine
package is mvccScanInternal()
. This performs a scan over the KV database, dealing with the data representation we use for MultiVersion Concurrency Control (MVCC). It iterates over the key/vals of the requested range and appends each one to the results. The MVCC details, such as the fact that we keep multiple versions of each key (for different timestamps) and the intents, are handled by MVCCIterate()
, which uses an iterator provided by the Engine
to scan over key/vals. It delegates reading key/vals and advancing the iterator to mvccGetInternal()
.
对于读取,引擎包的入口点是mvccScanInternal()。这将对KV数据库执行扫描,处理我们用于MultiVersion并发控制(MVCC)的数据表示。它迭代所请求范围的键/值,并将每个键附加到结果中。 MVCC的详细信息,例如我们保留每个键的多个版本(针对不同的时间戳)和意图的事实,由MVCCIterate()处理,它使用Engine提供的迭代器来扫描键/值。它委托读取键/值l并将迭代器推进到mvccGetInternal()。
Write requests are conceptually more interesting than reads because they're not simply serviced by one node/replica. Instead, they go through the Raft consensus algorithm, which maintains an ordered commit log, and are then applied by all of a range's replicas (see the Raft section of the design doc for more details). The replica that initiates this process is, just like in the read case, the lease holder. Execution on this lease holder is thus broken into two stages - before ("upstream of" in code) Raft and below ("downstream of") Raft. The upstream stage will eventually block for the corresponding Raft command to be applied locally(after the command has been applied locally, future reads are guaranteed to see its effects).
写请求在概念上比读更有趣,因为它们不只是由一个节点/副本提供服务。相反,它们通过raft共识算法,该算法维护一个有序的提交日志,然后日志由范围的所有副本应用(更多详细信息,请参见设计文档的raft部分)。启动这个过程的副本,和读取一样,是租约持有者。因此,在租约持有者的执行分为两个阶段:Raft前(“代码中的上游”)和Raft后(“代码中的下游”)。上游阶段最终会阻止相应的Raft命令在本地应用(命令在本地应用后,未来的读取保证看到其效果)。
For what follows we'll introduce some terminology. We've already seen that a replica
(and the KV subsystem in general) receives requests. In what follows, these requests will be evaluated, which transforms them to Raft commands. The commands in turn are proposed to the Raft consensus group and, after the Raft group accepts the proposals and commits them, control comes back to the Replica
s (all of the replicas of a range this time, not just the lease holder), which apply them.
接下来我们将介绍一些术语。 我们已经看到副本(以及一般的KV子系统)接收请求。 接下来,将评估这些请求,将其转换为Raft命令。 这些命令反过来被提议给Raft共识小组,并且在Raft小组接受提议并提交它们之后,控制权返回到所有副本(这次是范围的所有复制品,而不仅仅是租赁持有者),所有副本应用提议。
Execution of write commands, mirroring the reads, starts in replica.addWriteCmd()
. This method just contains a retry loop that deals with exceptional cases in which requests need to be evaluated repeatedly and delegates to replica.tryAddWriteCmd
. This guy does a number of things:
执行写命令,镜像读取,从replica.addWriteCmd()开始。 此方法只包含一个重试循环,用于处理需要重复计算请求并委派给replica.tryAddWriteCmd的异常情况。 这家伙做了很多事情:
It waits until overlapping in-flight requests are done and adds the current as an in-flight request to the CommandQueue
(similar to the reads).
It checks that the current replica is the lease holder by calling the redirectOnOrAcquireLease
method, just like the reads.
It "applies the timestamp cache"
TimestampCache
we've discussed above is checked to see if the write can proceed at the timestamp at which it's trying to modify that database. If it can't (because there's been a more recent overlapping read), the write's timestamp is bumped to a timestamp later than any overlapping read). 这意味着检查我们上面讨论过的TimestampCache,看看写入是否可以在它尝试修改该数据库的时间戳进行。 如果它不能(因为最近有重叠读取),写入的时间戳会比任何重叠读取的时间戳晚。It evaluates the request and proposes resulting Raft commands. It all starts with this call to replica.propose(). We'll describe the process below (it will be fun) but, before we do, let's see what the current method will do afterwards.它评估请求并提出Raft命令结果。 这一切都始于对replica.propose()的调用。 我们将在下面描述这个过程(它会很有趣)但是,在我们开始之前,让我们看看当前的方法将会做些什么。
The call to replica.propose
returns a channel that we'll wait on. This is the decoupling point that we've anticipated above - the point where we cede control to the Raft machinery. The replica
doing the proposals accepts its role as merely one of many replicas and waits for the consensus group to make progress in lock-step. The channel will receive a result when the (local) replica has applied the respective commands, which can happen only after the commands have been committed to the shared Raft log (a global operation).对replica.propose的调用返回了一个我们将等待的通道。 这是我们上面所预期的脱钩点 - 我们将控制权交给Raft机制的地方。 执行提案的副本仅接受其作为许多副本之一的角色,并等待共识小组在锁定步骤中取得进展。 当(本地)副本应用了相应的命令时,通道将接收结果,这可能仅在命令已提交到共享的Raft日志(全局操作)之后才会发生。
As in the reads case, at the end of the tryAddWriteCmd
method, an epilogue will remove the request from the CommandQueue
and add it to the timestamp cache.与在读取案例一样,在tryAddWriteCmd方法的末尾,结尾将从CommandQueue中删除请求并将其添加到时间戳缓存中。
Evaluation of requests and application of Raft commands 评估Raft命令的请求和应用
As promised, let's see what happens inside replica.propose()
. The first thing is the process of evaluation, i.e. turning a KV request into a Raft command. This is done through the call to requestToProposal()
, which quickly calls evaluateProposal()
, which in turn quickly calls the surprisingly-named applyRaftCommandInBatch
. This last method simulates the execution of the request, if you will, and records all the would-be changes to the Engine
into a "batch" (these batches are how RocksDB models transactions). This batch will be serialized into a Raft command. If we were to commit this batch now, the changes would be live, but just on this one replica, which would be a potential data consistency violation. Instead, we abort it. It will resurrect again when the command "comes out of Raft", as we'll see.
正如所承诺的,让我们看看replica.propose()中发生了什么。 首先是评估过程,即将KV请求转换为Raft命令。 这是通过调用requestToProposal()来完成的,该调用快速调用evaluateProposal(),后者又快速调用名为applyRaftCommandInBatch的命名。 如果愿意,最后一种方法模拟请求的执行,并将引擎的所有可能更改记录到“批处理”中(这些批次是RocksDB模型事务的方式)。 该批次将被序列化为Raft命令。 如果我们现在提交此批处理,则更改将是实时的,但仅在此一个副本上,这将是潜在的数据一致性违规。 相反,我们放弃它。 正如我们所看到的,当命令“从Raft里出来”时它会再次复活。
The simulation part takes place inside the executeWriteBatch()
method. This takes in the roachpb.BatchRequest
(the KV request we've been dealing with all along), allocates an engine.Batch
and delegates to evaluateBatch()
. This fellow finally iterates over the individual requests in the batch and, for each one, calls evaluateCommand
. We've seen evaluateCommand
before, on the read path. It switches over the different types of requests and calls a method specific to each type. One such method would beevalPut
, which writes a value for a key. Inside it we'll see a call to the engine to perform this write (but remember, it's all performed inside a RocksDB transaction, the engine.Batch
).
模拟部分发生在executeWriteBatch()方法中。 这会引入roachpb.BatchRequest(我们一直处理的KV请求),分配engine.Batch并委托给evaluateBatch()。 这个伙伴最终遍历批处理中的各个请求,并为每个请求调用evaluateCommand。 我们之前在读取路径上看到了evaluateCommand。 它切换不同类型的请求,并调用特定于每种类型的方法。 一种这样的方法就是beevalPut,它为Key写入一个值。 在其中我们将看到对引擎的调用以执行此写入(但请记住,它们都在RocksDB事务中执行,即engine.Batch)。
This was all for the purposes of recording the engine changes that need to be proposed to Raft. Let's unwind the stack toreplica.propose
(the method that started this section), and see what happens with the result of requestToProposal
. For one, it gets inserted into the "pending proposals map" - a structure that will make the connection between a command being applied and tryAddWriteCmd
which will be blocked on a channel waiting for the local application. More importantly, it gets passed toreplica.submitProposalLocked
which, eventually, calls raftGroup.Propose()
. This raftGroup
is a handle to a consensus group, implemented by the Etcd Raft library, a black box to which we submit proposals to have them serialized through majority voting into a coherent distributed log. This library is responsible for passing the commands in order to all the replicas for application.
这一切都是为了引擎记录的改变需要向Raft提出。 让我们展开堆栈replica.propose(本节开始的方法),看看requestToProposal的结果会发生什么。 例如,它被插入到“待定提议map”中 - 这个结构将在正在应用的命令和tryAddWriteCmd之间建立连接,这将在等待本地应用程序的通道上被阻塞。 更重要的是,它传递给replica.submitProposalLocked,最终调用raftGroup.Propose()。 这个raftGroup是一个共识组的句柄,由Etcd Raft库实现,我们提交了一个黑盒子,通过多数投票将它们序列化为连贯的分布式日志。 该库负责将命令传递给应用程序的所有副本。
This concludes the discussion of the part specific to the lease holder replica: how commands are proposed to Raft and how the lease holder is waiting for them to be applied before returning a reply to the (KV) client. What's missing is the discussion on how exactly they are applied.
以上是对租约持有者副本特定部分的讨论:如何向Raft提出命令,以及在向(KV)客户端返回答复之前,租赁持有者如何等待它们被应用。 缺少的是关于它们如何应用的讨论。
Raft command application Raft命令应用程序
We've seen how commands are "put into" Raft. But how do they "come out"? The Etcd Raft library implements a distributed state machine whose description is beyond the present scope. Suffice to say that we have a raftProcessor
interface that state transitions from this library call to. Our older friend the Store
implements this interface and the important method isStore.processReady()
. This will eventually call back into a specific replica
(the replica of a range that's being modified by each command), namely it will call handleRaftReadyRaftMuLocked
. This will iterate through newly committed commands callingprocessRaftCommand
for each one. This will in take the serialized engine.Batch
and call replica.applyRaftCommand
with it. Here the batch is deserialized and applied to the engine and, this time, unlike on the proposed side in applyRaftCommandInBatch
, the changes are actually committed to storage. The command has now been applied (on one particular replica, but keep in mind that the process described in this section happens on every replica).
我们已经看到命令是如何“放入”Raft的。但他们如何“走出去”? Etcd Raft库实现了一个分布式状态机,其描述超出了当前的范围。我只想说我们有一个raftProcessor接口,它说明从这个库调用转换为。我们的老朋友Store实现了这个接口,重要的方法是Store.processReady()。这最终会回调到一个特定的副本(每个命令正在修改的范围的副本),即它将调用handleRaftReadyRaftMuLocked。这将遍历新提交的命令,为每个命令调用processRaftCommand。这将采用序列化的engine.Batch并用它调用replica.applyRaftCommand。这里批处理被反序列化并应用于引擎,这次,与applyRaftCommandInBatch中的建议方不同,这些更改实际上已提交给存储。该命令现已应用(在一个特定副本上,但请记住,本节中描述的过程发生在每个副本上)。
We've glossed over something in processRaftCommand
that's important: after applying the command, if the current replica is proposer (i.e. the lease holder), we need to signal the proposer (which, as we saw in the previous section, is blocked in tryAddWriteCmd
). This happens at the very end. We've now come full circle - the proposer will now be unblocked and receive a response on the channel it was waiting on, and it can unwind the stack letting its client know that the request is complete. This reply can travel through the hierarchy of Sender
s, back from the lease holder node to the SQL gateway node, to a SQL tree of planNode
s, to the SQL Executor, and, through the pgwire
implementation, to the SQL client.
我们掩盖了processRaftCommand中的一些重要信息:在应用命令之后,如果当前副本是提议者(即租约持有者),我们需要发信号通知提交者(正如我们在上一节中看到的那样,在tryAddWriteCmd中被阻止))。 这发生在最后。 我们现在已经完全循环 - 提议者现在将被解除阻塞并在它正在等待的频道上收到响应,它可以解除堆栈,让其客户端知道请求已完成。 此回复可以通过发件人的层次结构,从租约持有者节点返回到SQL网关节点,到planNodes的SQL树,到SQL Executor,以及通过pgwire实现到SQL客户端。