4.1. Concurrency in the LSM-tree
In general, we are given an LSM-tree of K+1 components, C0, C1, C2, . . ., CK-1 and CK, of in- creasing size, where the C0 component tree is memory resident and all other components are disk resident.
一般情况,一个LSM-tree树有K+1个组件,size为C0...CK-1和CK递增,C0存储在内存,其他存储在磁盘。
There are asynchronous rolling merge processes in train between all component pairs (Ci-1, Ci) that move entries out from the smaller to the larger component each time the smaller component, Ci-1, exceeds its threshold size.
存在异步滚动合并流程当组件中的数据超过阈值时移动到更大的组件中。
Each disk resident component is constructed of page-sized nodes in a B-tree type structure, except that multiple nodes in key sequence order at all levels below the root sit on multi-page blocks.
每个磁盘驻留组件都是由页面大小的节点以b -树类型结构构造的,除了根以下所有级别上的多个关键顺序节点位于多页块上。(有道翻译)(待分析)
Directory information in upper levels of the tree channels access down through single page nodes and also indicates which sequence of nodes sits on a multi-page block, so that a read or write of such a block can be performed all at once.
树的上层目录信息通道通过单页节点向下访问,并指出哪个节点序列位于多页块上,这样就可以一次性执行对该块的读或写。(有道翻译)(待分析)
Under most circumstances, each multi-page block is packed full with single page nodes, but as we will see there are a few situations where a smaller number of nodes exist in such a block.
大部分情况下,每个多页块是有单页块的结点组成,
但正如我们将看到的,在一些情况下,这样的块中存在的节点数量更少。(有道翻译)(待分析)
In that case, the active nodes of the LSM-tree will fall on a contiguous set of pages of the multi-page block, though not necessarily the initial pages of the block.
在这种情况下,lsm树的活动节点将落在多页块的连续页面集上,但不一定落在块的初始页面上。(有道翻译)(待分析)
Apart from the fact that such contiguous pages are not necessarily the initial pages on the multi-page block, the structure of an LSM-tree component is identical to the structure of the SB-tree presented in [21], to which the reader is referred for supporting details.
除了这些连续的页面不一定是多页面块上的初始页面这一事实之外,LSM-tree组件的结构与[21]中给出的SB-tree的结构是相同的,读者可以参考[21]获得支持细节。(有道翻译)(待分析)
A node of a disk-based component Ci can be individually resident in a single page memory buffer, as when equal match finds are performed, or it can be memory resident within its containing multi-page block.
基于磁盘的组件Ci的节点可以单独驻留在单页内存缓冲区中(如执行相等的匹配查找时),也可以驻留在其包含的多页块中。(有道翻译)(待分析)
A multi-page block will be buffered in memory as a result of a long range find or else because the rolling merge cursor is passing through the block in question at a high rate.
由于长范围查找或滚动合并游标以很高的速度通过有问题的块,多页块将被缓冲在内存中。(有道翻译)(待分析)
In any event, all non-locked nodes of the Ci component are accessible to directory lookup at all times, and disk access will perform lookaside to locate any node in memory, even if it is resident as part of a multi-page block taking part in the rolling merge.
在任何情况下,目录查找都可以访问Ci组件的所有非锁定节点,磁盘访问将执行后备以定位内存中的任何节点,即使它是作为参与滚动合并的多页块的一部分驻留的。(有道翻译)(待分析)
Given these considerations, a concurrency approach for the LSM-tree must mediate three distinct types of physical conflict.
考虑到这些因素,LSM-tree的并发方法必须调解三种不同类型的物理冲突。
- A find operation should not access a node of a disk-based component at the same time that a different process performing a rolling merge is modifying the contents of the node.
查找和合并操作的修改操作同一个磁盘结点。 - A find or insert into the C0 component should not access the same part of the tree that a different process is simultaneously altering to perform a rolling merge out to C1.
查找或插入操作和合并同时操作C0的同一部分。 - The cursor for the rolling merge from Ci-1 out to Ci will sometimes need to move past the cursor for the rolling merge from Ci out to Ci+1, since the rate of migration out from the component Ci-1 is always at least as great as the rate of migration out from Ci and this implies a faster rate of circulation of the cursor attached to the smaller component Ci-1. Whatever concurrency method is adopted must permit this passage to take place without one process (migration out to Ci) being blocked behind the other at the point of intersec- tion (migration out from Ci).
从Ci-1到Ci的滚动合并光标有时需要移动到光标过去,从Ci到Ci+1,因为从组件Ci-1的迁移速率总是至少与从Ci的迁移速率相同,这意味着连接到较小组件Ci-1上的光标的循环速度更快。无论采用何种并发方法,都必须允许这一通道的发生,而不允许一个进程(向Ci迁移)在交叉点(从Ci迁移)后面被另一个进程阻塞。(有道翻译)(待分析)
Nodes are the unit of locking used in the LSM-tree to avoid physical conflict during concurrent access to disk based components.
并发操作锁的几本单位是node结点。
Nodes being updated because of rolling merge are locked in write mode and nodes being read during a find are locked in read mode;
合并写锁,查找读锁;
methods of directory locking to avoid deadlocks are well understood (see, for example, [3]).
目录锁阻止死锁(待分析)。
The locking approach taken in C0 is dependent on the data structure used.
操作C0的锁依赖C0使用的数据结构。
In the case of a (2-3)-tree, for example, we could write lock a subtree falling below a single (2-3)-directory node that contains all entries in the range affected during a merge to a node of C1; 以(2-3)-tree为例,我们需要锁一个合并影响的整个子树;
simultaneously, find operations would lock all (2-3)-nodes on their access path in read mode so that one type of access will exclude another.
同时,find操作将以读模式锁定其访问路径上的所有(2-3)节点,以便一种访问类型将排斥另一种访问类型。(有道翻译)
Note that we are only considering concurrency at the lowest physical level of multi-level locking, in the sense of [28].
注意,我们只考虑多级锁的最低物理层的并发性,即[28]。(有道翻译)
We leave to others the question of more abstract locks, such as key range locking to preserve transactional isolation, and avoid for now the problem of phantom updates;
我们把更抽象的锁问题留给其他人,比如保持事务隔离的键范围锁,并目前避免了幻像更新的问题;(有道翻译)(待分析)
see [4], [14] for a discussion.
讨论参见 [4], [14]。
Thus read-locks are released as soon as the entries being sought at the leaf level have been scanned.
因此,只要在叶级查找的条目被扫描,读锁就被释放。(有道翻译)
Write locks for (all) nodes under the cursor are released following each node merged from the larger component.
写锁在何必到更大组件后释放。
This gives an opportunity for a long range find or for a faster cursor to pass a relatively slower cursor position, and thus addresses point (iii) above..
这提供了一个机会,让一个较长的范围查找或较快的光标通过一个相对较慢的光标位置,从而寻址上面的点(iii)。(有道翻译)(待分析)
Now assume we are performing a rolling merge between two disk based components, migrating entries from Ci-1, which we refer to as the inner component of this rolling merge, out to Ci, which we refer to as the outer component.
现在假设我们正在两个基于磁盘的组件之间执行滚动合并,将条目从Ci-1(我们称之为滚动合并的内部组件)迁移到Ci(我们称之为外部组件)。(有道翻译)(待分析)
The cursor always has a well-defined inner com- ponent position within a leaf-level node of Ci-1, pointing to the next entry it is about to migrate out to Ci, and simultaneously a position in each of the higher directory levels of Ci-1 along the path of access to the leaf level node position.
光标总是有一个定义良好的内部组件的位置在Ci-1叶级节点,指向下一个条目是迁移Ci,同时在每个目录级别越高的职位Ci-1沿着路径访问叶子级节点的位置。(有道翻译)(待分析)
The cursor also has an outer component position in Ci, both at the leaf level and at upper levels along the path of access, corresponding to an entry it is about to consider in the merge process.
游标在Ci中也有一个外部组件位置,既在叶级,也在访问路径的上层,对应于它在合并过程中将要考虑的条目。(有道翻译)(待分析)
As the merge cursor progresses through successive entries of the inner and outer components, new leaf nodes of Ci created by the merge are im- mediately placed in left-to-right sequence in a new buffer resident multi-page block.
当合并游标通过内部和外部组件的连续条目时,由合并创建的Ci的新叶子节点将立即按从左到右的顺序放置在新的驻留缓冲区的多页块中。(有道翻译)(待分析)
Thus the nodes of the Ci component surrounding the current cursor position will in general be split into two partially full multi-page block buffers in memory: the "emptying" block whose entries have been depleted but which retains information not yet reached by the merge cursor, and the "filling" block which reflects the result of the merge up to this moment but is not yet full enough to write on disk.
因此,围绕当前游标位置的Ci组件节点通常会在内存中被分成两个部分满的多页块缓冲区:“清空”块,它的条目已经被耗尽,但它保留了合并游标尚未到达的信息;“填充”块,它反映了到目前为止的合并结果,但还没有满到足以写入磁盘。(有道翻译)(待分析)
For concurrent access purposes, both the emptying block and the filling block contain an integral number of page-sized nodes of the C1 tree which simply happen to be buffer resident.
出于并发访问的目的,空块和填充块都包含C1树的页面大小节点的整数,这些节点碰巧是缓冲区驻留节点。(有道翻译)(待分析)
During merge step operations restructuring individual nodes, the nodes involved are locked in write mode, blocking other types of concurrent access to the entries.
在重组单个节点的合并步骤操作期间,所涉及的节点被锁定为写模式,从而阻止其他类型的对条目的并发访问。(有道翻译)(待分析)
In the most general approach to a rolling merge, we may wish to retain certain entries in the component Ci-1 rather than migrating all entries out to Ci as the cursor passes over them.
在滚动合并的最一般方法中,我们可能希望保留组件Ci-1中的某些条目,而不是在光标经过它们时将所有条目迁移到Ci。(有道翻译)
In this case, the nodes in the Ci-1 component surrounding the merge cursor will also be split into two buffer resident multi-page blocks, the "emptying" block that contains nodes of Ci-1 that the merge cursor has not yet reached, and the "filling" block with nodes, placed left-to-right, that contain entries recently passed over by the merge cursor and retained in component Ci-1.
在这种情况下,围绕合并游标的Ci-1组件中的节点也将被分割为两个驻留在缓冲区中的多页块,“清空”块包含合并游标尚未到达的Ci-1节点,以及“填充”块,其中节点从左到右放置,包含最近被merge游标忽略并在组件Ci-1中保留的条目。(有道翻译)(待分析)
In this most general case then, the merge cursor position is affecting four different nodes at any one time: the inner and outer component nodes in the emptying blocks where the merge is about to occur and the inner and outer component nodes in the filling blocks where new information is being written as the cursor progresses.
在这个最一般的情况下,合并光标位置影响四个不同的节点在任何一个时间:清空块中的内部和外部组件节点合并即将发生和内部和外部组件的节点在新信息的填充块被写成光标的进展。(有道翻译)(待分析)
Clearly these four nodes may all be less than com- pletely full at any moment, and the same is true of the containing blocks.
显然,这四个节点在任何时候都可能没有完全满,包含的块也是如此。(有道翻译)
We take write locks on all four nodes during the time the merge is actually modifying the node structures and release these locks at quantized instants to allow a faster cursor to pass by;
在合并实际修改节点结构的时候,我们在所有四个节点上取写锁,并在量化的瞬间释放这些锁,以允许更快的游标通过;(有道翻译)(待分析)
we choose to release locks each time a node in the emptying block in the outer component has been completely depleted, but the other three nodes will generally be less than full at that time.
每次当外部组件的empty块中的一个节点被完全耗尽时,我们就会释放锁,但其他三个节点通常在那个时候还没有满。(有道翻译)(待分析)
This is all right, since we can perform all operations of access on a tree with nodes that are less than completely full as well as blocks that are less than completely full with nodes.
这是没有问题的,因为我们可以在节点不完全满的树上执行所有访问操作,以及节点不完全满的块上执行所有访问操作。(有道翻译)(待分析)
The case where one cursor passes an- other requires particularly careful thought, because in general the cursor position of the rolling merge being bypassed will be invalidated on its inner component, and provision must be made to reorient the cursor.
一个光标通过另一个光标的情况需要特别仔细的考虑,因为通常被绕过的滚动合并的光标位置在其内部组件上是无效的,必须准备好重新定位光标。(有道翻译)(待分析)
Note that all of the above considerations also apply at various di- rectory levels of both components where changes occur because of the moving cursor.
注意,上述所有注意事项也适用于由于移动光标而发生更改的两个组件的不同目录级别。(有道翻译)(待分析)
High level directory nodes will not normally be memory resident in a multi-page block buffer, however, so a somewhat different algorithm must be used, but there will still be a "filling" node and an "emptying" node at every instant.
高级目录节点通常不会驻留在多页块缓冲区的内存中,但是,必须使用某种不同的算法,但仍然会有一个“填充”节点和一个“清空”节点在每个瞬间。(有道翻译)(待分析)
We leave such complex considerations for later work, after an implementation of the LSM-tree has provided additional experience.
在lsm树的实现提供了额外的经验之后,我们将这些复杂的注意事项留给以后的工作。(有道翻译)(待分析)
Up to now we haven't taken any special account of the situation where the rolling merge under consideration is directed from the inner component C0 to the outer C1 component.
到目前为止,我们还没有特别考虑考虑的滚动合并是从内部分量C0直接到外部分量C1的情况。(有道翻译)
In fact, this is a relatively simple situation by comparison with a disk-based inner component.
事实上,与基于磁盘的内部组件相比,这是一种相对简单的情况。(有道翻译)
As with all such merge steps, one CPU should be totally dedicated to this task so that other accesses are excluded by write locks for a short a time as possible.
对于所有这样的合并步骤,应该有一个CPU完全用于这个任务,以便在尽可能短的时间内通过写锁排除其他访问。(有道翻译)
The range of C0 entries to be merged should be pre-calculated and a write lock taken on this entry range in advance by the method already explained.
要合并的C0条目的范围应该预先计算,并且使用前面解释过的方法提前对这个条目范围进行写锁。(有道翻译)(待分析)
Following this, CPU time is saved by deleting entries from the C0 component in a batch fashion, without attempts to rebalance after each individual entry delete;
接下来,通过批处理方式从C0组件中删除条目来节省CPU时间,而不需要在每个条目删除后尝试重新平衡;(有道翻译)
the C0 tree can be fully rebalanced after the merge step is complete.
在合并步骤完成后,C0树可以完全重新平衡。(有道翻译)