OLTP
) 和 optimized for analytics(OLAP
) 有何不同?LSM-Trees
) and page-oriented storage enginers(如 B-Trees
)?A transaction needn’t necessarily have ACID (atomicity, consis‐ tency, isolation, and durability) properties. Transaction processing just means allowing clients to make low-latency reads and writes— as opposed to batch processing jobs, which only run periodically (for example, once per day).
Storage:
相比随机写入顺序写入简单高效, 因此 Many databases internally use a log(an append-only sequence of records), which is an append-only
(the simplest possible write operation) data file.
Retrieval:
TP 应用通常面向用户,使用 key 检索数据,结果集也通常较小,对响应时间有要求,因此 store engine 使用 index 来加速查询。Disk seek time is often the bottleneck here.
Index:
Index 利用额外的存储(来自 data)来加速查询. 维持 index 会面临一些成本,尤其是当写入时,需要同步更新 index。由于 index 可以加速查询, 降低写入效率,所以索引的建立需开发人员去权衡.
Storage:
数据以 (k,v) pair 的形式存储在磁盘,以 append only 的方式进行追加
Retrieval:
在内存中维护 hash index,数据追加时同步更新 index
以 append-only 方式追加会导致磁盘空间不足,为了解决这个问题:
这个方案一些要关注的点:
Crash recovery
: Bitcask speeds up recovery by storing a snapshot of each segment’s hash map on disk, which can be loaded into memory more quickly.Partially written records
: include checksums, allowing such corrupted parts of the log to be detected and ignored.Concurrency control
: 单线程写入,一个 file 要么是 append-only 要么是 immutable, 所以可以多线程读取优势:
Limitations:
如何使用 SSTables 来解决 hash index 的问题呢
Storage:
在 hash index 中 k,v 数据按照写入顺序存储,且同一个 k 会存在多个 segment file 中。在 SSTable 中, we require that the sequence of key-value pairs is sorted by key
.We also require that each key only appears once
within each merged segment file
Retrieval:
因为 key 按序存储,所以 index 中的一个 key 可以表示一个范围的上/下界,找到数据所在范围后再 scan 者个范围内的数据。先检索内存中的 memtable, 再根据时间检索磁盘上的 SSTables.
[外链图片转存失败(img-8gsxfDsB-1568549252126)(https://miro.medium.com/max/4000/0*6zJnBf20REcr4Uet.png)]
Advantages:
group those records into a block and compress it
before writing it to disk. 这可以节省磁盘同时也能降低 IO 带宽Sorted structure on disk: LSM-tree(Log-Structured Merge-Tree)?
Sorted structure in memory: red-black trees or AVL trees
Work flow:
memtable
disk as an SSTable file
.唯一的问题:
当 database crashed, 还没刷到磁盘的 memetable 就会丢失,为了避免这个问题,写一条数据时,同时以简单的 append 方式追加到一个 log 中,便于 crash 后 memtable 的恢复
写快读慢,因为顺序写入,读取的时候需要检索 memtable 和 sstable.
Storage engines that are based on this principle of merging and compacting sorted files are often called LSM storage engines.
the basic idea of LSM-trees—keeping a cascade of SSTables that are merged in the background is simple and effective
由于数据存储有序,所以 index 可以进行 range query,这保证了再大数据集下不错的检索性能。
由于顺序写入, LSM-tree 也可以支持大的写吞吐;
Reads are typically slower on LSM-trees because they have to check several different data structures and SSTables at different stages of compaction.
Merge 和 compact 方式:
size-tiered: newer and smaller SSTables are successively merged into older and larger SSTables.
leveled compaction: the key range is split up into smaller SSTables and older data is moved into separate “levels,” which allows the compaction to proceed more incrementally and use less disk space
Storage:
(k,v) 按 k 有序存储,which allows efficient key-value lookups and range queries。B-trees break the database down into fixed-size blocks or pages
, traditionally 4 KB in size (sometimes bigger), and read or write one page at a time. This design corresponds more closely to the underlying hardware, as disks are also arranged in fixed-size blocks.
Most databases can fit into a B-tree that is three or four levels deep, so you don’t need to follow many page references to find the page you are look‐ing for. (A four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 TB.)
B-tree is to overwrite a page on disk with new data. It is assumed that the overwrite does not change the location of the page;
fault-tolerant when database crash?
In order to make the database resilient to crashes, it is common for B-tree implemen‐ tations to include an additional data structure on disk: a write-ahead log
(WAL, also known as a redo log).
并发问题:
This is typically done by protecting the tree’s data structures with latches (lightweight locks).
B-tree: 写慢读快;每条数据之存在于一个地方,随意可以支持强大的事务能力
Lsm-tree: 写快读慢,同一个 key 可能在不同的 segment 中
相比于 B-trees, LSM-trees 可以维持 higher write throughput:
LSM-trees 磁盘利用率相对较高
: LSM-trees can be compressed better, and thus often produce smaller files on disk than B-trees.
write amplification(写放大):
- A B-tree index must write every piece of data at least twice: index 和 数据
- Log-structured indexes also rewrite data multiple times due to repeated compaction and merging of SSTables
compaction process 有时和 ongoing reads and writes process 会互相干扰
Another issue with compaction arises at high write throughput: the disk’s finite write bandwidth needs to be shared between the initial write (logging and flushing a memtable to disk) and the compaction threads running in the background.
二级索引
数据存储在 heap file 中,索引处存储 heap file 的 reference,这样可以避免重复存储数据.index 中新值可以 overwrite 老值, index 无需更新; 如果空间不够需要生成新的 heapfile, 则需要更新 index。
clustered index(toring all row data within the index):
so it can be desirable to store the indexed row directly within an index. This is known as a clustered index(聚集索引,比如 mysql 主键
).
covering index:(比如 mysql 主键外其他键上索引)
which stores some of a table’s columns within the index
concatenated index(比如 mysql 联合唯一键):
simply combines several fields into one key by appending one column to another (the index definition specifies in which order the fields are concatenated)
Multi-dimensional indexes are a more general way of querying several columns at once, 比如 R 树?
To cope with typos in documents or queries, Lucene is able to search text for words within a certain edit distance
but in Lucene, the in-memory index is a finite state automaton over the characters in the keys, similar to a trie
But other in- memory databases aim for durability, which can be achieved with special hardware (such as battery-powered RAM), by writing a log of changes to disk, by writing periodic snapshots to disk, or by replicating the in-memory state to other machines.
Counterintuitively, the performance advantage of in-memory databases is not due to the fact that they don’t need to read from disk.Rather, they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk [44].?
Besides performance, another interesting area for in-memory databases is providing data models that are difficult to implement with disk-based indexes.
anti-caching approach ?
Hash Index + append only
Range Index + SSTable(LSM-Trees 是其进阶)
Their key idea is that they systematically turn random-access writes into sequential writes on disk, which enables higher write throughput due to the performance characteristics of hard drives and SSDs.
B-Trees(update-in-place)
特点: scan 大量的数据, project 一些字段,在这些数据上进行聚合运算
Instead it becomes important to encode data very compactly, to minimize the amount of data that the query needs to read from disk.
We discussed how column-oriented storage helps achieve this goal.
通过 ETL 将 TP 系统中的数据以合适的方式收集到 data warehouse 中进行 AP 分析.DW 通常是关系模型,因为使用 sql 分析非常合适。
A data warehouse, by contrast, is a separate database that analysts can query to their hearts’ content, without affecting OLTP operations
A big advantage of using a separate data warehouse, rather than querying OLTP sys‐ tems directly for analytics, is that the data warehouse can be optimized for analytic access patterns.
[外链图片转存失败(img-a5szTnbl-1568549252127)(https://panoply.io/uploads/versions/diagram4—x----750-328x—.jpg)]
Stars schema(也称维度建模):
中间是事实表,其通常包含很多字段,有些是属性,有些事指向维度表的外键。
[外链图片转存失败(img-GLKtXL71-1568549252127)(https://i.ytimg.com/vi/KUwOcip7Zzc/maxresdefault.jpg)]
Snowflakes schema:
是 Stars 的变体,维度表维度和以被细分为更小的维度
[外链图片转存失败(img-owDxB68K-1568549252128)(https://www.tutorialspoint.com/dwh/images/snowflake.jpg)]
fact table(事实表): Each row of the fact table represents an event that occurred at a particular time
dimension table(维度表): the dimensions represent the who, what, where, when, how, and why of the event
The idea behind column-oriented storage is simple: don’t store all the values from one row together, but store all the values from each column together instead.
The column-oriented storage layout relies on each column file containing the rows in the same order.
we can further reduce the demands on disk throughput by compressing data
bitmap encoding
vectorized processing
A big bottleneck is the bandwidth for getting data from disk into memory
Developers of analytical databases also worry about efficiently using the bandwidth from main memory into the CPU cache
在列存中顺序无关紧要,存储按照 insert 顺序来就好了.但是仍然可以类似 SSTable 指定顺序用来实现索引
Rather, the data needs to be sorted an entire row at a time, even though it is stored by column.
A second column can determine the sort order of any rows that have the same value in the first column.
Another advantage of sorted order is that it can help with compression of columns
Having multiple sort orders in a column-oriented store is a bit similar to having mul‐ tiple secondary indexes in a row-oriented store.
面向列,压缩和排序增加了写入成本。可以使用类似 LSM-trees 的方式,将数据写入 memtable 再刷到磁盘,刷到磁盘的时候生成 column-oriented 文件。只是这样以来,查询就需要合并 memtable 和磁盘文件,但是查询优化器会对用户屏蔽这些底层细节。
Materialized view:
is and actual copy of the query results, written to disk,When the underlying data changes, a materialized view needs to be updated
A common special case of a materialized view is known as a data cube or OLAP cube, It is a grid of aggregates grouped by different dimensions.
The disadvantage is that a data cube doesn’t have the same flexibility as querying the raw data