此文简介了Oracle内存数据库的两大分支,TimesTen和Database In-Memory,非常值得一读,两种技术的重要特性完全涵盖,可以让读者很快的对Oracle内存数据库技术的关键点有一个全面的认识,建议看完此文后可以对参考中的原文再读一遍。
随着内存容量不断增大,价格不断下降,将全部的用户数据置入内存已避免昂贵的I/O已成为可能。
Oracle提供两种互补的内存数据库技术,用于应用层和数据库层:
1) TimesTen
可以部署在数据库层,作为独立的数据库;或在应用层作为后端Oracle数据库的缓存。
主要用于低延迟的OLTP应用
2) Database In-Memory
是数据库企业版12c的选件,部署在数据库层,用于加速分析负载。数据库的大小不受内存的限制,需要分析的表可以载入内存,其它的可以用磁盘, SSD等存储。
TimesTen是关系型的内存数据库,支持ACID。由于数据完全在内存中,因此可以实现高性能低延时。
应用可以使用JDBC, ODBC, OCI, SQL连接TimesTen.
TimesTen支持传统的C/S模式,以及独有的更高性能Direct模式(In direct-link mode, all database API invocations are treated simply as function calls into the TimesTen shared library allowing for in-process execution of database code)
为何内存数据库比Cache还要快,解释如下:
Another over-arching design principle is the use of memory-based addressing rather than logical addressing. For instance, indexes in TimesTen contain pointers to the tuples in the base table. The metadata describing the layout of a table contains pointers to the pages comprising the table. Therefore both index scans and tablescans can operate via pointer traversal. This design approach is repeated over and over again in the storage manager, with the result that TimesTen is significantly faster than a disk-oriented database even one that is completely cached – since there is no overhead from having to translate logical rowids to physical memory addresses of buffers in a buffer cache.
支持的索引类型为Hash, Range和Bitmap
Hash Indexes for speeding up lookup queries, Bitmap Indexes for accelerating star joins with complex predicates, as well as Range Indexes for accelerating range scans.
ACID中的Transactional Durability是通过Checkpoint和write-ahead logging实现的,有两个checkpoint文件和多个log文件。每一次checkpoint完成时,就会切换到另一个文件,因此总有一个完整的内存映像用于备份和恢复时的roll-forward。日志方面,为性能计,TimesTen除提供与Oracle一样的Durable Commit外,也提供delayed-durability模式,即日志写入log buffer后,后台每200ms将buffer中的数据flush到磁盘。
如果对数据丢失不能容许,可以使用Durable Commit模式或后面提到的2-Safe复制模式
缺省隔离级为read-committed isolation。支持行级锁,和Oracle一样,也支持MVCC或MVRC,这样读写互不阻塞。
TimesTen的高可用是通过基于日志的复制实现的,原理是log-shipping。支持异步,准同步和同步模式(2-Safe)。
With 2-safe replication, a transaction is committed locally only after the commit has been successfully acknowledged by the receiver.
2-safe 复制通常与non-durable commit组合使用.
This combination allows applications to achieve commit durability in two memories, without requiring any disk IO.
复制可以基于单个表(Classic Replication)和整个数据库(Active Standb Pair),复制可以是双向(Classic Replication)或单向复制(Classic Replication或Active Standb Pair),最常用和推荐的复制模式为Active Standb Pair,Standby节点运行只读应用,Standby节点可以再复制到多个Subscriber节点以提高读扩展性。
可以通过并行复制提高复制速度和吞吐量。
这是TimesTen最常用的模式,即作为后端Oracle数据库的可持久化的交易缓存,从而极大的加速应用。
应用层缓存的高性能取决于:
In-Memory Optimizations - 这是最根本的,即全内存的架构比基于磁盘的架构速度快得多
Application Proximity - TimesTen部署在中间层,离应用更近,而独有的Direct Mode,可以使应用和TimesTen在进程内通讯,进一步提高效率
TimesTen以Cache Group对应Oracle中需要缓存的表。
Cache Group中数据的加载支持Pre-Load(预先加载)和Dynamically loaded(访问时加载)模式。
对于Dynamically loaded模式,The data to be referenced must be identified by an equality predicate on the primary key of the root table
Dynamically loaded模式还可以指定缓存Aging策略,可以基于时间或LRU
最常用的缓存类型是Read-Only Cache Groups和Updatable Cache Groups:
Read-Only Cache Groups -
For data that is infrequently updated, but widely read, a read-only cache group can be created on TimesTen to offload the backend Oracle database. Very hot reference data, as online catalogs, airline gate arrival/departure information, etc. is a candidate for this type of caching. The Oracle side tables corresponding to Read-Only cache groups are updated on Oracle. The updates are periodically refreshed into TimesTen using an automatic refresh mechanism.
Updatable Cache Groups -
For frequently updated data, an updatable cache group with write-through synchronization is appropriate. Account balance information for an online ecommerce application, the location of subscribers in a cellular network, streaming sensor data, etc. are all candidates for write-through caching. TimesTen provides a number of alternative mechanisms for propagating writes to Oracle, but the most commonly used and highest performing mechanism is referred to as Asynchronous Writethrough where the changes are replicated to Oracle using a log-based transport mechanism. This mechanism is also capable of applying changes to Oracle in parallel, in keeping with the parallel-everywhere design theme of the system.
注意: TimesTen中的术语write-through等同于存储中的write-back。
从TimesTen到Oracle的复制,原理同TimesTen到TimesTen的复制是一样的。
可以为一个Oracle数据库部署多个TimesTen缓存,它们之间是协同的,称为Application-Tier Database Cache Grid。
因此缓存的类型可进一步划分为本地和全局Cache Group。
Local Cache Groups -
缓存中的数据是私有的,其它成员看不到
This type of cache group is useful when the data can be statically partitioned across grid members; for instance, different ranges of user profile Ids may be cached on different grid members.
Global Cache Groups -
In many cases, an application cannot be statically partitioned and Global Cache Groups allow applications to transparently share cached contents across a grid of independent TimesTen databases. With this type of cache group, cache instances are migrated across the grid on reference. Only consistent (committed) changes are propagated across the grid.
缓存中的数据可以为其它成员共享。应用从哪个TimesTen节点访问,cache instance就会传递到那个节点。因此,尽管不能静态分区,但还是应尽量保证数据的本地化,避免节点间过多的数据传递。
Thus, the contents of the global cache group are accessible from any location, via data shipping.
在没有考虑复制的前提下,每一个cache instance在全局缓存中只有一份,这样,增加TimesTen节点就可以横向扩展,提供更多的容量和处理能力。
如果需要考虑高可用,可以为每一个节点建立Active Standby Pair。
可以通过global query实现对于全局缓存中所有TimesTen成员的联邦查询。例如COUNT(*), MAX等
A global query is a query executed in parallel across multiple grid members.
Oracle Database In-Memory(DBIM)为数据提供了行和列两种格式,其中行格式是早已有的Buffer Cache。和传统的内存数据库不同,DBIM并不限制数据的容量必须完全容纳在内存中。
DBIM的创新在于:
* 双格式
行格式适合访问多个列少数行的OLTP应用,列格式适合分析
* 无限容量
可以将重要的用于分析的表和分区置于IM column store,而其余数据置于buffer cache,SSD和磁盘
* 应用透明
由于列格式无缝的嵌入到数据访问层,所有的数据库特性如RAC,多租户,ADG都可以结合使用,数据库自动判断是访问列存储还是buffer cache
新的列格式完全基于内存,可以将整个表空间,表,表的部分列,分区和子分区置于内存中。
数据可以修改,数据库自动维护列格式和行格式的数据一致性。
数据自动通过后台进程发布到内存中,不会有应用中断。
数据的加载可以控制优先级,可以在数据库启动时预先加载,或在访问时实时加载。
加载的数据是自动压缩的,可以选择不同的压缩级别,如果OLTP范围频繁,可以选择开销较小的压缩级别。
通过SIMD(Single Instruction Multiple Data)可以实现单个指令对多个列数据并行扫描。
Storage Index进一步实现I/O优化,
An In-Memory Storage Index keeps track of minimum and maximum values for each column CU.
类似于分区,在查询时可以略过不必要的CU(Column Unit),从而减少I/O
典型的在fact table和dimension table间的star join可以通过bloom filter将join转化为列扫描,这非常适合于列式存储,从而提高扫描速度。
A new optimizer transformation, called Vector Group By, is used to compute multi-dimensional aggregates in real-time. The Vector Group By transformation is a two-part process similar to the well known star transformation.
The default isolation level provided by Oracle Database is known as Consistent Read. With Consistent Read, every transaction in the database is associated with a monotonically increasing timestamp referred to as a System Change Number (SCN). A multi-versioning protocol is employed by the buffer cache to ensure that a given transaction or query only sees changes made by transactions with older SCNs.
加入列存储格式后,仍然可以与buffer cache保持数据一致性。
The IM column store similarly maintains the same consistent read semantics as the buffer cache. Each IMCU is marked with the SCN of the time of its creation. An associated metadata area, known as a Snapshot Metadata Unit (SMU) tracks changes to the rows within the IMCU made beyond that SCN.
SMU维护IMCU中数据的有效性,如果数据由于修改导致陈旧,在查询时会自动与日志联合查询得到最新的数据,然后后台会自动更新IMCU中的数据。
IMCU的更新可以根据设定的阈值或定期的小批量更新。
通过在RAC节点间分布数据实现扩展。同时可实现并行查询
分布的策略可以依据分区,如果没有分区,可以依据ROWID。或者完全由系统自动选择。
数据分布后,如果考虑高可用,可以选择复制,复制的数据位于RAC中的其它节点或其它所有节点。