OLTP & OLAP
OLTP: Online Transaction Processing 联机事务处理
OLAP: 联机分析技术( On-Line Analytical Processing)
OLTP与OLAP的关系是什么? - 知乎
OLTP vs. OLAP | Transactional Databases vs. Analytical Databases
OLTP
OLTP数据库分类
关系数据库
MySQL
MariaDB(MySQL的代替品[3],维基媒体基金会项目已从MySQL转向MariaDB[4])
Percona Server(MySQL的代替品[5][6])
PostgreSQL
Microsoft Access
Microsoft SQL Server
Google Fusion Tables
FileMaker
Oracle数据库
Sybase
dBASE
Clipper
FoxPro
foshub
几乎所有的数据库管理系统都配备了一个开放式数据库连接(ODBC)驱动程序,令各个数据库之间得以互相集成。
非关系型数据库(NoSQL)
BigTable(Google)
Cassandra
MongoDB
CouchDB
Redis
键值数据库
Apache Cassandra(为Facebook所使用[7]):高度可扩展[8]
Dynamo
LevelDB(Google)
MySQL & MongoDB
https://en.wikipedia.org/wiki/MongoDB
https://en.wikipedia.org/wiki/MySQL
特点比较
https://medium.com/@rsk.saikrishna/when-to-use-mongodb-rather-than-mysql-d03ceff2e922
MySQL: The SQL Relational Database
The following are some MySQL benefits and strengths:
Maturity: MySQL is an extremely established database, meaning that there’s a huge community, extensive testing and quite a bit of stability.
Compatibility: MySQL is available for all major platforms, including Linux, Windows, Mac, BSD and Solaris. It also has connectors to languages like Node.js, Ruby, C#, C++, Java, Perl, Python and PHP, meaning that it’s not limited to SQL query language.
Cost-effective: The database is open source and free.
Replicable: The MySQL database can be replicated across multiple nodes, meaning that the workload can be reduced and the scalability and availability of the application can be increased.
Data Sharding: While sharding cannot be done on most SQL databases, it can be done on MySQL servers. This is both cost-effective and good for business.
MongoDB: The NoSQL Non-Relational Database
The following are some of MongoDB benefits and strengths:
• Dynamic schema: As mentioned, this gives you flexibility to change your data schema without modifying any of your existing data.
• Scalability: MongoDB is horizontally scalable, which helps reduce the workload and scale your business with ease.
- Manageability: The database doesn’t require a database administrator. Since it is fairly user-friendly in this way, it can be used by both developers and administrators.
• Speed: It’s high-performing for simple queries. • Flexibility: You can add new columns or fields on MongoDB without affecting existing rows or application performance.
适合场景
Reasons to Use a SQL Database
Not every database fits every business need. That’s why many companies rely on both relational and non-relational databases for different tasks. Although NoSQL databases have gained popularity for their speed and scalability, there are still situations in which a highly structured SQL database might be preferable. Two reasons why you might consider a SQL database are:
- You need ACID compliancy (Atomicity, Consistency, Isolation, Durability). ACID compliancy reduces anomalies and protects the integrity of your database. It does this by defining exactly how transactions interact with the database, which is not the case with NoSQL databases, which have a primary goal of flexibility and speed, rather than 100% data integrity.
- Your data is structured and unchanging:If your business is not growing exponentially, there may be no reason to use a system designed to support a variety of data types and high traffic volume.
Reasons to Use a NoSQL Database
To prevent the database from becoming a system-wide bottleneck, especially in high volume environments, NoSQL databases perform in a way that relational databases cannot.
The following features are driving the popularity of NoSQL databases like MongoDB, Couch DB, Cassandra, and HBase:
- Storing large volumes of data without structure. A NoSQL database doesn’t limit storable data types. Plus, you can add new types as business needs change.
- Using cloud computing and storage. Cloud-based storage is a great solution, but it requires data to be easily spread across multiple servers for scaling. Using affordable hardware on-site for testing and then for production in the cloud is what NoSQL databases are designed for.
- Rapid development. If you are developing using modern agile methodologies, a relational database will slow you down. A NoSQL database doesn’t require the level of preparation typically needed for relational databases.
OLAP
OLAP技术
https://en.wikipedia.org/wiki/Comparison_of_OLAP_servers
选择适合你的开源 OLAP 引擎 - 微信公众号:数据社 - OSCHINA - 中文开源技术交流社区
目前市面上主流的开源OLAP引擎包含不限于:Hive、Hawq、Presto、Kylin、Impala、Sparksql、Druid、Clickhouse、Greeplum等。
它们各自有各自的特点,我们将其分组:
Hive,Hawq,Impala - 基于SQL on Hadoop
Presto和Spark SQL类似 - 基于内存解析SQL生成执行计划
Kylin - 用空间换时间,预计算
Druid - 一个支持数据的实时摄入
ClickHouse - OLAP领域的Hbase,单表查询性能优势巨大
Greenpulm - OLAP领域的Postgresql
Hive & ClickHouse
对比:ClickHouse vs. Hive vs. Impala Comparison
典型的大数据分析架构 --> ClickHouse
Hive
https://en.wikipedia.org/wiki/Apache_Hive
Hive是一个数据仓库基础工具在Hadoop中用来处理结构化数据。它架构在Hadoop之上,总归为大数据,并使得查询和分析方便。并提供简单的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。
那么MapReduce又是什么?
MapReduce
Big Data & Hadoop: MapReduce Framework | EduPristine
ClickHouse
Sudo Null - Latest IT News
什么是ClickHouse? | ClickHouse文档
https://tech.bytedance.net/articles/6853622128044146696#heading3
https://tech.bytedance.net/articles/6908282140950167565
为什么ClickHouse这么快? - 墨天轮
最快开源OLAP引擎!ClickHouse在头条的技术演进-InfoQ
Clickhouse的前世今生和优缺点_Xlucas的博客-CSDN博客_clickhouse优点缺点
列式数据库
列式数据库更适合于OLAP场景(对于大多数查询而言,处理速度至少提高了100倍),下面详细解释了原因(通过图片更有利于直观理解):
行式
列式
看到差别了么?下面将详细介绍为什么会发生这种情况。
clickhouse基于列式进行存储,支持数据压缩,我们都知道,查询IO的耗时操作主要有寻道时间和定位扇区时间和读取时间,我们应该尽可能减少寻道的时间,所以顺序写入的写入能力是比随机写入大得多,同时,顺序读取同一个文件比随机读取可以减少磁盘的调度次数,如果基于行来存储,当读取多个行时候就需要多次寻道时间,如果改为列式存储(一列一个文件),将大大减少IO的读写时间。
在基于列式存储之上,文件的数量大大减少,每一个列式存储文件大小更大,因为可以更高效的进行数据压缩,减少数据存储量
向量化执行引擎
在原本的查询逻辑中,当有多个数据到达CPU的时候,通常是串行作业,一个寄存器处理多个数据,串行执行,CPU大部分时间都在遍历查询操作树,并没有真正的去处理数据,因而CPU利用率不高,处理数据的效率不高。如果是一批数据都是执行相同的逻辑,那么可以基于SIMD执行对数据并行执行。
clickhouse 表引擎
clickhouse支持多种表引擎,不同的表引擎支持不同的功能和特性,其中MergeTree表引擎是所有引擎的基础,其他的表引擎都是在此之上加上新的特征,
MergeTree表引擎 提供了数据分区、数据副本等功能
ReplacingMergeTree表引擎 提供了根据主键删除重复数据功能
SummingMergeTree表引擎 支持按照建自动聚合数据
AggregatingMergeTree表引擎,预先对需要聚合的数据做预聚合并存储
CollapsingMergeTree表引擎,通过新增一行(以增代删)实现行级别粒度的删除数据,新增一行数据设置sign = 1,通过再次新增一行数据设置sign = -1 代表数据已删除,但要求sign = -1 的行数据要在待删除数据之后。
VersionedCollapsingMergeTree表引擎通过版本号的机制实现了AggregatingMergeTree表引擎写入顺序的局限性,对写入顺序无限制。
MergeTree存储结构
Patition:表示分区目录,只有当数据进入新的分区时才会创建。
checksums.txt:对分区文件数量和数量hash的保存。
count.txt:记录数据的总行数。
primary.idx:一级索引,用于构建稀疏索引。
Column.bin: 列文件,每一列一个文件用于存储某一列的数据信息。
Column.mrk:用于记录稀疏索引到数据的具体偏移量。
partition.dat:如果设置分区键会记录分区最终表达式
minmax.idx:分区字段对应原始数据的最小和最大值
索引生成以及查找流程
primary.idx用来存储一级索引,也是稀疏索引,其中每一个索引代表的是一段数据,一批数据而并不是每一个数据,因而稀疏索引可以使用较少的数据量来表示数据,其中定义一批数据的量级在于参数index_granularity(通常值为8192)
这里假设使用countID作为主键,那么索引的保存格式就是 countID+countID+...(要保持尽可能的紧凑)
而索引的查询是基于递归的查询区间 如果不存在次区间,则直接剪枝优化掉 如果存在此区间,就会判断区间的数据量长度是否大于8,如果大于8则拆分成8个小区间接下来递归查询,类似于递归的八分查询。
数据块压缩
每一列的数据都存储在column.bin文件中,但并非是常规的把数据全部写入bin文件,而是根据一定的批量对数据压缩后作为一个整体压缩到bin文件中,其中批量压缩就需要有一个数据的批量标记,用于标记压缩前压缩后的大小,其中就包含头文件和压缩数据。 头文件主要表达为 压缩方法+压缩后大小+压缩前大小。
其中压缩方式更是支持zstd LZ4等压缩格式。
支持长字符串转枚举、差量信息压缩等。