这里简单介绍目前商业市场上出现的宣称是“第三代”图数据库产品,能支持OLAP和OLTP的场景。这个厂商提出了一个新的名词叫NPG(Native Parallel Graph)原生并行图(感觉广告软文在创造新词汇...o(╯□╰)o)。
Its data store holds nodes, links, and their attributes. Some graph database products on the market are really wrappers built on top of a more generic NoSQL data store. This virtual graph strategy has a double penalty when it comes to performance.
另外,Neo4j也是native graph, index-free的形式,看来没有别的捷径,要想图数据库引擎跑得快,需要最大程度减少磁盘IO和网络IO,native+memory是最自然的实现方式,除非后续计算机业界有新的存储技术突破
Internally hash indices are used to reference nodes and links. In Big-O terms, our average access time is O(1) and our average index update time is also O(1).
Users can set parameters that specify how much of the available memory may be used for holding the graph. If the full graph does not fit in memory, then the excess is stored on disk. Best performance is achieved when the full graph fits in memory, of course.
Data values are stored in encoded formats that effectively compress the data. The compression factor varies with the graph structure and data, but typical compression factors are between 2x and 10x. Compression has two advantages: First, a larger amount of graph data can fit in memory and in CPU cache. Such compression reduces not only the memory footprint, but also CPU cache misses, speeding up overall query performance. Second, for users with very large graphs, hardware costs are reduced.
In general, decompression is needed only for displaying the data. When values are used internally, often they may remain encoded and compressed.
压缩这部分非常有意思,从香农的《通信的数学原理》论文我们知道,数据的压缩是跟数据的分布情况有关的,2-10倍的压缩我觉得不一定对所有的数据集特性上都能实现...思路有点像Google protocolbuffer,将上层的数据转换为引擎层紧致的数据格式,当然解码部分会有一定的开销。
TigerGraph also excels at parallelism, employing an MPP (massively parallel processing) design architecture throughout.
The nature of graph queries is to “follow the links.”
Ask each counter to do its share of the world, and then combine their results in the end.
因为是使用原生图的引擎存储方式,因此可以使用物理存储接口出发进行图的访问或计算迭代,follow the link,基本可以预测在内存中直接通过指针访问数据,比通过IO或者其他mapping层面访问数据要快。
TigerGraph has been carefully designed to use memory efficiently and to release unused memory. Careful memory management contributes to TigerGraph’s ability to traverse many links, both in terms of depth and breadth, in a single query.
Many other graph database products are written in Java, which has pros and cons. Java programs run inside a Java Virtual Machine (JVM). The JVM takes care of memory management and garbage collection (freeing up memory that is no longer needed). While this is convenient, it is difficult for the programmer to optimize memory usage or to control when unused memory becomes available.
这个没什么好说的,在所有的路中选择了最难的那条,用C++比较偏底层的语言来实现整个数据库设计,相比其他用Java实现的数据库产品,跑得快是正常的。比如Neo4j, OrientDB主要是通过Java实现的。
TigerGraph also has its own graph querying and update language, GSQL.
To reiterate what we have revealed above, the TigerGraph graph is both a storage model and a computational model. Each node and link can be associated with a compute function. Therefore, each node or link acts as a parallel unit of storage and computation simultaneously. This would be unachievable using a generic NoSQL data store or without the use of accumulators.
TigerGraph is designed to automatically partition the graph data across a cluster of servers, and still perform quickly. The hash index is used to determine not only the within-server data location but also which-server. All the links that connect out from a given node are stored on the same server.
In distributed query mode, all servers are asked to work on the query; each server’s actual participation is on an as-needed basis. When a traversal path crosses from server A to server B, the minimal amount of information that server B needs to know is passed to it. Since server B already knows about the overall query request, it can easily fit in its contribution.
As the world’s first and only true native parallel graph (NPG) system, TigerGraph is a complete, distributed, graph analytics platform supporting web-scale data analytics in real time. The TigerGraph NPG is built around both local storage and computation, supports real-time graph updates, and serves as a parallel computation engine. TigerGraph ACID transactions, guaranteeing data consistency and correct results. Its distributed, native parallel graph architecture enables TigerGraph to achieve unequaled performance levels:
GSQL job
GSQL algorithm
GSQL can be compared to other prominent graph query languages in circulation today. This comparison seeks to transcend the particular syntax or the particular way in which semantics are defined, focusing on expressive power classified along the following key dimensions.
1. Accumulation: What is the language support for the storage of (collected or aggregated) data computed by the query?
2. Multi-hop Path Traversal: Does the language support the chaining of multiple traversal steps into paths, with data collected along these steps?
3. Intermediate Result Flow: Does the language support the flow of intermediate results along the steps of the traversal?
4. Control Flow: What control flow primitives are supported?
5. Query-Calling-Query: What is the support for queries invoking other queries?
6. SQL Completeness: Is the language SQL-complete? That is, is it the case that for a graph-based representation G of any relational database D, any SQL query over D can be expressed by a GSQL query over G?
7. Turing completeness: Is the language Turing-complete?
参考资料:Native Parallel Graphs - TigerGraph