此示例描述了将CSV文件作为图形导入OrientDB的过程。为简单起见,仅考虑以下两个实体:
使用RDBMS Post和Comment将存储在2个单独的表中:
TABLE POST:
+----+----------------+
| id | title |
+----+----------------+
| 10 | NoSQL movement |
| 20 | New OrientDB |
+----+----------------+
TABLE COMMENT:
+----+--------+--------------+
| id | postId | text |
+----+--------+--------------+
| 0 | 10 | First |
| 1 | 10 | Second |
| 21 | 10 | Another |
| 41 | 20 | First again |
| 82 | 20 | Second Again |
+----+--------+--------------+
使用RDBMS,一对多的引用从目标表(Comment)反转到源表(Post)。这是由于RDBMS无法处理值集合。
相比之下,使用OrientDB Graph模型,在设计应用程序时,可以按照您的想法建模关系:POSTs与COMMENTs有边缘。
所以,使用RDBMS,你有:
Table POST <- (foreign key) Table COMMENT
使用OrientDB,Graph模型使用Edges来管理关系:
Class POST ->* (collection of edges) Class COMMENT
(1)导出为CSV
如果您使用的是RDBMS或任何其他来源,请以CSV格式导出数据。ETL模块还可以直接通过JDBC驱动程序从JSON和RDBMS中提取。但是,为了简单起见,在本例中我们将使用CSV作为源格式。
考虑使用2个CSV文件:
a.文件posts.csv
posts.csv文件,包含所有帖子
id,title
10,NoSQL movement
20,New OrientDB
b. 文件comments.csv
comments.csv文件,包含所有注释,以及与评论帖子的关系
id,postId,text
0,10,First
1,10,Second
21,10,Another
41,20,First again
82,20,Second Again
(2)ETL配置
OrientDB ETL工具只需要一个JSON文件来定义ETL过程作为Extractor,一个要在管道中执行的变换器列表,以及一个Loader,用于将图形元素加载到OrientDB数据库中。
以下是包含要分别导入帖子和评论的ETL的2个文件。
post.json ETL文件
{
"source": { "file": { "path": "D:/software/orientdb/data/posts.csv" } },
"extractor": { "csv": {} },
"transformers": [
{ "vertex": { "class": "Post" } }
],
"loader": {
"orientdb": {
"dbURL": "remote:localhost/demodb",
"serverUser": "root",
"serverPassword": "root",
"dbType": "graph",
"classes": [
{"name": "Post", "extends": "V"},
{"name": "Comment", "extends": "V"},
{"name": "HasComments", "extends": "E"}
], "indexes": [
{"class":"Post", "fields":["id:integer"], "type":"UNIQUE" }
]
}
}
}
Loader包含连接到OrientDB数据库的所有信息。我们使用了一个地理数据库,因为它更快。但是,如果您已启动并运行OrientDB服务器,请改用“remote:”。请注意Loader中声明的类和索引。一旦配置了Loader,就会创建类和索引(如果它们尚不存在)。我们在Post.id字段上创建了索引,以确保没有重复项,并且创建的边上的查找(见下文)将足够快。
comments.json ETL文件
{
"source": { "file": { "path": "D:/software/orientdb/data/comments.csv" } },
"extractor": { "csv": {} },
"transformers": [
{ "vertex": { "class": "Comment" } },
{ "edge": { "class": "HasComments",
"joinFieldName": "postId",
"lookup": "Post.id",
"direction": "in"
}
}
],
"loader": {
"orientdb": {
"dbURL": "remote:localhost/demodb",
"serverUser": "root",
"serverPassword": "root",
"dbType": "graph",
"classes": [
{"name": "Post", "extends": "V"},
{"name": "Comment", "extends": "V"},
{"name": "HasComments", "extends": "E"}
], "indexes": [
{"class":"Post", "fields":["id:integer"], "type":"UNIQUE" }
]
}
}
}
此文件与前一个文件类似,但Edge转换器完成了这项工作。由于CSV中的链接方向相反(Comment-> Post),我们想直接建模(Post-> Comment),我们使用方向“in”(默认值始终为“out”)。
(3)运行ETL过程
现在允许ETL按顺序执行两个导入来运行。在OrientDB主目录下打开一个shell,然后执行以下步骤:
$ cd bin
$ ./oetl.sh post.json
$ ./oetl.sh comment.json
oetl.bat comments.json
oetl.bat post.json
一旦两个脚本成功执行,您就可以将您的Blog作为图形导入OrientDB!
(4)检查数据库
在OrientDB控制台下打开数据库并执行以下命令以检查导入是否正常:
$ ./console.sh
OrientDB console v.2.0-SNAPSHOT (build 2565) www.orientechnologies.com
Type ‘help’ to display all the supported commands.
Installing extensions for GREMLIN language v.2.6.0
orientdb>CONNECT remote:localhost/demodb root root
Connecting to database [plocal:/temp/databases/blog] with user ‘admin’…OK
orientdb {db=blog}> select expand( out() ) from Post where id = 10
----+-----+-------+----+------+-------+--------------
# |@RID |@CLASS |id |postId|text |in_HasComments
----+-----+-------+----+------+-------+--------------
0 |#12:0|Comment|0 |10 |First |[size=1]
1 |#12:1|Comment|1 |10 |Second |[size=1]
2 |#12:2|Comment|21 |10 |Another|[size=1]
----+-----+-------+----+------+-------+--------------
3 item(s) found. Query executed in 0.002 sec(s).
orientdb {db=blog}> select expand( out() ) from Post where id = 20
----+-----+-------+----+------+------------+--------------
# |@RID |@CLASS |id |postId|text |in_HasComments
----+-----+-------+----+------+------------+--------------
0 |#12:3|Comment|41 |20 |First again |[size=1]
1 |#12:4|Comment|82 |20 |Second Again|[size=1]
----+-----+-------+----+------+------------+--------------
2 item(s) found. Query executed in 0.001 sec(s).