neo4j导入csv数据

neo4j 导入数据

文章目录

  • neo4j 导入数据
    • 1、 数据准备
    • 2、关闭服务
    • 3、数据导入
    • 4、开启服务
    • 5、一些遇到的问题
      • 倒错了想要删库怎么办
      • 导入出现错误怎么办
        • 错误一及解决办法
        • 错误二及解决办法
        • 错误三及解决办法
        • 错误四及解决办法
      • 导入成功之后同步出错怎么办
        • 服务过一段时间就会挂掉
        • 导入成功了,服务起不来怎么办
    • 6、一些技巧(导入的时候不看你可能会后悔)

1、 数据准备

将想要导入的数据做成csv文件的格式,需要准备两种csv文件、一种为节点csv文件一种为节点关系csv文件

比如人和手机号,需要准备节点文件 people.csv 、phone.csv 关系文件relation.csv

csv数据的内容如下所示:

# people.csv

people:ID,gender,age,:LABEL

"people1",F,11,people
"people2",F,19,people
"people3",M,14,people
"people4",F,12,people
"people5",F,16,people
"people6",M,29,people
"people7",F,69,people
......................
# phone.csv

phone:ID,province_code,city_code,:LABEL
"phone1",12,23,phone
"phone2",14,57,phone
"phone3",17,89,phone
"phone4",19,23,phone
"phone5",45,89,phone
"phone6",98,36,phone
"phone7",35,69,phone
......................
# ralation.csv

:START_ID,:END_ID
"people1","phone3"
"people2","phone1"
"people3","phone4"
"people4","phone2"
"people5","phone6"
"people6","phone5"
"people7","phone7"
......................

至此数据准备阶段结束

2、关闭服务

使用命令neo4j stop停止neo4j 服务

3、数据导入

将数据上传到服务其上面,然后执行导入命令,导入命令如下所示,

neo4j-import \
--into ../data/databases/graph.db \
--nodes people.csv \
--nodes phone.csv \
--relationships:have relation.csv \
--skip-duplicate-nodes=true \
--skip-bad-relationships=true \
--stacktrace --badtolerance=500000 

注意你的neo4j命令如果没有配环境变量的话需要到neo目录的bin目录下找。

4、开启服务

使用命令neo4j start开启neo4j服务
开启之后可能集群做的第一件事就是同步数据,你使用du -sb data/ 命令查看data文件夹内容占用空间大小可能会发现,他是一直在增长中的。

5、一些遇到的问题

倒错了想要删库怎么办

如果导入的数据错了想要删除数据库,其实neo4j没有什么高级的删除方法,官网推荐的就是直接从磁盘删除目录,我的做法是直接把配置的目录删除,默认的配置目录为neo4j主目录下的data目录,我每次删除都是直接关闭服务后把所有节点的该目录删除,然后重新开启服务器,这时候通过网页或者cypher-shell连接的时候需要重新初始化密码,也就是密码恢复为默认的neo4j了。

导入出现错误怎么办

错误一及解决办法

0 (global id space)-[have]->phone0 (global id space) referring to missing node 0

导入之后会产生导入的日志文件在目录/*/data/databases/graph.db/bad.log 如果出现了很多类似于上面的missing node 说明你导入关系的时候,还没有创建节点,比如这个我导入关于phone0的关系的时候节点phone0还没有创建。因此出席那missing node,所以检查你数据准备阶段的数据对不对。

错误二及解决办法

Data statistics is not available.
Peak memory usage: 0.00 B
Import error: /bigdata/neo4j-enterprise-3.5.6/data/databases/graph.db already contains data, cannot do import here
Caused by:/bigdata/neo4j-enterprise-3.5.6/data/databases/graph.db already contains data, cannot do import here
java.lang.IllegalStateException: /bigdata/neo4j-enterprise-3.5.6/data/databases/graph.db already contains data, cannot do import here
	at org.neo4j.unsafe.impl.batchimport.store.BatchingNeoStores.assertDatabaseIsEmptyOrNonExistent(BatchingNeoStores.java:193)
	at org.neo4j.unsafe.impl.batchimport.store.BatchingNeoStores.createNew(BatchingNeoStores.java:174)
	at com.neo4j.unsafe.impl.batchimport.RestartableParallelBatchImporter.fastForwardToLastCompletedState(RestartableParallelBatchImporter.java:190)
	at com.neo4j.unsafe.impl.batchimport.RestartableParallelBatchImporter.doImport(RestartableParallelBatchImporter.java:113)
	at org.neo4j.tooling.ImportTool.doImport(ImportTool.java:581)
	at org.neo4j.tooling.ImportTool.main(ImportTool.java:482)
	at org.neo4j.tooling.ImportTool.main(ImportTool.java:379)
	Suppressed: java.lang.IllegalStateException: VM pause monitor is not started
		at org.neo4j.util.Preconditions.checkState(Preconditions.java:142)
		at org.neo4j.kernel.monitoring.VmPauseMonitor.stop(VmPauseMonitor.java:71)
		at org.neo4j.unsafe.impl.batchimport.staging.OnDemandDetailsExecutionMonitor.done(OnDemandDetailsExecutionMonitor.java:128)
		at org.neo4j.unsafe.impl.batchimport.staging.MultiExecutionMonitor.done(MultiExecutionMonitor.java:82)
		at org.neo4j.unsafe.impl.batchimport.staging.MultiExecutionMonitor.done(MultiExecutionMonitor.java:82)
		at org.neo4j.unsafe.impl.batchimport.ImportLogic.close(ImportLogic.java:520)
		at com.neo4j.unsafe.impl.batchimport.RestartableParallelBatchImporter.doImport(RestartableParallelBatchImporter.java:118)
		... 3 more


neo4j导入数据之前必须没有该数据库,也就是只能第一次导入,第二次就不能导入了,这也算neo4j的缺点吧,你按照第一步把库删了就可以导入了,但是得确定你删的不是生产需要的库(如果你删了生产需要的而且很重要的库,那么恭喜你可以买票跑路了)。

错误三及解决办法

Import error: org.neo4j.io.pagecache.impl.FileLockException: This file is locked by another process, please ensure you don't have another Neo4j process or tool using it: '/bigdata/neo4j-enterprise-3.5.6/data/databases/graph.db/neostore'.'
Caused by:org.neo4j.io.pagecache.impl.FileLockException: This file is locked by another process, please ensure you don't have another Neo4j process or tool using it: '/bigdata/neo4j-enterprise-3.5.6/data/databases/graph.db/neostore'.'
org.neo4j.kernel.impl.store.UnderlyingStorageException: org.neo4j.io.pagecache.impl.FileLockException: This file is locked by another process, please ensure you don't have another Neo4j process or tool using it: '/bigdata/neo4j-enterprise-3.5.6/data/databases/graph.db/neostore'.'
	at org.neo4j.kernel.impl.store.NeoStores.verifyRecordFormat(NeoStores.java:217)
	at org.neo4j.kernel.impl.store.NeoStores.(NeoStores.java:144)
	at org.neo4j.kernel.impl.store.StoreFactory.openNeoStores(StoreFactory.java:129)
	at org.neo4j.kernel.impl.store.StoreFactory.openAllNeoStores(StoreFactory.java:93)
	at org.neo4j.unsafe.impl.batchimport.store.BatchingNeoStores.instantiateStores(BatchingNeoStores.java:237)
	at org.neo4j.unsafe.impl.batchimport.store.BatchingNeoStores.createNew(BatchingNeoStores.java:181)
	at com.neo4j.unsafe.impl.batchimport.RestartableParallelBatchImporter.fastForwardToLastCompletedState(RestartableParallelBatchImporter.java:190)
	at com.neo4j.unsafe.impl.batchimport.RestartableParallelBatchImporter.doImport(RestartableParallelBatchImporter.java:113)
	at org.neo4j.tooling.ImportTool.doImport(ImportTool.java:581)
	at org.neo4j.tooling.ImportTool.main(ImportTool.java:482)
	at org.neo4j.tooling.ImportTool.main(ImportTool.java:379)
	Suppressed: java.lang.IllegalStateException: VM pause monitor is not started
		at org.neo4j.util.Preconditions.checkState(Preconditions.java:142)
		at org.neo4j.kernel.monitoring.VmPauseMonitor.stop(VmPauseMonitor.java:71)
		at org.neo4j.unsafe.impl.batchimport.staging.OnDemandDetailsExecutionMonitor.done(OnDemandDetailsExecutionMonitor.java:128)
		at org.neo4j.unsafe.impl.batchimport.staging.MultiExecutionMonitor.done(MultiExecutionMonitor.java:82)
		at org.neo4j.unsafe.impl.batchimport.staging.MultiExecutionMonitor.done(MultiExecutionMonitor.java:82)
		at org.neo4j.unsafe.impl.batchimport.ImportLogic.close(ImportLogic.java:520)
		at com.neo4j.unsafe.impl.batchimport.RestartableParallelBatchImporter.doImport(RestartableParallelBatchImporter.java:118)
		... 3 more
Caused by: org.neo4j.io.pagecache.impl.FileLockException: This file is locked by another process, please ensure you don't have another Neo4j process or tool using it: '/bigdata/neo4j-enterprise-3.5.6/data/databases/graph.db/neostore'.'
	at org.neo4j.io.pagecache.impl.SingleFilePageSwapper.acquireLock(SingleFilePageSwapper.java:227)
	at org.neo4j.io.pagecache.impl.SingleFilePageSwapper.(SingleFilePageSwapper.java:178)
	at org.neo4j.io.pagecache.impl.SingleFilePageSwapperFactory.createPageSwapper(SingleFilePageSwapperFactory.java:66)
	at org.neo4j.io.pagecache.impl.muninn.MuninnPagedFile.(MuninnPagedFile.java:149)
	at org.neo4j.io.pagecache.impl.muninn.MuninnPageCache.map(MuninnPageCache.java:412)
	at org.neo4j.kernel.impl.store.MetaDataStore.getRecord(MetaDataStore.java:285)
	at org.neo4j.kernel.impl.store.NeoStores.verifyRecordFormat(NeoStores.java:198)
	... 10 more


如果出现上面的问题,那么可以肯定的是你的neo4j服务没有关闭就开始导入了,那么请回到第二步关闭服务重新开始。

错误四及解决办法

Id 'xxx' is defined more than once in group 'global id space'
Id 'xxx' is defined more than once in group 'global id space'
Id 'xxx' is defined more than once in group 'global id space'
Id 'xxx' is defined more than once in group 'global id space'

像这种 Id xxx 已经被定义超过一次,这说明一个问题,你的csv文件准备阶段没有做好,你的节点ID重复了,因此导入的时候就会报错。

导入成功之后同步出错怎么办

服务过一段时间就会挂掉

jps查看进程,CommercialEntryPoint过一段时间自动挂掉,挂掉之前集群一切正常,查看其他节点的data目录发现目录大小确实不停增长,命令如下

[root@n07 neo4j]# du -sh data
107G	data
[root@n07 neo4j]# du -sh data
108G	data
[root@n07 neo4j]# du -sh data
109G	data
[root@n07 neo4j]# du -sh data
109G	data
[root@n07 neo4j]# du -sh data
110G	data
[root@n07 neo4j]# du -sh data
127G	data

但是过一段时间之后再查看该目录就直接变为几KB或者不变了,去查看logs目录下的debug.log日志发现错误。core 机器中的master节点报错主要信息为

Server failed to join cluster within catchup time limit [600000 ms]

其他节点报错信息为 channelClose 等一些连接关闭异常。
这时候可以修改配置文件

# The time limit allowed for a new member to attempt to update its data to match the rest of the cluster.
causal_clustering.join_catch_up_timeout=100m

默认的这个值为10m, 10m也就是10*60*1000 = 600000刚好等于master节点显示的错误信息limit 60000ms,这个参数的作用和他的注释一样限制一个新的节点同步信息的时间限制。我直接把他调整为100因为我的数据量很大,你可以根据需要调整。

导入成功了,服务起不来怎么办

我把数据导入成功之后,发现其他节点的data目录也有数据,但是CommercialEntryPoint进程也在,但是想进入cypher-shell却总是拒绝连接,而且使用命令netstat -ntlp查看机器的7474、7473、7687三个端口都没有开启。
查看日志logs目录下的debug.log和neo4j.log一直在增长,debug.log一直报WARN提醒,neo4j.log一直报等待leader的INFO提醒,日志如下:

# 本日志屏蔽了主机名
2019-07-02 09:51:34.430+0000 INFO  Discovering other core members in initial members set: [xxx:5000, xxx:5000, xxx:5000]
2019-07-02 09:51:43.388+0000 INFO  Bound to cluster with id ad9f67ce-9ab7-4cb2-ab37-f76f85fcc1fb
2019-07-02 09:51:43.564+0000 INFO  Discovered core member at xxx:5000
2019-07-02 09:51:43.572+0000 INFO  Discovered core member at xxx:5000
2019-07-02 09:52:13.198+0000 INFO  Waiting to hear from leader...
2019-07-02 09:52:41.198+0000 INFO  Waiting to hear from leader...
2019-07-02 09:53:09.199+0000 INFO  Waiting to hear from leader...
2019-07-02 09:53:37.199+0000 INFO  Waiting to hear from leader...
2019-07-02 09:54:05.200+0000 INFO  Waiting to hear from leader...
2019-07-02 09:54:33.200+0000 INFO  Waiting to hear from leader...
2019-07-02 09:55:01.201+0000 INFO  Waiting to hear from leader...
2019-07-02 09:55:29.201+0000 INFO  Waiting to hear from leader...
2019-07-02 09:55:57.202+0000 INFO  Waiting to hear from leader...
2019-07-02 09:56:25.202+0000 INFO  Waiting to hear from leader...
2019-07-02 09:56:53.203+0000 INFO  Waiting to hear from leader...
2019-07-02 09:57:21.203+0000 INFO  Waiting to hear from leader...
2019-07-02 09:57:49.204+0000 INFO  Waiting to hear from leader...
2019-07-02 09:58:17.204+0000 INFO  Waiting to hear from leader...
2019-07-02 09:58:45.205+0000 INFO  Waiting to hear from leader...
2019-07-02 09:59:13.206+0000 INFO  Waiting to hear from leader...

但是无论等多久都不会变得正常,这时候,我经过测试把所有节点的data目录该命为data.copy 命令

move data data.cpoy

然后重新启动集群,发现集群可以正常启动了,这说明是我的data目录的数据有问题,因此我把所有除了leader的数据都删掉了然后重新进行同步,完成之后再次开启集群,发现问题解决。

6、一些技巧(导入的时候不看你可能会后悔)

官网也为我们提供了一些便利的导入技巧点击进入此处的脚本都是官网demo
neo4j导入的时候要有csv header,但是一个文件如果很大,那个去给他追加一个header也是一个很耗时的任务,那么neo4j就为这提供了一些便利操作。你可以新建一个csv文件,这个文件仅仅存放header,然后在导入的时候指定这个hader文件和其他的内容文件,注意指定的时候一定把header的csv文件在前面。比如命令可以这么写

neo4j_home$ bin/neo4j-admin import --nodes="import/movies4-header.csv,import/movies4-part1.csv,import/movies4-part2.csv" --nodes="import/actors4-header.csv,import/actors4-part1.csv,import/actors4-part2.csv" --relationships="import/roles4-header.csv,import/roles4-part1.csv,import/roles4-part2.csv"

我导入的时候用的是HDFS文件,但是导入命令写一大堆文件感觉不太靠谱,因此使用了hdfs 的getmerge命令将文件合并了,合并之后514GB,然后追加csv header,这显然是一个非常笨的操作,在我看了官网之后我明白了,官网早就看透了一切,它不仅提供另写csv header文件的操作而且为我们提供了正则匹配文件名。注意不是通配符,是正则表达式哦,千万不搞混了比如导入命令可以这么写

neo4j_home$ bin/neo4j-admin import --nodes="import/movies4-header.csv,import/movies4-part.*" --nodes="import/actors4-header.csv,import/actors4-part.*" --relationships="import/roles4-header.csv,import/roles4-part.*"

我们导入的时候还有可能遇到关系的起始节点或者结束节点没有建立的情况,那么为了不让它报错我们可以尝试一下加上 --ignore-missing-nodes这个命令,demo如下

neo4j_home$ bin/neo4j-admin import --nodes=import/movies8a.csv --nodes=import/actors8a.csv --relationships=import/roles8a.csv --ignore-missing-nodes

也有可能两个相同ID节点导入第二次的时候报错,这个官网也有解决办法,使用--ignore-duplicate-nodes忽略重复节点
demo如下:

neo4j_home$ bin/neo4j-admin import --nodes=import/actors8b.csv --ignore-duplicate-nodes

你也许看晕了,为什么上面写的导入命令和下面不一样,一个使用的neo4j-import命令一个使用的neo4j-admin import命令其实,这两个命令都可以,而且效率差不多,只不过参数有些不同。
请看

# 导入命令一
neo4j-import \
--into ../data/databases/graph.db \
--nodes people.csv \
--nodes phone.csv \
--relationships:have relation.csv \
--skip-duplicate-nodes=true \
--skip-bad-relationships=true \
--stacktrace --badtolerance=500000 
# 导入命令二
neo4j-admin import  \
--nodes people.csv  \
--nodes phone.csv  \
--relationships:call phone_phone.csv  \
--relationships:have relation.csv \
--ignore-duplicate-nodes=true \
--ignore-missing-nodes=true

你可能感兴趣的:(图数据库,neo4j)