neo4j py2neo_py2neo进行图形分析

neo4j py2neo

Network data is everywhere and it is becoming increasingly important for data scientists to have a working knowledge of graph analytics. One challenge that data scientists often face is the lack of scalability of graph analytics solutions. In this blog, I discuss how you can use py2neo combined with neo4j to build a scalable graph analytics solution from scratch.

ñetwork数据是无处不在,它正变得越来越重要数据科学家有图形分析的工作知识。 数据科学家经常面临的挑战之一是缺乏图形分析解决方案的可扩展性。 在此博客中,我将讨论如何将py2neo与neo4j结合使用以从头开始构建可扩展的图形分析解决方案。

Many libraries in python have been created to perform graph analytics. The most popular ones are networkx, scikit-networks, and graph-tool. All of these packages are great; however, if you are working with large amounts of data, you might want to consider using the power of neo4j.

python中已经创建了许多库来执行图分析。 最受欢迎的是networkx , scikit-networks和graph-tool 。 所有这些软件包都很棒。 但是,如果要处理大量数据,则可能要考虑使用neo4j的功能 。

Neo4j is a graph database which means that it is designed specifically for the storage and analysis of large graph datasets. Think about transactional databases of supermarkets or network data from social media platforms. The neo4j community edition is a free version of neo4j that can be downloaded by anyone.

Neo4j是一个图形数据库,这意味着它专门用于存储和分析大型图形数据集。 考虑一下超级市场的​​交易数据库或来自社交媒体平台的网络数据。 neo4j社区版neo4j的免费版本,任何人都可以下载。

Py2neo is a python package that allows the programmer to use the power of neo4j in python. It works by establishing a connection to neo4j which allows the programmer to execute queries on the neo4j database and write the results to a pandas dataframe (or other data types). Unfortunately, the documentation of Py2neo is not perfect (see this thread). Below, I have outlined all the steps that need to be taken to start using py2neo for your next graph analytics project.

Py2neo是一个python软件包,允许程序员在python中使用neo4j的功能。 它通过建立与neo4j的连接来工作,该连接允许程序员对neo4j数据库执行查询并将结果写入熊猫数据框(或其他数据类型)。 不幸的是,Py2neo的文档并不完美(请参阅此线程 )。 下面,我概述了为下一个图形分析项目开始使用py2neo所需采取的所有步骤。

Making the py2neo connection to neo4j work will probably be the hardest part of your graph analysis project. With the steps below, however; you can get started in less than 10 minutes!

使py2neo与neo4j的连接正常工作可能是图形分析项目中最难的部分。 但是,执行以下步骤; 您可以在不到10分钟的时间内开始使用!

Step 1: download the neo4j community edition

步骤1:下载Neo4j社区版

Download the neo4j community edition — Image by author 下载neo4j社区版—照片作者作者

Step 2: Go through the registration process

步骤2:完成注册程序

This step is pretty straightforward. Just go through the steps.

这一步非常简单。 只需执行步骤。

Step 3: Remove the Movie Database (or any other pre-installed database)

步骤3:删除电影数据库(或任何其他预安装的数据库)

Click on the three dots in the upper right corner of the database. Then press Remove.

单击数据库右上角的三个点。 然后按“ 删除”

Removing the existing database — Image by author 删除现有的数据库—照片作者author

Step 4: Add a new database which will be used for the project

步骤4:添加一个新的数据库,该数据库将用于项目

First, you click on the grey area with the plus icon and the text Add Database. After clicking you have to opportunity to create a local database or connect to a remote DBMS. For this project, we will create a local database. Now it’s time to give the graph a name and set a password. The graph name can be anything, this does not have any influence on the project, the password however is more important and needs to be used later on (note it down!).

首先,单击带有加号图标和文本“ 添加数据库”的灰色区域 单击后,您将有机会创建本地数据库或连接到远程DBMS。 对于这个项目,我们将创建一个本地数据库 。 现在是时候给图形命名并设置密码了。 图形名称可以是任何名称,这对项目没有任何影响,但是密码更重要,以后需要使用(请注意!)。

Adding the database consists of three steps — Image by author 添加数据库包括三个步骤—照片作者author

Step 5: manage database

第5步:管理数据库

This is the most complicated step and the one that discourages a lot of people to work with neo4j instead of the standard python libraries. Click once again on the three dots in the upper right corner of your database. This time we select Manage. This opens a separate page as shown in the image below (right-hand side).

这是最复杂的步骤,它使很多人不愿使用neo4j而不是标准python库。 再次单击数据库右上角的三个点。 这次我们选择Manage 。 这将打开一个单独的页面,如下图(右侧)所示。

Go to the management page for your graph — Image by author 转到管理页面为您的图—照片作者作者

Now we need to take a couple of steps. First, we need to activate two libraries: (1) APOC and (2) Graph Data Science Library. This can be done by clicking on the Plugins tab and pressing the install buttons.

现在我们需要采取几个步骤。 首先,我们需要激活两个库:(1)APOC和(2)图形数据科学库。 这可以通过单击“ 插件”选项卡并按安装按钮来完成。

Activating APOC and GDS — Image by author 激活APOC和Gds —照片作者author

Once, we installed the libraries we move to the Settings tab to allow the procedures of these two libraries to be executed in neo4j. To do this, we need to add the following text to the settings:

一次,我们安装了库,然后移至“ 设置”选项卡,以允许在neo4j中执行这两个库的过程。 为此,我们需要在设置中添加以下文本:

dbms.security.procedures.unrestricted=gds.*,apoc.*dbms.security.procedures.whitelist=gds.*,apoc.*

dbms.security.procedures.unrestricted = gds。*,apoc。* dbms.security.procedures.whitelist = gds。*,apoc。*

apoc.import.file.enabled=trueapoc.export.file.enabled=true

apoc.import.file.enabled = trueapoc.export.file.enabled = true

It does not matter where you add this text.

在何处添加此文本都没有关系。

Changin the settings of the neo4j database — Image by author Changin neo4j数据库的设置—照片作者author

In case you will be working with very large datasets you might also want to change the maximum allowed memory size. This can be done by changing the dbms.memory.heap.max_size=1G to whatever size your machine can handle.

如果您要使用非常大的数据集,则可能还需要更改允许的最大内存大小。 这可以通过将dbms.memory.heap.max_size = 1G更改为您的计算机可以处理的任何大小来完成。

Step 6: copy path of import folder

步骤6:导入文件夹的复制路径

Since we will be working with data that is not yet loaded into neo4j, we would like to import a dataset. In order to do this using python, we need to know the import folder of our neo4j database. This is the folder that the database uses to load new datasets into the database.

由于我们将使用尚未加载到neo4j中的数据,因此我们想导入数据集。 为了使用python做到这一点,我们需要知道neo4j数据库的import文件夹。 这是数据库用来将新数据集加载到数据库中的文件夹。

To find the import folder, we press the Open Folder button at the top of the screen and then copy the address of the folder that was opened. Save this somewhere because it will be used in the code!

要找到导入文件夹,请按屏幕顶部的“ 打开文件夹”按钮,然后复制打开的文件夹的地址。 将此保存在某个地方,因为它将在代码中使用!

Open the folder and copy the address 打开文件夹并复制地址

Step 7: Start the database and find host location

步骤7:启动数据库并找到主机位置

The next step is to start the database. This is as easy as pressing the start button (the triangle at the left-hand side).

下一步是启动数据库。 这就像按下“开始”按钮(左侧的三角形)一样简单。

Starting the database — Image by author 启动数据库—照片作者author

Once the database is loaded, you go back to the home screen and open the database in the neo4j browser.

加载数据库后,您将返回主屏幕并在neo4j浏览器中打开数据库。

Open the database — Image by author 打开数据库—照片作者author

Then, copy the link where the neo4j database is hosted. In my case this was: bolt://localhost:7687. Save this somewhere as well!

然后,将链接复制到托管neo4j数据库的位置。 在我的情况下,这是: bolt:// localhost:7687。 也将其保存在某个地方!

Save the host location — Image by author 保存主机位置—照片作者author

Now we are ready to start coding!

现在我们准备开始编码!

Analyzing the karate club network using neo4j

使用neo4j分析空手道俱乐部网络

What follows is a brief analysis of the karate club network dataset. The goal is to show you how to connect to the neo4j database, how to load the dataset into neo4j, and how to analyze the network in neo4j; all, of course, using py2neo.

接下来是对空手道俱乐部网络数据集的简要分析。 目的是向您展示如何连接到neo4j数据库,如何将数据集加载到neo4j中以及如何在neo4j中分析网络。 当然,全部使用py2neo。

First, we load the required packages:

首先,我们加载所需的软件包:

import networkx as nx
from py2neo import Graph


import_folder = '.../installation-4.1.0/import'
url = 'bolt://localhost:7687'
user = "neo4j" # Default by neo4j
password = '...'

Make sure that your neo4j database is running! Next, we define functions that allow us to connect to the neo4j database and load data onto it.

确保您的neo4j数据库正在运行! 接下来,我们定义允许我们连接到neo4j数据库并将数据加载到其中的函数。

def connectToNeo4j(user, password):
    '''
    Establishes a connection with neo4j.
    Default neo4j username is 'neo4j'
    '''
    try:
        connection = Graph(url, user=user, password=password)
    except:
        connection = Graph(url, user=user, password=password)
    return connection




def removeExistingData(connection):
    '''
    Neo4j does not replace a dataset when additional data is added, that is
    why we need to remove all existing data before loading the dataset again
    '''
    connection.run("MATCH (n) DETACH DELETE n")




def loadNetworkToNeo4j(network, connection):
    nx.write_graphml(network, import_folder + '/file.graphml')
    removeExistingData(connection)
    connection.run("call apoc.import.graphml('file.graphml', {})")
    try:
        connection.run("CALL gds.graph.create('gds_graph', '*', 'RELATED')")
    except: # If name already taken, I assume the gds_graph was already loaded
        print("GDS graph already created")
    return "Dataset loaded and ready to use"

Note that we are trying two times to connect to the neo4j database. This is a known issue with py2neo. Also, note that I call a function named gds.graph.create(), this function transforms a neo4j graph into a Graph Data Science graph (remember the Graph Data Science Library we activated?). This transformation is required to analyze the dataset.

请注意,我们正在尝试两次连接到neo4j数据库。 这是py2neo的已知问题。 另外,请注意,我调用了一个名为gds.graph.create()的函数,该函数将neo4j图转换为图形 数据科学图 (还记得我们激活的图形数据科学库吗?)。 分析数据集需要进行此转换。

Finally, we create some functions that can be used to analyze the dataset:

最后,我们创建一些可用于分析数据集的函数:

def pageRank(connection):
    return connection.run("CALL gds.pageRank.stream('gds_graph')"
            ).to_data_frame()
            
            
def louvain(connection):
    return connection.run("CALL gds.louvain.stream('gds_graph')"
            ).to_data_frame()

I included a function that calculates the Pagerank centrality of the nodes and one that calculates the community of the nodes based on the Louvain algorithm. The functionality can easily be extended since all functions follow a similar pattern. A full overview of functions supported by the neo4j Graph Data Science Library can be found here.

我包含了一个用于计算节点的Pagerank中心性的函数,以及一个基于L ouvain算法计算节点的社区的函数。 由于所有功能都遵循类似的模式,因此可以轻松扩展功能。 可以在此处找到neo4j图形数据科学库支持的功能的完整概述。

The result of the algorithm is a pandas.DataFrame:

该算法的结果是pandas.DataFrame:

The methods above can be combined into one script that visualizes the karate club network based on the Pagerank centrality and the community of the nodes.

可以将以上方法合并为一个脚本,该脚本基于Pagerank中心性和节点社区来可视化空手道俱乐部网络。

# Download the karate club network from the networkx package
network = nx.karate_club_graph()


# Establish a connection with neo4j and load the dataset
connection = connectToNeo4j(user, password)
loadNetworkToNeo4j(network, connection)


# Calculate metrics
pr = pageRank(connection)
louvain = louvain(connection)


# Plot results
pr.score = pr.score.apply(lambda x: x*250)
nx.draw_kamada_kawai(G = network,
        node_size = pr.score,
        node_color = louvain.communityId,
        alpha = 0.7,
        width = 0.4)

The resulting plot is shown below. The size of the nodes corresponds to the Pagerank centrality and the color of the nodes shows which nodes belong to the same community according to the Louvain algorithm.

结果图如下所示。 节点的大小对应于Pagerank中心,节点的颜色根据Louvain算法显示哪些节点属于同一社区。

Visualization of the result — Image by author 可视化的结果—照片作者author

Summary

摘要

Even though python already has a number of promising graph analysis libraries it might still be worth using the power of neo4j when working with large amounts of data. Unfortunately, the documentation of py2neo, the package that links python with neo4j, is not the best kind of documentation. This guide has shown you how to set up neo4j and do some basic analyses. All methods shown above scale to larger datasets.

即使python已经拥有许多有前途的图形分析库,在处理大量数据时仍应使用neo4j的强大功能。 不幸的是,py2neo的文档(将python与neo4j链接在一起的软件包)并不是最好的文档。 本指南向您展示了如何设置neo4j并进行一些基本分析。 上面显示的所有方法都可以扩展到更大的数据集。

Want to know more?

想知道更多?

Introduction to various graph algorithms:

各种图形算法简介:

Graph analytics in neo4j:

neo4j中的图分析:

翻译自: https://towardsdatascience.com/graph-analytics-with-py2neo-f629ba71051b

neo4j py2neo

你可能感兴趣的:(neo4j,python,java)