可视化编程语言_可视化编程语言影响图

可视化编程语言

Gephi和Sigma.js的网络可视化教程 (A network visualization tutorial with Gephi and Sigma.js)

Here’s a preview of what we’ll be making today: the programming languages influence graph. Check out the link to explore the “design influence” relationships between over 250 programming languages past and present!

这是我们今天要做的预览: 编程语言影响图 。 查看链接,探索过去和现在超过250种编程语言之间的“设计影响力”关系!

轮到你! (Your turn!)

In today’s hyper-connected world, networks are an ubiquitous aspect of modern life.

在当今高度连接的世界中,网络是现代生活中无处不在的方面。

Take the start of my day so far — I used London’s transport network to travel into town. Then I went into a branch of my favourite coffee shop and used my Chromebook to connect to their Wi-Fi network. Next, I logged in to the various social networking sites I frequent.

到目前为止,我已经开始了新的一天-我使用伦敦的交通网络前往市区。 然后,我去了我最喜欢的咖啡店的一家分店 ,并用我的Chromebook连接到他们的Wi-Fi网络 。 接下来,我登录了我经常访问的各种社交网站。

It’s no secret that some of the most influential companies of the last few decades owe their success to the power of networks.

过去几十年来一些最有影响力的公司将其成功归功于网络的力量已经不是什么秘密了。

Facebook, Twitter, Instagram, LinkedIn and other social media platforms rely on the small-world properties of social networks. This lets them connect their users with each other (and advertisers) effectively.

Facebook,Twitter,Instagram,LinkedIn和其他社交媒体平台依赖于社交网络的小世界特性。 这样一来,他们就可以有效地将用户(和广告客户)彼此联系起来。

Google owes much of its current success to their early dominance of the search engine market — enabled in part through their ability to return relevant results with the help of their Page Rank network algorithm.

Google的当前成功主要归功于其在搜索引擎市场的早期统治地位-部分是由于他们借助Page Rank网络算法能够返回相关结果的能力。

Amazon’s efficient distribution network allows them to offer same-day delivery in some major cities.

亚马逊的高效分销网络使他们能够在一些主要城市提供当天送货服务。

Networks are also super-important in fields such as Artificial Intelligence and Machine Learning. Neural networks are a very active field of research. Many feature detection algorithms, essential in Computer Vision, rely heavily on using networks to model different parts of images.

网络在人工智能和机器学习等领域也非常重要。 神经网络是一个非常活跃的研究领域。 在计算机视觉中必不可少的许多功能检测算法都严重依赖于使用网络对图像的不同部分进行建模 。

A wide range of scientific phenomena can also be understood in terms of network models. This includes quantum mechanics, biochemical pathways, and ecological and socio-economic systems.

通过网络模型也可以理解各种各样的科学现象。 这包括量子力学 , 生化途径以及生态和社会经济系统 。

Given their undeniable importance, then, how can we better understand networks and their properties?

鉴于它们不可否认的重要性,那么,我们如何才能更好地理解网络及其属性?

The mathematical study of networks is known as “graph theory”, and is one of the more accessible branches of mathematics. This article aims to provide an introduction, assuming little prior knowledge or experience.

网络的数学研究被称为“ 图论 ”,是数学中较易获得的分支之一。 本文旨在提供介绍,假定您几乎没有先验知识或经验。

We’ll be using Python 3.x and some awesome open-source software called Gephi to put together a network visualization of how a range of programming languages past and present are linked by influence.

我们将使用Python 3.x和一些很棒的开源软件Gephi ,将网络过去和现在的各种编程语言如何通过影响联系在一起的可视化网络。

但首先… (But first…)

What exactly is a network?

网络到底是什么?

The examples described above give us some clues. Transport networks are made up of destinations connected by routes. Social networks are made up of individuals, connected through their relationships to one another. Google’s search engine algorithms evaluate the “rank” of different webpages by looking at which pages link out to others.

上述示例为我们提供了一些线索。 运输网络由路线连接的目的地组成。 社交网络由个人组成 ,通过彼此之间的关系相互联系。 Google的搜索引擎算法通过查看哪些页面链接到其他页面来评估不同网页的“排名”。

More generally, a network is any system that can be described in terms of nodes and edges, or in colloquial terms, “dots and lines”.

更一般而言,网络是可以用节点描述的任何系统 边缘 (俗称“点和线”)。

Some systems are readily abstracted in this manner. Social networks are perhaps the most obvious example. Computer filesystems are another — folders and files are linked by their “parent” and “child” relationships.

一些系统很容易以这种方式抽象。 社交网络也许是最明显的例子。 计算机文件系统是另一种-文件夹和文件通过它们的“父”和“子”关系链接。

But the real power of networks comes from the fact that many, many systems can be abstracted and modelled in network terms, even if at first it isn’t obvious how.

但是网络的真正力量来自这样一个事实,即可以用网络术语对许多系统进行抽象和建模,即使起初并不清楚如何实现。

代表网络 (Representing networks)

We need to go a little beyond pen-and-paper sketches to analyze and describe networks mathematically. How can we turn pictures of dots and lines into numbers we can crunch?

我们需要超出纸本草图的范围,以数学方式分析和描述网络。 我们如何将点和线的图片转换为可以处理的数字?

One solution is to draw up an adjacency matrix to represent our network.

一种解决方案是绘制一个邻接矩阵来表示我们的网络。

Matrices are one of those concepts that might sound a little intimidating if you’re not familiar with them, but fear not. Think of them as grids of numbers which can be used to perform many calculations all at once. Here’s an example below:

如果您不熟悉矩阵,这些概念可能听起来有些吓人,但不要害怕。 将它们视为数字网格,可以一次执行许多计算。 下面是一个示例:

Python Java Scala C#
Python     0    1     0  0
Java       0    0     0  1
Scala      0    1     0  0
C#         0    1     0  0

In this matrix, the intersection of each row and column is either 0 or 1, depending on whether or not the respective languages are linked. You can check this against the illustration above!

在此矩阵中,每一行和每一列的交集为0或1,这取决于是否链接了相应的语言。 您可以对照上图进行检查!

For most purposes, the adjacency matrix is a good way of representing a network mathematically. From a computational perspective, however, it can sometimes be a bit cumbersome.

对于大多数目的,邻接矩阵是数学上表示网络的一种好方法。 但是,从计算角度来看,有时可能会有些麻烦。

For instance, with even a relatively modest number of nodes (say 1000), there will be a much larger number of elements in the matrix (e.g., 1000² = 1,000,000).

例如,即使节点数量相对较少(例如1000个),矩阵中的元素数量也会大得多(例如1000²= 1,000,000)。

Many real-world systems yield sparse networks. In these networks, most nodes only connect to a small proportion of all the others.

许多现实世界的系统都会产生稀疏网络 。 在这些网络中,大多数节点仅连接到所有其他节点的一小部分。

If we represented a 1000-node sparse network in computer memory as an adjacency matrix, we’d have 1,000,000 bytes of data stored in RAM. Most will be zeros. There’s got to be a more efficient way of going about this.

如果将计算机内存中的1000个节点的稀疏网络表示为邻接矩阵,则RAM中将存储1,000,000字节的数据。 多数将为零。 必须有一种更有效的解决方法。

An alternative approach is to work with edge lists instead. These are exactly what they say they are. They are simply a list of which node pairs link to each other.

另一种方法是使用边列表 代替。 这些正是他们所说的。 它们只是节点对之间相互链接的列表。

For example, the programming languages network above can be represented as follows:

例如,上面的编程语言网络可以表示如下:

Java, Python
Java, Scala
Java, C#
C#, Java

For larger networks, this is a much more computationally efficient means of representing them. It is of course possible to generate an adjacency matrix from an edge list (and vice versa). It’s not like we have to pick one or the other.

对于较大的网络,这是表示它们的计算效率更高的方式。 当然可以从边缘列表生成邻接矩阵(反之亦然)。 好像我们不必选择另一个。

Another means of representing networks are adjacency lists. This lists every node followed by the nodes it links to. For example:

表示网络的另一种方法是邻接表 。 这列出了每个节点,然后列出了它链接到的节点。 例如:

Java: Python, Scala, C#
C#: Java

收集数据,建立连接 (Collecting data, making connections)

Any network model and visualisation will only be as good as the data used to construct it. This means, as well as ensuring the data is both accurate and complete, we also need to justify a means of inferring edges between nodes.

任何网络模型和可视化效果都只会与用于构建它的数据一样好。 这意味着,除了确保数据准确且完整之外,我们还需要证明一种推断节点之间边缘的方法。

In many respects, this is the critical step. Any subsequent analysis and inferences made about the network depend on being able to justify the “linkage criterion”.

在许多方面,这关键的一步。 有关网络的任何后续分析和推论都取决于能够证明“链接标准”的合理性。

For example, in social network analysis, you might link people based upon whether they follow one another on social media. In molecular biology, you might link genes based upon their co-expression.

例如,在社交网络分析中 ,您可以根据人们是否在社交媒体上彼此关注来链接人们。 在分子生物学中,您可能基于基因的共表达来链接基因。

Often, the method used to link nodes will allow for weights to be assigned to the edges, giving a measure of “strength”.

通常,用于链接节点的方法将允许将权重分配给边缘,从而给出“强度”的度量。

For instance, in the context of online retail, you could link products based upon how often they are purchased together. Products that are frequently bought together would be linked by a higher weighted edge than products which are only sometimes bought together. Products that are bought together no more often than would be expected by chance wouldn’t be linked at all.

例如,在在线零售的情况下,您可以根据产品购买的频率来链接产品。 与仅有时一起购买的产品相比,经常一起购买的产品将具有更高的加权边链接。 在一起购买的产品的频率不会比偶然预期的要高,根本不会链接在一起。

As you might imagine, the methods for linking nodes to one another can be as sophisticated as you like.

就像您想象的那样,将节点相互链接的方法可以随您喜欢而复杂。

However, for this tutorial we’ll be using a simpler means of connecting programming languages. We’re gonna rely on the accuracy of Wikipedia.

但是,对于本教程,我们将使用一种更简单的方法来连接编程语言。 我们将依靠维基百科的准确性。

For our purposes, this should be fine. Wikipedia’s success is testament that it must be doing something right. The open-source, collaborative method by which articles are written should ensure some degree of objectivity.

就我们的目的而言,这应该很好。 维基百科的成功证明了它一定在做正确的事。 撰写文章的开源协作方法应确保一定程度的客观性。

Also, its relatively consistent page structure makes it a convenient playground for trying out web-scraping techniques.

同样,其相对一致的页面结构使其成为尝试Web爬网技术的便捷场所。

Another bonus is the extensive, well-documented Wikipedia API, which makes information retrieval easier still. Let’s get started.

另一个好处是广泛的, 有据可查的Wikipedia API ,它使信息检索更加容易。 让我们开始吧。

第1步-安装Gephi (Step 1 — Installing Gephi)

Gephi is available on Linux, Mac and Windows. You can download it here.

Gephi在Linux,Mac和Windows上可用。 您可以在此处下载。

For this project, I was using Lubuntu. If you’re on Ubuntu/Debian, then you can follow the steps below to get Gephi up and running. Otherwise, the installation process will likely be much the same as whatever you’re familiar with.

对于这个项目,我正在使用Lubuntu。 如果您使用的是Ubuntu / Debian,则可以按照以下步骤启动和运行Gephi。 否则,安装过程可能与您熟悉的过程几乎相同。

Download the latest version (at the time of writing this was v.0.9.1) of Gephi for your system. When it’s ready, you’ll need to extract the files.

为您的系统下载Gephi的最新版本(在撰写本文时为v.0.9.1)。 准备就绪后,您需要解压缩文件。

cd Downloads
tar -xvzf gephi-0.9.1-linux.tar.gz
cd gephi-0.9.1/bin./gephi

You may need to check your version of the Java JRE. Gephi requires a recent version. On my relatively fresh install of Lubuntu, I simply installed the default-jre, and everything worked from there.

您可能需要检查Java JRE的版本。 Gephi需要最新版本。 在我相对较新的Lubuntu安装中,我只安装了default-jre,一切都从那里开始。

apt install default-jre
./gephi

There’s one more step before you’re ready to get underway. In order to export the graph to the Web, you can use the Sigma.js plugin for Gephi.

在您准备好开始之前,还有另外一步。 为了将图形导出到Web,可以将Sigma.js插件用于Gephi。

From Gephi’s menu bar, choose the “Tools” option, and select “Plugins”.

从Gephi的菜单栏中,选择“工具”选项,然后选择“插件”。

Click on the “Available Plugins” tab and select “SigmaExporter” (I also installed JSON Exporter, because it’s another useful plugin to have around).

单击“可用插件”选项卡,然后选择“ SigmaExporter”(我还安装了JSON Exporter,因为它是另一个有用的插件)。

Hit the “Install” button and you’ll be walked through the process. You’ll need to restart Gephi once you’re done.

点击“安装”按钮,您将逐步完成该过程。 完成后,您需要重新启动Gephi。

第2步-编写Python脚本 (Step 2 — Writing the Python script)

This tutorial will use Python 3.x, plus a few modules to make life easier. Using the pip module installer, run the following command:

本教程将使用Python 3.x,以及一些使生活更轻松的模块。 使用pip模块安装程序,运行以下命令:

pip3 install wikipedia

Now, in a new directory, create a file called something like script.py, and open it up in your favourite code editor/IDE. Below is an outline of the main logic:

现在,在新目录中,创建一个名为script.py类的文件,然后在您喜欢的代码编辑器/ IDE中将其打开。 以下是主要逻辑的概述:

  1. First, you’ll need a list of programming languages to include.

    首先,您需要包含一系列编程语言 。

  2. Next, go through that list and retrieve the HTML of the relevant Wikipedia article.

    接下来,浏览该列表并检索相关Wikipedia文章HTML。
  3. From this, extract a list of programming languages that each language has influenced. This will be a rough-and-ready linkage criterion.

    从中,提取每种语言影响的编程语言列表。 这将是一个粗略的关联标准。
  4. While you’re at it, it’d be nice to grab some metadata about each language.

    当您使用它时,最好能获取有关每种语言的一些元数据。
  5. Finally, you’ll want to write all the data you’ve collected to a .csv file

    最后,您需要将收集的所有数据写入.csv文件

The full script can be found in this gist.

完整的脚本可以在本要点中找到。

导入一些模块 (Import some modules)

In script.py, start by importing a few modules which will make things easier:

script.py ,首先导入一些模块,这将使事情变得更容易:

import csv
import wikipedia
import urllib.request
from bs4 import BeautifulSoup as BS
import re

OK — begin by making a list of nodes to include. This is where the Wikipedia module comes in handy. It makes accessing the Wikipedia API super-easy.

确定-首先列出要包括的节点。 这是Wikipedia模块派上用场的地方。 它使访问Wikipedia API变得非常容易。

Add the following code:

添加以下代码:

pageTitle = "List of programming languages"
nodes = list(wikipedia.page(pageTitle).links)
print(nodes)

If you save and run this script, you’ll see it prints out all the links from the “List of programming languages” Wikipedia article. Nice!

如果保存并运行此脚本,您将看到它打印出Wikipedia文章“编程语言列表”中的所有链接。 真好!

However, it’s always sensible to manually inspect any automatically collected data. A quick glance will reveal that, as well as many actual programming languages, the script has also picked up a few extra links.

但是,手动检查任何自动收集的数据总是明智的。 快速浏览一下,就会发现该脚本以及许多实际的编程语言,还增加了一些额外的链接。

For example, you might see “List of markup languages”, “Comparison of programming languages” and others in there.

例如,您可能会在其中看到“ 标记语言列表 ”,“ 编程语言比较 ”以及其他内容。

Although Gephi lets you remove nodes you’d rather not include, it wouldn’t hurt to “clean” the data before proceeding. If anything, this will save time later on.

尽管Gephi允许您删除您不希望包含的节点,但是在继续操作之前“清理”数据不会有什么坏处。 如果有的话,这将在以后节省时间。

removeList = [
    "List of",
    "Lists of",
    "Timeline",
    "Comparison of",
    "History of",
    "Esoteric programming language"
    ]

nodes = [i for i in nodes if not any(r in i for r in removeList)]

These lines define a list of substrings to be removed from the data. The script then goes through the data, removing any elements that contain any of the unwanted substrings.

这些行定义了要从数据中删除的子字符串列表。 然后,脚本遍历数据,删除包含任何不需要的子字符串的所有元素。

In Python, this requires just one line of code!

在Python中,这只需要一行代码!

一些辅助功能 (Some helper functions)

Now you can start scraping Wikipedia to build up an edge list (and collect any metadata). To make this easier, first define a few functions.

现在,您可以开始抓取Wikipedia来建立边缘列表(并收集所有元数据)。 为了简化操作,首先定义一些功能。

抓HTML (Grabbing HTML)

The first function uses the BeautifulSoup module to get hold of the HTML for each language’s Wikipedia page.

第一个功能使用BeautifulSoup模块获取每种语言的Wikipedia页面HTML。

base = "https://en.wikipedia.org/wiki/"

def getSoup(n):
    try:
        with urllib.request.urlopen(base+n) as response:
            soup = BS(response.read(),'html.parser')
            table = soup.find_all("table",class_="infobox vevent")[0]                return table
     except:
         pass

This function uses the urllib.request module to get hold of the HTML for the page at “https://en.wikipedia.org/wiki/” + “programming language”.

此函数使用urllib.request模块来获取“https://en.wikipedia.org/wiki/” + “programming language”页面HTML。

This is then passed to BeautifulSoup, which reads and parses the HTML into an object we can use to search for information.

然后将其传递给BeautifulSoup,BeautifulSoup读取HTML并将其解析为一个对象,我们可以使用该对象来搜索信息。

Next, use the find_all() method to extract the HTML element you’re interested in.

接下来,使用find_all()方法提取您感兴趣HTML元素。

Here, this will be the summary table at the top of each programming language article. How can these be identified?

在这里,这将是每个编程语言文章顶部的摘要表。 如何识别这些?

The easiest way is to visit one of the programming language pages. Here, you can simply use the browser’s Developer Tools to inspect the elements of interest.

最简单的方法是访问其中一种编程语言页面。 在这里,您只需使用浏览器的开发人员工具来检查感兴趣的元素。

The summary table has the HTML tag le> and the CSS classes "infobox" and "vevent", so you can use these to identify the table in the HTML.

摘要表具有HTML标记 le>和asses "in fobox中" and "v事件”CSS分类,因此您可以使用它们来识别HTML中的表。

Specify this with the arguments:

使用参数指定它:

  • "table" and

    "table"

  • class_="infobox vevent"

    class_="infobox vevent"

find_all() returns a list of all elements that match the criteria. In order to actually specify the element you’re interested in, add the index [0]. If the function is successful, it returns the table object. Otherwise, it returns None.

find_all()返回符合条件的所有元素的列表。 为了实际指定您感兴趣的元素,请添加索引[0] 。 如果函数成功,则返回table对象。 否则,它返回None

With any automated data collection procedure, it’s always important to handle exceptions thoroughly. If not, then in the best case scenario the script crashes and you’ll need to start over.

对于任何自动数据收集过程,彻底处理异常始终很重要。 如果不是,那么在最佳情况下,脚本会崩溃,您需要重新开始。

In the worst case, you’ll end up with a data set riddled with inconsistencies and errors. This will make it a nightmare to work with down the line.

在最坏的情况下,您将得到一个充满不一致和错误的数据集。 这将使下线工作成为一场噩梦。

检索元数据 (Retrieve metadata)

The next function uses the table object to look for some metadata. Here, it searches the table for the year the language first appeared.

下一个函数使用table对象查找一些元数据。 在这里,它会在表格中搜索该语言首次出现的年份。

def getYear(t):
    try:
        t = t.get_text()
        year = t[t.find("appear"):t.find("appear")+30]
        year = re.match(r'.*([1-3][0-9]{3})',year).group(1)
        return int(year)
    except:
        return "Could not determine"

This short function takes the table object as its argument, and uses BeautifulSoup’s get_text() function to produce a string.

这个简短的函数将table对象作为其参数,并使用BeautifulSoup的get_text()函数生成一个字符串。

The next step is to create a substring called year. This takes the 30 characters after the first appearance of the word "appear". This string should contain the year the language first appeared.

下一步是创建一个名为year的子字符串。 单词"appear"首次出现后需要30个字符。 该字符串应包含该语言首次出现的年份。

In order to extract just the year, use a regular expression (courtesy of the re module) to match any characters that begin with a digit between 1 and 3, and are followed by three digits.

为了只提取年份,请使用正则表达式 (由re模块提供)匹配任何以1到3之间的数字开头,后跟三个数字的字符。

re.match(r'.*([1-3][0-9]{3})',year)

If this is successful, the function returns year as an integer. Otherwise, it returns a sad-looking “Could not determine”. You might wish to scrape further metadata — such as paradigm, designer or typing discipline.

如果成功,函数将以整数形式返回year 。 否则,它返回一个令人悲伤的“无法确定”。 您可能希望进一步抓取元数据,例如范例,设计师或打字学科。

One more function for you — this time, you’ll feed in the table object for a given language, and hopefully receive out a list of other programming languages.

为您提供的另一个功能-这次,您将输入给定语言的table对象,并希望收到其他编程语言的列表。

def getLinks(t):
    try:
        table_rows = t.find_all("tr")
        for i in range(0,len(table_rows)-1):
            try:
                if table_rows[i].get_text() == "\nInfluenced\n":
                    out = []
                    for j in table_rows[i+1].find_all("a"):
                        try:
                            out.append(j['title'])
                        except:
                            continue
                    return out
            except:
                continue
        return
    except:
        return

Woah, look at all that nesting… What is actually going on here then?

哇,看一下所有的嵌套...那么这里到底发生了什么?

This function makes use of the fact that the table objects have a consistent structure. The information in the table is stored in rows (the relevant HTML tag is <tr> ). One of these rows will contain the` text "\nInfluenced\n". The first part of the function finds which row this is.

该功能利用了table对象具有一致结构的事实。 表中的信息存储在行中(相关HTML标记为< tr>)。 这些行之一将包含文本"\nInfluenced\n" 。 函数的第一部分查找这是哪一行。

Once this row has been found, you can then be pretty sure the next row contains links to each of the programming languages influenced by the current one. Find these links using find_all("a") — where the argument "a" corresponds to the HTML tag .

找到该行后,您就可以确定下一个 该行包含到受当前语言影响的每种编程语言的链接。 使用find_all("a")查找这些链接-其中参数"a"对应于HTML标签

For each link j, append its ["title"] attribute to a list called out. The reason to be interested in the ["title"] attribute is because this will match exactly the language’s name as stored in nodes.

对于每个链接j ,将其["title"]属性附加到一个名为out的列表out 。 对["title"]属性感兴趣的原因是,这将与存储在nodes的语言名称完全匹配。

For example, Java is stored in nodes as "Java (programming language)", so you need to use this exact name throughout the data set.

例如,Java作为"Java (programming language)"存储在nodes中,因此您需要在整个数据集中使用这个确切的名称。

If successful, getLinks() returns a list of programming languages. The rest of the function deals with exception handling, in case something should go wrong at any stage.

如果成功,则getLinks()返回编程语言列表。 该函数的其余部分处理异常处理,以防万一在任何阶段出现问题。

收集资料 (Collecting the data)

At last, you’re almost ready to sit back and let the script do its thing. It will collect the data and store it in two list objects.

最后,您几乎可以坐下来让脚本执行其任务了。 它将收集数据并将其存储在两个列表对象中。

edgeList = [["Source,Target"]]
meta = [["Id","Year"]]

Now write a loop that will apply the functions defined earlier to every item in nodes, and store the outputs in edgeList and meta.

现在编写一个循环,将较早定义的功能应用于nodes每个项目,并将输出存储在edgeListmeta

for n in nodes:
    try:
        temp = getSoup(n)
    except:
        continue
    try:
        influenced = getLinks(temp)
        for link in influenced:
            if link in nodes:
                edgeList.append([n+","+link])
                print([n+","+link])
    except:
        continue
    year = getYear(temp)
    meta.append([n,year])

This function takes each language in nodes and attempts to retrieve the summary table from its Wikipedia page.

此功能将nodes每种语言都使用,并尝试从其Wikipedia页面检索摘要表。

Then, it retrieves all the languages the table lists as having been influenced by the language in question.

然后,它检索表中列出的受有关语言影响的所有语言。

For each language that also appears in the nodes list, append an element to edgeList in the form of ["source,target"]. In this way, you’ll build up an edge list to feed into Gephi.

对于也出现在nodes列表中的每种语言,以["source,target"]的形式将元素添加到edgeList 。 这样,您将建立一个边缘列表以馈入Gephi。

For debugging purposes, print each element added to edgeList — just to be sure everything’s working as it should. If you were being extra thorough, you could add print statements to the except clauses, too.

出于调试目的,请打印添加到edgeList每个元素,以确保一切正常进行。 如果您要更彻底,也可以将print语句添加到except子句中。

Next, get the language’s name and year, and append these to the meta list.

接下来,获取语言的名称和年份,并将其附加到meta列表中。

写入CSV (Writing to CSV)

Once the loop has run, the final step is to write the contents of edgeList and meta to comma separated value (CSV) files. This is easily done with the csv module imported earlier.

循环运行后,最后一步是将edgeListmeta的内容写入逗号分隔值(CSV)文件。 使用先前导入的csv模块可以轻松完成此操作。

with open("edge_list.csv","w") as f: 
    wr = csv.writer(f)
    for e in edgeList:
        wr.writerow(e)

with open("metadata.csv","w") as f2:
    wr = csv.writer(f2)
    for m in meta:
        wr.writerow(m)

Done! Save the script, and from the terminal run:

做完了! 保存脚本,然后从终端运行:

$ python3 script.py

$ python3 script.py

You should see the script printing out each source-target pair as it builds up the edge list. Make sure your internet connection is steady, and sit back while the script does its magic.

您应该看到脚本在建立边缘列表时将每个源-目标对打印出来。 确保您的Internet连接稳定,并在脚本发挥作用时坐下来。

步骤3 —使用Gephi构建图 (Step 3 — Graph building with Gephi)

Hopefully you got Gephi installed and running earlier. Now you can create a new project and use the data you gathered to build a directed graph. This will show how different programming languages have influenced one another!

希望您早已安装并运行了Gephi。 现在,您可以创建一个新项目,并使用收集的数据来构建有向图。 这将显示不同的编程语言如何相互影响!

Start by creating a new project in Gephi, and switch to the “Data Laboratory” view. This provides a spreadsheet-like interface for handling data in Gephi. The first thing to do is import the edge list.

首先在Gephi中创建一个新项目,然后切换到“数据实验室”视图。 这提供了类似于电子表格的界面,用于在Gephi中处理数据。 首先要做的是导入边缘列表。

  • Click “Import spreadsheet”.

    点击“导入电子表格”。
  • Choose the edge_list.csv file generated by the Python script. Ensure that Gephi knows to use the commas as the separator.

    选择Python脚本生成的edge_list.csv文件。 确保Gephi知道使用逗号作为分隔符。

  • Choose “Edge List” from the List type.

    从列表类型中选择“边缘列表”。
  • Click “Next” and check that you are importing both Source and Target columns as strings.

    单击“下一步”,并检查您是否正在将源列和目标列都导入为字符串。

This should update the Data Lab with a list of nodes. Now, import the metadata.csv file. This time, make sure to choose “Nodes list” from the List type.

这应该使用节点列表更新数据实验室。 现在,导入metadata.csv文件。 这次,请确保从“列表”类型中选择“节点列表”。

Switch over to the “Preview” tab, and see how the network looks.

切换到“预览”选项卡,然后查看网络外观。

Ah… It’s just a little bit… monochrome. And messy. Like a plate of spaghetti. Let’s fix this.

啊……只是一点……单色。 和凌乱。 像一盘意大利面。 让我们解决这个问题。

使它漂亮 (Making it pretty)

There are all sorts of ways you can work on the presentation, and here’s where a little bit of creative freedom comes in. With network visualisations, there are essentially three things to take into consideration:

您可以采用多种方式来进行演示,这就是其中的一点创作自由。通过网络可视化,本质上要考虑三件事:

  1. Positioning There are several algorithms which can generate layout patterns for a network. A popular choice is the Fruchterman-Reingold algorithm, which is available in Gephi.

    定位有几种算法可以生成网络的布局模式。 流行的选择是Gephi中可用的Fruchterman-Reingold算法 。

  2. Sizing The size of nodes in a graph can be used to represent some interesting property. Often, this is a centrality measure. There are many ways of measuring centrality, but they all reflect the “importance” of a given node, in terms of how well-connected it is to the rest of the network.

    大小调整图中节点的大小可用于表示一些有趣的属性。 通常,这是一项中心性措施 。 有许多方法可以衡量中心性 ,但它们都反映了给定节点与网络其余部分的连接程度,这一点“很重要”。

  3. Coloring It is also possible to use color to show some property of a node. Often, color is used to indicate community structure. This is broadly defined as a “group of nodes which are more connected with each other than with the rest of the graph”. In a social network, this can reveal friendship, family or professional groups. There are several algorithms which can detect community structure. Gephi comes with the Louvain method built-in.

    着色也可以使用颜色显示节点的某些属性。 通常,颜色用于指示社区结构 。 这被广泛定义为“一组节点,彼此之间的联系比图的其余部分更多”。 在社交网络中,这可以显示友谊,家庭或专业团体。 有几种算法可以检测社区结构 。 Gephi内置了Louvain方法 。

To make these changes, you will need to calculate some statistics. Switch to the “Overview” window. Here you will see a panel on the right. It should contain a “Statistics” tab. Open this, and you will see a range of options.

要进行这些更改,您将需要计算一些统计信息。 切换到“概述”窗口。 在这里,您会在右侧看到一个面板。 它应该包含一个“统计”标签。 打开它,您将看到一系列选项。

Gephi comes with many inbuilt statistical capabilities. For each of them, clicking “Run” will generate a report that will reveal insights about the network.

Gephi具有许多内置的统计功能。 对于每个用户,单击“运行”将生成一个报告,该报告将揭示有关网络的见解。

Some useful ones to know include:

一些有用的知识包括:

  • Average degree The average language is connected to about four others. The report also shows a degree distribution graph. This reveals that most languages have very few connections, while a small proportion have many. This suggests that this is a scale-free network. Much research has been done on scale-free networks, and the processes that generate them.

    平均程度平均语言与大约四种其他语言相关。 该报告还显示了学位分布图。 这表明大多数语言的联系很少,而一小部分则有很多。 这表明这是一个无规模 网络 。 关于无标度网络及其生成过程已经进行了许多研究。

  • Diameter This network has a diameter of 12 — meaning this is the “widest” number of connections between any two languages. The average path length is just under four. This means that, on average, any two languages are separated by four edges. These figures give a measure of the “size” of the network.

    直径此网络的直径为12,这意味着这是任何两种语言之间“最大”的连接数。 平均路径长度不到4。 这意味着,平均而言,任何两种语言都由四个边分开。 这些数字可以衡量网络的“规模”。

  • Modularity This is a score that shows how “compartmentalized” the network is. Here, the modularity score is about 0.53. This is relatively high, suggesting there are distinct modules within this network. Again, this indicates something interesting about the underlying system. Languages tend to fall into distinct “influence groups”.

    模块化这是一个分数,显示了网络的“分隔”程度。 在这里,模块化得分约为0.53。 这相对较高,表明该网络中存在不同的模块。 同样,这表明底层系统有一些有趣之处。 语言倾向于分为不同的“影响力群体”。

Anyhow, to modify the appearance of the network, head over to the left panel.

无论如何,要修改网络的外观,请转到左侧面板。

In the “Layout” tab, you can select which layout algorithm to use. Hit “Run” and watch the graph shift about in real-time! See which layout algorithm you think works best.

在“布局”标签中,您可以选择要使用的布局算法。 点击“运行”,实时观看图形变化! 看看您认为哪种布局算法最有效。

Above the Layout tab is the “Appearance” tab. Here, you can play with different settings for the node and edge colors, sizes and labels. These can be configured based upon attributes (including the stats you get Gephi to calculate).

“布局”标签上方是“外观”标签。 在这里,您可以对节点和边缘颜色,大小和标签使用不同的设置。 可以根据属性(包括让Gephi计算的统计信息)进行配置。

As a suggestion, you could:

作为建议,您可以:

  • Color the nodes by their Modularity attribute. This colors them according to their community membership.

    通过其模块属性为节点着色。 这根据他们的社区成员身份为其着色。
  • Size the nodes by their Degree. Better connected nodes will appear larger than less connected ones.

    根据节点的大小调整节点的大小。 连通性更好的节点将看起来比连通性较小的节点更大。

However, you should experiment and come up with a layout you like best.

但是,您应该进行试验并提出最喜欢的布局。

Once you’re happy with the appearance of your graph, it is time to move on to the final step — exporting to Web!

对图形的外观感到满意之后,就该进入最后一步了-导出到Web!

第4步-Sigma.js (Step 4 — Sigma.js)

Already you have built a network visualisation that can be explored in Gephi. You could choose to take a screenshot, or save the graph in SVG, PDF or PNG format.

您已经建立了可以在Gephi中进行探索的网络可视化。 您可以选择截图,也可以将图形保存为SVG,PDF或PNG格式。

However, if you installed the Sigma.js plugin earlier, then why not export the graph to HTML? This will create an interactive visualisation that you can host online, or upload to GitHub and share with others.

但是,如果您较早安装了Sigma.js插件,那么为什么不将图形导出到HTML? 这将创建一个交互式可视化文件,您可以在线托管它,或者将其上传到GitHub并与他人共享。

To do this, select “Export > Sigma.js template…” from Gephi’s menu bar.

为此,请从Gephi的菜单栏中选择“导出> Sigma.js模板…”。

Fill in the details as required. Make sure to choose which directory you export the project to. You can change the title, legend, description, hover behavior and many other details. When you’re ready, click “OK”.

根据需要填写详细信息。 确保选择要将项目导出到的目录。 您可以更改标题,图例,描述,悬停行为和许多其他详细信息。 准备好后,单击“确定”。

Now, if you navigate to the directory you exported the project to, you will see a folder containing all the files generated by Sigma.js.

现在,如果您导航到将项目导出到的目录,您将看到一个包含Sigma.js生成的所有文件的文件夹。

Open up index.html in your favorite browser. Ta-da! There’s your network! If you know a little CSS and JavaScript, you can dive into the various generated files to tweak the output as you wish.

在您喜欢的浏览器中打开index.html 。 - 有您的网络! 如果您了解一点CSS和JavaScript,则可以深入研究各种生成的文件,以根据需要调整输出。

And that concludes this tutorial!

到此结束本教程!

摘要 (Summary)

  • Many systems can be modelled and visualised as networks. Graph theory is a branch of math that provides tools to help understand network structures and properties.

    许多系统可以建模并可视化为网络。 图论是数学的一个分支,提供了有助于理解网络结构和属性的工具。
  • You used Python to scrape data from Wikipedia to build a programming languages influence graph. The linkage criterion was whether a given language was listed as an influence on another’s design.

    您使用Python从Wikipedia抓取数据来构建编程语言影响图。 链接标准是是否将一种给定语言列为对另一种设计的影响。
  • Gephi and Sigma.js are open-source tools that allow you to analyze and visualize networks. They allow you to export the network in image, PDF or Web formats.

    Gephi和Sigma.js是开放源代码工具,可让您分析和可视化网络。 它们允许您以图像,PDF或Web格式导出网络。

Thanks for reading — I look forward to any comments or questions you might have! For a fantastic resource to learn more about graph theory, see Albert-László Barabási’s interactive online book.

感谢您的阅读-我期待您的任何评论或疑问! 要获取更多关于图论的丰富信息,请参见Albert-LászlóBarabási的在线互动图书 。

The full code for this tutorial can be found here.

可以在此处找到本教程的完整代码。

翻译自: https://www.freecodecamp.org/news/how-to-visualize-the-programming-language-influence-graph-7f1b765b44d1/

可视化编程语言

你可能感兴趣的:(编程语言,可视化,大数据,python,机器学习)