spacy实体关系抽取
使用Spacy的Wikipedia文章中的命名实体识别 (Named Entity Recognition From Wikipedia article using Spacy)
In this article we ‘ll try to find names of person in a wikipedia article using python spacy library. I assume that you have already installed spacy and wikipedia api libraries from pypi if you are planning to run source code from this article.
在本文中,我们将尝试使用python spacy库在Wikipedia文章中查找人名。 如果您打算运行本文的源代码,我假设您已经从pypi安装了spacy和Wikipedia api库。
Many a time articles are too long and we are only interested in certain information. We are either interested in summary or major events and major characters associated with the current. Here we are trying to just find person names from different articles. Determining whether a word is name of a person is done using pretrained models. Spacy does a good job of labeling these. We are going to explore that in this article.
很多时间文章太长,我们只对某些信息感兴趣。 我们对摘要或与当前事件相关的主要事件和主要特征感兴趣。 在这里,我们试图从不同的文章中查找人名。 使用预先训练的模型确定单词是否是人的名字。 Spacy在标记这些标签方面做得很好。 我们将在本文中进行探讨。
脚步 (Steps)
- Search for wikipedia articles 搜索维基百科文章
- Use spacy to create document object 使用spacy创建文档对象
- Iterate for entries and find the ones with label Person 迭代条目并找到带有标签“ Person”的条目
- Count the frequency of person and plot them in descending order 计算人的频率并按降序排列
In following section we list necessary imports. wikipedia api is python api used to get wikipedia content.
在以下部分中,我们列出了必要的进口。 Wikipedia api是用于获取Wikipedia内容的python API。
import wikipedia
import requests
import spacy
from collections import Counter
import matplotlib.pyplot as plt
import spacy
nlp = spacy.load('en_core_web_lg')
在特定页面上搜索 (Search on a specific page)
Here we are trying to search a page for a given article. I have choosen Lord Krishna as our starting point. Let’s see who all are the most frequently occurring persons in wikipedia article relate to Lord Krishna.
在这里,我们试图在页面上搜索给定的文章。 我选择克里希纳勋爵为起点。 让我们看看谁是与克里希纳勋爵相关的维基百科文章中最常出现的人。
result = wikipedia.search("Krishna")
result['Krishna',
'Krishna Krishna',
'Krishna Janmashtami',
'Krishna (Telugu actor)',
'Krishna Vamsi',
'Krishna Bhagavaan',
'International Society for Krishna Consciousness',
'Krishna-Krishna',
'Hare Krishna',
'Krishna (TV series)']
We get the page content corresponding to the first article related to the first search result of our search term.
我们获得与搜索词的第一个搜索结果相关的第一篇文章所对应的页面内容。
page = wikipedia.page(result[0], preload= True)
We get the parced document using spacy module.
我们使用spacy模块获得了经过解析的文档。
doc = nlp(page.content)#from spacy import displacy
Lets try to find the page url corresponding to first result of our search query
让我们尝试查找与我们的搜索查询的第一个结果相对应的页面网址
page.url'https://en.wikipedia.org/wiki/Krishna'#displacy.serve(doc, style="ent")
Lets explore the part of speech taggings of different terms in our page. For illustration purpose I am showing just 10 tokens here.
让我们探索页面中不同术语的语音标记部分。 为了说明目的,我在这里仅显示10个令牌。
max_token_display = 10
for idx , token in enumerate(doc):
# Print the token and its part-of-speech tag
print(token.text, "-->", token.pos_, )
if idx > max_token_display:
break;The --> DET
Mahābhārata --> PROPN
( --> PUNCT
US --> PROPN
: --> PUNCT
, --> PUNCT
UK --> PROPN
: --> PUNCT
; --> PUNCT
Sanskrit --> ADJ
: --> PUNCT
महाभारतम् --> X
Here are some labels corresponding to the words appearing in the document.
以下是与文档中出现的单词相对应的一些标签。
for idx , ent in enumerate(doc.ents):
print(ent.text, ent.start_char, ent.end_char, ent.label_)
if idx>max_token_display:
breakMahābhārata 4 15 PERSON
US 17 19 GPE
UK 23 25 GPE
Sanskrit 29 37 LANGUAGE
महाभारतम् 39 48 CARDINAL
Mahābhāratam 50 62 PERSON
two 108 111 CARDINAL
Sanskrit 118 126 NORP
India 144 149 GPE
Rāmāyaṇa 171 179 PERSON
two 214 217 CARDINAL
the Kurukshetra War 239 258 EVENT
In the below section we are trying to identify all the entries with label as person.
在下面的部分中,我们尝试将所有带有标签的条目标识为person。
persons = [ent.text for ent in doc.ents if ent.label_=='PERSON' ]
Lets count the frequency of person names as identified by spacy on a particular wikipedia page
让我们计算在特定维基百科页面上由空格识别的人员姓名的出现频率
person_count = Counter(persons)print(person_count){'Pandavas': 31, 'Krishna': 25, 'Mahābhārata': 24, 'Mahabharata': 23, 'Pandu': 17, 'Dhritarashtra': 15, 'Yudhishthira': 14, 'Bhishma': 11, 'Kunti': 11, 'Kaurava': 8, 'Satyavati': 6, 'Madri': 6, 'Gandhari': 6, 'Vyasa': 5, 'Kuru': 5, 'Pandava': 5, 'Vichitravirya': 5, 'Vidura': 5, 'Kauravas': 5, 'Rāmāyaṇa': 4, 'Bhima': 4, 'Draupadi': 4, 'Jain': 4, 'Gupta': 3, 'Janamejaya': 3, 'Jaya': 3, 'Minkowski': 3, 'Parikshit': 3, 'Devavrata': 3, 'Amba': 3, 'Karna': 3, 'Yama': 3, 'Yayati': 3, 'Jarasandha': 3, 'Motilal Banarsidass': 3, 'BCE': 2, 'Ugraśrava Sauti': 2, 'Vasu': 2, 'Oberlies': 2, 'Kālidāsa': 2, 'Mahapadma Nanda': 2, 'Adhisimakrishna': 2, 'Shakuni': 2, 'Dushasana': 2, 'Ghatotkacha': 2, 'J. L. Fitzgerald': 2, 'P. Lal': 2, 'Bibek Debroy': 2, 'Shyam Benegal': 2, 'Vasudeva': 2, 'Jaini': 2, 'Oldenberg': 2}
sort the persons from maximum to minimum occurrences of a person on a page.
按页面上某人的出现次数从大到小排序。
person_count = {k: v for k, v in sorted(person_count.items(), key=lambda item: item[1] , reverse=True) if v>1}print(person_count){'Pandavas': 31, 'Krishna': 25, 'Mahābhārata': 24, 'Mahabharata': 23, 'Pandu': 17, 'Dhritarashtra': 15, 'Yudhishthira': 14, 'Bhishma': 11, 'Kunti': 11, 'Kaurava': 8, 'Satyavati': 6, 'Madri': 6, 'Gandhari': 6, 'Vyasa': 5, 'Kuru': 5, 'Pandava': 5, 'Vichitravirya': 5, 'Vidura': 5, 'Kauravas': 5, 'Rāmāyaṇa': 4, 'Bhima': 4, 'Draupadi': 4, 'Jain': 4, 'Gupta': 3, 'Janamejaya': 3, 'Jaya': 3, 'Minkowski': 3, 'Parikshit': 3, 'Devavrata': 3, 'Amba': 3, 'Karna': 3, 'Yama': 3, 'Yayati': 3, 'Jarasandha': 3, 'Motilal Banarsidass': 3, 'BCE': 2, 'Ugraśrava Sauti': 2, 'Vasu': 2, 'Oberlies': 2, 'Kālidāsa': 2, 'Mahapadma Nanda': 2, 'Adhisimakrishna': 2, 'Shakuni': 2, 'Dushasana': 2, 'Ghatotkacha': 2, 'J. L. Fitzgerald': 2, 'P. Lal': 2, 'Bibek Debroy': 2, 'Shyam Benegal': 2, 'Vasudeva': 2, 'Jaini': 2, 'Oldenberg': 2}
Here we are trying to plot the counts corresponding to each person appearing on the page.
在这里,我们试图绘制与页面上出现的每个人对应的计数。
fig = plt.gcf()
ax= plt.gca()
fig.set_size_inches(25.5, 25.5)
plt.barh(list(person_count.keys()), person_count.values())
plt.xticks(rotation=0, fontsize=40)
plt.yticks(rotation=0, fontsize=25)
for i, v in enumerate(person_count.values()):
ax.text(v + 2, i + 0, str(v), color='black' ,fontsize = 20)
plt.show()
检查其他页面 (Check for the other page)
Following piece of code consolidates everything and uses a different search query for word ‘Jesus’
以下代码整合了所有内容,并对单词“耶稣”使用了不同的搜索查询
result = wikipedia.search("Jesus")
page = wikipedia.page(result[0], preload= True)
doc = nlp(page.content)
persons = [ent.text for ent in doc.ents if ent.label_=='PERSON' ]
person_count = Counter(persons)
person_count = {k: v for k, v in sorted(person_count.items(), key=lambda item: item[1] , reverse=True) if v>1}fig = plt.gcf()
ax= plt.gca()
fig.set_size_inches(25.5, 25.5)
plt.barh(list(person_count.keys()), person_count.values())
plt.xticks(rotation=0, fontsize=40)
plt.yticks(rotation=0, fontsize=25)
for i, v in enumerate(person_count.values()):
ax.text(v + 2, i + 0, str(v), color='black' ,fontsize = 20)
plt.show()
result = wikipedia.search("Mahabharat")
page = wikipedia.page(result[0], preload= True)
doc = nlp(page.content)
persons = [ent.text for ent in doc.ents if ent.label_=='PERSON' ]
person_count = Counter(persons)
person_count = {k: v for k, v in sorted(person_count.items(), key=lambda item: item[1] , reverse=True) if v>1}fig = plt.gcf()
ax= plt.gca()
fig.set_size_inches(25.5, 25.5)
plt.barh(list(person_count.keys()), person_count.values())
plt.xticks(rotation=0, fontsize=40)
plt.yticks(rotation=0, fontsize=25)
for i, v in enumerate(person_count.values()):
ax.text(v + 2, i + 0, str(v), color='black' ,fontsize = 20)
plt.show()
创建一个包含以上所有功能的函数 (Create a function including all above)
Finally we can create a function that plots all the names present on the first page from the list of pages from search result of a given term. Here the search title is given as an argument. Details for this Method can be found in previous sections.
最后,我们可以创建一个函数,该函数从给定术语的搜索结果的页面列表中绘制出出现在首页上的所有名称。 在此,搜索标题作为参数给出。 该方法的详细信息可以在前面的部分中找到。
def plot_names_from_page(title = "Mahabharat"):
result = wikipedia.search(title)
page = wikipedia.page(result[0], preload= True)
doc = nlp(page.content)
persons = [ent.text for ent in doc.ents if ent.label_=='PERSON' ]
person_count = Counter(persons)
person_count = {k: v for k, v in sorted(person_count.items(), key=lambda item: item[1] , reverse=True) if v>1}
print(page.url)
fig = plt.gcf()
ax= plt.gca()
fig.set_size_inches(25.5, 25.5)
plt.barh(list(person_count.keys()), person_count.values())
plt.xticks(rotation=0, fontsize=40)
plt.yticks(rotation=0, fontsize=25)
#plt.title(page.url, fontdict={size:20})
for i, v in enumerate(person_count.values()):
ax.text(v + 2, i + 0, str(v), color='black' ,fontsize = 20)
plt.show()
Finally we can use above function to get occurrences of different names on a wikipedia page. I tried to find names in article for variety of topics. First one is related to the books Illiad by homer. Most of the names are characters in the book. It may also include writer’s name.
最后,我们可以使用上面的函数在Wikipedia页面上获取不同名称的出现。 我试图在文章中找到涉及多个主题的名称。 第一个与荷马的《伊利亚德》有关。 大部分名称是书中的字符。 它还可能包括作者的名字。
plot_names_from_page('Illiad')
Following are the names corresponding to the article for great Hindu epic Ramayan. As we can expect name of Lord Rama appears most of the times here.
以下是与伟大印度教史诗《 Ramayan》的文章相对应的名称。 可以预见,拉玛勋爵的名字在这里经常出现。
plot_names_from_page('Ramayan')
plot_names_from_page('World_War_I')https://en.wikipedia.org/wiki/World_War_I
plot_names_from_page('great depression')https://en.wikipedia.org/wiki/Great_Depression
plot_names_from_page('higgs boson')https://en.wikipedia.org/wiki/Higgs_boson
References :
参考文献:
https://www.mediawiki.org/wiki/API:Main_page
https://www.mediawiki.org/wiki/API:Main_page
翻译自: https://medium.com/@pankaj.tiwari2/named-entity-recognition-from-wikipedia-article-using-spacy-73f8cbdc9851
spacy实体关系抽取