spacy官方: Install spaCy · spaCy Usage Documentation
目录
简介:
一、安装
1. 训练模型
二、功能
1. 分句 (sentencizer)
2.分词 (Tokenization)
3.词性标注 (Part-of-speech tagging)
4.识别停用词 (Stop words)
5.命名实体识别 (Named Entity Recognization)
6.依存分析 (Dependency Parsing)
7.词性还原 (Lemmatization)
8.提取名词短语 (Noun Chunks)
9.指代消解 (Coreference Resolution)
三、可视化
spacy 可以用于进行分词,命名实体识别,词性识别等等
pip install spacy
安装之后还要下载官方的训练模型, 不同的语言有不同的训练模型,这里只用对应中文的模型演示:
python -m spacy download zh_core_web_sm
代码中使用:
import spacy
nlp = spacy.load("zh_core_web_sm")
模型官方文档:
Trained Models & Pipelines · spaCy Models Documentation
每种语言也会有几种不同的模型,例如中文的模型除了刚才下载的 zh_core_web_sm 外,还有zh_core_web_trf、zh_core_web_md 等,它们的区别在于准确度和体积大小, zh_core_web_sm 体积小,准确度相比zh_core_web_trf差,zh_core_web_trf相对就体积大。这样可以适应不同场景。
这里以模型 zh_core_web_sm 做一个介绍
Trained Models & Pipelines · spaCy Models Documentation
tok2vec: 分词
tagger: 词性标注
parser: 依存分析
senter: 分句
ner: 命名实体识别
attribute_ruler: 更改属性映射(没有具体了解)
模型会中指明包含哪些词性、依存分析、实体种类:
这是一些词性名称的解释:
IP:简单从句
NP:名词短语
VP:动词短语
PU:断句符,通常是句号、问号、感叹号等标点符号
LCP:方位词短语
PP:介词短语
CP:由‘的’构成的表示修饰性关系的短语
DNP:由‘的’构成的表示所属关系的短语
ADVP:副词短语
ADJP:形容词短语
DP:限定词短语
QP:量词短语
NN:常用名词
NR:固有名词
NT:时间名词
PN:代词
VV:动词
VC:是
CC:表示连词
VE:有
VA:表语形容词
AS:内容标记(如:了)
VRD:动补复合词
CD: 表示基数词
DT: determiner 表示限定词
EX: existential there 存在句
FW: foreign word 外来词
IN: preposition or conjunction, subordinating 介词或从属连词
JJ: adjective or numeral, ordinal 形容词或序数词
JJR: adjective, comparative 形容词比较级
JJS: adjective, superlative 形容词最高级
LS: list item marker 列表标识
MD: modal auxiliary 情态助动词
PDT: pre-determiner 前位限定词
POS: genitive marker 所有格标记
PRP: pronoun, personal 人称代词
RB: adverb 副词
RBR: adverb, comparative 副词比较级
RBS: adverb, superlative 副词最高级
RP: particle 小品词
SYM: symbol 符号
TO:”to” as preposition or infinitive marker 作为介词或不定式标记
WDT: WH-determiner WH限定词
WP: WH-pronoun WH代词
WP$: WH-pronoun, possessive WH所有格代词
WRB:Wh-adverb WH副词
官方关于词性、依存关系、实体的名词解释:
def explain(term):
"""Get a description for a given POS tag, dependency label or entity type.
term (str): The term to explain.
RETURNS (str): The explanation, or `None` if not found in the glossary.
EXAMPLE:
>>> spacy.explain(u'NORP')
>>> doc = nlp(u'Hello world')
>>> print([w.text, w.tag_, spacy.explain(w.tag_) for w in doc])
"""
if term in GLOSSARY:
return GLOSSARY[term]
GLOSSARY = {
# POS tags
# Universal POS Tags
# http://universaldependencies.org/u/pos/
"ADJ": "adjective",
"ADP": "adposition",
"ADV": "adverb",
"AUX": "auxiliary",
"CONJ": "conjunction",
"CCONJ": "coordinating conjunction",
"DET": "determiner",
"INTJ": "interjection",
"NOUN": "noun",
"NUM": "numeral",
"PART": "particle",
"PRON": "pronoun",
"PROPN": "proper noun",
"PUNCT": "punctuation",
"SCONJ": "subordinating conjunction",
"SYM": "symbol",
"VERB": "verb",
"X": "other",
"EOL": "end of line",
"SPACE": "space",
# POS tags (English)
# OntoNotes 5 / Penn Treebank
# https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
".": "punctuation mark, sentence closer",
",": "punctuation mark, comma",
"-LRB-": "left round bracket",
"-RRB-": "right round bracket",
"``": "opening quotation mark",
'""': "closing quotation mark",
"''": "closing quotation mark",
":": "punctuation mark, colon or ellipsis",
"$": "symbol, currency",
"#": "symbol, number sign",
"AFX": "affix",
"CC": "conjunction, coordinating",
"CD": "cardinal number",
"DT": "determiner",
"EX": "existential there",
"FW": "foreign word",
"HYPH": "punctuation mark, hyphen",
"IN": "conjunction, subordinating or preposition",
"JJ": "adjective (English), other noun-modifier (Chinese)",
"JJR": "adjective, comparative",
"JJS": "adjective, superlative",
"LS": "list item marker",
"MD": "verb, modal auxiliary",
"NIL": "missing tag",
"NN": "noun, singular or mass",
"NNP": "noun, proper singular",
"NNPS": "noun, proper plural",
"NNS": "noun, plural",
"PDT": "predeterminer",
"POS": "possessive ending",
"PRP": "pronoun, personal",
"PRP$": "pronoun, possessive",
"RB": "adverb",
"RBR": "adverb, comparative",
"RBS": "adverb, superlative",
"RP": "adverb, particle",
"TO": 'infinitival "to"',
"UH": "interjection",
"VB": "verb, base form",
"VBD": "verb, past tense",
"VBG": "verb, gerund or present participle",
"VBN": "verb, past participle",
"VBP": "verb, non-3rd person singular present",
"VBZ": "verb, 3rd person singular present",
"WDT": "wh-determiner",
"WP": "wh-pronoun, personal",
"WP$": "wh-pronoun, possessive",
"WRB": "wh-adverb",
"SP": "space (English), sentence-final particle (Chinese)",
"ADD": "email",
"NFP": "superfluous punctuation",
"GW": "additional word in multi-word expression",
"XX": "unknown",
"BES": 'auxiliary "be"',
"HVS": 'forms of "have"',
"_SP": "whitespace",
# POS Tags (German)
# TIGER Treebank
# http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
"$(": "other sentence-internal punctuation mark",
"$,": "comma",
"$.": "sentence-final punctuation mark",
"ADJA": "adjective, attributive",
"ADJD": "adjective, adverbial or predicative",
"APPO": "postposition",
"APPR": "preposition; circumposition left",
"APPRART": "preposition with article",
"APZR": "circumposition right",
"ART": "definite or indefinite article",
"CARD": "cardinal number",
"FM": "foreign language material",
"ITJ": "interjection",
"KOKOM": "comparative conjunction",
"KON": "coordinate conjunction",
"KOUI": 'subordinate conjunction with "zu" and infinitive',
"KOUS": "subordinate conjunction with sentence",
"NE": "proper noun",
"NNE": "proper noun",
"PAV": "pronominal adverb",
"PROAV": "pronominal adverb",
"PDAT": "attributive demonstrative pronoun",
"PDS": "substituting demonstrative pronoun",
"PIAT": "attributive indefinite pronoun without determiner",
"PIDAT": "attributive indefinite pronoun with determiner",
"PIS": "substituting indefinite pronoun",
"PPER": "non-reflexive personal pronoun",
"PPOSAT": "attributive possessive pronoun",
"PPOSS": "substituting possessive pronoun",
"PRELAT": "attributive relative pronoun",
"PRELS": "substituting relative pronoun",
"PRF": "reflexive personal pronoun",
"PTKA": "particle with adjective or adverb",
"PTKANT": "answer particle",
"PTKNEG": "negative particle",
"PTKVZ": "separable verbal particle",
"PTKZU": '"zu" before infinitive',
"PWAT": "attributive interrogative pronoun",
"PWAV": "adverbial interrogative or relative pronoun",
"PWS": "substituting interrogative pronoun",
"TRUNC": "word remnant",
"VAFIN": "finite verb, auxiliary",
"VAIMP": "imperative, auxiliary",
"VAINF": "infinitive, auxiliary",
"VAPP": "perfect participle, auxiliary",
"VMFIN": "finite verb, modal",
"VMINF": "infinitive, modal",
"VMPP": "perfect participle, modal",
"VVFIN": "finite verb, full",
"VVIMP": "imperative, full",
"VVINF": "infinitive, full",
"VVIZU": 'infinitive with "zu", full',
"VVPP": "perfect participle, full",
"XY": "non-word containing non-letter",
# POS Tags (Chinese)
# OntoNotes / Chinese Penn Treebank
# https://repository.upenn.edu/cgi/viewcontent.cgi?article=1039&context=ircs_reports
"AD": "adverb",
"AS": "aspect marker",
"BA": "把 in ba-construction",
# "CD": "cardinal number",
"CS": "subordinating conjunction",
"DEC": "的 in a relative clause",
"DEG": "associative 的",
"DER": "得 in V-de const. and V-de-R",
"DEV": "地 before VP",
"ETC": "for words 等, 等等",
# "FW": "foreign words"
"IJ": "interjection",
# "JJ": "other noun-modifier",
"LB": "被 in long bei-const",
"LC": "localizer",
"M": "measure word",
"MSP": "other particle",
# "NN": "common noun",
"NR": "proper noun",
"NT": "temporal noun",
"OD": "ordinal number",
"ON": "onomatopoeia",
"P": "preposition excluding 把 and 被",
"PN": "pronoun",
"PU": "punctuation",
"SB": "被 in short bei-const",
# "SP": "sentence-final particle",
"VA": "predicative adjective",
"VC": "是 (copula)",
"VE": "有 as the main verb",
"VV": "other verb",
# Noun chunks
"NP": "noun phrase",
"PP": "prepositional phrase",
"VP": "verb phrase",
"ADVP": "adverb phrase",
"ADJP": "adjective phrase",
"SBAR": "subordinating conjunction",
"PRT": "particle",
"PNP": "prepositional noun phrase",
# Dependency Labels (English)
# ClearNLP / Universal Dependencies
# https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md
"acl": "clausal modifier of noun (adjectival clause)",
"acomp": "adjectival complement",
"advcl": "adverbial clause modifier",
"advmod": "adverbial modifier",
"agent": "agent",
"amod": "adjectival modifier",
"appos": "appositional modifier",
"attr": "attribute",
"aux": "auxiliary",
"auxpass": "auxiliary (passive)",
"case": "case marking",
"cc": "coordinating conjunction",
"ccomp": "clausal complement",
"clf": "classifier",
"complm": "complementizer",
"compound": "compound",
"conj": "conjunct",
"cop": "copula",
"csubj": "clausal subject",
"csubjpass": "clausal subject (passive)",
"dative": "dative",
"dep": "unclassified dependent",
"det": "determiner",
"discourse": "discourse element",
"dislocated": "dislocated elements",
"dobj": "direct object",
"expl": "expletive",
"fixed": "fixed multiword expression",
"flat": "flat multiword expression",
"goeswith": "goes with",
"hmod": "modifier in hyphenation",
"hyph": "hyphen",
"infmod": "infinitival modifier",
"intj": "interjection",
"iobj": "indirect object",
"list": "list",
"mark": "marker",
"meta": "meta modifier",
"neg": "negation modifier",
"nmod": "modifier of nominal",
"nn": "noun compound modifier",
"npadvmod": "noun phrase as adverbial modifier",
"nsubj": "nominal subject",
"nsubjpass": "nominal subject (passive)",
"nounmod": "modifier of nominal",
"npmod": "noun phrase as adverbial modifier",
"num": "number modifier",
"number": "number compound modifier",
"nummod": "numeric modifier",
"oprd": "object predicate",
"obj": "object",
"obl": "oblique nominal",
"orphan": "orphan",
"parataxis": "parataxis",
"partmod": "participal modifier",
"pcomp": "complement of preposition",
"pobj": "object of preposition",
"poss": "possession modifier",
"possessive": "possessive modifier",
"preconj": "pre-correlative conjunction",
"prep": "prepositional modifier",
"prt": "particle",
"punct": "punctuation",
"quantmod": "modifier of quantifier",
"rcmod": "relative clause modifier",
"relcl": "relative clause modifier",
"reparandum": "overridden disfluency",
"root": "root",
"vocative": "vocative",
"xcomp": "open clausal complement",
# Dependency labels (German)
# TIGER Treebank
# http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
# currently missing: 'cc' (comparative complement) because of conflict
# with English labels
"ac": "adpositional case marker",
"adc": "adjective component",
"ag": "genitive attribute",
"ams": "measure argument of adjective",
"app": "apposition",
"avc": "adverbial phrase component",
"cd": "coordinating conjunction",
"cj": "conjunct",
"cm": "comparative conjunction",
"cp": "complementizer",
"cvc": "collocational verb construction",
"da": "dative",
"dh": "discourse-level head",
"dm": "discourse marker",
"ep": "expletive es",
"hd": "head",
"ju": "junctor",
"mnr": "postnominal modifier",
"mo": "modifier",
"ng": "negation",
"nk": "noun kernel element",
"nmc": "numerical component",
"oa": "accusative object",
"oc": "clausal object",
"og": "genitive object",
"op": "prepositional object",
"par": "parenthetical element",
"pd": "predicate",
"pg": "phrasal genitive",
"ph": "placeholder",
"pm": "morphological particle",
"pnc": "proper noun component",
"rc": "relative clause",
"re": "repeated element",
"rs": "reported speech",
"sb": "subject",
"sbp": "passivized subject (PP)",
"sp": "subject or predicate",
"svp": "separable verb prefix",
"uc": "unit component",
"vo": "vocative",
# Named Entity Recognition
# OntoNotes 5
# https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf
"PERSON": "People, including fictional",
"NORP": "Nationalities or religious or political groups",
"FACILITY": "Buildings, airports, highways, bridges, etc.",
"FAC": "Buildings, airports, highways, bridges, etc.",
"ORG": "Companies, agencies, institutions, etc.",
"GPE": "Countries, cities, states",
"LOC": "Non-GPE locations, mountain ranges, bodies of water",
"PRODUCT": "Objects, vehicles, foods, etc. (not services)",
"EVENT": "Named hurricanes, battles, wars, sports events, etc.",
"WORK_OF_ART": "Titles of books, songs, etc.",
"LAW": "Named documents made into laws.",
"LANGUAGE": "Any named language",
"DATE": "Absolute or relative dates or periods",
"TIME": "Times smaller than a day",
"PERCENT": 'Percentage, including "%"',
"MONEY": "Monetary values, including unit",
"QUANTITY": "Measurements, as of weight or distance",
"ORDINAL": '"first", "second", etc.',
"CARDINAL": "Numerals that do not fall under another type",
# Named Entity Recognition
# Wikipedia
# http://www.sciencedirect.com/science/article/pii/S0004370212000276
# https://pdfs.semanticscholar.org/5744/578cc243d92287f47448870bb426c66cc941.pdf
"PER": "Named person or family.",
"MISC": "Miscellaneous entities, e.g. events, nationalities, products or works of art",
# https://github.com/ltgoslo/norne
"EVT": "Festivals, cultural events, sports events, weather phenomena, wars, etc.",
"PROD": "Product, i.e. artificially produced entities including speeches, radio shows, programming languages, contracts, laws and ideas",
"DRV": "Words (and phrases?) that are dervied from a name, but not a name in themselves, e.g. 'Oslo-mannen' ('the man from Oslo')",
"GPE_LOC": "Geo-political entity, with a locative sense, e.g. 'John lives in Spain'",
"GPE_ORG": "Geo-political entity, with an organisation sense, e.g. 'Spain declined to meet with Belgium'",
}
url: https://github.com/explosion/spaCy/blob/master/spacy/glossary.py
import spacy
s = "小米董事长叶凡决定投资华为。在2002年,他还创作了<遮天>。"
nlp = spacy.load("zh_core_web_sm")
doc = nlp(s)
# 1. 分句 (sentencizer)
for i in doc.sents:
print(i)
"""
小米董事长叶凡决定投资华为。
在2002年,他还创作了<遮天>。
"""
# 2. 分词 (Tokenization)
print([w.text for w in doc])
"""
['小米', '董事长', '叶凡', '决定', '投资', '华为', '。', '在', '2002年', ',', '他', '还', '创作', '了', '<遮天>', '。']
"""
细粒度
print([(w.text, w.tag_) for w in doc])
"""
[('小米', 'NR'), ('董事长', 'NN'), ('叶凡', 'NR'), ('决定', 'VV'), ('投资', 'VV'), ('华为', 'NR'), ('。', 'PU'), ('在', 'P'), ('2002年', 'NT'), (',', 'PU'), ('他', 'PN'), ('还', 'AD'), ('创作', 'VV'), ('了', 'AS'), ('<遮天>', 'NN'), ('。', 'PU')]
"""
粗粒度
print([(w.text, w.pos_) for w in doc])
"""
[('小米', 'PROPN'), ('董事长', 'NOUN'), ('叶凡', 'PROPN'), ('决定', 'VERB'), ('投资', 'VERB'), ('华为', 'PROPN'), ('。', 'PUNCT'), ('在', 'ADP'), ('2002年', 'NOUN'), (',', 'PUNCT'), ('他', 'PRON'), ('还', 'ADV'), ('创作', 'VERB'), ('了', 'PART'), ('<遮天>', 'NOUN'), ('。', 'PUNCT')]
"""
print([(w.text, w.is_stop) for w in doc])
"""
[('小米', False), ('董事长', False), ('叶凡', False), ('决定', True), ('投资', False), ('华为', False), ('。', True), ('在', True), ('2002年', False), (',', True), ('他', True), ('还', True), ('创作', False), ('了', True), ('<遮天>', False), ('。', True)]
"""
# 命名实体识别 (Named Entity Recognization)
print([(e.text, e.label_) for e in doc.ents])
"""
[('小米', 'PERSON'), ('叶凡', 'PERSON'), ('2002年', 'DATE')]
"""
print([(w.text, w.dep_) for w in doc])
"""
[('小米', 'nmod:assmod'), ('董事长', 'appos'), ('叶凡', 'nsubj'), ('决定', 'ROOT'), ('投资', 'ccomp'), ('华为', 'dobj'), ('。', 'punct'), ('在', 'case'), ('2002年', 'nmod:prep'), (',', 'punct'), ('他', 'nsubj'), ('还', 'advmod'), ('创作', 'ROOT'), ('了', 'aux:asp'), ('<遮天>', 'dobj'), ('。', 'punct')]
"""
这个模型没有这个功能,用英文模型演示下
找到单词的原型,即词性还原,将am, is, are, have been
还原成be
,复数还原成单数(cats -> cat)
,过去时态还原成现在时态 (had -> have)
。
import spacy
nlp = spacy.load('en_core_web_sm')
txt = "A magnetic monopole is a hypothetical elementary particle."
doc = nlp(txt)
lem = [token.lemma_ for token in doc]
print(lem)
"""
['a', 'magnetic', 'monopole', 'be', 'a', 'hypothetical', 'elementary', 'particle', '.']
"""
这个模型没有这个功能,用英文模型演示下
noun_chunks = [nc for nc in doc.noun_chunks]
print(noun_chunks)
"""
[A magnetic monopole, a hypothetical elementary particle]
"""
指代消解 ,寻找句子中代词 he
,she
,it
所对应的实体。为了使用这个模块,需要使用神经网络预训练的指代消解系数,如果前面没有安装,可运行命令:pip install neuralcoref
这个模型没有这个功能,用英文模型演示下
txt = "My sister has a son and she loves him."
# 将预训练的神经网络指代消解加入到spacy的管道中
import neuralcoref
neuralcoref.add_to_pipe(nlp)
doc = nlp(txt)
doc._.coref_clusters
"""
[My sister: [My sister, she], a son: [a son, him]]
"""
from spacy import displacy
# 可视化依存关系
html_str = displacy.render(doc, style="dep")
#可视化命名名称实体
# html_str = displacy.render(doc, style="ent")
with open("D:\\data\\ss.html", "w", encoding="utf8") as f:
f.write(html_str)
html_str 是一个html格式的字符串, 保存到本地 ss.html文件,浏览器打开效果:
依存关系
命名实体
官方还有一个可视化的库: spacy-streamlit , 专门用于spacy相关的nlp可视化。
streamlit 也是一个专门可视化的库。
spacy-streamlit 有一个使用demo:
https://share.streamlit.io/ines/spacy-streamlit-demo/app.py
demo对应githup
GitHub - ines/spacy-streamlit-demo