以上所有步骤都是可选步骤,可根据项目需求调换顺序或者省略。
spaCy、textacy(SpaCy升级版)和neuralcoref(SpaCy拓展包,共指消解工具)
pip install neuralcoref
NLTK,SpaCy,TextBlob,Textacy和PyTorch-NLP比较2
名称 | 优势 | 劣势 | 适用情景 |
---|---|---|---|
NLTK | 功能全,每个组件多个实现,支持多语言 | 以字符串形式表示所有数据;发展较其他略慢 | 计算时需要特定算法进行组合 |
SpaCy | 速度快,单实现 | 以对象展示内容 | 需要高性能且不需要特定算法 |
TextBlob | NLTK的扩展,还包含Pattern库中的功能 | 小型项目 | |
Textacy | 在SpaCy核心NLP功能基础上增加了输入输出的处理功能 | 同SpaCy | |
PyTorch-NLP | 算法先进 | 针对研究人员/快速原型制作 |
pip install -U spacy
python -m spacy download zh_core_web_sm
#这一条其实重装pip
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py
# 有博主说这里改成缩写就ok了,但是我还是要重装pip才可以
# 也有人说改到管理员权限ok,但我仍然不行:)
python -m spacy download en
# zh 似乎找不到,也很神奇
python -m spacy download zh_core_web_sm
# 或者可以去 https://github.com/explosion/spacy-models/releases
# 找所需的语言包,用pip安装
pip install 路径/*.tar.gz
代码
import spacy
def test_segmentation(text, lang):
nlp = spacy.load(lang)
doc = nlp(text)
print(doc.ents)
for entity in doc.ents:
print(entity.text + " : " + entity.label_)
if __name__ == "__main__":
text = """London is the capital and most populous city of England and
the United Kingdom. Standing on the River Thames in the south east
of the island of Great Britain, London has been a major settlement
for two millennia. It was founded by the Romans, who named it Londinium.
"""
lang = "en_core_web_sm"
test_segmentation(text, lang)
输出结果
London : GPE
England : GPE
the United Kingdom : GPE
the River Thames : ORG
the south east : LOC
Great Britain : GPE
London : GPE
two millennia : DATE
Romans : NORP
解释:词性含义见这里 官网解释
缩写 | 含义 |
---|---|
GPE | Countries, cities, states. |
ORG | Companies, agencies, institutions, etc. |
LOC | Non-GPE locations, mountain ranges, bodies of water. |
DATE | Absolute or relative dates or periods. |
NORP | Nationalities or religious or political groups. |
代码
def replace_name_with_token(token):
if token.ent_iob != 0 and token.ent_type_ == "PERSON":
return "[REDACTED]"
else:
return token.string
def test_del_name(text, lang):
nlp = spacy.load(lang)
doc = nlp(text)
for ent in doc.ents:
with doc.retokenize() as retokenizer:
retokenizer.merge(ent)
tokens = map(replace_name_with_token, doc)
return "".join(tokens)
if __name__ == "__main__":
text = """
Yes Minister is a political satire British sitcom written by Antony Jay and Jonathan Lynn. Split over three seven-episode series, it was first transmitted on BBC2 from 1980 to 1984. A sequel, Yes, Prime Minister, ran for 16 episodes from 1986 to 1988.
All but one of the episodes lasted half an hour, and almost all ended with a variation of the title of the series spoken as the answer to a question posed by Minister (later, Prime Minister) Jim Hacker. Several episodes were adapted for BBC Radio;
the series also spawned a 2010 stage play that led to a new television series on Gold in 2013.
"""
lang = "en_core_web_sm"
#替换名字 redacted:编造的
print(test_del_name(text, lang))
输出结果
Yes Minister is a political satire British sitcom written by [REDACTED]and [REDACTED].
Split over three seven-episode series, it was first transmitted on BBC2 from 1980 to 1984. A sequel, Yes, Prime Minister, ran for 16 episodes from 1986 to 1988. All but one of the episodes lasted half an hour, and almost all ended with a variation of the title of the series spoken as the answer to a question posed by Minister (later, Prime Minister) [REDACTED]. Several episodes were adapted for BBC Radio; the series also spawned a 2010 stage play that led to a new television series on Gold in 2013.
声明主语和动词,寻找事实。
def get_semistructured_statement(text, lang, sub):
import textacy.extract
nlp = spacy.load(lang)
doc = nlp(text)
statements = textacy.extract.semistructured_statements(doc, "Minister")
print("Statements:")
for s in statements:
subject, verb, fact = s
print("\n".join([subject.text, verb.text, fact.text]))
get_semistructured_statement(text, lang, "Minister")
仍然用上面的句子做测试,结果是
Statements:
Minister
is
a political satire British sitcom written by Antony Jay and Jonathan Lynn
def get_word_freq(text, lang, min_freq):
import textacy.extract
nlp = spacy.load(lang)
doc = nlp(text)
noun_chunks = textacy.extract.noun_chunks(doc, min_freq=min_freq)
noun_chunks = map(str, noun_chunks)
noun_chunks = map(str.lower, noun_chunks)
for noun_chunks in set(noun_chunks):
f = len(noun_chunks.split(" "))
if f > 1:
print(str(f) + " : " + noun_chunks)
get_word_freq(text, lang, 1)
输出结果
4 : over three seven-episode series
2 : bbc radio
3 : new television series
6 :
yes minister
2 : several episodes
2 : jonathan lynn
3 : 2010 stage play
2 : 16 episodes
3 : (later, prime minister
4 : political satire british sitcom
2 : jim hacker
2 : antony jay
2 : prime minister
text1 = """乔·舒马赫曾执导过两部《蝙蝠侠》系列电影——1995年的《永远的蝙蝠侠》和1997年的《蝙蝠侠与罗宾》。据悉舒马赫是在蒂姆·伯顿离开蝙蝠侠系列后加入的。
虽然这两部电影并没有获得评论界的好评,却在票房上大获成功。舒马赫是以服装设计师的身份进入电影行业的。作为一名导演,乔·舒马赫有着自己独特的导演风格。
他在80年代和90年代的一系列主流电影中确立了自己在电影制作行业的地位,如1985年的《圣艾尔摩之火》、1987年的《捉鬼小精灵》和1990年的《别闯阴阳界》,都大获成功。
"""
lang1 = "zh_core_web_sm"
test_segmentation(text1, lang1)
print("====================================================")
#替换名字 redacted:编造的
print(test_del_name(text1, lang1))
print("====================================================")
get_semistructured_statement(text1, lang1, "乔·舒马赫")
CSDN资讯-NPL 太难怎么办?教你 8 步实现代码编写! ↩︎
用于自然语言处理的12大开源工具 ↩︎
官网 Install spaCy ↩︎