NLP小白入门python实战-spacy中文包简单测试

基础知识1

NLP流水线

NLP小白入门python实战-spacy中文包简单测试_第1张图片

  1. 句子分片
    将文本分解成独立的句子。根据标点或其他格式信息。
  2. 将词汇变成标记(token)
    tokenization。标点符号也应作为标记
  3. 预测每个标记的词性
    把每个单词(以及上下文环境中的一些单词)输入 词性分类模型 得到词性(名词/动词/形容词等)。
    词性分类模型是完全依据统计学的,是依靠从前的句子训练出来的。
  4. 文本还原
    lemmatization。因为单词可能会有变形(例如复数/时态),所以需要找到基本词形(lemma)。
    一般通过查找表和特殊规则(没见过的词)。
  5. 确定停止词
    英语有许多出现频率非常高的填充词,如“and”、“the”和“a”等。在进行词频统计时,这些单词会引入许多噪声,因为它们出现的频率比其他单词高得多。
    一些NLP流水线会将这些词标记为停止词(stop words),意思是在进行统计分析之前要过滤掉这些词。
    同样也是用词汇表操作,不过不同应用停止词词汇表应该是不同的。
  6. 依赖解析
    Dependency parsing。建立一棵解析树,root是句子的主要动词,两个单词之间的关系也可以预测出来。
    有时候不需要知道哪个词确切是什么词性,而只关心整句含义的话,可以找出名词词组,比如the capital成一组,而不必分成the和capital。
  7. 命名实体识别
    Named Entity Recognition, NER: 检测并标记那些表示真实世界中存在的概念的名词。如公司名/人名/地理位置/产品名/日期和时间/金钱数目/事件名等
  8. 相互引用解析
    上下文的相互指代关系,如he/she/it究竟指代什么?只能通过上下文确定。深度学习方面进展更好。
    这一步其实是可选的。

以上所有步骤都是可选步骤,可根据项目需求调换顺序或者省略。

常用python包

spaCy、textacy(SpaCy升级版)和neuralcoref(SpaCy拓展包,共指消解工具)
pip install neuralcoref
NLTK,SpaCy,TextBlob,Textacy和PyTorch-NLP比较2

名称 优势 劣势 适用情景
NLTK 功能全,每个组件多个实现,支持多语言 以字符串形式表示所有数据;发展较其他略慢 计算时需要特定算法进行组合
SpaCy 速度快,单实现 以对象展示内容 需要高性能且不需要特定算法
TextBlob NLTK的扩展,还包含Pattern库中的功能 小型项目
Textacy 在SpaCy核心NLP功能基础上增加了输入输出的处理功能 同SpaCy
PyTorch-NLP 算法先进 针对研究人员/快速原型制作

spacy实例

安装3
官网上,为不同语言指定了不同的包,可以选择NLP小白入门python实战-spacy中文包简单测试_第2张图片

pip install -U spacy
python -m spacy download zh_core_web_sm

安装语言库时报错安装语言库时报错
解决:目前是spacy 2.3.0

#这一条其实重装pip
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py
# 有博主说这里改成缩写就ok了,但是我还是要重装pip才可以
# 也有人说改到管理员权限ok,但我仍然不行:)
python -m spacy download en
# zh 似乎找不到,也很神奇
python -m spacy download zh_core_web_sm
# 或者可以去 https://github.com/explosion/spacy-models/releases 
# 找所需的语言包,用pip安装
pip install 路径/*.tar.gz

代码

实例1 输出词性
import spacy

def test_segmentation(text, lang):
    nlp = spacy.load(lang)
    doc = nlp(text)
    print(doc.ents)
    for entity in doc.ents:
        print(entity.text + " : " + entity.label_)

if __name__ == "__main__":
    text = """London is the capital and most populous city of England and 
the United Kingdom.  Standing on the River Thames in the south east 
of the island of Great Britain, London has been a major settlement 
for two millennia. It was founded by the Romans, who named it Londinium.
    """
    lang = "en_core_web_sm"
    test_segmentation(text, lang)

输出结果

London : GPE
England : GPE
the United Kingdom : GPE
the River Thames : ORG
the south east : LOC
Great Britain : GPE
London : GPE
two millennia : DATE
Romans : NORP

解释:词性含义见这里 官网解释

缩写 含义
GPE Countries, cities, states.
ORG Companies, agencies, institutions, etc.
LOC Non-GPE locations, mountain ranges, bodies of water.
DATE Absolute or relative dates or periods.
NORP Nationalities or religious or political groups.
实例2 替换人名

代码

def replace_name_with_token(token):
    if token.ent_iob != 0 and token.ent_type_ == "PERSON":
        return "[REDACTED]"
    else:
        return token.string

def test_del_name(text, lang):
    nlp = spacy.load(lang)
    doc = nlp(text)
    for ent in doc.ents:
        with doc.retokenize() as retokenizer:
            retokenizer.merge(ent)
    tokens = map(replace_name_with_token, doc)
    return "".join(tokens)

if __name__ == "__main__":
    text = """
Yes Minister is a political satire British sitcom written by Antony Jay and Jonathan Lynn. Split over three seven-episode series, it was first transmitted on BBC2 from 1980 to 1984. A sequel, Yes, Prime Minister, ran for 16 episodes from 1986 to 1988. 
All but one of the episodes lasted half an hour, and almost all ended with a variation of the title of the series spoken as the answer to a question posed by Minister (later, Prime Minister) Jim Hacker. Several episodes were adapted for BBC Radio; 
the series also spawned a 2010 stage play that led to a new television series on Gold in 2013.
"""
lang = "en_core_web_sm"
#替换名字 redacted:编造的
print(test_del_name(text, lang))

输出结果

Yes Minister is a political satire British sitcom written by [REDACTED]and [REDACTED]. 
Split over three seven-episode series, it was first transmitted on BBC2 from 1980 to 1984. A sequel, Yes, Prime Minister, ran for 16 episodes from 1986 to 1988. All but one of the episodes lasted half an hour, and almost all ended with a variation of the title of the series spoken as the answer to a question posed by Minister (later, Prime Minister) [REDACTED]. Several episodes were adapted for BBC Radio; the series also spawned a 2010 stage play that led to a new television series on Gold in 2013.
实例3 提取事实

声明主语和动词,寻找事实。

def get_semistructured_statement(text, lang, sub):
    import textacy.extract
    nlp = spacy.load(lang)
    doc = nlp(text)
    statements = textacy.extract.semistructured_statements(doc, "Minister")
    print("Statements:")
    for s in statements:
        subject, verb, fact = s
        print("\n".join([subject.text, verb.text, fact.text]))

get_semistructured_statement(text, lang, "Minister")

仍然用上面的句子做测试,结果是

Statements:
Minister
is
a political satire British sitcom written by Antony Jay and Jonathan Lynn
实例4 获取关联词组
def get_word_freq(text, lang, min_freq):
    import textacy.extract
    nlp = spacy.load(lang)
    doc = nlp(text)
    noun_chunks = textacy.extract.noun_chunks(doc, min_freq=min_freq)
    noun_chunks = map(str, noun_chunks)
    noun_chunks = map(str.lower, noun_chunks)
    for noun_chunks in set(noun_chunks):
        f = len(noun_chunks.split(" "))
        if f > 1:
            print(str(f) + " : " + noun_chunks)

get_word_freq(text, lang, 1)

输出结果

4 : over three seven-episode series
2 : bbc radio
3 : new television series
6 : 
    yes minister
2 : several episodes
2 : jonathan lynn
3 : 2010 stage play
2 : 16 episodes
3 : (later, prime minister
4 : political satire british sitcom
2 : jim hacker
2 : antony jay
2 : prime minister
使用中文做一下测试(实例1-3)
text1 = """乔·舒马赫曾执导过两部《蝙蝠侠》系列电影——1995年的《永远的蝙蝠侠》和1997年的《蝙蝠侠与罗宾》。据悉舒马赫是在蒂姆·伯顿离开蝙蝠侠系列后加入的。
虽然这两部电影并没有获得评论界的好评,却在票房上大获成功。舒马赫是以服装设计师的身份进入电影行业的。作为一名导演,乔·舒马赫有着自己独特的导演风格。
他在80年代和90年代的一系列主流电影中确立了自己在电影制作行业的地位,如1985年的《圣艾尔摩之火》、1987年的《捉鬼小精灵》和1990年的《别闯阴阳界》,都大获成功。
"""
lang1 = "zh_core_web_sm"
test_segmentation(text1, lang1)
print("====================================================")
#替换名字 redacted:编造的
print(test_del_name(text1, lang1))
print("====================================================")
get_semistructured_statement(text1, lang1, "乔·舒马赫")

输出结果NLP小白入门python实战-spacy中文包简单测试_第3张图片


  1. CSDN资讯-NPL 太难怎么办?教你 8 步实现代码编写! ↩︎

  2. 用于自然语言处理的12大开源工具 ↩︎

  3. 官网 Install spaCy ↩︎

你可能感兴趣的:(python,大数据,nlp,数据挖掘)