环境配置 | 有关NLP的库安装学习使用示例,原理解释及出错解析

1.Spacy库学习

1.1.介绍

spacy:文本预处理库,Python和Cython中的高级自然语言处理库,它建立在最新的研究基础之上,从一开始就设计用于实际产品。spaCy带有预先训练的统计模型和单词向量,目前支持20多种语言的标记。它具有世界上速度最快的句法分析器,用于标签的卷积神经网络模型,解析和命名实体识别以及与深度学习整合。它是在MIT许可下发布的商业开源软件。【1】

1.2.安装

win10,pycharm,anaconda的虚拟环境(要注意pip和conda不能重复)

pip install spacy -i https://pypi.tuna.tsinghua.edu.cn/simple

1.3.示例使用

1.3.1.英文分词的实现

import spacy # 导包

#########英文分词##########
# 加载英文模型
nlp = spacy.load("en_core_web_sm")

# 使用模型,传入句子即可
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
# 获取分词结果
print([token.text for token in doc])

 结果

1.3.2.中文分词及单词编码的实现

#########对中文进行分词和Word Embedding##########
import spacy # 导包
# 加载模型,并排除掉不需要的components
nlp1 = spacy.load("zh_core_web_sm", exclude=("tagger", "parser", "senter", "attribute_ruler", "ner"))
# 对句子进行处理
doc = nlp1("自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。")
# for循环获取每一个token与它对应的向量
for token in doc:
	# 这里为了方便展示,只截取5位,但实际该模型将中文词编码成了96维的向量
    print(token.text, token.tensor[:5])

结果

环境配置 | 有关NLP的库安装学习使用示例,原理解释及出错解析_第1张图片

1.3.3.韩语分词及单词编码的实现

########对韩语句法依存解析##########
 #(虚拟环境中韩语模型下载命令)python -m spacy download ko_core_news_sm

import spacy # 导包
from spacy.lang.ko.examples import sentences

nlp2 = spacy.load("ko_core_news_sm")
doc = nlp2(sentences[0])
print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_)

 结果

环境配置 | 有关NLP的库安装学习使用示例,原理解释及出错解析_第2张图片

 可参考【2】

1.3.4.检测英文主题及实体类型

import spacy

# Load the English NLP model
nlp = spacy.load('en_core_web_sm')

# The text we want to examine
text = """London is the capital and most populous city of England and
the United Kingdom. Standing on the River Thames in the south east
of the island of Great Britain, London has been a major settlement
for two millennia. It was founded by the Romans, who named it Londinium.
"""

# Parse the text with spaCy. This runs the entire pipeline.
doc = nlp(text)

# 'doc' now contains a parsed version of text. We can use it to do anything we want!
# For example, this will print out all the named entities that were detected:
for entity in doc.ents:
    print(f"{entity.text} ({entity.label_})")

得到一个在我们的文档中检测到的命名实体和实体类型的列表:

环境配置 | 有关NLP的库安装学习使用示例,原理解释及出错解析_第3张图片

 1.3.5.词汇与文本相似度

import spacy
#python -m spacy download en_core_web_lg
nlp = spacy.load("en_core_web_lg")
# 词汇语义相似度(关联性)

banana = nlp.vocab['banana']
dog = nlp.vocab['dog']
fruit = nlp.vocab['fruit']
animal = nlp.vocab['animal']

print(dog.similarity(animal), dog.similarity(fruit))  # 0.6618534 0.23552845
print(banana.similarity(fruit), banana.similarity(animal))  # 0.67148364 0.2427285

# 文本语义相似度(关联性)
target = nlp("Cats are beautiful animals.")

doc1 = nlp("Dogs are awesome.")
doc2 = nlp("Some gorgeous creatures are felines.")
doc3 = nlp("Dolphins are swimming mammals.")

1.4.实现原理

组件:tok2vec,标记器,形态化器,解析器,词形还原器(trainable_lemmatizer),senter,ner。

spaCy的处理过程(Processing Pipeline)

当调用文本时,spaCy 首先标记文本以生成对象。然后通过几个不同的步骤进行处理 - 这也是 称为处理管道。训练管道使用的管道通常包括标记器、词形还原器、分析器 和实体识别器。每个管道组件返回已处理的、 然后将其传递给下一个组件。

环境配置 | 有关NLP的库安装学习使用示例,原理解释及出错解析_第4张图片

tok2vec:

1.5.错误修正

错误1

在pip install spacy后,运行出现没有spacy.load()时

环境配置 | 有关NLP的库安装学习使用示例,原理解释及出错解析_第5张图片

 卸载spacy

pip uninstall spacy

然后重新安装

pip install spacy -i https://pypi.tuna.tsinghua.edu.cn/simple

错误原因分析:错误是由将文件命名为“spacy”引起的,显然它会产生命名冲突。

解决方案:修改文件名spacy.py,不能与spacy库同名。

错误2

实现代码python -m spacy download en_core_web_sm,出现错误如下

E:\Anaconda3\envs\tf24\lib\site-packages\h5py\__init__.py:39: UserWarning: h5py is running against HDF5 1.10.5 when it was built against 1.10.6, this may cause problems
  '{0}.{1}.{2}'.format(*version.hdf5_built_version_tuple)
Warning! ***HDF5 library version mismatched error***
The HDF5 header files used to compile this application do not match
the version used by the HDF5 library to which this application is linked.
Data corruption or segmentation faults may occur if the application continues.
This can happen when an application was compiled by one version of HDF5 but
linked with a different version of static or shared HDF5 library.
You should recompile the application or check your shared library related
settings such as 'LD_LIBRARY_PATH'.
You can, at your own risk, disable this warning by setting the environment
variable 'HDF5_DISABLE_VERSION_CHECK' to a value of '1'.
Setting it to 2 or higher will suppress the warning messages totally.
Headers are 1.10.6, library is 1.10.5

错误原因分析:pycharm会对库 版本更新,升级新的版本,导致版本不匹配

解决方案:(我的版本h5py-2.10.0 和 tensorflow-2.4.0 Python3.7)

卸载pip uninstall h5py

安装pip install h5py==2.10.0

修改后成功!!

环境配置 | 有关NLP的库安装学习使用示例,原理解释及出错解析_第6张图片

2.Textacy学习

用于执行各种自然语言处理任务的Python库,建立在高性能spaCy库的基础上,在 spaCy 之上实现了几种常见的数据抽取算法。

示例

import spacy
import textacy.extract

# Load the large English NLP model
nlp = spacy.load('en_core_web_sm')

# The text we want to examine
text = """London is the capital and most populous city of England and the United Kingdom.
Standing on the River Thames in the south east of the island of Great Britain,
London has been a major settlement for two millennia. It was founded by the Romans,
who named it Londinium.
"""

# Parse the document with spaCy
doc = nlp(text)

# Extract semi-structured statements
statements = textacy.extract.semistructured_statements(doc, "London")

# Print the results
print("Here are the things I know about London:")

for statement in statements:
    subject, verb, fact = statement
print(f" - {fact}")

错误1

 Traceback (most recent call last):
  File "G:/NLP/bert-master/bert-master/nlpbase/textacypre.py", line 18, in
    statements = textacy.extract.semistructured_statements(doc, "London")
TypeError: semistructured_statements() takes 1 positional argument but 2 were given(如图)环境配置 | 有关NLP的库安装学习使用示例,原理解释及出错解析_第7张图片

参考文献

【1】Trained Models & Pipelines · spaCy Models Documentation

【2】恩田 / 梅卡布科 / README.md — 比特桶 (bitbucket.org) 

【3】英语文本处理工具库——spaCy - 简书 (jianshu.com)

你可能感兴趣的:(-,NLP,-,-,环境配置,-,自然语言处理,python,人工智能)