个人主页:道友老李
欢迎加入社区:道友老李的学习社区
自然语言处理(Natural Language Processing,NLP)是计算机科学领域与人工智能领域中的一个重要方向。它研究的是人类(自然)语言与计算机之间的交互。NLP的目标是让计算机能够理解、解析、生成人类语言,并且能够以有意义的方式回应和操作这些信息。
NLP的任务可以分为多个层次,包括但不限于:
NLP技术的发展依赖于算法的进步、计算能力的提升以及大规模标注数据集的可用性。近年来,深度学习方法,特别是基于神经网络的语言模型,如BERT、GPT系列等,在许多NLP任务上取得了显著的成功。随着技术的进步,NLP正在被应用到越来越多的领域,包括客户服务、智能搜索、内容推荐、医疗健康等。
NLTK全称是Natural Language Toolkit(自然语言处理工具包),它是一个用于构建处理人类语言数据的Python程序的领先平台。NLTK提供了简单易用的接口以及丰富的工具和资源,广泛应用于文本处理、信息检索、情感分析、机器翻译等众多自然语言处理(NLP)任务中。
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize
paragraph = 'You must follow me carefully. I shall have to controvert one or twoideas that are almost universally accepted. The geometry, forinstance, they taught you at school is founded on a misconception.'
tokenized_text = sent_tokenize(paragraph)
tokenized_text
['You must follow me carefully.',
'I shall have to controvert one or twoideas that are almost universally accepted.',
'The geometry, forinstance, they taught you at school is founded on a misconception.']
# 分词
from nltk import word_tokenize
text = 'You must follow me carefully.'
tokenized_text = word_tokenize(text)
['You', 'must', 'follow', 'me', 'carefully', '.']
方法一:
import string
string.punctuation
text.translate(str.maketrans(string.punctuation, ' ' * len(string.punctuation)))
'You must follow me carefully '
方法二:对tokenized_text过滤
punctuation = list(string.punctuation)
[word for word in tokenized_text if word not in punctuation]
['You', 'must', 'follow', 'me', 'carefully']
没有意义, 常用英文停用词: is, am, a, are the, an, to , for…
nltk.download('stopwords')
from nltk.corpus import stopwords
# 加载英文停用词
stop_words = stopwords.words('english')
text = 'I shall have to controvert one or twoideas that are almost universally accepted.'
tokenized_text = word_tokenize(text)
tokenized_text
# 过滤标点
tokenized_text = [word for word in tokenized_text if word not in punctuation]
tokenized_text
# 过滤停用词
tokenized_text = [word for word in tokenized_text if word not in stop_words]
tokenized_text
tokenized_text = nltk.word_tokenize(paragraph)
tokenized_text = [word for word in tokenized_text if word not in punctuation]
tokenized_text
word_freqs = nltk.FreqDist(w.lower() for w in tokenized_text)
# 画出词频
word_freqs.plot()
出现频率最高的几个词.
# 出现频率最高的几个词.
word_freqs.plot(3, cumulative=True)