以下是关于 Python NLTK(Natural Language Toolkit) 库的全面深入讲解,涵盖核心功能、应用场景及代码示例:
NLTK 是 Python 中用于自然语言处理(NLP)的核心库,提供了丰富的文本处理工具、算法和语料库。主要功能包括:
pip install nltk
# 下载NLTK数据包(首次使用时需运行)
import nltk
nltk.download('punkt') # 分词模型
nltk.download('averaged_perceptron_tagger') # 词性标注模型
nltk.download('wordnet') # 词汇数据库
nltk.download('stopwords') # 停用词
句子分割:
from nltk.tokenize import sent_tokenize
text = "Hello world! This is NLTK. Let's learn NLP."
sentences = sent_tokenize(text) # ['Hello world!', 'This is NLTK.', "Let's learn NLP."]
单词分割:
from nltk.tokenize import word_tokenize
words = word_tokenize("Hello, world!") # ['Hello', ',', 'world', '!']
from nltk import pos_tag
tokens = word_tokenize("I love NLP.")
tags = pos_tag(tokens) # [('I', 'PRP'), ('love', 'VBP'), ('NLP', 'NNP'), ('.', '.')]
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed = stemmer.stem("running") # 'run'
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemma = lemmatizer.lemmatize("better", pos='a') # 'good'(需指定词性)
from nltk import RegexpParser
grammar = r"NP: {?*}" # 定义名词短语规则
parser = RegexpParser(grammar)
tree = parser.parse(tags) # 生成语法树
tree.draw() # 可视化树结构
from nltk import ne_chunk
text = "Apple is headquartered in Cupertino."
tags = pos_tag(word_tokenize(text))
entities = ne_chunk(tags)
# 输出: (GPE Apple/NNP) is/VBZ headquartered/VBN in/IN (GPE Cupertino/NNP)
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [w for w in word_tokenize(text) if w.lower() not in stop_words]
from nltk import edit_distance
distance = edit_distance("apple", "appel") # 2
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
score = sia.polarity_scores("I love this movie!") # {'compound': 0.8316, 'pos': 0.624, ...}
from nltk.corpus import gutenberg
print(gutenberg.fileids()) # 查看内置语料库
emma = gutenberg.words('austen-emma.txt') # 加载文本
from nltk.text import TextCollection
corpus = TextCollection([text1, text2, text3])
tfidf = corpus.tf_idf(word, text)
from nltk.util import ngrams
bigrams = list(ngrams(tokens, 2)) # 生成二元组
NLTK 对中文支持较弱,需结合其他工具:
# 示例:使用 jieba 分词
import jieba
words = jieba.lcut("自然语言处理很有趣") # ['自然语言', '处理', '很', '有趣']
功能 | NLTK | spaCy | Transformers |
---|---|---|---|
速度 | 慢 | 快 | 中等 |
预训练模型 | 少 | 多 | 极多(BERT等) |
易用性 | 简单 | 简单 | 中等 |
中文支持 | 弱 | 一般 | 强 |
使用 NLTK 内置的电影评论语料库进行情感分析分类:
from nltk.corpus import movie_reviews
import random
# 加载数据(正面和负面评论)
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents) # 打乱顺序
# 提取所有单词并构建特征集
all_words = [word.lower() for word in movie_reviews.words()]
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000] # 选择前3000个高频词作为特征
# 定义特征提取函数
def document_features(document):
document_words = set(document)
features = {
}
for word in word_features:
features[f'contains({
word})'] = (word in document_words)
return features
featuresets = [(document_features(doc), category) for (doc, category) in documents]
train_set, test_set = featuresets[100:], featuresets[:100] # 划分训练集和测试集
classifier = nltk.NaiveBayesClassifier.train(train_set)
# 评估模型
accuracy = nltk.classify.accuracy(classifier, test_set)
print(f"Accuracy: {
accuracy:.2f}") # 输出约 0.7-0.8
# 查看重要特征
classifier.show_most_informativ