1. 准备工作:分词和清洗
import nltk
from nltk.corpus import stopwords
from nltk.corpus import brown
import numpy as np
#分词
text = "Sentiment analysis is a challenging subject in machine learning.\
People express their emotions in language that is often obscured by sarcasm,\
ambiguity, and plays on words, all of which could be very misleading for \
both humans and computers.".lower()
text_list = nltk.word_tokenize(text)
#去掉标点符号
english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']
text_list = [word for word in text_list if word not in english_punctuations]
#去掉停用词
stops = set(stopwords.words("english"))
text_list = [word for word in text_list if word not in stops]
2. 使用词性标注器:处理一个词序列,为每个词附加一个词性标记
nltk.pos_tag(text_list)
Out[81]:
[('sentiment', 'NN'),
('analysis', 'NN'),
('challenging', 'VBG'),
('subject', 'JJ'),
('machine', 'NN'),
('learning', 'VBG'),
('people', 'NNS'),
('express', 'JJ'),
('emotions', 'NNS'),
('language', 'NN'),
('often', 'RB'),
('obscured', 'VBD'),
('sarcasm', 'JJ'),
('ambiguity', 'NN'),
('plays', 'NNS'),
('words', 'NNS'),
('could', 'MD'),
('misleading', 'VB'),
('humans', 'NNS'),
('computers', 'NNS')]
3. 读取已标注的语料库:NLTK中包括的若干语料库已经标注了词性
brown_taged= nltk.corpus.brown.tagged_words()
4. 自动标注
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
#默认标注
tags = [tag for (word,tag) in brown.tagged_words(categories='news')]
print(nltk.FreqDist(tags).max())
NN
raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
tokens = nltk.word_tokenize(raw)
default_tagger = nltk.DefaultTagger('NN')
print(default_tagger.tag(tokens))
print(default_tagger.evaluate(brown_tagged_sents))
[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'), ('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'), ('I', 'NN'), ('am', 'NN'), ('!', 'NN')]
0.13089484257215028
#正则表达式标注器
patterns= [(r'.*ing$','VBG'),(r'.*ed$','VBD'),(r'.*es$','VBZ'),(r'.*ould$','MD'),\
(r'.*\'s$','NN$'),(r'.*s$','NNS'),(r'^-?[0-9]+(.[0-9]+)?$','CD'),(r'.*','NN')]
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(brown_sents[3])
print(regexp_tagger.evaluate(brown_tagged_sents))
0.20326391789486245
#查询标注器:找出100个最频繁的词,存储它们最有可能的标记。然后可以使用这个信息作为
#"查询标注器"(NLTK UnigramTagger)的模型
fd = nltk.FreqDist(brown.words(categories='news'))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
most_freq_words = list(fd.keys())[:100]
likely_tags = dict((word,cfd[word].max()) for word in most_freq_words)
# baseline_tagger = nltk.UnigramTagger(model=likely_tags)
#许多词都被分配了None标签,因为它们不在100个最频繁的词中,可以使用backoff参数设置这些词的默认词性
baseline_tagger = nltk.UnigramTagger(model=likely_tags,backoff=nltk.DefaultTagger('NN'))
print(baseline_tagger.evaluate(brown_tagged_sents))
0.46063806511923944
5. N-gram标注
(1)一元标注器:利用一种简单的算法,对每个标识符分配最有可能的标记,不考虑上下文
In[87]: unigram_tagger = nltk.UnigramTagger(brown_tagged_sents) #训练一个一元标注器
print(unigram_tagger.tag(brown_sents[2007]))
unigram_tagger.evaluate((brown_tagged_sents))
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
Out[87]: 0.9349006503968017
#分离训练集和测试集
size = int(len(brown_tagged_sents)*0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.evaluate(test_sents)
Out[89]: 0.8121200039868434
(2)一般的N-gram标注:n-gram标注器是unigram标注器的一般化,它的上下文是当期词和它前面n-1个标注分的词性标记。
NgramTagger类使用一个已标注的训练语料库来确定每个上下文中哪个词性标记最有可能。
bigram_tagger = nltk.BigramTagger(train_sents)
bigram_tagger.tag(brown_sents[2007])
bigram_tagger.evaluate(test_sents)
Out[90]: 0.10206319146815508
注意,bigram标注器能够标注训练量中它看到过的句子中的所有词,但对一个没见过的句子却不行。只要遇到一个新词就无法给它分配标记,也无法给新词后面的一个词
分配标记,因为在训练过程中从来没有见过哪个词前面有None标记的词。它的整体准确度非常低。
(3)组合标注器
尝试使用bigram标注器标注标识符
如果bigram无法找到标记,尝试unigram标注器
如果unigram也无法找到标记,使用默认标注器
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents,backoff=t0)
t2 = nltk.BigramTagger(train_sents,backoff=t1)
t2.evaluate(test_sents)
Out[92]: 0.8452108043456593
t3 = nltk.BigramTagger(train_sents,cutoff=2,backoff=t1)
t3.evaluate(test_sents)
Out[95]: 0.8424200139539519
cutoff=2表示将丢弃那些只出现一次或者两次的上下文。
(4)存储标注器:在大语料库中训练标注器可能需要大量的时间,保存标注器很有必要
In[101]: #保存标注器
from pickle import dump
output = open('t2.pkl','wb')
dump(t2,output,-1)
output.close()
#加载标注器
from pickle import load
input = open('t2.pkl','rb')
tagger = load(input)
input.close()
#使用标注器
text = "Sentiment analysis is a challenging subject in machine learning."
tokens = text.split()
tagger.tag(tokens)
Out[101]:
[('Sentiment', 'NN'),
('analysis', 'NN'),
('is', 'BEZ'),
('a', 'AT'),
('challenging', 'JJ'),
('subject', 'NN'),
('in', 'IN'),
('machine', 'NN'),
('learning.', 'NN')]
6 基于转换的标注nltk.tag.brill
主要是基于规则进行标注,例如当前面的词是TO时,替换NN为VB,当下一个标记是NNS时,替换TO为IN
7 如何确定一个词的分类
(1)形态学线索:如词的前缀、后缀
(2)句法线索:即词可能出现的典型的上下文语境
(3)语义线索:即一个词的意思
(4)新词:词类的成员会随着时间的推移增加新的词汇,需注意