DefaultTagger可以将所有token标记为同一个标签(tag)。
sent = "Thanks for your reading!"
tokens = nltk.word_tokenize(sent)
default_tagger = nltk.DefaultTagger('NN')
tagged_words = default_tagger.tag(tokens)
print(tagged_words)
result:
[('Thanks', 'NN'), ('for', 'NN'), ('your', 'NN'), ('reading', 'NN'), ('!', 'NN')]
evaluate函数可以测试这种标记方法的准确率。这里使用brown语料库提供的标记好词性的tagged_sents进行测试:
brown_tagged_sents = brown.tagged_sents(categories='news')
default_tagger = nltk.DefaultTagger('NN')
print(default_tagger.evaluate(brown_tagged_sents))
result:
0.13089484257215028
输入结果说明将所有单词标记为名词(NN)的方法只有13%的准确率,这也说明brown_tagged_sents里名词占13%。
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger
train_data= brown_tagged_sents[:int(len(brown_tagged_sents) * 0.9)]
test_data= brown_tagged_sents[int(len(brown_tagged_sents) * 0.9):]
unigram_tagger = UnigramTagger(train_data,backoff = default_tagger)
print(unigram_tagger.evaluate(test_data))
bigram_tagger= BigramTagger(train_data, backoff = unigram_tagger)
print(bigram_tagger.evaluate(test_data))
trigram_tagger=TrigramTagger(train_data,backoff = bigram_tagger)
print(trigram_tagger.evaluate(test_data))
result:
0.8361407355726104
0.8452108043456593
0.843317053722715
from nltk.tag.sequential import RegexpTagger
regexp_tagger = RegexpTagger(
[( r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
( r'(The|the|A|a|An|an)$', 'AT'), # articles
( r'.*able$', 'JJ'), # adjectives
( r'.*ness$', 'NN'), # nouns formed from adj
( r'.*ly$', 'RB'), # adverbs
( r'.*s$', 'NNS'), # plural nouns
( r'.*ing$', 'VBG'), # gerunds
(r'.*ed$', 'VBD'), # past tense verbs
(r'.*', 'NN') # nouns (default)
]) # 前缀r用于防止转义,常用于正则表达式
print((regexp_tagger.evaluate(test_data)))
result:
0.31306687929831556
date_tagger = RegexpTagger([
(r'(\d{2})[/.-](\d{2})[/.-](\d{4})$','DATE'),
(r'\$','MONEY')
])
test = 'I will be coming on sat 10-02-2014 with around 10 $ '.split()
date_tagger.tag(test)
result:
[('I', None),
('will', None),
('be', None),
('coming', None),
('on', None),
('sat', None),
('10-02-2014', 'DATE'),
('with', None),
('around', None),
('10', None),
('$', 'MONEY')]
unigram_tagger = UnigramTagger(train_data,backoff = regexp_tagger)
print(unigram_tagger.evaluate(test_data))
bigram_tagger= BigramTagger(train_data, backoff = unigram_tagger)
print(bigram_tagger.evaluate(test_data))
trigram_tagger=TrigramTagger(train_data,backoff = bigram_tagger)
print(trigram_tagger.evaluate(test_data))
result:
0.8657430479417921
0.8755108143127679
0.8730190371773149