
  • 继续上一篇——词性标签
  • 运行代码后,发现问题


a_sentence = 'like hate'
[('like', 'IN'), ('hate', 'NN')]
  • 解决方法构思
  1. 人工找到类似的词性不准从而没有被分入的关键词,把这些关键词添加入content.
  2. 修改pos_tag的字典。






POS tagging :part-of-speech tagging , or word classes or lexical categories . 说法很多其实就是词性标注。

那么用nltk的工具集的off-the-shelf工具可以简单的对文本进行POS tagging

>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

API Document里面是这么介绍这个接口的

Use NLTK's currently recommended part of speech tagger to   tag the given list of tokens.

我查了下code, pos_tag load the Standard treebank POS tagger

1.      CC      Coordinating conjunction
2.     CD     Cardinal number
3.     DT     Determiner
4.     EX     Existential there
5.     FW     Foreign word
6.     IN     Preposition or subordinating conjunction
7.     JJ     Adjective
8.     JJR     Adjective, comparative
9.     JJS     Adjective, superlative
10.     LS     List item marker
11.     MD     Modal
12.     NN     Noun, singular or mass
13.     NNS     Noun, plural
14.     NNP     Proper noun, singular
15.     NNPS     Proper noun, plural
16.     PDT     Predeterminer
17.     POS     Possessive ending
18.     PRP     Personal pronoun
19.     PRP$     Possessive pronoun
20.     RB     Adverb
21.     RBR     Adverb, comparative
22.     RBS     Adverb, superlative
23.     RP     Particle
24.     SYM     Symbol
25.     TO     to
26.     UH     Interjection
27.     VB     Verb, base form
28.     VBD     Verb, past tense
29.     VBG     Verb, gerund or present participle
30.     VBN     Verb, past participle
31.     VBP     Verb, non-3rd person singular present
32.     VBZ     Verb, 3rd person singular present
33.     WDT     Wh-determiner
34.     WP     Wh-pronoun
35.     WP$     Possessive wh-pronoun
36.     WRB     Wh-adverb 



在nltk的corpus,语料库,里面有些是加过词性标注的,这些可以用于训练集,标注过的corpors都有tagged_words() method

>>> nltk.corpus.brown.tagged_words()
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...]
>>> nltk.corpus.brown.tagged_words(simplify_tags=True)
[('The', 'DET'), ('Fulton', 'N'), ('County', 'N'), ...]


Automatic Tagging


我们就用brown corpus作为例子,

>>> from nltk.corpus import brown

>>> brown_tagged_sents = brown.tagged_sents(categories='news')


>>> brown_sents = brown.sents(categories='news')

可以分布取出标注过的句子集合, 未标注的句子集合,分别用做标注算法的验证集和测试集。


The Default Tagger

The simplest possible tagger assigns the same tag to each token.

>>> raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
>>> tokens = nltk.word_tokenize(raw)

>>> default_tagger = nltk.DefaultTagger('NN')


>>> default_tagger.tag(tokens)
[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'),
('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'),
('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'),
('I', 'NN'), ('am', 'NN'), ('!', 'NN')]



The Regular Expression Tagger

The regular expression tagger assigns tags to tokens on the basis of matching patterns.

>>> patterns = [
... (r'.*ing$', 'VBG'), # gerunds
... (r'.*ed$', 'VBD'), # simple past
... (r'.*es$', 'VBZ'), # 3rd singular present
... (r'.*ould$', 'MD'), # modals
... (r'.*/'s$', 'NN$'), # possessive nouns
... (r'.*s$', 'NNS'), # plural nouns
... (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
... (r'.*', 'NN') # nouns (default)

... ]


>>> pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}
>>> pos = dict(colorless='ADJ', ideas='N', sleep='V', furiously='ADV')
>>> pos = defaultdict(list)
>>> pos['sleep'] = ['NOUN', 'VERB']
>>> pos['ideas']

>>> regexp_tagger = nltk.RegexpTagger(patterns)


>>> regexp_tagger.tag(brown_sents[3])


[('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'),
('of', 'NN'), ('such', 'NN'), ('reports', 'NNS'), ('was', 'NNS'), ('received', 'VBD'),
("''", 'NN'), (',', 'NN'), ('the', 'NN'), ('jury', 'NN'), ('said', 'NN'), (',', 'NN'),
('``', 'NN'), ('considering', 'VBG'), ('the', 'NN'), ('widespread', 'NN'), ...]



The Lookup Tagger

A lot of high-frequency words do not have the NN tag. Let’s find the hundred most frequent words and store their most likely tag.

这个方法开始有点实用价值了, 就是通过统计训练corpus里面最常用的词,最有可能出现的词性是什么,来进行词性标注。

>>> fd = nltk.FreqDist(brown.words(categories='news'))

[FreqDist({'The': 806,'Fulton': 14, 'County': 35, 'Grand': 6, 'Jury': 2, 'said': 402, 'Friday': 41,……]


>>> cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))


>>> most_freq_words = fd.keys()[:100]


>>> likely_tags = dict((word, cfd[word].max()) for word in most_freq_words)
>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags)

这段code就是从corpus中取出top 100的词,然后找到这100个词出现次数最多的词性,然后形成likely_tags的字典



这个方法有个最大的问题,你只指定了top 100词的词性,那么其他的词怎么办

好,前面的default tagger有用了

baseline_tagger = nltk.UnigramTagger(model=likely_tags, backoff=nltk.DefaultTagger('NN'))

这样就可以部分解决这个问题, 不知道的就用default tagger来标注



N-Gram Tagging

Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token.

上面给出的lookup tagger就是用的Unigram tagger, 现在给出Unigram tagger更一般的用法

>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents(categories='news')
>>> brown_sents = brown.sents(categories='news')
>>> unigram_tagger = nltk.UnigramTagger(brown_tagged_sents) #Training 
>>> unigram_tagger.tag(brown_sents[2007])
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'),
('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'),
(',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'),
('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'),
('direct', 'JJ'), ('.', '.')]

你可以来已标注的语料库对Unigram tagger进行训练


An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens.



>>> bigram_tagger = nltk.BigramTagger(train_sents)
>>> bigram_tagger.tag(brown_sents[2007])


>>> t0 = nltk.DefaultTagger('NN')
>>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)
>>> t2 = nltk.BigramTagger(train_sents, backoff=t1)


Transformation-Based Tagging

n-gram tagger存在的问题是,model会占用比较大的空间,还有就是在考虑context时,只会考虑前面词的tag,而不会考虑词本身。



Brill tagging is a kind of transformation-based learning, named after its inventor. The general idea is very simple: guess the tag of each word, then go back and fix the mistakes.


那么Brill tagging的原理从底下这个例子就可以了解

(1) replace NN with VB when the previous word is TO;

(2) replace TO with IN when the next tag is NNS.

Phrase     to increase grants to states for vocational rehabilitation
Unigram TO    NN        NNS   TO NNS    IN      JJ                NN
Rule 1              VB
Rule 2                                    IN
Output     TO    VB        NNS    IN NNS    IN      JJ                NN

第一步用unigram tagger对所有词做一遍tagging,这里面可能有很多不准确的




During its training phase, the tagger guesses values for T1, T2, and C, to create thousands of candidate rules. Each rule is scored according to its net benefit: the number of incorrect tags that it corrects, less the number
of correct tags it incorrectly modifies.

意思就是在training阶段,先创建thousands of candidate rules, 这些rule创建可以通过简单的统计来完成,所以可能有一些rule是不准确的。那么用每条rule去fix mistakes,然后和正确tag对比,改对的数目减去改错的数目用来作为score评价该rule的好坏,自然得分高的留下,得分低的rule就删去, 底下是些rules的例子

NN -> VB if the tag of the preceding word is 'TO'
NN -> VBD if the tag of the following word is 'DT'
NN -> VBD if the tag of the preceding word is 'NNS'
NN -> NNP if the tag of words i-2...i-1 is '-NONE-'
NN -> NNP if the tag of the following word is 'NNP'
NN -> NNP if the text of words i-2...i-1 is 'like'
NN -> VBN if the text of the following word is '*-1'

  •  但是上面的这种方法只适合自己训练规则,或自己创建规则。有没有已经训练好的tagger呢?或者能不能修改默认的分类tagger呢?


                    {'The': FreqDist({'AT': 775, 'AT-HL': 3, 'AT-TL': 28}),
                     'Fulton': FreqDist({'NP': 4, 'NP-TL': 10}),
                     'County': FreqDist({'NN-TL': 35}),
                     'Grand': FreqDist({'FW-JJ-TL': 1, 'JJ-TL': 5}),
                     'Jury': FreqDist({'NN-TL': 2}),
                     'said': FreqDist({'VBD': 382, 'VBN': 20}),
                     'Friday': FreqDist({'NR': 41}),
                     'an': FreqDist({'AT': 300}),
                     'investigation': FreqDist({'NN': 9}),
                     'of': FreqDist({'IN': 2716, 'IN-HL': 5, 'IN-TL': 128}),
                     "Atlanta's": FreqDist({'NP$': 4}),
                     'recent': FreqDist({'JJ': 20}),
                     'primary': FreqDist({'JJ': 4, 'NN': 13}),
                     'election': FreqDist({'NN': 38}),
                     'produced': FreqDist({'VBD': 5, 'VBN': 1}),




>>> pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}
>>> pos = dict(colorless='ADJ', ideas='N', sleep='V', furiously='ADV')
