词性标注器 POS tagger,part-of-speech tagger,处理次序列,为每个词附加词性标记
nltk.pos_tag(text)
import nltk
text=nltk.word_tokenize("they refuse to permit us to obtain the refuse permit")
print nltk.pos_tag(text)
[('they', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
注:word_tokenize()只能对句子分词
注意到这里可以区分一个词的不同词性
查找与woman具有相同上下文的词
text.similar(word)
text=nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('woman')
man time day year car moment world family house country child boy
state job way war girl place word work
nltk.tag.str2tuple() 将已标注词性的字符串分成元组形式
sent='''
The/AT grand/JJ jury/NN commented/VBD on/TN a/AT number/NN of/IN other/AP topics/NNS among/IN them/PPO
the/AT
'''
print [nltk.tag.str2tuple(t) for t in sent.split()]
[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'), ('on', 'TN'), ('a', 'AT'), ('number', 'NN'), ('of', 'IN'), ('other', 'AP'), ('topics', 'NNS'), ('among', 'IN'), ('them', 'PPO'), ('the', 'AT')]
text=open("e:/python2/Harry Potter1.txt").read()
import re
def text_clean(text):
text_subp=re.sub(r'[^a-zA-Z0-9]',' ',text)
text_split=text_subp.split()
return text_split
text=text_clean(text)
text_pos=nltk.pos_tag(text)
print text_pos[:30]
[('Harry', 'NNP'), ('Potter', 'NNP'), ('and', 'CC'), ('the', 'DT'), ('Sorcerer', 'NNP'), ('s', 'VBD'), ('Stone', 'NNP'), ('CHAPTER', 'NNP'), ('ONE', 'NNP'), ('THE', 'NNP'), ('BOY', 'NNP'), ('WHO', 'NNP'), ('LIVED', 'NNP'), ('Mr', 'NNP'), ('and', 'CC'), ('Mrs', 'NNP'), ('Dursley', 'NNP'), ('of', 'IN'), ('number', 'NN'), ('four', 'CD'), ('Privet', 'NNP'), ('Drive', 'NNP'), ('were', 'VBD'), ('proud', 'JJ'), ('to', 'TO'), ('say', 'VB'), ('that', 'IN'), ('they', 'PRP'), ('were', 'VBD'), ('perfectly', 'RB')]
word_tag_pairs=nltk.bigrams(text_pos)
tag_pairs=[]
for (a,b) in word_tag_pairs:
tag_pairs.append([a[1],b[1]])
print tag_pairs[:20]
[['NNP', 'NNP'], ['NNP', 'CC'], ['CC', 'DT'], ['DT', 'NNP'], ['NNP', 'VBD'], ['VBD', 'NNP'], ['NNP', 'NNP'], ['NNP', 'NNP'], ['NNP', 'NNP'], ['NNP', 'NNP'], ['NNP', 'NNP'], ['NNP', 'NNP'], ['NNP', 'NNP'], ['NNP', 'CC'], ['CC', 'NNP'], ['NNP', 'NNP'], ['NNP', 'IN'], ['IN', 'NN'], ['NN', 'CD'], ['CD', 'NNP']]
注:书上P201的代码用自己的文本还没实现
查找文本中最常见的动词,并按频率排序动词
print text_pos[:20]
[('Harry', 'NNP'), ('Potter', 'NNP'), ('and', 'CC'), ('the', 'DT'), ('Sorcerer', 'NNP'), ('s', 'VBD'), ('Stone', 'NNP'), ('CHAPTER', 'NNP'), ('ONE', 'NNP'), ('THE', 'NNP'), ('BOY', 'NNP'), ('WHO', 'NNP'), ('LIVED', 'NNP'), ('Mr', 'NNP'), ('and', 'CC'), ('Mrs', 'NNP'), ('Dursley', 'NNP'), ('of', 'IN'), ('number', 'NN'), ('four', 'CD')]
verbs=[]
for i in range(len(text_pos)):
if text_pos[i][1].startswith('V'):
verbs.append(text_pos[i])
verb_freq=nltk.FreqDist(verbs)
fdist_sort=sorted(verb_freq.iteritems(), key=lambda d:d[1], reverse = True)
print fdist_sort[:10]
[(('was', 'VBD'), 1178), (('said', 'VBD'), 793), (('had', 'VBD'), 685), (('be', 'VB'), 362), (('were', 'VBD'), 304), (('s', 'VBD'), 228), (('s', 'VBZ'), 219), (('been', 'VBN'), 211), (('got', 'VBD'), 169), (('looked', 'VBD'), 165)]
文本在Python中被视为词链表,链表的一个重要的属性是可以通过其索引来“查找”特定项目。
将词作为keys 词性作为values映射为字典,字典不是序列而是映射,所以字典的顺序排列是不固定的。
之前用的FreqDist就是继承自dict,所以我们可以像操作字典一样操作FreqDist对象
# 创建列表pos
pos={'colorlesss':'ADJ','ideas':'N','sleep':'V','furiously':'ADV'}
# 将链表里的keys提取出来存成链表
print 'list(pos):',list(pos)
print 'sorted(pos):',sorted(pos)
print 'pos.items():',pos.items()
list(pos): ['colorlesss', 'sleep', 'ideas', 'furiously']
sorted(pos): ['colorlesss', 'furiously', 'ideas', 'sleep']
pos.items(): [('colorlesss', 'ADJ'), ('sleep', 'V'), ('ideas', 'N'), ('furiously', 'ADV')]
注意:先定义字典,在设置dict.nltk.defaultdict().顺序反了的话访问不存在的键时依旧会报错
nltk.defaultdict()的参数为value的数据类型,
当nltk.defaultdict()的参数是lambda表达式时,其表示默认的value为lambda值
frequency={'colorless':4}
frequency=nltk.defaultdict(int)
print frequency['ideas']
pos=nltk.defaultdict(lambda:'N')
print pos['blog']
0
N
先将链表转为字典
sorted()第一个参数是链表,字典转为链表用 .items()
第二个参数制定排序键,itemgetter(1)
from operator import itemgetter
list=[(u'secondly', 2), (u'pardon', 6), (u'Tut', 1), (u'saves', 1), (u'knelt', 1), \
(u'four', 8), (u'Does', 2), (u'sleep', 6), (u'hanging', 3)]
list2dict={}
for i in range(len(list)):
list2dict[list[i][0]]=list[i][1]
print list2dict
print sorted(list,key=itemgetter(1),reverse=True)
{u'secondly': 2, u'pardon': 6, u'saves': 1, u'Tut': 1, u'knelt': 1, u'four': 8, u'Does': 2, u'sleep': 6, u'hanging': 3}
[(u'four', 8), (u'pardon', 6), (u'sleep', 6), (u'hanging', 3), (u'secondly', 2), (u'Does', 2), (u'Tut', 1), (u'saves', 1), (u'knelt', 1)]
list2=dict((value,key) for (key,value) in list2dict.items())
print list2
{8: u'four', 1: u'knelt', 2: u'Does', 3: u'hanging', 6: u'sleep'}
词性的标记依赖于这个词和它在句子中的上下文
由于大多数新词都是名词,所以将标注器的默认值设为名词
from nltk.corpus import brown
brown_tagged_sents=brown.tagged_sents(categories='news')
brown_sents=brown.sents(categories='news')
tags=[tag for (word,tag) in brown.tagged_words(categories='news')]
raw="I do not like green eggs and ham,I do not like them Sam I am !"
tokens=nltk.word_tokenize(raw)
default_tagger=nltk.DefaultTagger('NN')
print default_tagger.tag(tokens)
[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'), ('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'), ('I', 'NN'), ('am', 'NN'), ('!', 'NN')]
patterns=[
(r'.*ing$','VBG'),#动词进行式
(r'.*ed$','VBD'),#动词过去式
(r'.*es$','VBZ'),#动词第三人称
(r'.*ould$','MD'),
(r'.*\'s$','NN$'),#所有格
(r'.*s$','NNS'),#名词复数
(r'.^-?[0-9]+(.[0-9]+)?$','CD'),
(r'.*','NN')
]
regexp_tagger=nltk.RegexpTagger(patterns)
print regexp_tagger.evaluate(brown_tagged_sents) #用brown_tagged_sents检测regexp_tagger词性标注器的准确率
0.191419535772
注意:书上nltk.FreqDist()函数获得的链表是按value值进行排序的,但实际应用时发现并不是,所以自己写一个sorted语句 对其进行排序
fd=nltk.FreqDist(brown.words(categories='news'))
fd2list=sorted(fd.items(),key=itemgetter(1),reverse=True)
cfd=nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
print cfd.items()[:10]
[(u'stock', FreqDist({u'NN': 20, u'VB': 1})), (u'sunbonnet', FreqDist({u'NN': 1})), (u'Elevated', FreqDist({u'VBN-TL': 1})), (u'narcotic', FreqDist({u'JJ': 1, u'NN': 1})), (u'four', FreqDist({u'CD': 73})), (u'woods', FreqDist({u'NNS': 4})), (u'railing', FreqDist({u'NN': 1})), (u'Until', FreqDist({u'CS': 3, u'IN': 2})), (u'aggression', FreqDist({u'NN': 1})), (u'marching', FreqDist({u'VBG': 2}))]
print 'fd2list:',fd2list[:10]
print 'fd.keys:',fd.keys()[:10]
fd_most=[]
for (key,value) in fd2list:
fd_most.append(key)
print 'fd_most:',fd_most[:10]
most_freq_words=fd_most[:100]
fd2list: [(u'the', 5580), (u',', 5188), (u'.', 4030), (u'of', 2849), (u'and', 2146), (u'to', 2116), (u'a', 1993), (u'in', 1893), (u'for', 943), (u'The', 806)]
fd.keys: [u'stock', u'sunbonnet', u'Elevated', u'narcotic', u'four', u'woods', u'railing', u'Until', u'aggression', u'marching']
fd_most: [u'the', u',', u'.', u'of', u'and', u'to', u'a', u'in', u'for', u'The']
各种数据类型之间的转换有点麻烦,后续会整理出一个单独的文件
UnigramTagger通过训练tagged sentence data将一个词最常见的词性设为这个词的词性
书里的算法更像是UnigramTagger的底层实现代码,UnigramTagger可以直接训练数据
likely_tags=dict((word,cfd[word].max()) for word in most_freq_words)#找出最常出现的100个词中每个词最常见的词性
baseline_tagger=nltk.UnigramTagger(model=likely_tags)
baseline_tagger.evaluate(brown_tagged_sents)
0.45578495136941344
baseline_tagger=nltk.UnigramTagger(brown_tagged_sents)
baseline_tagger.evaluate(brown_tagged_sents)
0.9349006503968017
一元标注器就是刚刚介绍的UnigramTagger,对每个标识符分配这个独特的标识符最有可能的标记。
把数据集分为训练集和测试集:90%为测试数据,10%为测试数据。
size=int(len(brown_tagged_sents)*0.9)
print size
train_sents=brown_tagged_sents[:size]
test_sents=brown_tagged_sents[size:]
unigram_tagger=nltk.UnigramTagger(train_sents)
unigram_tagger.evaluate(test_sents)
4160
0.8120203329014253
n-gram 标注器,根据给定词的前n-1个词的标记,来考虑第n个词最有可能的标记
bigram
bigram_tagger=nltk.BigramTagger(train_sents)
print bigram_tagger.tag(brown_sents[2007])
[(u'Various', u'JJ'), (u'of', u'IN'), (u'the', u'AT'), (u'apartments', u'NNS'), (u'are', u'BER'), (u'of', u'IN'), (u'the', u'AT'), (u'terrace', u'NN'), (u'type', u'NN'), (u',', u','), (u'being', u'BEG'), (u'on', u'IN'), (u'the', u'AT'), (u'ground', u'NN'), (u'floor', u'NN'), (u'so', u'CS'), (u'that', u'CS'), (u'entrance', u'NN'), (u'is', u'BEZ'), (u'direct', u'JJ'), (u'.', u'.')]
bigram 标注器能过标注训练中它看到过的句子中的所有词,但对一个没见过的句子表现很差。
构造一个组合标注器
1.尝试使用bigram标注器标注标识符
2.如果bigram标注器无法找到一个标记,尝试unigram标注器
3.如果unigram标注器也无法找到一个标记,使用默认标注器
回退标注器 backoff
cutoff 丢弃只看到n词的上下文
t0=nltk.DefaultTagger('NN')
t1=nltk.UnigramTagger(train_sents,backoff=t0)
t2=nltk.BigramTagger(train_sents,backoff=t1,cutoff=2)
t2.evaluate(test_sents)
0.8423203428685339
from cPickle import dump
output=open('t2.pkl','wb')
dump(t2,output,-1)
output.close()
from cPickle import load
input=open('t2.pkl','rb')
tagger=load(input)
input.close()
text="The board's action shows what free enterprise is up against in our complex maze of regulatory laws."
tokens=text.split()
print tagger.tag(tokens)
[('The', u'AT'), ("board's", u'NN$'), ('action', 'NN'), ('shows', u'NNS'), ('what', u'WDT'), ('free', u'JJ'), ('enterprise', 'NN'), ('is', u'BEZ'), ('up', u'RP'), ('against', u'IN'), ('in', u'IN'), ('our', u'PP$'), ('complex', u'JJ'), ('maze', 'NN'), ('of', u'IN'), ('regulatory', 'NN'), ('laws.', 'NN')]
1.形态学
2.句法