python自然语言处理之分类和标注词性5.1-5.3

介绍下NLP的基本技术:包括序列标注、N-gram模型、回退和评估。

将词汇按照词性分类并相应的对他们进行标注,也即:词性标注(part-of-speech tagging, POS tagging),也称作标注。

词性也称为词类或者词汇范畴。用于特定任务标记的集合被称作一个标记集。

5.1使用词性标注器

用以处理一个词序列,为每一个词附加词性标记。

>>> import nltk

>>> text = nltk.word_tokenize('and now for something completely different')

>>> nltk.pos_tag(text)

[('and', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

其中CC代表并列连词,RB代表副词,IN是介词,NN是名词,JJ是形容词。

NLTK中提供了每一个标记的文档,可以使用标记来查询它所对应的词性。如:

>>> nltk.help.upenn_tagset('RB')

RB: adverb

    occasionally unabatingly maddeningly adventurously professedly

    stirringly prominently technologically magisterially predominately

    swiftly fiscally pitilessly ...

反映出RB所代表的含义是副词。同样可以查询NN的含义是名词:

>>> nltk.help.upenn_tagset('NN')

NN: noun, common, singular or mass

    common-carrier cabbage knuckle-duster Casino afghan shed thermostat

    investment slide humour falloff slick wind hyena override subhumanity

    machinist ...

可以用nltk.help.brown_tagset('NN.*')近似查询,如同正则表达式一般匹配,如NN.*代表名词在前的某些搭配。

NN: noun, singular, common

    failure burden court fire appointment awarding compensation Mayor

    interim committee fact effect airport management surveillance jail

    doctor intern extern night weekend duty legislation Tax Office ...

NN$: noun, singular, common, genitive

    season's world's player's night's chapter's golf's football's

    baseball's club's U.'s coach's bride's bridegroom's board's county's

    firm's company's superintendent's mob's Navy's ...

NN+BEZ: noun, singular, common + verb 'to be', present tense, 3rd person singular

text.similar(word)用来找寻在整个文本中具有与word相似用法的其他单词,也就是如果word1与word的上下文单词一致,那么word1则出现在此函数的返回列表中。

>>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())

>>> text.similar('women')

people men others the time children that one work man af house girls

and two way state years water this

这些返回的单词在某种意义上具有与‘women’相同的用法。可以用similar()函数分析不同文章是否属于同一作者。

>>> from nltk.book import *

*** Introductory Examples for the NLTK Book ***

Loading text1, ..., text9 and sent1, ..., sent9

Type the name of the text or sentence to view it.

Type: 'texts()' or 'sents()' to list the materials.

text1: Moby Dick by Herman Melville 1851

text2: Sense and Sensibility by Jane Austen 1811

text3: The Book of Genesis

text4: Inaugural Address Corpus

text5: Chat Corpus

text6: Monty Python and the Holy Grail

text7: Wall Street Journal

text8: Personals Corpus

text9: The Man Who Was Thursday by G . K . Chesterton 1908

>>> text2.similar('lady')

man house day moment world person brother subject family wife time

woman year case men week colonel park manner sister

>>> text1.similar('lady')

body whale one ship crew pequod world fish english whales deep boat

seas side man harpooneers voyage ribs boats fire

可以看出text1与text2的风格不太一致,可以推测两本书的作者不是同一个人。

5.2标注语料库

表示已标注的标识符

NLTK中用一个由标识符和标记组成的元祖表示。用到的函数是str2tuple()。

>>> tagged_token = nltk.tag.str2tuple('fly/NN')

>>> tagged_token

('fly', 'NN')

>>> tagged_token[1]

'NN'

>>> tagged_token[0]

'fly'

读取已标记的语料库

只要语料库包含已标注的文本,NLTK的语料库接口都将有一个tagged_words()方法。

>>> nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

>>> nltk.corpus.brown.tagged_words(tagset='universal')

[('The', 'DET'), ('Fulton', 'NOUN'), ...]

简化的词性标记集

标记 含义 例子
ADJ 形容词 new, good, high, special, big, local
ADV 动词 really, already, still, early, now
CNJ 连词 and, or, but, if, while, although
DET 限定词 the, a, some, most, every, no
EX 存在量词 there, there’s
FW 外来词 dolce, ersatz, esprit, quo, maitre
MOD 情态动词 will, can, would, may, must, should
N 名词 year, home, costs, time, education
NP 专有名词 Alison, Africa, April, Washington
NUM 数词 twenty-four, fourth, 1991, 14:24
PRO 代词 he, their, her, its, my, I, us
P 介词 on, of, at, with, by, into, under
TO 词 to to
UH 感叹词 ah, bang, ha, whee, hmpf, oops
V 动词 is, has, get, do, make, see, run
VD 过去式 said, took, told, made, asked
VG 现在分词 making, going, playing, working
VN 过去分词 given, taken, begun, sung
WH Wh 限定词 who, which, when, what, where, how
. 标点符号 . , ; !

查看一下brown语料库词性的使用情况:

>>> from nltk.corpus import brown

>>> brown_news_tagged = brown.tagged_words(categories='news',tagset='universal') 

>>> tag_fd = nltk.FreqDist(tag for (word,tag) in brown_news_tagged)

>>> tag_fd.keys()

dict_keys(['DET', 'NOUN', 'ADJ', 'VERB', 'ADP', '.', 'ADV', 'CONJ', 'PRT', 'PRON', 'NUM', 'X'])

>>> tag_fd.plot

>>> tag_fd.plot()

python自然语言处理之分类和标注词性5.1-5.3_第1张图片


可以考虑使用nltk.app.concordance()函数调用nltk内置的图形界面搜索某个单词的用法,但是此处似乎支持搜索的语料库只有brown。

>>> nltk.app.concordance('fly/NN')

Traceback (most recent call last):

  File "", line 1, in

TypeError: app() takes 0 positional arguments but 1 was given

>>> nltk.app.concordance()

python自然语言处理之分类和标注词性5.1-5.3_第2张图片

切换语料库之后会报错:

python自然语言处理之分类和标注词性5.1-5.3_第3张图片

对名词、动词以及形容词和副词的标注中由于tagged_words()方法中的参数由simplify_tags=True改为tagset=‘universal’后返回的包含词-标注字典的列表与参考书中的不太一致,并且在以后可能的实践中,并不是一定会使用nltk自带的已经实现词性标记的文本进行研究,所以此处不再刻意扩展。

使用python字典映射词及其属性

定义一个空的字典,并手动添加四个词并标注其词性,之后可以按键索引值。

>>> import nltk

>>> pairs ={}

>>> pairs['genius']='N'

>>> pairs['monstrous']='ADJ'

>>> pairs['have']='V'

>>> pairs['carelessly']='ADV'

>>> pairs

{'genius': 'N', 'monstrous': 'ADJ', 'have': 'V', 'carelessly': 'ADV'}

由于字典不是序列而是映射,所以其键值对的顺序并不是按固有的顺序排列。要找到健可以将字典转化为一个链表。

>>> list(pairs)

['genius', 'monstrous', 'have', 'carelessly']

>>> sorted(pairs)

['carelessly', 'genius', 'have', 'monstrous']

>>> list(pairs)

['genius', 'monstrous', 'have', 'carelessly']


>>> for word in sorted(pairs):

...     print(word+':',pairs[word])

... 

carelessly: ADV

genius: N

have: V

monstrous: ADJ


当然也可以通过字典的固有方法如keys()、values()、items()访问作为单独链表的键、值以及键值对。

>>> pairs.keys()

dict_keys(['genius', 'monstrous', 'have', 'carelessly'])

>>> pairs.values()

dict_values(['N', 'ADJ', 'V', 'ADV'])

>>> pairs.items()

dict_items([('genius', 'N'), ('monstrous', 'ADJ'), ('have', 'V'), ('carelessly', 'ADV')])

>>> for key ,val in sorted(pairs.items()):

...     print(key+':',val)

... 

carelessly: ADV

genius: N

have: V

monstrous: ADJ

当一个单词具有多种词性时,可以使用链表存储值也就是存储其词性。

>>> pairs['sleep']='V'

>>> pairs

{'genius': 'N', 'monstrous': 'ADJ', 'have': 'V', 'carelessly': 'ADV', 'sleep': 'V'}

>>> pairs['sleep']='N'

>>> pairs

{'genius': 'N', 'monstrous': 'ADJ', 'have': 'V', 'carelessly': 'ADV', 'sleep': 'N'}

>>> pairs['sleep']=['N','V']

>>> pairs

{'genius': 'N', 'monstrous': 'ADJ', 'have': 'V', 'carelessly': 'ADV', 'sleep': ['N', 'V']}

字典的键必须是不可改变的类型,比如元组和字符串,而使用字典是不可以的。

>>> pairs['good','nice']='ADJ'

>>> pairs

{('good', 'nice'): 'ADJ'}

>>> pos = {['ideas','blogs','adventures']:'N'}

Traceback (most recent call last):

  File "", line 1, in

TypeError: unhashable type: 'list'

在某些时候,我们访问的单词(键)可能并不存在于字典中,这样查询时会返回错误信息。

python2.5之后自带一种方法可以使得当要查询的单词(键)不存在时,可以以预设的值类型存储到字典中。

>>> frequency = nltk.defaultdict(int)

>>> frequency['colorless']=4

>>> frequency['ideas']

0

>>> pos = nltk.defaultdict(lambda:'N')

>>> pos['colorless']='ADJ'

>>> pos['apple']

'N'

>>> pos.items()

dict_items([('colorless', 'ADJ'), ('apple', 'N')])

默认字典应用于较大规模的语言处理任务中,许多语言处理任务包括标注,费大力气来正确处理文本中只出现过一次的词。当有固定的词汇并且不会有新词出现时,可能处理效果会更好。在默认字典下预处理文本,并使用特殊的“超出词汇表”标识符,UNK替换低频词汇。

>> alice = nltk.corpus.gutenberg.words('carroll-alice.txt')

>>> vocab = nltk.FreqDist(alice)

>>> v1000 = list(vocab)[:1000]

>>> mapping = nltk.defaultdict(lambda:'UNK')

>>> for v in v1000:

...     mapping[v]=v

... 

>>> alice2 = [mapping[v] for v in alice]

>>> alice2[:100]

['[', 'Alice', "'", 's', 'Adventures', 'in', 'Wonderland', 'by', 'Lewis', 'Carroll', '1865', ']', 'CHAPTER', 'I', '.', 'Down', 'the', 'Rabbit', '-', 'Hole', 'Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', ',', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', ',', "'", 'and', 'what', 'is', 'the', 'use', 'of', 'a', 'book', ",'", 'thought', 'Alice', "'", 'without', 'pictures', 'or', 'conversation', "?'", 'So', 'she', 'was', 'considering', 'in', 'her', 'own', 'mind', '(', 'as', 'well', 'as', 'she', 'could', ',']

>>> len(set(alice2))

递增更新字典

可以使用字典计数出现的次数。首先初始化一个空的defaultdict,然后处理文本中每个词性标记,如果以前没有出现相同标记,就默认该标记的计数为0。每遇到一个标记,都递增其计数值。

>>> counts = nltk.defaultdict(int)

>>> for (word,tag) in brown.tagged_words(categories='news'):

...     counts[tag] += 1

... 

c>>> counts['N']

0

>>> list(counts)

['AT', 'NP-TL', 'NN-TL', 'JJ-TL', 'VBD', 'NR', 'NN', 'IN', 'NP$', 'JJ', '``', "''", 'CS', 'DTI', 'NNS', '.', 'RBR', ',', 'WDT', 'HVD', 'VBZ', 'CC', 'IN-TL', 'BEDZ', 'VBN', 'NP', 'BEN', 'TO', 'VB', 'RB', 'DT', 'PPS', 'DOD', 'AP', 'BER', 'HV', 'DTS', 'VBG', 'PPO', 'QL', 'JJT', 'ABX', 'NN-HL', 'VBN-HL', 'WRB', 'CD', 'MD', 'BE', 'JJR', 'VBG-TL', 'BEZ', 'NN$-TL', 'HVZ', 'ABN', 'PN', 'PPSS', 'PP$', 'DO', 'NN$', 'NNS-HL', 'WPS', '*', 'EX', 'VB-HL', ':', '(', ')', 'NNS-TL', 'NPS', 'JJS', 'RP', '--', 'BED', 'OD', 'BEG', 'AT-HL', 'VBG-HL', 'AT-TL', 'PPL', 'DOZ', 'NP-HL', 'NR$', 'DOD*', 'BEDZ*', ',-HL', 'CC-TL', 'MD*', 'NNS$', 'PPSS+BER', "'", 'PPSS+BEM', 'CD-TL', 'RBT', '(-HL', ')-HL', 'MD-HL', 'VBZ-HL', 'IN-HL', 'JJ-HL', 'PPLS', 'CD-HL', 'WPO', 'JJS-TL', 'ABL', 'BER-HL', 'PPS+HVZ', 'VBD-HL', 'RP-HL', 'MD*-HL', 'AP-HL', 'CS-HL', 'DT$', 'HVN', 'FW-IN', 'FW-DT', 'VBN-TL', 'NR-TL', 'NNS$-TL', 'FW-NN', 'HVG', 'DTX', 'OD-TL', 'BEM', 'RB-HL', 'PPSS+MD', 'NPS-HL', 'NPS$', 'WP$', 'NN-TL-HL', 'CC-HL', 'PPS+BEZ', 'AP-TL', 'UH-TL', 'BEZ-HL', 'TO-HL', 'DO*', 'VBN-TL-HL', 'NNS-TL-HL', 'DT-HL', 'BE-HL', 'DOZ*', 'QLP', 'JJR-HL', 'PPSS+HVD', 'FW-IN+NN', 'PP$$', 'JJT-HL', 'NP-TL-HL', 'NPS-TL', 'MD+HV', 'NP$-TL', 'OD-HL', 'JJR-TL', 'VBD-TL', 'DT+BEZ', 'EX+BEZ', 'PPSS+HV', ':-HL', 'PPS+MD', 'UH', 'FW-CC', 'FW-NNS', 'BEDZ-HL', 'NN$-HL', '.-HL', 'HVD*', 'BEZ*', 'AP$', 'NP+BEZ', 'FW-AT-TL', 'VB-TL', 'RB-TL', 'MD-TL', 'PN+HVZ', 'FW-JJ-TL', 'FW-NN-TL', 'ABN-HL', 'PPS+BEZ-HL', 'NR-HL', 'HVD-HL', 'RB$', 'FW-AT-HL', 'DO-HL', 'PP$-TL', 'FW-IN-TL', 'WPS+BEZ', '*-HL', 'DTI-HL', 'PN-HL', 'CD$', 'BER*', 'NNS$-HL', 'PN$', 'BER-TL', 'TO-TL', 'FW-JJ', 'BED*', 'RB+BEZ', 'VB+PPO', 'PPSS-HL', 'HVZ*', 'FW-IN+NN-TL', 'FW-IN+AT-TL', 'NN-NC', 'JJ-NC', 'NR$-TL', 'FW-PP$-NC', 'FW-VB', 'FW-VB-NC', 'JJR-NC', 'NPS$-TL', 'QL-TL', 'FW-AT', 'FW-*', 'FW-CD', 'WQL', 'FW-WDT', 'WDT+BEZ', 'N']

>>> len(counts)

219

>>> from operator import itemgetter

>>> sorted(counts.items(),key=itemgetter(1),reverse=True)

[('NN', 13162), ('IN', 10616), ('AT', 8893), ('NP', 6866), (',', 5133), ('NNS', 5066), ('.', 4452), ('JJ', 4392), ('CC', 2664), ('VBD', 2524), ('NN-TL', 2486), ('VB', 2440), ('VBN', 2269), ('RB', 2166), ('CD', 2020), ('CS', 1509), ('VBG', 1398), ('TO', 1237), ('PPS', 1056), ('PP$', 1051), ('MD', 1031), ('AP', 923), ('NP-TL', 741), ('``', 732), ('BEZ', 730), ('BEDZ', 716), ("''", 702), ('JJ-TL', 689), ('PPSS', 602), ('DT', 589), ('BE', 525), ('VBZ', 519), ('NR', 495), ('RP', 482), ('QL', 468), ('PPO', 412), ('WPS', 395), ('NNS-TL', 344), ('WDT', 343), ('BER', 328), ('WRB', 328), ('OD', 309), ('HVZ', 301), ('--', 300), ('NP$', 279), ('HV', 265), ('HVD', 262), ('*', 256), ('BED', 252), ('NPS', 215), ('BEN', 212), ('NN$', 210), ('DTI', 205), ('NP-HL', 186), ('ABN', 183), ('NN-HL', 171), ('IN-TL', 164), ('EX', 161), (')', 151), ('(', 148), ('JJR', 145), (':', 137), ('DTS', 136), ('JJT', 100), ('CD-TL', 96), ('NNS-HL', 92), ('PN', 89), ('RBR', 88), ('VBN-TL', 87), ('ABX', 73), ('NN$-TL', 69), ('IN-HL', 65), ('DOD', 64), ('DO', 63), ('BEG', 57), (',-HL', 55), ('VBN-HL', 53), ('AT-TL', 50), ('NNS$', 50), ('CD-HL', 50),  ('PPS+BEZ-HL', 1), ('HVD-HL', 1), ('RB$', 1), ('FW-AT-HL', 1), ('DO-HL', 1), ('PP$-TL', 1), ('FW-IN-TL', 1), ('*-HL', 1), ('PN-HL', 1), ('PN$', 1), ('BER-TL', 1), ('TO-TL', 1), ('BED*', 1), ('RB+BEZ', 1), ('VB+PPO', 1), ('PPSS-HL', 1), ('HVZ*', 1), ('FW-IN+NN-TL', 1), ('FW-IN+AT-TL', 1), ('JJ-NC', 1), ('NR$-TL', 1), ('FW-PP$-NC', 1), ('FW-VB', 1), ('FW-VB-NC', 1), ('JJR-NC', 1), ('NPS$-TL', 1), ('QL-TL', 1), ('FW-*', 1), ('FW-CD', 1), ('WQL', 1), ('FW-WDT', 1), ('WDT+BEZ', 1), ('N', 0)]

sorted()的第一个参数是要排序的项目,也就是词性搭配的种数,由一个pos标记和一个频率组成的元组链表。第二个参数使用itemgetter()指定排序键。最后一个参数的指定项目表明应以反序返回,即按频率值递减输出。

>>> last_letters = nltk.defaultdict(list)

>>> words = nltk.corpus.words.words('en')

>>> for word in words:

...     key = word[-2:]

...     last_letters[key].append(word)

... 

>>> last_letters['lly']

[]

>>> len(last_letters['ly'])

11523


>>> anagrams = nltk.defaultdict(list)

>>> for word in words:

...     key = ''.join(sorted(word))

...     anagrams[key].append(word)

... 

>>> anagrams['aeilnrt']

['entrail', 'latrine', 'ratline', 'reliant', 'retinal', 'trenail']

创建词字典并按照频率排序是一种常见的任务,所以NLTK提供了一种更为方便的创建方式:

>>> anagrams['aeilnrt']

['entrail', 'latrine', 'ratline', 'reliant', 'retinal', 'trenail']

>>> aragrams = nltk.Index((''.join(sorted(w)),w) for w in words)

>>> aragrams['aeilnrt']

['entrail', 'latrine', 'ratline', 'reliant', 'retinal', 'trenail']

可以看出的是nltk.Index是额外支持初始化的defaultdict(list),而nltk.FreqDist()的本质是额外支持初始化的defaultdict(附带排序和绘图方法)


字典支持高级查找,可获得任意键对应的值,但当给定一个值,需查找对应的键,并且需要多次执行这种操作,可建立一个映射值到键的字典。在任意两个键都不具有相同值的情况下,只要得到字典中所有的键值对,并创建新的值键对字典即可。

>>> pos = {'colorless':'ADJ','ideas':'N','sleep':'V','furiously':'ADV'}

>>> pos2 = dict((value,key) for (key,value) in pos.items())

>>> pos2['N']

'ideas'


>>> pos.update({'cats':'N'.'search':'V','peaceful':'ADV','old':'ADJ'})

>>> pos.update({'cats':'N','search':'V','peaceful':'ADV','old':'ADJ'})

>>> pos2 = nltk.defaultdict(list)

>>> for key,value in pos.items():

...     pos2[value].append(key)

... 

>>> pos2['ADV']

['furiously', 'peaceful']

用update方法在pos中加入一个词,创建多个具有相同值的情况,因为append()积累词性后,每个键所对应的词会有相同的词性,不满足此前建立逆向字典的方法就失效了。

python字典方法总结:

d1.update(d2):添加d2中所有项目到d1

defaultdict(int):一个默认值为0的字典

你可能感兴趣的:(python自然语言处理)