常用操作
- Part-Of-Speech Tagging and POS Tagger
POS主要是用于标注词在文本中的成分,NLTK使用如下:
>>> import nltk
>>> text = nltk.word_tokenize(“Dive into NLTK: Part-of-speech tagging and POS Tagger”)
>>> text
[‘Dive’, ‘into’, ‘NLTK’, ‘:’, ‘Part-of-speech’, ‘tagging’, ‘and’, ‘POS’, ‘Tagger’]
>>> nltk.pos_tag(text)
[(‘Dive’, ‘JJ’), (‘into’, ‘IN’), (‘NLTK’, ‘NNP’), (‘:’, ‘:’), (‘Part-of-speech’, ‘JJ’), (‘tagging’, ‘NN’), (‘and’, ‘CC’), (‘POS’, ‘NNP’), (‘Tagger’, ‘NNP’)]
注意这里先做了word的tokenize,之后才做了pos tagging. NLTK对于每一种Tag都提供了说明文档,相关代码如下:
>>> nltk.help.upenn_tagset(‘JJ’)
>>> nltk.help.upenn_tagset(‘IN’)
>>> nltk.help.upenn_tagset(‘NNP’)
除此之外,NLTK还提供了pos tagging的批处理,代码如下:
>>> nltk.batch_pos_tag([[‘this’, ‘is’, ‘batch’, ‘tag’, ‘test’], [‘nltk’, ‘is’, ‘text’, ‘analysis’, ‘tool’]])[[(‘this’, ‘DT’), (‘is’, ‘VBZ’), (‘batch’, ‘NN’), (‘tag’, ‘NN’), (‘test’, ‘NN’)], [(‘nltk’, ‘NN’), (‘is’, ‘VBZ’), (‘text’, ‘JJ’), (‘analysis’, ‘NN’), (‘tool’, ‘NN’)]]
NLTK中nltk_data/taggers还提供了已经预先训练好的POS Tagging Model。其中,默认的Tagging Model是maxent_treebanck_pos_tagger model,相关代码在nltk-master/nltk/tag/_init_.py中。除此之外,我们训练其他相应的模型,如crf, hmm, brill, tnt and interfaces with stanford pos tagger, hunpos pos tagger和senna postaggers。Model训练的相关代码如下:
划分训练数据
>>> from nltk.corpus import treebank
>>> len(treebank.tagged_sents())
3914
>>> train_data = treebank.tagged_sents()[:3000]
>>> test_data = treebank.tagged_sents()[3000:]
训练模型
>>> from nltk.tag import tnt
>>> tnt_pos_tagger = tnt.TnT()
>>> tnt_pos_tagger.train(train_data)
>>> tnt_pos_tagger.evaluate(test_data)
0.8755881718109216
保存模型
>>> import pickle
>>> f = open(‘tnt_treebank_pos_tagger.pickle’, ‘w’)
>>> pickle.dump(tnt_pos_tagger, f)
>>> f.close()
应用模型
>>> tnt_tagger.tag(nltk.word_tokenize(“this is a tnt treebank tnt tagger”))
[(‘this’, u’DT’), (‘is’, u’VBZ’), (‘a’, u’DT’), (‘tnt’, ‘Unk’), (‘treebank’, ‘Unk’), (‘tnt’, ‘Unk’), (‘tagger’, ‘Unk’)]
- Stemming
从我个人的理解,Stemming的作用是提取词根,Lemmatization的作用是提取词的原型。
2.1 Porter Stemmer
>>> from nltk.stem.porter import PorterStemmer
>>> porter_stemmer = PorterStemmer()
>>> porter_stemmer.stem(‘maximum’)
u’maximum’
>>> porter_stemmer.stem(‘presumably’)
u’presum’
>>> porter_stemmer.stem(‘multiply’)
u’multipli’
>>> porter_stemmer.stem(‘provision’)
u’provis’
>>> porter_stemmer.stem(‘owed’)
u’owe’
>>> porter_stemmer.stem(‘ear’)
u’ear’
>>> porter_stemmer.stem(‘saying’)
u’say’
>>> porter_stemmer.stem(‘crying’)
u’cri’
>>> porter_stemmer.stem(‘string’)
u’string’
>>> porter_stemmer.stem(‘meant’)
u’meant’
>>> porter_stemmer.stem(‘cement’)
u’cement’
2.2 Lancaster Stemmer
>>> from nltk.stem.lancaster import LancasterStemmer
>>> lancaster_stemmer = LancasterStemmer()
>>> lancaster_stemmer.stem(‘maximum’)
‘maxim’
>>> lancaster_stemmer.stem(‘presumably’)
‘presum’
>>> lancaster_stemmer.stem(‘presumably’)
‘presum’
>>> lancaster_stemmer.stem(‘multiply’)
‘multiply’
>>> lancaster_stemmer.stem(‘provision’)
u’provid’
>>> lancaster_stemmer.stem(‘owed’)
‘ow’
>>> lancaster_stemmer.stem(‘ear’)
‘ear’
>>> lancaster_stemmer.stem(‘saying’)
‘say’
>>> lancaster_stemmer.stem(‘crying’)
‘cry’
>>> lancaster_stemmer.stem(‘string’)
‘string’
>>> lancaster_stemmer.stem(‘meant’)
‘meant’
>>> lancaster_stemmer.stem(‘cement’)
‘cem’
2.3 Snowball Stemmer
>>> from nltk.stem import SnowballStemmer
>>> snowball_stemmer = SnowballStemmer(“english”)
>>> snowball_stemmer.stem(‘maximum’)
u’maximum’
>>> snowball_stemmer.stem(‘presumably’)
u’presum’
>>> snowball_stemmer.stem(‘multiply’)
u’multipli’
>>> snowball_stemmer.stem(‘provision’)
u’provis’
>>> snowball_stemmer.stem(‘owed’)
u’owe’
>>> snowball_stemmer.stem(‘ear’)
u’ear’
>>> snowball_stemmer.stem(‘saying’)
u’say’
>>> snowball_stemmer.stem(‘crying’)
u’cri’
>>> snowball_stemmer.stem(‘string’)
u’string’
>>> snowball_stemmer.stem(‘meant’)
u’meant’
>>> snowball_stemmer.stem(‘cement’)
u’cement’
- Lemmatization
NLTK的Lemmatization是基于WordNet实现的。
>>> from nltk.stem import WordNetLemmatizer
>>> wordnet_lemmatizer = WordNetLemmatizer()
>>> wordnet_lemmatizer.lemmatize(‘dogs’)
u’dog’
>>> wordnet_lemmatizer.lemmatize(‘churches’)
u’church’
>>> wordnet_lemmatizer.lemmatize(‘aardwolves’)
u’aardwolf’
>>> wordnet_lemmatizer.lemmatize(‘abaci’)
u’abacus’
>>> wordnet_lemmatizer.lemmatize(‘hardrock’)
‘hardrock’
>>> wordnet_lemmatizer.lemmatize(‘are’)
‘are’
>>> wordnet_lemmatizer.lemmatize(‘is’)
‘is’
在此基础上,NLTK可以修改默认的pos参数,如从pos='n'
改为pos='V'
>>> wordnet_lemmatizer.lemmatize(‘is’, pos=’v’)
u’be’
>>> wordnet_lemmatizer.lemmatize(‘are’, pos=’v’)
u’be’