nltk.word_tokenize(text):对指定的句子进行分词,返回单词列表。
nltk.pos_tag(words):对指定的单词列表进行词性标记,返回标记列表。
import nltk
words = nltk.word_tokenize('And now for something completely different')
print(words)
word_tag = nltk.pos_tag(words)
print(word_tag)
['And', 'now', 'for', 'something', 'completely', 'different']
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
str2tuple(s, sep='/')
Given the string representation of a tagged token, return the
corresponding tuple representation. The rightmost occurrence of
*sep* in *s* will be used to divide *s* into a word string and
a tag string. If *sep* does not occur in *s*, return (s, None).
from nltk.tag.util import str2tuple
str2tuple('fly/NN')
('fly', 'NN')
:type s: str
:param s: The string representation of a tagged token.
:type sep: str
:param sep: The separator string used to separate word strings
from tags.
标记会转成大写
默认sep=’/’
t = nltk.str2tuple('fly~abc',sep='~')
t
Out[26]: ('fly', 'ABC')
t = nltk.str2tuple('fly/abc')
t
Out[28]: ('fly', 'ABC')
from nltk.corpus import brown
words_tag = brown.tagged_words(categories='news')
print(words_tag[:10])
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')]
简化的标记 原先的 simplify_tags 在 python 3 中 改为 tagset
words_tag = brown.tagged_words(categories='news',tagset = 'universal')
print(words_tag[:10])
[('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'), ('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), ('of', 'ADP')]
brown可以看作是一个CategorizedTaggedCorpusReader实例对象。
CategorizedTaggedCorpusReader::tagged_words(fileids, categories):该方法接受文本标识或者类别标识作为参数,返回这些文本被标注词性后的单词列表。
CategorizedTaggedCorpusReader::tagged_sents(fileids, categories):该方法接受文本标识或者类别标识作为参数,返回这些文本被标注词性后的句子列表,句子为单词列表。
tagged_sents = brown.tagged_sents(categories='news')
print(tagged_sents)
[[('The', 'AT'), ... ('.', '.')],
[('The', 'AT'), ...('jury', 'NN').. ],
...]