用nltk对文本分词

用nltk对文本分词

在对有标注的文本进行分词时,要用到nltk库中的WordPunctTokenizer和WhitespaceTokenizer,例子如下:

import nltk
from nltk.tokenize import WordPunctTokenizer,WhitespaceTokenizer
txt = 'red foxes scare me.'
token = WordPunctTokenizer().tokenize(txt)
print(token)
token1 = WhitespaceTokenizer().tokenize(txt)
print(token1)

输出结果分别为:

['red', 'foxes', '<', 'emotion', '>', 'scare', ', 'emotion', '>', 'me', '.']
['red', 'foxes', 'scare', 'me.']

你可能感兴趣的:(python)