在对有标注的文本进行分词时,要用到nltk库中的WordPunctTokenizer和WhitespaceTokenizer,例子如下:
import nltk
from nltk.tokenize import WordPunctTokenizer,WhitespaceTokenizer
txt = 'red foxes scare me.'
token = WordPunctTokenizer().tokenize(txt)
print(token)
token1 = WhitespaceTokenizer().tokenize(txt)
print(token1)
输出结果分别为:
['red', 'foxes', '<', 'emotion', '>', 'scare', '', 'emotion', '>', 'me', '.']
['red', 'foxes', 'scare ', 'me.']