【记录】python nltk Stem 和 Lemmatization 的区别

使用方法
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()
snowball_stemmer = SnowballStemmer('english')
wordnet_lemmatizer = WordNetLemmatizer()


words = [('bottles', wordnet.NOUN), ('vases', wordnet.NOUN), ('lit', wordnet.VERB), ('said', wordnet.VERB), ('earlier', wordnet.ADJ)]

for word_tuple in words:
  word = word_tuple[0]
  pos = word_tuple[1]
  porter_stemmer.stem(word) # output: 'bottl', 'vase', 'lit', 'said', 'earlier'
  lancaster_stemmer.stem(word) # output: 'bottl', 'vas', 'lit', 'said', 'ear'
  snowball_stemmer.stem(word) # output: 'bottl', 'vase', 'lit', 'said', 'earlier'

  wordnet_lemmatizer.lemmatize(word) # output: 'bottle', 'vas', 'lit', 'said', 'earlier'
  wordnet_lemmatizer.lemmatize(word, pos=pos) # output: 'bottle', 'vas', 'light', 'say', 'early'

结论

仅由上例可见,在有词性的情况下,WordNetLemmatizer获取英语单词原形的效果要更好。

[注] 词形还原工具对比

你可能感兴趣的:(【记录】python nltk Stem 和 Lemmatization 的区别)