使用NLTK对英文文章分句,避免缩略词标点符号干扰

对于英文语料,我们想要获得句子时,可以通过正则或者NLTK工具切分。例如,NLTK:

from nltk.tokenize import sent_tokenize

document=''
sentences=sent_tokenize(document)

NLTK会根据“.?!”等符号切分。但是当句子中含有缩写词时,可能会产生错误的切分:

sent_tokenize('fight among communists and anarchists (i.e. at a series of events named May Days).')

输出:
['fight among communists and anarchists (i.e.',
 'at a series of events named May Days).']

句子在i.e.后边被切分了。为了避免这种情况,我们需要使用nltk.tokenize.punkt并且自定义缩写词表:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
abbreviation = ['i.e']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.tokenize('fight among communists and anarchists (i.e. at a series of events named May Days).')

输出:
['fight among communists and anarchists (i.e. at a series of events named May Days).']

可以添加多个缩写词,注意添加的缩写词应该没有最后的“.”,比如i.e.写成i.e

 

参考

https://stackoverflow.com/questions/34805790/how-to-avoid-nltks-sentence-tokenizer-splitting-on-abbreviations

http://www.nltk.org/api/nltk.tokenize.html

你可能感兴趣的:(nlp)