NTLK是著名的Python自然语言处理工具包,但是主要针对的是英文处理。NLTK配套有文档,有语料库,有书籍。
在NLTK的主页详细介绍了如何在Mac, Linux和Windows下安装NLTK:http://nltk.org/install.html ,最主要的还是要先装好依赖NumPy和PyYAML,其他没什么问题。不过可以直接下载Anaconda,省去了大部分包的安装,安装NLTK完毕,可以import nltk测试一下,如果没有问题,还有下载NLTK官方提供的相关语料
安装步骤:
1. 下载NLTK包
2.运行Python,并输入下面的指令(当然,第一条指令还是要导入NLTK包):
然后,Python Launcher会弹出下面这个界面,建议你选择安装所有的Packages,以免去日后一而再、再而三的进行安装,也为你的后续开发提供一个稳定的环境。某些包的Status显示“out of date”,你可以不必理会,它基本不影响你的使用与开发。
试着享受自己的成果吧!
测试以下布朗语料
>>>from nltk.corpus import brown
>>>brown,readme()
>>> brown.words()[0:10]
[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of']
>>> len(brown.words())
1161192
小试牛刀:
sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!"
sentences = sent_tokenizer.tokenize(paragraph)
print sentences
['The first time I heard that song was in Hawaii on radio.', 'I was just a kid, and loved it very much!', 'What a fantastic song!']
2.分词
from nltk.tokenize import WordPunctTokenizer
sentence = "Are you old enough to remember Michael Jackson attending. the Grammys with Brooke Shields and Webster sat on his lap during the show?"
words = WordPunctTokenizer().tokenize(sentence.lower())
print words
['are', 'you', 'old', 'enough', 'to', 'remember', 'michael', 'jackson', 'attending', '.', 'the', 'grammys', 'with', 'brooke', 'shields', 'and', 'webster', 'sat', 'on', 'his', 'lap', 'during', 'the', 'show', '?']