很多介绍NLP的,都会提到NLTK库。还以为NLTK是多牛逼的必需品。看了之后,感觉NLTK对实际项目,作用不大。很多内容都是从语义、语法方面解决NLP问题的。感觉不太靠谱。而且本身中文语料库不多。很多介绍NLTK的书籍和blog都比较陈旧。
《NLTK基础教程--用NLTK和Python库构建机器学习应用》虽然是2017年6月第一版。但内容大部分还是很陈旧的。基本都是采用英文的素材。书中排版类、文字类错误很多。
《Python自然语言处理》 [美] Steven Bird,Ewan Klein & Edward Loper著 陈涛 张旭 催杨 刘海平 译 的介绍的更全面。代码及其陈旧,知识点很全面。
下面整理了1、2、3、4、6、8章的代码。在win10 nltk3.2.4 python3.5.3/python3.6.1环境,可以正常运行。一定要注意nltk_data代码的下载,还有缺少库的时候,按需安装。其中 pywin32-221.win-amd64-py3.6.exe/pywin32-221.win-amd64-py3.5.exe 需要手工下载[https://sourceforge.net/projects/pywin32/files/pywin32/Build%20221/]。
需要下载的数据,都在代码里给出链接,或者说明。
# 《NLTK基础教程--用NLTK和Python库构建机器学习应用》01 自然语言处理简介
# win10 nltk3.2.4 python3.5.3/python3.6.1
# filename:NLTKEssentials01.py # 自然语言处理简介
import nltk
#nltk.download() # 完全下载需要很久,很可能需要多次尝试,才能下载成功
print("Python and NLTK installed successfully")
'''Python and NLTK installed successfully'''
# 1.2 先从Python开始
# 1.2.1 列表
lst = [1, 2, 3, 4]
print(lst)
'''[1, 2, 3, 4]'''
# print('Fisrt element: ' + lst[0])
# '''TypeError: must be str, not int'''
print('Fisrt element: ' + str(lst[0]))
'''Fisrt element: 1'''
print('First element: ' + str(lst[0]))
print('last element: ' + str(lst[-1]))
print('first three elemenets: ' + str(lst[0:2]))
print('last three elements: ' + str(lst[-3:]))
'''
First element: 1
last element: 4
first three elemenets: [1, 2]
last three elements: [2, 3, 4]
'''
# 1.2.2 自主功能
print(dir(lst))
'''
['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']
'''
print(' , '.join(dir(lst)))
'''
__add__ , __class__ , __contains__ , __delattr__ , __delitem__ , __dir__ , __doc__ , __eq__ , __format__ , __ge__ , __getattribute__ , __getitem__ , __gt__ , __hash__ , __iadd__ , __imul__ , __init__ , __init_subclass__ , __iter__ , __le__ , __len__ , __lt__ , __mul__ , __ne__ , __new__ , __reduce__ , __reduce_ex__ , __repr__ , __reversed__ , __rmul__ , __setattr__ , __setitem__ , __sizeof__ , __str__ , __subclasshook__ , append , clear , copy , count , extend , index , insert , pop , remove , reverse , sort
'''
help(lst.index)
'''
Help on built-in function index:
index(...) method of builtins.list instance
L.index(value, [start, [stop]]) -> integer -- return first index of value.
Raises ValueError if the value is not present.
'''
mystring = "Monty Python ! And the holy Grail ! \n"
print(mystring.split())
'''
['Monty', 'Python', '!', 'And', 'the', 'holy', 'Grail', '!']
'''
print(mystring.strip())
'''Monty Python ! And the holy Grail !'''
print(mystring.lstrip())
'''
Monty Python ! And the holy Grail !
'''
print(mystring.rstrip())
'''Monty Python ! And the holy Grail !'''
print(mystring.upper())
'''
MONTY PYTHON ! AND THE HOLY GRAIL !
'''
print(mystring.replace('!', ''''''))
'''
Monty Python And the holy Grail
'''
# 1.2.3 正则表达式
import re
if re.search('Python', mystring):
print("We found python ")
else:
print("No ")
'''We found python '''
import re
print(re.findall('!', mystring))
'''['!', '!']'''
# 1.2.4 字典
word_freq = {}
for tok in mystring.split():
if tok in word_freq:
word_freq[tok] += 1
else:
word_freq[tok] = 1
print(word_freq)
'''{'Monty': 1, 'Python': 1, '!': 2, 'And': 1, 'the': 1, 'holy': 1, 'Grail': 1}'''
# 1.2.5 编写函数
import sys
def wordfreq(mystring):
'''
Function to generated the frequency distribution of the given text
'''
print(mystring)
word_freq = {}
for tok in mystring.split():
if tok in word_freq:
word_freq[tok] += 1
else:
word_freq[tok] = 1
print(word_freq)
def main():
str = "This is my fist python program"
wordfreq(str)
if __name__ == '__main__':
main()
'''
This is my fist python program
{'This': 1, 'is': 1, 'my': 1, 'fist': 1, 'python': 1, 'program': 1}
'''
# 1.3 向NLTK迈进
from urllib import request
response = request.urlopen('http://python.org/')
html = response.read()
html = html.decode('utf-8')
print(len(html))
'''48141'''
#print(html)
tokens = [tok for tok in html.split()]
print("Total no of tokens :" + str(len(tokens)))
'''Total no of tokens :2901'''
print(tokens[0: 100])
'''
['', 'html>', '', '', '', '', 'class="no-js"', 'lang="en"', 'dir="ltr">', '', '', '', 'charset="utf-8">', '', 'http-equiv="X-UA-Compatible"', 'content="IE=edge">', '<link', 'rel="prefetch"', 'href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">', '', 'name="application-name"', 'content="Python.org">', '', 'name="msapplication-tooltip"', 'content="The', 'official', 'home', 'of', 'the', 'Python', 'Programming', 'Language">', '', 'name="apple-mobile-web-app-title"', 'content="Python.org">', '', 'name="apple-mobile-web-app-capable"', 'content="yes">', '', 'name="apple-mobile-web-app-status-bar-style"', 'content="black">', '', 'name="viewport"', 'content="width=device-width,', 'initial-scale=1.0">', '', 'name="HandheldFriendly"', 'content="True">', '', 'name="format-detection"', 'content="telephone=no">', '', 'http-equiv="cleartype"', 'content="on">', '', 'http-equiv="imagetoolbar"', 'content="false">', '