NLP常用工具包实战 (3)NLTK工具包:英文数据分词、Text对象、停用词、词性标注、命名实体识别、数据清洗实例

NLTK 非常实用的文本处理工具,主要用于英文数据,历史悠久~

import nltk
# nltk.download()
# nltk.download('punkt')
# nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.text import Text
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.chunk import RegexpParser
from nltk import ne_chunk

1 分词

str1 = "Today's weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow."
tokens = word_tokenize(str1)
tokens = [word.lower() for word in tokens]
print(tokens[:5])  # ['today', "'s", 'weather', 'is', 'good']

2 Text对象

# print(help(nltk.text))  # 帮助文档
# 创建一个Text对象,方便后续操作
t = Text(tokens)
print(t.count('good'))  # 1
print(t.index('good'))  # 4
t.plot(8)  # 打印出现次数最多的8个单词

你可能感兴趣的:(nlp,自然语言处理,python,nltk)