英文分词nltk进行文本清洗

1、安装

import nltk
nltk.download('punkt')#一个默认的模型,也可以用别的模型

下载后可能会提示unzipping什么的,不用管,再运行一遍会发现已经satisfied了

2、分词

import nltk
sentence="python is a widely use high-level programing language"
tokens=nltk.word_tokenize(sentence)#默认punkt模型
print(tokens)

结果:[‘python’, ‘is’, ‘a’, ‘widely’, ‘use’, ‘high-level’, ‘programing’, ‘language’]

3、词干提取

from nltk.stem.porter import PorterStemmer

porter_stemmer=PorterStemmer()
print(porter_stemmer.stem('looked'))#look
print(porter_stemmer.stem('looking'))#look
print(porter_stemmer.stem('went'))#went

4、词形归并

from nltk.stem import WordNetLemmatizer # 需要下载wordnet语料库
wordnet_lematizer = WordNetLemmatizer()
print(wordnet_lematizer.lemmatize('cats'))#cat
print(wordnet_lematizer.lemmatize('boxes'))#box
print(wordnet_lematizer.lemmatize('went'))#went
# 指明词性可以更准确地进行lemma
# lemmatize 默认为名词,went被理解成名词
print(wordnet_lematizer.lemmatize('are', pos='v'))#be
print(wordnet_lematizer.lemmatize('went', pos='v'))#go

5、词性标注

import nltk

words = nltk.word_tokenize('Python is a widely used programming language.')
print(nltk.pos_tag(words)) # 需要下载 averaged_perceptron_tagger

结果:[(‘Python’, ‘NNP’), (‘is’, ‘VBZ’), (‘a’, ‘DT’), (‘widely’, ‘RB’), (‘used’, ‘VBN’), (‘programming’, ‘NN’), (‘language’, ‘NN’), (’.’, ‘.’)]

6、去除停用词

from nltk.corpus import stopwords # 需要下载stopwords

filtered_words = [word for word in words if word not in stopwords.words('english')]
print('原始词:', words)
print('去除停用词后:', filtered_words)

结果:
原始词: [‘Python’, ‘is’, ‘a’, ‘widely’, ‘used’, ‘programming’, ‘language’, ‘.’]
去除停用词后: [‘Python’, ‘widely’, ‘used’, ‘programming’, ‘language’, ‘.’]

7、经典文本清洗流程

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# 原始文本
raw_text = 'Life is like a box of chocolates. You never know what you\'re gonna get.'

# 分词
raw_words = nltk.word_tokenize(raw_text)

# 词形归一化
wordnet_lematizer = WordNetLemmatizer()
words = [wordnet_lematizer.lemmatize(raw_word) for raw_word in raw_words]

# 去除停用词
filtered_words = [word for word in words if word not in stopwords.words('english')]

print('原始文本:', raw_text)
# print('预处理结果:', filtered_words)

原始文本: Life is like a box of chocolates. You never know what you’re gonna get.
预处理结果: [‘Life’, ‘like’, ‘box’, ‘chocolate’, ‘.’, ‘You’, ‘never’, ‘know’, “'re”, ‘gon’, ‘na’, ‘get’, ‘.’]

你可能感兴趣的:(英文分词nltk进行文本清洗)