英文文本分词处理(NLTK)

文章目录

        • 1、NLTK的安装
        • 2、NLTK分词和分句
        • 3、NLTK分词后去除标点符号
        • 4、NLTK分词后去除停用词
        • 5、NLTK分词后进行词性标注
        • 6、NLTK分词后进行词干提取
        • 7、NLTK分词后进行词性还原

1、NLTK的安装

首先,打开终端(Anaconda Prompt)安装nltk:

pip install nltk

打开Python终端或是Anaconda 的Spyder并输入以下内容来安装 NLTK 包

import nltk
nltk.download()

注意: 详细操作或其他安装方式请查看 Anaconda3安装jieba库和NLTK库。

2、NLTK分词和分句

 由于英语的句子基本上就是由标点符号、空格和词构成,那么只要根据空格和标点符号将词语分割成数组即可,所以相对来说简单很多:
(1)分词:

from nltk import word_tokenize     #以空格形式实现分词
paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!"
words = word_tokenize(paragraph)
print(words)

运行结果:

['The', 'first', 'time', 'I', 'heard', 'that', 'song', 'was', 'in', 'Hawaii', 'on', 'radio', '.', 'I', 'was', 'just', 'a', 'kid', ',', 'and', 'loved', 'it', 'very', 'much', '!', 'What', 'a', 'fantastic', 'song', '!']

(2)分句:

from nltk import sent_tokenize    #以符号形式实现分句
sentences = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!"
sentence = sent_tokenize(sentences )
print(sentence)

运行结果:

['The first time I heard that song was in Hawaii on radio.', 'I was just a kid, and loved it very much!', 'What a fantastic song!']

注意: NLTK分词或者分句以后,都会自动形成列表的形式

3、NLTK分词后去除标点符号

from nltk import word_tokenize
paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!".lower()
cutwords1 = word_tokenize(paragraph)   #分词
print('【NLTK分词结果:】')
print(cutwords1)

interpunctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']   #定义标点符号列表
cutwords2 = [word for word in cutwords1 if word not in interpunctuations]   #去除标点符号
print('\n【NLTK分词后去除符号结果:】')
print(cutwords2)

运行结果:

【NLTK分词结果:】
['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', '.', 'i', 'was', 'just', 'a', 'kid', ',', 'and', 'loved', 'it', 'very', 'much', '!', 'what', 'a', 'fantastic', 'song', '!']

【NLTK分词后去除符号结果:】
['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', 'i', 'was', 'just', 'a', 'kid', 'and', 'loved', 'it', 'very', 'much', 'what', 'a', 'fantastic', 'song']

4、NLTK分词后去除停用词

from nltk import word_tokenize
from nltk.corpus import stopwords

paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!".lower()
cutwords1 = word_tokenize(paragraph)   #分词
print('【NLTK分词结果:】')
print(cutwords1)

interpunctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']   #定义符号列表
cutwords2 = [word for word in cutwords1 if word not in interpunctuations]   #去除标点符号
print('\n【NLTK分词后去除符号结果:】')
print(cutwords2)

stops = set(stopwords.words("english"))
cutwords3 = [word for word in cutwords2 if word not in stops]
print('\n【NLTK分词后去除停用词结果:】')
print(cutwords3)

运行结果:

【NLTK分词结果:】
['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', '.', 'i', 'was', 'just', 'a', 'kid', ',', 'and', 'loved', 'it', 'very', 'much', '!', 'what', 'a', 'fantastic', 'song', '!']

【NLTK分词后去除符号结果:】
['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', 'i', 'was', 'just', 'a', 'kid', 'and', 'loved', 'it', 'very', 'much', 'what', 'a', 'fantastic', 'song']

【NLTK分词后去除停用词结果:】
['first', 'time', 'heard', 'song', 'hawaii', 'radio', 'kid', 'loved', 'much', 'fantastic', 'song']

5、NLTK分词后进行词性标注

from nltk import word_tokenize,pos_tag
from nltk.corpus import stopwords

paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!".lower()
cutwords1 = word_tokenize(paragraph)   #分词
print('【NLTK分词结果:】')
print(cutwords1)

interpunctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']   #定义符号列表
cutwords2 = [word for word in cutwords1 if word not in interpunctuations]   #去除标点符号
print('\n【NLTK分词后去除符号结果:】')
print(cutwords2)

stops = set(stopwords.words("english"))
cutwords3 = [word for word in cutwords2 if word not in stops]
print('\n【NLTK分词后去除停用词结果:】')
print(cutwords3)

print('\n【NLTK分词去除停用词后进行词性标注:】')
print(pos_tag(cutwords3))

运行结果:

【NLTK分词结果:】
['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', '.', 'i', 'was', 'just', 'a', 'kid', ',', 'and', 'loved', 'it', 'very', 'much', '!', 'what', 'a', 'fantastic', 'song', '!']

【NLTK分词后去除符号结果:】
['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', 'i', 'was', 'just', 'a', 'kid', 'and', 'loved', 'it', 'very', 'much', 'what', 'a', 'fantastic', 'song']

【NLTK分词后去除停用词结果:】
['first', 'time', 'heard', 'song', 'hawaii', 'radio', 'kid', 'loved', 'much', 'fantastic', 'song']

【NLTK分词去除停用词后进行词性标注:】
[('first', 'JJ'), ('time', 'NN'), ('heard', 'NN'), ('song', 'NN'), ('hawaii', 'NN'), ('radio', 'NN'), ('kid', 'NN'), ('loved', 'VBD'), ('much', 'JJ'), ('fantastic', 'NN'), ('song', 'NN')]

说明: 列表中每个元组第二个元素显示为该词的词性,具体每个词性注释可运行代码”nltk.help.upenn_tagset()“或参看说明文档:NLTK词性标注说明

6、NLTK分词后进行词干提取

 单词词干提取就是从单词中去除词缀并返回词根,搜索引擎在索引页面的时候使用这种技术,所以很多人通过同一个单词的不同形式进行搜索,返回的都是相同的、有关这个词干的页面。
 词干提取的算法有很多:

# 基于Porter词干提取算法
from nltk.stem.porter import PorterStemmer  
print(PorterStemmer().stem('leaves'))

# 基于Lancaster 词干提取算法
from nltk.stem.lancaster import LancasterStemmer  
print(LancasterStemmer().stem('leaves'))

# 基于Snowball 词干提取算法
from nltk.stem import SnowballStemmer  
print(SnowballStemmer('english').stem('leaves'))

运行结果:
leav
leav
leav

 我们最常用的算法是 Porter 提取算法。NLTK 有一个 PorterStemmer 类,使用的就是 Porter 提取算法:

from nltk import word_tokenize,pos_tag
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!".lower()
cutwords1 = word_tokenize(paragraph)   #分词
print('【NLTK分词结果:】')
print(cutwords1)

interpunctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']   #定义符号列表
cutwords2 = [word for word in cutwords1 if word not in interpunctuations]   #去除标点符号
print('\n【NLTK分词后去除符号结果:】')
print(cutwords2)

stops = set(stopwords.words("english"))
cutwords3 = [word for word in cutwords2 if word not in stops]  #判断分词在不在停用词列表内
print('\n【NLTK分词后去除停用词结果:】')
print(cutwords3)

print('\n【NLTK分词去除停用词后进行词性标注:】')
print(pos_tag(cutwords3))      #词性标注

print('\n【NLTK分词进行词干提取:】')
cutwords4 = []
for cutword in cutwords3:
    cutwords4.append(PorterStemmer().stem(cutword))    #词干提取
print(cutwords4)

运行结果:

【NLTK分词结果:】
['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', '.', 'i', 'was', 'just', 'a', 'kid', ',', 'and', 'loved', 'it', 'very', 'much', '!', 'what', 'a', 'fantastic', 'song', '!']

【NLTK分词后去除符号结果:】
['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', 'i', 'was', 'just', 'a', 'kid', 'and', 'loved', 'it', 'very', 'much', 'what', 'a', 'fantastic', 'song']

【NLTK分词后去除停用词结果:】
['first', 'time', 'heard', 'song', 'hawaii', 'radio', 'kid', 'loved', 'much', 'fantastic', 'song']

【NLTK分词去除停用词后进行词性标注:】
[('first', 'JJ'), ('time', 'NN'), ('heard', 'NN'), ('song', 'NN'), ('hawaii', 'NN'), ('radio', 'NN'), ('kid', 'NN'), ('loved', 'VBD'), ('much', 'JJ'), ('fantastic', 'NN'), ('song', 'NN')]

【NLTK分词进行词干提取:】
['first', 'time', 'heard', 'song', 'hawaii', 'radio', 'kid', 'love', 'much', 'fantast', 'song']

7、NLTK分词后进行词性还原

 词形还原与词干提取类似, 但不同之处在于词干提取经常可能创造出不存在的词汇,词形还原的结果是一个真正的词汇。

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('playing'))

运行结果:
playing

 NLTK词形还原时默认还原的结果是名词,如果你想得到动词,可以通过以下的方式指定:

from nltk.stem import WordNetLemmatizer    #词性还原
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('playing', pos="v"))   #指定还原词性为动词

运行结果:
play

注意: 面对不同的词语,我们需要其不同的词性才能更好的提起单词的原型,所以建议在词性还原时指定还原的词性。

from nltk import word_tokenize,pos_tag   #分词、词性标注
from nltk.corpus import stopwords    #停用词
from nltk.stem import PorterStemmer    #词干提取
from nltk.stem import WordNetLemmatizer    #词性还原

paragraph = "I went to   the gymnasium yesterday  , when I had finished my homework !".lower()
cutwords1 = word_tokenize(paragraph)   #分词
print('【NLTK分词结果:】')
print(cutwords1)

interpunctuations = [',', ' ','.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']   #定义符号列表
cutwords2 = [word for word in cutwords1 if word not in interpunctuations]   #去除标点符号
print('\n【NLTK分词后去除符号结果:】')
print(cutwords2)

stops = set(stopwords.words("english"))
cutwords3 = [word for word in cutwords2 if word not in stops]  #判断分词在不在停用词列表内
print('\n【NLTK分词后去除停用词结果:】')
print(cutwords3)

print('\n【NLTK分词去除停用词后进行词性标注:】')
print(pos_tag(cutwords3))      #词性标注

print('\n【NLTK分词进行词干提取:】')
cutwords4 = []
for cutword1 in cutwords3:
    cutwords4.append(PorterStemmer().stem(cutword1))    #词干提取
print(cutwords4)

print('\n【NLTK分词进行词形还原:】')
cutwords5 = []
for cutword2 in cutwords4:
    cutwords5.append(WordNetLemmatizer().lemmatize(cutword2,pos='v'))   #指定还原词性为名词
print(cutwords5)

运行结果:

【NLTK分词结果:】
['i', 'went', 'to', 'the', 'gymnasium', 'yesterday', ',', 'when', 'i', 'had', 'finished', 'my', 'homework', '!']

【NLTK分词后去除符号结果:】
['i', 'went', 'to', 'the', 'gymnasium', 'yesterday', 'when', 'i', 'had', 'finished', 'my', 'homework']

【NLTK分词后去除停用词结果:】
['went', 'gymnasium', 'yesterday', 'finished', 'homework']

【NLTK分词去除停用词后进行词性标注:】
[('went', 'VBD'), ('gymnasium', 'NN'), ('yesterday', 'NN'), ('finished', 'VBD'), ('homework', 'NN')]

【NLTK分词进行词干提取:】
['went', 'gymnasium', 'yesterday', 'finish', 'homework']

【NLTK分词进行词形还原:】
['go', 'gymnasium', 'yesterday', 'finish', 'homework']

 到这里,英文文本处理就基本结束了,后续进行英文关键词的提取和分析过程,谢谢你的阅读!

你可能感兴趣的:(nltk,python)