基于nltk的自然语言处理---stopwords停用词处理

一个nltk库的自然语言处理stopwords停用词的测试脚本,先对一段字符串进行测试:

import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')


#example_sent= pd.read_csv('D:/set/PubMed/1813/1-grams-1813.tsv')
example_sent = "the respecting  Spasmodic SecondaryVenereal off from Fluids	portions partly Nerve Example	some Natives  Metacarpal Contracted Constitutions	Instance jat by severe double Appendix contained Joints Disorders  Tumour Vascular Tongue Bone case Liver Account Diseases History Explanation A  Soldiery Human Brain betweenHumor operation , cyst Tabular Radial attended situated Inflammation Puberty attached sawing evacuating Dissection DiseaseMouth Groin Some Bones cases circumstances posterior Cataract	intoStrangulatedAqueous Observations . was to which Aneurism Paralysis	beneficial	Eyes Opium Ossium Effects Hemorrhage Appearance succeeded On a with Synopsis Fon in successfully"
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
     if w not in stop_words:
      filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)

运行如下:
基于nltk的自然语言处理---stopwords停用词处理_第1张图片
如果遇到nltk_data语料下载报错,尝试切换网络,WiFi切换手机热点等情况下可以解决

LookupError:
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/english.pickle

  Searched in:
    - 'C:\\Users\\86166/nltk_data'
    - 'C:\\Users\\86166\\Miniconda3\\nltk_data'
    - 'C:\\Users\\86166\\Miniconda3\\share\\nltk_data'
    - 'C:\\Users\\86166\\Miniconda3\\lib\\nltk_data'
    - 'C:\\Users\\86166\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************

出现上述语料加载问题,按照问题提示,在代码中添加下载命令即可,如nltk.download('punkt'),或者手动网址下载,移到指定路径。

我们拿到的语料资料如下图,里面有数字等等数据噪声,需要写一个简单的脚本清洗一下
基于nltk的自然语言处理---stopwords停用词处理_第2张图片
清洗脚本如下:

import glob
#在对应的文件夹下,读取tsv文件
tsv_list = glob.glob('C:/Users/86166/Desktop/111/*') #处理同文件夹下的所有要处理的txt,tsv.csv文件
#1.txt 为需要写入新数据的文件,a表示追加内容,不加newline='',会出现空行
out = open('C:/Users/86166/Desktop/111/1.txt','a', newline='')
# csv_write=csv.writer(out)
for i in tsv_list:
    #打开 文件中的一个tsv文件,
    tsv_file = open(i)
    #逐行读取tsv_file文件中的内容
    tsv_reader_lines = tsv_file.readlines()
    for one_line in tsv_reader_lines:
        # print(str(one_line))
        #内容筛选,去掉不需要的内容,这里只留下英文字母
        a1=''.join([x for x in one_line if x.isalpha()])
        # print(a1)
        #逐行写入新文件
        out.write(a1+' ')

清洗完的数据如下,规格正常了许多,变成一句连贯的英文段落:
基于nltk的自然语言处理---stopwords停用词处理_第3张图片
然后再做停用词处理:

import pandas as pd
import glob
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')



infile = open('C:/Users/86166/Desktop/111/1.txt')
str = infile.read()
# print(str)
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(str)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
     if w not in stop_words:
      filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)

你可能感兴趣的:(基于nltk的自然语言处理---stopwords停用词处理)