txt文件长这样:
我们想要将所有单词读取出来并存储到list当中,需要经历以下几个步骤:
data = open(r'E:\Program Files\PyCharm 2019.2\machinelearning\homework\Emails\Training\spam\3.txt')
cab = []
for line in data.readlines():
cab.append(line.strip().split(','))
print(cab)
输出cab:
[[‘You Have Everything To Gain!’], [’’], [‘Incredib1e gains in length of 3-4 inches to yourPenis’, ’ PERMANANTLY’], [’’], [‘Amazing increase in thickness of yourPenis’, ’ up to 30%’], [‘BetterEjacu1ation control’], [‘Experience Rock-HardErecetions’], [‘Explosive’, ’ intenseOrgasns’], [‘Increase volume ofEjacu1ate’], [‘Doctor designed and endorsed’], [‘100% herbal’, ’ 100% Natural’, ’ 100% Safe’], [‘The proven NaturalPenisEnhancement that works!’], [‘100% MoneyBack Guaranteeed’]]
可以看到cab[1]为一个异常值。
cab_f=[]
for i in range(len(cab)):
for j in range(len(cab[i])):
if cab[i][j] != '':
cab_f.append(cab[i][j].strip())
输出cab_f:
[‘You Have Everything To Gain!’, ‘Incredib1e gains in length of 3-4 inches to yourPenis’, ‘PERMANANTLY’, ‘Amazing increase in thickness of yourPenis’, ‘up to 30%’, ‘BetterEjacu1ation control’, ‘Experience Rock-HardErecetions’, ‘Explosive’, ‘intenseOrgasns’, ‘Increase volume ofEjacu1ate’, ‘Doctor designed and endorsed’, ‘100% herbal’, ‘100% Natural’, ‘100% Safe’, ‘The proven NaturalPenisEnhancement that works!’, ‘100% MoneyBack Guaranteeed’]
可以看到我们将list的维数变成了一维,且除去了异常值。
cab_final = []
for i in cab_f:
for j in i.split(' '):
cab_final.append(j)
输出cab_final:
[‘You’, ‘Have’, ‘Everything’, ‘To’, ‘Gain!’, ‘Incredib1e’, ‘gains’, ‘in’, ‘length’, ‘of’, ‘3-4’, ‘inches’, ‘to’, ‘yourPenis’, ‘PERMANANTLY’, ‘Amazing’, ‘increase’, ‘in’, ‘thickness’, ‘of’, ‘yourPenis’, ‘up’, ‘to’, ‘30%’, ‘BetterEjacu1ation’, ‘control’, ‘Experience’, ‘Rock-HardErecetions’, ‘Explosive’, ‘intenseOrgasns’, ‘Increase’, ‘volume’, ‘ofEjacu1ate’, ‘Doctor’, ‘designed’, ‘and’, ‘endorsed’, ‘100%’, ‘herbal’, ‘100%’, ‘Natural’, ‘100%’, ‘Safe’, ‘The’, ‘proven’, ‘NaturalPenisEnhancement’, ‘that’, ‘works!’, ‘100%’, ‘MoneyBack’, ‘Guaranteeed’]
可以看到,得到了我们想要的结果!!!
完整代码:
def read_txt():
data = open(r'E:\Program Files\PyCharm 2019.2\machinelearning\homework\Emails\Training\spam\3.txt')
cab = []
for line in data.readlines():
cab.append(line.strip().split(','))
cab_f = []
for i in range(len(cab)):
for j in range(len(cab[i])):
if cab[i][j] != '':
cab_f.append(cab[i][j].strip())
cab_final = []
for i in cab_f:
for j in i.split(' '):
cab_final.append(j)
return cab_final
if __name__=='__main__':
print(read_txt())