[Python3]洗数据新手向教程Ⅰ:用自带函数对文本进行加工

此教程包含如何对文档进行简单的数据采集和存储。

基础知识储备

  1. String & List & Dictionary & Tuple 相关函数
  2. File IO 相关函数
    详见我的另一篇:
    Python for Informatics(File&String&List&Dictionary&Tuple)

项目示例

  • 读取外部文档,抠出confidence值,计算平均值(习题来自《Python for Informatics》)
from urllib.request import urlopen

file_url = 'http://www.py4inf.com/code/mbox-short.txt'
file_list = urlopen(file_url)
conf_list = []

for line in file_list:
    line = str(line, 'utf-8') #注意类型转换,urlopen()得到的是byte形式
    sign = "X-DSPAM-Confidence: "
    if line.startswith(sign): #防止混进非目标行的数据
        start = line.find(sign)+len(sign)
        end = line.find(' ',start)
        confidence = line[start: end]
        print(confidence)
        conf_list.append(float(confidence))

sum = 0
num = 0
for conf in conf_list:
    sum += conf
    num +=1

print("Average spam condifence: "+str(sum/num))
  • 读取外部文档,收集所有单词(不重复)并储存在list中,按字母顺序排列(习题来自《Python for Informatics》)
from urllib.request import urlopen

url = "http://www.py4inf.com/code/romeo.txt"
url_file = urlopen(url)
words = []

for line in url_file:
    line = str(line,'utf-8')
    temp_words = line.split()
    for word in temp_words:
        if word not in words:
            words.append(word)

words.sort()
print(words)

  • 统计文本中前十高频词(习题来自《Python for Informatics》)
import string
fhand = open('text.txt')
words = dict()

for line in fhand:
    line = str(line)
    table = str.maketrans(' ',' ',string.punctuation)
    line.translate(table) #剥去所有标点,记得Import string(python3中,translate()函数只有一个argument)
    line.lower()
    word_list = line.split()
    for word in word_list:
        if word not in words:
            words[word] =1
        else:
            words[word]+=1

words_cooked = list()

for key,value in words.items():
    words_cooked.append((value,key))

words_cooked.sort(reverse= True)

for key, value in words_cooked[:10]:
    print(key,value)

你可能感兴趣的:([Python3]洗数据新手向教程Ⅰ:用自带函数对文本进行加工)