一份较为简单的作业:
作业详情 : 点击这里
解决思路 :
(1)在Chrome浏览器中选择爬取数据部分,右击检查或cirl+shift+I查找html代码.
(2)先分析其中的html代码,通过BeautifulSoup来进行爬取单组数据.
(3)根据单组数据建立函数模型,简化代码.
(4)反复测试,查找特殊情况,使代码模型更加完善.
代码 :
(1)加载所需包
import csv
from bs4 import BeautifulSoup
import requests
(2)以列表形式存储url,filename的字符串
domain_name = "https://en.wikipedia.org/wiki/"
url = ["List_of_countries_by_GDP_(nominal)",
"List_of_countries_by_GDP_(nominal)_per_capita",
"List_of_countries_by_GDP_(PPP)"]
out_filenames = ["gdp_imf","gdp_wb","gdp_un",
"gdp_cap_imf","gdp_cap_wb","gdp_cap_un",
"gdp_ppp_imf","gdp_ppp_wb","gdp_ppp_cia"]
(3)爬取的函数
def extract_html(url):
content = requests.get(url).text
bsObj = BeautifulSoup(content,'lxml')
return bsObj
def extract_table(url, tno , out_filename):
table = extract_html(url).find_all('table')[tno]
out_file = open(out_filename+".csv",'w',encoding='utf-8')
writer = csv.writer(out_file)
for row in table.find_all('tr')[1:]:
tds = row.find_all('td')
country = tds[1].a.get_text()
gdp = tds[2].get_text(separator=":")
gdp_val = gdp.split(':')[-1].replace(",","")
writer.writerow([country,gdp_val])
out_file.close()
(4)函数调用进行爬取(ps:没有优化过)
extract_table(domain_name + url[0],2,out_filenames[0])
extract_table(domain_name + url[0],3,out_filenames[1])
extract_table(domain_name + url[0],4,out_filenames[2])
extract_table(domain_name + url[1],2,out_filenames[3])
extract_table(domain_name + url[1],3,out_filenames[4])
extract_table(domain_name + url[1],4,out_filenames[5])
extract_table(domain_name + url[2],2,out_filenames[6])
extract_table(domain_name + url[2],3,out_filenames[7])
extract_table(domain_name + url[2],4,out_filenames[8])
(5)爬取后的数据的处理以及存储
data_dd = {}
for cname in out_filenames:
file = open(cname + ".csv")
reader = csv.reader(file)
for row in reader:
country = row[0]
value = row[1]
if country not in data_dd:
data_dd[country] = {}
data_dd[country][cname] = value
data = []
for k, v in data_dd.items():
v['country'] = k
data.append(v)
output = open("all_data.csv","w")
writer = csv.DictWriter(output,['country'] + out_filenames)
writer.writerows(data)
output.close()