Python 使用BeautifulSoup方式爬虫爬取数据

一份较为简单的作业:

作业详情 : 点击这里


使用工具jupyter notebook, google chrome


解决思路 : 

(1)在Chrome浏览器中选择爬取数据部分,右击检查或cirl+shift+I查找html代码.

(2)先分析其中的html代码,通过BeautifulSoup来进行爬取单组数据.

(3)根据单组数据建立函数模型,简化代码.

(4)反复测试,查找特殊情况,使代码模型更加完善.


代码 :
(1)加载所需包

import csv
from bs4 import BeautifulSoup
import requests
(2)以列表形式存储url,filename的字符串

domain_name = "https://en.wikipedia.org/wiki/"
url = ["List_of_countries_by_GDP_(nominal)",
       "List_of_countries_by_GDP_(nominal)_per_capita",
       "List_of_countries_by_GDP_(PPP)"]
out_filenames = ["gdp_imf","gdp_wb","gdp_un",
                "gdp_cap_imf","gdp_cap_wb","gdp_cap_un", 
                "gdp_ppp_imf","gdp_ppp_wb","gdp_ppp_cia"]
(3)爬取的函数

def extract_html(url):
    content = requests.get(url).text
    bsObj = BeautifulSoup(content,'lxml')
    return bsObj

def extract_table(url, tno , out_filename):
    table = extract_html(url).find_all('table')[tno]
    out_file = open(out_filename+".csv",'w',encoding='utf-8')
    writer = csv.writer(out_file)
    for row in table.find_all('tr')[1:]:
        tds = row.find_all('td')
        country = tds[1].a.get_text()
        gdp = tds[2].get_text(separator=":")
        gdp_val = gdp.split(':')[-1].replace(",","")
        writer.writerow([country,gdp_val])
    out_file.close()
(4)函数调用进行爬取(ps:没有优化过)
extract_table(domain_name + url[0],2,out_filenames[0])
extract_table(domain_name + url[0],3,out_filenames[1])
extract_table(domain_name + url[0],4,out_filenames[2])
extract_table(domain_name + url[1],2,out_filenames[3])
extract_table(domain_name + url[1],3,out_filenames[4])
extract_table(domain_name + url[1],4,out_filenames[5])
extract_table(domain_name + url[2],2,out_filenames[6])
extract_table(domain_name + url[2],3,out_filenames[7])
extract_table(domain_name + url[2],4,out_filenames[8])
(5)爬取后的数据的处理以及存储

data_dd = {}
for cname in out_filenames:
    file = open(cname + ".csv")
    reader = csv.reader(file)
    for row in reader:
        country = row[0]
        value = row[1]
        if country not in data_dd:
            data_dd[country] = {}
        data_dd[country][cname] = value
data = []
for k, v in data_dd.items():
    v['country'] = k
    data.append(v)
    
output = open("all_data.csv","w")
writer = csv.DictWriter(output,['country'] + out_filenames) 
writer.writerows(data)
output.close()

鉴于Wiki百科不定时更新,具体代码还是需要根据网页代码最新版来调整.






你可能感兴趣的:(Python)