python爬虫实战万年历

万年历的爬取

目标网址:https://wannianrili.bmcx.com/
python爬虫实战万年历_第1张图片

目标:获取1970年到2021年每一天的天干地支

查看万年历网址的Network,在转换月份或者年份的时候发现network中出现了一些新的东西
python爬虫实战万年历_第2张图片
点开一看,就是我们需要获取的数据,而且url方便更改,更改其中的年份和月份即可对应。

因为获取的是1970-2021的每一天,数据量比较大,所以在爬的过程中要做注意添加时间元件,控制每次爬取的速度。接下来编写代码:

// An highlighted block
import requests,openpyxl,time  # 导入模块
from bs4 import BeautifulSoup
import random
wb = openpyxl.Workbook()
sheet = wb.active
sheet.title = '数据'  #自定义表名
a = 1
for i in range(1970,2022):
    if a == 1:
        for j in range(1,13):
            url = 'https://wannianrili.bmcx.com/ajax/?q={}-{}&v=20031912'.format(i,j)  #构造目标网址
            headers = {
     'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36',
                        'Referer': 'https://wannianrili.bmcx.com/'}
            res = requests.get(url, headers=headers)
            if res.status_code == 200:
                soup = BeautifulSoup(res.text,'html.parser')
                datas = soup.find_all(class_="wnrl_k_you")
                for data in datas:
                    date = data.find(class_="wnrl_k_you_id_biaoti").text.replace(' ','').replace('\n','')
                    tian = data.find(class_="wnrl_k_you_id_wnrl_riqi").text.replace(' ','').replace('\n','')
                    text = data.find(class_="wnrl_k_you_id_wnrl_nongli_ganzhi").text.replace(' ','').replace('\n','')
                    date1 = date[:8] + tian + date[8:len(date)]
                    year = text[:7]
                    month = text[7:10]
                    day = text[10:len(text)]
                    sheet.append([date1,year,month,day])
                    time.sleep(5)
                print(i,j)
            else:
                print('Error')
                a = 0
                break
    else:
        print('错误')
        break
wb.save('万年历.xlsx')
wb.close()

你可能感兴趣的:(python)