上一篇介绍过在单页面中获取数据,这次加上click页面跳转
本文获取多页面,不同小标签下的数据
Scrape Centerhttps://scrape.center/适合初学者练习,里面资源挺多,页面也不是非常复杂
上代码!
import xlwt
from playwright.sync_api import sync_playwright
def run(playwright):
browser = playwright.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
page.goto("https://spa5.scrape.center/page/502")
listMsg =[]
# 这里的range(3)代表0,1,2也就是遍历三次,自己可以设置
for i in range(3) :
# page.wait_for_timeout(2000)
# 因为循环中遍历三次数据,每个数据获取完都要进行页面跳转,
# 如果不设置页面等待,则未加载完的数据将获取不到,直接error
page.wait_for_load_state("networkidle")
# xpath选择器获取大标签下的内容
msgs = page.query_selector_all("//*[@id='index']/div[1]/div/div/div")
print(f"第{i+1}页,共{len(msgs)}条:")
# 将大标签的内容分成小标签,进行获取
for msg in msgs :
img_href = msg.query_selector("//a/img").get_attribute("src")
book_name = msg.query_selector("//a/h3").text_content()
if msg.query_selector("//p") is None:
book_author = "佚名"
else:
book_author = msg.query_selector("//p").text_content().replace(" ", "")
print('\t',img_href,book_name,book_author)
listMsg.append((img_href,book_name,book_author))
print(listMsg)
print(f"共获取到{len(listMsg)}条数据")
# 判断下一页按钮是否生效
if page.query_selector("//button[2]").get_attribute("disabled") == "disabled":
print("没有下一页了...,爬取结束")
else:
page.click("//button[2]/i")
workbook = xlwt.Workbook(encoding='utf-8')
wordsheet = workbook.add_sheet("爬虫书籍网")
colName = ["图片链接","书籍名称","书籍作者"]
wordsheet.write(0,0,"书籍全部信息")
wordsheet.col(0).width = 10000
wordsheet.col(1).width = 4000
wordsheet.col(2).width = 8000
for i in range(len(colName)):
wordsheet.write(1,i,colName[i])
# 遍历所有数据,打印出结果为 索引 [(内容1),(内容2)]
for i,items, in enumerate(listMsg) :
print(i,items)
wordsheet.write(2+i,0,items[0])
wordsheet.write(2+i,1,items[1])
wordsheet.write(2+i,2,items[2].replace(" ",""))
workbook.save("Excel_books.xls")
page.close()
context.close()
browser.close()
with sync_playwright() as playwright:
run(playwright)
在实现之前必须要有个思路,按照这个思路来进行
①打开浏览器
②创建轻量级context浏览器
③新建页面
④加载所需网址
import xlwt
from playwright.sync_api import sync_playwright
def run(playwright):
browser = playwright.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
page.goto("https://spa5.scrape.center/")
⑤通过选择器选择所需要的内容