对于静态网站,爬虫爬取数据的难度较低。
讲一下思路:首先分析网站的界面,我们要爬取Top100榜单,而网站每页只显示十条信息,因此我们构造一个urlList存放目标url,然后我们具体分析一页的情况,分析网页的html代码如下图:我们要了解每个标签对应的数据,代码中我们就可以直接用Beautifulsoup中的css选择器,而不必写贪婪正则表达式了。了解这么多已经可以敲代码了!
源代码:
# -*- encoding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
import xlwt
urlList = []
# 构造目标网站的URL
for i in range(0, 100, 10):
url = 'http://maoyan.com/board/4?offset='+i.__str__()
urlList.append(url)
# 构造代理,如果没有这一步,网站会检测出恶意爬取,进而获取不到数据
user_agent = 'Mozilla/6.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134'
header = {
'User-Agent': user_agent,
'Host': 'maoyan.com'
}
title = [] # 存放影片名
star = [] # 存放主演
releaseTime = [] # 存放上映时间
score = []
moives = [title, star, releaseTime, score]
for url in urlList: # 遍历URL
content = requests.get(url, headers=header).text
soup = BeautifulSoup(content, 'lxml')
stars = soup.select('p[class=star]') # 查找star
titles = soup.select('p[class=name]') # 查找title
times = soup.select('p[class=releasetime]') # 查找上映时间
scores = soup.select('p[class=score]') # 查找评分
for title in titles:
moives[0].append(title.get_text())
for star in stars:
moives[1].append(star.get_text().strip()[3:])
for time in times:
moives[2].append(time.get_text().strip()[5:])
for score in scores:
moives[3].append(score.get_text())
# 新建excel文件存储爬取的数据
book = xlwt.Workbook(encoding='utf-8', style_compression=0)
sheet = book.add_sheet('猫眼', cell_overwrite_ok=True)
sheet.write(0, 0, '影片名')
sheet.write(0, 1, '主演')
sheet.write(0, 2, '上映时间')
sheet.write(0, 3, '评分')
for j in range(1, 101):
sheet.write(j, 0, moives[0][j-1])
sheet.write(j, 1, moives[1][j-1])
sheet.write(j, 2, moives[2][j-1])
sheet.write(j, 3, moives[3][j-1])
book.save('猫眼Top100.xls')
运行结果: