python爬虫 爬取猫眼电影Top100榜单

对于静态网站,爬虫爬取数据的难度较低。

讲一下思路:首先分析网站的界面,我们要爬取Top100榜单,而网站每页只显示十条信息,因此我们构造一个urlList存放目标url,然后我们具体分析一页的情况,分析网页的html代码如下图:我们要了解每个标签对应的数据,代码中我们就可以直接用Beautifulsoup中的css选择器,而不必写贪婪正则表达式了。了解这么多已经可以敲代码了!python爬虫 爬取猫眼电影Top100榜单_第1张图片

源代码:

# -*- encoding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
import xlwt
urlList = []
# 构造目标网站的URL
for i in range(0, 100, 10):
    url = 'http://maoyan.com/board/4?offset='+i.__str__()
    urlList.append(url)
# 构造代理,如果没有这一步,网站会检测出恶意爬取,进而获取不到数据
user_agent = 'Mozilla/6.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134'
header = {
    'User-Agent': user_agent,
    'Host': 'maoyan.com'
}
title = []  # 存放影片名
star = []   # 存放主演
releaseTime = []  # 存放上映时间
score = []
moives = [title, star, releaseTime, score]
for url in urlList:   # 遍历URL
    content = requests.get(url, headers=header).text
    soup = BeautifulSoup(content, 'lxml')
    stars = soup.select('p[class=star]')  # 查找star
    titles = soup.select('p[class=name]')  # 查找title
    times = soup.select('p[class=releasetime]')  # 查找上映时间
    scores = soup.select('p[class=score]')  # 查找评分
    for title in titles:
        moives[0].append(title.get_text())
    for star in stars:
        moives[1].append(star.get_text().strip()[3:])
    for time in times:
        moives[2].append(time.get_text().strip()[5:])
    for score in scores:
        moives[3].append(score.get_text())
# 新建excel文件存储爬取的数据
book = xlwt.Workbook(encoding='utf-8', style_compression=0)
sheet = book.add_sheet('猫眼', cell_overwrite_ok=True)
sheet.write(0, 0, '影片名')
sheet.write(0, 1, '主演')
sheet.write(0, 2, '上映时间')
sheet.write(0, 3, '评分')
for j in range(1, 101):
    sheet.write(j, 0, moives[0][j-1])
    sheet.write(j, 1, moives[1][j-1])
    sheet.write(j, 2, moives[2][j-1])
    sheet.write(j, 3, moives[3][j-1])
book.save('猫眼Top100.xls')

运行结果:

python爬虫 爬取猫眼电影Top100榜单_第2张图片

你可能感兴趣的:(python,爬虫)