爬虫实战——豆瓣电影Top250

爬虫实战——豆瓣电影Top250

准备阶段

网页分析

在目标网页直接ctrl+u查看网页源代码(或者F12审查),豆瓣的网页源代码就出现了(非常友好):


<html lang="zh-CN" class="ua-windows ua-webkit">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <meta name="renderer" content="webkit">
    <meta name="referrer" content="always">
    <meta name="google-site-verification" content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" />
    <title>
豆瓣电影 Top 250
title>
    
    <meta name="baidu-site-verification" content="cZdR4xxR7RxmM4zE" />
    <meta http-equiv="Pragma" content="no-cache">
    ......

然后我们翻到330行左右的位置,有如下代码:

<ol class="grid_view">
        <li>
            <div class="item">
                <div class="pic">
                    <em class="">1em>
                    <a href="https://movie.douban.com/subject/1292052/">
                        <img width="100" alt="肖申克的救赎" src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p480747492.webp" class="">
                    a>
                div>
                <div class="info">
                    <div class="hd">
                        <a href="https://movie.douban.com/subject/1292052/" class="">
                            <span class="title">肖申克的救赎span>
                                    <span class="title"> / The Shawshank Redemptionspan>
                                <span class="other"> / 月黑高飞(港)  /  刺激1995(台)span>
                        a>


                            <span class="playable">[可播放]span>
                    div>
                    <div class="bd">
                        <p class="">
                            导演: 弗兰克·德拉邦特 Frank Darabont   主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
                            1994 / 美国 / 犯罪 剧情
                        p>

                        
                        <div class="star">
                                <span class="rating5-t">span>
                                <span class="rating_num" property="v:average">9.7span>
                                <span property="v:best" content="10.0">span>
                                <span>2304569人评价span>
                        div>

                            <p class="quote">
                                <span class="inq">希望让人自由。span>
                            p>
                            

    <p>
        
        <span class="gact">
            <a href="https://movie.douban.com/wish/224683240/update?add=1292052" target="_blank" class="j a_collect_btn" name="sbtn-1292052-wish" rel="nofollow">想看a>
        span>  
        
        <span class="gact">
            <a href="https://movie.douban.com/collection/224683240/update?add=1292052" target="_blank" class="j a_collect_btn" name="sbtn-1292052-collection" rel="nofollow">看过a>
        span>  
    p>

                    div>
                div>
            div>
        li>
        <li>
            <div class="item">
                <div class="pic">
                    <em class="">2em>
                    <a href="https://movie.douban.com/subject/1291546/">
                        <img width="100" alt="霸王别姬" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2561716440.webp" class="">
                    a>
                div>
                <div class="info">
                    <div class="hd">
                        <a href="https://movie.douban.com/subject/1291546/" class="">
                            <span class="title">霸王别姬span>
                                <span class="other"> / 再见,我的妾  /  Farewell My Concubinespan>
                        a>


                            <span class="playable">[可播放]span>
                    div>
                    <div class="bd">
                        <p class="">
                            导演: 陈凯歌 Kaige Chen   主演: 张国荣 Leslie Cheung / 张丰毅 Fengyi Zha...<br>
                            1993 / 中国大陆 中国香港 / 剧情 爱情 同性
                        p>

                        
                        <div class="star">
                                <span class="rating5-t">span>
                                <span class="rating_num" property="v:average">9.6span>
                                <span property="v:best" content="10.0">span>
                                <span>1709666人评价span>
                        div>

                            <p class="quote">
                                <span class="inq">风华绝代。span>
                            p>

从这些代码中我们可以发现我们需要爬取的内容都在里面了。

现在我们来分析每一部电影的HTML语言,即从item项看:

<div class="item">
                <div class="pic">
                    <em class="">1em>
                    <a href="https://movie.douban.com/subject/1292052/">
                        <img width="100" alt="肖申克的救赎" src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p480747492.webp" class="">
                    a>
                div>
                <div class="info">
                    <div class="hd">
                        <a href="https://movie.douban.com/subject/1292052/" class="">
                            <span class="title">肖申克的救赎span>
                                    <span class="title"> / The Shawshank Redemptionspan>
                                <span class="other"> / 月黑高飞(港)  /  刺激1995(台)span>
                        a>


                            <span class="playable">[可播放]span>
                    div>
                    <div class="bd">
                        <p class="">
                            导演: 弗兰克·德拉邦特 Frank Darabont   主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
                            1994 / 美国 / 犯罪 剧情
                        p>

                        
                        <div class="star">
                                <span class="rating5-t">span>
                                <span class="rating_num" property="v:average">9.7span>
                                <span property="v:best" content="10.0">span>
                                <span>2304569人评价span>
                        div>

                            <p class="quote">
                                <span class="inq">希望让人自由。span>
                            p>
                            

    <p>
        
        <span class="gact">
            <a href="https://movie.douban.com/wish/224683240/update?add=1292052" target="_blank" class="j a_collect_btn" name="sbtn-1292052-wish" rel="nofollow">想看a>
        span>  
        
        <span class="gact">
            <a href="https://movie.douban.com/collection/224683240/update?add=1292052" target="_blank" class="j a_collect_btn" name="sbtn-1292052-collection" rel="nofollow">看过a>
        span>  
    p>

                    div>
                div>
            div>
        li>

从《肖申克的救赎》这部电影对应的HTML代码我们可以发现,我们需要提取的内容为:

<a href="https://movie.douban.com/subject/1292052/" class="">
<span class="title">肖申克的救赎span>
<span class="title"> / The Shawshank Redemptionspan>    
<span class="other"> / 月黑高飞(港)  /  刺激1995(台)span>    
<span class="rating_num" property="v:average">9.7span>    
<span class="inq">希望让人自由。span>    

分析到这我们就可以确定我们的代码实现思路了。

确定思路

1、确定网页的url,即:

start_url = 'https://movie.douban.com/top250?start={:d}&filter='
size = 10
for i in range(size):
    url = start_url.format(i * 25)  # url便是每一页对应的网页链接

2、获取到对应的网页,即通过requests.get()方法,即

headers = {
     'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'}
html = requests.get(url, headers=headers)

爬取网页的时候可能会出现错误,我们要确保传递给下一个过程的数据没有错误,因此:

if html.status_code == 200:
    # 继续下一步操作
    pass
else:
    print("error!!!")

3、通过get()方法获取网页代码有时我们无法直接处理,因此我们需要用的beautifulSoup来解析网页(虽然这一次根本不需要)

soup = BeautifulSoup(html.text, 'html.parser')

4、经过前面的步骤我们就可以开始利用正则表达式来进行数据清洗了,因为之前我们已经确定了需求,所以现在我们按照之前的需求来一次完成我们的目标

4.1 匹配url

res = r'^[\[a-z<="\s]*href="(.*)">$'
if re.match(res, test):
    url = re.match(res, test).group(1)
else:
    url = 'None'

4.2 匹配名称

res = r'^[]*>(.*)$'
if re.match(res, test):
    movie_name = re.match(res, test).group(1)
else:
    movie_name = 'None'

4.3 匹配评分

res = r'^[\[=a-z"<>\s:_]*(.*)]$'
if re.match(res, test):
    rating = re.match(res, test).group(1)
else:
    rating = 'None'

4.4 匹配推荐语

res = r'^[\[=a-z"<>\s]*(.*)]$']
if re.match(res, test):
    inq = re.match(res, test).group(1)
else:
    inq = 'None'

5、经过前面的步骤,我们就可以在从程序输出得到爬取的,但是为了方便阅读和保存,我们需要把爬取结果保存下来,为了方便,直接保存为CSV格式的文件

with open('res.csv', 'w', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['电影名', '评分', '推荐语', '链接'])
    for i in res:
        writer.writerow(i)

实施阶段

在准备阶段我们已经确定了整个爬取流程,接下来就作为一个无情的代码机器就行了

1、编写getHtmlDiv(url)函数,用于爬取网页并解析得到div标签中的结果

def getHtmlDiv(url):
    headers = {
     'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'}
    html = requests.get(url, headers=headers)
    if html.status_code == 200:
        soup = BeautifulSoup(html.text, 'html.parser')
        return soup.find_all(name='div', class_='info')
    else:
        print(html.status_code)

2、编写writeToCSV(res, filename)函数,将爬取结果保存

def writeToCSV(res, filename):
    with open(filename, 'w', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['电影名', '评分', '推荐语', '链接'])
        for i in res:
            writer.writerow(i)

3、编写getRes(ans, url)函数,从div标签中的到我们需要的数据

def getRes(ans, url):
    div = getHtmlDiv(url)
    for i in range(len(div)):
        s_url = str(div[i].find_all(name='a')).split('\n')[0]
        res_url = r'^[\[a-z<="\s]*href="(.*)">$'
        if re.match(res_url, s_url):
            movie_url = re.match(res_url, s_url).group(1)
        else:
            movie_url = "None"

        s_title_span = div[i].find_all(name='span', class_='title') + div[i].find_all(name='span',class_='other')
        res_title = r'^[]*>(.*)$'
        movie_name = ''
        for j in range(len(s_title_span)):
            if re.match(res_title, str(s_title_span[j])):
                m = re.match(res_title, str(s_title_span[j])).group(1)
                m = ''.join(m.split())
            else:
                m = "None"
            movie_name += m

        s_rating = str(div[i].find_all(name='span', class_='rating_num'))
        res_rating = r'^[\[=a-z"<>\s:_]*(.*)]$'
        if re.match(res_rating, s_rating):
            movie_rating = re.match(res_rating, s_rating).group(1)
        else:
            movie_rating = "None"


        s_inq = str(div[i].find_all(name='span', class_='inq'))
        res_inq = r'^[\[=a-z"<>\s]*(.*)]$'
        if re.match(res_inq, s_inq):
            movie_inq = re.match(res_inq, s_inq).group(1)
        else:
            movie_inq = "None"


        item = (movie_name, movie_rating, movie_inq, movie_url)
        ans.append(item)
    return ans

4、编写main()函数

def main():
    start_url = "https://movie.douban.com/top250?start={:d}&filter="
    size = 10
    res = []
    for i in range(size):
        try:
            url = start_url.format(i * 25)
            res = getRes(res, url)
        except:
            print("main() error! i = %d" % i)
            continue
    writeToCSV(res, 'test.csv')

5、源码

# -*- coding: utf-8 -*-
# author:Egoist
import requests
from bs4 import BeautifulSoup
import xlwings as xw
from tqdm import tqdm
import re
import csv


def getHtmlDiv(url):
    headers = {
     'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'}
    # 添加headers 防止被反爬虫程序检测
    html = requests.get(url, headers=headers)  # 爬取网页
    if html.status_code == 200:  # 判断状态码:200则成功;403的话有可能是被Ban了,过段时间就好了
        soup = BeautifulSoup(html.text, 'html.parser')  # 煲汤,解析网页(在这个项目中虽然有点多余)
        return soup.find_all(name='div', class_='info')  # 获取div标签并返回
    else:
        print(html.status_code)  # 输出状态码


def writeToCSV(res, filename):
    with open(filename, 'w', encoding='utf-8') as f:  # with open打开文件方便点
        writer = csv.writer(f)  # csv写入器
        writer.writerow(['电影名', '评分', '推荐语', '链接'])  # 写入标题
        for i in res:  # 依次写入数据
            writer.writerow(i)


def getRes(ans, url):
    div = getHtmlDiv(url)  # 得到div标签的内容
    for i in tqdm(range(len(div))):  # 遍历每部电影
        s_url = str(div[i].find_all(name='a')).split('\n')[0]  # 获取包含url的标签
        res_url = r'^[\[a-z<="\s]*href="(.*)">$'  # 解析url的正则表达式
        if re.match(res_url, s_url):  # 判空
            movie_url = re.match(res_url, s_url).group(1)
        else:
            movie_url = "None"

        s_title_span = div[i].find_all(name='span', class_='title') + div[i].find_all(name='span',class_='other')
        res_title = r'^[]*>(.*)$'
        movie_name = ''
        for j in range(len(s_title_span)):
            if re.match(res_title, str(s_title_span[j])):
                m = re.match(res_title, str(s_title_span[j])).group(1)
                m = ''.join(m.split())  # 去掉\xa0字符
            else:
                m = "None"
            movie_name += m

        s_rating = str(div[i].find_all(name='span', class_='rating_num'))
        res_rating = r'^[\[=a-z"<>\s:_]*(.*)]$'
        if re.match(res_rating, s_rating):
            movie_rating = re.match(res_rating, s_rating).group(1)
        else:
            movie_rating = "None"

        s_inq = str(div[i].find_all(name='span', class_='inq'))
        res_inq = r'^[\[=a-z"<>\s]*(.*)]$'
        if re.match(res_inq, s_inq):
            movie_inq = re.match(res_inq, s_inq).group(1)
        else:
            movie_inq = "None"

        item = (movie_name, movie_rating, movie_inq, movie_url)  # 将结果保存为tuple
        ans.append(item)  # 将该电影的信息加入结果list
    return ans  # 返回结果


def main():
    start_url = "https://movie.douban.com/top250?start={:d}&filter="  # 导航url
    size = 10  # 页数
    res = []  # 保存数据
    for i in range(size):
        try:
            url = start_url.format(i * 25)  # 每一页的url
            res = getRes(res, url)  # 存储数据
        except:
            print("main() error! i = %d" % i)
            continue
    for i in res:
        print(i)
    writeToCSV(res, 'test.csv')  # 写入结果


if __name__ == '__main__':
    main()  # 主函数

查看结果

在代码所在的文件夹中,我们就可以看到生成了一个.csv文件,在该文件中就保存了我们的爬取结果。

爬虫实战——豆瓣电影Top250_第1张图片

你可能感兴趣的:(学习笔记,python,爬虫,正则表达式)