3.爬虫基础之批量爬取图片


1.正则表达式


元字符【单字符】

. [and] \d \D \s \S

修饰符

.* + ? {m} {m,n} {m,}

边界符

^ $ \A \B

贪婪模式

.*

非贪婪模式

.*?

模式修正

re.S 单行

re.M 多行

re.I 忽略大小写


2.XPath语法


层级等位:根据标签的层级关系进行查找

属性定位:根据属性查找标签


4.爬取妹子图代码


from time import sleep
from urllib import request, parse
import re

# 业务函数,处理url
def handler_url(url, page ,num):
    if num == 1:
        page_url = url + str(page)
        # 请求头
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
            'Referer': 'https://www.baidu.com/link?url=dORiYkjnb0AkMxSoE4UzQYAiVlhvcutBR6sSxgYQY-y&wd=&eqid=961cc7e80003f1a6000000065bd05902'
    }
        return request.Request(url=page_url, headers=headers)
    else:

        page_url = url + str(page) + '/' + str(num)
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
            'Referer': 'https://www.baidu.com/link?url=dORiYkjnb0AkMxSoE4UzQYAiVlhvcutBR6sSxgYQY-y&wd=&eqid=961cc7e80003f1a6000000065bd05902'
        }
        return request.Request(url=page_url, headers=headers)

# 业务函数,发起请求
def request_data(req):
    res = request.urlopen(req)
    # print(res.read().decode('utf-8'))

    return res.read().decode('utf-8')

# 业务函数,解析
def anylasis(html):
    # 正则匹配图片url
    pat = re.compile(r'
.*?

你可能感兴趣的:(3.爬虫基础之批量爬取图片)