python爬虫系列——开始入土(四)

高性能异步爬虫

  • 同步爬虫
  • 异步爬虫
    • 线程池原则
    • 实战

同步爬虫

例子:url是阻塞爬取的,执行完毕上一个图片爬取才会执行下一个图片爬取。 是单线程的。

import requests

header = {
     'User-Agent': 'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)'}

urls = [
    'http://pic.netbian.com/uploads/allimg/210122/195550-1611316550d711.jpg',
    'http://pic.netbian.com/uploads/allimg/180803/084010-15332568107994.jpg',
    'http://pic.netbian.com/uploads/allimg/190415/214606-15553359663cd8.jpg'
]

# 封装方法  获取url内容
def get_content(url):
    print('正在爬取', url)
    response = requests.get(url=url, headers=header)
    if response.status_code == 200:
        return response.content

def parse_content(content):
    print('响应数据的长度为:', len(content))

for url in urls:
    content = get_content(url)
    parse_content(content)

异步爬虫

方式:

  • 多线程,多进程(不建议):
    • 好处:可以为相关阻塞的操作单独开启线程或者进程,阻塞操作就可以异步执行
    • 弊端:无法无限制的开启多线程或者多进程
  • 线程池、进程池(适当使用):
    • 好处:可以降低系统对进程或者线程创建和消耗,从而很好的降低系统的开销
    • 弊端:池中线程或进程的数量是有上限的
  • 单线程 + 异步协程(推荐):
    • event_loop:事件循环,相当于一个无线循环,可以把一些函数注册到这个事件循环上,当满足某些条件时,函数就会被循环执行。
    • coroutine:协程对象,可以将协程对象注册到事件循环中,它会被事件循环调用。可以使用async关键字来定义一个方法,这个方法在调用时不会立即被执行,而是返回一个协程对象
    • task:任务,是对协程对象的进一步封装,包含了任务的各个状态
    • future:代表将来执行或还没有执行的任务,实际上和task没有本质区别
    • async:定义一个协程
    • await:用来挂起阻塞方法的执行

注意

在协程当中,需要使用aiohttp模块来代替requests来进行异步网络请求。
requests.get是基于同步的
代码:

import aiohttp
# 使用该模块中的ClientSession
async def get_page(url):
    async with aiohttp.ClientSession() as session:
        async with await session.get(url=url) as response:
            # text() 返回字符串形式的响应数据
            # read() 返回的二进制形式响应数据
            # json() 返回的是json对象
            # 注意:获取响应数据操作之前一定要使用await进行挂起
            page_text = await response.text()
            print(page_text)

python实例化一个线程池对象

from multiprocessing.dummy import Pool
# 实例化一个线程池对象
pool = Pool(4)

协程:

import asyncio

async def request(url):
    print('请求url:', url)
# async修饰的函数,调用后返回一个协程对象
c = request('www.wzc.com')
# 创建一个事件循环对象
loop = asyncio.get_event_loop()
# 将协程对象注册到事件循环中,启动loop
loop.run_until_complete(c)

线程池原则

线程池处理的是阻塞且耗时的操作

实战

爬取来源网站:https://www.pearvideo.com/category_8

注意:在进入单个视频页面前的操作都比较基础,在进入单个页面后,视频是动态获取的,需要通过Ajax来进行数据传递获取
python爬虫系列——开始入土(四)_第1张图片
python爬虫系列——开始入土(四)_第2张图片

python爬虫系列——开始入土(四)_第3张图片

获取ajax数据时需要添加的参数,在header中需要添加Referer

post_url = 'https://www.pearvideo.com/videoStatus.jsp'
    data = {
     
        'contId': id_,
        'mrd': str(random.random()),

    }
    ajax_headers = {
     
        'User-Agent': random.choice(user_agent_list),
        'Referer':'https://www.pearvideo.com/video_' + id_
    }
    response = requests.post(post_url, data, headers=ajax_headers)

代码

import requests
from lxml import etree
import random
import os
import time
from multiprocessing.dummy import Pool

user_agent_list=[
            'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0)',
            'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)',
            'Mozilla/4.0(compatible;MSIE7.0;WindowsNT6.0)',
            'Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11',
            'Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
            'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
            'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
            'Opera/8.0 (Windows NT 5.1; U; en)',
            'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
            'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
            'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
            'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
            'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)',
            'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
        ]

# 视频地址url解密,返回正确视频路径
def videoUrlDeal(video_url, id_):
    # 得到真url,做字符串处理
    video_true_url = ''
    s_list = str(video_url).split('/')
    for i in range(0, len(s_list)):
        if i < len(s_list) - 1:
            video_true_url += s_list[i] + '/'
        else:
            ss_list = s_list[i].split('-')
            for j in range(0, len(ss_list)):
                if j == 0:
                    video_true_url += 'cont-' + id_ + '-'
                elif j == len(ss_list) - 1:
                    video_true_url += ss_list[j]
                else:
                    video_true_url += ss_list[j] + '-'
    return video_true_url

def testPost(id_):
    post_url = 'https://www.pearvideo.com/videoStatus.jsp'
    data = {
     
        'contId': id_,
        'mrd': str(random.random()),

    }
    ajax_headers = {
     
        'User-Agent': random.choice(user_agent_list),
        'Referer':'https://www.pearvideo.com/video_' + id_
    }
    response = requests.post(post_url, data, headers=ajax_headers)
    page_json = response.json()
    # print(page_json['videoInfo']['videos']['srcUrl'])
    return videoUrlDeal(page_json['videoInfo']['videos']['srcUrl'], id_)

# 存储视频到本地
def saveVideo(data):
    true_url = data[0]
    videoTitle = data[1]
    content = requests.get(url=true_url, headers=header).content
    with open('./video/' + videoTitle + '.mp4', 'wb') as fp:
        fp.write(content)
        print(true_url, videoTitle, '存储成功')

if __name__ == '__main__':
    # 创建一个文件夹,保存所有的图片
    if not os.path.exists('./video'):
        os.mkdir('./video')
    header = {
     'User-Agent': 'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)'}
    url = 'https://www.pearvideo.com/category_8'

    # 对url发起请求,解析出视频详情也的url和视频名称
    response = requests.get(url=url, headers=header)
    tree = etree.HTML(response.text)
    li_list = tree.xpath('//ul[@class="category-list clearfix"]/li')
    true_url_list = []
    for li in li_list:
        videoTitle = li.xpath('./div[@class="vervideo-bd"]/a/div[@class="vervideo-title"]/text()')[0]
        videoHref = 'https://www.pearvideo.com/' + li.xpath('./div[@class="vervideo-bd"]/a/@href')[0]

        # 对详情页url发起请求 (此页面已更改)
        # videoText = requests.get(url=videoHref, headers=header).text
        # 从详情页中解析出视频的地址,通过id(url)
        id = videoHref.split('_')[1]
        true_url_list.append((testPost(id),videoTitle))
    # print(true_url_list)

    # 实例化一个线程池对象,进行多线程存储
    pool = Pool(5)
    pool.map(saveVideo, true_url_list)
    pool.close()
    pool.join()

爬取结果:
python爬虫系列——开始入土(四)_第4张图片
python爬虫系列——开始入土(四)_第5张图片

你可能感兴趣的:(python,python,爬虫,异步,多线程)