概念 | 定义 |
---|---|
阻塞 | 程序未得到计算资源时被挂起的状态,在此期间程序无法处理其他事情 |
非阻塞 | 程序未得到计算资源时被挂起的状态,在此期间程序任然可以处理其他事情 |
概念 | 定义 | 场景 |
---|---|---|
同步 | 不同程序单元为了完成某个任务,在执行过程中需靠某种通信方式以协调一致,我们称这些程序单元是同步执行的 | 购物系统中更新商品库存,用行锁作为通信信号 |
异步 | 不同程序单元之间过程中无需通信协调,也能完成任务的方式,不相关的程序单元之间可以是异步的 | 爬虫 |
协程(Coroutine)是一种用户态的轻量级线程
1.由于GIL锁的存在,多线程的运行需要频繁的加锁解锁,切换线程,这极大地降低了并发性能,
2.协程本质上是单线程,无需线程上下文切换开销,无需原子操作,
3.协程调度切换时,将寄存器上下文和栈保存到其他地方,在切回来的时候,恢复先前的寄存器上下文和栈,极大的提高了并发性能
手动进行线程切换
from greenlet import greenlet
import time
def test1():
while True:
print "---A--"
gr2.switch()
time.sleep(0.5)
def test2():
while True:
print "---B--"
gr1.switch()
time.sleep(0.5)
gr1 = greenlet(test1)
gr2 = greenlet(test2)
#手动切换到gr1中运行
gr1.switch()
输出
---A--
---B--
---A--
---B--
---A--
---B--
---A--
---B--
import gevent
import random
import time
def coroutine_work(coroutine_name):
for i in range(10):
print(coroutine_name, i)
# gevent不会认同time模块中的sleep是耗时操作
time.sleep(random.random())
gevent.joinall([
gevent.spawn(coroutine_work, "work1"),
gevent.spawn(coroutine_work, "work2")
])
运行结果:2个gevent是依次执行,因为gevent不会认同time模块中的sleep是耗时操作
work1 0
work1 1
work1 2
work1 3
work1 4
work1 5
work1 6
work1 7
work1 8
work1 9
work2 0
work2 1
work2 2
work2 3
work2 4
work2 5
work2 6
work2 7
work2 8
work2 9
import gevent
import random
def coroutine_work(coroutine_name):
for i in range(10):
print(coroutine_name, i)
# 模拟耗时操作
gevent.sleep(random.random())
gevent.joinall([
gevent.spawn(coroutine_work, "work1"),
gevent.spawn(coroutine_work, "work2")
])
输出结果:2个gevent交替执行
work1 0
work1 1
work1 2
work1 3
work1 4
work1 5
work1 6
work1 7
work1 8
work1 9
work2 0
work2 1
work2 2
work2 3
work2 4
work2 5
work2 6
work2 7
work2 8
work2 9
from gevent import monkey
import gevent
import random
import time
#当程序中存在非gevent的耗时操作时,需要使用猴子补丁将耗时操作转换为gevent的耗时操作
monkey.patch_all()
def coroutine_work(coroutine_name):
for i in range(10):
print(coroutine_name, i)
time.sleep(random.random())
gevent.joinall([
gevent.spawn(coroutine_work, "work1"),
gevent.spawn(coroutine_work, "work2")
])
输出结果:2个交替执行,
work1 0
work2 0
work1 1
work2 1
work2 2
work1 2
work1 3
work1 4
work2 3
work2 4
work1 5
work1 6
work2 5
work2 6
迭代1:串行爬虫
import time
def crawl_page(url):
print('crawling {}'.format(url))
sleep_time = int(url.split('_')[-1])
time.sleep(sleep_time)
print('OK {}'.format(url))
def main(urls):
for url in urls:
crawl_page(url)
begin_time = time.perf_counter()
main(['url_1', 'url_2', 'url_3', 'url_4'])
end_time = time.perf_counter()
print(end_time - begin_time)
结果
crawling url_1
OK url_1
crawling url_2
OK url_2
crawling url_3
OK url_3
crawling url_4
OK url_4
10.015771199949086
迭代2:使用async标记异步函数,使用await进行调用
import time
import asyncio
async def crawl_page(url):
print('crawling {}'.format(url))
sleep_time = int(url.split('_')[-1])
await asyncio.sleep(sleep_time)
print('OK {}'.format(url))
async def main(urls):
for url in urls:
await crawl_page(url)
begin_time = time.perf_counter()
asyncio.run(main(['url_1', 'url_2', 'url_3', 'url_4']))
end_time = time.perf_counter()
print(end_time - begin_time)
结果
crawling url_1
OK url_1
crawling url_2
OK url_2
crawling url_3
OK url_3
crawling url_4
OK url_4
10.009807399939746
迭代3:使用任务Task
import time
import asyncio
async def crawl_page(url):
print('crawling {}'.format(url))
sleep_time = int(url.split('_')[-1])
await asyncio.sleep(sleep_time)
print('OK {}'.format(url))
async def main(urls):
tasks = [asyncio.create_task(crawl_page(url)) for url in urls]
for task in tasks:
await task
begin_time = time.perf_counter()
asyncio.run(main(['url_1', 'url_2', 'url_3', 'url_4']))
end_time = time.perf_counter()
print(end_time - begin_time)
结果
crawling url_1
crawling url_2
crawling url_3
crawling url_4
OK url_1
OK url_2
OK url_3
OK url_4
3.9802477001212537
迭代4,使用gather
import time
import asyncio
async def crawl_page(url):
print('crawling {}'.format(url))
sleep_time = int(url.split('_')[-1])
await asyncio.sleep(sleep_time)
print('OK {}'.format(url))
async def main(urls):
tasks = [asyncio.create_task(crawl_page(url)) for url in urls]
await asyncio.gather(*tasks)
begin_time = time.perf_counter()
asyncio.run(main(['url_1', 'url_2', 'url_3', 'url_4']))
end_time = time.perf_counter()
print(end_time - begin_time)
迭代1:不创建task的异步任务,等同于串行
import asyncio
import time
async def worker_1():
print('worker_1 start')
await asyncio.sleep(1)
print('worker_1 done')
async def worker_2():
print('worker_2 start')
await asyncio.sleep(2)
print('worker_2 done')
async def main():
print('before await')
await worker_1()
print('awaited worker_1')
await worker_2()
print('awaited worker_2')
begin_time = time.perf_counter()
asyncio.run(main())
end_time = time.perf_counter()
print(end_time - begin_time)
结果:
before await
worker_1 start
worker_1 done
awaited worker_1
worker_2 start
worker_2 done
awaited worker_2
2.9887187001295388
迭代2:
import time
import asyncio
async def worker_1():
print('worker_1 start')
await asyncio.sleep(1)
print('worker_1 done')
async def worker_2():
print('worker_2 start')
await asyncio.sleep(2)
print('worker_2 done')
async def main():
task1 = asyncio.create_task(worker_1())
task2 = asyncio.create_task(worker_2())
print('before await')
await task1
print('awaited worker_1')
await task2
print('awaited worker_2')
begin_time = time.perf_counter()
asyncio.run(main())
end_time = time.perf_counter()
print(end_time - begin_time)
结果
before await
worker_1 start
worker_2 start
worker_1 done
awaited worker_1
worker_2 done
awaited worker_2
2.002005099784583
import asyncio
import time
from asyncio import Queue
async def customer(id, queue: Queue):
while True:
val = queue.get()
print(f"{id} get a val:{val}")
await asyncio.sleep(1)
async def producer(id, queue: Queue):
for i in range(5):
await queue.put(i)
print(f"{id} put a val:{i}")
await asyncio.sleep(1)
async def main():
queue = asyncio.Queue()
cust_1 = asyncio.create_task(customer('customer_1', queue))
cust_2 = asyncio.create_task(customer('customer_2', queue))
prod_1 = asyncio.create_task(customer('prod_1', queue))
prod_2 = asyncio.create_task(customer('prod_2', queue))
await asyncio.sleep(1)
cust_1.cancel()
cust_2.cancel()
await asyncio.gather(cust_1, cust_2, prod_1, prod_2, return_exceptions=True)
begin_time = time.perf_counter()
asyncio.run(main())
end_time = time.perf_counter()
print(end_time - begin_time)
在上述代码中,先调用 cancel()
方法是为了在协程任务运行前就取消它们的执行,如果 cancel()
方法在协程任务运行后才调用,那么这些协程任务就可能会继续执行一段时间,直到它们进入下一个 await
语句或者 yield
关键字时才能被取消.这可能会导致一些资源泄漏和额外的执行时间,因为这些协程任务会占用CPU和内存资源,但实际上它们并没有执行任何有用的操作. 因此,在需要取消协程任务的场景下,我们应该尽早调用 cancel()
方法,以确保它们能够尽早地停止执行.同时,还需要注意一些关键点:
cancel()
方法只是向协程任务发送一个取消信号,协程任务需要自己处理取消信号并进行清理工作。前期步骤
pip install lxml
pip install beautifulsoup4
迭代一:
import time
import requests
from bs4 import BeautifulSoup
def main():
url = "https://movie.douban.com/cinema/later/beijing/"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/52.0.2743.116 Safari/537.36 '
}
response = requests.get(url, headers=headers)
init_page = response.text
init_soup = BeautifulSoup(init_page, 'lxml')
all_movies = init_soup.find('div', id="showing-soon")
for each_movie in all_movies.find_all('div', class_="item"):
all_a_tag = each_movie.find_all('a')
all_li_tag = each_movie.find_all('li')
movie_name = all_a_tag[1].text
url_to_fetch = all_a_tag[1]['href']
movie_date = all_li_tag[0].text
response_item = requests.get(url_to_fetch, headers=headers).content
soup_item = BeautifulSoup(response_item, 'lxml')
img_tag = soup_item.find('img')
print('{} {} {}'.format(movie_name, movie_date, img_tag['src']))
begin_time = time.perf_counter()
main()
end_time = time.perf_counter()
print(end_time - begin_time)
结果
灌篮高手 04月20日 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2888398295.jpg
长空之王 04月28日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2889598060.jpg
人生路不熟 04月28日 https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2889864501.jpg
这么多年 04月28日 https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2890327372.jpg
长沙夜生活 04月28日 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2888648134.jpg
倒数说爱你 04月28日 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2888751814.jpg
惊天救援 04月28日 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2890135815.jpg
检察风云 04月29日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2890643247.jpg
天堂谷大冒险 04月29日 https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2890601753.jpg
新猪猪侠大电影·超级赛车 04月29日 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2890529454.jpg
魔幻奇缘之宝石公主 04月29日 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2890340206.jpg
宇宙护卫队:风暴力量 04月29日 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2881560195.jpg
鲛在水中央 05月01日 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2890526786.jpg
马庄村 05月01日 https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2890134283.jpg
银河护卫队3 05月05日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2889358680.jpg
我和妈妈的最后一年 05月12日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2890599819.jpg
荒野狂兽 05月12日 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2890557435.jpg
贫民窟之王 05月12日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2889912399.jpg
速度与激情10 05月17日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2890343870.jpg
余生那些年 05月20日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2888332880.jpg
请别相信她 05月20日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2886540928.jpg
69.0037998999469
迭代二
import asyncio
import aiohttp
import time
from bs4 import BeautifulSoup
header = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/52.0.2743.116 Safari/537.36 '
}
async def fetch_content(url):
async with aiohttp.ClientSession(
headers=header, connector=aiohttp.TCPConnector(ssl=False)
) as session:
async with session.get(url) as response:
return await response.text()
async def main():
url = "https://movie.douban.com/cinema/later/beijing/"
init_page = await fetch_content(url)
init_soup = BeautifulSoup(init_page, 'lxml')
movie_names, urls_to_fetch, movie_dates = [], [], []
all_movies = init_soup.find('div', id="showing-soon")
for each_movie in all_movies.find_all('div', class_="item"):
all_a_tag = each_movie.find_all('a')
all_li_tag = each_movie.find_all('li')
movie_names.append(all_a_tag[1].text)
urls_to_fetch.append(all_a_tag[1]['href'])
movie_dates.append(all_li_tag[0].text)
tasks = [fetch_content(url) for url in urls_to_fetch]
pages = await asyncio.gather(*tasks)
for movie_name, movie_date, page in zip(movie_names, movie_dates, pages):
soup_item = BeautifulSoup(page, 'lxml')
img_tag = soup_item.find('img')
print('{} {} {}'.format(movie_name, movie_date, img_tag['src']))
begin_time = time.perf_counter()
asyncio.run(main())
end_time = time.perf_counter()
print(end_time - begin_time)
结果
灌篮高手 04月20日 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2888398295.jpg
长空之王 04月28日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2889598060.jpg
人生路不熟 04月28日 https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2889864501.jpg
这么多年 04月28日 https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2890327372.jpg
长沙夜生活 04月28日 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2888648134.jpg
倒数说爱你 04月28日 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2888751814.jpg
惊天救援 04月28日 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2890135815.jpg
检察风云 04月29日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2890643247.jpg
天堂谷大冒险 04月29日 https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2890601753.jpg
新猪猪侠大电影·超级赛车 04月29日 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2890529454.jpg
魔幻奇缘之宝石公主 04月29日 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2890340206.jpg
宇宙护卫队:风暴力量 04月29日 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2881560195.jpg
鲛在水中央 05月01日 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2890526786.jpg
马庄村 05月01日 https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2890134283.jpg
银河护卫队3 05月05日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2889358680.jpg
我和妈妈的最后一年 05月12日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2890599819.jpg
荒野狂兽 05月12日 https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2890557435.jpg
贫民窟之王 05月12日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2889912399.jpg
速度与激情10 05月17日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2890343870.jpg
余生那些年 05月20日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2888332880.jpg
请别相信她 05月20日 https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2886540928.jpg
4.7766728000715375