十四、 GIL和Lock的关系(下)
有了GIL,为什么还需要Lock:
GIL只是保证全局同一时刻只有一个线程在执行,但是他并不能保证执行代码的原子性。也就是说一个操作可能会被分成几个部分完成,这样就会导致数据有问题。所以需要使用Lock来保证操作的原子性。
作业:
1、用多线程的方式爬取“百思不得姐”段子作业。
2、网址:http://www.budejie.com/text/
3、爬取下来后需要用csv保存下来。
示例代码:
import requests
from lxml import etree
import threading
from queue import Queue
import csv
class BSSpider(threading.Thread):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130Safari/537.36'
}
def __init__(self, page_queue, joke_queue,*args, **kwargs):
super(BSSpider, self).__init__(*args,**kwargs)
self.base_domain ='http://www.budejie.com'
self.page_queue = page_queue
self.joke_queue = joke_queue
def run(self) -> None:
while True:
if self.page_queue.empty():
break
url = self.page_queue.get()
response = requests.get(url,headers=self.headers)
text = response.text
html = etree.HTML(text)
descs =html.xpath("//div[@class='j-r-list-c-desc']")
for desc in descs:
jokes = desc.xpath(".//text()")
joke ="\n".join(jokes).strip()
link =self.base_domain+desc.xpath(".//a/@href")[0]
self.joke_queue.put((joke,link))
print('='*30+"第%s页下载完成!"% url.split('/')[-1]+"="*30)
class BSWriter(threading.Thread):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130Safari/537.36'
}
def __init__(self, joke_queue, writer,gLock, *args, **kwargs):
super(BSWriter, self).__init__(*args,**kwargs)
self.joke_queue = joke_queue
self.writer = writer
self.lock = gLock
def run(self) -> None:
while True:
try:
joke_info =self.joke_queue.get(timeout=40)
joke, link = joke_info
self.lock.acquire()
self.writer.writerow((joke,link))
self.lock.release()
print('保存一条')
except:
break
def main():
page_queue = Queue(10)
joke_queue = Queue(500)
gLock = threading.Lock()
fp = open('bsbdj.csv', 'a', newline='',encoding='utf-8')
writer = csv.writer(fp)
writer.writerow(('content', 'link'))
for x in range(1, 11):
url = 'http://www.budejie.com/text/%d'% x
page_queue.put(url)
for x in range(5):
t = BSSpider(page_queue, joke_queue)
t.start()
if__name__ == '__main__':
main()
上一篇文章 第五章 爬虫进阶(十三) 2020-01-30 地址:
https://www.jianshu.com/p/42f852ebdcaa
下一篇文章 第五章 爬虫进阶(十五) 2020-02-01 地址:
https://www.jianshu.com/p/ca48fa8c11ce
以上资料内容来源网络,仅供学习交流,侵删请私信我,谢谢。