十四、 GIL和Lock的关系（下）

有了GIL，为什么还需要Lock：

GIL只是保证全局同一时刻只有一个线程在执行，但是他并不能保证执行代码的原子性。也就是说一个操作可能会被分成几个部分完成，这样就会导致数据有问题。所以需要使用Lock来保证操作的原子性。

作业：

1、用多线程的方式爬取“百思不得姐”段子作业。

2、网址：http://www.budejie.com/text/

3、爬取下来后需要用csv保存下来。

示例代码：

import requests

from lxml import etree

import threading

from queue import Queue

import csv

class BSSpider(threading.Thread):

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130Safari/537.36'

}

def __init__(self, page_queue, joke_queue,*args, **kwargs):

super(BSSpider, self).__init__(*args,**kwargs)

self.base_domain ='http://www.budejie.com'

self.page_queue = page_queue

self.joke_queue = joke_queue

def run(self) -> None:

while True:

if self.page_queue.empty():

break

url = self.page_queue.get()

response = requests.get(url,headers=self.headers)

text = response.text

html = etree.HTML(text)

descs =html.xpath("//div[@class='j-r-list-c-desc']")

for desc in descs:

jokes = desc.xpath(".//text()")

joke ="\n".join(jokes).strip()

link =self.base_domain+desc.xpath(".//a/@href")[0]

self.joke_queue.put((joke,link))

print('='*30+"第%s页下载完成！"% url.split('/')[-1]+"="*30)

class BSWriter(threading.Thread):

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130Safari/537.36'

}

def __init__(self, joke_queue, writer,gLock, *args, **kwargs):

super(BSWriter, self).__init__(*args,**kwargs)

self.joke_queue = joke_queue

self.writer = writer

self.lock = gLock

def run(self) -> None:

while True:

try:

joke_info =self.joke_queue.get(timeout=40)

joke, link = joke_info

self.lock.acquire()

self.writer.writerow((joke,link))

self.lock.release()

print('保存一条')

except:

break

def main():

page_queue = Queue(10)

joke_queue = Queue(500)

gLock = threading.Lock()

fp = open('bsbdj.csv', 'a', newline='',encoding='utf-8')

writer = csv.writer(fp)

writer.writerow(('content', 'link'))

for x in range(1, 11):

url = 'http://www.budejie.com/text/%d'% x

page_queue.put(url)

for x in range(5):

t = BSSpider(page_queue, joke_queue)

t.start()

if__name__ == '__main__':

main()

上一篇文章第五章爬虫进阶（十三） 2020-01-30 地址：

https://www.jianshu.com/p/42f852ebdcaa

下一篇文章第五章爬虫进阶（十五） 2020-02-01 地址：

https://www.jianshu.com/p/ca48fa8c11ce

以上资料内容来源网络，仅供学习交流，侵删请私信我，谢谢。

第五章爬虫进阶（十四） 2020-01-31

十四、 GIL和Lock的关系（下）

你可能感兴趣的:(第五章爬虫进阶（十四） 2020-01-31)

第五章 爬虫进阶（十四） 2020-01-31

十四、 GIL和Lock的关系（下）

你可能感兴趣的:(第五章 爬虫进阶（十四） 2020-01-31)

第五章爬虫进阶（十四） 2020-01-31

你可能感兴趣的:(第五章爬虫进阶（十四） 2020-01-31)