第五章 爬虫进阶(十四) 2020-01-31

十四、 GIL和Lock的关系(下)


有了GIL,为什么还需要Lock


GIL只是保证全局同一时刻只有一个线程在执行,但是他并不能保证执行代码的原子性。也就是说一个操作可能会被分成几个部分完成,这样就会导致数据有问题。所以需要使用Lock来保证操作的原子性。


作业:


1、用多线程的方式爬取“百思不得姐”段子作业。

2、网址:http://www.budejie.com/text/

3、爬取下来后需要用csv保存下来。


示例代码:


import requests

from lxml import etree

import threading

from queue import Queue

import csv

 

 

class BSSpider(threading.Thread):

    headers = {

        'User-Agent': 'Mozilla/5.0 (Windows NT6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130Safari/537.36'

    }

 

    def __init__(self, page_queue, joke_queue,*args, **kwargs):

        super(BSSpider, self).__init__(*args,**kwargs)

        self.base_domain ='http://www.budejie.com'

        self.page_queue = page_queue

        self.joke_queue = joke_queue

 

    def run(self) -> None:

        while True:

            if self.page_queue.empty():

                break

            url = self.page_queue.get()

            response = requests.get(url,headers=self.headers)

            text = response.text

            html = etree.HTML(text)

            descs =html.xpath("//div[@class='j-r-list-c-desc']")

            for desc in descs:

               jokes = desc.xpath(".//text()")

                joke ="\n".join(jokes).strip()

                link =self.base_domain+desc.xpath(".//a/@href")[0]

                self.joke_queue.put((joke,link))

            print('='*30+"第%s页下载完成!"% url.split('/')[-1]+"="*30)

 

 

class BSWriter(threading.Thread):

    headers = {

        'User-Agent': 'Mozilla/5.0 (Windows NT6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130Safari/537.36'

    }

 

    def __init__(self, joke_queue, writer,gLock, *args, **kwargs):

        super(BSWriter, self).__init__(*args,**kwargs)

        self.joke_queue = joke_queue

        self.writer = writer

        self.lock = gLock

 

    def run(self) -> None:

        while True:

            try:

                joke_info =self.joke_queue.get(timeout=40)

                joke, link = joke_info

                self.lock.acquire()

                self.writer.writerow((joke,link))

                self.lock.release()

                print('保存一条')

            except:

                break

 

 

def main():

    page_queue = Queue(10)

    joke_queue = Queue(500)

    gLock = threading.Lock()

    fp = open('bsbdj.csv', 'a', newline='',encoding='utf-8')

    writer = csv.writer(fp)

    writer.writerow(('content', 'link'))

 

    for x in range(1, 11):

        url = 'http://www.budejie.com/text/%d'% x

        page_queue.put(url)

 

    for x in range(5):

        t = BSSpider(page_queue, joke_queue)

        t.start()

 

 

if__name__ == '__main__':

    main()



上一篇文章 第五章 爬虫进阶(十三) 2020-01-30 地址:

https://www.jianshu.com/p/42f852ebdcaa

下一篇文章 第五章 爬虫进阶(十五) 2020-02-01 地址:

https://www.jianshu.com/p/ca48fa8c11ce



以上资料内容来源网络,仅供学习交流,侵删请私信我,谢谢。

你可能感兴趣的:(第五章 爬虫进阶(十四) 2020-01-31)