多线程爬虫

最近正在学习爬虫,看到多线程爬虫,自己写了一个针对多线程爬虫的文章,希望可以对初学者有一定的帮助。写了两个爬虫,都是爬豆瓣电影获取电影信息的。最后发现多线程相对单线程确实节约了大量的时间。先把代码贴出来,然后在稍微分析一下。(ps:最好有一定的爬虫基础,python版本2.7)

#coding:utf-8
import requests
import json
import time

def get_html(url):
    header = {
        'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.89 Safari/537.36',
    }
    try:
        res = requests.get(url, headers = header)
        return res.text
    except:
        print u"获取失败"
        return None

def get_data(html):
    a = json.loads(html)
    a_list = a[u'subjects']
    item = a_list[0]
    print item[u'title'], item[u'rate']
    save_data(a_list)

def save_data(a_list):
    file = open('t1.txt', 'ab+')
    res = u""
    for item in a_list:
        s1 = item[u'title']
        s2 = item[u'rate']
        res += s1+s2
    file.write(res.encode('utf-8'))
    file.write(u'\r\n')
if __name__ == '__main__':
    begin = time.time()

    for i in range(0,300,20):
        url  = 'https://movie.douban.com/j/search_subjects?type=movie&tag=%E6%9C%80%E6%96%B0&page_limit=20&page_start='+str(i)
        html = get_html(url)
        get_data(html)

    end = time.time()
    print end - begin

上述就是单线程爬取电影信息的代码。因为返回的数据是json类型的,所以导入了json模块。稍微了解爬虫的同学应该就可以看明白,就不在解释了。

单线程运行时间

多线程爬虫_第1张图片

下面贴出多线程爬虫的代码,比上述单线程,多了两个模块Queue与threading。这两个模块的戳这里

Queue:https://www.cnblogs.com/itogo/p/5635629.html

threading:https://www.cnblogs.com/fnng/p/3670789.html

#coding:utf-8
import requests

import json
import time
import Queue
import threading

Share_Q = Queue.Queue()
THREAD_NUM = 5
title_list = []
def get_html(url):
    header = {
        'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.89 Safari/537.36',
    }
    try:
        res = requests.get(url, headers = header)
        return res.text
    except:
        print u"获取失败"
        return None

def get_data(html):
    a = json.loads(html)
    a_list = a[u'subjects']
    item = a_list[0]
    print item[u'title'], item[u'rate']
    list = []
    for item in a_list:
        list.append(item[u'title'] +u',')
    title_list.append(list)

def worker():
    global Share_Q
    while not Share_Q.empty():
        url = Share_Q.get()
        html = get_html(url)
        get_data(html)

if __name__ == '__main__':
    begin = time.time()
    threads = []
    for i in range(0,300,20):
        url  = 'https://movie.douban.com/j/search_subjects?type=movie&tag=%E6%9C%80%E6%96%B0&page_limit=20&page_start='+str(i)
        Share_Q.put(url)
    for i in range(THREAD_NUM):
        thread = threading.Thread(target=worker)
        thread.start()
        threads.append(thread)

    for thread in threads:
        thread.join()

    file = open('t2.txt', 'ab')
    for title in title_list:
        res = u""
        for i in title:
            res += i
        file.write(res.encode('utf-8'))
        file.write(u'\r\n')
    end = time.time()
    print end - begin
在程序中,我们首先将url放置到队列中,然后多个线程在队列中获得url。同时爬取网页的内容。相对于单线程一个一个url爬取。速度要快了许多。

多线程爬虫_第2张图片

你可能感兴趣的:(python,多线程,爬虫,豆瓣)