多线程获取豆瓣网页的网络爬虫(Python实现)

该代码使用多线程进行豆瓣电影网页的下载分析操作,起始页面为豆瓣电影网页中的《星际迷航》(2009)页面。
其中主要的操作是获取页面中电影的名称,年份,以及评分,然后从推荐列表中获取下一部电影的页面。将获取的信息保存至csv文件中。

main.py文件:其中创建30个线程,每个线程独立进行网页下载,分析,写文件等操作。Handle类继承Thread,为线程运行的代码。

# -*- coding:utf-8 -*-
from HTMLDownloader import download
import URLManager
from bs4 import BeautifulSoup
from threading import Thread, Lock

cnt = 0
urlmanager = URLManager.URLManager()


class Handle(Thread):

    def __init__(self):
        super(Handle, self).__init__()
        pass

    def run(self):
        global cnt
        while 1:
            if cnt > 1000:
                return
            urlLock.acquire()
            if urlmanager.has_url() == False:
                urlLock.release()
                continue
            url = urlmanager.get_new_url()
            urlLock.release()

            html = download(url)
            if html is None:
                continue
            soup = BeautifulSoup(html, "html.parser")
            item = soup.find_all("span", {"property": "v:itemreviewed"})
            year = soup.find_all("span", {"class": "year"})
            score = soup.find_all("strong", {"class": "ll rating_num", "property": "v:average"})

            fileLock.acquire()
            file = open("movies.csv", "a+")
            file.write("%s,\t%s,\t%s\n" % (item[0].getText(), year[0].getText(), score[0].getText()))
            fileLock.release()

            recommendation = soup.find_all("dl", {"class": ""})
            # print(len(recommendation))
            urlLock.acquire()
            for i in recommendation:
                href = i.find("a").get("href")
                urlmanager.add_new_url(href)
                pass
            urlLock.release()

            cntLock.acquire()
            print("%d" % cnt)
            cnt = cnt + 1
            cntLock.release()

        pass

if __name__ == "__main__":
    urlmanager.add_new_url("https://movie.douban.com/subject/2132932/?from=subject-page")
    cntLock = Lock()
    urlLock = Lock()
    fileLock = Lock()
    threads = []
    for i in range(30):
        p = Handle()
        threads.append(p)
        pass

    for i in threads:
        i.start()
        pass

    for i in threads:
        i.join()
        pass

    pass

HTMLDownloader.py文件 :用于下载网页,若出错,则返回None,否则返回该网页

from urllib import request
import urllib


def download(url):
    header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36"}
    req = request.Request(url, headers=header)
    try:
        response = request.urlopen(req)
    except urllib.error.HTTPError as e:
        print(e.code)
        return None
    except urllib.error.URLError as e:
        print(e.reason)
        return None
    else:
        html = response.read()
        return html

URLManager.py文件:对程序中的URL资源进行管理。

class URLManager():
    def __init__(self):
        self.newUrls = []
        self.oldUrls = set()

    def add_new_url(self, url):
        if url is None:
            return
        if url not in self.newUrls and url not in self.oldUrls:
            self.newUrls.append(url)

    def add_new_urls(self, urls):
        if urls is None or len(urls) == 0:
            return
        for url in urls:
            self.add_new_url(url)

    def has_url(self):
        return len(self.newUrls) != 0

    def get_new_url(self):
        new = self.newUrls.pop()
        self.oldUrls.add(new)
        return new

警告:该程序不可以运行,由于豆瓣设置了单位时间内的网络访问次数,运行该程序会被豆瓣官方封锁IP地址!!!

措施:考虑使用代理IP进行访问。

你可能感兴趣的:(python多线程爬虫)