python爬虫爬取代理ip构建代理ip池,并自动测试是否可用

python多线程非阻塞爬取代理ip并自动测试是否可用

推荐一个网站西刺代理,其中每天都会更新一些高匿代理ip供使用。https://www.xicidaili.com/
一页有99个ip,但是经我测试,一般只有7-8个可以使用,不过已经足够了。(毕竟是免费的),现在我来给出如何爬取代理ip并测试其是否可用。

引入要使用的类

import requests, os, re, time, random
from threading import Thread

只有requests是需要下载安装的

定义findProxy类

class findProxy:  # 采用多线程方式判断爬到的proxy是否可以用
    def __init__(self):
        self.canUseProxy = []
        self.USER_AGENTS = [
            "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
            "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
            "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
            "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
            "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
            "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
            "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
            "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
            "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
            "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
            "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
            "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
            "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
            "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
            "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
            "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
            "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
            "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
            "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
            "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
            "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
            "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"
        ]

self.USER_AGENTS是我给出的请求表头

爬取西刺代理的html

    def getProxy(self):  # 获取整个网页
        head = {
            'User-Agent': random.choice(self.USER_AGENTS),
            'Host': 'www.xicidaili.com',
            'Accept': 'text/html, application/xhtml+xml, image/jxr, */*',
            'Accept-Encoding': 'gzip,deflate',
            'Connection': 'keep-alive'
        }  # Host用来伪装
        r = requests.get('http://www.xicidaili.com/nn/', headers=head)
        return r.text

这个函数返回西刺代理第一页的html源文件。因为第二页基本上是昨天的代理信息,几乎全都无法使用,所以只爬取第一页数据。

从html中用正则得到ip地址和端口,并写入ip池字典

    def proxy_dict(html):
        proxyIp = re.findall(r'\d+\.\d+\.\d+\.\d+', html)  # findall适用于文本类型的数据  # 匹配ip地址
        proxyPort = re.findall(r'\d+', html)  # 匹配端口
        pTemp = ''.join(proxyPort)  # join的作用是把列表中的各个值连接起来成一个大字符串
        proxyDict = {}
        proxyDict.setdefault('http', [])
        proxyDict.setdefault('https', [])

        for i in range(len(proxyIp)):
            proxyPort = re.findall(r'\d+', pTemp)  # 从有标签的proxyPort中把端口纯数字提取出来覆盖到proxyPort当中
            proxyUrl = 'http://{0}:{1}'.format(proxyIp[i],
                                               proxyPort[i])  # 传给requests的字典,不论http还是https,这里都是'http://{0}:{1}'
            proxyDict['http'].append(proxyUrl)  # 给字典循环赋值
            proxyDict.setdefault('http', [])  #  setdefault确保了http键存在与字典中
            proxyDict['https'].append(proxyUrl)  # 给字典循环赋值 https
            proxyDict.setdefault('https', [])  

        return proxyDict

以上函数寻找出第一页的99个ip及端口号,并存在proxyDict字典当中。

判断ip是否能用

    def openProxy(self, Dict, i):
        try:
            head = {
                "User-Agent": random.choice(self.USER_AGENTS),
                "Referer": "http://ip.tool.chinaz.com/",
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
                'Connection': 'keep-alive'
            }
            proxy = Dict['http'][i]
            proxies = {'http': proxy,
                       'https': proxy}
            content = requests.get('http://ip.tool.chinaz.com/' + proxy, headers=head, proxies=proxies, timeout=50)
            # get函数的是返回Response格式的东西
            # content.encode()
            print(proxy)
            self.canUseProxy.append(proxy)
        except Exception as e:
            1

这个函数通过把ip和端口号传到站长工具ip测试里,判断ip是否能用。连接的时候如果ip不能使用,会阻塞一段时间。如果使用单线程跑,将在这里浪费大量的时间。
接下来就是最重要的,通过开多线程,使这种I/O密集型操作的总耗时,变为几乎等同于消耗最长时间的那个单个测试ip的时间,即单个ip的阻塞时间。

开启多线程

    def rightProxy(self):
        html = self.getProxy()
        myDict = self.proxy_dict(html)
        nLoops = len(myDict['http'])
        threads = []
        sTime = time.time()
        # 开启多线程
        for j in range(nLoops):
            t = Thread(target=self.openProxy, args=(myDict, j,))
            threads.append(t)
        for j in range(nLoops):
            # 此处有一个坑,即如果同时有N个子线程join(timeout),那么实际上主线程会等待的
            # 超时时间最长为 N * timeout, 因为每个子线程的超时开始时刻是上一个子线程超时结束的时刻。
            threads[j].setDaemon(True)
            threads[j].start()
        for j in range(nLoops):     # 等所有线程运行完毕
            threads[j].join()

        print('use ' + str(time.time() - sTime) + ' seconds')
        return self.canUseProxy

要注意的是,这里开启多线程使用的是单核。如果要使用多核,可以参考multiprocessing类。
最后加上一句fProxies = findProxy().rightProxy()即可得到所有能用的代理ip。

你可能感兴趣的:(python)