爬虫代理IP池的实现

代理池的思路

之前写过一篇关于代理的文档:爬虫和IP代理,里面介绍了一些代理的基本知识,后半部分我也见到那说了下如何用Python抓取免费的代理IP并检验其有效性。这篇文章就是在其基础上尝试完整实现了一个代理IP池:

  • 后端用的是sqlite3,因为之前爬网易云音乐评论时后端就是用的它,最后存了大概一万条评论,读取写入时间都很不错,我估计代理池里的IP最后也只会是这个量级。还有就是sqlite3用起来非常方便简单。不过我仍然在Sqlite3api.py里对其进行了封装。
  • 除了sqlite3,程序里还维护了一个名为iplist的列表,它的大小由配置项iplist_pool_size决定。这个列表的作用就是对db的cache,里面只会存储检验有效的IP。用户通过API方法proxy_get_one_ip()获取表头位置的IP,然后再将这个IP添加到列表末尾,实现循环利用。这里我的思路是proxy_get_one_ip()只会尽量保证返回一个有效的IP,而用户如果发现IP无效,可以通过proxy_report_invalid_ip()告知代理池,代理池会将这个IP从iplist列表和DB中同时删除。
    • 如何保证iplist列表和DB中内容的一致性?这里参考了Cache的同步理念,就是读取的时候只读取cache里的内容,写入的时候要同时写入list和db。
  • 目前,我采用的IP抓取方式是从大象代理的API直接获取,之前我花5块钱买了10000个IP,现在才用了2000个不到。虽然这里也能抓取很多免费代理IP,但是我想说的是,即使是我花钱买的IP,有效率也才百分之二三十的样子,有效时间还不知道。抓取免费代理IP的方法前面的文章里有,就不赘述了。我个人还是推荐花5块钱买一下,反正又不贵。
  • 程序还提供了另一个API方法proxy_fill_db()用来抓取并往数据库里写入有效IP,这个方法主要作用是扩充db里的IP数量。
  • proxy_get_one_ip()被调用时,代理池先检测iplist是否有内容,有直接返回第一个元素,没有的话会调用内部函数__refresh()
    • __refresh()方法的最终目的是将iplist列表里IP数量补足到至少iplist_pool_size的大小。它会先从数据库里抓取需要的IP数量;如果数据库里的IP不够,则会通过__get_ip_thr_api()从代理API处获取数量为N的IP值。
    • 关于N 值的确定,首先先做一次采样,先从代理API处获取10个IP,并检查它们的有效性,假设10个里面有4个能用,此时会计算一个ratio = 4/10 = 0.4 ,之后假设我们还需要missing_ip_num个IP,N的值则是missing_ip_num/ratio*factor,这里factor可以是任何整数,因为这个ratio值不一定每次都这么高,据我测试大多数时候是0.2到0.3的样子,有时也会是0.0。总之,factor越大,则有可能获得更多的有效IP,同时消耗的时间也更长。最后,代理池会对每个IP做校验,如果是有效IP,则会写入iplist和db中。
    • 这里我还想提一点,就是我在之前的文章里写了一种设想,就是为了保证IP有效率,在代理池给出IP时,概率性的选择要不要检验一下这个IP,比如说这里,如果说p1=0.4, N=5000, p2=0.9,则x/N=0.83,就是说83%的概率在给出IP前要检查,如果检查失败的话,就要重新获取IP。因为p1只有0.4,甚至更低,所以一般情况下可能要2-3次的检查才能保证这个IP是有效的。我觉得两种方法都是可以的,等待时间其实是一样的,只不过一个是在初始化的时候等,一个是在返回IP的时候等。
  • 检查一个IP是否有效,目前就是用这个IP访问百度页面,检查其返回码是不是200 OK。
  • 另外还包含了一些测试代码。

这里还想补充一个关于unittest的库,因为之前写C代码,要写很多测试代码,而我一般命名测试函数为unittest,后来我发现Python里也有类似的函数,一开始我以为和C差不多,最近我才知道原来有一个叫unittest的测试框架,可以用来编写测试用例,实现自动化测试。不过这里我就不展开说了。

问题

  • 目前最大的问题就是缺少多线程的支持,我会尽快补上的。
  • 如果有需要,我也会补上获取免费代理IP的部分。
  • 这里没有对IP做重复性检测,但是我想了一下,这种情况在目前的实现里是几乎不可能出现的。之后如果增加多个source,写入db之前还需要做重复性检查。
  • 还有一点是,因为我之前一直习惯写C代码,所有很少用在Python里很少用类,我不确定这是不是有问题。

代码

Configure.py

# proxy related
daxiang_proxy_tid = #order number
iplist_pool_size = 10
#iplist_thread_num = 1

# DB
db_name = 'proxy.db'

Sqlite3api.py

import sqlite3
import os
import Configure

def sqlite3_init():
    try:
        conn = sqlite3.connect(Configure.db_name)
    except Exception as e:
        print ('sqlite3 init fail.')
        print (e)

    return conn

def sqlite3_execute(conn, sql, args = None):
    data = None
    try:
        cur = conn.cursor()
        if args:
            cur.execute(sql, args)
        else:
            cur.execute(sql)
        data = cur.fetchall()
    except Exception as e:
        print (e, "[SQL]:" + sql.strip())
        conn.rollback()

    conn.commit()
    if data:
        return data
    return None

def sqlite3_close(conn):
    conn.close()

def unittest():
    conn = sqlite3_init()

    sqlite3_execute(conn, "CREATE TABLE stocks (date text, trans text, symbol text, qty real, price real)")

    sqlite3_execute(conn, "INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
    sqlite3_execute(conn, "INSERT INTO stocks VALUES ('2006-03-28', 'BUY', 'IBM', 1000, 45.00)")
    sqlite3_execute(conn, "INSERT INTO stocks VALUES ('2006-04-05', 'BUY', 'MSFT', 1000, 72.00)")
    sqlite3_execute(conn, "INSERT INTO stocks VALUES ('2006-04-06', 'SELL', 'IBM', 500, 53.00)")

    assert 4 == sqlite3_execute(conn, "SELECT count(*) FROM stocks")[0][0]

    sqlite3_execute(conn, "DROP TABLE stocks")

    sqlite3_close(conn)

if __name__ == '__main__':
    unittest()

proxy.py

import requests
import ast
from random import choice
import os
import time

import Configure
import Sqlite3api as sqlite3

iplist = []

def proxy_init():
    conn = sqlite3.sqlite3_init()
    sqlite3.sqlite3_execute(conn, "CREATE TABLE proxy (t real, valid real, ip text)")
    sqlite3.sqlite3_close(conn)

# main entry, return an ip with certain protocols(either http or https)
def proxy_get_one_ip(protocol='all'): 
    global iplist

    if not iplist:
        __refresh(protocol)

    # pop() return and remove the last element in the list by default, amazing!
    # here I need op on the first element
    ip = iplist.pop(0)
    iplist.append(ip)

    return ip

def proxy_report_invalid_ip(ip, protocol=None):
    # sync write op
    # remove it from list
    iplist.remove(ip)

    # set false in db
    conn = sqlite3.sqlite3_init()
    sqlite3.sqlite3_execute(conn, "DELETE FROM proxy WHERE ip = '{0:s}'".format(ip))
    sqlite3.sqlite3_close(conn)

    return proxy_get_one_ip(protocol) if protocol else None


def proxy_fill_db(num = 20):
    samples = __get_ip_thr_api(num)
    cnt = 0

    conn = sqlite3.sqlite3_init()
        
    for sample in samples:
        ip = "{0:s}:{1:d}".format(sample.get('host'), sample.get('port'))
        if __validation(ip):
            sqlite3.sqlite3_execute(conn, "INSERT INTO proxy VALUES ({0:d},1,'{1:s}')".format(int(time.time()*1000), ip))
            #print ("Found one: {0:s}.".format(ip))
            cnt += 1

    sqlite3.sqlite3_close(conn)
    return cnt

# get ip through api from http://www.daxiangdaili.com
def __get_ip_thr_api(num=1, protocol='all'):
    tid = Configure.daxiang_proxy_tid

    url = "http://tvp.daxiangdaili.com/ip/?tid={0:d}&num={1:d}&delay=3&category=2&sortby=time&filter=on&format=json".format(tid, num)
    
    if protocol == 'https':
        url += "&protocol=https"
    
    try:
        response = requests.get(url)
        content = None

        if response.status_code == requests.codes.ok:
            content = response.text
            
    except Exception as e:
        print (e)

    return ast.literal_eval(content.strip())

# validate ip,addr format: [ip:port]
def __validation(addr):
    proxies = {
        "http": "http://{0}".format(addr),
        "https": "http://{0}".format(addr)
    }

    header = {}
    header['user-agent'] = choice(Configure.FakeUserAgents)

    try:
        response = requests.get("https://www.baidu.com", headers=header, proxies=proxies, timeout=5)
    except Exception as e:
        #print (e)
        return False
    else:
        if response.status_code == requests.codes.ok:
            return True

def __cal_sample_ratio(sampe_size, protocol):
    # collect sample to calculate validity ratio
    cnt, valid = sampe_size, 0
    addrs = __get_ip_thr_api(cnt, protocol)
    conn = sqlite3.sqlite3_init()

    for addr in addrs:
        ip = "{0:s}:{1:d}".format(addr.get('host'), addr.get('port'))
        if __validation(ip):
            valid += 1
            # Logically, here I should not add this ip to db nor iplist
            # because this func is for ratio calculation
            # However, it's not easy to get valid ip, I don't any waste here
            iplist.append(ip)
            sqlite3.sqlite3_execute(conn, "INSERT INTO proxy VALUES ({0:d},1,'{1:s}')".format(int(time.time()*1000), ip))
            
    sqlite3.sqlite3_close(conn)

    # calculate validity ratio
    ratio = float(valid/cnt)
    print ("ratio is {0}".format(ratio))
    return ratio

def __refresh(protocol='all'):
    print ("[Start updating pool]")

    global iplist

    # need this number of valid ips to fill the pool
    missing_ip_num = Configure.iplist_pool_size - len(iplist)

    # (1) get them from db
    #       normally the db should return enough valid ips

    conn = sqlite3.sqlite3_init()
    data = sqlite3.sqlite3_execute(conn, "SELECT ip, valid FROM proxy where valid = 1 order by t limit {0:d}".format(missing_ip_num))
    sqlite3.sqlite3_close(conn)

    for item in data:
        l = list(item)
        iplist.append(l[0])

    # check the results
    if len(iplist) >= Configure.iplist_pool_size:
        print ("Enough valid ips in pool. Refresh finished.")
        return
    else:
        missing_ip_num = Configure.iplist_pool_size - len(iplist)
        print ("Still need {0:d} ips after retrieving from db.".format(missing_ip_num))

    # (2) this means even used all valid ips in db, pool is still not full
    #       then get more valid ips from api or other sources
    # calculate ratio
    ratio = __cal_sample_ratio(10, protocol)
    
    # Since inside __cal_sample_ratio() would add new ips to db
    # I did one more time check here
    if len(iplist) >= Configure.iplist_pool_size:
        print ("Enough valid ips in pool. Refresh finished.")
        return

    # higher means more valid ips, but slower to process
    factor = 1

    # __sample_size__ means based on the validity ratio, this number of ips are needed
    # to filter enough valid ips, approximately
    if ratio == 0.00:
        # extremely bad sample
        sample_size = missing_ip_num * factor * 2 
    else:
        sample_size = int(1/ratio * missing_ip_num * factor)

    print ("Need to collect {0:d} ips for validation test".format(sample_size))

    samples = __get_ip_thr_api(sample_size, protocol)
    # TODO multi threads
    # if sample_size > a certain number, do multi threads
    # else:
    conn = sqlite3.sqlite3_init()
        
    for sample in samples:
        ip = "{0:s}:{1:d}".format(sample.get('host'), sample.get('port'))
        if __validation(ip):
            # sync write operation
            iplist.append(ip)
            sqlite3.sqlite3_execute(conn, "INSERT INTO proxy VALUES ({0:d},1,'{1:s}')".format(int(time.time()*1000), ip))
            print ("Found one: {0:s}.".format(ip))

            if len(iplist) >= Configure.iplist_pool_size:
                print (len(iplist))
                break

    sqlite3.sqlite3_close(conn)

    # Note here I didn't check the list length again.
    # so the list may be still not full because there are
    # not enough valid ips in the sample
    # Generally, this should not happen often, one should
    # call proxy_fill_db() enough times to make sure 
    # db has enough valid ips already for use
    print ("[pool updated.]")

def unittest():
    global iplist
    assert 1 == len(__get_ip_thr_api(1, 'https'))
    assert 1 == len(__get_ip_thr_api(1))

    proxy_init()

    conn = sqlite3.sqlite3_init()
    cnt = sqlite3.sqlite3_execute(conn, "SELECT count(*) FROM proxy")[0][0]
    res = proxy_fill_db(10)

    assert (cnt + res) == sqlite3.sqlite3_execute(conn, "SELECT count(*) FROM proxy")[0][0]

    if res > 0:
        ip = proxy_get_one_ip()

    cnt1 = sqlite3.sqlite3_execute(conn, "SELECT count(*) FROM proxy")[0][0]
    proxy_report_invalid_ip(ip)
    cnt2 = sqlite3.sqlite3_execute(conn, "SELECT count(*) FROM proxy")[0][0]

    assert 1 == (cnt1 - cnt2)
    assert 0 == sqlite3.sqlite3_execute(conn, "SELECT count(*) FROM proxy where ip = '{0:s}'".format(ip))[0][0]
    sqlite3.sqlite3_close(conn)

if __name__ == '__main__':
    unittest()

你可能感兴趣的:(爬虫代理IP池的实现)