代理池的思路
之前写过一篇关于代理的文档:爬虫和IP代理,里面介绍了一些代理的基本知识,后半部分我也见到那说了下如何用Python抓取免费的代理IP并检验其有效性。这篇文章就是在其基础上尝试完整实现了一个代理IP池:
- 后端用的是sqlite3,因为之前爬网易云音乐评论时后端就是用的它,最后存了大概一万条评论,读取写入时间都很不错,我估计代理池里的IP最后也只会是这个量级。还有就是sqlite3用起来非常方便简单。不过我仍然在
Sqlite3api.py
里对其进行了封装。 - 除了sqlite3,程序里还维护了一个名为
iplist
的列表,它的大小由配置项iplist_pool_size
决定。这个列表的作用就是对db的cache,里面只会存储检验有效的IP。用户通过API方法proxy_get_one_ip()
获取表头位置的IP,然后再将这个IP添加到列表末尾,实现循环利用。这里我的思路是proxy_get_one_ip()
只会尽量保证返回一个有效的IP,而用户如果发现IP无效,可以通过proxy_report_invalid_ip()
告知代理池,代理池会将这个IP从iplist
列表和DB中同时删除。- 如何保证
iplist
列表和DB中内容的一致性?这里参考了Cache的同步理念,就是读取的时候只读取cache里的内容,写入的时候要同时写入list和db。
- 如何保证
- 目前,我采用的IP抓取方式是从大象代理的API直接获取,之前我花5块钱买了10000个IP,现在才用了2000个不到。虽然这里也能抓取很多免费代理IP,但是我想说的是,即使是我花钱买的IP,有效率也才百分之二三十的样子,有效时间还不知道。抓取免费代理IP的方法前面的文章里有,就不赘述了。我个人还是推荐花5块钱买一下,反正又不贵。
- 程序还提供了另一个API方法
proxy_fill_db()
用来抓取并往数据库里写入有效IP,这个方法主要作用是扩充db里的IP数量。 - 当
proxy_get_one_ip()
被调用时,代理池先检测iplist
是否有内容,有直接返回第一个元素,没有的话会调用内部函数__refresh()
:-
__refresh()
方法的最终目的是将iplist
列表里IP数量补足到至少iplist_pool_size
的大小。它会先从数据库里抓取需要的IP数量;如果数据库里的IP不够,则会通过__get_ip_thr_api()
从代理API处获取数量为N
的IP值。 - 关于
N 值
的确定,首先先做一次采样,先从代理API处获取10个IP,并检查它们的有效性,假设10个里面有4个能用,此时会计算一个ratio = 4/10 = 0.4
,之后假设我们还需要missing_ip_num
个IP,N
的值则是missing_ip_num/ratio*factor
,这里factor可以是任何整数,因为这个ratio
值不一定每次都这么高,据我测试大多数时候是0.2到0.3的样子,有时也会是0.0。总之,factor越大,则有可能获得更多的有效IP,同时消耗的时间也更长。最后,代理池会对每个IP做校验,如果是有效IP,则会写入iplist
和db中。 - 这里我还想提一点,就是我在之前的文章里写了一种设想,就是为了保证IP有效率,在代理池给出IP时,概率性的选择要不要检验一下这个IP,比如说这里,如果说p1=0.4, N=5000, p2=0.9,则x/N=0.83,就是说83%的概率在给出IP前要检查,如果检查失败的话,就要重新获取IP。因为p1只有0.4,甚至更低,所以一般情况下可能要2-3次的检查才能保证这个IP是有效的。我觉得两种方法都是可以的,等待时间其实是一样的,只不过一个是在初始化的时候等,一个是在返回IP的时候等。
-
- 检查一个IP是否有效,目前就是用这个IP访问百度页面,检查其返回码是不是200 OK。
- 另外还包含了一些测试代码。
这里还想补充一个关于unittest
的库,因为之前写C代码,要写很多测试代码,而我一般命名测试函数为unittest
,后来我发现Python里也有类似的函数,一开始我以为和C差不多,最近我才知道原来有一个叫unittest
的测试框架,可以用来编写测试用例,实现自动化测试。不过这里我就不展开说了。
问题
- 目前最大的问题就是缺少多线程的支持,我会尽快补上的。
- 如果有需要,我也会补上获取免费代理IP的部分。
- 这里没有对IP做重复性检测,但是我想了一下,这种情况在目前的实现里是几乎不可能出现的。之后如果增加多个source,写入db之前还需要做重复性检查。
- 还有一点是,因为我之前一直习惯写C代码,所有很少用在Python里很少用类,我不确定这是不是有问题。
代码
Configure.py
# proxy related
daxiang_proxy_tid = #order number
iplist_pool_size = 10
#iplist_thread_num = 1
# DB
db_name = 'proxy.db'
Sqlite3api.py
import sqlite3
import os
import Configure
def sqlite3_init():
try:
conn = sqlite3.connect(Configure.db_name)
except Exception as e:
print ('sqlite3 init fail.')
print (e)
return conn
def sqlite3_execute(conn, sql, args = None):
data = None
try:
cur = conn.cursor()
if args:
cur.execute(sql, args)
else:
cur.execute(sql)
data = cur.fetchall()
except Exception as e:
print (e, "[SQL]:" + sql.strip())
conn.rollback()
conn.commit()
if data:
return data
return None
def sqlite3_close(conn):
conn.close()
def unittest():
conn = sqlite3_init()
sqlite3_execute(conn, "CREATE TABLE stocks (date text, trans text, symbol text, qty real, price real)")
sqlite3_execute(conn, "INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
sqlite3_execute(conn, "INSERT INTO stocks VALUES ('2006-03-28', 'BUY', 'IBM', 1000, 45.00)")
sqlite3_execute(conn, "INSERT INTO stocks VALUES ('2006-04-05', 'BUY', 'MSFT', 1000, 72.00)")
sqlite3_execute(conn, "INSERT INTO stocks VALUES ('2006-04-06', 'SELL', 'IBM', 500, 53.00)")
assert 4 == sqlite3_execute(conn, "SELECT count(*) FROM stocks")[0][0]
sqlite3_execute(conn, "DROP TABLE stocks")
sqlite3_close(conn)
if __name__ == '__main__':
unittest()
proxy.py
import requests
import ast
from random import choice
import os
import time
import Configure
import Sqlite3api as sqlite3
iplist = []
def proxy_init():
conn = sqlite3.sqlite3_init()
sqlite3.sqlite3_execute(conn, "CREATE TABLE proxy (t real, valid real, ip text)")
sqlite3.sqlite3_close(conn)
# main entry, return an ip with certain protocols(either http or https)
def proxy_get_one_ip(protocol='all'):
global iplist
if not iplist:
__refresh(protocol)
# pop() return and remove the last element in the list by default, amazing!
# here I need op on the first element
ip = iplist.pop(0)
iplist.append(ip)
return ip
def proxy_report_invalid_ip(ip, protocol=None):
# sync write op
# remove it from list
iplist.remove(ip)
# set false in db
conn = sqlite3.sqlite3_init()
sqlite3.sqlite3_execute(conn, "DELETE FROM proxy WHERE ip = '{0:s}'".format(ip))
sqlite3.sqlite3_close(conn)
return proxy_get_one_ip(protocol) if protocol else None
def proxy_fill_db(num = 20):
samples = __get_ip_thr_api(num)
cnt = 0
conn = sqlite3.sqlite3_init()
for sample in samples:
ip = "{0:s}:{1:d}".format(sample.get('host'), sample.get('port'))
if __validation(ip):
sqlite3.sqlite3_execute(conn, "INSERT INTO proxy VALUES ({0:d},1,'{1:s}')".format(int(time.time()*1000), ip))
#print ("Found one: {0:s}.".format(ip))
cnt += 1
sqlite3.sqlite3_close(conn)
return cnt
# get ip through api from http://www.daxiangdaili.com
def __get_ip_thr_api(num=1, protocol='all'):
tid = Configure.daxiang_proxy_tid
url = "http://tvp.daxiangdaili.com/ip/?tid={0:d}&num={1:d}&delay=3&category=2&sortby=time&filter=on&format=json".format(tid, num)
if protocol == 'https':
url += "&protocol=https"
try:
response = requests.get(url)
content = None
if response.status_code == requests.codes.ok:
content = response.text
except Exception as e:
print (e)
return ast.literal_eval(content.strip())
# validate ip,addr format: [ip:port]
def __validation(addr):
proxies = {
"http": "http://{0}".format(addr),
"https": "http://{0}".format(addr)
}
header = {}
header['user-agent'] = choice(Configure.FakeUserAgents)
try:
response = requests.get("https://www.baidu.com", headers=header, proxies=proxies, timeout=5)
except Exception as e:
#print (e)
return False
else:
if response.status_code == requests.codes.ok:
return True
def __cal_sample_ratio(sampe_size, protocol):
# collect sample to calculate validity ratio
cnt, valid = sampe_size, 0
addrs = __get_ip_thr_api(cnt, protocol)
conn = sqlite3.sqlite3_init()
for addr in addrs:
ip = "{0:s}:{1:d}".format(addr.get('host'), addr.get('port'))
if __validation(ip):
valid += 1
# Logically, here I should not add this ip to db nor iplist
# because this func is for ratio calculation
# However, it's not easy to get valid ip, I don't any waste here
iplist.append(ip)
sqlite3.sqlite3_execute(conn, "INSERT INTO proxy VALUES ({0:d},1,'{1:s}')".format(int(time.time()*1000), ip))
sqlite3.sqlite3_close(conn)
# calculate validity ratio
ratio = float(valid/cnt)
print ("ratio is {0}".format(ratio))
return ratio
def __refresh(protocol='all'):
print ("[Start updating pool]")
global iplist
# need this number of valid ips to fill the pool
missing_ip_num = Configure.iplist_pool_size - len(iplist)
# (1) get them from db
# normally the db should return enough valid ips
conn = sqlite3.sqlite3_init()
data = sqlite3.sqlite3_execute(conn, "SELECT ip, valid FROM proxy where valid = 1 order by t limit {0:d}".format(missing_ip_num))
sqlite3.sqlite3_close(conn)
for item in data:
l = list(item)
iplist.append(l[0])
# check the results
if len(iplist) >= Configure.iplist_pool_size:
print ("Enough valid ips in pool. Refresh finished.")
return
else:
missing_ip_num = Configure.iplist_pool_size - len(iplist)
print ("Still need {0:d} ips after retrieving from db.".format(missing_ip_num))
# (2) this means even used all valid ips in db, pool is still not full
# then get more valid ips from api or other sources
# calculate ratio
ratio = __cal_sample_ratio(10, protocol)
# Since inside __cal_sample_ratio() would add new ips to db
# I did one more time check here
if len(iplist) >= Configure.iplist_pool_size:
print ("Enough valid ips in pool. Refresh finished.")
return
# higher means more valid ips, but slower to process
factor = 1
# __sample_size__ means based on the validity ratio, this number of ips are needed
# to filter enough valid ips, approximately
if ratio == 0.00:
# extremely bad sample
sample_size = missing_ip_num * factor * 2
else:
sample_size = int(1/ratio * missing_ip_num * factor)
print ("Need to collect {0:d} ips for validation test".format(sample_size))
samples = __get_ip_thr_api(sample_size, protocol)
# TODO multi threads
# if sample_size > a certain number, do multi threads
# else:
conn = sqlite3.sqlite3_init()
for sample in samples:
ip = "{0:s}:{1:d}".format(sample.get('host'), sample.get('port'))
if __validation(ip):
# sync write operation
iplist.append(ip)
sqlite3.sqlite3_execute(conn, "INSERT INTO proxy VALUES ({0:d},1,'{1:s}')".format(int(time.time()*1000), ip))
print ("Found one: {0:s}.".format(ip))
if len(iplist) >= Configure.iplist_pool_size:
print (len(iplist))
break
sqlite3.sqlite3_close(conn)
# Note here I didn't check the list length again.
# so the list may be still not full because there are
# not enough valid ips in the sample
# Generally, this should not happen often, one should
# call proxy_fill_db() enough times to make sure
# db has enough valid ips already for use
print ("[pool updated.]")
def unittest():
global iplist
assert 1 == len(__get_ip_thr_api(1, 'https'))
assert 1 == len(__get_ip_thr_api(1))
proxy_init()
conn = sqlite3.sqlite3_init()
cnt = sqlite3.sqlite3_execute(conn, "SELECT count(*) FROM proxy")[0][0]
res = proxy_fill_db(10)
assert (cnt + res) == sqlite3.sqlite3_execute(conn, "SELECT count(*) FROM proxy")[0][0]
if res > 0:
ip = proxy_get_one_ip()
cnt1 = sqlite3.sqlite3_execute(conn, "SELECT count(*) FROM proxy")[0][0]
proxy_report_invalid_ip(ip)
cnt2 = sqlite3.sqlite3_execute(conn, "SELECT count(*) FROM proxy")[0][0]
assert 1 == (cnt1 - cnt2)
assert 0 == sqlite3.sqlite3_execute(conn, "SELECT count(*) FROM proxy where ip = '{0:s}'".format(ip))[0][0]
sqlite3.sqlite3_close(conn)
if __name__ == '__main__':
unittest()