然而,实用性很差,仅仅是能用而已。
已知bug:
由于土啬的问题,经常会炸掉。网络不稳定导致各种Connection Aborted/SSLError: EOF occurred in violation of protocol.
引入新的bug,无法记录错误啊啊啊!
解决方案:
已修复,添加异常处理,一次超时重试三次,超时值设定为1s。三次超时访问下一个页面,同时记录错误信息。
已修复,改了下代码。
程序运行速度已经有了很大的提高[约3pv/(s/thread)]
bug已经修复。
乱写的+现学现卖。
鸣谢:百度爬虫,感谢它的无私奉献(Anti-Anti-Spider Technology)
效果(速度不太稳定,约在1s/pv~10s/pv间波动):
(已经有了较大变化)
多进程生成器:
import sys reload(sys) sys.setdefaultencoding('utf8') import codecs for i in range(1,7000): f = open(str(i*10)+'k'+'.py','w') f.write('''# coding:utf-8 import re import requests import time headers = { 'User-Agent':'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)', } sess = requests.Session() adapter = requests.adapters.HTTPAdapter(max_retries = 20) sess.mount('https://', adapter) for i in range ('''+str(10000*i)+''','''+str((i+1)*10000)+'''): print(str(i)) url = "https://www.pixiv.net/member_illust.php?mode=medium&illust_id="+str(i); try: r = sess.get(url,headers = headers,timeout = 1) except: try: r = sess.get(url,headers = headers,timeout = 1) except: try: r = sess.get(url,headers = headers,timeout = 1) except: err = open('err'+str(i)+'.log',"a") err.write(str(i)+"\\n") err.close continue if r.status_code != 200: continue data = r.text pattern = u'class="text">初音ミク' piclist = re.findall(pattern,data) if len(piclist): f = open("'''+str(i*10)+'k-'+str((i+1)*10)+'k'+'''.txt","a") f.write(str(i)+'\\n') f.close()''') f.close()
生成实例:
# coding:utf-8 import re import requests import time headers = { 'User-Agent':'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)', } sess = requests.Session() adapter = requests.adapters.HTTPAdapter(max_retries = 20) sess.mount('https://', adapter) for i in range (69990000,70000000): print(str(i)) url = "https://www.pixiv.net/member_illust.php?mode=medium&illust_id="+str(i); try: r = sess.get(url,headers = headers,timeout = 1) except: try: r = sess.get(url,headers = headers,timeout = 1) except: try: r = sess.get(url,headers = headers,timeout = 1) except: err = open('err'+str(i)+'.log',"a") err.write(str(i)+"\n") err.close continue if r.status_code != 200: continue data = r.text pattern = u'class="text">初音ミク' piclist = re.findall(pattern,data) if len(piclist): f = open("69990k-70000k.txt","a") f.write(str(i)+'\n') f.close()
UPDATE: 回家后更新多进程版,速度约25w pv/h.结合bash脚本实现不间断爬虫。然而还是很慢(摊手)。代码大范围重构,补了一点信息(回家后就能跑完了hhh)。稳定性有了提高(1000Wpv无错误)???
系统要求:
Linux主流发行版
内存8G(主进程虚拟内存2G,物理内存2G,worker进程不清楚,反正能跑,总内存消耗大约是4.4G左右)
四核CPU(占用率大约是170%-200%+)
40Mbps网络(外网,速度不是很稳定,约10-30Mbps左右)
------------------------------------------------------
150进程(然而实践证明100进程足矣)???
引入新的bug:单文件7000W数量级,gc会炸掉。orz.