使用scrapy防止爬去的网站把我们的程序被ban

方法是自己写一个仿浏览器。他不可能把我们的浏览器给禁止掉吧。除了做一个浏览器还要有几个辅助的条件,cookie机制,IP代理,请求时间间隔。

一想到浏览器,我天浏览器可是一个我们的工具啊,那要用多少代码给堆起来啊,一这么头就大了。可事实并非要做到浏览器那边复杂和健全,只是拿一些必不可少东西让爬去的网站认为是浏览器在访问就可以了。这里的必不可少的东西就请求中的Hesders,只要我们把该传的传过去了就可以实现欺骗了。

如图:


使用scrapy防止爬去的网站把我们的程序被ban_第1张图片

具体实现是这样:

首先创建一个.py

#encoding:utf-8

importrandom

importbase64

fromsettingsimportPROXIES

classRandomUserAgent(object):

def__init__(self, agents):

self.agents = agents

@classmethod

deffrom_crawler(cls,crawler):

returncls(crawler.settings.getlist('USER_AGENTS'))

defprocess_request(self,

request,spider):

print"**************************"+ random.choice(self.agents)

request.headers.setdefault('User-Agent', random.choice(self.agents))

classProxyMiddleware(object):

defprocess_request(self,

request,spider):

proxy = random.choice(PROXIES)

ifproxy['user_pass']is notNone:

request.meta['proxy'] ="http://%s"% proxy['ip_port']

encoded_user_pass = base64.encodestring(proxy['user_pass'])

request.headers['Proxy-Authorization'] ='Basic '+ encoded_user_pass

print"**************ProxyMiddleware have

pass************"+ proxy['ip_port']

print"请求头部是什么",request.headers

else:

print"**************ProxyMiddleware no

pass************"+ proxy['ip_port']

request.meta['proxy'] ="http://%s"% proxy['ip_port']

这里是代码

第二步在settings.py中打开或些必要的东西

添加USER_AGENTS

USER_AGENTS =[

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",

"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",

"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",

"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",

"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",

"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",

"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",

"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",

"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",

"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",

"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",

"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",

"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",

]

添加代理IP设置PROXIES

PROXIES =[

{'ip_port':'111.11.228.75:80','user_pass':''},

{'ip_port':'120.198.243.22:80','user_pass':''},

{'ip_port':'111.8.60.9:8123','user_pass':''},

{'ip_port':'101.71.27.120:80','user_pass':''},

{'ip_port':'122.96.59.104:80','user_pass':''},

{'ip_port':'122.224.249.122:8088','user_pass':''},

]

禁用cookies

COOKIES_ENABLED=False

设置下载延迟

DOWNLOAD_DELAY=3

添加DOWNLOADER_MIDDLEWARES

DOWNLOADER_MIDDLEWARES= {

'cnblogs.middlewares.RandomUserAgent':1,#随机user agent

# 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,'cnblogs.middlewares.ProxyMiddleware':100,#代理需要用到}

添加DEFAULT_REQUEST_HEADERS

DEFAULT_REQUEST_HEADERS= {

'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

'Accept-Language':'en',

}

好了到这里就可以运行了。

免费代理地址:http://www.xicidaili.com/

项目代码地址:https://github.com/tangyi1234/cnblogs

你可能感兴趣的:(使用scrapy防止爬去的网站把我们的程序被ban)