网上坏蜘蛛搜索引擎bot/spider等HTTP USER AGENT关键字一览

为什么80%的码农都做不了架构师?>>>   hot3.png

本文转载自 http://www.mr-fu.com/4532/ 

下面数组中罗列的都是对网站无实际意义的爬虫(crawler)、蜘蛛(spider)或机器人(bot)。

只要在HTTP_USER_AGENT发现下面数组中的关键词,就可以直接干掉了(百度、谷歌、360等能带来流量的蜘蛛已经排除,Yandex基本不会为中文网站带来流量,因此也被列入其中)。

此数组持续更新!数月以来,从未误杀!

$bad_spiders_array=array(
    'Crawler','Barkrowler','CakePHP','GarlikCrawler','Go-http-client','ias_crawler','ICC-Crawler','PotPlayer',
    'Riddler','Scrapy','WINAMP','viz/viz','ZXing','Castro','Jakarta Commons','ltx71','NativeHost',
    'SalesIntelligent','Xenu Link Sleuth','Y!J-ASR','BUbiNG','CRAZYWEBCRAWLER','http Cnrdn',
    'Lavf','NSPlayer','spray-can','stagefright','voltron','LibVLC','A6-Indexer','crawler4j',
    'wsr-agent','DigitalPebble Crawler','MBCrawler','AhrefsBot','GrapeshotCrawler','proximic','SemrushBot',
    'ahoy!','alkaline','ananzi','anthill','arachnophilia','arale','araneo','aretha','ariadne','arks','askjeeves',
    'atn worldwide','auresys','backrub','big brother','bjaaland','blackwidow','bloodhound','calif','cassandra',
    'christcrawler.com','churl','cienciaficcion.net','cmc/0.01','collective','combine system',
    'computingsite robi/1.0','crawler.feedback','cusco','cyberspyder link test','katalog/index',
    'die blinde kuh','digger','direct hit grabber','download express','dwcp','ebiness','e-collector',
    'emacs-w3 search engine','esculapio','esther','evliya celebi','fastcrawler','felix ide','fetchrover',
    'fido','fish search','fouineur','freecrawl','funnelweb','gazz','gcreep','getterroboplus puu',
    'geturl','golem','grapnel/0.01 experiment','griffon','gromit','Gluten','hämähäkki','harvest','havindex',
    'hi (html index) search','hku www octopus','ht://dig','html_analyzer','htmlgobble','hyper-decontextualizer',
    'ia_archiver','ibm_planetwide','image.kapsi.net','imagelock','incywincy','informant','infoseek sidewinder',
    'ingrid','inktomi slurp','inspector web','intelliagent','internet shinchakubin','iron33','israeli-search',
    'javabee','jcrawler','jumpstation','katipo','kdd-explorer','kilroy','kit-fireball','labelgrabber','larbin',
    'legs','link validator','linkscan','linkwalker','lockon','logo.gif crawler','lycos','mac wwwworm','magpie',
    'marvin/infoseek','mattie','mediafox','merzscope','mindcrawler','mnogosearch search engine software',
    'moget','monster','motor','muncher','muninn','muscat ferret','mwd.search','nec-meshexplorer','nederland.zoek',
    'netcarta webmap engine','netmechanic','netscoop','newscan-online','nhse web forager','nomad',
    'northern light gulliver','nzexplorer','objectssearch','occam','OOZBOT','openfind data gatherer','orb search',
    'pack rat','pageboy','parasite','patric','pegasus','perlcrawler 1.0','pgp key agent','phpdig','piltdownman',
    'pioneer','plumtreewebaccessor','poppi','popular iconoclast','raven search','roadhouse crawling system',
    'robofox','robozilla','rules','scooter','search.aus-au.com','searchprocess','senrigan','sg-scout','shagseeker',
    'sift','site searcher','site valet','sitetech-rover','skymob.com','slcrawler','sleek','snooper','suke',
    'suntek search engine','sven','sygol','tach black widow','tarantula','templeton','the peregrinator',
    'the web moose','the web wombat','the world wide web wanderer','the world wide web worm','titan','titin',
    'ucsd crawl','udmsearch','unnamed','url check','valkyrie','verticrawl','victoria','vision-search',
    'voyager','w3m2','w3mir','walhello appie','wallpaper (alias crawlpaper)','web core / roots',
    'webcatcher','webcopy','webfetcher','webinator','weblayers','weblinker','weblog monitor',
    'webmirror','webquest','webreaper','websnarf','webstolperer','webvac','webwalk','webwalker','webwatch',
    'webzinger','wget','whatuseek winona','wild ferret web hopper','wired digital','wwwc ver',
    'xget','daumoa','jobo','echo!','linkchecker','bloglines','twiceler','appie','sun4u','httrack','sisi',
    'robi','webster pro','webster','zeus','scirus','picosearch','plucker','disco pump','gulliver','emailsiphon',
    'teleport pro','fetch','pamuk','webcopier','webcapture','mass downloader','awv0.8d',
    'crescent internet toolpak','webstripper','sitesucker','webdup','python-urllib','python',
    'franklin locator','ck-sillydog','pockethttp','java','kototoi.org','teragramwebcrawler','vagabondo',
    'nogoop-httpclient','myoperatb','myoperatb','accoona-ai-agent','arachmo','b-l-i-t-z-b-o-t','boitho.com-dc',
    'cerberian drtrs','charlotte','converacrawler','cosmos','covario ids','dataparksearch','earthcom.info',
    'fast enterprise crawler','fast-webcrawler','findlinks','g2crawler','holmes','htdig','iccrawler','ichiro',
    'igdespyder','issuecrawler','l.webis','lwp-trivial','mabontland','magpie-crawler','mnogosearch','mogimogi',
    'morning paper','mvaclient','netresearchserver','netseer crawler','newsgator','ng-search','nutchcvs',
    'nymesis','oegp','orbiter','peew','pompos','postpost','pycurl','qseero','radian6','sandcrawler','sbider',
    'scoutjet','scrubby','searchsight','seekbot','semanticdiscovery','sensis web crawler','shim-crawler','shopwiki',
    'snappy','sqworm','stackrambler','teoma','tineye','truwogps','updated','vortex','vyu2','webcollage',
    'websquash.com','wf84','womlpefactory','yacy','yahooseeker','yahooseeker-testing','yandeximages',
    'yandexmetrika','yeti','yooglifetchagent','zyborg','wordpress','a6-indexer','wsr-agent',
    'Microsoft Office','JDatabaseDriver','facebookexternalhit','The Knowledge AI','Twitterbot',
    'VenusCrawler','aria2','GetCode','CCBot','NetTrack','Go-http-client','IAS crawler','POE-Component',
    'VelenPublicWebCrawler','www.ru','Nutch Master Test','Wotbox','orion-semantics.com','lwp-request',
    'ShortLinkTranslate','mj12bot','WinHttpRequest','Exabot','Auto Spider','Applebot','DuckDuckGo','SeznamBot',
    'moatbot','DotBot','SurdotlyBot','28logsSpider','zgrab','Windows-Media-Player','spbot','Mail.RU_Bot',
    'Backlink','SiteExplorer','SEOkicks','linkdexbot','Qwantify','DataXu','ExtLinksBot','gvfs/','evc-batch',
    'Cliqzbot','YandexBot','YandexMobileBot','newspaper','Clickagy','Chicken laser','coccocbot',
    'Microsoft Windows Network Diagnostics','spuhex.com','smtbot','Dataprovider','HybridBot','Sky-Wapproxy','SafeDNSBot',
    'HatenaBookmark','Meta_Bot','ToutiaoSpider','HttpComponents','ips-agent','yandex.com/bots','(ziva)','Jersey',
    'Auto Shell Spider','User-Agent','curl/','MPlayer','internal request','Grammarly','package','TrendsmapResolver',
    'PaperLiBot','startmebot','WebFuck','GStreamer','httpsrc','AntennaPod','panscient.com','webscan',
    'Screaming Frog','WFilter Live','trendictionbot','nsrbot','PlurkBot','Mojolicious','AlphaBot','tracemyfile',
    'VCTestClient','heritrix','MiniRedir','Iframely','rest-client','Cappuccino','FirmsBot','BOT for JCE',
    'Nimbostratus-Bot','Emacs-w3m','WordupinfoSearch','Dispatch','Paracrawl','Mr.4x3','axios','Typhoeus',
    'tools.random','WhatCMSBot','InetURL','NetpeakCheckerBot','Goose','lua-resty','WhatWeb','special_archiver',
    'XoviBot','Wappalyzer','OK-Search-Bot','abot','Mechanize','uipbot','GnowitNewsbot','PostmanRuntime','HoneyBee',
    'gobuster','Bidtellect','Sonos','RankingBot','Uptimebot','Synapse','Re-re Studio','Mappy','Statastico',
    'Linguee Bot','PocketImageCache','colly','YunSecurityBot','archive.org_bot','CheckMarkNetwork'
);

服务器层面iptables推荐屏蔽的C类地址段:
iptables -I INPUT -s 54.36.148.0/24 -j DROP
iptables -I INPUT -s 54.36.149.0/24 -j DROP
iptables -I INPUT -s 47.74.240.0/24 -j DROP
iptables -I INPUT -s 46.229.168.0/24 -j DROP
iptables-save > /etc/sysconfig/iptables
service iptables restart

1、54.36.148.*和54.36.149.*是AhrefsBot的IP段。
2、47.74.240.*是阿里云新加坡节点IP段,该ip段上有主机不间断地扫描网站根目录下面的.rar和.zip文件(类似/www.rar, /web.zip),且伪装成baiduspider
3、46.229.168.* 是SemRush bot的IP段

关注一个ip:51.55.248.7
这个ip总是伪造成baiduspider向首页发送POST请求

另一个IP:51.255.65.46
可能是AhrefsBot启用的新ip端:
51.255.65.46 - Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)

 

 

本人微信:   本人QQ:

转载于:https://my.oschina.net/lwkai/blog/3015419

你可能感兴趣的:(网上坏蜘蛛搜索引擎bot/spider等HTTP USER AGENT关键字一览)