其实,我了解搜索引擎方面的知识是比较晚的~~~对robots协议还是来自2012年的“3B大战“也就是360和百度之间的一场争论!!
360呢,在2012年推出了自己的一款搜索引擎”360搜索“,并在发布没多久就一跃成为中国第二大搜索引擎,超越搜狗,仅次于百度!!!
但是呢,百度就指出自己的Robots文本中已设定不允许360爬虫进入,而360的爬虫依然对“百度知道”、“百度百科”等百度网站内容进行抓取。
违反了国际上”Robots协议“。一下是关于这方面大家可以查看:http://baike.baidu.com/view/9230864.htm 至此呢,我才了解到了”Robots协议“
百度一下,了解到”
以上是来自百度的解释!!Robots仅仅是一种协议而已!如果你不遵循它,那也没办法!只能通过打官司解决了!!
我们来看一下各大网站的Robots.txt吧~~~
www.baidu.com/robots.txt
User-agent: Baiduspider Disallow: /w? User-agent: Googlebot Disallow: /update Disallow: /history Disallow: /usercard Disallow: /usercenter User-agent: MSNBot Allow: / User-agent: Baiduspider-image Disallow: /w? User-agent: YoudaoBot Allow: / User-agent: Sogou web spider Disallow: /update Disallow: /history Disallow: /usercard Disallow: /usercenter User-agent: Sogou inst spider Disallow: /update Disallow: /history Disallow: /usercard Disallow: /usercenter User-agent: Sogou spider2 Disallow: /update Disallow: /history Disallow: /usercard Disallow: /usercenter User-agent: Sogou blog Disallow: /update Disallow: /history Disallow: /usercard Disallow: /usercenter User-agent: Sogou News Spider Disallow: /update Disallow: /history Disallow: /usercard Disallow: /usercenter User-agent: Sogou Orion spider Disallow: /update Disallow: /history Disallow: /usercard Disallow: /usercenter User-agent: JikeSpider Allow: / User-agent: Sosospider Allow: / User-agent: YYspider Allow: / User-agent: PangusoSpider Allow: / User-agent: yisouspider Allow: / User-agent: EasouSpider Allow: / User-agent: * Disallow: /
上面是什么意思就不用多说了吧、?User-agent后面跟的也就是网络爬虫的名字了!!!
正如百度所说,确实没允许360spider进行爬取!!
www.google.com/robots.txt
User-agent: * Disallow: /search Disallow: /sdch Disallow: /groups Disallow: /images Disallow: /catalogs Allow: /catalogs/about Allow: /catalogs/p? Disallow: /catalogues Disallow: /news Allow: /news/directory Disallow: /nwshp Disallow: /setnewsprefs? Disallow: /index.html? Disallow: /? Allow: /?hl= Disallow: /?hl=*& Disallow: /addurl/image? Disallow: /pagead/ Disallow: /relpage/ Disallow: /relcontent Disallow: /imgres Disallow: /imglanding Disallow: /sbd Disallow: /keyword/ Disallow: /u/ Disallow: /univ/ Disallow: /cobrand Disallow: /custom Disallow: /advanced_group_search Disallow: /googlesite Disallow: /preferences Disallow: /setprefs Disallow: /swr Disallow: /url Disallow: /default Disallow: /m? Disallow: /m/ Disallow: /wml? Disallow: /wml/? Disallow: /wml/search? Disallow: /xhtml? Disallow: /xhtml/? Disallow: /xhtml/search? Disallow: /xml? Disallow: /imode? Disallow: /imode/? Disallow: /imode/search? Disallow: /jsky? Disallow: /jsky/? Disallow: /jsky/search? Disallow: /pda? Disallow: /pda/? Disallow: /pda/search? Disallow: /sprint_xhtml Disallow: /sprint_wml Disallow: /pqa Disallow: /palm Disallow: /gwt/ Disallow: /purchases Disallow: /hws Disallow: /bsd? Disallow: /linux? Disallow: /mac? Disallow: /microsoft? Disallow: /unclesam? Disallow: /answers/search?q= Disallow: /local? Disallow: /local_url Disallow: /shihui? Disallow: /shihui/ Disallow: /froogle? Disallow: /products? Disallow: /products/ Disallow: /froogle_ Disallow: /product_ Disallow: /products_ Disallow: /products; Disallow: /print Disallow: /books/ Disallow: /bkshp?*q=* Disallow: /books?*q=* Disallow: /books?*output=* Disallow: /books?*pg=* Disallow: /books?*jtp=* Disallow: /books?*jscmd=* Disallow: /books?*buy=* Disallow: /books?*zoom=* Allow: /books?*q=related:* Allow: /books?*q=editions:* Allow: /books?*q=subject:* Allow: /books/about Allow: /booksrightsholders Allow: /books?*zoom=1* Allow: /books?*zoom=5* Disallow: /ebooks/ Disallow: /ebooks?*q=* Disallow: /ebooks?*output=* Disallow: /ebooks?*pg=* Disallow: /ebooks?*jscmd=* Disallow: /ebooks?*buy=* Disallow: /ebooks?*zoom=* Allow: /ebooks?*q=related:* Allow: /ebooks?*q=editions:* Allow: /ebooks?*q=subject:* Allow: /ebooks?*zoom=1* Allow: /ebooks?*zoom=5* Disallow: /patents? Disallow: /patents/download/ Disallow: /patents/pdf/ Disallow: /patents/related/ Disallow: /scholar Disallow: /citations? Allow: /citations?user= Allow: /citations?view_op=new_profile Allow: /citations?view_op=top_venues Disallow: /complete Disallow: /s? Disallow: /sponsoredlinks Disallow: /videosearch? Disallow: /videopreview? Disallow: /videoprograminfo? Allow: /maps?hq=http://maps.google.com/help/maps/directions/biking/mapleft.kml&ie=UTF8&ll=37.687624,-122.319717&spn=0.346132,0.727158&z=11&lci=bike&dirflg=b&f=d Allow: /maps/api/js? Disallow: /maps? Disallow: /mapstt? Disallow: /mapslt? Disallow: /maps/stk/ Disallow: /maps/br? Disallow: /mapabcpoi? Disallow: /maphp? Disallow: /mapprint? Disallow: /maps/api/js/ Disallow: /maps/api/staticmap? Disallow: /mld? Disallow: /staticmap? Disallow: /places/ Allow: /places/$ Disallow: /maps/preview Disallow: /maps/place Disallow: /help/maps/streetview/partners/welcome/ Disallow: /help/maps/indoormaps/partners/ Disallow: /lochp? Disallow: /center Disallow: /ie? Disallow: /sms/demo? Disallow: /katrina? Disallow: /blogsearch? Disallow: /blogsearch/ Disallow: /blogsearch_feeds Disallow: /advanced_blog_search Disallow: /uds/ Disallow: /chart? Disallow: /transit? Disallow: /mbd? Disallow: /extern_js/ Disallow: /xjs/ Disallow: /calendar/feeds/ Disallow: /calendar/ical/ Disallow: /cl2/feeds/ Disallow: /cl2/ical/ Disallow: /coop/directory Disallow: /coop/manage Disallow: /trends? Disallow: /trends/music? Disallow: /trends/hottrends? Disallow: /trends/viz? Disallow: /notebook/search? Disallow: /musica Disallow: /musicad Disallow: /musicas Disallow: /musicl Disallow: /musics Disallow: /musicsearch Disallow: /musicsp Disallow: /musiclp Disallow: /browsersync Disallow: /call Disallow: /archivesearch? Disallow: /archivesearch/url Disallow: /archivesearch/advanced_search Disallow: /base/reportbadoffer Disallow: /urchin_test/ Disallow: /movies? Disallow: /codesearch? Disallow: /codesearch/feeds/search? Disallow: /wapsearch? Disallow: /safebrowsing Allow: /safebrowsing/diagnostic Allow: /safebrowsing/report_badware/ Allow: /safebrowsing/report_error/ Allow: /safebrowsing/report_phish/ Disallow: /reviews/search? Disallow: /orkut/albums Allow: /jsapi Disallow: /views? Disallow: /c/ Disallow: /cbk Allow: /cbk?output=tile&cb_client=maps_sv Disallow: /recharge/dashboard/car Disallow: /recharge/dashboard/static/ Disallow: /translate_a/ Disallow: /translate_c Disallow: /translate_f Disallow: /translate_static/ Disallow: /translate_suggestion Disallow: /profiles/me Allow: /profiles Disallow: /s2/profiles/me Allow: /s2/profiles Allow: /s2/photos Allow: /s2/static Disallow: /s2 Allow: /s2/search/social Disallow: /transconsole/portal/ Disallow: /gcc/ Disallow: /aclk Disallow: /cse? Disallow: /cse/home Disallow: /cse/panel Disallow: /cse/manage Disallow: /tbproxy/ Disallow: /imesync/ Disallow: /shenghuo/search? Disallow: /support/forum/search? Disallow: /reviews/polls/ Disallow: /hosted/images/ Disallow: /ppob/? Disallow: /ppob? Disallow: /ig/add? Disallow: /adwordsresellers Disallow: /accounts/o8 Allow: /accounts/o8/id Disallow: /topicsearch?q= Disallow: /xfx7/ Disallow: /squared/api Disallow: /squared/search Disallow: /squared/table Disallow: /toolkit/ Allow: /toolkit/*.html Disallow: /globalmarketfinder/ Allow: /globalmarketfinder/*.html Disallow: /qnasearch? Disallow: /app/updates Disallow: /sidewiki/entry/ Disallow: /quality_form? Disallow: /labs/popgadget/search Disallow: /buzz/post Disallow: /compressiontest/ Disallow: /analytics/reporting/ Disallow: /analytics/admin/ Disallow: /analytics/web/ Disallow: /analytics/feeds/ Disallow: /analytics/settings/ Disallow: /alerts/ Disallow: /ads/search Disallow: /phone/compare/? Allow: /alerts/manage Allow: /alerts/remove Disallow: /travel/clk Disallow: /hotelfinder/rpc Disallow: /hotels/rpc Disallow: /flights/rpc Disallow: /commercesearch/services/ Disallow: /evaluation/ Disallow: /chrome/browser/mobile/tour Disallow: /compare/*/apply* Disallow: /forms/perks/ Disallow: /baraza/*/search Disallow: /baraza/*/report Disallow: /shopping/suppliers/search Disallow: /ct/ Disallow: /edu/cs4hs/ Disallow: /trustedstores/s/ Disallow: /trustedstores/tm2 Disallow: /trustedstores/verify Disallow: /adwords/proposal Disallow: /shopping/product/ Disallow: /shopping/seller Disallow: /shopping/reviewer Sitemap: http://www.gstatic.com/culturalinstitute/sitemaps/www_google_com_culturalinstitute/sitemap-index.xml Sitemap: http://www.google.com/hostednews/sitemap_index.xml Sitemap: http://www.google.com/sitemaps_webmasters.xml Sitemap: http://www.gstatic.com/sitemaps/websearch_hreflang/sitemap_index.xml Sitemap: http://www.google.com/ventures/sitemap_ventures.xml Sitemap: http://www.gstatic.com/dictionary/static/sitemaps/sitemap_index.xml Sitemap: http://www.gstatic.com/earth/gallery/sitemaps/sitemap.xml Sitemap: http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml Sitemap: http://www.gstatic.com/trends/websites/sitemaps/sitemapindex.xml
诶?为什么多出了Sitemap这个元素呢?
前面说过爬虫会通过网页内部的链接发现新的网页。但是如果没有连接指向的网页怎么办?或者用户输入条件生成的动态网页怎么办?能否让网站管理员通知搜索引擎他们网站上有哪些可供抓取的网页?这就是sitemap,最简单的 Sitepmap 形式就是 XML 文件,在其中列出网站中的网址以及关于每个网址的其他数据(上次更新的时间、更改的频率以及相对于网站上其他网址的重要程度等等),利用这些信息搜索引擎可以更加智能地抓取网站内容。
sitemap是另一个话题,足够开一篇新的文章聊的,这里就不展开了,有兴趣的同学可以参考sitemap
新的问题来了,爬虫怎么知道这个网站有没有提供sitemap文件,或者说网站管理员生成了sitemap,(可能是多个文件),爬虫怎么知道放在哪里呢?
由于robots.txt的位置是固定的,于是大家就想到了把sitemap的位置信息放在robots.txt里。这就成为robots.txt里的新成员了。
以上是跟的xml文件形式,大家可以打开看一下~~~其实还可以后跟txt格式的~~如:
大家打开看看!!!!!!!当然还可以是压缩包的形式哦~~我们看一下亚马逊的
http://www.amazon.cn/robots.txt
User-agent: * Disallow: /buycar Disallow: /cart Disallow: /checkout Disallow: /class Disallow: /com Disallow: /common Disallow: /css Disallow: /dll Disallow: /doc Disallow: /dp/e-mail-friend/ Disallow: /dp/manual-submit/ Disallow: /dp/product-availability/ Disallow: /dp/rate-this-item/ Disallow: /dp/shipping/ Disallow: /dp/twister-update/ Disallow: /gp/aws/ssop Disallow: /gp/cart Disallow: /gp/css/homepage.html Disallow: /gp/customer-reviews/common/du Disallow: /gp/flex Disallow: /gp/gfix Disallow: /gp/history Disallow: /gp/item-dispatch Disallow: /gp/music/clipserve Disallow: /gp/music/wma-pop-up Disallow: /gp/offer-listing Disallow: /gp/product/e-mail-friend Disallow: /gp/product/product-availability Disallow: /gp/product/rate-this-item Disallow: /gp/recsradio Disallow: /gp/slredirect Disallow: /gp/twitter/ Disallow: /gp/vote Disallow: /gp/voting/ Disallow: /gp/yourstore Disallow: /inc Disallow: /js Disallow: /lib Disallow: /mn/bookLookInsideApp Disallow: /mn/checkInitApp Disallow: /mn/checkoutAlertMsgApp Disallow: /mn/checkoutredirectApp Disallow: /mn/giftCardApp Disallow: /mn/loginApplication Disallow: /mn/loyaltyApp Disallow: /mn/orderAddrApp Disallow: /mn/orderCfmApp Disallow: /mn/orderDetailApp Disallow: /mn/orderFailApp Disallow: /mn/orderHistoryApp Disallow: /mn/orderModifyApp Disallow: /mn/orderSummaryApp Disallow: /mn/paymentRedriveApp Disallow: /mn/recommendReviewApp Disallow: /mn/releaseReviewApp Disallow: /mn/reviewVoteApplication Disallow: /mn/selectPaymentMethodApp Disallow: /mn/selectShippingOpptionApplication Disallow: /mn/shipmentTraceApp Disallow: /mn/shoppingCartApplication Disallow: /mn/tellFriend Disallow: /mn/thankYouApplication Disallow: /mn/virtualAccountApp Disallow: /mn/yourAccountApp Disallow: /paper Disallow: /xml Disallow: /youraccount Disallow: /ap/signin Disallow: /gp/registry/wishlist/ Disallow: /wishlist/ Allow: /wishlist/universal* Allow: /wishlist/vendor-button* Allow: /wishlist/get-button* Disallow: /gp/wishlist/ Allow: /gp/wishlist/universal* Allow: /gp/wishlist/vendor-button* Allow: /gp/wishlist/ipad-install* Disallow: /registry/wishlist/ Disallow: /gp/help/customer/display.html*nodeId=200843370 Disallow: /gp/help/customer/display.html*nodeId=200877580 Disallow: /gp/help/customer/display.html*nodeId=200877590 Disallow: /gp/help/customer/display.html*nodeId=200879080 Disallow: /gp/help/customer/display.html*nodeId=200879100 Disallow: /gp/help/customer/display.html*nodeId=200879120 Disallow: /gp/help/customer/display.html*nodeId=200879160 Disallow: /gp/help/customer/display.html*nodeId=200879140 Disallow: /gp/help/customer/display.html*nodeId=200877610 Disallow: /gp/help/customer/display.html*nodeId=200878960 Disallow: /gp/help/customer/display.html*nodeId=200878980 Disallow: /gp/help/customer/display.html*nodeId=200879000 Disallow: /gp/help/customer/display.html*nodeId=200879040 Disallow: /gp/help/customer/display.html*nodeId=200879020 Disallow: /gp/help/customer/display.html*nodeId=200877630 Disallow: /gp/help/customer/display.html*nodeId=200879200 Disallow: /gp/help/customer/display.html*nodeId=200879220 Disallow: /gp/help/customer/display.html*nodeId=200879240 Disallow: /gp/help/customer/display.html*nodeId=200879280 Disallow: /gp/help/customer/display.html*nodeId=200879260 Disallow: /gp/help/customer/display.html*nodeId=200877650 Disallow: /gp/help/customer/display.html*nodeId=200879320 Disallow: /gp/help/customer/display.html*nodeId=200879340 Disallow: /gp/help/customer/display.html*nodeId=200879360 Disallow: /gp/help/customer/display.html*nodeId=200879400 Disallow: /gp/help/customer/display.html*nodeId=200879380 Disallow: /gp/help/customer/display.html*nodeId=200877560 Disallow: /gp/help/customer/display.html*nodeId=200843460 Disallow: /gp/help/customer/display.html*nodeId=200843440 Disallow: /gp/help/customer/display.html*nodeId=200899270 Disallow: /gp/help/customer/display.html*nodeId=200879440 Disallow: /gp/help/customer/display.html*nodeId=200899330 Disallow: /gp/help/customer/display.html*nodeId=200899350 Disallow: /gp/help/customer/display.html*nodeId=200899390 Disallow: /gp/help/customer/display.html*nodeId=200899410 Disallow: /gp/help/customer/display.html*nodeId=200899430 Disallow: /gp/help/customer/display.html*nodeId=200899220 Disallow: /gp/help/customer/display.html*nodeId=200899450 Disallow: /gp/help/customer/display.html*nodeId=200899670 Disallow: /gp/help/customer/display.html*nodeId=200899530 Disallow: /gp/help/customer/display.html*nodeId=200899470 Disallow: /gp/help/customer/display.html*nodeId=200899550 Disallow: /gp/help/customer/display.html*nodeId=200899570 Disallow: /gp/help/customer/display.html*nodeId=200899590 Disallow: /gp/help/customer/display.html*nodeId=200899490 Disallow: /gp/help/customer/display.html*nodeId=200899510 Disallow: /gp/help/customer/display.html*nodeId=200899610 Disallow: /gp/help/customer/display.html*nodeId=200899630 Disallow: /gp/help/customer/display.html*nodeId=200899650 Disallow: /gp/help/customer/display.html*nodeId=200879180 Disallow: /gp/help/customer/display.html*nodeId=200879060 Disallow: /gp/help/customer/display.html*nodeId=200879300 Disallow: /gp/help/customer/display.html*nodeId=200879420 Disallow: /gp/help/customer/display.html*nodeId=200899290 Disallow: /gp/help/customer/display.html*nodeId=200899310 Disallow: /gp/help/customer/display.html*nodeId=200843380 Disallow: /gp/help/customer/display.html*nodeId=200843420 Disallow: /gp/help/customer/display.html*nodeId=200899230 Disallow: /gp/help/customer/display.html*nodeId=200899250 Disallow: /gp/help/customer/display.html*nodeId=200899370 Disallow: /gp/help/contact-us/general-questions.html*?type&email&skip=true Disallow: /gp/help/customer/accessibility?ie=UTF8&initialIssue=forgotpw&skip=true Disallow: /gp/registry/search.html Disallow: /gp/orc/rml/ Disallow: /gp/digital/fiona/manage Disallow: /gp/entity-alert/external Disallow: /gp/customer-reviews/dynamic/sims-box Disallow: /review/dynamic/sims-box Disallow: /gp/redirect.html # Sitemap files Sitemap: http://www.amazon.cn/sitemap_feed_index1.xml Sitemap: http://www.amazon.cn/sitemaps.f3053414d236e84.SitemapIndex_0.xml.gz Sitemap: http://www.amazon.cn/sitemaps.1946f6b8171de60.SitemapIndex_0.xml.gz Sitemap: http://www.amazon.cn/sitemaps.c21f969b5f03d33.SitemapIndex_0.xml.gz
我们可以将压缩包,下载下来,打开可以看到是一个xml文件!!
我们再来看一个:
哎?ia_archiver是什么爬虫啊?没见过啊?
百度一下~~
基本上就这些了~~~
还有一些好玩的~~大家可以参考:http://lusongsong.com/reed/732.html
关于robots协议就到这里了!!