提高nutch爬取效率

Here are the things that could potentially slow down fetching
下面这些是潜在的影响爬取效率的内容:

 

1) DNS setup
2) The number of crawlers you have, too many, too few.
3) Bandwidth limitations
4) Number of threads per host (politeness)
5) Uneven distribution of urls to fetch and politeness.
6) High crawl-delays from robots.txt (usually along with an uneven distribution of urls).
7) Many slow websites (again usually with an uneven distribution).
8) Downloading lots of content (PDFS, very large html pages, again possibly an uneven distribution).
9) Others

 

1)DNS设置
2)你的爬虫数量,太多或太少
3)带宽限制
4)每一主机的线程数
5)要抓取的urls的分配不均匀
6) robots.txt中的高爬取延时(通常和urls的分配不均匀同时出现)
7)有很多比较慢的网页(通常和分配不均匀同时出现)
8)要下载太多的内容(PDF,大的html页面,通常和分配不均匀同时出现)
9)其它

 

Now how do we fix them
那现在怎样改善它们?


1) Have a DNS setup on each local crawling machine, if multiple crawling machines and a single centralized DNS it

can act like a DOS attack on the DNS server slowing the entire system. We always did a two layer setup hitting

first to the local DNS cache then to a large DNS cache like OpenDNS or Verizon.

 

 

1)在每一个本地的爬虫机器上设置DNS,如果是多个爬取机器和一个单独的DNS中心这种情况,那么它就会像有DOS攻击在DNS服务

器上那样,使整个系统变慢。我们经常设置两层,首先命中本地DNS缓存,然后就是大的DNS缓存,就像OpenDNS或Verizon。

 

 

2) This would be number of map tasks * fetcher.threads.fetch. So 10 map tasks * 20 threads = 200 fetchers at

once. Too many and you overload your system, too few and other factors and the machine sites idle. You will need

to play around with this setting for your setup.

 

 

2)这将是map任务数乘以fetcher.threads.fetch属性值的数量。所以10个map任务*20个线程=一次200个爬取列表。太多的话会超

过你系统的负担,太少的话就会使一些机器闲置。你需要认真考虑在你的环境下如何设置这些属性。

 

 

3) Bandwidth limitations. Use ntop, ganglia, and other monitoring tools to determine how much bandwidth you are

using. Account for in and out bandwidth. A simple test, from a server inside the fetching network but not itself

fetching, if it is very slow connecting to or downloading content when fetching is occurring, it is a good bet

you are maxing out bandwidth. If you set http timeout as we describe later and are maxing your bandwidth, you

will start seeing many http timeout errors.

 

 

3)带宽限制,用ntop,ganglia和其它监控软件来测定你使用了多少的带宽。计算输入和输出的带宽。可以做一个简单的测试,

用抓取网络中一台不用作爬虫的服务器中,如果它与其中一台爬虫机器连接时或当那台机器抓取时从中下载信息时非常慢,这时

你就可以加大带宽。如果你像我后来说的那样设置http的超时时间并且增加了你的带宽,你会开始看到很多http超时的错误。

 

 

4) Politeness along with uneven distribution of urls is probably the biggest limiting factor. If one thread is

processing a single site and there are a lot of urls from that site to fetch all other threads will sit idle

while that one thread finishes. Some solutions, use fetcher.server.delay to shorten the time between page fetches

and use fetcher.threads.per.host to increase the number of threads fetching for a single site (this would still

be in the same map task though and hence the same JVM ChildTask process). If increasing this > 0 you could also

set fetcher.server.min.delay to some value > 0 for politeness to min and max bound the process.

 

 

4)urls分配的不均匀很有可能是限制性能的一个最大的因素。如果一个线程正在处理一个网站并且那个网站还有很多url等待抓

取,那么其它线程就会闲置直到那个线程完成抓取。一些解决方法是,使用fetcher.server.delay来缩短网页抓取之间的时间间

隔,和使用fetcher.threads.per.host来增加同一网站抓取的线程数(这仍然在同一个map任务中,因此也是在同一个JVM中的子

任务中处理)。如果把这些属性都设置为大于0,你也可以设置fetcher.server.min.delay属性大于0来设置处理的最小和最大的

界限。

 

 

5) Fetching a lot of pages from a single site or a lot of pages from a few sites will slow down fetching

dramatically. For full web crawls you want an even distribution so all fetching threads can be active. Setting

generate.max.per.host to a value > 0 will limit the number of pages from a single host/domain to fetch.

 

 

5)在一个网站上抓取大量的网页或在少量网站上抓取大量的网页将显著地降低抓取的速度。对于全网爬取,你希望用分布式环境

来使所有抓取线程活动。设置generate.max.per.host大于0将限制在同一网站/域名抓取网页的数量。

 

 

6) Crawl-delay can be used and is obeyed by nutch in robots.txt. Most sites don't use this setting but a few

(some malicious do). I have seen crawl-delays as high as 2 days in seconds. The fetcher.max.crawl.delay variable

will ignore pages with crawl delays > x. I usually set this to 10 seconds, default is 30. Even at 10 seconds if

you have a lot of pages from a site from which you can only crawl 1 page every 10 seconds it is going to be slow.

On the flip side, setting this to a low value will ignore and not fetch those pages.

 

 

6)爬取延迟。大多数网站不使用这些设置只有少数使用(一些恶意的网站)。我见过爬取延迟每秒最长延迟2天的。

fetcher.max.crawl.delay属性将忽略爬取延迟大于x的页面。我经常把它设置成10秒,默认是30秒。尽管设置为10秒,如果你在

某个网站上有大量的页面要爬取,但你只能每10秒爬取一个页面,这样也是很慢的。另一方面,把它的值设置过小将忽略该页面

并且不抓取这些网页。

 

 

7) Sometimes, manytimes websites are just slow. Setting a low value for http.timeout helps. The default is 10

seconds. If you don't care and want as many pages as fast as possible, set it lower. Some websites, digg for

instance, will bandwidth limit you on their side only allowing x connections per given time frame. So even if you

only have say 50 pages from a single site (which I still think is to many). It may be waiting 10 seconds on each

page. The ftp.timeout can also be set if fetching ftp content.

 

 

7)有时,网页刚好很慢。设置http.timeout一个低点的值就有助于这种情况。它的默认值为10秒。如果你不在意并想所有网页都

尽可能的快,设置得小点。一些网站。例如digg,会在网站中限制你的带宽并且只允许在某个时间段内存在x个到你机器的连接。所以即使你只在一个网站中爬取50个网页(我仍然认为太多了)。这样将在每一页面中等待10秒。 ftp.timeout 也可以用来设置抓取ftp的内容时的时间间隔。

 

 

8) Lots of content means slower fetching. If downloading PDFs and other non-html documents this is especially

true. To avoid non-html content you can use the url filters. I prefer the prefix and suffix filters. The

http.content.limit and ftp.content.limit can be used to limit the amount of content downloaded for a single

document.

 

 

8)大量的内容意味着要降低抓取的速度。特别是下载PDF或其它非html的文件时。为了避免下载非html的内容,你可以使用url过

滤器。我更喜欢prefix和suffix过滤器。http.content.limit和 ftp.content.limit 属性可以限制一个文档中下载数据的多少。

 

 

9) Other things that could be causing slow fetching:

Max the number of open sockets/files on a machine. You will start seeing IO errors or can't open socket errors.
Poor routing. Bad routers or home routers might not be able to handle the number of connections going through at

once. An incorrect routing setup could also be causing problems but those are usually much more complex to

diagnose. Use network trace and mapping tools if you think this is happening. Upstream routing can also be a

problem from your network provider.
Bad network cards. I have seen network cards flip once they reach a certain bandwidth point. This was more

prevalent on, at the time, newer gigabit cards. Not usually my first thought but always a possibility. Use

tcpdump and network monitoring tools on the single interface.

 


9)其它可能导致抓取变慢的因素:

一台机器最大可打开的socket或文件的多少。你可能会开始看到IO错误或不能打开socket的错误。低效的路由。坏的或家里的路

由不能控制同一时间大量连接的建立。一个错误的路由设置也可能导致问题但这些问题通常很难发现。如果你认为是这个问题,

可以用网络跟踪和映射工具来查找。反向的路由则可能是你网络供应商的问题。坏的网卡。我曾经见过一些网卡突然达到了某个

带宽值。这个问题在使用新的网卡时更加普遍。这通常不是我首先想到的但是通常是可能会出现的。可以使用tcpdump和网络监控

工具来检查那个网络接口。

That is about it from my perspective. Feel free to add anything if anybody else thinks of other things.
这些是我的观点。如果大家还想到其它的因素,欢迎补充。

你可能感兴趣的:(Nutch)