See Attachment
Heritrix Intro
Virgil
黄新宇
爬虫简介
? Search "Free Web Crawlers" in
amazon:
? Free Web Crawlers:
– Wget, Curl, Heritrix,
– Dataparksearch, Nutch, Yacy,
– Axel, Arachnode.net, Grub,
– Httrack, Mnogosearch, Methabot, Gwget
为什么要有爬虫
?
请看《钢铁侠》
-
工程师的一个很大的
价值之一是
-
可以从零做起
Material from
? Material from: google - an introduction
to heritrix
? After simple of use heritrix
本质介绍
?
使用
web
页面(
webconsole
)创建一
个
crawl
任务,
?
可以配置的选项
1.
预抓取时行为不同
2.
保存形式的不同
?
? start
此任务后即开始执行爬行
Features
? ? Collects content via HTTP recursively from multiple
websites in a single crawl run, spanning hundreds to
thousands of independent websites, and millions to tens of
millions of distinct resources, over a week or more of non-stop
collection.
? ? Collects by site domains, exact host, or configurable
URI patterns, starting from an operator-provided "seed" set of
URIs
? ? Executes a primarily breadth-first, order-of-discovery policy
for choosing URIs to process, with an option to prefer
finishing sites in progress to beginning new sites ( “site-first”
scheduling).
? ? Highly extensible with all of the major Heritrix
components
– – the scheduling Frontier, the Scope, the protocol-based Fetch
processors filtering rules, format-based Extract processors, content Write
processors, and more
– – replaceable by alternate implementations or extensions. Documented
APIs and HOW-TOs explain extension options.
Features - Highly configurable
? ? Settable output locations for logs, archive files,
reports and temporary files.
? ? Settable maximum bytes to download, maximum
number of documents to download, and maximum time
to spend crawling.
? ? Settable number of 'worker' crawling threads.
? ? Settable upper bound on bandwidth-usage.
? ? Politeness configuration that allows setting
minimum/maximum time between requests as well an
option to base the lag between requests on a multiple
of time elapsed fulfilling the most recent request.
? ? Configurable inclusion/exclusion filtering mechanism.
Includes regular expression, URI path depth, and link
hop count filters that can be combined variously and
attached at key points along the processing chain to
enable fine tuned inclusion/exclusion.
Key Components
? The Web Administrative Console
– is a standalone web application, hosted by the embedded Jetty Java
HTTP server. Its web pages allow the
operator to choose a crawl's
components and parameters by composing a CrawlOrder
, a
configuration object that also has an
external XML
representation.
? A crawl
– is initiated by passing this CrawlOrder to the CrawlController, a
component which instantiates and holds references to all
configured crawl components. The CrawlController is the crawl's
global
context
:
all subcomponents can reach each other
through it
.
? The CrawlOrder
– contains sufficient information to create the
Scope
.
The Scope
seeds the Frontier with initial URIs
and
is consulted to decide
which later-discovered URIs should also be scheduled
.
? The Frontier
– has responsibility for
ordering the URIs to be visited
,
ensuring
URIs are not revisited unnecessarily
, and moderating the crawler's
visits to any one remote site. It achieves these goals by maintaining a
series of internal queues of URIs to be visited, and a list of all URIs
already visited or queued. URIs are only released from queues for
fetching in a manner compatible with the configured politeness policy.
The default provided Frontier implementation
offers a primarily
breadth-first, order-of-discovery policy
for choosing URIs to
process, with an option to prefer finishing sites in progress to
beginning new sites. Other Frontier implementations are possible.
Multithreaded supported crawler
? The Heritrix crawler is
– multithreaded in order to make progress on many URIs in
parallel during network and local disk I/O lags. Each
worker thread is called a ToeThread, and while a crawl
is active, each ToeThread loops through steps that
roughly correspond to the generic process outlined
previously:
? Ask the Frontier for a next() URI
? Pass the URI to each Processor in turn. (Distinct processors perform the fetching,
analysis, and selection steps.)
? Report the completion of the finished() URI
– The number of ToeThreads in a running crawler is
adjustable to achieve maximum throughput given local
resources. The number of ToeThreads usually ranges in
the hundreds.
CrawlURI instance / ServerCache
? Each URI is represented by a CrawlURI instance, which
packages the URI withadditional information collected
during the crawling process, including arbitrary nested
named attributes. The loosely-coupled system components
communicate their progress and output through the
CrawlURI, which carries the results of earlier
processing to later processors and finally, back to
the Frontier to influence future retries or scheduling.
? The ServerCache holds persistent data about servers
that can be shared across CrawlURIs and time. It
contains any number of CrawlServer entities, collecting
information such as
– IP addresses,
– robots exclusion policies,
– historical responsiveness, and
– per-host crawl statistics.
Processors
? The overall functionality of a crawler with
respect to a scheduled URI is largely
specified by the series of Processors
configured to run.
? Each Processor in turn performs its
tasks, marks up the CrawlURI state, and
returns. The tasks performed will often
vary conditionally based on URI type,
history, or retrieved content. Certain
CrawlURI state also affects whether and
which further processing occurs. (For
example, earlier Processors may cause
later processing to be skipped.)
5 Chains - Prefetch/Fetch
? Processors in the Prefetch Chain receive the
CrawlURI before any network activity to resolve
or fetch the URI. Such Processors typically delay,
reorder, or veto the subsequent processing of
a CrawlURI, for example to ensure that robots
exclusion policy rules are fetched and considered
before a URI is processed.
? Processors in the Fetch Chain attempt network
activity to acquire the resource referred-to by a
CrawlURI. In the typical case of an HTTP
transaction, a Fetcher Processor will fill the
"request" and "response" buffers of the
CrawlURI, or indicate whatever error condition
prevented those buffers from being filled.
5 Chains - Extract/Write/Postprocess
? Processors in the Extract Chain perform follow-up
processing on a CrawlURI for which a fetch has already
completed, extracting features of interest. Most commonly,
these are new URIs that may also be eligible for visitation.
URIs are only discovered at this step, not evaluated.
? Processors in the Write Chain store the crawl results –
returned content or extracted features – to permanent
storage. Our standard crawler merely writes data to the
Internet Archive's ARC file format but third parties have
created Processors to write other data formats or index the
crawled data.
? Finally, Processors in the Postprocess Chain perform
final crawl-maintenance actions on the CrawlURI, such
as testing discovered URIs against the Scope , scheduling
them into the Frontier if necessary, and updating internal
crawler information caches.
Processors
limitations
? Heritrix has been used primarily for doing focused
crawls to date. The broad and continuous use
cases are to be tackled in the next phase of
development (see below). Key current limitations
to keep in mind are:
? ? Single instance only: cannot coordinate
crawling amongst multiple Heritrix instances
whether all instances are run on a single
machine or spread across multiple machines.
? ? Requires sophisticated operator tuning to run
large crawls within machine resource limits.
limitations
? ? Only officially supported and tested on
Linux
? ? Each crawl run is independent, without
support for scheduled revisits to areas of
interest or incremental archival of changed
material.
? ? Limited ability to recover from in-crawl
hardware/system failure.
? ? Minimal time spent profiling and
optimizing has Heritrix coming up short on
performance requirements (See Crawler
Performance below).