Heritrix学习ppt

See Attachment


Heritrix Intro
Virgil
黄新宇
爬虫简介
? Search "Free Web Crawlers" in
amazon:
? Free Web Crawlers:
– Wget, Curl, Heritrix,
– Dataparksearch, Nutch, Yacy,
– Axel, Arachnode.net, Grub,
– Httrack, Mnogosearch, Methabot, Gwget
为什么要有爬虫
?
请看《钢铁侠》
-
工程师的一个很大的
价值之一是
-
可以从零做起
Material from
? Material from: google - an introduction
to heritrix
? After simple of use heritrix
本质介绍
?
使用
web
页面(
webconsole
)创建一

crawl
任务,
?
可以配置的选项
1.
预抓取时行为不同
2.
保存形式的不同
?  
? start
此任务后即开始执行爬行
Features
? ?  Collects content via HTTP recursively from multiple
websites  in a single crawl  run, spanning hundreds  to 
thousands of  independent websites, and millions  to  tens of
millions of distinct resources, over a week or more of non-stop
collection.
? ?  Collects  by  site  domains,  exact  host,  or  configurable 
URI  patterns, starting from an operator-provided "seed" set of
URIs
? ?  Executes a primarily breadth-first, order-of-discovery policy
for choosing URIs  to  process, with  an  option  to  prefer 
finishing  sites  in  progress  to beginning new sites ( “site-first”
scheduling).
? ?  Highly  extensible  with  all  of  the  major  Heritrix 
components
– –  the scheduling  Frontier,  the  Scope,  the  protocol-based  Fetch 
processors filtering rules, format-based Extract processors, content Write
processors, and  more
– –  replaceable  by  alternate  implementations  or  extensions. Documented
APIs and HOW-TOs explain extension options.
Features - Highly configurable
? ?  Settable  output  locations  for  logs,  archive  files, 
reports  and  temporary files.
? ?  Settable maximum bytes to download, maximum
number of documents to download, and maximum time
to spend crawling.
? ?  Settable number of 'worker' crawling threads.
? ?  Settable upper bound on bandwidth-usage.
? ?  Politeness  configuration  that  allows  setting 
minimum/maximum  time between requests as well an
option  to base  the  lag between requests on a multiple
of time elapsed fulfilling the most recent request.
? ?  Configurable  inclusion/exclusion  filtering mechanism.  
Includes  regular expression,  URI  path  depth,  and  link 
hop  count  filters  that  can  be combined variously and
attached at key points along the processing chain to
enable fine tuned inclusion/exclusion.

Key Components
? The Web  Administrative Console 
– is  a  standalone web application,  hosted by the embedded Jetty Java
HTTP server. Its web pages allow the
operator to choose  a  crawl's 
components  and parameters  by  composing  a  CrawlOrder
,  a
configuration object that also has an
external XML
representation.
? A  crawl
– is  initiated  by  passing  this  CrawlOrder  to  the  CrawlController,  a 
component  which  instantiates  and  holds  references  to  all 
configured  crawl components. The CrawlController  is  the  crawl's 
global 
context

all  subcomponents can reach each other 
through  it
.
? The  CrawlOrder
– contains  sufficient  information  to  create  the
Scope
.
The Scope
seeds the Frontier with initial URIs
and
is consulted to decide
which later-discovered URIs should also be scheduled
.
? The Frontier

– has responsibility for
ordering  the URIs  to be visited
,
ensuring
URIs are not revisited unnecessarily
, and moderating the crawler's
visits to any one remote site. It achieves  these goals by maintaining a
series of  internal queues of URIs  to be visited, and a  list of all URIs
already visited or queued. URIs are only released from  queues for
fetching in a manner compatible with the configured politeness policy.
The default  provided Frontier  implementation
offers  a  primarily 
breadth-first,  order-of-discovery policy
for choosing URIs to
process, with an option to prefer finishing sites  in progress to
beginning new sites. Other Frontier implementations are possible.
Multithreaded supported crawler
? The Heritrix crawler is
– multithreaded  in order  to make progress on many URIs  in
parallel  during  network  and  local  disk  I/O  lags.  Each 
worker  thread  is  called  a ToeThread,  and while  a  crawl 
is  active, each ToeThread  loops through steps that 
roughly correspond to the generic process outlined
previously:
? Ask the Frontier for a next() URI
? Pass  the URI  to each Processor  in  turn. (Distinct processors perform  the fetching,
analysis, and selection steps.)
? Report the completion of the finished() URI
– The number of ToeThreads in a running crawler is
adjustable to achieve maximum  throughput given  local 
resources.   The number of ToeThreads usually  ranges  in 
the hundreds.
CrawlURI  instance / ServerCache 
? Each URI  is  represented by a CrawlURI  instance, which
packages  the URI withadditional  information  collected 
during  the  crawling  process,  including  arbitrary nested
named attributes. The  loosely-coupled system components
communicate  their progress  and  output  through  the 
CrawlURI,  which  carries  the  results  of  earlier
processing  to  later  processors  and  finally,  back  to 
the  Frontier  to  influence  future retries or scheduling.
? The ServerCache  holds  persistent  data  about  servers 
that  can  be  shared  across CrawlURIs  and  time.  It 
contains  any  number  of  CrawlServer entities, collecting 
information such as
– IP addresses,
– robots exclusion policies,
– historical responsiveness,  and
– per-host crawl statistics.
Processors
? The overall functionality of a crawler with 
respect  to a  scheduled URI  is  largely
specified  by  the  series  of  Processors 
configured  to  run. 
? Each  Processor  in  turn performs  its 
tasks, marks up  the CrawlURI  state,  and 
returns. The  tasks performed will often
vary conditionally based on URI type,
history, or retrieved content.  Certain
CrawlURI  state  also  affects  whether  and 
which  further  processing  occurs.   (For
example, earlier Processors may cause
later processing to be skipped.)
5 Chains - Prefetch/Fetch
? Processors in the Prefetch Chain receive the
CrawlURI before any network activity to  resolve 
or  fetch  the URI.  Such  Processors  typically  delay, 
reorder,  or  veto  the subsequent  processing  of 
a CrawlURI,  for  example  to  ensure  that  robots 
exclusion policy rules are fetched and considered
before a URI is processed.
? Processors  in  the Fetch  Chain  attempt  network 
activity  to  acquire  the  resource referred-to  by  a
CrawlURI.  In  the  typical  case  of  an HTTP 
transaction,  a  Fetcher Processor will  fill  the
"request" and "response" buffers of  the
CrawlURI, or  indicate whatever error condition
prevented those buffers from being filled.
5 Chains - Extract/Write/Postprocess
? Processors in the Extract Chain perform  follow-up
processing on a CrawlURI for which a fetch has already
completed, extracting features of interest. Most commonly,
these are new URIs that may also be eligible for visitation.
URIs are only discovered at this step, not evaluated.
? Processors  in  the  Write  Chain  store  the  crawl  results  – 
returned  content  or extracted features – to permanent
storage. Our standard crawler merely writes data to the 
Internet Archive's ARC  file  format  but  third  parties  have 
created Processors  to write other data formats or index the
crawled data.
? Finally,  Processors  in  the  Postprocess  Chain  perform 
final  crawl-maintenance actions  on  the  CrawlURI,  such 
as  testing  discovered  URIs  against  the  Scope , scheduling 
them  into  the  Frontier  if  necessary,  and  updating  internal 
crawler information caches.
Processors
limitations
? Heritrix has been used primarily for doing focused
crawls  to date.  The broad and continuous use
cases are to be tackled in the next phase of
development (see below). Key current limitations
to keep in mind are:
? ?  Single  instance  only:  cannot  coordinate 
crawling  amongst  multiple Heritrix  instances
whether  all  instances  are  run  on  a  single
machine  or spread across multiple machines.
? ?  Requires sophisticated operator tuning to run
large crawls within machine resource limits.
limitations
? ?  Only officially supported and tested on
Linux
? ?  Each crawl  run  is  independent, without 
support  for  scheduled  revisits  to areas of
interest or incremental archival of changed
material.
? ?  Limited ability to recover from in-crawl
hardware/system failure.
? ?  Minimal time spent profiling and
optimizing has Heritrix coming up short on
performance requirements (See Crawler
Performance below).


你可能感兴趣的:(Web,linux,Google,UP,performance)