The Crawl Database is a data store where Nutch stores every URL,together with the metadata that it
knows about。
In Hadoop terms it's a Sequence file (meaning all records
are stored in sequential manner) consisting of tuples of URL and
CrawlDatum.
Operations (like inserts, deletes and updates) in Crawl
Database and other data are processed in batch mode. Here is an example
of the contents of crawldb:
http://www.example.com/page1.html -> status=..., fetchTime=..., retries=..., fetchInterval=..., ... http://www.example.com/page2.html -> status=..., fetchTime=..., retries=..., fetchInterval=..., ... http://www.example.com/page3.html -> status=..., fetchTime=..., retries=..., fetchInterval=..., ...
The Link Database is a data structure (Sequence file, URL ->
Inlinks) that contains all inverted links.
In the parsing phase Nutch
can extract outlinks from a document and store them in format source url
-> target_url,anchor_text.
IThe Inject command in Nutch has one responsibility: inject more
URLs into Crawl Database. Normally you should collect a set of URLs to
add and then process them in one batch to keep the time of a single
insert small.
Job1: Convert plain text into URL,CrawlDatum tuples and dedupe(mr task)
Job2: Merge with existing CrawlDB, dedupe(mr task)
The Generate command in Nutch is used to generate a list of URLs
to fetch from Crawl Database URLs with the highest scores are
preferred.
Fetcher is responsible for fetching content from URLs and writing
them to disk. It also optionally parses the content. URLs are read from
a Fetch List generated by Generator.
Parser reads raw fetched content, parses it and stores the results.
The UpdateDB command reads the CrawlDatums from Segment (extracted
URLs) and merges them to the existing CrawlDB.
Inverts link information so we can use anchor texts from other
documents that point to a document together with the rest of the
document data.