Nutch项目配置2---外部网搜索(原)

阅读更多

首先看一下Nutch的整个工作流程


下面解析http://lucene.apache.org/nutch/tutorial8.html中关于外部网搜索的部分中所描述的内容:

Whole-web: Boostrapping the Web Database

The injector adds urls to the crawldb. Let's inject URLs from the DMOZ Open Directory. First we must download and uncompress the file listing all of the DMOZ pages. (This is a 200+Mb file, so this will take a few minutes.)

wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz

Next we select a random subset of these pages. (We use a random subset so that everyone who runs this tutorial doesn't hammer the same sites.) DMOZ contains around three million URLs. We select one out of every 5000, so that we end up with around 1000 URLs:

mkdir dmoz
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls

 

直接去http://rdf.dmoz.org/rdf/content.rdf.u8.gz中下载content.rdf.u8.gz,它能提供给你数量庞大的URL,以供测试使用,然后将下载下来的zip解压在你的nutch根目录中,例如:d:\nutch\nutch-0.9;

在cygwin中运行以下命令:

mkdir dmoz :在当前目录下创建dmoz目录

bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls:用来生成测试用的URL。利用nutch提供的工具包中的DmozParser解析工具解析content.rdf.u8中的内容,每5000个URL提取一个,因为DMOZ大概包含了三百多万个URL,大概可以得到1000个左右的URL,将这些URL写入到dmoz目录下的urls文件中。

 

 

The parser also takes a few minutes, as it must parse the full file. Finally, we initialize the crawl db with the selected urls.

bin/nutch inject crawl/crawldb dmoz

Now we have a web database with around 1000 as-yet unfetched URLs in it.

 

bin/nutch inject crawl/crawldb dmoz:将dmoz中的URL注入到crawl/crawldb中,这样就达到了初始化crawldb的目的,也就是上图Nutch工作流程图中的添加初始化url,写入到保存中url信息的crawldb目录中。

 

Whole-web: Fetching

Starting from 0.8 nutch user agent identifier needs to be configured before fetching. To do this you must edit the file conf/nutch-site.xml , insert at minimum following properties into it and edit in proper values for the properties:


  http.agent.name
 
  HTTP 'User-Agent' request header. MUST NOT be empty -
  please set this to a single word uniquely related to your organization.
  NOTE: You should also check other related properties:
  http.robots.agents
  http.agent.description
  http.agent.url
  http.agent.email
  http.agent.version
  and set their values appropriately.
 




  http.agent.description
 
  Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
 




  http.agent.url
 
  A URL to advertise in the User-Agent header.  This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
 




  http.agent.email
 
  An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
 

 

 

To fetch, we first generate a fetchlist from the database:

bin/nutch generate crawl/crawldb crawl/segments

 

正式开始抓取之前要对Nutch进行配置,主要是用来告诉被抓取的网站此爬虫的一些信息,配置一些说明性强的信息有助于爬虫被人理解。


  http.agent.name
 
  HTTP 'User-Agent' request header. MUST NOT be empty -
  please set this to a single word uniquely related to your organization.
  NOTE: You should also check other related properties:
  http.robots.agents
  http.agent.description
  http.agent.url
  http.agent.email
  http.agent.version
  and set their values appropriately.
 




  http.agent.description
  MyNutch
  Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
 




  http.agent.url
  www.XX.com
  A URL to advertise in the User-Agent header.  This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
 




  http.agent.email
  [email protected]
  An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
 

 

bin/nutch generate crawl/crawldb crawl/segments:从crawldb中生成一个要被抓取的fetchlist

也就是上图中第二个步骤:创建新的segment

命令正常完成后生成如下目录:crawl/segments/20080701162119/crawl_generate

 

This generates a fetchlist for all of the pages due to be fetched. The fetchlist is placed in a newly created segment directory. The segment directory is named by the time it's created. We save the name of this segment in the shell variable s1 :

s1=`ls -d crawl/segments/2* | tail -1`
echo $s1

Now we run the fetcher on this segment with:

bin/nutch fetch $s1

When this is complete, we update the database with the results of the fetch:

bin/nutch updatedb crawl/crawldb $s1

Now the database has entries for all of the pages referenced by the initial set.

Now we fetch a new segment with the top-scoring 1000 pages:

bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2

bin/nutch fetch $s2
bin/nutch updatedb crawl/crawldb $s2

Let's fetch one more round:

bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s3=`ls -d crawl/segments/2* | tail -1`
echo $s3

bin/nutch fetch $s3
bin/nutch updatedb crawl/crawldb $s3

By this point we've fetched a few thousand pages. Let's index them!

 

 

解释1:

第一轮抓取:

s1=`ls -d crawl/segments/2* | tail -1`
echo $s1

是将segments目录20080701162119保存在一个变量s1中,

bin/nutch fetch $s1

开始第一轮抓取,

bin/nutch updatedb crawl/crawldb $s1

更新数据库,把获取到的页面信息存入数据库中。

 

 

解释2:

第二轮抓取:

bin/nutch generate crawl/crawldb crawl/segments -topN 1000

创建新的 segment ,选择分值排在前 1000 的URL来进行第二次获取,

s2=`ls -d crawl/segments/2* | tail -1`
echo $s2

把新的segments目录20080701162531存入变量s2中,

bin/nutch fetch $s2

开始第二轮抓取,

bin/nutch updatedb crawl/crawldb $s2

更新数据库,把新的页面信息存入数据库中,

 

第三轮抓取:

bin/nutch generate crawl/crawldb crawl/segments -topN 1000

与第二轮一样,创建新的segments,选择分值排在前面的1000个URL来进行抓取,

s3=`ls -d crawl/segments/2* | tail -1`
echo $s3

将新的segments目录20080701163439保存在变量s3中,

bin/nutch fetch $s3

开始第三轮抓取,

bin/nutch updatedb crawl/crawldb $s3

更新数据库,把新抓取的页面信息保存到数据库中,

 

完成以后会在segments目录下生成如下目录:

20080701162119/content,crawl_fetch,crawl_parse,parse_data,parse_text

此部分包括了上图中的第三步“爬行抓取”和第四步“内容分析”。

Whole-web: Indexing

Before indexing we first invert all of the links, so that we may index incoming anchor text with the pages.

bin/nutch invertlinks crawl/linkdb crawl/segments/*

To index the segments we use the index command, as follows:

bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*

Now we're ready to search!

 

此部分建立索引分两部分:

bin/nutch invertlinks crawl/linkdb crawl/segments/* :反转所有的链接,以便索引页面的锚点文本(此处不是很明白)

正确运行后生成目录:crawl/linkdb

bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*:建立索引

正确运行后生成目录:crawl/indexes

此处就是上图中的编制索引的步骤。

 

最后说明一下各目录的用处:

Whole-web: Concepts

Nutch data is composed of:

  1. The crawl database, or crawldb . This contains information about every url known to Nutch, including whether it was fetched, and, if so, when.
  2. The link database, or linkdb . This contains the list of known links to each url, including both the source url and anchor text of the link.
  3. A set of segments . Each segment is a set of urls that are fetched as a unit. Segments are directories with the following subdirectories:
    • a crawl_generate names a set of urls to be fetched
    • a crawl_fetch contains the status of fetching each url
    • a content contains the content of each url
    • a parse_text contains the parsed text of each url
    • a parse_data contains outlinks and metadata parsed from each url
    • a crawl_parse contains the outlink urls, used to update the crawldb
  4. The indexes are Lucene-format indexes.

crawldb:包含nutch已经知道的URL信息,包括URL是否已经被抓取,什么时候被抓取的…

linkdb:URL的链接列表,包括URL资源和URL链接的锚点文本;

segments下的目录:

crawl_generate:将要被抓取的URL列表

crawl_fetch:URL的状态

content:URL的内容

parse_text:URL被分析后的文本

parse_data:URL的外部链接和它的元数据

crawl_parse:URL的外部链接,将用于更新crawldb

 

 

 

 

 

你可能感兴趣的:(lucene,Web,Apache,UP,工作)