nutch入门

Nutch入门

Nutch介绍:

Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。


1.    安装nutch

1)安装subversion

  • 设置root密码:

sudo passwd root
这个命令是给root用户设定密码。
然后su root
切换到root用户。

例如:

jiangzl@ubuntu :~$ sudo passwd root

[sudo] password for jiangzl:

Enter new UNIX password:

Retype new UNIX password:

passwd: password updated successfully

jiangzl@ubuntu :~$ su root

Password:

root@ubuntu :/home/jiangzl#

  •  apt-get install subversion

2)用svn下载nutch-1.6源码

  • svn co https://svn.apache.org/repos/asf/nutch/tags/release-1.6/

3)配置nutch-site.xml

  • ls ~/release-1.6/conf | grep nutch

  • nutch-conf.xml

  • nutch-default.xml   //nutch默认配置

  • nutch-site.xml      //可以覆盖nutch默认配置的文件

  • nutch-site.xml.template  //模板

  • 复制模板nutch-site.xml.template产生nutch-site.xml

cp nutch-site.xml.template nutch-site.xml

2. nutch-site.xml中配置代理

内容格式在nutch-default.xml   //nutch默认配置  中找

more -10 nutch-default.xml

/ http.agent.name     (找到对应的位置,复制出来)

添加http代理: http.agent.name的配置

root@localhost :/home/jiangzl/release-1.6/conf# vim nutch-site.xml

<configuration>

<property>

  <name>http.agent.name</name>

  <value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0</value>

  <description>HTTP 'User-Agent' request header. MUST NOT be empty -

  please set this to a single word uniquely related to your organization.

 

  NOTE: You should also check other related properties:

 

        http.robots.agents

        http.agent.description

        http.agent.url

        http.agent.email

        http.agent.version

 

  and set their values appropriately.

 

  </description>

</property>

</configuration>

4)用ant编译nutch-1.6的源码

  • cd release-1.6/    (进入nutch根目录)

  • apt-get install ant  (安装ant编译工具)

  • ant             (nutch根目录进行编译,多出两个文件:build runtime)

  • cd runtime       (deploy(hadoop集群运行模式) local(本地运行模式) 两个文件夹,代表nutch两种运行方式)

deployàhadoop集群运行模式

那么nutchhadoop通过什么连接起来的勒?

脚本位置:ls runtime/deploy/bin  (脚本叫:nutch)

job位置: ls  runtime/deploy  job叫:apache-nutch-1.6.job

通过nutch脚本中中hadoop命令把apache-nutch-1.6.job提交给hadoopJobTracker

localà本地运行模式

root@localhost :/home/jiangzl/release-1.6/runtime/local# ls

bin  conf  lib  logs  plugins  test

2.运行nutch抓取信息

1) 运行命令

  • cd runtime/local/

  • mkdir urls  //创建url的文件夹

  • vim urls/url   //编写url地址

  • (随便的一个网址: http://blog.tianya.cn

  • ./bin/nutch    (看能接什么参数—> 当然我们可以看看nutch的源码: vim nutch)

  • ./bin nutch crawl    (你不知道里面的命令就直接回车就行)

root@localhost :/home/jiangzl/release-1.6/runtime/local# bin/nutch crawl

Usage: Crawl <urlDir> -solr <solrURL> [-dir d] [-threads n] [-depth i] [-topN N]

  • <usrDir> 存放要爬去的网站地址的url的文件夹

  • -solr <solrURL>  建立索引  (这个可选,可加可不加)

  • [-dir d]  存放爬去后文件的文件夹,d(可写):相对路径/绝对路径

  • [-threads n]  开始爬取的线程,默认是10个线程

  • [-depth i]   深度,表示你要抓取几层

  • [-topN N]   针对本次抓取,每个map爬去的url不能超过n

  • 执行命令:bin/nutch crawl urls -dir data2 -threads 10 -depth 3

  • 后台执行: nohup bin/nutch crawl urls –dir data –threads 10 -depth 3 &

  • 查看日志

  • cat runtime/local/nutch.out    (查看nutch运行日志)

  • cat runtime/local/logs/hadoop.log   (查看nutch操作hadoop的详细日志)

3)查看结果

 

root@localhost :/home/jiangzl/release-1.6/runtime/local# ls data/ -l

crawldb    爬虫在抓取过程中所有的url

linkdb      存储最终的二进制文件

segments   用于抓取的网页信息集合

           

root@localhost :/home/jiangzl/release-1.6/runtime/local# ls data/crawldb/

current  每一次爬取的时候就是最新的url

old     每一次爬取的时候就把current变为old

root@localhost :/home/jiangzl/release-1.6/runtime/local# ls data/crawldb/current/

part-00000

root@localhost :/home/jiangzl/release-1.6/runtime/local# ls data/crawldb/current/part-00000/

data  存放所有的url的数据

index  存放url的索引,方便下次读取速度快

root@localhost :/home/jiangzl/release-1.6/runtime/local# ls data/segments/

20140512055159  20140512055242

root@localhost :/home/jiangzl/release-1.6/runtime/local# ls data/segments/20140512055159/

content         页面所有信息

crawl_fetch      抓取状态

crawl_generate   抓取源码的二进制文件

crawl_parse      解析的url的状态

parse_data       页面的原数据

parse_text       页面的文本内容

4)架构图

root@localhost :/home/jiangzl/release-1.6/runtime/local# cat nohup.out | grep elapsed

Injector: finished at 2014-05-12 05:51:51, elapsed: 00:00:18

Generator: finished at 2014-05-12 05:52:06, elapsed: 00:00:15

Fetcher: finished at 2014-05-12 05:52:13, elapsed: 00:00:07

ParseSegment: finished at 2014-05-12 05:52:20, elapsed: 00:00:07

CrawlDb update: finished at 2014-05-12 05:52:34, elapsed: 00:00:13

 

Generator: finished at 2014-05-12 05:52:49, elapsed: 00:00:15

Fetcher: finished at 2014-05-12 05:59:19, elapsed: 00:06:30

ParseSegment: finished at 2014-05-12 05:59:29, elapsed: 00:00:10

CrawlDb update: finished at 2014-05-12 05:59:42, elapsed: 00:00:13

 

LinkDb: finished at 2014-05-12 05:59:52, elapsed: 00:00:10

(顺序Injector注入 抓取循环(topN N次) LinkDb

 

root@localhost :/home/jiangzl/release-1.6/runtime/local# cat nohup.out | grep elapsed | grep Injector

Injector: finished at 2014-05-12 05:51:51, elapsed: 00:00:18

root@localhost :/home/jiangzl/release-1.6/runtime/local# cat nohup.out | grep elapsed | grep Generator

Generator: finished at 2014-05-12 05:52:06, elapsed: 00:00:15

Generator: finished at 2014-05-12 05:52:49, elapsed: 00:00:15

root@localhost :/home/jiangzl/release-1.6/runtime/local# cat nohup.out | grep elapsed | grep Fetcher

Fetcher: finished at 2014-05-12 05:52:13, elapsed: 00:00:07

Fetcher: finished at 2014-05-12 05:59:19, elapsed: 00:06:30

root@localhost :/home/jiangzl/release-1.6/runtime/local# cat nohup.out | grep elapsed | grep ParseSegment

ParseSegment: finished at 2014-05-12 05:52:20, elapsed: 00:00:07

ParseSegment: finished at 2014-05-12 05:59:29, elapsed: 00:00:10

root@localhost :/home/jiangzl/release-1.6/runtime/local# cat nohup.out | grep elapsed | grep update       

CrawlDb update: finished at 2014-05-12 05:52:34, elapsed: 00:00:13

CrawlDb update: finished at 2014-05-12 05:59:42, elapsed: 00:00:13

root@localhost :/home/jiangzl/release-1.6/runtime/local# cat nohup.out | grep elapsed | grep LinkDb

LinkDb: finished at 2014-05-12 05:59:52, elapsed: 00:00:10

 

Segment Nà爬取的时候设置 -depth i 表示循环次数(爬取的深度

同时表示Segment的所在的哪一层

nutch入门_第1张图片

5)nutch的所有参数意思

root@localhost :/home/jiangzl/release-1.6/runtime/local# bin/nutch

Usage: nutch COMMAND

where COMMAND is one of:

  crawl             one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)

  readdb            read / dump crawl db

  mergedb           merge crawldb-s, with optional filtering

  readlinkdb        read / dump link db

  inject            inject new urls into the database

  generate          generate new segments to fetch from crawl db

  freegen           generate new segments to fetch from text files

  fetch             fetch a segment's pages

  parse             parse a segment's pages

  readseg           read / dump segment data

  mergesegs         merge several segments, with optional filtering and slicing

  updatedb          update crawl db from segments after fetching

  invertlinks       create a linkdb from parsed segments

  mergelinkdb       merge linkdb-s, with optional filtering

  solrindex         run the solr indexer on parsed segments and linkdb

  solrdedup         remove duplicates from solr

  solrclean         remove HTTP 301 and 404 documents from solr

  parsechecker      check the parser for a given url

  indexchecker      check the indexing filters for a given url

  domainstats       calculate domain statistics from crawldb

  webgraph          generate a web graph from existing segments

  linkrank          run a link analysis program on the generated web graph

  scoreupdater      updates the crawldb with linkrank scores

  nodedumper        dumps the web graph's node scores

  plugin            load a plugin and run one of its classes main()

  junit             runs the given JUnit test


你可能感兴趣的:(nutch入门)