Nutch抓取网页步骤
1,新建url列表
http://www.qq.com/
http://www.sina.com.cn/
2,将种子列表URL导入Nutch的crawldb
hadoop@slave5:~/nutch$ nutch inject crawl/crawldb urls/ Injector: starting at 2013-07-14 17:19:07 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 1 Injector: total number of urls injected after normalization and filtering: 2 Injector: Merging injected urls into crawl db. Injector: finished at 2013-07-14 17:19:10, elapsed: 00:00:02
hadoop@slave5:~/nutch$ nutch generate crawl/crawldb crawl/segments Generator: starting at 2013-07-14 17:19:14 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20130714171917 Generator: finished at 2013-07-14 17:19:18, elapsed: 00:00:03
4,我们需要最新的segment目录作为参数,存储到环境变量SEGMENT里
export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
hadoop@slave5:~/nutch$ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
5,启动抓取程序真正开始抓取内容
nutch fetch $SEGMENT -noParsing
6,
解析抓取下来的内容
bin/nutch parse $SEGMENT
hadoop@slave5:~/nutch$ bin/nutch fetch $SEGMENT -noParsing Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2013-07-14 17:20:18 Fetcher: segment: crawl/segments/20130714171917 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 2 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost fetching http://www.qq.com/ (queue crawl delay=5000ms) Using queue mode : byHost Using queue mode : byHost fetching http://www.sina.com.cn/ (queue crawl delay=5000ms) -finishing thread FetcherThread, activeThreads=2 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=2 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=2 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=2 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=2 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=2 Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2013-07-14 17:20:21, elapsed: 00:00:02 hadoop@slave5:~/nutch$ nutch parse $SEGMENT ParseSegment: starting at 2013-07-14 17:21:35 ParseSegment: segment: crawl/segments/20130714171917 Parsed (10ms):http://www.qq.com/ http://www.sina.com.cn/ skipped. Content of size 135311 was truncated to 65536 ParseSegment: finished at 2013-07-14 17:21:37, elapsed: 00:00:01
更 新Nutch crawldb,updatedb命令会存储以上两步抓取(fetch)和分析(parse)最新的segment而得到的新的URLs到Nutch crawldb,以便后续的继续抓取,除了URLs之外,Nutch也存储了相应的页面内容,防止相同的URLs被反反复复的抓取。
bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
到此,一个完整的抓取周期结束了,你可以重复步骤10多次以便可以抓取更多的内容。
hadoop@slave5:~/nutch$ bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize CrawlDb update: starting at 2013-07-14 17:33:55 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20130714171917] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2013-07-14 17:33:56, elapsed: 00:00:01
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
hadoop@slave5:~/nutch$ bin/nutch invertlinks crawl/linkdb -dir crawl/segments LinkDb: starting at 2013-07-14 17:37:43 LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: internal links will be ignored. LinkDb: adding segment: file:/home/hadoop/apache-nutch-1.7/crawl/segments/20130714171917 LinkDb: finished at 2013-07-14 17:37:44, elapsed: 00:00:01
9,
索引所有segments中的内容到Solr中
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
现在,所有Nutch抓取的内容已经被Solr索引了,你可以通过Solr Admin执行查询操作了
http://127.0.0.1:8983/solr/admin
。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。