1、 nutch的存储文件夹data下面各个文件夹和文件里面的内容究竟是什么?
crawl one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
solrindex run the solr indexer on parsed segments and linkdb
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
domainstats calculate domain statistics from crawldb
webgraph generate a web graph from existing segments
linkrank run a link analysis program on the generated web graph
scoreupdater updates the crawldb with linkrank scores
nodedumper dumps the web graph's node scores
plugin load a plugin and run one of its classes main()
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
这次我们将要使用的命令是readdb readseg readlinkdb来查看目录下的相关内容信息
crawldb
bin/nutch | grep read
bin/nutch readdb data/crawldb -stats
bin/nutch readdb data/crawldb -dump data/crawldb/crawldb_dump
bin/nutch readdb data/crawldb -url http://4008209999.tianyaclub.com/
bin/nutch readdb data/crawldb -topN 10 data/crawldb/crawldb_topN
bin/nutch readdb data/crawldb -topN 10 data/crawldb/crawldb_topN_m 1
segments
crawl_generate:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nocontent -nofetch -noparse -noparsedata –noparsetext
crawl_fetch:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nocontent -nogenerate -noparse -noparsedata –noparsetext
content:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nofetch -nogenerate -noparse -noparsedata –noparsetext
crawl_parse:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nofetch -nogenerate -nocontent –noparsedata –noparsetext
parse_data:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nofetch -nogenerate -nocontent -noparse –noparsetext
parse_text:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump -nofetch -nogenerate -nocontent -noparse -noparsedata
全部:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump
segments
bin/nutch readseg -list -dir data/segments
bin/nutch readseg -list data/segments/20130325043023
bin/nutch readseg -get data/segments/20130325042858 http://blog.tianya.cn/
linkdb
bin/nutch readlinkdb data/linkdb -url http://4008209999.tianyaclub.com/
bin/nutch readlinkdb data/linkdb -dump data/linkdb_dump
2.nutch爬取流程的命令实现
第一步 引入
bin/nutch inject
Usage: Injector <crawldb> <url_dir>
第一个是crawldb的生成目录 第二个为初始的目标url的目录
第二步 generate 生成抓取列表
第三步 fetch 抓取
第四步 parse 解析抓取结果
第五步 updatedb 更新抓取列表
如果想进行多轮抓取 执行2-5步即可
最后抓取结束 执行 invertlinks 生成linkdb