一,环境信息
硬件:虚拟机
操作系统:Centos 6.4 64位
IP:10.51.121.10
主机名:datanode-4
安装用户:root
Nutch:Nutch2.3,安装路径:/root/nutch/apache-nutch-2.3
Hbase:Hbase0.94.14,安装路径:/root/hadoop/hbase-0.94.14
Solr:solr-4.10.3,安装路径:/root/nutch/solr-4.10.3
Nutch2.3+Hbase0.94+Solr4.10.3单机集成配置安装请见:http://blog.csdn.net/freedomboy319/article/details/44172277
二,Crawl命令使用
进入/root/nutch/apache-nutch-2.3/runtime/local/bin目录,查看crawl文件。
# ./bin/crawl
Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>
<seedDir>:放置种子文件的目录
<crawlID> :抓取任务的ID
<solrURL>:用于索引及搜索的solr地址
<numberOfRounds>:迭代次数,即抓取深度
crawl脚本的核心部分如下:
commonOptions="-D mapred.reduce.tasks=$numTasks -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true"
1,# initial injection
$bin/nutch inject "$SEEDDIR" -crawlId "$CRAWL_ID"
2,#循环调用 generate - fetch - parse - update - index,循环的次数由numberOfRounds参数指定。
1)$bin/nutch generate $commonOptions -topN $sizeFetchlist -noNorm -noFilter -adddays $addDays -crawlId "$CRAWL_ID" -batchId $batchId
2)$bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $batchId -crawlId "$CRAWL_ID" -threads 50
3)$bin/nutch parse $commonOptions $skipRecordsOptions $batchId -crawlId "$CRAWL_ID"
4)$bin/nutch updatedb $commonOptions $batchId -crawlId "$CRAWL_ID"
5)如果solrUrl参数不为空,则
$bin/nutch index $commonOptions -D solr.server.url=$SOLRURL -all -crawlId "$CRAWL_ID"
$bin/nutch solrdedup $commonOptions $SOLRURL
如果solrUrl参数为空,则执行结束。
Crawl命令涉及的类如下:
(1)org.apache.nutch.crawl.InjectorJob
(2)org.apache.nutch.crawl.GeneratorJob
(3)org.apache.nutch.fetcher.FetcherJob
(4)org.apache.nutch.parse.ParserJob
(5)org.apache.nutch.crawl.DbUpdaterJob
(6)org.apache.nutch.indexer.IndexingJob
(7)org.apache.nutch.indexer.solr.SolrDeleteDuplicates
三,Nutch命令
进入/root/nutch/apache-nutch-2.3/runtime/local/bin目录,查看Nutch文件。
# ./bin/nutch
Usage: nutch COMMAND
where COMMAND is one of:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
index run the plugin-based indexer on parsed batches
elasticindex run the elasticsearch indexer - DEPRECATED use the index command instead
solrindex run the solr indexer on parsed batches - DEPRECATED use the index command instead
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead
clean remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
webapp run a local Nutch web application
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
可以参看每个命令的详细使用:
# ./bin/nutch inject
Usage: InjectorJob <url_dir> [-crawlId <id>]
每个命令对应的类如下:
inject=org.apache.nutch.crawl.InjectorJob
hostinject=org.apache.nutch.host.HostInjectorJob
generate=org.apache.nutch.crawl.GeneratorJob
fetch=org.apache.nutch.fetcher.FetcherJob
parse=org.apache.nutch.parse.ParserJob
updatedb=org.apache.nutch.crawl.DbUpdaterJob
updatehostdb=org.apache.nutch.host.HostDbUpdateJob
readdb=org.apache.nutch.crawl.WebTableReader
readhostdb=org.apache.nutch.host.HostDbReader
elasticindex=org.apache.nutch.indexer.elastic.ElasticIndexerJob
solrindex="org.apache.nutch.indexer.IndexingJob -D solr.server.url=$1"
index=org.apache.nutch.indexer.IndexingJob
solrdedup=org.apache.nutch.indexer.solr.SolrDeleteDuplicates
solrclean="org.apache.nutch.indexer.CleaningJob -D solr.server.url=$2 $1"
clean=org.apache.nutch.indexer.CleaningJob
parsechecker=org.apache.nutch.parse.ParserChecker
indexchecker=org.apache.nutch.indexer.IndexingFiltersChecker
plugin=org.apache.nutch.plugin.PluginRepository
webapp=org.apache.nutch.webui.NutchUiServer
nutchserver=org.apache.nutch.api.NutchServer
junit=org.junit.runner.JUnitCore
四,执行Crawl命令
进入/root/nutch/apache-nutch-2.3/runtime/local目录,执行如下命令:
# ./bin/crawl ./myUrls/ mycrawl1 http://localhost:8983/solr/ 2
此命令与如下命令基本基本等价:
# ./bin/nutch inject ./myUrls/ -crawlId mycrawl1
# ./bin/nutch generate -topN 5 -crawlId mycrawl1
# ./bin/nutch fetch -all -crawlId mycrawl1 -threads 5
# ./bin/nutch parse -all -crawlId mycrawl1
# ./bin/nutch updatedb -all -crawlId mycrawl1
# ./bin/nutch solrindex http://localhost:8983/solr/ -all -crawlId mycrawl1
# ./bin/nutch solrdedup http://localhost:8983/solr
可以按照以上步骤一步一执行,并进入Hbase查看mycrawl1_webpage表。
进入Hbase的安装目录,执行# ./bin/hbase shell
命令
# ./bin/hbase shell
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version 0.94.14, r1543222, Mon Nov 18 23:23:33 UTC 2013
hbase(main):001:0> list
TABLE
mycrawl1_webpage
1 row(s) in 0.5500 seconds
hbase(main):002:0> describe 'mycrawl1_webpage'
DESCRIPTION ENABLED
'mycrawl1_webpage', {NAME => 'f', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => true
'1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_
MEMORY => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}, {NAME => 'h', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER =
> 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DEL
ETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}, {NAME => '
il', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', M
IN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_
DISK => 'true', BLOCKCACHE => 'true'}, {NAME => 'mk', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE
=> '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCK
SIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}, {NAME => 'mtdt', DATA_BLOCK_ENCODING =
> 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL =>
'2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE
=> 'true'}, {NAME => 'ol', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', C
OMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY
=> 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}, {NAME => 'p', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NON
E', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_C
ELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}, {NAME => 's', DA
TA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERS
IONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK =>
'true', BLOCKCACHE => 'true'}
1 row(s) in 0.1820 seconds
hbase(main):003:0> scan 'mycrawl1_webpage'