Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。
设置root密码:
sudo passwd root
这个命令是给root用户设定密码。
然后su root
切换到root用户。
例如:
jiangzl@ubuntu :~$ sudo passwd root
[sudo] password for jiangzl:
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
jiangzl@ubuntu :~$ su root
Password:
root@ubuntu :/home/jiangzl#
apt-get install subversion
svn co https://svn.apache.org/repos/asf/nutch/tags/release-1.6/
ls ~/release-1.6/conf | grep nutch
nutch-conf.xml
nutch-default.xml //nutch默认配置
nutch-site.xml //可以覆盖nutch默认配置的文件
nutch-site.xml.template //模板
复制模板nutch-site.xml.template产生nutch-site.xml
cp nutch-site.xml.template nutch-site.xml
2. 在nutch-site.xml中配置代理
内容格式在nutch-default.xml //nutch默认配置 中找
more -10 nutch-default.xml
/ http.agent.name (找到对应的位置,复制出来)
添加http代理: http.agent.name的配置
root@localhost :/home/jiangzl/release-1.6/conf# vim nutch-site.xml
<configuration>
<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
</configuration>
cd release-1.6/ (进入nutch根目录)
apt-get install ant (安装ant编译工具)
ant (在nutch根目录进行编译,多出两个文件:build 和 runtime)
cd runtime (deploy(hadoop集群运行模式)和 local(本地运行模式) 两个文件夹,代表nutch两种运行方式)
那么nutch和hadoop通过什么连接起来的勒?
脚本位置:ls runtime/deploy/bin (脚本叫:nutch)
job位置: ls runtime/deploy (job叫:apache-nutch-1.6.job)
通过nutch脚本中中hadoop命令把apache-nutch-1.6.job提交给hadoop的JobTracker
root@localhost :/home/jiangzl/release-1.6/runtime/local# ls
bin conf lib logs plugins test
cd runtime/local/
mkdir urls //创建url的文件夹
vim urls/url //编写url地址
(随便的一个网址: http://blog.tianya.cn
./bin/nutch (看能接什么参数—> 当然我们可以看看nutch的源码: vim nutch)
./bin nutch crawl (你不知道里面的命令就直接回车就行)
root@localhost :/home/jiangzl/release-1.6/runtime/local# bin/nutch crawl
Usage: Crawl <urlDir> -solr <solrURL> [-dir d] [-threads n] [-depth i] [-topN N]
<usrDir> 存放要爬去的网站地址的url的文件夹
-solr <solrURL> 建立索引 (这个可选,可加可不加)
[-dir d] 存放爬去后文件的文件夹,d(可写):相对路径/绝对路径
[-threads n] 开始爬取的线程,默认是10个线程
[-depth i] 深度,表示你要抓取几层
[-topN N] 针对本次抓取,每个map爬去的url不能超过n
执行命令:bin/nutch crawl urls -dir data2 -threads 10 -depth 3
后台执行: nohup bin/nutch crawl urls –dir data –threads 10 -depth 3 &
cat runtime/local/nutch.out (查看nutch运行日志)
cat runtime/local/logs/hadoop.log (查看nutch操作hadoop的详细日志)
root@localhost :/home/jiangzl/release-1.6/runtime/local# ls data/ -l
crawldb 爬虫在抓取过程中所有的url
linkdb 存储最终的二进制文件
segments 用于抓取的网页信息集合
root@localhost :/home/jiangzl/release-1.6/runtime/local# ls data/crawldb/
current 每一次爬取的时候就是最新的url
old 每一次爬取的时候就把current变为old
root@localhost :/home/jiangzl/release-1.6/runtime/local# ls data/crawldb/current/
part-00000
root@localhost :/home/jiangzl/release-1.6/runtime/local# ls data/crawldb/current/part-00000/
data 存放所有的url的数据
index 存放url的索引,方便下次读取速度快
root@localhost :/home/jiangzl/release-1.6/runtime/local# ls data/segments/
20140512055159 20140512055242
root@localhost :/home/jiangzl/release-1.6/runtime/local# ls data/segments/20140512055159/
content 页面所有信息
crawl_fetch 抓取状态
crawl_generate 抓取源码的二进制文件
crawl_parse 解析的url的状态
parse_data 页面的原数据
parse_text 页面的文本内容
root@localhost :/home/jiangzl/release-1.6/runtime/local# cat nohup.out | grep elapsed
Injector: finished at 2014-05-12 05:51:51, elapsed: 00:00:18
Generator: finished at 2014-05-12 05:52:06, elapsed: 00:00:15
Fetcher: finished at 2014-05-12 05:52:13, elapsed: 00:00:07
ParseSegment: finished at 2014-05-12 05:52:20, elapsed: 00:00:07
CrawlDb update: finished at 2014-05-12 05:52:34, elapsed: 00:00:13
Generator: finished at 2014-05-12 05:52:49, elapsed: 00:00:15
Fetcher: finished at 2014-05-12 05:59:19, elapsed: 00:06:30
ParseSegment: finished at 2014-05-12 05:59:29, elapsed: 00:00:10
CrawlDb update: finished at 2014-05-12 05:59:42, elapsed: 00:00:13
LinkDb: finished at 2014-05-12 05:59:52, elapsed: 00:00:10
(顺序Injector注入—) 抓取循环(topN N次) —) LinkDb )
root@localhost :/home/jiangzl/release-1.6/runtime/local# cat nohup.out | grep elapsed | grep Injector
Injector: finished at 2014-05-12 05:51:51, elapsed: 00:00:18
root@localhost :/home/jiangzl/release-1.6/runtime/local# cat nohup.out | grep elapsed | grep Generator
Generator: finished at 2014-05-12 05:52:06, elapsed: 00:00:15
Generator: finished at 2014-05-12 05:52:49, elapsed: 00:00:15
root@localhost :/home/jiangzl/release-1.6/runtime/local# cat nohup.out | grep elapsed | grep Fetcher
Fetcher: finished at 2014-05-12 05:52:13, elapsed: 00:00:07
Fetcher: finished at 2014-05-12 05:59:19, elapsed: 00:06:30
root@localhost :/home/jiangzl/release-1.6/runtime/local# cat nohup.out | grep elapsed | grep ParseSegment
ParseSegment: finished at 2014-05-12 05:52:20, elapsed: 00:00:07
ParseSegment: finished at 2014-05-12 05:59:29, elapsed: 00:00:10
root@localhost :/home/jiangzl/release-1.6/runtime/local# cat nohup.out | grep elapsed | grep update
CrawlDb update: finished at 2014-05-12 05:52:34, elapsed: 00:00:13
CrawlDb update: finished at 2014-05-12 05:59:42, elapsed: 00:00:13
root@localhost :/home/jiangzl/release-1.6/runtime/local# cat nohup.out | grep elapsed | grep LinkDb
LinkDb: finished at 2014-05-12 05:59:52, elapsed: 00:00:10
Segment Nà爬取的时候设置 -depth i 表示循环次数(爬取的深度)
同时表示Segment的所在的哪一层
root@localhost :/home/jiangzl/release-1.6/runtime/local# bin/nutch
Usage: nutch COMMAND
where COMMAND is one of:
crawl one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
solrindex run the solr indexer on parsed segments and linkdb
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
domainstats calculate domain statistics from crawldb
webgraph generate a web graph from existing segments
linkrank run a link analysis program on the generated web graph
scoreupdater updates the crawldb with linkrank scores
nodedumper dumps the web graph's node scores
plugin load a plugin and run one of its classes main()
junit runs the given JUnit test