Hadoop集群创建文件
[nutch@gc01vm13 /]$ cd ./home/nutch/nutchinstall/nutch-1.0/ [nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -ls Found 1 items drwxr-xr-x - nutch supergroup 0 2010-06-09 20:10 /user/nutch/zklin [nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -mkdir crawldatatest //mkdir 的地址相对于/user/nutch [nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -ls Found 2 items drwxr-xr-x - nutch supergroup 0 2010-06-11 00:40 /user/nutch/crawldatatest drwxr-xr-x - nutch supergroup 0 2010-06-09 20:10 /user/nutch/zklin |
[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -mkdir urls [nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -ls Found 3 items drwxr-xr-x - nutch supergroup 0 2010-06-11 00:40 /user/nutch/crawldatatest drwxr-xr-x - nutch supergroup 0 2010-06-11 00:45 /user/nutch/urls drwxr-xr-x - nutch supergroup 0 2010-06-09 20:10 /user/nutch/zklin |
[nutch@gc01vm13 nutchinstall]$ mkdir urls //先在本地创建urls文件夹 [nutch@gc01vm13 nutchinstall]$ cd ./urls/ [nutch@gc01vm13 urls]$ ls [nutch@gc01vm13 urls]$ vim urls1 //写url入口地址,urls1 [nutch@gc01vm13 urls]$ cd .. [nutch@gc01vm13 nutchinstall]$ cd ./nutch-1.0/ [nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -copyFromLocal /home/nutch/nutchinstall/urls/urls1 urls //从本地拷贝到集群 集群是相对路径 |
[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -lsr //在集群上查看urls1 drwxr-xr-x - nutch supergroup 0 2010-06-11 00:40 /user/nutch/crawldatatest drwxr-xr-x - nutch supergroup 0 2010-06-11 00:46 /user/nutch/urls -rw-r--r-- 2 nutch supergroup 31 2010-06-11 00:46 /user/nutch/urls/urls1 drwxr-xr-x - nutch supergroup 0 2010-06-09 20:10 /user/nutch/zklin |
[nutch@gc01vm13 nutch-1.0]$ bin/nutch crawl urls1 -dir crawldatatest -depth 3 -topN 10
crawl started in: crawldatatest
rootUrlDir = urls1 threads = 10 depth = 3 topN = 10 Injector: starting Injector: crawlDb: crawldatatest/crawldb Injector: urlDir: urls1 Injector: Converting injected urls to crawl db entries. Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://gc01vm13:9000/user/nutch/urls1 at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.crawl.Injector.inject(Injector.java:160) at org.apache.nutch.crawl.Crawl.main(Crawl.java:113) |
Crawl需要用绝对路径,相对路径报错
日志分析
[nutch@gc01vm13 nutch-1.0]$ bin/nutch crawl /user/nutch/urls/urls1 -dir crawldatatest -depth 3 -topN 10 // crawldatatest是爬虫后的数据的存放位置,相对路径,和nutch-site.xml的Search.dir一致 crawl started in: crawldatatest //表明网络蜘蛛的名称 rootUrlDir = /user/nutch/urls/urls1 //待下载数据的列表文件或列表 threads = 10 depth = 3 topN = 10 |
Injector: starting //注入下载列表 Injector: crawlDb: crawldatatest/crawldb Injector: urlDir: /user/nutch/urls/urls1 Injector: Converting injected urls to crawl db entries. //根据注入的列表生成待下载的地址库 Injector: Merging injected urls into crawl db. //执行Merge Injector: done |
Generator: Selecting best-scoring urls due for fetch. //判断网页重要性,决定下载顺序 Generator: starting Generator: segment: crawldatatest/segments/20100611004927 //生成下载结果存储的数据段 Generator: filtering: true Generator: topN: 10 Generator: Partitioning selected urls by host, for politeness. //将url下载列表按Hadoop中的配置文件slaves中定义的datanode来分配。按host分配。 Generator: done. |
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting Fetcher: segment: crawldatatest/segments/20100611004927 //下载指定网页内容到segment中去 Fetcher: done |
CrawlDb update: starting //下载完毕后,更新下载数据库,增加新的下载 CrawlDb update: db: crawldatatest/crawldb CrawlDb update: segments: [crawldatatest/segments/20100611004927] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done |
//循环执行下载 第二次
Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawldatatest/segments/20100611005051 Generator: filtering: true Generator: topN: 10 Generator: Partitioning selected urls by host, for politeness. Generator: done. |
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting Fetcher: segment: crawldatatest/segments/20100611005051 Fetcher: done |
CrawlDb update: starting CrawlDb update: db: crawldatatest/crawldb CrawlDb update: segments: [crawldatatest/segments/20100611005051] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done |
//循环下载,第三次
Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawldatatest/segments/20100611005212 Generator: filtering: true Generator: topN: 10 Generator: Partitioning selected urls by host, for politeness. Generator: done. |
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting Fetcher: segment: crawldatatest/segments/20100611005212 Fetcher: done |
CrawlDb update: starting CrawlDb update: db: crawldatatest/crawldb CrawlDb update: segments: [crawldatatest/segments/20100611005212] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done |
//总共循环depth次,Nutch的局域网模式采用了广度优先策略,把二级页面抓取完毕以后,进行三级页面抓取
LinkDb: starting //进行网页链接关系分析 LinkDb: linkdb: crawldatatest/linkdb LinkDb: URL normalize: true //规范化 LinkDb: URL filter: true //根据Crawl-urlfiter.txt来过滤 LinkDb: adding segment: hdfs://gc01vm13:9000/user/nutch/crawldatatest/segments/20100611004927 LinkDb: adding segment: hdfs://gc01vm13:9000/user/nutch/crawldatatest/segments/20100611005051 LinkDb: adding segment: hdfs://gc01vm13:9000/user/nutch/crawldatatest/segments/20100611005212 LinkDb: done //链接分析完毕 |
Indexer: starting //开始创建索引 Indexer: done Dedup: starting //网页去重 Dedup: adding indexes in: crawldatatest/indexes Dedup: done merging indexes to: crawldatatest/index //索引合并 Adding hdfs://gc01vm13:9000/user/nutch/crawldatatest/indexes/part-00000 done merging crawl finished: crawldatatest //结束 [nutch@gc01vm13 nutch-1.0]$ |
[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -copyToLocal crawldatatest /home/nutch/nutchinstall/ [nutch@gc01vm13 nutch-1.0]$ cd .. [nutch@gc01vm13 nutchinstall]$ ls confbak crawldatatest filesystem hadoopscheduler hadooptmp nutch-1.0 urls |
copyToLocal用的是相对路径
生出数据分析
bin/nutch crawl /user/nutch/urls/urls1 -dir crawldatatest -depth 3 -topN 10
从集群拷贝到本地
[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -copyToLocal crawldatatest /home/nutch/nutchinstall/ [nutch@gc01vm13 nutch-1.0]$ cd .. [nutch@gc01vm13 nutchinstall]$ ls confbak crawldatatest filesystem hadoopscheduler hadooptmp nutch-1.0 urls |
抓取程序在集群用户根目录(/user/nutch)下面建立目录,
[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -ls Found 3 items drwxr-xr-x - nutch supergroup 0 2010-06-11 00:55 /user/nutch/crawldatatest drwxr-xr-x - nutch supergroup 0 2010-06-11 00:46 /user/nutch/urls drwxr-xr-x - nutch supergroup 0 2010-06-09 20:10 /user/nutch/zklin |
在crawldatatest目录下生成了crawldb,segments,index,indexs,linkdb五个文件
[nutch@gc01vm13 nutch-1.0]$ bin/hadoop fs -ls /user/nutch/crawldatatest Found 5 items drwxr-xr-x - nutch supergroup 0 2010-06-11 00:53 /user/nutch/crawldatatest/crawldb drwxr-xr-x - nutch supergroup 0 2010-06-11 00:55 /user/nutch/crawldatatest/index drwxr-xr-x - nutch supergroup 0 2010-06-11 00:54 /user/nutch/crawldatatest/indexes drwxr-xr-x - nutch supergroup 0 2010-06-11 00:53 /user/nutch/crawldatatest/linkdb drwxr-xr-x - nutch supergroup 0 2010-06-11 00:52 /user/nutch/crawldatatest/segments |
1) crawldb目录下面存放下载的URL,以及下载的日期,用来页面更新检查时间。
2) linkdb目录存放URL的关联关系,是下载完成后分析时创建的,通过这个关联关系可以实现类似google的pagerank功能。
3) segments目录存储抓取的页面,下面子目录的个数与获取页面的层数有关系,我指定-depth是3层,这个目录下就有3层。
里面有6个子目录
content,下载页面的内容;
crawl_fetch,下载URL的状态内容;
crawl_generate,待下载的URL的集合,在generate任务生成时和下载过程中持续分析出来;
crawl_parse,存放用来更新crawldb的外部链接库;
parse_data,存放每个URL解析出来的外部链接和元数据;
parse_text,存放每个解析过的URL的文本内容;
4) index目录存放符合lucene格式的索引目录,是indexs里所有的索引内容合并后的完整内容,看了一下这里的索引文件和用 lucenedemo做出来的文件名称都不一样,待进一步研究;
5 )indexs目录存放每次下载的索引目录,存放part-0000;