在使用本教程之前,需要满足条件:
1)有一台Linux或Linux虚拟机
2)安装JDK(推荐1.7)
3)安装Apache Ant
推荐使用Nutch 1.9,官方下载地址:http://mirrors.hust.edu.cn/apache/nutch/1.9/apache-nutch-1.9-src.zip
ant eclipse -verbose
然后耐心等待,这个过程ant会根据ivy从中心仓库下载各种依赖jar包,可能要十几分钟。
http://www.cnbeta.com/
args=new String[]{"/tmp/crawldb","/tmp/urls"};
Caused by: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.
at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:123)
at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:84)
... 23 more
plugin.folders
plugins
Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.
将value修改为绝对路径 apache-nutch-1.9所在文件夹+"/src/plugin",比如我的配置:
plugin.folders
/home/hu/apache/apache-nutch-1.9/src/plugin
Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.
vim /tmp/crawldb/current/part-00000/data
key0 value0
key1 value1
key2 value2
......
keyn valuen
以key value的形式,将对象序列(key value序列)存储到文件中。我们从SequenceFile头部可以看出来key value的类型。
package org.apache.nutch.example;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import java.io.IOException;
/**
* Created by hu on 15-2-9.
*/
public class InjectorReader {
public static void main(String[] args) throws IOException {
Configuration conf=new Configuration();
Path dataPath=new Path("/tmp/crawldb/current/part-00000/data");
FileSystem fs=dataPath.getFileSystem(conf);
SequenceFile.Reader reader=new SequenceFile.Reader(fs,dataPath,conf);
Text key=new Text();
CrawlDatum value=new CrawlDatum();
while(reader.next(key,value)){
System.out.println("key:"+key);
System.out.println("value:"+value);
}
reader.close();
}
}
key:http://www.cnbeta.com/
value:Version: 7
Status: 1 (db_unfetched)
Fetch time: Mon Feb 09 13:20:36 CST 2015
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:
_maxdepth_=1000
_depth_=1
Status: 1 (db_unfetched)
表示当前url为未爬取状态,在后续流程中,爬虫会从crawldb取未爬取的url进行爬取。
http.agent.name
HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
http.agent.name
test
HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
package org.apache.nutch.crawl;
import java.util.*;
import java.text.*;
// Commons Logging imports
import org.apache.commons.lang.StringUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.nutch.parse.ParseSegment;
import org.apache.nutch.indexer.IndexingJob;
//import org.apache.nutch.indexer.solr.SolrDeleteDuplicates;
import org.apache.nutch.util.HadoopFSUtil;
import org.apache.nutch.util.NutchConfiguration;
import org.apache.nutch.util.NutchJob;
import org.apache.nutch.fetcher.Fetcher;
public class Crawl extends Configured implements Tool {
public static final Logger LOG = LoggerFactory.getLogger(Crawl.class);
private static String getDate() {
return new SimpleDateFormat("yyyyMMddHHmmss").format
(new Date(System.currentTimeMillis()));
}
/* Perform complete crawling and indexing (to Solr) given a set of root urls and the -solr
parameter respectively. More information and Usage parameters can be found below. */
public static void main(String args[]) throws Exception {
Configuration conf = NutchConfiguration.create();
int res = ToolRunner.run(conf, new Crawl(), args);
System.exit(res);
}
@Override
public int run(String[] args) throws Exception {
/*种子所在文件夹*/
Path rootUrlDir = new Path("/tmp/urls");
/*存储爬取信息的文件夹*/
Path dir = new Path("/tmp","crawl-" + getDate());
int threads = 50;
/*广度遍历时爬取的深度,即广度遍历树的层数*/
int depth = 2;
long topN = 10;
JobConf job = new NutchJob(getConf());
FileSystem fs = FileSystem.get(job);
if (LOG.isInfoEnabled()) {
LOG.info("crawl started in: " + dir);
LOG.info("rootUrlDir = " + rootUrlDir);
LOG.info("threads = " + threads);
LOG.info("depth = " + depth);
if (topN != Long.MAX_VALUE)
LOG.info("topN = " + topN);
}
Path crawlDb = new Path(dir + "/crawldb");
Path linkDb = new Path(dir + "/linkdb");
Path segments = new Path(dir + "/segments");
Path indexes = new Path(dir + "/indexes");
Path index = new Path(dir + "/index");
Path tmpDir = job.getLocalPath("crawl"+Path.SEPARATOR+getDate());
Injector injector = new Injector(getConf());
Generator generator = new Generator(getConf());
Fetcher fetcher = new Fetcher(getConf());
ParseSegment parseSegment = new ParseSegment(getConf());
CrawlDb crawlDbTool = new CrawlDb(getConf());
LinkDb linkDbTool = new LinkDb(getConf());
// initialize crawlDb
injector.inject(crawlDb, rootUrlDir);
int i;
for (i = 0; i < depth; i++) { // generate new segment
Path[] segs = generator.generate(crawlDb, segments, -1, topN, System
.currentTimeMillis());
if (segs == null) {
LOG.info("Stopping at depth=" + i + " - no more URLs to fetch.");
break;
}
fetcher.fetch(segs[0], threads); // fetch it
if (!Fetcher.isParsing(job)) {
parseSegment.parse(segs[0]); // parse it, if needed
}
crawlDbTool.update(crawlDb, segs, true, true); // update crawldb
}
/*
if (i > 0) {
linkDbTool.invert(linkDb, segments, true, true, false); // invert links
if (solrUrl != null) {
// index, dedup & merge
FileStatus[] fstats = fs.listStatus(segments, HadoopFSUtil.getPassDirectoriesFilter(fs));
IndexingJob indexer = new IndexingJob(getConf());
indexer.index(crawlDb, linkDb,
Arrays.asList(HadoopFSUtil.getPaths(fstats)));
SolrDeleteDuplicates dedup = new SolrDeleteDuplicates();
dedup.setConf(getConf());
dedup.dedup(solrUrl);
}
} else {
LOG.warn("No URLs to fetch - check your seed list and URL filters.");
}
*/
if (LOG.isInfoEnabled()) { LOG.info("crawl finished: " + dir); }
return 0;
}
}
运行成功,对网站进行了一个2层的爬取,爬取信息都保存在/tmp/crawl+时间的文件夹中。
2015-02-09 14:23:17,171 INFO crawl.CrawlDb (CrawlDb.java:update(115)) - CrawlDb update: finished at 2015-02-09 14:23:17, elapsed: 00:00:01
2015-02-09 14:23:17,171 INFO crawl.Crawl (Crawl.java:run(117)) - crawl finished: /tmp/crawl-20150209142212
http.content.limit
65536
The length limit for downloaded content using the http://
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
http.content.limit
-1
The length limit for downloaded content using the http://
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
if (!rules.isAllowed(fit.u.toString())) {
// unblock
fetchQueues.finishFetchItem(fit, true);
if (LOG.isDebugEnabled()) {
LOG.debug("Denied by robots.txt: " + fit.url);
}
output(fit.url, fit.datum, null, ProtocolStatus.STATUS_ROBOTS_DENIED, CrawlDatum.STATUS_FETCH_GONE);
reporter.incrCounter("FetcherStatus", "robots_denied", 1);
continue;
}