根据上一章的分析,“bin/nutch fetch crawl/segments/*”这条命令最终会调用org.apache.nutch.fetcher.Fetcher的main函数。
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(NutchConfiguration.create(), new Fetcher(), args);
System.exit(res);
}
ToolRunner的run函数进而调用Fetcher的run函数。
Fetcher::run
public int run(String[] args) throws Exception {
Path segment = new Path(args[0]);
int threads = getConf().getInt("fetcher.threads.fetch", 10);
for (int i = 1; i < args.length; i++) {
if (args[i].equals("-threads")) {
threads = Integer.parseInt(args[++i]);
}
}
getConf().setInt("fetcher.threads.fetch", threads);
fetch(segment, threads);
return 0;
}
获取抓取网页的线程数threads,默认为10,segment为crawl/segments/2*的目录路径,最后调用fetch函数。
Fetcher::run->fetch
public void fetch(Path segment, int threads) throws IOException {
checkConfiguration();
JobConf job = new NutchJob(getConf());
job.setJobName("fetch " + segment);
job.setInt("fetcher.threads.fetch", threads);
job.set(Nutch.SEGMENT_NAME_KEY, segment.getName());
job.setSpeculativeExecution(false);
FileInputFormat.addInputPath(job, new Path(segment,
CrawlDatum.GENERATE_DIR_NAME));
job.setInputFormat(InputFormat.class);
job.setMapRunnerClass(Fetcher.class);
FileOutputFormat.setOutputPath(job, segment);
job.setOutputFormat(FetcherOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NutchWritable.class);
JobClient.runJob(job);
}
checkConfiguration在配置文件中检查是否配置了http.agent.name属性,如果没有设置则抛出异常。接下来创建hadoop的Job,输入为crawl/segments/2*下的crawl_generate目录,由命令generate生成,处理函数为Fetcher中的run函数,输出也为crawl/segments/2*目录下,FetcherOutputFormat定义了最后如何输出。
Fetcher::run
public void run(RecordReader input,
OutputCollector output, Reporter reporter)
throws IOException {
...
feeder = new QueueFeeder(input, fetchQueues, threadCount
* queueDepthMuliplier);
feeder.start();
for (int i = 0; i < threadCount; i++) {
FetcherThread t = new FetcherThread(getConf(), getActiveThreads(), fetchQueues,
feeder, spinWaiting, lastRequestStart, reporter, errors, segmentName,
parsing, output, storingContent, pages, bytes);
fetcherThreads.add(t);
t.start();
}
...
}
Fetcher的run函数首先创建一个共享队列QueueFeeder,然后创建QueueFeeder(feeder),用于读取crawl/crawldb/2*下的url和CrawlDatum,把它们放到共享队列FetchItemQueues(fetchQueues)中。
然后创建FetcherThread,并调用其start函数开始抓取网页。
Fetcher::run->QueueFeeder::run
public void run() {
boolean hasMore = true;
while (hasMore) {
...
int feed = size - queues.getTotalSize();
if (feed <= 0) {
Thread.sleep(1000);
continue;
} else {
while (feed > 0 && hasMore) {
Text url = new Text();
CrawlDatum datum = new CrawlDatum();
hasMore = reader.next(url, datum);
if (hasMore) {
queues.addFetchItem(url, datum);
feed--;
}
}
}
}
}
feed变量表示共享队列FetchItemQueues中是否有空闲位置可以插入待抓取的url和CrawlDatum,如果feed小于0,表示空间不足,就需要进程睡眠等待,如果feed大于0,表示空间足够,此时通过RecordReader(reader)的next函数从crawl/crawldb/2*/crawl_generate文件夹中依次读取url和CrawlDatum,调用addFetchItem函数将其封装成FetchItem并添加到共享队列中。
Fetcher::run->FetcherThread::run
public void run() {
FetchItem fit = null;
try {
while (true) {
...
fit = ((FetchItemQueues) fetchQueues).getFetchItem();
...
try {
do {
Protocol protocol = this.protocolFactory.getProtocol(fit.url
.toString());
BaseRobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
...
ProtocolOutput output = protocol.getProtocolOutput(fit.url,
fit.datum);
ProtocolStatus status = output.getStatus();
Content content = output.getContent();
ParseStatus pstatus = null;
((FetchItemQueues) fetchQueues).finishFetchItem(fit);
String urlString = fit.url.toString();
switch (status.getCode()) {
...
case ProtocolStatus.SUCCESS:
pstatus = output(fit.url, fit.datum, content, status,
CrawlDatum.STATUS_FETCH_SUCCESS, fit.outlinkDepth);
...
break;
...
}
...
} while (redirecting && (redirectCount <= maxRedirect));
} catch (Throwable t) {
}
}
} catch (Throwable e) {
} finally {
}
}
protocolFactory为ProtocolFactory,根据url的协议(例如http、ftp等),从插件库中获取协议类org.apache.nutch.protocol.http.Http。Http的getRobotRules函数用于抓取对应url下的robots.txt文件,robots.txt是爬虫协议,搜索引擎应该根据该协议设定抓取策略,例如哪些页面可以或不可以抓取,这里假设为null。
接下来通过Http的getProtocolOutput函数从url对应的地址读取内容并返回一个ProtocolOutput,内部封装了从url读取出的信息以及状态码。
最后调用output将刚刚得到的数据写入文件中。
Fetcher::run->FetcherThread::run->Http::getProtocolOutput
public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
String urlString = url.toString();
try {
URL u = new URL(urlString);
long startTime = System.currentTimeMillis();
Response response = getResponse(u, datum, false);
if (this.responseTime) {
int elapsedTime = (int) (System.currentTimeMillis() - startTime);
datum.getMetaData().put(RESPONSE_TIME, new IntWritable(elapsedTime));
}
int code = response.getCode();
datum.getMetaData().put(Nutch.PROTOCOL_STATUS_CODE_KEY,
new Text(Integer.toString(code)));
byte[] content = response.getContent();
Content c = new Content(u.toString(), u.toString(),
(content == null ? EMPTY_CONTENT : content),
response.getHeader("Content-Type"), response.getHeaders(), this.conf);
if (code == 200) {
return new ProtocolOutput(c);
}
...
} catch (Throwable e) {
}
}
getProtocolOutput的主要任务是从url地址处根据协议获取信息(一般情况下就是下载url对应的网页),并刷新CrawlDatum中信息,最后创建ProtocolOutput并返回,ProtocolOutput内保存了两个重要的内容,一是url地址处对应的内容(例如网站html文件内的内容),二是保存了本次请求的状态信息,例如是否成功等等。
Fetcher::run->FetcherThread::run->output
private ParseStatus output(Text key, CrawlDatum datum, Content content,
ProtocolStatus pstatus, int status, int outlinkDepth) {
datum.setStatus(status);
datum.setFetchTime(System.currentTimeMillis());
if (pstatus != null)
datum.getMetaData().put(Nutch.WRITABLE_PROTO_STATUS_KEY, pstatus);
ParseResult parseResult = null;
if (content != null) {
Metadata metadata = content.getMetadata();
if (content.getContentType() != null)
datum.getMetaData().put(new Text(Metadata.CONTENT_TYPE),
new Text(content.getContentType()));
metadata.set(Nutch.SEGMENT_NAME_KEY, segmentName);
try {
scfilters.passScoreBeforeParsing(key, datum, content);
} catch (Exception e) {
}
...
content.getMetadata().add(Nutch.FETCH_STATUS_KEY,
Integer.toString(status));
}
try {
output.collect(key, new NutchWritable(datum));
if (content != null && storingContent)
output.collect(key, new NutchWritable(content));
...
} catch (IOException e) {
}
return null;
}
省略的代码和解析有关,放在下一章分析。output函数向CrawlDatum中记录各个信息,然后通过collect收集CrawlDatum和Content,最后由FetcherOutputFormat定义了如何输出。
FetcherOutputFormat::getRecordWriter
public RecordWriter getRecordWriter(final FileSystem fs,
final JobConf job, final String name, final Progressable progress)
throws IOException {
Path out = FileOutputFormat.getOutputPath(job);
final Path fetch = new Path(new Path(out, CrawlDatum.FETCH_DIR_NAME), name);
final Path content = new Path(new Path(out, Content.DIR_NAME), name);
final CompressionType compType = SequenceFileOutputFormat
.getOutputCompressionType(job);
Option fKeyClassOpt = MapFile.Writer.keyClass(Text.class);
org.apache.hadoop.io.SequenceFile.Writer.Option fValClassOpt = SequenceFile.Writer.valueClass(CrawlDatum.class);
org.apache.hadoop.io.SequenceFile.Writer.Option fProgressOpt = SequenceFile.Writer.progressable(progress);
org.apache.hadoop.io.SequenceFile.Writer.Option fCompOpt = SequenceFile.Writer.compression(compType);
final MapFile.Writer fetchOut = new MapFile.Writer(job,
fetch, fKeyClassOpt, fValClassOpt, fCompOpt, fProgressOpt);
return new RecordWriter() {
private MapFile.Writer contentOut;
private RecordWriter parseOut;
{
if (Fetcher.isStoringContent(job)) {
Option cKeyClassOpt = MapFile.Writer.keyClass(Text.class);
org.apache.hadoop.io.SequenceFile.Writer.Option cValClassOpt = SequenceFile.Writer.valueClass(Content.class);
org.apache.hadoop.io.SequenceFile.Writer.Option cProgressOpt = SequenceFile.Writer.progressable(progress);
org.apache.hadoop.io.SequenceFile.Writer.Option cCompOpt = SequenceFile.Writer.compression(compType);
contentOut = new MapFile.Writer(job, content,
cKeyClassOpt, cValClassOpt, cCompOpt, cProgressOpt);
}
if (Fetcher.isParsing(job)) {
parseOut = new ParseOutputFormat().getRecordWriter(fs, job, name,
progress);
}
}
public void write(Text key, NutchWritable value) throws IOException {
Writable w = value.get();
if (w instanceof CrawlDatum)
fetchOut.append(key, w);
else if (w instanceof Content && contentOut != null)
contentOut.append(key, w);
else if (w instanceof Parse && parseOut != null)
parseOut.write(key, (Parse) w);
}
public void close(Reporter reporter) throws IOException {
fetchOut.close();
if (contentOut != null) {
contentOut.close();
}
if (parseOut != null) {
parseOut.close(reporter);
}
}
};
}
FETCH_DIR_NAME和DIR_NAME常量分别为crawl_fetch和content,getRecordWriter在crawl/segments/2*目录下创建这两个目录和对应的输出流。getRecordWriter函数最后返回RecordWriter,对应的write函数根据数据类型输出到不同的文件中。