话说之前的文章把inject整个过程都讲完了,下面讲解generate过程。
generate:产生本轮真实要去爬取的url集合。
先看代码:
int i;
for (i = 0; i < depth; i++) { // generate new segment
Path[] segs = generator.generate(crawlDb, segments, -1, topN, System
.currentTimeMillis());
if (segs == null) {
LOG.info("Stopping at depth=" + i + " - no more URLs to fetch.");
break;
}
fetcher.fetch(segs[0], threads); // fetch it
if (!Fetcher.isParsing(job)) {
parseSegment.parse(segs[0]); // parse it, if needed
}
crawlDbTool.update(crawlDb, segs, true, true); // update crawldb
}
这里基本上就是nutch的核心,当然里面的代码包含了很多细节。
这里还要注意的一点是:循环的次数由depth控制,这个参数是自己在命令行中设置的。
-----------------------------------------------------------------------------------------------------------------
下面我们从
Path[] segs = generator.generate(crawlDb, segments, -1, topN, System
.currentTimeMillis());
作为入口讲解generate过程。
Generator generator = new Generator(getConf());这个是构造generator对象。构造函数为:
public Generator(Configuration conf) {
setConf(conf);
}
接下来看代码
Path[] segs = generator.generate(crawlDb, segments, -1, topN, System
.currentTimeMillis());
进入函数:
public Path[] generate(Path dbDir, Path segments, int numLists, long topN, long curTime)
throws IOException {
JobConf job = new NutchJob(getConf());
boolean filter = job.getBoolean(GENERATOR_FILTER, true);
boolean normalise = job.getBoolean(GENERATOR_NORMALISE, true);
return generate(dbDir, segments, numLists, topN, curTime, filter, normalise, false, 1);
}
再跟踪最后一句:
Path tempDir = new Path(getConf().get("mapred.temp.dir", ".")
+ "/generate-temp-" + java.util.UUID.randomUUID().toString());//创建临时目录
Path lock = new Path(dbDir, CrawlDb.LOCK_NAME);//创建文件锁的路径名
FileSystem fs = FileSystem.get(getConf());//获取文件系统
LockUtil.createLockFile(fs, lock, force); //创建文件锁
----------------
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
long start = System.currentTimeMillis();
LOG.info("Generator: starting at " + sdf.format(start));
LOG.info("Generator: Selecting best-scoring urls due for fetch.");
LOG.info("Generator: filtering: " + filter);
LOG.info("Generator: normalizing: " + norm);
if (topN != Long.MAX_VALUE) {
LOG.info("Generator: topN: " + topN);
}
if ("true".equals(getConf().get(GENERATE_MAX_PER_HOST_BY_IP))) {
LOG.info("Generator: GENERATE_MAX_PER_HOST_BY_IP will be ignored, use partition.url.mode instead");
}
主要就是打印日志,很简单!
--------------------------------------------接下来的代码是:
// map to inverted subset due for fetch, sort by score
JobConf job = new NutchJob(getConf()); //创建NutchJob
job.setJobName("generate: select from " + dbDir); //设置job的名字
if (numLists == -1) { // for politeness make
numLists = job.getNumMapTasks(); // a partition per fetch task
} //此时值为2
if ("local".equals(job.get("mapred.job.tracker")) && numLists != 1) {
// override
LOG.info("Generator: jobtracker is 'local', generating exactly one partition.");
numLists = 1;
} //如果是本地模式,重新设置为1
job.setLong(GENERATOR_CUR_TIME, curTime); //设置为当前时间
// record real generation time
long generateTime = System.currentTimeMillis();
job.setLong(Nutch.GENERATE_TIME_KEY, generateTime); //设置为当前时间
job.setLong(GENERATOR_TOP_N, topN); //设置每轮最大个数
job.setBoolean(GENERATOR_FILTER, filter); //设置是否过滤URL
job.setBoolean(GENERATOR_NORMALISE, norm); //设置是否归一化
job.setInt(GENERATOR_MAX_NUM_SEGMENTS, maxNumSegments); //设置为1
FileInputFormat
.addInputPath(job, new Path(dbDir, CrawlDb.CURRENT_NAME)); //添加输入目录
job.setInputFormat(SequenceFileInputFormat.class); //设置输入格式
job.setMapperClass(Selector.class);
job.setPartitionerClass(Selector.class);
job.setReducerClass(Selector.class);
FileOutputFormat.setOutputPath(job, tempDir); //设置输出目录
job.setOutputFormat(SequenceFileOutputFormat.class); 输出格式
job.setOutputKeyClass(FloatWritable.class);
job.setOutputKeyComparatorClass(DecreasingFloatComparator.class);
job.setOutputValueClass(SelectorEntry.class);
job.setOutputFormat(GeneratorOutputFormat.class); 输出格式
try {
JobClient.runJob(job); //执行
} catch (IOException e) {
throw e;
}
这里设置了若干参数:
输入路径,输入格式,map,partition,reduce的类
输出路径,输出格式,key,value,comparator的类。
下面就分别解析这几个类的代码。
------------------------------------------------------------------------------------
先看map类的代码:
Text url = key;
if (filter) {
// If filtering is on don't generate URLs that don't pass
// URLFilters
try {
if (filters.filter(url.toString()) == null)
return;
} catch (URLFilterException e) {
if (LOG.isWarnEnabled()) {
LOG.warn("Couldn't filter url: " + url + " ("
+ e.getMessage() + ")");
}
}
}
CrawlDatum crawlDatum = value;
上面的代码很简单,如果需要过滤,则进行过滤,过滤结果为NULL的直接return .
通过之后,我们也获取了key/value.
--------------------------------------------------------------
// check fetch schedule
if (!schedule.shouldFetch(url, crawlDatum, curTime)) {
LOG.debug("-shouldFetch rejected '" + url + "', fetchTime="
+ crawlDatum.getFetchTime() + ", curTime=" + curTime);
return;
}
这里的schedule类是schedule class --- org.apache.nutch.crawl.DefaultFetchSchedule@74328c1c
shouldFetch代码是:
public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
System.out.println("output by AbstractFetchSchedule...");
// pages are never truly GONE - we have to check them from time to time.
// pages with too long fetchInterval are adjusted so that they fit within
// maximum fetchInterval (segment retention period).
if (datum.getFetchTime() - curTime > (long) maxInterval * 1000) {
if (datum.getFetchInterval() > maxInterval) {
datum.setFetchInterval(maxInterval * 0.9f);
}
datum.setFetchTime(curTime);
}
if (datum.getFetchTime() > curTime) {
return false; // not time yet
}
return true;
}
这里就是查询当前网页的下一次爬取时间和当前时间相比,决定是否抓取。
这里根据maxInterval做了一些调整。
------------------
接下来的代码是:
LongWritable oldGenTime = (LongWritable) crawlDatum.getMetaData()
.get(Nutch.WRITABLE_GENERATE_TIME_KEY);
if (oldGenTime != null)
{ // awaiting fetch & update
if (oldGenTime.get() + genDelay > curTime) // still wait for
// update
return;
}
这段代码按下不表。
===============================================================
如果校验都通过了,我们就开始计算这个url的score吧。
float sort = 1.0f;
try {
sort = scfilters.generatorSortValue(key, crawlDatum, sort);
} catch (ScoringFilterException sfe) {
if (LOG.isWarnEnabled()) {
LOG.warn("Couldn't filter generatorSortValue for " + key
+ ": " + sfe);
}
}
这里的scfilters只有 OPICScoringFilter
其相应代码如下:
public float generatorSortValue(Text url, CrawlDatum datum, float initSort) throws ScoringFilterException {
return datum.getScore() * initSort;
}
默认情况下,1.0 * 1.0 =1.0
--------------------------------------------
if (restrictStatus != null
&& !restrictStatus.equalsIgnoreCase(CrawlDatum
.getStatusName(crawlDatum.getStatus())))
return;
这里是状态校验,汗,为啥不放在前面校验...
=========================================
// consider only entries with a score superior to the threshold
if (scoreThreshold != Float.NaN && sort < scoreThreshold)
return;
这个就很简单了,分数设了个阀值,没啥难的。
-------------------------------------------
// sort by decreasing score, using DecreasingFloatComparator
sortValue.set(sort);//设定分数
// record generation time
crawlDatum.getMetaData().put(Nutch.WRITABLE_GENERATE_TIME_KEY,
genTime);//设定生成时间戳
entry.datum = crawlDatum;
entry.url = key;//url和相关信息
output.collect(sortValue, entry); // invert for sort by score
//最终输出