上一节讲解了Fetch的部分代码,这一节开始讲真正的Fetch的Hadoop工作部分。
-----------------------------------------------------------------------------
代码如下:
JobConf job = new NutchJob(getConf());
job.setJobName("fetch " + segment);
job.setInt("fetcher.threads.fetch", threads);
job.set(Nutch.SEGMENT_NAME_KEY, segment.getName());
// for politeness, don't permit parallel execution of a single task
job.setSpeculativeExecution(false);
FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.GENERATE_DIR_NAME));
job.setInputFormat(InputFormat.class);
job.setMapRunnerClass(Fetcher.class);
FileOutputFormat.setOutputPath(job, segment);
job.setOutputFormat(FetcherOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NutchWritable.class);
JobClient.runJob(job);
别的不多说,最关键的一点就是:
job.setMapRunnerClass(Fetcher.class);
所以接下来我们就来研究下这个类的代码。
this.output = output;//保留引用
this.reporter = reporter;//保留引用
this.fetchQueues = new FetchItemQueues(getConf());//构造一个FetchItemQueues对象
最后一句的构造函数是如下:
public FetchItemQueues(Configuration conf) {
this.conf = conf; //保留对conf的引用
this.maxThreads = conf.getInt("fetcher.threads.per.queue", 1); //默认值:1
queueMode = conf.get("fetcher.queue.mode", QUEUE_MODE_HOST); //默认值byHost
// check that the mode is known
if (!queueMode.equals(QUEUE_MODE_IP) && !queueMode.equals(QUEUE_MODE_DOMAIN)
&& !queueMode.equals(QUEUE_MODE_HOST)) {
LOG.error("Unknown partition mode : " + queueMode + " - forcing to byHost");
queueMode = QUEUE_MODE_HOST;
} //校验是否正确的配置
LOG.info("Using queue mode : "+queueMode); //打印消息
this.crawlDelay = (long) (conf.getFloat("fetcher.server.delay", 1.0f) * 1000); //默认5000毫秒
this.minCrawlDelay = (long) (conf.getFloat("fetcher.server.min.delay", 0.0f) * 1000); //默认0毫秒
this.timelimit = conf.getLong("fetcher.timelimit", -1);
//默认值-1
this.maxExceptionsPerQueue = conf.getInt("fetcher.max.exceptions.per.queue", -1);
//默认值-1
} //初始化完毕!
-----------------------------------------------------------------------------------------------------------------------------------
接下来是若干变量的设置
int threadCount = getConf().getInt("fetcher.threads.fetch", 10);
if (LOG.isInfoEnabled()) {
LOG.info("Fetcher: threads: " + threadCount);
} //这里就是命令行里的线程数设置
int timeoutDivisor = getConf().getInt(
"fetcher.threads.timeout.divisor", 2);
if (LOG.isInfoEnabled()) {
LOG.info("Fetcher: time-out divisor: " + timeoutDivisor);
} //默认值是2
int queueDepthMuliplier = getConf().getInt(
"fetcher.queue.depth.multiplier", 50); //默认值是50
----------------------------------------------------------------------------------------------
接下来是构造QueueFeeder线程并启动,代码如下:
public QueueFeeder(RecordReader<Text, CrawlDatum> reader,
FetchItemQueues queues, int size) {
this.reader = reader;
this.queues = queues;
this.size = size;
this.setDaemon(true);
this.setName("QueueFeeder");
}
然后设置下时间限制启动这个线程
feeder = new QueueFeeder(input, fetchQueues, threadCount
* queueDepthMuliplier);
// feeder.setPriority((Thread.MAX_PRIORITY + Thread.NORM_PRIORITY) / 2);
// the value of the time limit is either -1 or the time where it should
// finish
long timelimit = getConf().getLong("fetcher.timelimit", -1);
if (timelimit != -1)
feeder.setTimeLimit(timelimit);
feeder.start();
稍后我们再讨论这个线程的启动部分,先继续往下看。
----------------------------------------------------------------
然后再启动若干个FetcherThread,代码如下:
// set non-blocking & no-robots mode for HTTP protocol plugins.
getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
getConf().setBoolean(Protocol.CHECK_ROBOTS, false);
for (int i = 0; i < threadCount; i++) { // spawn threads
new FetcherThread(getConf()).start();
}
-------------------------------------------------------------------------
接下来是一些零碎的设置
// select a timeout that avoids a task timeout
long timeout = getConf().getInt("mapred.task.timeout", 10 * 60 * 1000)
/ timeoutDivisor;
// Used for threshold check, holds pages and bytes processed in the last
// second
int pagesLastSec;
int bytesLastSec;
// Set to true whenever the threshold has been exceeded for the first
// time
boolean throughputThresholdExceeded = false;
int throughputThresholdNumRetries = 0;
int throughputThresholdPages = getConf().getInt(
"fetcher.throughput.threshold.pages", -1);
if (LOG.isInfoEnabled()) {
LOG.info("Fetcher: throughput threshold: "
+ throughputThresholdPages);
}
int throughputThresholdMaxRetries = getConf().getInt(
"fetcher.throughput.threshold.retries", 5);
if (LOG.isInfoEnabled()) {
LOG.info("Fetcher: throughput threshold retries: "
+ throughputThresholdMaxRetries);
}
long throughputThresholdTimeLimit = getConf().getLong(
"fetcher.throughput.threshold.check.after", -1);
这几个变量的值暂时不分析,因为还没到分析的时候。
-----------------------
之前我们有2个线程启动了,但是代码没有分析,下面我们就新开2篇文章来分析分析这2个线程的具体代码。