前几节把inject,generate的源码都分析完了。现在到了分析fetch的时候了。
-----------------------------------------------------------------------------------------------
fetcher.fetch(segs[0], threads); // fetch it
这里开始执行真正的fetch流程。其中threads为命令行指定的参数。
-------------------------------------- 校验配置
checkConfiguration();
代码如下:
private void checkConfiguration() {
// ensure that a value has been set for the agent name and that that
// agent name is the first value in the agents we advertise for robot
// rules parsing
String agentName = getConf().get("http.agent.name");
if (agentName == null || agentName.trim().length() == 0) {
//校验配置项http.agent.name是否存在
String message = "Fetcher: No agents listed in 'http.agent.name'"
+ " property.";
if (LOG.isErrorEnabled()) {
LOG.error(message);
}
throw new IllegalArgumentException(message);
} else {
// get all of the agents that we advertise
String agentNames = getConf().get("http.robots.agents");
StringTokenizer tok = new StringTokenizer(agentNames, ",");
ArrayList<String> agents = new ArrayList<String>();
while (tok.hasMoreTokens()) {
agents.add(tok.nextToken().trim());
} //取出http.robots.agents配置的所有配置项
// if the first one is not equal to our agent name, log fatal and throw
// an exception
if (!(agents.get(0)).equalsIgnoreCase(agentName)) {
//并且要求第一个配置项和http.agent.name一致。
String message = "Fetcher: Your 'http.agent.name' value should be "
+ "listed first in 'http.robots.agents' property.";
if (LOG.isWarnEnabled()) {
LOG.warn(message);
}
}
}
}
如果不存在,则会报错如下:
Fetcher: No agents listed in 'http.agent.name' property.
Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1397)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1282)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:221)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:66)
解决方案:
在nutch-site.xml中添加如下:
<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1</value>
</property>
---------------------------接下来是打印时间
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
long start = System.currentTimeMillis();
if (LOG.isInfoEnabled()) {
LOG.info("Fetcher: starting at " + sdf.format(start));
LOG.info("Fetcher: segment: " + segment);
}
-------------------------------------------------接下来是时间限制的代码
// set the actual time for the timelimit relative
// to the beginning of the whole job and not of a specific task
// otherwise it keeps trying again if a task fails
long timelimit = getConf().getLong("fetcher.timelimit.mins", -1);
if (timelimit != -1) {
timelimit = System.currentTimeMillis() + (timelimit * 60 * 1000);
LOG.info("Fetcher Timelimit set for : " + timelimit);
getConf().setLong("fetcher.timelimit", timelimit);
}
默认是-1,这段代码先略过。
----------------------------------------------------
// Set the time limit after which the throughput threshold feature is enabled
timelimit = getConf().getLong("fetcher.throughput.threshold.check.after", 10);
timelimit = System.currentTimeMillis() + (timelimit * 60 * 1000);
getConf().setLong("fetcher.throughput.threshold.check.after", timelimit);
通过配置文件中的配置,决定在某个时间后开启 fetcher.throughput.threshold.check.after
-----------------------------------------------------------------------------------------------------------
接下来是设置maxOutlinkDepth
代码如下:
int maxOutlinkDepth = getConf().getInt("fetcher.follow.outlinks.depth", -1);
if (maxOutlinkDepth > 0) {
LOG.info("Fetcher: following outlinks up to depth: " + Integer.toString(maxOutlinkDepth));
int maxOutlinkDepthNumLinks = getConf().getInt("fetcher.follow.outlinks.num.links", 4);
int outlinksDepthDivisor = getConf().getInt("fetcher.follow.outlinks.depth.divisor", 2);
int totalOutlinksToFollow = 0;
for (int i = 0; i < maxOutlinkDepth; i++) {
totalOutlinksToFollow += (int)Math.floor(outlinksDepthDivisor / (i + 1) * maxOutlinkDepthNumLinks);
}
LOG.info("Fetcher: maximum outlinks to follow: " + Integer.toString(totalOutlinksToFollow));
}
这个值的获取见配置项
<property>--------------------------------------------------------------
下面就是真正的一个Hadoop工作了,见下一节内容。