Nutch1.7源码再研究之---9 Fetch流程分析

前几节把inject,generate的源码都分析完了。现在到了分析fetch的时候了。

-----------------------------------------------------------------------------------------------

fetcher.fetch(segs[0], threads);  // fetch it

这里开始执行真正的fetch流程。其中threads为命令行指定的参数。

-------------------------------------- 校验配置

 checkConfiguration();

代码如下:

 

private void checkConfiguration() {

    // ensure that a value has been set for the agent name and that that

    // agent name is the first value in the agents we advertise for robot

    // rules parsing

    String agentName = getConf().get("http.agent.name");

    if (agentName == null || agentName.trim().length() == 0) {

//校验配置项http.agent.name是否存在

      String message = "Fetcher: No agents listed in 'http.agent.name'"

          + " property.";

      if (LOG.isErrorEnabled()) {

        LOG.error(message);

      }

      throw new IllegalArgumentException(message);

    } else {

      // get all of the agents that we advertise

      String agentNames = getConf().get("http.robots.agents");

      StringTokenizer tok = new StringTokenizer(agentNames, ",");

      ArrayList<String> agents = new ArrayList<String>();

      while (tok.hasMoreTokens()) {

        agents.add(tok.nextToken().trim());

      } //取出http.robots.agents配置的所有配置项

      // if the first one is not equal to our agent name, log fatal and throw

      // an exception

      if (!(agents.get(0)).equalsIgnoreCase(agentName)) {

              //并且要求第一个配置项和http.agent.name一致。

        String message = "Fetcher: Your 'http.agent.name' value should be "

            + "listed first in 'http.robots.agents' property.";

        if (LOG.isWarnEnabled()) {

          LOG.warn(message);

        }

      }

    }

  }

  如果不存在,则会报错如下:

Fetcher: No agents listed in 'http.agent.name' property.

Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.

at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1397)

at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1282)

at org.apache.nutch.crawl.Crawl.run(Crawl.java:221)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

at org.apache.nutch.crawl.Crawl.main(Crawl.java:66)

解决方案:

在nutch-site.xml中添加如下:

<property>
 <name>http.agent.name</name>
 <value>Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1</value>
</property>

---------------------------接下来是打印时间

 

SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

    long start = System.currentTimeMillis();

    if (LOG.isInfoEnabled()) {

      LOG.info("Fetcher: starting at " + sdf.format(start));

      LOG.info("Fetcher: segment: " + segment);

    }

-------------------------------------------------接下来是时间限制的代码

 

// set the actual time for the timelimit relative

    // to the beginning of the whole job and not of a specific task

    // otherwise it keeps trying again if a task fails

    long timelimit = getConf().getLong("fetcher.timelimit.mins", -1);

    if (timelimit != -1) {

      timelimit = System.currentTimeMillis() + (timelimit * 60 * 1000);

      LOG.info("Fetcher Timelimit set for : " + timelimit);

      getConf().setLong("fetcher.timelimit", timelimit);

    }

  默认是-1,这段代码先略过。

----------------------------------------------------

 

// Set the time limit after which the throughput threshold feature is enabled

    timelimit = getConf().getLong("fetcher.throughput.threshold.check.after", 10);

    timelimit = System.currentTimeMillis() + (timelimit * 60 * 1000);

    getConf().setLong("fetcher.throughput.threshold.check.after", timelimit);

通过配置文件中的配置,决定在某个时间后开启 fetcher.throughput.threshold.check.after

-----------------------------------------------------------------------------------------------------------

接下来是设置maxOutlinkDepth

 代码如下:

int maxOutlinkDepth = getConf().getInt("fetcher.follow.outlinks.depth", -1);

    if (maxOutlinkDepth > 0) {

      LOG.info("Fetcher: following outlinks up to depth: " + Integer.toString(maxOutlinkDepth));

      int maxOutlinkDepthNumLinks = getConf().getInt("fetcher.follow.outlinks.num.links", 4);

      int outlinksDepthDivisor = getConf().getInt("fetcher.follow.outlinks.depth.divisor", 2);

      int totalOutlinksToFollow = 0;

      for (int i = 0; i < maxOutlinkDepth; i++) {

        totalOutlinksToFollow += (int)Math.floor(outlinksDepthDivisor / (i + 1) * maxOutlinkDepthNumLinks);

      }

      LOG.info("Fetcher: maximum outlinks to follow: " + Integer.toString(totalOutlinksToFollow));

    }

这个值的获取见配置项

<property>
  <name>fetcher.follow.outlinks.depth</name>
  <value>-1</value>
  <description>(EXPERT)When fetcher.parse is true and this value is greater than 0 the fetcher will extract outlinks
  and follow until the desired depth is reached. A value of 1 means all generated pages are fetched and their first degree
  outlinks are fetched and parsed too. Be careful, this feature is in itself agnostic of the state of the CrawlDB and does not
  know about already fetched pages. A setting larger than 2 will most likely fetch home pages twice in the same fetch cycle.
  It is highly recommended to set db.ignore.external.links to true to restrict the outlink follower to URL's within the same
  domain. When disabled (false) the feature is likely to follow duplicates even when depth=1.
  A value of -1 of 0 disables this feature.
  </description>
</property>

--------------------------------------------------------------

下面就是真正的一个Hadoop工作了,见下一节内容。

 

你可能感兴趣的:(Nutch,fetch)