Nutch流程之Fetch

1.      概述

Fetch主要是从待抓取列表中取出url,进行抓取解析,期间产生crawl_parse,carwl_fetch,parse_data,parse_text文件夹。本次将讲解Fetch的大致流程,重点将是各个文件夹的产生过程以及包含的内容。对于Fetch的生产者、消费者模型,这些将不会讲解。

2.      正文

在Fetcher类的fetch()方法中,设置了执行fetch操作的job。其中,

job.setOutputFormat(GeneralChannelFetcherOutputFormat.class);方法是重要的。后面的各个文件夹的产生都由它控制。(GeneralChannelFetcherOutputFormat.class是在Nutch源码的基础上修改过的代码。)

实现抓取过程的是FetchThread类中的run()方法。

 

ProtocolOutput output = protocol.getProtocolOutput(fit.url,fit.datum);

ProtocolStatus status = output.getStatus();

Content content = output.getContent();

这几行代码实现url源码的抓取,将生成的内容放到Content对象中。

 

接下来,根据status的状态信息,进行相应的操作。

switch(status.getCode())

case  ProtocolStatus.SUCCESS:        // got a page

                pstatus =output(fit.url, fit.datum, content, status,

CrawlDatum.STATUS_FETCH_SUCCESS);

当状态时success时,会先执行output方法。Output方法也是一个重要的方法,下面来看看output方法。

private ParseStatus output(Text key, CrawlDatum datum,

                       Content content, ProtocolStatus pstatus, int status) {

 

      datum.setStatus(status);

     datum.setFetchTime(System.currentTimeMillis());

      if (pstatus != null)datum.getMetaData().put(Nutch.WRITABLE_PROTO_STATUS_KEY, pstatus);

//上面的代码实现为value的抓取状态设值。如抓取时间。

      ParseResult parseResult= null;

      if (content != null) {

        Metadata metadata =content.getMetadata();

        // add segment tometadata

       metadata.set(Nutch.SEGMENT_NAME_KEY, segmentName);

        // add score tocontent metadata so that ParseSegment can pick it up.

        try {

         scfilters.passScoreBeforeParsing(key, datum, content);

        } catch (Exception e){

          if(LOG.isWarnEnabled()) {

           e.printStackTrace(LogUtil.getWarnStream(LOG));

           LOG.warn("Couldn't pass score, url " + key + " (" +e + ")");

          }

        }

        /* Note: Fetcher willonly follow meta-redirects coming from the

         * original URL. */

        if (parsing &&status == CrawlDatum.STATUS_FETCH_SUCCESS) {

          try {

            parseResult =this.parseUtil.parse(content);//对抓取到的源码进行解析

          } catch (Exceptione) {

           LOG.warn("Error parsing: " + key + ": " +StringUtils.stringifyException(e));

          }

 

          if (parseResult ==null) {

            byte[] signature =

              SignatureFactory.getSignature(getConf()).calculate(content,

                  newParseStatus().getEmptyParse(conf));

           datum.setSignature(signature);

          }

        }

       

        /* Store status codein content So we can read this value during

         * parsing (as aseparate job) and decide to parse or not.

         */

       content.getMetadata().add(Nutch.FETCH_STATUS_KEY,Integer.toString(status));

      }

//涉及到setOutputFormat中设定的类了。

      try {

        output.collect(key,new NutchWritable(datum));

        if (content != null&& storingContent)

          output.collect(key,new NutchWritable(content));

        if (parseResult !=null) {

          for (Entry<Text,Parse> entry : parseResult) {

            Text url =entry.getKey();

            Parse parse =entry.getValue();

            ParseStatusparseStatus = parse.getData().getStatus();

           

            if(!parseStatus.isSuccess()) {

             LOG.warn("Error parsing: " + key + ": " +parseStatus);

              parse =parseStatus.getEmptyParse(getConf());

            }

 

            // Calculate pagesignature. For non-parsing fetchers this will

            // be done inParseSegment

            byte[] signature =

             SignatureFactory.getSignature(getConf()).calculate(content, parse);

            // Ensure segmentname and score are in parseData metadata

           parse.getData().getContentMeta().set(Nutch.SEGMENT_NAME_KEY,

                segmentName);

           parse.getData().getContentMeta().set(Nutch.SIGNATURE_KEY,

               StringUtil.toHexString(signature));

            // Pass fetch timeto content meta

           parse.getData().getContentMeta().set(Nutch.FETCH_TIME_KEY,

               Long.toString(datum.getFetchTime()));

            if (url.equals(key))

             datum.setSignature(signature);

            try {

             scfilters.passScoreAfterParsing(url, content, parse);

            } catch (Exceptione) {

              if(LOG.isWarnEnabled()) {

                e.printStackTrace(LogUtil.getWarnStream(LOG));

               LOG.warn("Couldn't pass score, url " + key + " (" +e + ")");

              }

            }

           output.collect(url, new NutchWritable(

                    newParseImpl(new ParseText(parse.getText()),

                                 parse.getData(), parse.isCanonical())));

          }

        }

      } catch (IOException e){

        if(LOG.isFatalEnabled()) {

         e.printStackTrace(LogUtil.getFatalStream(LOG));

          LOG.fatal("fetchercaught:"+e.toString());

        }

      }

 

在output.collect()方法中,就涉及到相关文件的生成了。下面就来看看

GeneralChannelFetcherOutputFormat.class做了点什么。

 

public RecordWriter<Text, NutchWritable> getRecordWriter(finalFileSystem fs,

                                     final JobConf job,

                                     final String name,

                                     final Progressable progress) throws IOException {

 

    Path out =FileOutputFormat.getOutputPath(job);

    final Path fetch =

  newPath(new Path(out, CrawlDatum.FETCH_DIR_NAME),

name);/*crawl-fetch��������key-datum��

                                                                 map���ͣ���ŵ���url��״̬��Ϣ*/

    final Path content =

      new Path(new Path(out,Content.DIR_NAME), name);

   

    final CompressionTypecompType =

SequenceFileOutputFormat.getOutputCompressionType(job);

 

    final MapFile.WriterfetchOut =

      new MapFile.Writer(job,fs, fetch.toString(), Text.class, CrawlDatum.class,

          compType, progress);

   

    return newRecordWriter<Text, NutchWritable>() {

        private MapFile.WritercontentOut;

        privateRecordWriter<Text, Parse> parseOut;

 

        {

          if(GeneralChannelFetcher.isStoringContent(job)) {

            contentOut = newMapFile.Writer(job, fs, content.toString(),

                                           Text.class, Content.class,

                                           compType, progress);

          }

 

          if(GeneralChannelFetcher.isParsing(job)) {

            parseOut = newGeneralChannelParseOutputFormat().getRecordWriter(fs, job, name, progress);

          }

        }

 

        public void write(Textkey, NutchWritable value)

          throws IOException {

 

          Writable w =value.get();

         

          if (w instanceofCrawlDatum)

           fetchOut.append(key, w);

          else if (winstanceof Content)

           contentOut.append(key, w);

          else if (winstanceof Parse)

           parseOut.write(key, (Parse)w);

        }

 

        public voidclose(Reporter reporter) throws IOException {

          fetchOut.close();

          if (contentOut !=null) {

           contentOut.close();

          }

          if (parseOut !=null) {

            parseOut.close(reporter);

          }

        }

 

      };

 

  }     

}

 

从中可以看出,根据不同的crawlDatun的内容,输出到不同的目录中。

if (w instanceof CrawlDatum)

           fetchOut.append(key, w);

          else if (winstanceof Content)

           contentOut.append(key, w);

          else if (winstanceof Parse)

           parseOut.write(key, (Parse)w);

从这段代码可以看出,crawl_fetch中的内容是value,及其抓取状态信息。Content中的内容是网页的源码。而segments中的其他文件内容的产生,则由另外一个类来实现——GeneralChannelParseOutputFormat。

下面就来了解下这个类。

public RecordWriter<Text, Parse> getRecordWriter(FileSystemfs, JobConf job,

                                     String name, Progressable progress) throws IOException {

 

    this.filters = newURLFilters(job);

    this.normalizers = newURLNormalizers(job, URLNormalizers.SCOPE_OUTLINK);

    this.scfilters = newScoringFilters(job);

    final int interval =job.getInt("db.fetch.interval.default", 2592000);

    final booleanignoreExternalLinks = job.getBoolean("db.ignore.external.links",false);

    int maxOutlinksPerPage =job.getInt("db.max.outlinks.per.page", 100);

    final int maxOutlinks =(maxOutlinksPerPage < 0) ? Integer.MAX_VALUE

                                                    : maxOutlinksPerPage;

    final CompressionTypecompType = SequenceFileOutputFormat.getOutputCompressionType(job);

    Path out =FileOutputFormat.getOutputPath(job);

   

    Path text = new Path(newPath(out, ParseText.DIR_NAME), name);

    Path data = new Path(newPath(out, ParseData.DIR_NAME), name);

    Path crawl = new Path(newPath(out, CrawlDatum.PARSE_DIR_NAME), name);

   

    final String[]parseMDtoCrawlDB =job.get("db.parsemeta.to.crawldb","").split(" *,*");

   

    final MapFile.WritertextOut =

      new MapFile.Writer(job,fs, text.toString(), Text.class, ParseText.class,

          CompressionType.RECORD,progress);

   

    final MapFile.WriterdataOut =

      new MapFile.Writer(job,fs, data.toString(), Text.class, ParseData.class,

          compType, progress);

   

    final SequenceFile.WritercrawlOut =

     SequenceFile.createWriter(fs, job, crawl, Text.class, CrawlDatum.class,

          compType, progress);

   

    return newRecordWriter<Text, Parse>() {

 

 

        public void write(Textkey, Parse parse)

          throws IOException {

          String[]secondleveldoamin=new String[]{"org","com","edu","net","ac","gov"};//�д�����

          String fromUrl =key.toString();

          String fromHost =null;

          String toHost =null; 

          Stringfromdomain=null;

          Stringtodomain=null;

          textOut.append(key,new ParseText(parse.getText()));

         

          ParseData parseData= parse.getData();

          // recover thesignature prepared by Fetcher or ParseSegment

          String sig =parseData.getContentMeta().get(Nutch.SIGNATURE_KEY);

          if (sig != null) {

            byte[] signature =StringUtil.fromHexString(sig);

            if (signature !=null) {

              // append aCrawlDatum with a signature

              CrawlDatum d =new CrawlDatum(CrawlDatum.STATUS_SIGNATURE, 0);

              d.setSignature(signature);

             crawlOut.append(key, d);

            }

          }

         

        // see if the parsemetadata contain things that we'd like

        // to pass to themetadata of the crawlDB entry

        CrawlDatum parseMDCrawlDatum= null;

        for (String mdname :parseMDtoCrawlDB) {

          String mdvalue =parse.getData().getParseMeta().get(mdname);

          if (mdvalue != null){

            if(parseMDCrawlDatum == null) parseMDCrawlDatum = new CrawlDatum(

                CrawlDatum.STATUS_PARSE_META, 0);

           parseMDCrawlDatum.getMetaData().put(new Text(mdname),

                newText(mdvalue));

          }

        }

        if (parseMDCrawlDatum!= null) crawlOut.append(key, parseMDCrawlDatum);

 

          try {

            ParseStatuspstatus = parseData.getStatus();

            if (pstatus !=null && pstatus.isSuccess() &&

               pstatus.getMinorCode() == ParseStatus.SUCCESS_REDIRECT) {

              String newUrl =pstatus.getMessage();

              int refreshTime =Integer.valueOf(pstatus.getArgs()[1]);

              try {

                newUrl =normalizers.normalize(newUrl,

                   URLNormalizers.SCOPE_FETCHER);

              } catch(MalformedURLException mfue) {

                newUrl = null;

              }

              if (newUrl !=null) newUrl = filters.filter(newUrl);

              String url =key.toString();

              if (newUrl !=null && !newUrl.equals(url)) {

                String reprUrl=

                  URLUtil.chooseRepr(url, newUrl,

                                    refreshTime < Fetcher.PERM_REFRESH_TIME);

                CrawlDatumnewDatum = new CrawlDatum();

               newDatum.setStatus(CrawlDatum.STATUS_LINKED);

                if (reprUrl !=null && !reprUrl.equals(newUrl)) {

                 newDatum.getMetaData().put(Nutch.WRITABLE_REPR_URL_KEY,

                                            new Text(reprUrl));

                }

               crawlOut.append(new Text(newUrl), newDatum);

              }

            }

          } catch(URLFilterException e) {

            // ignore

          }

 

          // collect outlinksfor subsequent db update

          Outlink[] links =parseData.getOutlinks();

          int outlinksToStore= Math.min(maxOutlinks, links.length);

          if(ignoreExternalLinks) {

            try {/*此处做了修改,对于外连接进行过滤,过滤的规则是,将由相同domain的取出,nutch自带的是将

                                         具有相同host的取出*/

              fromHost = newURL(fromUrl).getHost().toLowerCase();

              String[]fromHosts=fromHost.split("\\.");

              inti=fromHosts.length-1;

                   if(fromHosts[i].equals("cn")){

                            for(intj=0;j<secondleveldoamin.length;j++){

                            if(fromHosts[i-1].equals(secondleveldoamin[j]))

                               {

                                     fromdomain=fromHosts[i-2];

                                     break;

                                  }

                            else

                                     continue;

                                     }

                            if(fromdomain==null)

                                      fromdomain=fromHosts[i-1];

                   }

                   if(fromHosts[i].equals("org")||fromHosts[i].equals("com")

                                     ||fromHosts[i].equals("net"))

                            fromdomain=fromHosts[i-1];

            } catch(MalformedURLException e) {

              fromHost = null;

            }

          } else {

            fromHost = null;

          }

 

          int validCount = 0;

          CrawlDatum adjust =null;

         List<Entry<Text, CrawlDatum>> targets = newArrayList<Entry<Text, CrawlDatum>>(outlinksToStore);

          List<Outlink>outlinkList = new ArrayList<Outlink>(outlinksToStore);

          for (int i = 0; i< links.length && validCount < outlinksToStore; i++) {

            String toUrl =links[i].getToUrl();

            // ignore links toself (or anchors within the page)

            if(fromUrl.equals(toUrl)) {

              continue;

            }

            if(ignoreExternalLinks) {

              try {

                toHost = newURL(toUrl).getHost().toLowerCase();

                String[]toHosts=toHost.split("\\.");

                intk=toHosts.length-1;

                         if(toHosts[k].equals("cn")){

                                   for(intj=0;j<secondleveldoamin.length;j++){

                                   if(toHosts[k-1].equals(secondleveldoamin[j]))

                                      {

                                            todomain=toHosts[k-2];

                                            break;

                                         }

                                   else

                                            continue;

                                            }

                                   if(todomain==null)

                                            todomain=toHosts[k-1];

                         }

                         if(toHosts[k].equals("org")||toHosts[k].equals("com")

                                            ||toHosts[k].equals("net"))

                                   todomain=toHosts[k-1];

               

              } catch (MalformedURLExceptione) {

                toHost = null;

              }

              if (todomain ==null || !todomain.equals(fromdomain)) { // external links

                continue; //skip it

              }

//             if(toHost==null||!toHost.equals(fromHost)){

//              continue;

//              }

            }

            try {

              toUrl =normalizers.normalize(toUrl,

                         URLNormalizers.SCOPE_OUTLINK); // normalize the url

              toUrl =filters.filter(toUrl);   // filter theurl

              if (toUrl ==null) {

                continue;

              }

            } catch (Exceptione) {

              continue;

            }

            CrawlDatum target= new CrawlDatum(CrawlDatum.STATUS_LINKED, interval);

            Text targetUrl =new Text(toUrl);

            try {

             scfilters.initialScore(targetUrl, target);

            } catch(ScoringFilterException e) {

             LOG.warn("Cannot filter init score for url " + key +

                       ",using default: " + e.getMessage());

             target.setScore(0.0f);

            }

           

            targets.add(newSimpleEntry(targetUrl, target));

           outlinkList.add(links[i]);

            validCount++;

          }

          try {

            // compute scorecontributions and adjustment to the original score

            adjust =scfilters.distributeScoreToOutlinks((Text)key, parseData,

                      targets,null, links.length);

          } catch (ScoringFilterExceptione) {

           LOG.warn("Cannot distribute score from " + key + ":" + e.getMessage());

          }

          for (Entry<Text,CrawlDatum> target : targets) {

           crawlOut.append(target.getKey(), target.getValue());

          }

          if (adjust != null)crawlOut.append(key, adjust);

 

          Outlink[]filteredLinks = outlinkList.toArray(new Outlink[outlinkList.size()]);

          parseData = newParseData(parseData.getStatus(), parseData.getTitle(),

                                    filteredLinks,parseData.getContentMeta(),

                                   parseData.getParseMeta());

          dataOut.append(key,parseData);

          if(!parse.isCanonical()) {

            CrawlDatum datum =new CrawlDatum();

            datum.setStatus(CrawlDatum.STATUS_FETCH_SUCCESS);

            String timeString= parse.getData().getContentMeta().get(Nutch.FETCH_TIME_KEY);

            try {

             datum.setFetchTime(Long.parseLong(timeString));

            } catch (Exceptione) {

             LOG.warn("Can't read fetch time for: " + key);

             datum.setFetchTime(System.currentTimeMillis());

            }

           crawlOut.append(key, datum);

          }

        }

       

        public voidclose(Reporter reporter) throws IOException {

          textOut.close();

          dataOut.close();

          crawlOut.close();

        }

       

      };

   

  }

 

}

 

这个类大致做了以下几件事情。产生crawl_parse、parse_text、parse_data三个文件夹。Prase_text就是网页中解析出来的文本内容。Crawl_parse中最主要的是包含了从ParseData中提取出来的Outlink格式化了的外连接信息,外连接由CrawlDatum.STATUS_LINKED做标记。

此外,crawl_parse中还包含了其他一些内容。但是如果要提取外连接的话,根据Liked即可获取。

  在这段代码中还有个参数可以设置——ignoreExternalLinks。这个BOOLEAN参数用来设置是否需要外连接。外连接是用来更新crawldb中的内容的,当然你可以设置db.update.additions.allowed,来要求外连接是否更新到crawldb中。

  当ignoreExternalLinks设置为true时,你可以更改外连接选取规则,来选择你想要的外连接。Nutch自带的是host相同的外连接,上面的代码是domain一样的外连接。

  挑选出外连接之后,以装有外连接的数组为构造参数,重新构造一个ParseData,产生parse_data文件夹。

你可能感兴趣的:(String,null,url,Path,output,Parsing)