Arachnid Web Spider Framework2

PageInfo.java是一个页面对象。它抽象了页面的主要元素,并且封装了取得那些元素的方法。

public URL getUrl() { return(url); }\\取得页面的url


 public URL getParentUrl() { return(parentUrl); }\\


 public String getTitle() { return(title); }


 public URL[] getLinks() { return(links); }
 public URL[] getImages() { return(images); }
 public String getContentType() { return(contentType); }
 public boolean isValid() { return(valid); }
 public int getResponseCode() { return responseCode; }

 中间有个最核心的方法就是

public void extract(Reader reader) throws IOException
 {
  // Note: contentLength of -1 means UNKNOWN
  if (reader == null  || url == null ||
   responseCode != HttpURLConnection.HTTP_OK ||
   contentLength == 0 || contentType.equalsIgnoreCase(HTML) == false) { 
   valid = false;
   return;
  }
  WebPageXtractor x = new WebPageXtractor();
  try { x.parse(reader); }
  catch(EOFException e) {
   valid = false;
   return;
  }
  catch(SocketTimeoutException e) {
   valid = false;
   throw(e);
  }
  catch(IOException e) {
   valid = false;
   return;
  }
  ArrayList rawlinks = x.getLinks();
  ArrayList rawimages = x.getImages();
  
  // Get web page title (1st title if more than one!)
  ArrayList rawtitle = x.getTitle();
  if (rawtitle.isEmpty()) title = null;
  else title = new String((String)rawtitle.get(0));
  
  // Get links
  int numelem = rawlinks.size();
  if (numelem == 0) links = null;
  else {
   ArrayList t = new ArrayList();
   for (int i=0; i     String slink = (String)rawlinks.get(i);
    try {
     URL link = new URL(url,slink);
     t.add(link);
    }
    catch(MalformedURLException e) { /* Ignore */ }
   }
   if (t.isEmpty()) links = null;
   else links = (URL[])t.toArray(dummy);
  }
  
  // Get images
  numelem = rawimages.size();
  if (numelem == 0) images = null;
  else {
   ArrayList t = new ArrayList();
   for (int i=0; i     String simage = (String)rawimages.get(i);
    try {
     URL image = new URL(url,simage);
     t.add(image);
    }
    catch(MalformedURLException e) { }
   }
   if (t.isEmpty()) images = null;
   else images = (URL[])t.toArray(dummy);
  }

  // Set valid flag
  valid = true;
 }

这个方法主要是调用了WebPageXtractor来处理web页面。

然后把页面的各个属性的值都返回。

WebPageXtractor是继承了SimpleHTMLParser的方法,所以对页面元素进行分解是SimpleHTMLParser实现的

重点分析一下SimpleHTMLParser的一些方法。

你可能感兴趣的:(html,Web)