PHPCrawl webcrawler 爬虫

 

1. PHPCrawl

PHPCrawl is a framework for crawling/spidering websites written in the programming language PHP, so just call it a webcrawler-library or crawler-engine for PHP

PHPCrawl "spiders" websites and passes information about all found documents (pages, links, files ans so on) for futher processing to users of the library.

It provides several options to specify the behaviour of the crawler like URL- and Content-Type-filters, cookie-handling, robots.txt-handling, limiting options, multiprocessing and much more.

PHPCrawl is completly free opensource software and is licensed under the GNU GENERAL PUBLIC LICENSE v2.

To get a first impression on how to use the crawler you may want to take a look at the quickstart guide or an example inside the manual section.
A complete reference and documentation of all available options and methods of the framework can be found in the classreferences-section

The current version of the phpcrawl-package and older releases can be downloaded from a sourceforge-mirror.

Note to users of phpcrawl version 0.7x or before: Although in version 0.8 some method-names and parameters have changed, it should be fully compatible to older versions of phpcrawl.

 

Installation & Quickstart

The following steps show how to use phpcrawl:
  1. Unpack the phpcrawl-package somewhere. That's all you have to do for installation.
  2. Include the phpcrawl-mainclass to your script or project. Its located in the "libs"-path of the package.

     

    include("libs/PHPCrawler.class.php"); 

     

    There are no other includes needed.
  3. Extend the phpcrawler-class and override the handleDocumentInfo-method with your own code to process the information of every document the crawler finds on its way.

    class MyCrawler extends PHPCrawler
    {
      function handleDocumentInfo(PHPCrawlerDocumentInfo $PageInfo)
      {
        // Your code comes here!
        // Do something with the $PageInfo-object that
        // contains all information about the currently 
        // received document.
    
        // As example we just print out the URL of the document
        echo $PageInfo->url."\n";
      }
    } 

     

    For a list of all available information about a page or file within the handleDocumentInfo-method see the PHPCrawlerDocumentInfo-reference.

    Note to users of phpcrawl 0.7x or before: The old, overridable method "handlePageData()", that receives the document-information as an array, still is present and gets called. PHPcrawl 0.8 is fully compatible with scripts written for earlier versions.
  4. Create an instance of that class in your script or project, define the behaviour of the crawler and start the crawling-process.

    $crawler = new MyCrawler();
    $crawler->setURL("www.foo.com");
    $crawler->addContentTypeReceiveRule("#text/html#");
    // ...
    
    $crawler->go();  

     

    For a list of all available setup-options/methods of the crawler take a look at the PHPCrawler-classreference.

Tutorial: Example Script

The following code is a simple example of using phpcrawl.

The listed script just "spiders" some pages of www.php.net until a traffic-limit of 1 mb is reached and prints out some information about all found documents.

Please note that this example-script (and others) also comes in a file called "example.php" with the phpcrawl-package. It's recommended to run it from the commandline (php CLI).

 <?php

// It may take a whils to crawl a site ...
set_time_limit(10000);

// Inculde the phpcrawl-mainclass
include("libs/PHPCrawler.class.php");

// Extend the class and override the handleDocumentInfo()-method 
class MyCrawler extends PHPCrawler 
{
  function handleDocumentInfo($DocInfo) 
  {
    // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
    if (PHP_SAPI == "cli") $lb = "\n";
    else $lb = "<br />";

    // Print the URL and the HTTP-status-Code
    echo "Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")".$lb;
    
    // Print the refering URL
    echo "Referer-page: ".$DocInfo->referer_url.$lb;
    
    // Print if the content of the document was be recieved or not
    if ($DocInfo->received == true)
      echo "Content received: ".$DocInfo->bytes_received." bytes".$lb;
    else
      echo "Content not received".$lb; 
    
    // Now you should do something with the content of the actual
    // received page or file ($DocInfo->source), we skip it in this example 
    
    echo $lb;
    
    flush();
  } 
}

// Now, create a instance of your class, define the behaviour
// of the crawler (see class-reference for more options and details)
// and start the crawling-process. 

$crawler = new MyCrawler();

// URL to crawl
$crawler->setURL("www.php.net");

// Only receive content of files with content-type "text/html"
$crawler->addContentTypeReceiveRule("#text/html#");

// Ignore links to pictures, dont even request pictures
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");

// Store and send cookie-data like a browser does
$crawler->enableCookieHandling(true);

// Set the traffic-limit to 1 MB (in bytes,
// for testing we dont want to "suck" the whole site)
$crawler->setTrafficLimit(1000 * 1024);

// Thats enough, now here we go
$crawler->go();

// At the end, after the process is finished, we print a short
// report (see method getProcessReport() for more information)
$report = $crawler->getProcessReport();

if (PHP_SAPI == "cli") $lb = "\n";
else $lb = "<br />";
    
echo "Summary:".$lb;
echo "Links followed: ".$report->links_followed.$lb;
echo "Documents received: ".$report->files_received.$lb;
echo "Bytes received: ".$report->bytes_received." bytes".$lb;
echo "Process runtime: ".$report->process_runtime." sec".$lb; 
?> 

 

 

来源:http://cuab.de/

下载:http://sourceforge.net/projects/phpcrawl/files/PHPCrawl/

 

2. PHP Crawler

PHP Crawler is a simple website search script for small-to-medium websites. The only requrements are PHP and MySQL, no shell access required.

 

来源/下载:http://sourceforge.net/projects/php-crawler/

 

3. Crawl Web Pages In PHP And Jquery

You All know Google used to crawl web pages and index them into there database of millions and trillions of pages they use a great software called spider to perform this Process   this spider will index all the web pages in the internet with great speed like the same way i coded out a simple mini web crawler crawls  the specific webpages and get out there links and displays it, i used PHP and Jquery to perform this actions when a people types the url and clicks crawl the software  crawls the whole web page and displays the link present in it  and you can see the demo here and download the script for free

 

Demo 

 

来源:http://www.spixup.org/2013/04/crawl-web-pages-in-php-and-jquery.html

 

 

 

你可能感兴趣的:(PHP)