perl多线程抓取网页

perl抓取网页的功能特别强大,所以尝试用多线程来抓网页。。

 

#!/usr/bin/perl
use threads;
use threads::shared;
use LWP;
use LWP::Simple;
use LWP::UserAgent;
use LWP::ConnCache;
use HTML::TreeBuilder;

my @urls:shared;
my %uniq_url:shared;
my $starturl=$ARGV[0];
push @urls,$starturl;
$uniq_url{$starturl}=1;
my $browser=LWP::UserAgent->new();
$browser->timeout(10);
$browser->protocols_allowed(['http','gopher']);
$browser->conn_cache(LWP::ConnCache->new());
my $num=1;

while(scalar @urls >0)
{
 if(scalar @urls == 1)
  {
    my $url=shift @urls;
    &parse($url,'old');
  }
 if(scalar @urls >= 2)
 {
  if(scalar @urls >= 20)
    {
       $num=20;
    }
  else{
       $num= scalar @urls;
    }
  my @thread;
  for(my $j=0;$j<$num;$j++)
    {
      my $url=shift @urls;
      $thread[$j]=threads->create(\&parse,$url,"thread$j");
    }
  for(my $j=0;$j<$num;$j++)
    {
       $thread[$j]->join();
    }
 }
}
sub parse()
{
  my $url=shift;
  my $type=shift;
  my $response=$browser->get($url);
  unless($response->is_success)
   {
      print "cant access $url",$response->status_line."\n";
      return 0;
   }
 my $html=$response->content;
 if(scalar @urls <300)
  {
    while($html=~/href=\"(.*?)\"/ig)
      {
        my $new_url=URI->new_abs($1,$response->base);
        if(!exists($uniq_url{$new_url}))
         {
          push @urls,"$new_url";
          $uniq_url{$new_url}=1;
         }
      }
  }
$|=1;
my $root=HTML::TreeBuilder->new_from_content($html);
my $title=$root->find_by_tag_name('title');
if($title)
{
my $str_title=$title->as_text();
print "$url\t$str_title\t$type\n";
}
else{
print "$url\tno title\t$type\n";
}
}

感觉抓百度的音乐还不错,让可以把链接以及歌名放到mysql里面,写个cgi+sql的模糊查询等,简单实现一下搜索。 很粗陋啊,希望大家不要见笑。

你可能感兴趣的:(多线程,perl)