用lwp分析网页链接和图片链接

#!/usr/bin/perl
use HTML::TokeParser;
use LWP::Simple;

$url="http://www.51cto.com/";
$filename="51cto.html";
my $status_code=getstore("$url",$filename);
die "cant download $url" unless is_success($status_code);


my $stream = HTML::TokeParser->new($filename)|| die "Couldn't read HTML file $filename: $!";
my @href_array;
my @img_array;

#获取所有内链接和图片链接
while(my $start_tag=$stream->get_token()){
    if($start_tag->[0] eq 'S'){
          if($start_tag->[1] eq 'a'){
                   push @href_array,$start_tag->[2]{'href'};
           }
          if($start_tag->[1] eq 'img'){
                   push @img_array,$start_tag->[2]{'src'};
           }
     }
}
print "print inner herf\n============================\n";
foreach my $href_one(@href_array){
     print $href_one."\n";
}

print "print pic herf\n============================\n";
foreach my $img_one(@img_array){
    print $img_one."\n";
}

#收集所有图片链接
while(my $img_tag=$stream->get_tag('img')){
    print $img_tag->[1]{'src'}."\n";
}

#收集所有内链接
while(my $href_tag=$stream->get_tag('a')){
     #home => default.html
    print $stream->get_text('/a')." => ".$href_tag->[1]{'href'}."\n";
}
 

 

结果:

......

http://cn.made-in-china.com/
http://www.139shop.com/
http://www.51cto.com/about/aboutus_e.html
http://www.51cto.com/about/aboutus.html
http://www.51cto.com/about/zhaopin.html
http://www.51cto.com/about/contactus.html
http://www.51cto.com/about/history.html
http://www.51cto.com/about/links.html
http://www.51cto.com/php/guestbook
http://www.51cto.com/about/map.html
http://www.hd315.gov.cn/beian/view.asp?bianhao=010202006111400015
print pic herf
============================
http://home.51cto.com/public/themes/blue/images/top_bg_xian.gif
http://images.51cto.com/images/art/top_images/nav_ico1.gif
http://home.51cto.com/public/themes/blue/images/top_bg_xian.gif
http://images.51cto.com/images/art/top_images/nav_ico1.gif
http://home.51cto.com/public/themes/blue/images/top_bg_xian.gif
http://home.51cto.com/public/themes/blue/images/top_shequ.gif
http://images.51cto.com/images/art/top_images/nav_ico1.gif
http://images.51cto.com/images/index/Images/Logo.gif
http://www.51cto.com/book/s_waidian.gif
http://www.51cto.com/book/icon_yuanchuang.gif
http://www.51cto.com/book/icon_dujia.gif
http://images.51cto.com/images/tuijian_tit_bg.gif
http://images.51cto.com/images/index/Images/wrap_bg6_logo.gif
http://img1.51cto.com/images/expert/122234011545.jpg
http://img1.51cto.com/images/expert/130622222590.51
http://img1.51cto.com/images/expert/129983960396.jpg
......


 

你可能感兴趣的:(职场,perl,休闲,lwp,网页链接)