最近想到公司网站下一些文档,结果网站上做了下载总大小限制,没办法只好写个脚本来获取。
运行环境1:Windows XP StrawberryPerl 5.20 (WIKI:http://win32.perl.org/wiki/index.php?title=Strawberry_Perl)
运行环境2:Linux perl 5.8.8
抓包工具:
HttpWatch Professional v6.0.14 IE插件
Firefox上的Firebug和Live HTTP Headers插件,Chrome Web上的Developer Tools,我都试过,还是觉得httpwatch抓包的结果最满意。
我之前登录网页,都是直接将要访问的连接贴到浏览器,然后浏览器会弹出对话框,要求输入帐号密码。
这里,我需要抓一下包,之前我一直使用Firebug抓包,结果抓的都是GET消息,最后获取到状态为200的页面,以下就是网页内容
<HTML> <!-- File: llgettz.html --> <SCRIPT Language="Javascript1.2"> function getTime() { var llglogin_CurrentClientTime = new Date() var llglogin_year = ( llglogin_CurrentClientTime.getFullYear == null ) ? llglogin_CurrentClientTime.getYear() : llglogin_CurrentClientTime.getFullYear() var llglogin_month = llglogin_CurrentClientTime.getMonth() + 1 var llglogin_date = llglogin_CurrentClientTime.getDate() var llglogin_hour = llglogin_CurrentClientTime.getHours() var llglogin_minute = llglogin_CurrentClientTime.getMinutes() var llglogin_second = llglogin_CurrentClientTime.getSeconds() document.LoginForm.CurrentClientTime.value = 'D/' + llglogin_year + '/' + llglogin_month + '/' + llglogin_date document.LoginForm.CurrentClientTime.value += ':' + llglogin_hour + ':' + llglogin_minute + ':' + llglogin_second document.LoginForm.submit() } </SCRIPT> <BODY BGCOLOR="#FFFFFF" BACKGROUND="/img/pattern.gif" onLoad="getTime()"> <FORM NAME="LoginForm" METHOD="POST" ACTION="/livelink/livelink.exe"> <INPUT TYPE="HIDDEN" NAME="func" VALUE="ll.login"> <INPUT TYPE="HIDDEN" NAME="CurrentClientTime" VALUE=""> <INPUT TYPE="HIDDEN" NAME="NextURL" VALUE="/livelink/livelink.exe?Redirect=1"> </FORM> Your browser should have redirected you to the next Livelink page. Please click on the link below to continue. <BR><BR> <A HREF="/livelink/livelink.exe?Redirect=1">/livelink/livelink.exe?Redirect=1</A> </BODY> </HTML> <!-- End File: llgettz.html -->
事实上,从抓到包的结果中看,这里网站确实没有返回cookie。 cookie是在POST到login页面后,才返回的。 所以为什么windows下可以获取到cookie,我还没弄懂。 (我去CU,CSDN上都发帖问过,都没人回答,所以如果有人知道,希望不吝赐教。)
用HttpWatch来抓包,结果这里面就能看到POST的动作,但是Firebug 和 Live HTTP Headers都没抓到这个动作。
这时,浏览器执行了返回页面中的一段javascript。
<FORM NAME="LoginForm" METHOD="POST" ACTION="/livelink/livelink.exe"> <INPUT TYPE="HIDDEN" NAME="func" VALUE="ll.login"> <INPUT TYPE="HIDDEN" NAME="CurrentClientTime" VALUE=""> <INPUT TYPE="HIDDEN" NAME="NextURL" VALUE="/livelink/livelink.exe?Redirect=1"> </FORM>所以,脚本在初始化的时候,就是应该模拟发送POST消息,完成登录(前面的GET流程都不需要),然后保存cookie,接着就可以随意地get file or post file。
以后需要登录网页后获取网站信息之类的,就只用照葫芦画瓢。
#!/usr/bin/perl package Livelink; use LWP::UserAgent; use MIME::Base64 qw(encode_base64); use HTTP::Cookies; my $Basic_Url = "https://wcdma-ll.app.alcatel-lucent.com/livelink/livelink.exe"; sub new { my $invocant = shift; my $class = ref $invocant || $invocant; my $self = { login => shift, password => shift }; bless $self, $class; $self->init(); return $self; } sub init { my $self = shift; $self->{browser} = LWP::UserAgent->new(ssl_opts => {verify_hostname => 0}); #设置http代理 $self->{browser}->proxy('http', 'http://135.251.33.31:80'); #将用户名密码进行base64编码 die "Error: Login NONE.\n" if (! $self->{login}); die "Error: Password NONE.\n" if (! $self->{password}); $self->{encode_login} = encode_base64($self->{login} . ":" . $self->{password}); #模拟浏览器的header,我访问的网页比较特殊,请求头中带有Authorization字段来进行鉴权。 #也可以通过credentials方法来登录 my @headers = ('User-Agent' => 'Mozilla/5.0', Authorization => "Basic $self->{encode_login}", ); #这里是LWP::UserAgent对象调用default_header函数,它具有全局性 #即后面对网页请求时,都会带有@headers的信息 #在后面的HTTP::Reques对象也有个header函数,使用方法跟default_header一样 #但是它的header信息只在当前的请求中生效。 $self->{browser}->default_header(@headers); $self->{cookie_jar} = HTTP::Cookies->new; #设置cookie,正如前面所描述的,LWP::UserAgent对象调用的方法,在后面每次网页请求中都生效 $self->{browser}->cookie_jar( $self->{cookie_jar} ); #根据上面讲到的那段JAVASCRIPT,来构造post消息中的content字段。 my $content = { # NextURL=>'/livelink/livelink.exe?Redirect=1', func=>'ll.login', }; #这里用$self->{browser}->post比较简单 #如果用$req = HTTP::Request->new(POST => $Basic_Url)来POST的话 #在构造$req->content之前,还需指明$req->content_type #my $resp = $self->{browser}->post( $Basic_Url, Content => $content ); my $req = HTTP::Request->new(POST => $Basic_Url); $req->content_type('application/x-www-form-urlencoded'); $req->content('func=ll.login'); my $resp = $self->{browser}->request($req); if ($resp->is_success) { print "Login Success.\n"; } else { print "Login Error.\n" . $resp->status_line . "\n"; exit 1; } } sub get_web_content{ my $self = shift; my ($url) = @_; print "\tNow, processing $url\n"; my $get_cookie = 1; while ($get_cookie) { #调用HTTP::Request对象,对网页进行GET请求 #效果和$self->{browser}->get($url)一样,这里是为了后面打印$request->headers_as_string my $request = HTTP::Request->new('GET', $url); #default_header设置好后,这里就不需要重复调用了 #$request->header(Authorization => "Basic $self->{encode_login}"); my $response = $self->{browser}->request($request); $self->{cookie_jar}->extract_cookies($response); print "=== Cookies:\n", $self->{cookie_jar}->as_string, "\n"; if ($response->is_success) { #打印响应码,请求头,响应头和响应内容 print "Login Success.\n** " . $response->status_line . " **\n"; print "=== Request header: \n", $request->headers_as_string, "\n"; print "=== Response header: \n", $response->headers_as_string, "\n"; return $response->content; #访问成功后,即200 OK,就能获取到cookie值 $get_cookie = 0; } else { print "Login Error.\n** " . $response->status_line . " **\n"; print $response->content . "\n"; #如果发生网页重定向,将重组URL后继续访问 if ($response->status_line =~ /302/) { my ($redirect_url) = ($response->content =~ m/<[aA] (?:[hH]ref|HREF)=\"(.*?)\"/g); print "\n=== redirect url is $redirect_url\n"; if ($redirect_url=~ /http/) { $url = $redirect_url; next; } else { #有的重定向链接只有后面一部分,需要调用URI->new_abs来将$response->base链接起来 $url = URI->new_abs($redirect_url, $response->base); next; } } exit 1; } } } sub get_file { my $self = shift; my $object_id = shift; unless($object_id) { die "$object_id is invalid id!"; } my $ll_url = "$Basic_Url/open/$object_id"; return $self->get_web_content($ll_url); } package main; my $ll = new Livelink('username', 'password'); my $content = $ll->get_file('50967038'); open my $fh, '>', "myfile.html" or die "Can't open file, $!\n"; print $fh $content; close $fh;
1. 如果返回501错误, 请确认Crypt::SSLeay已经安装过了。
2.如果在访问HTTPS的时候,返回500错误,需要new UserAgent的时候加上ssl_opts => { verify_hostname => 0 }
如果程序运行在windows系统的时候,还会出现以下错误。
Status read failed: A non-blocking socket operation could not be completed immed iately. at C:/Strawberry/perl/vendor/lib/Net/HTTP/Methods.pm line 276.
它里面提到了一个bug fix:
Since 2.006 IO::Socket::SSL supports non-blocking on Windows. But on Windows the relevant $ERRNO is EWOULDBLOCK, not EAGAIN. On Unix these both are the same. Thus the code needs be changed to use EWOULDBLOCK instead.
于是我找到了C:/Strawberry/perl/vendor/lib/Net/HTTP/Methods.pm中EAGAIN所在行。增加了以下标黄的语句。
redo READ if $!{EINTR} || $!{EAGAIN} || $!{EWOULDBLOCK};