对于通常的基于http协议的页面抓取,可以参考http://blog.csdn.net/jj_liuxin/archive/2009/02/19/3911533.aspx上的例子。
我在这里只讨论对于https页面的登录以及抓取。由于python的2跟3版本有较大的差异,比如2下有urllib、urllib2两种库,而到了python 3上只有urllib了,其下的很多函数的调用方式也有不同。
#!/usr/bin/env python
#coding=utf-8
import urllib
import sys
import http.cookiejar
cookie = http.cookiejar.CookieJar() #保存cookie,为登录后访问其它页面做准备
cjhdr = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(cjhdr)
url = "https://192.168.1.227/"
postdata = {'username': 'admin', 'pwd': '123456', 'Submit':''} #用户名、密码和Submit按钮,有的页面要求Submit的值不为空
#print (urlopen(url).read().decode("gbk")) #输出登录页面
params = urllib.parse.urlencode(postdata) #将用户名、密码转换为 “username=admin&pwd=123456”的形式
opener.open(url,params) #开始登录
print (opener.open("https://192.168.1.227/about.php").read().decode("gbk")) #登录成功后,访问其它页面
在http://fly5.com.cn/p/p-like/python_https.html上,我看到了某大牛写的另一段用 http.client.HTTPSConnection来登录https,并获取信息的代码,觉得甚是有用,拿回来试了一下,发现不太好使,可能还是因为python版本的问题,略微改动了一下,比如:conn.request后面的get由小写换成大写,post信息里的login=换成了 Submit=,再有就是登录成功后,重新调了一遍conn=http.client.HTTPSConnection(m_host),这番微调后,果然达到了效果。此外,我发现有的页面程序里,服务器端会判断post上的Submit值是否非空,等等
#!/usr/bin/env python
#coding=utf-8
import sys
import http.cookiejar
import http
try:
m_host = "192.168.1.227"
m_user = "admin"
m_passwd = "123456"
data="username=%s&pwd=%s&Submit=" % (m_user,m_passwd)
#Get的发送头
Getheaders={"Host":m_host,"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.20) Gecko/20081217 (FoxPlus) Firefox/2.0.0.20","Content-Type":"application/x-www-form-urlencoded"}
#Post的发送头
Postheaders={"Host":m_host,"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.20) Gecko/20081217 (FoxPlus) Firefox/2.0.0.20","Content-Length":str(len(data)),"Content-Type":"application/x-www-form-urlencoded"}
#连接服务器
conn=http.client.HTTPSConnection(m_host)
conn.connect()
#获取登陆页
conn.request("GET","/login.php",None,Getheaders)
res=conn.getresponse()
print (res.read().decode("gbk"))
print ("/n/n---------------------------------------------------/n/n")
#Get first over
#登录
conn.request("POST","/login.php",data,Postheaders)
#获取cookie:
resp=conn.getresponse()
#print (resp.read().decode("gbk")) #输出登录结果,有时候会为空或者为报错信息或者为登录页面
m_cookie = resp.getheader("Set-Cookie").split('_')[0]
Infoheader={"Host":m_host,"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.20) Gecko/20081217 (FoxPlus) Firefox/2.0.0.20","Cookie":m_cookie,"Content-Type":"application/x-www-form-urlencoded"}
#post over
#登录后,访问其它页面
conn=http.client.HTTPSConnection(m_host)
conn.request("GET","/about.php",None,Infoheader)
res2=conn.getresponse()
print (res2.read().decode("gbk"))
except http.client.HTTPException as ex :
print("value exception occurred ", ex)
进一步的修改:我发现有些页面要求POST发送密码时,也要带上cookie,否则会提示“浏览器已禁用cookie,不能登录”,所以要在GET登录页面的时候就获得cookie,然后POST发送密码时,携带着cookie信息。于是代码改成这样:
#!/usr/bin/env python
#coding=utf-8
import sys
import http.cookiejar
import http
try:
m_host = "192.168.1.227"
m_user = "admin"
m_passwd = "不123456"
data="username=%s&pwd=%s&Submit=" % (m_user,m_passwd)
#Get的发送头
Getheaders={"Host":m_host,"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.20) Gecko/20081217 (FoxPlus) Firefox/2.0.0.20","Content-Type":"application/x-www-form-urlencoded"}
#连接服务器
conn=http.client.HTTPSConnection(m_host)
conn.connect()
#获取登陆页
conn.request("GET","/login.php",None,Getheaders)
res=conn.getresponse()
#获取cookie:
m_cookie = res.getheader("Set-Cookie")
#Post的发送头,其中带了cookie信息
Postheaders={"Host":m_host,"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.20) Gecko/20081217 (FoxPlus) Firefox/2.0.0.20","Cookie":m_cookie,"Content-Length":str(len(data)),"Content-Type":"application/x-www-form-urlencoded"}
print (res.read().decode("gbk"))
print ("/n/n---------------------------------------------------/n/n")
#Get first over
#登录
conn.request("POST","/login.php",data,Postheaders)
#获取cookie:
resp=conn.getresponse()
print (resp.read().decode("gbk"))
Infoheader={"Host":m_host,"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.20) Gecko/20081217 (FoxPlus) Firefox/2.0.0.20","Cookie":m_cookie,"Content-Type":"application/x-www-form-urlencoded"}
#post over
#登录后,访问其它页面
conn=http.client.HTTPSConnection(m_host)
conn.request("GET","/about.php",None,Infoheader)
res2=conn.getresponse()
print (res2.read().decode("gbk"))
except http.client.HTTPException as ex :
print("value exception occurred ", ex)