之前一直在研究爬虫的使用,网上的有关python爬虫的资料虽然虽然比较多,但是也很杂,有些程序使用的urllib2,有些使用http.client,比较杂,说的比较模糊,研究了一个星期才明白怎么post登陆抓取登陆信息,以及爬虫主要注意的一些方面
抓取网页的关键字段
可以使用正则表达式re模块来进行抓取,也可以使用bs4的BeautifulSoup模块来进行抓取都差不多,我就不谈了,看官方中文说明,更加具体仔细
几种网页post需要模块的比较
requests:用法简单,配置方便,具题看官网
#requests.py
import requests
import re
url="https://passport.csdn.net"
s=requests.Session()
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
'Accept': '*/*'
}
s.headers=headers
doc=s.get(url, verify=False).text
s1=re.findall(r'name="lt" value="(.*?)"',doc)[0]
t=re.findall(r'name="execution" value="(.*?)"',doc)[0]
value = {
'username': 'acsunqi',
'password': '************',
'lt':s1,
'execution':t,
'_eventId':'submit',
}
print(value)
r=s.post(url,value)
print(r.text)
mysd=s.get('http://my.csdn.net/my/mycsdn').text
print(mysd)
with open(r'C:\Users\sunqi\Desktop\webdata.txt','w') as f:
f.write(mysd)
print(mysd)
data=r.text
成功r.text会返回一个login.js的文档
<script src="/content/loginbox/loginapi.js" >script>
<script>
function redirect_back(){
var redirect = "http://www.csdn.net/";
var data = {"userId":55698829,"isLocked":false,"mobile":"马赛克=。=","userName":"acsunqi","email":"","password":"此处马赛克马赛克马赛克","registerIP":"182.48.104.153","isDeleted":false,"isActived":true,"role":0,"registerTime":"Aug 3, 2015 1:29:17 AM","userType":0,"lastLoginIP":"119.103.224.105","lastLoginTime":"Feb 6, 2016 2:17:00 AM","loginTimes":29,"user_status":0,"activeTime":"Aug 3, 2015 1:29:17 AM","passwordStrongLevel":2,"ucSyncStatus":true,"nickName":"acsunqi","avatar":"http://avatar.csdn.net/1/D/F/1_acsunqi.jpg"};
var userInfo = "8ClRT6oCiTXgNIcPJ97lMU+KK/SUT2KJjvcbryNkav4QTfDzxjb2kY06MJi3aIaAoo/7YFQ/wBXJCXppS/GrtmdKb8NR770lzdstcC4vWy8=";
data.userName = data.userName;
data.encryptUserInfo = userInfo;
csdn.login_param.call = function (){
location.href = redirect;
}
var _data = {};
_data.status = true;
_data.data = data;
var oauth = "";
if(oauth == "true"){
csdn.login_back(_data);
}else{
csdn.login_data = data;
csdn.login_end();
};
}
script>
随便我又get了一遍自己的个人主页,返回正常,登陆成功