参考:Web Crawler with Python - 08.模拟登录 (知乎)
在实践时,发现该行报错:
_xsrf = BeautifulSoup(session.get('https://www.zhihu.com/#signin').content).find('input', attrs={'name': '_xsrf'})['value']于是在chrome下F12再次分析一下登录过程之后,在requests的headers中加入User-Agent,发现可以获得_xsrf 字段。
接下来获取验证码和请求时同理加上User-Agent。
之后再获取验证码时,发现获得的结果如下:
ERR_VERIFY_CAPTCHA_SESSION_INVALID
再次分析获得验证码的请求(更新验证码):
考虑请求时使用的requests的session机制,已经携带了cookie信息。于是怀疑是url的问题。
改成如下解决:
captcha_content = session.get('http://www.zhihu.com/captcha.gif?r=%d&type=login' % (time.time() * 1000), headers=headers).content
最后修改断言,返回结果如下:
#!/usr/bin/python # -*- coding: utf-8 -*- import time import requests from bs4 import BeautifulSoup headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36', # 'Referer':'https://www.zhihu.com/', # 'X-Requested-With': 'XMLHttpRequest', # 'Origin':'https://www.zhihu.com' } def login(username, password, kill_captcha): session = requests.session() _xsrf = BeautifulSoup(session.get('https://www.zhihu.com/#signin', headers=headers).content).find('input', attrs={'name': '_xsrf'})['value'] session.headers.update({'_xsrf':str(_xsrf)}) #加入type=login 否则:ERR_VERIFY_CAPTCHA_SESSION_INVALID captcha_content = session.get('http://www.zhihu.com/captcha.gif?r=%d&type=login' % (time.time() * 1000), headers=headers).content data = { '_xsrf': _xsrf, 'password': password, 'captcha': kill_captcha(captcha_content), 'email': username, 'remember_me': 'true' # 字典的键值对顺序可以随机 } print data resp = session.post('http://www.zhihu.com/login/email', data=data, headers=headers).content # 登录成功 print 'resp\n',resp assert r'\u767b\u5f55\u6210\u529f' in resp return session def kill_captcha(data): with open('1.gif', 'wb') as fp: fp.write(data) return raw_input('captcha : ') if __name__ == '__main__': session = login('email', 'password', kill_captcha) print BeautifulSoup(session.get("https://www.zhihu.com",headers=headers).content).find('span', class_='name').getText()