爬虫:6. 抓包分析

抓包分析

抓包分析是爬虫必不可少的技能之一,常用的工具有Fiddler4,Charles, whareshark或者浏览器的debug.
什么时候需要抓包分析呢?

- APP数据的抓取,一般要结合反编译(后面有篇文章讲APP数据的抓取)
- 网页需要登录
- 复杂的抓取,比如对请求头,回复的报文头的分析,分析请求失败的原因等

登录

这里使用Fiddler4,网页上登录http://www.kanzhun.com/login/, 抓到得报文如下:

POST http://www.kanzhun.com/login.json HTTP/1.1
Host: www.kanzhun.com
Proxy-Connection: keep-alive
Content-Length: 69
Accept: application/json, text/javascript, */*; q=0.01
Origin: http://www.kanzhun.com
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Referer: http://www.kanzhun.com/login/
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.8,en;q=0.6
Cookie: W_CITY_S_V=0; ac="[email protected]"; t=EhhloI4AGnoXJMz; aliyungf_tc=AQAAAOkujCD0yw4ASSL+myvvTxkg1TH/; __c=1465718622; __g=-; __l=l=%2F&r=; __a=74808725.1465379010.1465379010.1465718622.6.2.3.6; AB_T=abvb

redirect=%2F&account=casd1%40sina.com&password=123456&remember=true

点击webforms后发现提交的表单内容为(部分内容我打了*):

Name Value
redirect /
account c_***@sina.com
password 1111****
remember true

那么我就可以通过requests模拟提交表单,实现登录。

# -*- coding:utf-8 -*-

"""
File Name : 'test3'.py
Description:
Author: 'chengwei'
Date: '2016/5/24' '14:08'
python: 2.7.10
"""

import requests
import time
import json


def main():
    s = requests.Session()
    data = {
        "redirect": '/',
        "account": 'username',
        "password": 'passwd',
        "remember": 'true',
    }
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'Accept-Encoding': 'gzip, deflate',
        'X-Requested-With': 'XMLHttpRequest',
        'Accept': 'application/json, text/javascript, */*; q=0.01'
    }
    s.post('http://www.kanzhun.com/login.json', headers=headers, data=data)

    res = s.get('http://www.kanzhun.com/gsx3195.html?ka=com-blocker1-salary', headers=headers)
    time.sleep(1)

if __name__ == '__main__':
    main()

如果不登录,访问工资页面是看不到全部内容的,而我们通过提交表单登录后,这个session以后访问工资页面就会返回全部内容。

你可能感兴趣的:(爬虫:6. 抓包分析)