需要登录的网站爬取及formdata获取

需要登录的网站爬取及formdata获取_第1张图片
1、查找from data提交的信息,构造表单信息:
payload = {
    "username": USERNAME,
    "password": PASSWORD,
    .......
}

2、代码如下:

# coding=utf-8
import requests
from lxml import html

# 登录页面url
LOGIN_URL = "https://auth.dxy.cn/accounts/login?service=http%3A%2F%2Fwww.dxy.cn%2Fuser%2Findex.do%3Fdone%3Dhttp%3A%2F%2Fwww.dxy.cn%2F"
# 爬取内容所在的url
URL = "http://rehab.dxy.cn/"

def main():
    session_requests = requests.session()

    # Get login csrf token
    result = session_requests.get(LOGIN_URL)
    tree = html.fromstring(result.text)    

    # Create payload
    payload = {
        "username": "****@qq.com",
        "password": "***",
        "loginType": 1,
        "validateCode":"mdgc",
        "keepOnlineType": 2,
        "trys": 0,
        "nlt": "_c2AABF5AD-CFCC-4DEC-0434-F8E7FB827921_k00905A80-3969-E3D8-A79D-5ACFA9048738",
        "_eventId": "submit"
    }

    # Perform login
    result = session_requests.post(LOGIN_URL, data = payload, headers = dict(referer = LOGIN_URL))

    # Scrape url
    result = session_requests.get(URL, headers = dict(referer = URL))
    tree = html.fromstring(result.content)
    bucket_names = tree.xpath("//a[@class='h5 dq-stat-zone']/@title")

    # print bucket_names
    for i in bucket_names:
        print i

if __name__ == '__main__':
    main()

3、运行结果如下:

加速康复外科中国专家共识及路径管理指南(2018版)
胆战心惊—我值班那些年的血泪教训!

你可能感兴趣的:(需要登录的网站爬取及formdata获取)