爬虫学习:登录GitHub

爬虫学习:登录GitHub

目标:使用Requests包实现GitHub的登录

1.## 确定表单参数

多次抓包对比表单参数

commit: Sign in
utf8: ✓
authenticity_token: sO34KvtovZgqSKQsVIkEdWbwX6ykeuzCMxuZbWul6wUmlpz/3Hc4SaeuRB5WEWbL1JbkgYL3r9Na1ivFxM+o+w==
ga_id: 1192443032.1565138303
login: 用户名
password: 明文密码
webauthn-support: supported
webauthn-iuvpaa-support: unsupported
required_field_34aa: 
timestamp: 1573029556609
timestamp_secret: bc3d494a0b7f36c58e7b3dc07c52fcd3e149456f46aff70797e3709c766434c7
commit: Sign in
utf8: ✓
authenticity_token: M0Xosj8ILvss0InDr0iNNiVylyczk06WBKmc6mfRbjKefRzgRUiPVzOKmu3CeVu4rAbQd7mj1EC99oP5yLCNoQ==
ga_id: 1192443032.1565138303
login: 用户名
password: 明文密码
webauthn-support: supported
webauthn-iuvpaa-support: unsupported
required_field_90c0: 
timestamp: 1573029878918
timestamp_secret: 68be517605bc020dbc20be18cb90323267ac88ff650b912fbe087df1be9fe117

通过对比:
固定值:
1. commit: Sign in
2. utf8: ✓
3. login: 用户名
4. password: 明文密码
5. webauthn-support: supported
6. webauthn-iuvpaa-support: unsupported
7. ga_id: 1192443032.1565138303
变化值:
1. authenticity_token: M0Xosj8ILvss0InDr0iNNiVylyczk06WBKmc6mfRbjKefRzgRUiPVzOKmu3CeVu4rAbQd==
2. required_field_90c0:
3. timestamp: 1573029878918
4. timestamp_secret: 68be517605bc020dbc20be18cb90323267ac88ff650b912fbe087df1be9fe117

2.## 分析表单参数
通过抓取登录页源码,发现

<input type="hidden" name="authenticity_token" value="K4gFC3qrPOfJVi8kLoPtjJg2dUp6Yisz4YG2sHktnw8Yu1nAo2n7vVVlupbmMQyTt5iKRLTJZb/+wA6FqPPV4g==">
<input type="hidden" name="ga_id" class="js-octo-ga-id-input" value="1192443032.1565138303">
<input type="text" name="required_field_01fb" id="required_field_01fb" hidden="hidden" class="form-control">
<input type="hidden" name="timestamp" value="1573029624696" class="form-control">
<input type="hidden" name="timestamp_secret" value="25e1caaf1d72b3184f9b551d96750d01eb6871d0adf2e795fb211e58c82f9958" class="form-control">

故,我们可以提前访问登录页,获取这些变化的表单参数

3.## 代码思路整理

  1. 提前访问GitHub登录页
  2. 从登录页面源码提取并构建表单
  3. 提交表单
  4. 验证登录

4.## 代码编写

import requests
import re

class Github(object):

    def __init__(self, username, password):
        self.headers = {
            'Referer': 'https://github.com/',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36',
            'Host': 'github.com'
        } # 带上请求头伪装
        self.session = requests.Session()  # 创建一个session对象
        self.username = username
        self.password = password


    def get_login_sources(self):
        login_url = "https://github.com/login"
        responses = self.session.get(url=login_url,headers=self.headers)
        return responses


    def get_form_data(self, responses):
        form_data = {}
        # print(responses.content.decode())
        form_data["commit"] = "Sign in"
        form_data["utf8"] = "✓"
        form_data["authenticity_token"] = re.findall('name="authenticity_token" value="(.*?)"', responses.content.decode())[0]  # 得到是一个列表, 故使用下标
        form_data["ga_id"] = "1192443032.1565138303"
        form_data["login"] = self.username
        form_data["password"] = self.password
        form_data["webauthn-support"] = "supported"
        form_data["webauthn-iuvpaa-support"] = "unsupported"
        required_field_name =  re.findall('name="(.*?)" id="required_field_', responses.content.decode())[0]
        form_data[required_field_name] = ""
        form_data["timestamp"] = re.findall('name="timestamp" value="(.*?)"', responses.content.decode())[0]
        form_data['timestamp_secret'] = re.findall('name="timestamp_secret" value="(.*?)"', responses.content.decode())[0]
        return form_data


    def post_form_data(self, form_data):
        post_url = "https://github.com/session"
        responses_2 = self.session.post(url=post_url, data=form_data, headers=self.headers)
        return responses_2


    def are_you_logged_in(self):
        logged_url = "https://github.com/" + self.username
        response_3 = self.session.get(logged_url)

        # 保存用户页面
        with open("GitHubasd.html", "wb")as f:
            f.write(response_3.content)

        # 验证登陆
        title = re.findall('(.*?)', response_3.content.decode())[0]
        print(title)
        if title == self.username:
            print("登陆成功")
        else:
            print('登陆失败')


    def run(self):
        # 1.提前访问登录页
        responses = self.get_login_sources()

        # 2.提取并构建表单数据
        form_data = self.get_form_data(responses)

        # 3.提交表单
        responses_2 = self.post_form_data(form_data)

        # 4.验证登陆
        self.are_you_logged_in()

if __name__ == '__main__':
    username = input("请输入用户名:")
    password = input("请输入密码:")
    user = Github(username, password)
    user.run()

你可能感兴趣的:(爬虫)