模拟登陆Github

这里不讨论用 Github API 的情况，仅仅以 Github 来说明模拟登陆

先尝试用真实浏览器登陆，登陆成功后在开发者工具的 Network 选项卡中捕获 Session 文件。可以看到，登陆所需要的数据不仅仅是 email（或用户名）和密码，还需要其它的 3 个字段，而这 3 个字段普通用户在真实浏览器中是无法填写的（也无需填写，这仨字段会自动附加到表单中提交）。

其中的 commit、utf8 的值是不变的，只有 authenticity_token 字段的值是每次登陆都不一样的（为的就是区分人类与爬虫程序），authenticity_token 字段是在 https://github.com/login （登陆页面，未登陆状态）的 from 元素下的一个隐含字段（不显示在浏览器中）,其 type 属性值为 hidden。

下图展示了（重新）登陆页面的源码，其中 type 属性为 hidden 的 input 字段中的 authenticity_token 属性的值就是需要提取出来作为表单数据的一部分提交至服务器

从下图可以看到响应码（Status Code）是 302 found 表示重定向跳转至其它 url，这里跳转至 https://github.com，也就是说，登陆成功后就跳转至 Github 首页（即个人主页）

虽然是在 https://github.com/login 页面中登陆，但登陆时是向 https://github.com/session 提交表单数据，所以在 session 响应中可惜查看到已提交的表单数据。

上图展示了登陆成功后，已提交的表单数据，可以发现 authenticity_token 字段的值和登陆前的值是一致的（email、password 字段由于是明文，所以这里打码了）

能保持登陆状态的原因是登陆成功后生成 Cookies 的功劳，不过 Cookies 一般不是永久有效的，如果希望长期处于登陆状态，需要每隔一段时间检测下 Cookies 是否还有效（或进行异常处理），失效的话就需要重新提交表单生成新的 Cookies。

代码实现

使用的库

requests
pyquery

携带 Cookies 模拟登陆 Github 的例子

代码中的表单数据 post_data 的 login、password 这俩字段分别需要改为自已的 email（或用户名）及密码

import requests
from pyquery import PyQuery as pq

headers = {
    'Referer': 'https://github.com/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Host': 'github.com',
}
login_url = 'https://github.com/login'
post_url = 'https://github.com/session'
logined_url = 'https://github.com/settings/profile'
keys_url = "https://github.com/settings/keys"

# 提取隐含字段 authenticity_token 的值，登陆需要提交表单，而提交表单需要该值
login_r = requests.get(login_url, headers=headers)
doc = pq(login_r.text)
token = doc('input[name="authenticity_token"]').attr("value").strip()
print(token)

# 构造表单数据
post_data = {
    'commit': 'Sign in',
    'utf8': '✓',
    'authenticity_token': token,
    'login': email_or_name,
    'password': password,
}
# 模拟登陆必须携带 Cookies
post_r = requests.post(post_url, data=post_data, headers=headers, cookies=login_r.cookies.get_dict())
# 可以发现响应的 url 是 https://github.com，而不是 https://github.com/session
# 因为模拟登陆成功后就 302 重定向跳转至 "https://github.com" 了
print(post_r.url)
doc = pq(post_r.text)
# 输出项目列表
print(doc("div.Box-body > ul > li").text().split())

# 请求其它 github 页面，只要附加了能维持登陆状态的 Cooikes 就可以访问只有登陆才可访问的页面内容
logined_r = requests.get(logined_url, headers=headers, cookies=post_r.cookies.get_dict())
doc = pq(logined_r.text)
page_title = doc("title").text()
user_profile_bio = doc("#user_profile_bio").text()
user_profile_company = doc("#user_profile_company").attr("value")
user_profile_location = doc("#user_profile_location").attr("value")
print(f"页面标题：{page_title}")
print(f"用户资料描述：{user_profile_bio}")
print(f"用户资料公司：{user_profile_company}")
print(f"用户资料地点：{user_profile_location}")

# 使用 logined_r 的 Cookies 也可以
keys_r = requests.get(keys_url, headers=headers, cookies=post_r.cookies.get_dict())
doc = pq(keys_r.text)
# SSH keys Title
doc('#ssh-key-29454773 strong.d-block').text()

显式传入 Cookies 、headers 还是挺麻烦的，万一有个请求没有携带完整的 Cookies，可能就无法得到正确的响应。

为了省略每次都要手动传入 Cookies 的麻烦，下面使用另一种方式模拟登陆 Github

利用 Session 对象维持 Github 模拟登陆状态

构造一个 session 对象；
使用 session 对象进行请求

代码

其中使用 session.headers 维持每次会话的 headers 不变

为了安全，利用内置模块 getpass 输入不可见的密码（注意密码一定不能错）

import getpass

import requests
from pyquery import PyQuery as pq

class Login(object):
    def __init__(self):
        base_url = 'https://github.com/'
        # 登陆 url 
        self.login_url = base_url +'login'
        # 提交表单的 api
        self.post_url = base_url +'session'
        # 个人资料页面的 url
        self.logined_url = base_url +'settings/profile'
        # 构造一个会话对象
        self.session = requests.Session()
        # 自定义请求头
        self.session.headers = {
            'Referer': 'https://github.com/',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
            'Host': 'github.com'
        }

    def token(self):
        # 请求登陆页面
        response = self.session.get(self.login_url)
        # 提取 authenticity_token 的 value，
        doc = pq(response.text)
        token = doc('input[name="authenticity_token"]').attr("value").strip()
        return token
    
    def login(self, email, password):
        token = self.token()
        # 构造表单数据
        post_data = {
            'commit': 'Sign in',
            'utf8': '✓',
            'authenticity_token': token,
            'login': email,
            'password': password
        }
        # 发送 POST 请求，它会 302 重定向至 'https://github.com/'，也就是响应 'https://github.com/' 的页面
        response = self.session.post(self.post_url, data=post_data)
        # 可以发现 302 重定向至 'https://github.com/'
        print(f"\n请求 url：{response.url}")
        if response.status_code == 200:
            print("status_code: 200")
            self.home(response.text)

        # 请求个人资料页
        response = self.session.get(self.logined_url)
        if response.status_code == 200:
            print("status_code: 200")
            self.profile(response.text)

    def home(self, html):
        doc = pq(html)
        # 提取用户名
        user_name = doc("summary > span").text().strip()
        print(f"用户名：{user_name}")

        # 提取仓库列表        
        Repositories = doc("div.Box-body > ul > li").text().split()
        for Repositorie in Repositories:
            print(Repositorie)
    
    def profile(self, html):
        doc = pq(html)
        page_title = doc("title").text()
        user_profile_bio = doc("#user_profile_bio").text()
        user_profile_company = doc("#user_profile_company").attr("value")
        user_profile_location = doc("#user_profile_location").attr("value")
        print(f"页面标题：{page_title}")
        print(f"用户资料描述：{user_profile_bio}")
        print(f"用户资料公司：{user_profile_company}")
        print(f"用户资料地点：{user_profile_location}")

    def main(self):
        email = input("email or username: ")
        # 输入的密码不可见，注意密码一定不能错
        password = getpass.getpass("password:")
        self.login(email=email, password=password)

if __name__ == "__main__":
    login = Login()
    login.main()

运行效果

参考资料

本文参考《Python 3 网络爬虫开发实战》 —— 10.1 模拟登陆并爬取 GitHub
隐含字段参考了《Python网络数据采集》 —— 12.3　常见表单安全措施