爬虫学习日记第七篇(爬取github搜索仓库接口,其实不算爬虫)

github提供的搜索仓库的API https://api.github.com/

# 连接数据库
db = mysql.connector.connect(
    host="***",
    user="***",
    password="***",
    database="***"
)
# 创建游标
cursor = db.cursor()
# 从数据库中读取CVE ID
cursor.execute("SELECT cve_id FROM vules WHERE cve_id != '无CVE' AND poc != '暂无可利用代码'")
cve_ids = cursor.fetchall()
print(cve_ids)

# 遍历CVE ID列表
for cve_id in cve_ids:
    cve_id = cve_id[0]  # 提取CVE ID值
    # 在GitHub上搜索CVE ID
    URL = f'https://api.github.com/search/repositories?q={cve_id}&sort=stars'
    r = requests.get(URL)
    response_dict = r.json()
    print(response_dict)
    repo_dicts = response_dict['items']
    results = []
    for i in range(len(repo_dicts)):
        results.append(repo_dicts[i]["html_url"])
    print(results)
# 关闭数据库连接
db.close()

报错,限制了API访问速率
{‘message’: “API rate limit exceeded for ******. (But here’s the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)”, ‘documentation_url’: ‘https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting’}

需要添加Authentication认证
在github个人主页setting/ Developer settings/personal access token/generate new token,把生成的token复制保存下来

headers = {'User-Agent':'Mozilla/5.0',
           'Authorization': 'token ef802a122df2e4d29d9b1b868a6fefb14f22b272',    //填写拿到的token
           'Content-Type':'application/json',
           'Accept':'application/json'
          }

加上token之后速率好了一些,但还是又报错了

{‘message’: ‘API rate limit exceeded for user ID ******. If you reach out to GitHub Support for help, please include the request ID FCA0:25083D:2521AF:27C7D1:6528B0F7.’, ‘documentation_url’: ‘https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting’}

用不了多线程,遂只能加上sleep慢慢读取,和try except。

    try:
        print(cve_id)
        cve_id = cve_id[0]  # 提取CVE ID值
        # 在GitHub上搜索CVE ID
        URL = f'https://api.github.com/search/repositories?q={cve_id}&sort=stars'
        r = requests.get(URL,headers=headers)
        response_dict = r.json()
        print(response_dict)
        repo_dicts = response_dict['items']
        results = []
        for i in range(len(repo_dicts)):
            results.append(repo_dicts[i]["html_url"])
        result = ','.join(results)
        sql = "UPDATE vules SET repositories=%s WHERE cve_id=%s;"
        values = (result, cve_id)
        cursor.execute(sql, values)
        db.commit()

        print(results)
        sleep(1)
    except Exception as e:
        # 捕获到异常后的处理代码
        # 打印异常信息
        print("发生异常:", str(e))
        # 等待几秒后继续执行循环
        sleep(5)
        continue

你可能感兴趣的:(爬虫,学习,github)