本节学习使用爬虫来爬取指定csdn用户的所有专栏下的文章
操作系统:Windows10 专业版
开发环境:Pycahrm Comunity 2022.3
Python解释器版本:Python3.8
第三方库:requests bs4
参考以下文章安装第三方库:
Python第三方库安装——使用vscode、pycharm安装Python第三方库
import requests
from bs4 import BeautifulSoup
import datetime
def printk(value):
print(value)
exit()
user = "qq_53381910"
url = "https://blog.csdn.net/{}".format(user)
# 构造请求头
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
}
# 发送get请求
r=requests.get(url = url,headers=headers,timeout = 5)
# 处理
soup=BeautifulSoup(r.text,'html.parser')
# 获取用户名
div = soup.find("div",class_ = "user-profile-head-name")
username = div.text.split(" ")[0]
# 查找专栏链接
divs = soup.find_all("div",class_ = "aside-common-box-content")
div = divs[1]
lis = div.find_all("li")
titles = []
infos = {}
# 爬取专栏链接及链接名
for li in lis:
# print("####")
url = li.find("a").attrs['href']
title = li.find("span").attrs['title']
titles.append(title)
infos[title] = {"url":url}
# print("[+]"+title+url)
with open("{}的博文.md".format(username),"w+",encoding="utf-8") as f:
# 爬取专栏下的博文
for title in titles:
# if True:
f.write("# {}".format(title.replace(" ","").replace("\n","")) + "\n")
r = requests.get(url=infos[title]["url"], headers=headers, timeout=5)
soup = BeautifulSoup(r.text, 'html.parser')
lis = soup.find("ul", class_="column_article_list").find_all("li")
print("###########{}##########\n".format(title))
for li in lis:
# print(li)
title = li.find("div",class_ = "column_article_desc").text.replace("\n","").replace(" ","").replace(" ","")
href = li.find("a").attrs['href']
print(title,href)
f.write("[{}]({})".format(title,href) + "\n")
status = li.find("div", class_="column_article_data").find_all("span")
# for statu in status:
# print(statu.text.replace("\n","").replace("\n","").replace(" ",""))
并且会生成以博主名字命名的md文档,点击即可跳转
将代码中的user替换为别人的id,qq_36232611是某个哪吒。