Python beautifulsoup网络抓取和解析cnblog首页帖子数据

视频版教程:一天掌握python爬虫【基础篇】 涵盖 requests、beautifulsoup、selenium

我们抓取下https://www.cnblogs.com/ 首页所有的帖子信息,包括帖子标题,帖子地址,以及帖子作者信息。

首先用requests获取网页文件,然后再用bs4进行解析。

参考代码:

import requests

url = "https://www.cnblogs.com/"

r = requests.get(url)

# 设置返回对象的编码
r.encoding = "utf-8"

# print(r.text)

from bs4 import BeautifulSoup

soup = BeautifulSoup(r.text, 'lxml')

article_list = soup.select("article.post-item")
# print(article_list)

for artile in article_list:
    print("==========")
    author = artile.find("a", class_="post-item-author")
    print(author.get_text())
    link = artile.find("a", class_="post-item-title")
    print(link.get_text())
    print(link.attrs["href"])

你可能感兴趣的:(Python,python,beautifulsoup,爬虫,Python爬虫)