一、所需要用到的Python库
其中BeautifulSoap的安装为pip install beautifulsoup4 当然也可以根据需要,继续安装lxml 和 html5lib,具体可以参考其他文档。
二、爬取网页相关内容,以微博为例
补充:
微博其实有多种查看方式,在各种教程中也有讲解,一般我们看到的是网页版,也就是上图所示的形式,其网页大多为weibo.com。但是改用weibo.cn版难度将降低,因为代码更简洁
import re
import requests
from bs4 import BeautifulSoup
cookie = {"Cookie":"Your Cookie!!!!"}
def GetText_WB_COM(URL,filePath):
fp = open(filePath, 'w', encoding='utf8')
# 抓取web页面
fp.write(URL)
fp.write("\n\n")
res = requests.get(URL,cookies=cookie)
status_code=res.status_code
if status_code/100 != 2:
return status_code
res.encoding = 'utf-8'
soup = BeautifulSoup(res.content,"lxml")
#print(soup)
links = soup.find_all('span', class_='ctt')
# link的内容就是span
for link in links:
print(GetContent(link))
#print(link)
def GetContent(text):
if '回复' in str(text):
result = re.findall(":(.*?)<", str(text))
else:
result = re.findall("(.*?)<",str(text))
value = result[0]
if len(value)>1:
return value
print(GetText_WB_COM('https://weibo.cn/2656274875/JcnqqgXnC?refer_flag=1001030103_&page=2','GetText3.txt'))