明确目的及需求,以 剑来吧 为例。
本次实例练习准备爬取“剑来吧”每个帖子的标题、帖子链接、发帖作者、发帖时间、回帖数量,那么拿到网页,二话不说先进入开发者模式先观察html文档结构——找规律。
因为直接在网页源码上看很难看出结构特征,可以将源码粘贴到工作区排版后再进一步观察,很容易发现每个帖子都在对应的一个 li标签 内。
<div class="t_con cleafix">
<div class="col2_left j_threadlist_li_left">
<span class="threadlist_rep_num center_text" title="回复">3span>
div>
<div class="col2_right j_threadlist_li_right ">
<div class="threadlist_lz clearfix">
<div class="threadlist_title pull_left j_th_tit ">
<a rel="noreferrer" href="/p/7330657625" title="老是讽刺大骊铁骑的,真的没脑子,他们是低品武夫啊" target="_blank" class="j_th_tit ">老是讽刺大骊铁骑的,真的没脑子,他们是低品武夫啊a>
div>
<div class="threadlist_author pull_right">
<span class="tb_icon_author " title="主题作者: nicloste" data-field='{"user_id":1184496600}' ><i class="icon_author">i><span class="frs-author-name-wrap"><a rel="noreferrer" data-field='{"un":"nicloste","id":"tb.1.13e78be1.hhmhKGZZ_b3PPalBqMlM-w"}' class="frs-author-name j_user_card " href="/home/main/?un=nicloste&ie=utf-8&id=tb.1.13e78be1.hhmhKGZZ_b3PPalBqMlM-w&fr=frs" target="_blank">niclostea>span><span class="icon_wrap icon_wrap_theme1 frs_bright_icons ">span> span>
<span class="pull-right is_show_create_time" title="创建时间">14:21span>
div>
div>
<div class="threadlist_detail clearfix">
<div class="threadlist_text pull_left">
<div class="threadlist_abs threadlist_abs_onlyline ">大骊铁骑很多都是三四境的武夫啊,打低境界的妖族没问题啊。 说大妖境界高的,浩然天下又不是没有高品级修士, 大妖不是来跟同级修士拼命的,是为了占地盘的,搬山老祖碰上大天师还不是逃了, 奥,就你厉害,在战场身先士卒都露大妖威风,就不怕冒出个齐静春苏子柳七把你碾压了,给其他大妖当了垫脚石? 都不动脑子吗div>
div>
<div class="threadlist_author pull_right">
<span class="tb_icon_author_rely j_replyer " title="最后回复人: 夏至几何">
<i class="icon_replyer">i>
<a rel="noreferrer" data-field='{"un":"\u963f\u767e\u963f\u5ea6","id":"tb.1.c8aa0b21.BBjYRnXI-gsLN13AFTZVSw"}' class="frs-author-name j_user_card " href="/home/main/?un=%E9%98%BF%E7%99%BE%E9%98%BF%E5%BA%A6&ie=utf-8&id=tb.1.c8aa0b21.BBjYRnXI-gsLN13AFTZVSw&fr=frs" target="_blank">夏至几何<img src="//tb1.bdstatic.com/tb/cms/nickemoji/1-25.png" class="nicknameEmoji" style="width:13px;height:13px"/>a> span>
<span class="threadlist_reply_date pull_right j_reply_data" title="最后回复时间">
14:28
span>
div>
div>
div>
div>
li>
其实只要发现这个点,那目标就很明确了,既然每个li标签都对应一个帖子,且通过对源码的观察不难发现li标签内有着所有我们所需要的信息,那么我们只需要将每页的 li 标签全部提出来逐个分析。
from fake_useragent import UserAgent
import requests
from bs4 import BeautifulSoup
import codecs
def getPageHtml(url):
headers={'User-Agent': UserAgent().random} # 构造请求头
try:
req = requests.get(url, headers=headers)
req.encoding = 'utf-8' # 网页编码
return req.text # 返回html结构文本
except:
print("获取page失败!")
def get_Info(html):
informations = [] # 存放所有信息的列表
soup = BeautifulSoup(html, 'lxml')
li_tags = soup.find_all('li', class_="j_thread_list clearfix thread_item_box")
for li in li_tags: # 遍历找到的li标签,一个li标签代表一个帖
info = {} # 用字典存储获取的信息
try:
info['title'] = li.find(
'div', class_="threadlist_abs threadlist_abs_onlyline"
).text.strip()
info['link'] = "https://tieba.baidu.com" + \
li.find('a', class_='j_th_tit')['href']
info['author'] = li.find('span', class_='tb_icon_author')['title']
info['time'] = li.find(
'span', class_='pull-right is_show_create_time'
).text.strip()
info['replyNum'] = li.find(
'span', class_='threadlist_rep_num center_text'
).text.strip()
informations.append(info)
except:
print("从标签中取信息出了问题!")
print(informations) # 打印测试
return informations
这里在调试时遇到了很多问题,开始时打印出来的信息列表一直为空,反复调试,原因如下:
1. 利用BeautifulSoup找信息时,比如提取li标签时,class属性在源码里前面是有空格的,而如果原封不动贴在代码中也含有空格是提取不到信息的,具体是什么原因暂未清楚,但这是一个很容易忽略的点。
源码 :
<li class=" j_thread_list clearfix thread_item_box"
没有去掉空格的代码 :
li_tags = soup.find_all('li', class_=" j_thread_list clearfix thread_item_box")
去掉空格的代码 :
li_tags = soup.find_all('li', class_="j_thread_list clearfix thread_item_box")
2.html内容部分在源码中被注释,这是一个极其坑的点,如果不先解决这个,那提取的内容总是为空
怎么看是否被注释:在网页中右键查看源码,或者查看排版后的源码,可以发现内容几乎被 注释了。
因此,需要对获取的html进行改正如下:
添加一行代码,将注释符号替换掉:
html = req.text.replace(r'', '"')
此时getPageHtml函数就变为:
def getPageHtml(url):
headers={'User-Agent': UserAgent().random} # 构造请求头
try:
req = requests.get(url, headers=headers)
req.encoding = 'utf-8' # 网页编码
html = req.text
html = html.replace(r'', '"') # 替换注释符,使内容信息得以提取
return html # 返回html结构文本
except:
print("获取page失败!")
i=0 # 计页数
def writeInfo(self, infoList):
# 以追加方式打开txt文件
with codecs.open(r'剑来吧帖子爬取.txt', 'a+', 'utf-8') as f:
for info in infoList: # 遍历列表中的每一个字典
f.write(
"标题:{}\t链接:{}\t帖子作者:{}\t发帖时间:{}\t回帖数量:{}\n\n".format(
info['title'], info['link'], info['author'], info['time'], info['replyNum']
)
)
i += 1
print("第%d页打印完成!" % i)
将完整代码封装在类中
# -*- coding: utf-8 -*-
from fake_useragent import UserAgent
import requests
from bs4 import BeautifulSoup
import codecs
class tiebaSpider(object):
def __init__(self, base_url, maxPage):
self.headers = {'User-Agent': UserAgent().random}
self.i = 0
self.main(base_url, maxPage)
def getPageHtml(self, url):
try:
req = requests.get(url, headers=self.headers)
req.encoding = 'utf-8'
html = req.text
html = html.replace(r'', '"')
return html
except:
print("获取page失败!")
def get_Info(self, html):
informations = [] # 存放所有信息的列表
soup = BeautifulSoup(html, 'lxml')
li_tags = soup.find_all('li', class_="j_thread_list clearfix thread_item_box")
for li in li_tags: # 遍历找到的li标签,一个li标签代表一个帖
info = {} # 用字典存储获取的信息
try:
info['title'] = li.find(
'div', class_="threadlist_abs threadlist_abs_onlyline"
).text.strip()
info['link'] = "https://tieba.baidu.com" + \
li.find('a', class_='j_th_tit')['href']
info['author'] = li.find('span', class_='tb_icon_author')['title']
info['time'] = li.find(
'span', class_='pull-right is_show_create_time'
).text.strip()
info['replyNum'] = li.find(
'span', class_='threadlist_rep_num center_text'
).text.strip()
informations.append(info)
except:
print("从标签中取信息出了问题!")
print(informations)
return informations
def writeInfo(self, infoList): # 写入TXT文件中
with codecs.open(r'剑来吧帖子爬取.txt', 'a+', 'utf-8') as f:
for info in infoList:
f.write(
"标题:{}\t链接:{}\t帖子作者:{}\t发帖时间:{}\t回帖数量:{}\n\n".format(
info['title'], info['link'], info['author'], info['time'], info['replyNum']
)
)
self.i += 1
print("第%d页打印完成!" % self.i)
def main(self, base_url, maxPage):
for p in range(0, maxPage):
url = base_url + str(p * 50)
html = self.getPageHtml(url)
imfoList = self.get_Info(html)
self.writeInfo(imfoList)
if __name__ == '__main__':
base_url = "https://tieba.baidu.com/f?kw=%E5%89%91%E6%9D%A5&ie=utf-8&pn="
tiebaSpider(base_url, 10)
初学者一枚,借此记录自己的学习记录~