本博客仅仅是为了记录学习过程,如涉及侵权。请告知,立即删除。谢谢!
多多看书小说网站:https://xiaoshuo.sogou.com/
1. 进入多多看书网站,选择一个当前限时免费的小说。这里选择《龙武帝尊》。打开该小说的第一章,Ctrl+S保存HTML。
2. 用NotePad++打开下载的HTML,找到小说正文部分
3. 使用requests获取HTML
# -*- coding:UTF-8 -*-
import requests
if __name__ == '__main__':
target = 'https://xiaoshuo.sogou.com/chapter/9027407560_295809430075282/'
req = requests.get(url=target)
print(req.text)
4. 用BeautifulSoup对HTML进行解析
通过分析HTML发现,小说的【章节名称】和【内容】都放置在一个div标签内,且该标签具有属性class="paper-box paper-article"
通过属性class="paper-box paper-article"对div标签进行筛选,代码如下
# -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests
if __name__ == "__main__":
target = 'https://xiaoshuo.sogou.com/chapter/9027407560_295809444117044/'
req = requests.get(url = target)
html = req.text
# HTML解析
bf_toal = BeautifulSoup(html, 'html.parser')
paper_texts = bf_toal.find_all('div', class_="paper-box paper-article")
print(paper_texts)
在筛选结果下的
5. 进一步提取
# -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests
if __name__ == "__main__":
target = 'https://xiaoshuo.sogou.com/chapter/9027407560_295809430075282/'
req = requests.get(url = target)
html = req.text
# HTML解析
bf_toal = BeautifulSoup(html, 'html.parser')
paper_texts = bf_toal.find_all('div', class_="paper-box paper-article")
# 获取章节名称
bf_paper = BeautifulSoup(str(paper_texts))
h1_texts = bf_paper.find_all('h1')
print(h1_texts[0].string)
# 获取正文内容
contens_texts = bf_paper.find_all('div', id="contentWp")
print(contens_texts[0].text)
提取结果如下:
6. 这仅仅是获取了一个章节的内容,如何获取所有章节呢。查看网页,发现每一章节的最后都用一个下一页按钮,通过点击这个按钮,就能够跳转到下一章节。查看下载的HTML,发现有一个div标签包含了我们所需要的信息。
7. 通过BeautifulSoup进行提取
# -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests
if __name__ == "__main__":
target = 'https://xiaoshuo.sogou.com/chapter/9027407560_295809430075282/'
req = requests.get(url = target)
html = req.text
# HTML解析
bf_toal = BeautifulSoup(html, 'html.parser')
paper_texts = bf_toal.find_all('div', class_="paper-box paper-article")
# 获取章节名称
bf_paper = BeautifulSoup(str(paper_texts))
h1_texts = bf_paper.find_all('h1')
print(h1_texts[0].string)
# 获取正文内容
contens_texts = bf_paper.find_all('div', id="contentWp")
# print(contens_texts[0].text)
# 获取下一章节的链接地址(不包含服务器地址)
next_chapter = bf_toal.find_all('div', class_="paper-footer")
next_chapter_bf = BeautifulSoup(str(next_chapter))
next_chapter_link = next_chapter_bf.find_all('a', class_="next")
print(next_chapter_link[0].get('href'))
下一章节的链接地址不是完整的,缺少了网站地址,为了能够正确地链接,需要对地址进行补全。
当访问到最后一个章节时,下一章节变得不可用,此时不应该继续爬取。打开最后一章节,查看其下一章的链接地址与前面章节的有啥区别。
# -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests
if __name__ == "__main__":
target = 'https://xiaoshuo.sogou.com/chapter/9027407560_295809444117044/'
req = requests.get(url = target)
html = req.text
# HTML解析
bf_toal = BeautifulSoup(html, 'html.parser')
paper_texts = bf_toal.find_all('div', class_="paper-box paper-article")
# 获取章节名称
bf_paper = BeautifulSoup(str(paper_texts))
h1_texts = bf_paper.find_all('h1')
print(h1_texts[0].string)
# 获取正文内容
contens_texts = bf_paper.find_all('div', id="contentWp")
# print(contens_texts[0].text)
# 获取下一章节的链接地址(不包含服务器地址)
next_chapter = bf_toal.find_all('div', class_="paper-footer")
next_chapter_bf = BeautifulSoup(str(next_chapter))
next_chapter_link = next_chapter_bf.find_all('a', class_="next")
print(next_chapter_link[0].get('href'))
最后一章节的下一章的链接地址为井号(#)。也就是说,当下一章节的链接地址为井号时,停止爬取。
完整代码如下:
# -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests, sys
class downloader(object):
def __init__(self):
self.server = 'https://xiaoshuo.sogou.com/'
self.first_chapter = 'https://xiaoshuo.sogou.com/chapter/9027407560_295809430075282/'
self.chapter_name = ''
self.next_chapter_url = self.first_chapter
def download(self):
target = self.next_chapter_url
req = requests.get(url = target)
html = req.text
# HTML解析
bf_toal = BeautifulSoup(html, 'html.parser')
paper_texts = bf_toal.find_all('div', class_="paper-box paper-article")
# 获取章节名称
bf_paper = BeautifulSoup(str(paper_texts))
h1_texts = bf_paper.find_all('h1')
charpter_name = h1_texts[0].string
# 获取正文内容
contens_texts = bf_paper.find_all('div', id="contentWp")
charpter_contents = contens_texts[0].text
# 获取下一章节的链接地址(不包含服务器地址)
next_chapter = bf_toal.find_all('div', class_="paper-footer")
next_chapter_bf = BeautifulSoup(str(next_chapter))
next_chapter_link = next_chapter_bf.find_all('a', class_="next")
self.next_chapter_url = self.server + next_chapter_link[0].get('href')
return charpter_name, charpter_contents
if __name__ == "__main__":
# 创建一个下载器
dl = downloader()
print('《龙武帝尊》开始下载:')
while len(dl.next_chapter_url) - len(dl.server) > 5:
charpter_name, charpter_contents = dl.download()
with open('D:\\龙武帝尊.txt', 'a', encoding='utf-8') as f:
f.write(charpter_name + '\n')
f.writelines(charpter_contents)
f.write('\n\n')
print('《龙武帝尊》下载完成')