根据需求利用python实现小说下载并校准

文章目录

  • 概述
  • 代码示例
  • 最后

概述

  • 书虫看小说总是喜欢本地看txt,有的会是想要收藏,还有一些是网页中没有给出具体章节。
    根据这些,我们可以将网页中的小说下载下来。
    如果其中标有章节但却是按节阅读的时候,我们可以进行正则匹配替换来校准章节名

代码示例

import threading
import time
import codecs
import requests_html
import pyperclip
import urllib.parse

'''
pyperclip粘贴小说网站
https://www.xxx.xxx

处理了requests_html.py中html = html.decode(DEFAULT_ENCODING,'replace')
设置页面编码gb18030
'''
elList = 'ul.clearfix'
elName = 'div.desc h1'
elContent = 'div#mycontent'
begin = time.perf_counter()

#多线程类
class myTread(threading.Thread):
    def __init__(self,threadID,name,st):
        threading.Thread.__init__ (self)
        self.threadID = threadID
        self.name = name
        self.st = st
    def run(self):
        print(self.st)
        threadget(self.st)

txtcontent = {} #存储小说所有内容

chaptername = []  #存放小说章节名字
chapteraddress = []     #存放小说章节地址

#获取小说所有章节以及地址
def getchapter(list):
    a_s = list.find('a')
    for i, a in enumerate(a_s):
        href = a.attrs['href']
        chap = a.text
        chaptername.append(chap)
        # 处理.和空格
        # chap = chap.replace('.', '')
        # cp = re.sub('^\d+', '第'+str(i+1)+'章 ', chap)
        # cp = chap.sub('\d', '第'+str(i+1)+'章 ')
        content_url = 'https://' + _ip + href
        # print(content_url)
        chapteraddress.append(content_url)

#获取章节内容
def getdetail(url):
    page = session.get(url)
    page.encoding = 'utf-8'
    encoding = page.html.encoding
    text = page.html.find(elContent, first=True).text
    content = text.replace('app2();\nread2();', '')
    return content

def threadget(st):
    # 线程数小于章节数组长度
    max = len(chaptername)
    #print('threadget函数',st,max)
    while st < max:
        url = chapteraddress[st]
        content = getdetail(url)
        txtcontent[st] = content
        st += thread_count

# 获取请求对象
requests_html.DEFAULT_ENCODING = "gb18030"
session = requests_html.HTMLSession()
url = pyperclip.paste()
# 往网站发送get请求
page = session.get(url)
page.encoding = 'utf-8'
list = page.html.find(elList, first=True)
name = page.html.find(elName, first=True).text
# 解析url域名或IP
_ip = urllib.parse.urlsplit(url).netloc
getchapter(list)

thread_list = []
# thread_count = int(input('请输入需要开的线程数'))
thread_count = 16
# 开启线程数
for id in range(thread_count):
    thread1 = myTread(id,str(id),id)
    thread_list.append(thread1)
# 多线程开启
for t in thread_list:
    t.daemon = False
    t.start()

for t in thread_list:
    t.join()
print('\n子线程运行完毕')
txtcontent1 = sorted(txtcontent)
file = codecs.open('D:/爬虫/待替换/'+name+'.txt','w','utf-8')  #小说存放在本地的地址
chaptercount = len(chaptername)

#写入文件中
for ch in range(chaptercount):
    # title = '\n           第' + str (ch + 1) + '节  ' + '         \n\n'
    content = str(txtcontent[txtcontent1[ch]])
    file.write(content)
file.close()
end = time.perf_counter()
print('下载完毕,总耗时',end-begin,'秒')


最后

代码仓库-my-py/爬虫/novel
根据需求利用python实现小说下载并校准_第1张图片

你可能感兴趣的:(python,python,爬虫)