爬不同的的小说,会有略微的改动。
我今天这个是从一章的提前到全部的提前。
在我们电脑里面了,想怎么看就怎么看。
代码代码:
import re
import requests
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"
}
mainUrl = "https://www.xzmncy.com/list/53005/"
mainText = requests.get(mainUrl,headers=headers).text
#.*?
my_re = re.findall('.*? ',mainText)
for info in my_re:
url = "https://www.xzmncy.com"+info[0]
response = requests.get(url,headers=headers).text
content = re.findall('(.*?)
', response)
if content == []:
continue
else:
con = content[0]
con = con.replace("
","\n")
Mcontent = '\n\n'+info[1]+'\n\n'+con+"\n--=====================---------------\n"
print(Mcontent)
f = open('逆命相师(全).txt', mode='a', encoding='utf-8')
f.write(Mcontent)
唉代码好少,但是挫折不断,一会这里报错哪里报错,有的章的写入方式会有区别。
说一下代码:
1.这个头可以不用,因为是笔趣阁 哈哈哈!!!
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"
}
2.
这个是章节的网址,并对章节的进行解析。
mainUrl = "https://www.xzmncy.com/list/53005/"
mainText = requests.get(mainUrl,headers=headers).text
3.利用正则表达式找到网址和题目
my_re = re.findall('.*? ',mainText)
4.加入网址的正确格式,进行解析。
for info in my_re:
url = "https://www.xzmncy.com"+info[0]
response = requests.get(url,headers=headers).text
5.
上面两个就是拿到内容!
6.
这个是我临时加的,奶的有问题。每个情况不一样。
content = re.findall('(.*?)
', response)
if content == []:
continue
7.
content[0]是一个字符串
进行字符串的替换,就行
然后是写入
con = content[0]
con = con.replace("
","\n")
Mcontent = '\n\n'+info[1]+'\n\n'+con+"\n--=====================---------------\n"
print(Mcontent)
f = open('逆命相师(全).txt', mode='a', encoding='utf-8')
f.write(Mcontent)
OK了,昨天搞得唉,下面的与这个无关就是一个失败的作品
import re
import os
import requests
from bs4 import BeautifulSoup
list_1=[]
list_2=[]
list_3=[]
list_4=[]
list_5=[]
list_6=[]
url = "https://b.faloo.com/1389943.html"
headers = {
"Cookie":"host4chongzhi=https%3a%2f%2fwww.faloo.com%2f; readline=1; fontFamily=1; fontsize=16; font_Color=666666; curr_url=https%3A//b.faloo.com/1389943.html; bgcolor=%23FFFFFE; vip_img_width=5",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"
}
html = requests.get(url,headers=headers)
html=html.text
soup = BeautifulSoup(html,'html.parser')
all_ico=soup.find(class_="DivTable")
#名字
title = soup.find(class_="fs23 colorHei")
title = title.string
#print(f"小说名:{title}")
list_1.append(title)
#章节名
all_title_a = all_ico.find_all("a")
for i in all_title_a:
s_title = i["title"]
s_title = s_title[:-11:1]
s_title = s_title[15::1]
#print(s_title)
list_2.append(s_title)
#章节URL
href = i["href"]
list_3.append(href)
# https://b.faloo.com/1389943.html
href_url = 'https:' + href
list_4.append(href_url)
这个就是没有用正则表达式用的Beautifulsoup这个解析,然后找class类,然后在找class,然后在找“a”,然后把“title”找出来,存到列表中,然后把“herf找出来,在组合”,然后存列表。
其实这个就是已经拿到内容了,然后因为错误是在不行,就没有继续了。因为re更为的方便,直接找内容,然后取列表的【0】就可以拿到,字符串的文章内容。所以就跳过了。
发现用css也很简单
下面这个是提前单章的:
import requests
import re
url = "https://www.biqg.cc/book/6909/1.html"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"
}
response = requests.get(url,headers=headers).text
content = re.findall('(.*)',response)[0]
content = content.replace("
","\n")
print(content)
f=open('人道大圣.txt',mode='a',encoding='utf-8')
f.write(content)
下面还是:
import requests
import re
url = "https://www.xzmncy.com/list/53005/24924612.html"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"
}
response = requests.get(url,headers=headers).text
content = re.findall('(.*?)
',response)[0]
content = content.replace("
","\n")
print(content)
f=open('逆命相师.txt',mode='a',encoding='utf-8')
f.write(content)
区别是一个是学习,一个是实战。
OK结束。开始今天的学习了