爬爬今天爬小说————爬虫练习

爬不同的的小说,会有略微的改动

我今天这个是从一章的提前到全部的提前。

爬爬今天爬小说————爬虫练习_第1张图片

 爬爬今天爬小说————爬虫练习_第2张图片

在我们电脑里面了,想怎么看就怎么看。

代码代码:

import re
import requests

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"
}
mainUrl = "https://www.xzmncy.com/list/53005/"
mainText = requests.get(mainUrl,headers=headers).text
#
.*?
my_re = re.findall('
.*?
',mainText) for info in my_re: url = "https://www.xzmncy.com"+info[0] response = requests.get(url,headers=headers).text content = re.findall('

(.*?)

', response) if content == []: continue else: con = content[0] con = con.replace("
","\n") Mcontent = '\n\n'+info[1]+'\n\n'+con+"\n--=====================---------------\n" print(Mcontent) f = open('逆命相师(全).txt', mode='a', encoding='utf-8') f.write(Mcontent)

唉代码好少,但是挫折不断,一会这里报错哪里报错,有的章的写入方式会有区别。

说一下代码:

1.这个头可以不用,因为是笔趣阁 哈哈哈!!!

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"
}

2.

这个是章节的网址,并对章节的进行解析。

mainUrl = "https://www.xzmncy.com/list/53005/"
mainText = requests.get(mainUrl,headers=headers).text

3.利用正则表达式找到网址和题目

my_re = re.findall('
.*?
',mainText)

4.加入网址的正确格式,进行解析

for info in my_re:
    url = "https://www.xzmncy.com"+info[0]
    response = requests.get(url,headers=headers).text


5.

上面两个就是拿到内容!

6.

这个是我临时加的,奶的有问题。每个情况不一样。

content = re.findall('

(.*?)

', response) if content == []: continue

7.

content[0]是一个字符串

进行字符串的替换,就行

然后是写入

con = content[0]
        con = con.replace("
","\n") Mcontent = '\n\n'+info[1]+'\n\n'+con+"\n--=====================---------------\n" print(Mcontent) f = open('逆命相师(全).txt', mode='a', encoding='utf-8') f.write(Mcontent)

OK了,昨天搞得唉,下面的与这个无关就是一个失败的作品


import re
import os
import requests
from bs4 import BeautifulSoup

list_1=[]
list_2=[]
list_3=[]
list_4=[]
list_5=[]
list_6=[]

url = "https://b.faloo.com/1389943.html"
headers = {
    "Cookie":"host4chongzhi=https%3a%2f%2fwww.faloo.com%2f; readline=1; fontFamily=1; fontsize=16; font_Color=666666; curr_url=https%3A//b.faloo.com/1389943.html; bgcolor=%23FFFFFE; vip_img_width=5",
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"
}

html = requests.get(url,headers=headers)
html=html.text
soup = BeautifulSoup(html,'html.parser')
all_ico=soup.find(class_="DivTable")
#名字
title = soup.find(class_="fs23 colorHei")
title = title.string
#print(f"小说名:{title}")
list_1.append(title)
#章节名
all_title_a = all_ico.find_all("a")
for i in all_title_a:
    s_title = i["title"]
    s_title = s_title[:-11:1]
    s_title = s_title[15::1]
    #print(s_title)
    list_2.append(s_title)
#章节URL
    href = i["href"]
    list_3.append(href)
    # https://b.faloo.com/1389943.html
    href_url = 'https:' + href
    list_4.append(href_url)

这个就是没有用正则表达式用的Beautifulsoup这个解析,然后找class类,然后在找class,然后在找“a”,然后把“title”找出来,存到列表中,然后把“herf找出来,在组合”,然后存列表。

其实这个就是已经拿到内容了,然后因为错误是在不行,就没有继续了。因为re更为的方便,直接找内容,然后取列表的【0】就可以拿到,字符串的文章内容。所以就跳过了。

发现用css也很简单


下面这个是提前单章的:
 

import requests
import re

url = "https://www.biqg.cc/book/6909/1.html"
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"
}

response = requests.get(url,headers=headers).text

content = re.findall('
(.*)',response)[0] content = content.replace("

","\n") print(content) f=open('人道大圣.txt',mode='a',encoding='utf-8') f.write(content)

下面还是:

import requests
import re

url = "https://www.xzmncy.com/list/53005/24924612.html"
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0"
}

response = requests.get(url,headers=headers).text
content = re.findall('

(.*?)

',response)[0] content = content.replace("
","\n") print(content) f=open('逆命相师.txt',mode='a',encoding='utf-8') f.write(content)

区别是一个是学习,一个是实战。

OK结束。开始今天的学习了

你可能感兴趣的:(编辑器,python,爬虫)