晋江文学城网友交流区，俗称兔区，是一个以明星八卦为主要讨论内容的匿名论坛。

1：该区帖子特点如下：

第一：论坛中每一个帖子回复只会显示一个id；
第二：同一个帖子里，同一个登录账号的id是固定不变的。

2：在取得一个帖子内有多少个固定id时，按照以下思路：

第一：该贴有多少页;
第二：找到id;
第二：对于多次回复的同一账号id的去重。

3：分析网页的特点:

第一：定位这帖子共多少页:

首先打开帖子的第一页：
以帖子为例（找一个不引战的帖子很难，我寻思二次元应该好一点）
网址：http://bbs.jjwxc.net/showmsg.php?board=2&boardpagemsg=1&id=6577455
可以看到首页有一个“共5页”，所以我们就可以知道这个帖子有5页了，所以把这个参数取下来就行。
具体参数右键-“检查网页源代码”可以找到：

1.PNG

2.PNG

另外还有帖子是一页的情况，那么是没有这个参数的，因此具体的代码如下：

#1:先查一下这个帖子一共有几页
print ("请输入帖子第一页的网址:" )
url = str( input() )
req = requests.get(url,cookies=cookies,headers=headers)
text = req.content.decode('GB2312','ignore')
soup = BeautifulSoup(text,features='lxml')
try :
    page_top = soup.find(name='div',attrs={'id':'pager_top'}).text  
    page_count_text =  re.findall(r'\d+(?:\.\d+)?', page_top)
    page_count = int ( page_count_text[0] )
    print ( '该帖子一共：' + str(page_count) + '页')
except :
    page_count = 1
    print ( '该帖子一共：' + str(page_count) + '页')

第二：查询固定id

如图具体的id，右键-“网页源代码”查看id对应的具体的参数：

3.PNG

4.PNG

因此具体的代码如下：

i = 1 
count_id = 1
for i in range(page_count):
    url_2 = url+ str( '&page=' )+ 'str(i-1)'
    req = requests.get(url_2,cookies=cookies,headers=headers)
    text = req.content.decode('GB2312','ignore')
    soup = BeautifulSoup(text,features='lxml')
    authorname = soup.find_all(name='td',attrs={'class':'authorname'})
    list = []
    for id in authorname:
        id_list =list.append( id.find(color="#999999").string )   
    #因为一个楼里同一个人多次回复，所以我们可能会重复的id，因此需要去重
    dis_li = []
    for i in list:
        if i not in dis_li:
            dis_li.append(i)
    #依次输出去重后的id    
    print ('帖子的具体id：')    
    for ev_id in dis_li:
        print ( str(count_id) + '______' + ev_id)
        count_id = count_id + 1
    time.sleep(5)

运行效果：

5.PNG

完整的代码如下：

# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup
import time
from future.backports.http import cookies
from http import cookies
from pip._internal import req
import xml.etree.ElementTree as ET
import re

headers = {'user-agent':
          '自行补充'  
          }
          
cookies = {'cookies':
           '自行补充'    
          }

#1:先查一下这个帖子一共有几页
print ("请输入帖子第一页的网址:" )
url = str( input() )
req = requests.get(url,cookies=cookies,headers=headers)
text = req.content.decode('GB2312','ignore')
soup = BeautifulSoup(text,features='lxml')
try :
    page_top = soup.find(name='div',attrs={'id':'pager_top'}).text  
    page_count_text =  re.findall(r'\d+(?:\.\d+)?', page_top)
    page_count = int ( page_count_text[0] )
    print ( '该帖子一共：' + str(page_count) + '页')#帖子大于1页时
except :
    page_count = 1
    print ( '该帖子一共：' + str(page_count) + '页')#帖子只有1页



#2:将每一个的ID提取出来
i = 1 
count_id = 1
for i in range(page_count):
    url_2 = url+ str( '&page=' )+ 'str(i-1)'
    req = requests.get(url_2,cookies=cookies,headers=headers)
    text = req.content.decode('GB2312','ignore')
    soup = BeautifulSoup(text,features='lxml')
    authorname = soup.find_all(name='td',attrs={'class':'authorname'})
    list = []
    for id in authorname:
        id_list =list.append( id.find(color="#999999").string )   
    #因为一个楼里同一个人多次回复，所以我们可能会重复的id，因此需要去重
    dis_li = []
    for i in list:
        if i not in dis_li:
            dis_li.append(i)
    #依次输出去重后的id    
    print ('帖子的具体id：')    
    for ev_id in dis_li:
        print ( str(count_id) + '______' + ev_id)
        count_id = count_id + 1
    time.sleep(5)

Python爬取晋江文学城网友交流区（兔区）帖子里的共多少个id

1：该区帖子特点如下：

2：在取得一个帖子内有多少个固定id时，按照以下思路：

3：分析网页的特点:

第一：定位这帖子共多少页:

第二：查询固定id

你可能感兴趣的:(Python爬取晋江文学城网友交流区（兔区）帖子里的共多少个id)