NGA论坛刚刚开放了用户IP显示功能,早就想查查泥潭精英充分的我连夜花费数个小时写了个IP爬虫出来,看看都是哪些人在泥潭大漩涡板块活跃
首先是配置headers:
import requests as req
from lxml import etree
import numpy as np
import time
import re
headers = {
# 在浏览器中,network查看
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.62',
'Cookie': '',
'Connection':'close',
}
# API文档参考 https://github.com/wolfcon/NGA-API-Documents
然后是从网事杂谈板块前几页的爬取到各个帖子的链接(API接口参数可查看文档)
F12查找到对应元素(不准确,需要自行修改)方便抓取链接。
urls = [] # 保存页面uid
limit = 5 # 版面页数,请勿设置过多
for i in range(1,limit+1): # 获取近期网事杂谈板块回复前limit页中的帖子地址
time.sleep(1)
mainPage = req.get('https://bbs.nga.cn/thread.php?fid=-7&order_by=lastpostdesc&page='+str(i),headers=headers,verify=False)
doc = etree.HTML(mainPage.text)
pages_url = doc.xpath('//td[@class="c1"]/a') # 查找对应元素
for pg in pages_url:
r = re.search(r'[0-9]+',pg.attrib['href']).group() # 帖子uid
urls.append(r)
print('no.'+str(i)+' : '+str(r))
之后对抓取到的主题贴进行去重
urls = set(urls) # 帖子去重,注意此处顺序被打乱
urls = list(urls)
print(len(urls))
之后获取到主题贴第一页(默认)的内容,找到对应结果计算帖子页数,并获取到每页的用户uid, 用户uid可去重可不去重。
uid = []
for item in urls: # 帖子中用户uid获取
time.sleep(1)
page_url = 'https://bbs.nga.cn/read.php?tid='+str(item)+'&lite=js' # 获取当前帖子页数
mainPage = req.get(page_url,headers=headers,verify=False)
txt = str(mainPage.text).replace('window.script_muti_get_var_store=','')
Rows = re.findall(r'"__ROWS"\:[0-9]+',txt)
if Rows: # nga一小部分帖子js只传一半
pass
else:
continue
pageNum = int(int(Rows[0].replace('"__ROWS":',''))/20 + 1) # 当前帖子页数
print(str(item)+" pages: "+str(pageNum))
if pageNum>100: #去除超过100页的帖子
continue
for i in range(1,pageNum+1): # 用户uid获取
u = page_url+'&page='+str(i)
mainPage = req.get(u,headers=headers,verify=False)
txt = str(mainPage.text).replace('window.script_muti_get_var_store=','')
tmp = re.findall(r'"uid"\:[0-9]+',txt)
flag = 0
for t in tmp:
if(flag==0):
flag=1
continue
i = t.replace('"uid":','')
uid.append(i)
通过uid查到用户信息,并筛出ipLoc数据:
url = 'https://bbs.nga.cn/nuke.php?lite=js&__lib=ucp&__act=get&uid=' # 用户IP查询
ips = [] # 保存ip
nums = 0
for person in uid:
time.sleep(0.1)
person_page_url = url + person
mainPage = req.get(person_page_url,headers=headers,verify=False)
txt = str(mainPage.text).replace('window.script_muti_get_var_store=','')
tmp = re.findall(r'"ipLoc"\:"[\u4e00-\u9fa5]+',txt) # 正则查找
if tmp: # nga有概率js只传一半
pass
else:
continue
tmp = tmp[0].replace('"ipLoc":"','')
ips.append(tmp)
print(nums,tmp) # 输出当前位置,方便网络中断后继续运行
nums = nums+1
with open('.\\area.txt', mode='a',encoding='utf-8') as f: # 写入文件保存
for i in ips:
f.write(i+'\n')
欢迎访问我的个人博客:chen0495.top
博客图片使用github保存,载入图片需特殊上网环境