学习爬虫,怎么能不拿王者荣耀来练手呢,正好CSDN上关于爬王者荣耀的帖子很多,正好方便学习,不懂的地方看一下大神的代码,这就是逛CSDN的乐趣。
https://pvp.qq.com/web201605/wallpaper.shtml
因为有分页,想找到下一页的超链接,发现怎么也找不到思路。看了一下CSDN其他大神的爬取过程,果断选取直接抓包,先把效果敲出来。
http://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page=10&iOrder=0&iSortNumClose=1&jsoncallback=jQuery17106927574791770883_1525742053044&iAMSActivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&_=1525742856493
链接有点长,直接看参数表
这个参数也是很好懂,要不同的页面就给page
传入不同的数字就行,0就是第一页。
'''
爬虫练习---多线程爬王者荣耀壁纸
version:01
author:金鞍少年
date:2020-04-29
'''
import requests
import os, random
from concurrent.futures import ThreadPoolExecutor
class wallpapers():
def __init__(self, path):
self.pool = ThreadPoolExecutor(10) # 开10个线程的线程池
self.path = path
self.headers = {
'user-agent': '/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
'referer': 'https://pvp.qq.com/web201605/wallpaper.shtml'}
# 代理ip
self.all_proxies = [
{'http': '183.166.20.179:9999'}, {'http': '125.108.124.168:9000'},
{'http': '182.92.113.148:8118'}, {'http': '163.204.243.51:9999'},
{'http': '175.42.158.45:9999'}] # 需要自行去找一些免费的代理,参考我其他博客案例
# 获取json内容
def get_page(self, count):
try:
url = 'https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?' \
'activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page=' + str(
count) + '&i' \
'Order=0&iSortNumClose=1&iAMSAc' \
'tivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&' \
'iActId=2735&iModuleId=2735&_=1582113303429'
res = requests.get(url, headers=self.headers,proxies=random.choice(self.all_proxies))
res.raise_for_status() # 主动抛出一个异常
return res.json()['List']
except Exception as e:
print('链接错误!', e)
def write_data(self, lists_data):
lists_data = lists_data.result()
for lis in lists_data:
name = requests.utils.unquote(lis["sProdName"]) # url解码
name = name.replace(":", "-")
dir_path = self.path + name + '/'
if not os.path.exists(dir_path): # 创建文件夹
os.makedirs(dir_path)
for i in range(1, 9):
jpg_url = requests.utils.unquote(lis["sProdImgNo_{}".format(i)]).replace('200', '0') # url解码
img = requests.get(jpg_url)
with open(dir_path + '%s.jpg' % i, 'wb') as f:
f.write(img.content)
print('第{}张{}壁纸,下载成功!'.format(i, name))
# 核心业务
def fun(self,total_pages):
for pages in range(1, total_pages):
try:
self.pool.submit(self.get_page, pages).add_done_callback(self.write_data)
except Exception as e:
print('错误:', e)
continue
if __name__ == '__main__':
g = wallpapers(r'./res/王者荣耀/')
g.fun(5)
requests.utils.unquote(url) # 解码
requests.utils.quote(url) # 编码