闲时看了一些爬虫的课程,本来是应该继续看K8S的,不过看了爬虫后还是想练练手。刚好在游戏行业,对TapTap的榜单有点好奇。于是就抓取了它的热门榜单信息。
这次爬取的是安卓手机端的TapTap。比较下网页端跟手机端不同之处,就是手机端返回的信息没有排名的列表。(可能是我对fiddler不熟)只能自己手动添加排名位数信息了。
1、首先安装fiddler,去官网下载,反正免费的。
2、然后安装模拟器,我用的是夜神模拟器。然后设置好fiddler代理。
一、设置Fiddler代理
1.点击Tools-Fiddler Options进入Fiddler Options页面
2.点击Connections,将Fiddler listens on port设为8888,勾选Allow remote computers to connect
3.点击OK,代理设置完成,重启Fiddler配置生效。
二、设置夜神模拟器代理
1.点击设置,进入到wifi连接选项
2.点击wifi进入wifi选项,长按热点,出现修改网络的弹窗
3.点击修改网络,勾选高级选项,将代理设为手动,代理服务器主机名填写电脑的ip,端口号填写8888
4.点击保存
还有后面的信任证书问题
可以通过fiddler来抓包后,就可以开始啦。
3、爬虫部分
这是模拟器的截图跟fiddler的抓取url,然后就可以点击fiddler
将下面给的数据拿到json.cn上去解析,获取关键信息
因为每个游戏的内容获取都差不多,就只举一个游戏的例子。
可以从截图上获取到我需要的游戏名 ,分类,厂商,历史下载量,评分,关注数,帖子数等信息。
接着就可以开始写爬虫了。
先分析访问的URL:https://api.taptapdada.com/app-top/v1/hits?from=0&limit=10&X-UA=V%3D1%26PN%3DTapPad%26VN_CODE%3D9%26LOC%3DCN%26LANG%3Dzh_CN%26CH%3DPadEmu%26UID%3Dfe192707-bff8-4f72-8502-4a613f2a2322&type_name=android_pad_hot_cn
url中的from=0代表从哪个排名开始。limit=10表示一次性显示多少个,做测试时候,返现limit不能超过20,会报错。为了方便计算排名信息,我将limit改成1,然后靠for循环到150。
写headers模拟访问头
可以看出这个请求是一个get请求,并且有些头部信息是不需要的,一般来说时间都是不需要的。
代码:
def handle_request(url, data):
header = {
"Host": "api.taptapdada.com",
"Connection": "Keep-Alive",
"Accept-Encoding": "gzip",
"User-Agent": "okhttp/3.10.0",
}
response = requests.get(url=url, headers=header, data=data)
return response
伪造了headers之后,就可以测试下能不能来访问了
def handle_index():
#获取当前的时间戳
timestamp = time.time()
url = 'https://api.taptapdada.com/app-top/v1/hits?from=0&limit=10&X-UA=V%3D1%26PN%3DTapPad%26VN_CODE%3D9%26LOC%3DCN%26LANG%3Dzh_CN%26CH%3DPadEmu%26UID%3Dfe192707-bff8-4f72-8502-4a613f2a2322&type_name=android_pad_hot_cn'.format(page)
response = handle_request(url=url, data=None)
index_response_dict = json.loads(response.text)
print(response.text)
handle_index()
要是执行后有数据返回,说明这个访问没有问题,就可以进行下一步了。
下一步对数据进行获取,我抓取的是游戏的排名,游戏名,厂商,下载量,分类,分数,关注数,评论数,帖子数等信息。但是这个榜单上应该是实时的下载量或者今天的下载量信息形成的,但是返回的信息上并没有今天的下载量信息出现,并且这个热门榜有些在榜单上的游戏是没有下载量显示的,一是挂了测试服的游戏,下载量都会显示为0,二是不知道是不是有什么标准,有很火的游戏也没有下载量显示,比如 和平精英,榜单上明明很靠前,但是下载量为0。
获取数据的代码如下:
for item in index_response_dict['data']['list']:
game_rank_info = {}
game_rank_info['排名'] = rank
game_rank_info['游戏名'] = item['title']
game_rank_info['厂商'] = item['author']
game_rank_info['下载量'] = item['stat']['hits_total']
game_rank_info['分类'] = item['category']
game_rank_info['分数'] = item['stat']['rating']['score']
game_rank_info['关注'] = item['stat']['fans_count']
game_rank_info['新版本分数'] = item['stat']['rating']['latest_version_score']
game_rank_info['评论数'] = item['stat']['review_count']
game_rank_info['帖子数'] = item['stat']['topic_count']
game_rank_info['评分5'] = item['stat']['vote_info']['5']
game_rank_info['评分4'] = item['stat']['vote_info']['4']
game_rank_info['评分3'] = item['stat']['vote_info']['3']
game_rank_info['评分2'] = item['stat']['vote_info']['2']
game_rank_info['评分1'] = item['stat']['vote_info']['1']
game_rank_info['时间'] = onTime
后面就是将这些数据保存下来了,为了以后自己查看数据的时候图个方便,就将数据存在MySQL数据库上,不然将数据存到MongoDB上会方便一些的。
根据上面获取到的信息,我的数据表结构如下:
Field | Type |
---|---|
id | int(11) |
rank | varchar(10) |
game_name | varchar(50) |
author | varchar(50) |
download | varchar(20) |
category | varchar(50) |
score | varchar(10) |
fans_count | varchar(10) |
latest_version_score | varchar(10) |
review_count | varchar(10) |
topic_count | varchar(10) |
vote_info_5 | varchar(10) |
vote_info_4 | varchar(10) |
vote_info_3 | varchar(10) |
vote_info_2 | varchar(10) |
vote_info_1 | varchar(10) |
record_time | varchar(20) |
严谨的话可以看出我的时间上是不太严谨的,目前只是当个记录,所以等到需要改时间的时候再将时间戳的格式改掉。
但是为啥我会将所有字段都改成varchar呢,因为懒,这样插入数据比较方便哈哈。因为后续即使数据需要做一些东西,python也是可以改变格式的。
数据库生成好后,就可以连接数据库跟插入数据了。
conn = pymysql.connect(host="localhost", user="root", password="***", database="taptest", charset="utf8")
cursor = conn.cursor()
sql = "insert into TAPTAP(rank, game_name,author, download,category,score,fans_count,latest_version_score,review_count,topic_count,vote_info_5,vote_info_4,vote_info_3,vote_info_2,vote_info_1,record_time)" \
"values (%s);"
ret = dict2(game_rank_info, sql)
cursor.execute(ret)
conn.commit()
cursor.close()
conn.close()
def dict2(dic,sql):
sf = ''
for key in dic:
tup = dic[key]
sf += ('\'' + str(tup) + '\',')
sf = sf.rstrip(',')
sql2 = sql % sf
return sql2
执行完后的结果如下:
这样就将taptap手机端的数据爬出来了。要是后续还有什么进一步的分析,我有空再贴出来,有兴趣的可以看看。
完整代码如下:
# coding=utf-8
import requests
import json
import time, datetime
import pymysql
def handle_request(url, data):
header = {
"Host": "api.taptapdada.com",
"Connection": "Keep-Alive",
"Accept-Encoding": "gzip",
"User-Agent": "okhttp/3.10.0",
}
response = requests.get(url=url, headers=header, data=data)
return response
def dict2(dic,sql):
sf = ''
for key in dic:
tup = dic[key]
sf += ('\'' + str(tup) + '\',')
sf = sf.rstrip(',')
sql2 = sql % sf
return sql2
def handle_index():
timestamp = time.time()
page = 0
while page < 150:
url = 'https://api.taptapdada.com/app-top/v1/hits?from={}&limit=1&X-UA=V%3D1%26PN%3DTapPad%26VN_CODE%3D9%26LOC%3DCN%26LANG%3Dzh_CN%26CH%3DPadEmu%26UID%3Dfe192707-bff8-4f72-8502-4a613f2a2322&type_name=android_pad_hot_cn'.format(page)
response = handle_request(url=url, data=None)
index_response_dict = json.loads(response.text)
# 获取时间
#timestamp = index_response_dict['now']
dateArray = time.localtime(timestamp)
onTime = time.strftime("%Y-%m-%d_%H", dateArray)
#print(onTime)
# 定义排名
rank = page+1
page += 1
#print(response.text)
for item in index_response_dict['data']['list']:
game_rank_info = {}
game_rank_info['排名'] = rank
#game_rank_info['id'] = item['id']
game_rank_info['游戏名'] = item['title']
game_rank_info['厂商'] = item['author']
game_rank_info['下载量'] = item['stat']['hits_total']
#game_rank_info['reserve_count'] = item['stat']['reserve_count']
#game_rank_info['今日下载量'] = item['stat']['play_total']
game_rank_info['分类'] = item['category']
game_rank_info['分数'] = item['stat']['rating']['score']
game_rank_info['关注'] = item['stat']['fans_count']
game_rank_info['新版本分数'] = item['stat']['rating']['latest_version_score']
game_rank_info['评论数'] = item['stat']['review_count']
game_rank_info['帖子数'] = item['stat']['topic_count']
game_rank_info['评分5'] = item['stat']['vote_info']['5']
game_rank_info['评分4'] = item['stat']['vote_info']['4']
game_rank_info['评分3'] = item['stat']['vote_info']['3']
game_rank_info['评分2'] = item['stat']['vote_info']['2']
game_rank_info['评分1'] = item['stat']['vote_info']['1']
game_rank_info['时间'] = onTime
conn = pymysql.connect(host="localhost", user="***", password="**", database="***", charset="utf8")
cursor = conn.cursor()
sql = "insert into TAPTAP(rank, game_name,author, download,category,score,fans_count,latest_version_score,review_count,topic_count,vote_info_5,vote_info_4,vote_info_3,vote_info_2,vote_info_1,record_time)" \
"values (%s);"
ret = dict2(game_rank_info, sql)
cursor.execute(ret)
conn.commit()
cursor.close()
conn.close()
time.sleep(0.5)
handle_index()