APP数据爬取

模拟器和Fiddler4安装及配置

我们使用夜神模拟器,下载地址,下载安装。打开后点击设置,开启root。
APP数据爬取_第1张图片
fiddler下载,地址,官网下载比较慢,不推荐。安装时记住安装位置,此应用不生成快捷桌面。
fiddler配置:点击Tools,打开Options。
APP数据爬取_第2张图片
在General中勾上这3个选项。
APP数据爬取_第3张图片
在HTTPS中勾选允许抓取HTTPS,如果弹出窗口,一律允许就可以了。
APP数据爬取_第4张图片
然后点击右侧的Actions,点击Trust Root Certificate,弹出窗口允许。
APP数据爬取_第5张图片
打开Connections勾上这些选项,记住端口号,这个可以随便修改,不冲突就OK。
APP数据爬取_第6张图片
浏览器配置就不多说,打开模拟器,打开手机设置,点击WLAN
APP数据爬取_第7张图片
鼠标左键长按WiredSSID,点击修改网络
APP数据爬取_第8张图片
代理改成手动,主机名就是电脑的IP,端口就是fiddler中设置的端口。
APP数据爬取_第9张图片
到此,fiddler和模拟器的安装及配置就结束了。如果模拟器不能上网,重启fiddler。
在模拟器中下载豌豆荚应用(下载应用方便),搜索豆果美食,安装。
APP数据爬取_第10张图片

分析应用接口

打开应用后会看到fiddler加载了这么多数据,这里有兴趣的可以去分析分析,我们的重点不在这。
APP数据爬取_第11张图片
清空所有请求,点击这里,会看到fiddler加载了一条数据
APP数据爬取_第12张图片
APP数据爬取_第13张图片
我们可以看到,这是一个post请求,返回的是一个json格式的数据,这些数据就包含了14大类的所有子类信息。
APP数据爬取_第14张图片
我们来获取这些子类信息,并放在队列里

import requests, json
from multiprocessing import Queue

queue = Queue()
i = 1
url = 'http://api.douguo.net/recipe/flatcatalogs'
headers = {'version': '6954.2'}
response = requests.post(url=url, headers=headers, data={})
response_json = json.loads(response.text)
for caipu_catalog in response_json['result']['cs']:
    for food_catalog in caipu_catalog['cs']:
        for food in food_catalog['cs']:
            caipu = {}
            caipu['name'] = food['name']
            caipu['id'] = food['id']
            queue.put(caipu)
            print(i, food['name'])
            i += 1

然后我们点击茄子,点击菜谱,我们会看到fiddler加载了4条数据,最后一条是我们所需的
APP数据爬取_第15张图片
分析请求,POST方式,data中有两个非常重要的参数keyword和order,返回的也是json格式的数据。
APP数据爬取_第16张图片
写一个函数来获取这些菜谱,其中info是从队列中取出来的字典

def dou_guo(info):
    url1 = 'http://api.douguo.net/recipe/v2/search/0/20'
    data1 = {'keyword': info['name'],
             'order': 0}
    resp = requests.post(url=url1, headers=headers, data=data1)
    print(info['name'])
    print(json.loads(resp.text))

随便点击一个菜谱,打开fiddler第一条数据。POST方式,网址最后一串数字是菜谱id。构造请求url,获取菜谱的相关信息。
APP数据爬取_第17张图片

   for item in resp_json['result']['list']:
        try:
            each_caipu = {}
            each_caipu['author'] = item['r']['an']
            each_caipu['id'] = item['r']['id']
            each_caipu['cookstory'] = item['r']['cookstory']
            url2 = 'http://api.douguo.net/recipe/detail/' + str(each_caipu['id'])
            res = requests.post(url=url2, headers=headers, data={})
            # print(json.loads(res.text))
        except Exception as e:
            continue
            # raise e
        else:
            res_json = json.loads(res.text)
            each_caipu['name'] = res_json['result']['recipe']['title']
            each_caipu['tips'] = res_json['result']['recipe']['tips']
            each_caipu['cook_step'] = res_json['result']['recipe']['cookstep']
            print(each_caipu)

完整代码

import requests, json, pymongo
from pymongo.collection import Collection
from multiprocessing import Queue
from concurrent.futures import ThreadPoolExecutor


def insert(item):
    client = pymongo.MongoClient(host='你的主机地址', port=27017)
    db_data = client['dou_guo_mei_shi']
    db_collection = Collection(db_data, 'dou_guo_mei_shi_item')
    db_collection.insert(item)


def dou_guo(info):
    url1 = 'http://api.douguo.net/recipe/v2/search/0/20'
    data1 = {'keyword': info['name'],
             'order': 0}
    resp = requests.post(url=url1, headers=headers, data=data1)
    resp_json = json.loads(resp.text)
    # print(resp.text)
    # print(resp_json)
    for item in resp_json['result']['list']:
        try:
            each_caipu = {}
            each_caipu['author'] = item['r']['an']
            each_caipu['id'] = item['r']['id']
            each_caipu['cookstory'] = item['r']['cookstory']
            url2 = 'http://api.douguo.net/recipe/detail/' + str(each_caipu['id'])
            res = requests.post(url=url2, headers=headers, data={})
            # print(json.loads(res.text))
        except Exception as e:
            continue
            # raise e
        else:
            res_json = json.loads(res.text)
            each_caipu['name'] = res_json['result']['recipe']['title']
            each_caipu['tips'] = res_json['result']['recipe']['tips']
            each_caipu['cook_step'] = res_json['result']['recipe']['cookstep']
            print(each_caipu)
            insert(each_caipu)


queue = Queue()
# i = 1
url = 'http://api.douguo.net/recipe/flatcatalogs'
headers = {'version': '6954.2'}
response = requests.post(url=url, headers=headers, data={})
response_json = json.loads(response.text)
for caipu_catalog in response_json['result']['cs']:
    for food_catalog in caipu_catalog['cs']:
        for food in food_catalog['cs']:
            caipu = {}
            caipu['name'] = food['name']
            queue.put(caipu)
            # print(i, food['name'])
            # i += 1
pool = ThreadPoolExecutor(max_workers=20)
while queue.qsize() > 0:
    pool.submit(dou_guo, queue.get())

如有问题欢迎大家与我交流

你可能感兴趣的:(APP数据爬取)