python爬虫——爬取携程机票信息

前提

编译IDA:pycharm 社区版
python版本:python3.7.4
用到的库:json(自带)、requests(导入)、pymysql(导入)

pip install requests
pip install pymysql

步骤

1、查看携程信息
发现这个不是放在 HTML 中,查看XHR,是 json 中,所以不是简单的获取界面,需要统一资源定位器(url)、请求头(headers)、请求载荷(payload)。
(1)提供的接口(api):https://flights.ctrip.com/itinerary/api/12808/products
不过接口不能通过网页访问,但是还是有json数据交互
python爬虫——爬取携程机票信息_第1张图片
(2)发现是 post 请求
(3)需要找到请求头 headers
(4)参数里面有东西,说明需要请求载荷(payload)

请求头Referer是:https://flights.ctrip.com/itinerary/oneway/bjs-sha?date=2019-07-18
python爬虫——爬取携程机票信息_第2张图片
(5)响应也可以查看的,但是当响应数据过大时,火狐浏览器开发者模式就查看不了,就会被截断(我判断是小于 1M 才不会截断),但是还是可以获取。(IE浏览器和谷歌浏览器没有限制,但是不能很好分层的列出来信息
python爬虫——爬取携程机票信息_第3张图片
python爬虫——爬取携程机票信息_第4张图片

2、获取 json数据
利用 requests.post 可以发现已经获取 json 信息,即所有的数据

import requests
import json


if __name__ == "__main__":

    url = "https://flights.ctrip.com/itinerary/api/12808/products"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0",
        "Referer": "https://flights.ctrip.com/itinerary/oneway/bjs-sha?date=2019-07-18",
        "Content-Type": "application/json"
    }
    request_payload = {
        "flightWay": "Oneway",
        "classType": "ALL",
        "hasChild": False,
        "hasBaby": False,
        "searchIndex": 1,
        "airportParams": [
            {"dcity": "BJS", "acity": "SHA", "dcityname": "北京", "acityname": "上海", "date": "2019-07-18", "dcityid": 1, "acityid": 2}
        ]
    }

    # post请求
    response = requests.post(url, data=json.dumps(request_payload), headers=headers).text
    print(response)

在这里插入图片描述
3、提炼信息
我们只需要我们想要的信息,所以我们要一层一层的剥离,然后取出我们想要的信息
我们想要的信息就在:
data->routeList(这儿要循环)->legs->flight

# post请求
    response = requests.post(url, data=json.dumps(request_payload), headers=headers).text
    # print(response)
    # 很多航班信息在此分一下
    routeList = json.loads(response).get('data').get('routeList')
    # print(routeList)
    # 依次读取每条信息
    for route in routeList:
        # 判断是否有信息,有时候没有会报错
        if len(route.get('legs')) == 1:
            legs = route.get('legs')
            flight = legs[0].get('flight')
            # 提取想要的信息
            airlineName = flight.get('airlineName')
            flightNumber = flight.get('flightNumber')
            departureDate = flight.get('departureDate')
            arrivalDate = flight.get('arrivalDate')
            departureCityName = flight.get('departureAirportInfo').get('cityName')
            departureAirportName = flight.get('departureAirportInfo').get('airportName')
            arrivalCityName = flight.get('arrivalAirportInfo').get('cityName')
            arrivalAirportName = flight.get('arrivalAirportInfo').get('airportName')

            print(airlineName, "\t",
                  flightNumber, "\t",
                  departureDate, "\t",
                  arrivalDate, "\t",
                  departureCityName, "\t",
                  departureAirportName, "\t",
                  arrivalCityName, "\t",
                  arrivalAirportName)

python爬虫——爬取携程机票信息_第5张图片

整体代码

import requests
import json


if __name__ == "__main__":

    url = "https://flights.ctrip.com/itinerary/api/12808/products"
    # Referer = "https://flights.ctrip.com/itinerary/oneway/bjs-sha?date=2019-07-18"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0",
        "Referer": "https://flights.ctrip.com/itinerary/oneway/bjs-sha?date=2019-07-18",
        "Content-Type": "application/json"
    }
    request_payload = {
        "flightWay": "Oneway",
        "classType": "ALL",
        "hasChild": False,
        "hasBaby": False,
        "searchIndex": 1,
        "airportParams": [
            {"dcity": "BJS", "acity": "SHA", "dcityname": "北京", "acityname": "上海", "date": "2019-07-18", "dcityid": 1, "acityid": 2}
        ]
    }

    # post请求
    response = requests.post(url, data=json.dumps(request_payload), headers=headers).text
    # print(response)
    # 很多航班信息在此分一下
    routeList = json.loads(response).get('data').get('routeList')
    # print(routeList)
    # 依次读取每条信息
    for route in routeList:
        # 判断是否有信息,有时候没有会报错
        if len(route.get('legs')) == 1:
            legs = route.get('legs')
            flight = legs[0].get('flight')
            # 提取想要的信息
            airlineName = flight.get('airlineName')
            flightNumber = flight.get('flightNumber')
            departureDate = flight.get('departureDate')
            arrivalDate = flight.get('arrivalDate')
            departureCityName = flight.get('departureAirportInfo').get('cityName')
            departureAirportName = flight.get('departureAirportInfo').get('airportName')
            arrivalCityName = flight.get('arrivalAirportInfo').get('cityName')
            arrivalAirportName = flight.get('arrivalAirportInfo').get('airportName')

            print(airlineName, "\t",
                  flightNumber, "\t",
                  departureDate, "\t",
                  arrivalDate, "\t",
                  departureCityName, "\t",
                  departureAirportName, "\t",
                  arrivalCityName, "\t",
                  arrivalAirportName)

总结及规划

1、json 格式,post 请求
2、信息层层筛选,必要时循环筛选

接下来就是让程序傻瓜式操作,这个只能获取设定的请求返回的数据,为了自动化,傻瓜式适应各种环境还需要修改以下地方:
(1)飞行场次:单程、往返、多程
(2)具体信息:哪里到哪里、时间、编号
python爬虫——爬取携程机票信息_第6张图片

问题

原来问题出在Refer中,https://flights.ctrip.com/itinerary/oneway/bjs-sha?date=2019-07-18,网站修改添加了部分代码,如果访问上级网址,会添加一个 token令牌,Refer不能简写了https://flights.ctrip.com/itinerary,直接写全就好了。

你可能感兴趣的:(python学习)