编译IDA:pycharm 社区版
python版本:python3.7.4
用到的库:json(自带)、requests(导入)、pymysql(导入)
pip install requests
pip install pymysql
1、查看携程信息
发现这个不是放在 HTML 中,查看XHR,是 json 中,所以不是简单的获取界面,需要统一资源定位器(url)、请求头(headers)、请求载荷(payload)。
(1)提供的接口(api):https://flights.ctrip.com/itinerary/api/12808/products
不过接口不能通过网页访问,但是还是有json数据交互
(2)发现是 post 请求
(3)需要找到请求头 headers
(4)参数里面有东西,说明需要请求载荷(payload)
请求头Referer是:https://flights.ctrip.com/itinerary/oneway/bjs-sha?date=2019-07-18
(5)响应也可以查看的,但是当响应数据过大时,火狐浏览器开发者模式就查看不了,就会被截断(我判断是小于 1M 才不会截断),但是还是可以获取。(IE浏览器和谷歌浏览器没有限制,但是不能很好分层的列出来信息)
2、获取 json数据
利用 requests.post 可以发现已经获取 json 信息,即所有的数据
import requests
import json
if __name__ == "__main__":
url = "https://flights.ctrip.com/itinerary/api/12808/products"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0",
"Referer": "https://flights.ctrip.com/itinerary/oneway/bjs-sha?date=2019-07-18",
"Content-Type": "application/json"
}
request_payload = {
"flightWay": "Oneway",
"classType": "ALL",
"hasChild": False,
"hasBaby": False,
"searchIndex": 1,
"airportParams": [
{"dcity": "BJS", "acity": "SHA", "dcityname": "北京", "acityname": "上海", "date": "2019-07-18", "dcityid": 1, "acityid": 2}
]
}
# post请求
response = requests.post(url, data=json.dumps(request_payload), headers=headers).text
print(response)
3、提炼信息
我们只需要我们想要的信息,所以我们要一层一层的剥离,然后取出我们想要的信息
我们想要的信息就在:
data->routeList(这儿要循环)->legs->flight
# post请求
response = requests.post(url, data=json.dumps(request_payload), headers=headers).text
# print(response)
# 很多航班信息在此分一下
routeList = json.loads(response).get('data').get('routeList')
# print(routeList)
# 依次读取每条信息
for route in routeList:
# 判断是否有信息,有时候没有会报错
if len(route.get('legs')) == 1:
legs = route.get('legs')
flight = legs[0].get('flight')
# 提取想要的信息
airlineName = flight.get('airlineName')
flightNumber = flight.get('flightNumber')
departureDate = flight.get('departureDate')
arrivalDate = flight.get('arrivalDate')
departureCityName = flight.get('departureAirportInfo').get('cityName')
departureAirportName = flight.get('departureAirportInfo').get('airportName')
arrivalCityName = flight.get('arrivalAirportInfo').get('cityName')
arrivalAirportName = flight.get('arrivalAirportInfo').get('airportName')
print(airlineName, "\t",
flightNumber, "\t",
departureDate, "\t",
arrivalDate, "\t",
departureCityName, "\t",
departureAirportName, "\t",
arrivalCityName, "\t",
arrivalAirportName)
import requests
import json
if __name__ == "__main__":
url = "https://flights.ctrip.com/itinerary/api/12808/products"
# Referer = "https://flights.ctrip.com/itinerary/oneway/bjs-sha?date=2019-07-18"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0",
"Referer": "https://flights.ctrip.com/itinerary/oneway/bjs-sha?date=2019-07-18",
"Content-Type": "application/json"
}
request_payload = {
"flightWay": "Oneway",
"classType": "ALL",
"hasChild": False,
"hasBaby": False,
"searchIndex": 1,
"airportParams": [
{"dcity": "BJS", "acity": "SHA", "dcityname": "北京", "acityname": "上海", "date": "2019-07-18", "dcityid": 1, "acityid": 2}
]
}
# post请求
response = requests.post(url, data=json.dumps(request_payload), headers=headers).text
# print(response)
# 很多航班信息在此分一下
routeList = json.loads(response).get('data').get('routeList')
# print(routeList)
# 依次读取每条信息
for route in routeList:
# 判断是否有信息,有时候没有会报错
if len(route.get('legs')) == 1:
legs = route.get('legs')
flight = legs[0].get('flight')
# 提取想要的信息
airlineName = flight.get('airlineName')
flightNumber = flight.get('flightNumber')
departureDate = flight.get('departureDate')
arrivalDate = flight.get('arrivalDate')
departureCityName = flight.get('departureAirportInfo').get('cityName')
departureAirportName = flight.get('departureAirportInfo').get('airportName')
arrivalCityName = flight.get('arrivalAirportInfo').get('cityName')
arrivalAirportName = flight.get('arrivalAirportInfo').get('airportName')
print(airlineName, "\t",
flightNumber, "\t",
departureDate, "\t",
arrivalDate, "\t",
departureCityName, "\t",
departureAirportName, "\t",
arrivalCityName, "\t",
arrivalAirportName)
1、json 格式,post 请求
2、信息层层筛选,必要时循环筛选
接下来就是让程序傻瓜式操作,这个只能获取设定的请求返回的数据,为了自动化,傻瓜式适应各种环境还需要修改以下地方:
(1)飞行场次:单程、往返、多程
(2)具体信息:哪里到哪里、时间、编号
原来问题出在Refer中,https://flights.ctrip.com/itinerary/oneway/bjs-sha?date=2019-07-18,网站修改添加了部分代码,如果访问上级网址,会添加一个 token令牌,Refer不能简写了https://flights.ctrip.com/itinerary,直接写全就好了。