前言:由于携程网页的变化,以及不断的反击爬虫,导致目前许多携程的爬虫代码无法爬取到数据。
本文核心:根据更换cookies的值得到携程酒店数据
主要包含以下四个部分
环境:python3.6+requests
包含部分文件写入操作
爬虫程序需要模仿浏览器进行访问,因此headers属性必不可少,可以在网页中轻松找到
headers = {
"Connection": "keep-alive",
"Cookie":cookies,
"origin": "https://hotels.ctrip.com",
"Host": "hotels.ctrip.com",
"referer": "https://hotels.ctrip.com/hotel/qamdo575",
"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36",
"Content-Type":"application/x-www-form-urlencoded; charset=utf-8"
}
其中较为重要的部分就是cookies,假如没有cookies会直接导致验证失败,获得空数据,并且在cookies需要登录后的cookies。
由于采用数据接口的方式爬取数据,因此主要组合相应的data属性,才能获得准确的返回值。在浏览器检索中,从header里面可以找到我们需要的data属性。
data = {
"StartTime": "2020-10-09",
"DepTime": "2019-10-10",
"RoomGuestCount": "1,1,0",
"cityId": 575,
"cityPY": "qamdo",
"cityCode": "0895",
"page": page
}
找到准确的数据接口之后,我们需要利用requests库,发送get或者post请求,拼接之前的headers和data参数,得到对应的json数据。
得到的json数据可以利用切片得到各种属性值,例如链接、评分、地址等。
html = requests.post(url, headers=headers, data=data)
hotel_list = html.json()["hotelPositionJSON"]
# coding=utf8
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import random
import time
import csv
import json
import re
from tqdm import tqdm
# Pandas display option
pd.set_option('display.max_columns', 10000)
pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_colwidth', 10000)
pd.set_option('display.width',1000)
url = "https://hotels.ctrip.com/Domestic/Tool/AjaxHotelList.aspx"
filename = "F:\\aaa\\changdu.csv"
print(requests.post(url))
def Scrap_hotel_lists():
cookies = ''' ......"'
headers = {
"Connection": "keep-alive",
"Cookie":cookies,
"origin": "https://hotels.ctrip.com",
"Host": "hotels.ctrip.com",
"referer": "https://hotels.ctrip.com/hotel/qamdo575",
"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36",
"Content-Type":"application/x-www-form-urlencoded; charset=utf-8"
}
id = []
name = []
hotel_url = []
address = []
score = []
star = []
stardesc=[]
lat=[]
lon=[]
dpcount=[]
dpscore=[]
for page in tqdm(range(1,13) ,desc='进行中',ncols=10):
data = {
"StartTime": "2020-10-09",
"DepTime": "2019-10-10",
"RoomGuestCount": "1,1,0",
"cityId": 575,
"cityPY": "qamdo",
"cityCode": "0895",
"page": page
}
html = requests.post(url, headers=headers, data=data)
hotel_list = html.json()["hotelPositionJSON"]
for item in hotel_list:
print(item)
id.append(item['id'])
name.append(item['name'])
hotel_url.append(item['url'])
address.append(item['address'])
score.append(item['score'])
stardesc.append(item['stardesc'])
lat.append(item['lat'])
lon.append(item['lon'])
dpcount.append(item['dpcount'])
dpscore.append(item['dpscore'])
if(item['star']==''):
star.append('NaN')
else:
star.append(item['star'])
time.sleep(random.randint(3,5))
hotel_array = np.array((id, name, score, hotel_url, address,star,stardesc,lat,lon,dpcount,dpscore)).T
list_header = ['id', 'name', 'score', 'url', 'address',
'star','stardesc','lat','lon','dpcount','dpscore']
array_header = np.array((list_header))
hotellists = np.vstack((array_header, hotel_array))
with open(filename, 'w', encoding="utf-8-sig", newline="") as f:
csvwriter = csv.writer(f, dialect='excel')
csvwriter.writerows(hotellists)
if __name__ == "__main__":
Scrap_hotel_lists()
df = pd.read_csv(filename, encoding='utf8')
print(df)
备注:xiecheng网站经常发生改版,此程序仅用于学习