房天下网站租房信息数据加载为动态js加载,分析页面的请求,找到ajax请求的url:
https://m.fang.com/zf/?purpose=%D7%A1%D5%AC¬GetPurpose=1&city=%B9%E3%D6%DD&renttype=cz&c=zf&a=ajaxGetList&city=gz&r=0.0021985656734149206&page=3
其中参数page为变量可以改变这个参数模拟浏览器刷新界面,发送请求方式为get请求,
在获取数据时遇到了个坑在此记录下,按照正常的request的get请求,在请求头中添加headers信息,初始我只添加了user-agent信息,但是这样并没有得到任何的数据,回到nerwork下的ajax页面找到Request请求头信息
:authority: m.fang.com
:method: GET
:path: /zf/?purpose=%D7%A1%D5%AC¬GetPurpose=1&city=%B9%E3%D6%DD&renttype=cz&c=zf&a=ajaxGetList&city=gz&r=0.0021985656734149206&page=3
:scheme: https
accept: */*
accept-encoding: gzip, deflate, br
accept-language: zh-CN,zh;q=0.9
cookie: JSESSIONID=aaaCdLQQAcDgcdgfXxQZw; global_cookie=5d011072-1567300194704-1588c24a; unique_cookie=U_5d011072-1567300194704-1588c24a; global_wapandm_cookie=7fgcyp8vatwbii0vjq5yg3rnz1wk00a2j29; zflistonedayClose=1; cityHistory=%E5%B9%BF%E5%B7%9E%2Cgz; csrfToken=eB7RgTWl8dkYe4EvaZiHqyE7; zhcity=%E5%B9%BF%E5%B7%9E; encity=gz; appdklj_refer=refer; zfdetailonedayClose=1; times=5; mencity=gz; g_sourcepage=zf_fy%5Elb_wap; unique_wapandm_cookie=U_7fgcyp8vatwbii0vjq5yg3rnz1wk00a2j29*16
referer: https://m.fang.com/zf/gz/
sec-fetch-mode: cors
sec-fetch-site: same-origin
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36
x-requested-with: XMLHttpRequest
如果只单单添加uesr-agent信息,可能会被识别出为爬虫,所以在源代码中继续添加其他的headers信息,再次发送请求,请求成功。在返回的数据中找到下一级页面的url信息。
import requests
import re
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',
'referer': 'https://m.fang.com/zf/bj/?jhtype=zf',
'authority': 'm.fang.com',
'x-requested-with': 'XMLHttpRequest',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin'
}
url = 'https://m.fang.com/zf/?purpose=%D7%A1%D5%AC¬GetPurpose=1&city=%B9%E3%D6%DD&renttype=cz&c=zf&a=ajaxGetList&city=gz&r=0.0021985656734149206&page=%s'
for i in range(1,15):
urls = url + str(i)
res = requests.get(url=url,headers=headers).content.decode('gbk')#放回的是str类型
domainurl = 'https://m.fang.com/zf/gz/'
results = re.findall(r'