找真实url 爬取ajax动态网页

http://www.santostang.com/2018/07/14/4-2-%E8%A7%A3%E6%9E%90%E7%9C%9F%E5%AE%9E%E5%9C%B0%E5%9D%80%E6%8A%93%E5%8F%96/

抓http://www.santostang.com/2018/07/04/hello-world/

抓包找传数据的url

先抓包 F12 -> Network -> F5 一般ajax数据是json格式获取


筛选XHR 再点Preview 看数据 发现是空的

不是json数据

只能回到All 再看

这样很难找不如用selenium
一个个点终于找到:


https://api-zero.livere.com/v1/comments/list?callback=jQuery112409131255202867523_1543847210853&limit=10&repSeq=4272904&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1543847210855

得json 解析出数据

点击进去看是这样


typeof jQuery112409131255202867523_1543847210853 === 'function' && jQuery112409131255202867523_1543847210853({这之间的数据是被传的json数据});

import requests

link = "https://api-zero.livere.com/v1/comments/list?callback=jQuery1124049866736766120545_1506309304525&limit=10&offset=1&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1506309304527"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} 

r = requests.get(link,heade rs= headers)
json_string = r.text
json_string = json_string[json_string.find('{'):-2]

看json结构

"results":  { 
   "parents": 
   [
    {   ....
       "content": "评论试试啊 :smiley:",
      ....
    },
    { ....
         "content": "121212",
      ....
    },
  ]
}
json_data = json.loads(json_string)
comment_list = json_data['results']['parents']

for eachone in comment_list:
    message = eachone['content']
    print (message)

URL地址的规律

以上https://api-zero.livere.com/v1/comments/list?callback=jQuery112409131255202867523_1543847210853&limit=10&repSeq=4272904&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&=1543847210855
只是评论的一部分
请求第二页


url是
https://api-zero.livere.com/v1/comments/list?callback=jQuery1124010814306767104775_1543848692464&limit=10&offset=2&repSeq=4272904&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&
=1543848692469

对比第一页第二页的参数



对比可以发现 关键是offset

for page in range(1,4):
    link1 = "https://api-zero.livere.com/v1/comments/list?callback=jQuery112407875296433383039_1506267778283&limit=10&offset="
    link2 = "&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1506267778285"
    page_str = str(page)

# 拼接得url
    link = link1 + page_str + link2

完整代码

import requests
import json

# 打印一页的评论
def single_page_comment(link):
    headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} 
    r = requests.get(link, headers= headers)
    # 获取 json 的 string
    json_string = r.text
    json_string = json_string[json_string.find('{'):-2]
    json_data = json.loads(json_string)
    comment_list = json_data['results']['parents']
    
    for eachone in comment_list:
        message = eachone['content']
        print (message)

# 1 2 3 4 页的评论
for page in range(1,4):

    link1 = "https://api-zero.livere.com/v1/comments/list?callback=jQuery112407875296433383039_1506267778283&limit=10&offset="
    link2 = "&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1506267778285"
    page_str = str(page)
    link = link1 + page_str + link2
    print (link)
    single_page_comment(link)

你可能感兴趣的:(找真实url 爬取ajax动态网页)