【Python爬虫】动态网页抓取

动态网页抓取

如果使用AJAX加载的动态网页,怎么爬取里面动态加载的内容呢?有两种方法:

  1. 通过浏览器审查元素解析地址
  2. 通过Selenium模拟浏览器抓取

解析真实地址: 

#!/usr/bin/env python
# -*- coding: utf-8 -*-
'''
@File  : Dynamic01.py
@Author: Xinzhe.Pang
@Date  : 2019/7/5 15:32
@Desc  : 
'''
import requests
import json

def single_page_comment(link):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
    r = requests.get(link, headers=headers)
    # 获取 json 的 string
    json_string = r.text
    json_string = json_string[json_string.find('{'):-2]
    json_data = json.loads(json_string)
    comment_list = json_data['results']['parents']
    for eachone in comment_list:
        message = eachone['content']
        print(message)

for page in range(1, 4):
    link1 = "https://api-zero.livere.com/v1/comments/list?callback=jQuery112403473268296510956_1531502963311&limit=10&offset="
    link2 = "&repSeq=4272904&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1531502963316"
    page_str = str(page)
    link = link1 + page_str + link2
    print(link)
    single_page_comment(link)

通过Selenium模拟浏览器抓取

#!/usr/bin/env python
# -*- coding: utf-8 -*-
'''
@File  : Dynamic02.py
@Author: Xinzhe.Pang
@Date  : 2019/7/5 17:27
@Desc  : 
'''
# 通过Selenium模拟浏览器抓取
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import time

# caps = webdriver.DesiredCapabilities().FIREFOX
# caps["marionette"] = False
# binary = FirefoxBinary(r'D:\firefox-51.0.win64.sdk\firefox-sdk\bin\firefox.exe')

driver = webdriver.Firefox(executable_path = r'D:\Anaconda3\Scripts\geckodriver.exe')
driver.get("http://www.santostang.com/2018/07/04/hello-world/")

comment = driver.find_element_by_css_selector('div.reply-content')
content = comment.find_element_by_tag_name('p')
print (content.text)

D:\Anaconda3\python.exe E:/python_learning/Python_Spyder/DynamicPage/Dynamic02.py
Traceback (most recent call last):
  File "E:/python_learning/Python_Spyder/DynamicPage/Dynamic02.py", line 21, in
    comment = driver.find_element_by_css_selector('div.reply-content')
  File "D:\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 598, in find_element_by_css_selector
    return self.find_element(by=By.CSS_SELECTOR, value=css_selector)
  File "D:\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 978, in find_element
    'value': value})['value']
  File "D:\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "D:\Anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: div.reply-content


Process finished with exit code 1
通过排查我们发现,原来代码中的 JavaScript 解析成了一个 iframe,