上一篇文章里面我们使用 Python Scrapy 爬取静态网页中所有文字:https://blog.csdn.net/sinat_40431164/article/details/81102476
但是有个问题,当我们把要访问的URL修改为:http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2的时候,可以发现爬取的内容里面没有“车型论坛”和“主题论坛”两个板块。
有时候,我们天真无邪的使用urllib库或Scrapy下载HTML网页时会发现,我们要提取的网页元素并不在我们下载到的HTML之中,尽管它们在浏览器里看起来唾手可得。
这说明我们想要的元素是在我们的某些操作下通过js事件动态生成的。举个例子,我们在刷QQ空间或者微博评论的时候,一直往下刷,网页越来越长,内容越来越多,就是这个让人又爱又恨的动态加载。爬取动态页面目前来说有两种方法:
下面我们就来讲一讲如何运用Selenium模拟浏览器行为。
Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:
scrapy startproject URLCrawler
This is the code for our first Spider. Save it in a file named my_spider.py
under the URLCrawler/spiders
directory in your project:
# -*- coding: utf-8 -*-
"""
Created on Wed Jul 18 17:55:45 2018
@author: Administrator
"""
from scrapy import Spider,Request
from selenium import webdriver
class MySpider(Spider):
name = "my_spider"
def __init__(self):
self.browser = webdriver.Firefox(executable_path='E:\software\python\geckodriver-v0.21.0-win64\geckodriver.exe')
self.browser.set_page_load_timeout(30)
def closed(self,spider):
print("spider closed")
self.browser.close()
def start_requests(self):
start_urls = ['http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2'.format(str(i)) for i in range(1,2,2)]
for url in start_urls:
yield Request(url=url, callback=self.parse)
def parse(self, response):
domain = response.url.split("/")[-2]
filename = '%s.html' %domain
with open(filename, 'wb') as f:
f.write(response.body)
print('---------------------------------------------------')
加入以下内容:
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium.common.exceptions import TimeoutException
import time
class SeleniumMiddleware(object):
def process_request(self, request, spider):
if spider.name == 'my_spider':
try:
spider.browser.get(request.url)
spider.browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
except TimeoutException as e:
print('超时')
spider.browser.execute_script('window.stop()')
time.sleep(2)
return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source, encoding="utf-8", request=request)
添加以下内容:
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'URLCrawler.middlewares.SeleniumMiddleware': 543,
}
To put our spider to work, go to the project’s 最高一层的目录 and run:
scrapy crawl my_spider
发现下载下来的网页和用浏览器访问该网页的内容一样!
如果仅仅需要文字内容,那么将spider中的parse方法改成:
def parse(self, response):
'''domain = response.url.split("/")[-2]
filename = '%s.html' %domain
with open(filename, 'wb') as f:
f.write(response.body)'''
#textlist_no_scripts = response.selector.xpath('//*[not(self::script or self::style)]/text()[normalize-space(.)]').extract()
textlist_with_scripts = response.selector.xpath('//text()[normalize-space(.)]').extract()
#with open('filename_no_scripts', 'w', encoding='utf-8') as f:
with open('filename_with_scripts', 'w', encoding='utf-8') as f:
for i in range(0, len(textlist_with_scripts)):
text = textlist_with_scripts[i].strip()
f.write(text + '\n')
print('---------------------------------------------------')
The End.