selenium模拟登录:爬虫在爬取某些必须登录才能可见数据的网站是,可以使用selenium模拟浏览器登录请求,以下为模拟登录微博的源码:
from selenium import webdriver
from scrapy.selector import Selector
browser = webdriver.Chrome(executable_path="C:\Mycode\爬虫资源\chromedriver.exe") #配置chromedriver.exe的路径
browser.get("https://weibo.com/") #获取登录页面
import time
time.sleep(10) #防止加载过慢
browser.find_element_by_css_selector("#loginname").send_keys("username") #填写账号,username修改为可用的账号
browser.find_element_by_css_selector(".info_list.password input[node-type='password']").send_keys("password") #填写密码,password修改为可用的密码
browser.find_element_by_css_selector(".info_list.login_btn a[node-type='submitBtn'").click() #模拟点击“登录”
#模拟鼠标下拉
for i in range(3): #设置下拉次数
browser.execute_script("window.scrollTo(0,document.body.scrollHeight); var lenOfPage=document.body.scrollHeight;return lenOfPage;")
time.sleep(3) #设置下拉后暂停时间,用以等待加载
不加载图片:在某些情况下,禁止图片加载来加快网页加载速度,也可以用selenium来实现:
#设置chromedriver不加载图片
chrome_opt = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images":2}
chrome_opt.add_experimental_option("prefs",prefs)
browser = webdriver.Chrome(executable_path="C:\Mycode\爬虫资源\chromedriver.exe",chrome_options=chrome_opt)
browser.get("https://www.taobao.com")
cookie调用:爬虫爬取每个人url时,不可能要selenium每次都要登录,因此就需要selenimu获取cookie,将cookie放到Request中请求网站:
import time
time.sleep(10) #登录结束后,延迟10s用以加载页面
Cookies = browser.get_cookies() #获取cookie
print(Cookies)
cookie_dict = {}
import pickle
for cookie in Cookies: #将cookie保存到本地
#写入文件
f = open('C:\Mycode\ArticleSpider\cookies\weibo' + cookie['name'] + '.weibo','wb')
pickle.dump(cookie,f)
cookie_dict[cookie['name']] = cookie['value']
browser.close()
return [scrapy.Request(url=self.start_urls[0], dont_filter=True,headers=self.headers,cookies=cookie_dict)] #将cookie随Request一起向网页发送请求
selenium集成到scrapy中:在中间件中定义一个selenium的请求类:
from scrapy.http import HtmlResponse
class JSPageMiddleware(object):
# 通过chorme请求动态网页
def process_request(self,request,spider):
if spider.name == "weibo": #spider的name
spider.browser.get(request.url)
import time
time.sleep(3)
print("访问{0}".format(request.url))
return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source, encoding="utf-8",
request=request)
为了使中间件的类生效,必须将这个类设置到setting中:
DOWNLOADER_MIDDLEWARES = {
'ArticleSpider.middlewares.JSPageMiddleware': 1,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}
为了避免每次请求都要重新启动一个chrome,需要在spider中定义一个关闭chrome的类:
def __init__(self):
self.browser = webdriver.Chrome(executable_path="C:\Mycode\爬虫资源\chromedriver.exe")
super(JobboleSpider,self).__init__()
dispatcher.connect(self.spider_closed,signals.spider_closed)
def spider_closed(self,spider):
#爬虫退出的时候关闭chorme
print("spider closed")
self.browser.quit()