selenium 实现模拟登录

selenium模拟登录:爬虫在爬取某些必须登录才能可见数据的网站是,可以使用selenium模拟浏览器登录请求,以下为模拟登录微博的源码:

from selenium import webdriver
from scrapy.selector import Selector

browser = webdriver.Chrome(executable_path="C:\Mycode\爬虫资源\chromedriver.exe")   #配置chromedriver.exe的路径

browser.get("https://weibo.com/")    #获取登录页面
import time
time.sleep(10)                       #防止加载过慢
browser.find_element_by_css_selector("#loginname").send_keys("username")   #填写账号,username修改为可用的账号
browser.find_element_by_css_selector(".info_list.password input[node-type='password']").send_keys("password")           #填写密码,password修改为可用的密码
browser.find_element_by_css_selector(".info_list.login_btn a[node-type='submitBtn'").click()        #模拟点击“登录”

#模拟鼠标下拉
for i in range(3):      #设置下拉次数
    browser.execute_script("window.scrollTo(0,document.body.scrollHeight); var lenOfPage=document.body.scrollHeight;return lenOfPage;")
    time.sleep(3)       #设置下拉后暂停时间,用以等待加载

不加载图片:在某些情况下,禁止图片加载来加快网页加载速度,也可以用selenium来实现:

#设置chromedriver不加载图片
chrome_opt = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images":2}
chrome_opt.add_experimental_option("prefs",prefs)
browser = webdriver.Chrome(executable_path="C:\Mycode\爬虫资源\chromedriver.exe",chrome_options=chrome_opt)
browser.get("https://www.taobao.com")

cookie调用:爬虫爬取每个人url时,不可能要selenium每次都要登录,因此就需要selenimu获取cookie,将cookie放到Request中请求网站:

import time
        time.sleep(10)                         #登录结束后,延迟10s用以加载页面
        Cookies = browser.get_cookies()        #获取cookie
        print(Cookies)
        cookie_dict = {}
        import pickle
        for cookie in Cookies:                 #将cookie保存到本地
            #写入文件
            f = open('C:\Mycode\ArticleSpider\cookies\weibo' + cookie['name'] + '.weibo','wb')
            pickle.dump(cookie,f)
            cookie_dict[cookie['name']] = cookie['value']
        browser.close()            
        return [scrapy.Request(url=self.start_urls[0], dont_filter=True,headers=self.headers,cookies=cookie_dict)]        #将cookie随Request一起向网页发送请求

selenium集成到scrapy中:在中间件中定义一个selenium的请求类:

from scrapy.http import HtmlResponse
class JSPageMiddleware(object):
    # 通过chorme请求动态网页

    def process_request(self,request,spider):
        if spider.name == "weibo":                #spider的name
            spider.browser.get(request.url)
            import time
            time.sleep(3)
            print("访问{0}".format(request.url))

            return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source, encoding="utf-8",
                            request=request)

为了使中间件的类生效,必须将这个类设置到setting中:

DOWNLOADER_MIDDLEWARES = {
   'ArticleSpider.middlewares.JSPageMiddleware': 1,
   'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}

为了避免每次请求都要重新启动一个chrome,需要在spider中定义一个关闭chrome的类:

 def __init__(self):
        self.browser = webdriver.Chrome(executable_path="C:\Mycode\爬虫资源\chromedriver.exe")
        super(JobboleSpider,self).__init__()
        dispatcher.connect(self.spider_closed,signals.spider_closed)

 def spider_closed(self,spider):
        #爬虫退出的时候关闭chorme
        print("spider closed")
        self.browser.quit()

 

你可能感兴趣的:(selenium 实现模拟登录)