Scrapy 模拟登录 用户名加密码

1. 模拟登陆抓取步骤

1.1 先通过浏览器工具查看是否有隐藏的input 内容一起提交

1.1.1 先请求登陆界面,对页面进行解析,获取隐藏的input 内容

1.2 通过浏览器工具查看提交的所有表单,记录下来

1.3 1.2中表单加上隐藏的一起提交

1.4 等返回就request 目标url即可。

 

2. 用浏览器工具查看是否有隐藏input

我们这次的是CSDN的登陆界面 https://passport.csdn.net/account/login

Scrapy 模拟登录 用户名加密码_第1张图片

我们看到是一个叫做lt的隐藏input,csdn程序员还很贴心的做了提示,每个用户都有一个流水号,这个就是为了防止机器直接用用户名和密码登陆的。

 

3. 先对这个界面进行解析得到隐藏input

 

    def start_requests(self):
        start_url = 'https://passport.csdn.net/account/login?from=https://mp.csdn.net/postlist/list/all'
        return [
            Request(start_url, callback=self.parseWelcome)
        ]

    def parseWelcome(self, response):
        lt = response.xpath('//input[@name="lt"]/@value').extract_first()
        logging.info('lt:' + lt)
        return FormRequest.from_response(
            response,
            url='https://passport.csdn.net/account/login?from=https://mp.csdn.net/postlist/list/all',
            #meta={'cookiejar': response.meta['cookiejar']},
            formdata={"username":"fox64194167", "password" : "*****", "lt" : lt},
            callback=self.afterLogin
        )

4. 全部代码

 

import scrapy

from tutorial.items import CSDNItem
import logging
from scrapy.http import Request, FormRequest, HtmlResponse

class CSDNLoginSpider(scrapy.Spider):
    name = "csdnLogin"

    target_url = 'https://mp.csdn.net/postlist/list/all'

    def start_requests(self):
        start_url = 'https://passport.csdn.net/account/login?from=https://mp.csdn.net/postlist/list/all'
        return [
            Request(start_url, callback=self.parseWelcome)
        ]

    def parseWelcome(self, response):
        lt = response.xpath('//input[@name="lt"]/@value').extract_first()
        logging.info('lt:' + lt)
        return FormRequest.from_response(
            response,
            url='https://passport.csdn.net/account/login?from=https://mp.csdn.net/postlist/list/all',
            #meta={'cookiejar': response.meta['cookiejar']},
            formdata={"username":"fox64194167", "password" : "*****", "lt" : lt},
            callback=self.afterLogin
        )
    def afterLogin(self, response):
        yield Request(self.target_url)
    def parseDetail(self, response):
        item = CSDNItem()
        item['title'] = response.css('.csdn_top::text').extract_first()
        item['body'] = response.css('#article_content .htmledit_views').extract_first()
        yield item
    def parse(self, response):


        for article in response.css('.list-item-title .article-list-item-txt'):
            articleId = article.css('a::attr("href")').extract_first()
            if articleId is not None:
                articleId = str(articleId)
                articleId = articleId[articleId.rfind("/") + 1: len(articleId)]
                next_page = 'https://blog.csdn.net/fox64194167/article/details/%s' % articleId
                yield response.follow(next_page, self.parseDetail)


        bottomNavNum = response.css('.page-item.active a::text').extract_first()
        logging.info(int(bottomNavNum))

        if bottomNavNum is not None:
            next_page = ('https://mp.csdn.net/postlist/list/all/%d' % (int(bottomNavNum) + 1))
            logging.info('next_page:' + next_page)
            yield response.follow(next_page, self.parse)

http://www.waitingfy.com/archives/2005

其他的解释参考上一篇文章

http://www.waitingfy.com/archives/1996

你可能感兴趣的:(scrapy)