【Scrapy】简单的爬虫--抓取取安全客漏洞（一）

0x01 创建项目
scrapy startproject YOUR_PROJECT_NAME

创建爬虫项目

• items.py ：该文件定义了待抓取域的模型。
• settings.py ：该文件定义了一些设置，如用户代理、爬取延时等。
• spiders/ ：该目录存储实际的爬虫代码。
另外，Scrapy使用scrapy.cfg设置项目配置，使用pipelines.py处理要抓取的域，不过目前无须修改这两个文件。

0x02 定义模型
默认情况下items.py文件中包含如下代码:

定义爬虫要爬取的字段信息

Exam123Item类是一个模板需要将其中的内容替换为爬虫运行时想要存储的待抓取的信息字段。

Paste_Image.png

** 0x03 创建爬虫文件**

scrapy genspider SPIDER_NAME SPIDER_DOMAIN
例如：scrapy genspider bobao bobao.360.cn

Paste_Image.png

0x03 完整的代码案例
爬取360播报漏洞

爬虫bobao.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule


class BobaoSpider(CrawlSpider):
    name = 'bobao'
    allowed_domains = ['bobao.360.cn']
    start_urls = ['http://bobao.360.cn/vul/index']

    def parse(self, response):
        vuls = response.xpath('/html/body/div[2]/div[2]/div[2]/div[1]/div/div[3]/ul/li')
        
        for vul in vuls:
            label_danger = vul.xpath('.//div/div/span/text()').extract()[0] if len( \
                vul.xpath('.//div/div/span/text()').extract()) else "null"            
            yield {
                'url': 'http://bobao.360.cn' + vul.xpath('.//div/div/a/@href').extract()[0],
                'title': vul.xpath('.//div/div/a/text()').extract()[0],
                'label_danger': label_danger,
                'ori': vul.xpath('.//div/span[2]/text()').extract()[0],
                }

items.py


import scrapy


class ExampleItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    
    url = scrapy.Field()
    title = scrapy.Field()
    label_danger = scrapy.Field()
    ori = scrapy.Field()

运行结果

Paste_Image.png

0x04 使用SHELL命令抓取内容
有些网站反爬虫措施禁止Scrapy框架去获取网站内容可以使用下面的方式绕过限制。
scrapy shell -s USER_AGENT='custom user agent' 'http://www.example.com'
处理中文

f='\u53eb\u6211' 
print f 
print(f.decode('unicode-escape'))

修改pipelines.py文件，

import json
import codecs
 
class ExamplePipeline(object):
    def __init__(self): 
        self.file = codecs.open('vul.json', 'wb', encoding='utf-8')
 
    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + '\n' 
        self.file.write(line.decode("unicode_escape"))
        return item

然后修改settings.py文件去掉ITEM_PIPELINES这个参数的注释

Paste_Image.png

最终效果如下:

Paste_Image.png

【Scrapy】简单的爬虫--抓取取安全客漏洞（一）

你可能感兴趣的:(【Scrapy】简单的爬虫--抓取取安全客漏洞（一）)