python3的爬虫笔记13—

1、Scrapy安装

在windows平台anaconda环境下，在命令窗口输入conda install scrapy，输入确认的y后，静静等待安装完成即可。安装完成后，在窗口输入scrapy version，能显示版本号说明能够正常使用。

2、Scrapy指令

输入scrapy -h可以看到指令，关于命令行，后面会再总结。

Scrapy 1.3.3 - project: quotetutorial

Usage:
  scrapy  [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  commands
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy  -h" to see more info about a command

3、新建项目

爬取的为用于测试scrapy的网站：http://quotes.toscrape.com/
爬取目标：获取名言---作者---标签。

网站样式

1、命令窗口下，用cd指令移动到想用来存放项目的文件夹
2、命令窗口下，scrapy startproject 你的文件夹名，这里命名为scrapy startproject quotetutorial。
可以看到两个提示， cd quotetutorial ,scrapy genspider example example.com，（即cd 你的工作文件夹 ,scrapy genspider 你的爬虫名爬取的目标地址），根据提示继续操作。

C:\Users\m1812>scrapy startproject quotetutorial
New Scrapy project 'quotetutorial', using template directory 'C:\\Users\\m1812\\Anaconda3\\lib\\site-packages\\scrapy\\templates\\project', created in:
    C:\Users\m1812\quotetutorial

You can start your first spider with:
    cd quotetutorial
    scrapy genspider example example.com

3、cd quotetutorial移动到创建好的文件夹中

C:\Users\m1812>cd quotetutorial

4、scrapy genspider quotes quotes.toscrape.com，生成一个名为quotes.py的文件，地址为quotes quotes.toscrape.com

C:\Users\m1812\quotetutorial>scrapy genspider quotes quotes.toscrape.com
Created spider 'quotes' using template 'basic' in module:
  quotetutorial.spiders.quotes

用pycharm打开工程，框架如图

4、Scrapy初窥

1、修改quotes.py中的parse函数，让其打印出网页的html代码，这个网页直接输出print(response.text)会有编码报错。parse函数会在爬虫运行的最后开始执行，这里的response就是网页请求返回的结果。

2、在命令窗口中使用scrapy crawl quotes运行爬虫，看到scrapy除了打印出网页html代码外，还有很多信息输出。

C:\Users\m1812\quotetutorial>scrapy crawl quotes
2019-04-05 19:50:11 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
2019-04-05 19:50:11 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'quotetutorial.spiders', 'SPIDER_MODULES': ['quotetutorial.spi
ders'], 'BOT_NAME': 'quotetutorial', 'ROBOTSTXT_OBEY': True}
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-05 19:50:11 [scrapy.core.engine] INFO: Spider opened
2019-04-05 19:50:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-05 19:50:11 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-05 19:50:12 [scrapy.core.engine] DEBUG: Crawled (404)  (referer: None)
2019-04-05 19:50:12 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)
b'\n\n\n\t\n\tQuotes to Scrape\n    \n    \n\n\n    \n        \n            \n                \n                    Quotes to Scrape\n
                
\n            
\n            \n                \n                \n                    Login\n                \n                
\n            
\n        
\n    \n\n\n    \n\
n    \n        \xa1\xb0The world as we have
 created it is a process of our thinking. It cannot be changed without changing our thinking.\xa1\xb1\n        by Albert Einstein\n        (about)\n        \n        \n
      Tags:\n             \n            \n
change\n            \n            deep-thoughts\n
           \n            thinking\n            \n            world\n            \n        
\n    
\n\n    \n        \xa1\xb0It is our choices, Harry, that show what we truly are, far more than our abilities.\xa1\xb1\n        
by J.K. Rowling\n        (about)\n        \n        \n            Tags:\n             \n            \n
abilities\n            \n            choices\n
     \n        
\n    
\n\n    \n        \xa1\xb0There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\xa1
\xb1\n        by Albert Einstein\n        (about)\
n        \n        \n            Tags:\n             \n            \n            inspirational\n            \n
  life\n            \n            live\n            \n
     miracle\n            \n            miracles\n
        \n        
\n    
\n\n    \n        \xa1\xb0The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\xa1\xb1\n        by Jane Austen\n        (about)\n        \n        \n            Tags:\n             \n
\n            aliteracy\n            \n            books\n            \n            classic\n            \n            humor\n            \n        
\n    
\n\n    \n        \xa1\xb0Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring
.\xa1\xb1\n        by Marilyn Monroe\n        (about)\n        \n        \n            Tags:\n             \n            \n            be-yourself\n            \n            inspirational\n            \n        
\n    
\n\n    \n        \xa1\xb0Try not to become a man of success. Rather become a man of value.\xa1\xb
1\n        by Albert Einstein\n        (about)\n
      \n        \n            Tags:\n             \n            \n            adulthood\n            \n            success\n            \n            value\n            \n        
\n    \
n\n    \n        \xa1\xb0It is better to be
 hated for what you are than to be loved for what you are not.\xa1\xb1\n        by Andr\xa8\xa6
Gide\n        (about)\n        \n        \n            Tags:\n             \n            \n            life\n
  \n            love\n            \n        
\n    
\n\n    \n        \xa1\xb0I have not failed. I've just found 10,000 ways that won&
#39;t work.\xa1\xb1\n        by Thomas A. Edison\n        (about)\n        \n        \n            Tags:\n             \n            \n            edison\n            \n
    failure\n            \n            inspirational<
/a>\n            \n            paraphrased\n            \n        
\n    
\n\n    \n        \xa1\xb0A woman is like a tea bag; you
never know how strong it is until it's in hot water.\xa1\xb1\n        by Eleanor Roosevelt\n        (about)\n        \n        \n            Tags:\n             \n            \n            misattributed-eleanor-roosevelt\n            \n        
\n    
\n\n    \n        \xa1\xb0A day without sunshine is like, you know, night.\xa1\xb1\n        by Steve Martin\n        (about)\n        \n        \n            Tags:\n             \n
          \n            humor\n            \n            obvi
ous\n            \n            simile\n            \n        
\n    
\n\n    \n
   \n            \n            \n            \n                Next &r
arr;\n            
\n            \n        
\n    
\n    
\n    \n        \n            <
h2>Top Ten tags\n            \n            \n            love\n            \n            \n            \n            inspirational\n            \n            \n            \n            life\n            \n            \n            \n            humor\n            \n            \n            \n            books\n            \n            \n            \n            reading\n            \n            \n            \n            friendship\n            \n            \n            \n            friends\n            \n            \n            \n            truth\n            \n            \n            \n
simile\n            \n            \n        \n    
\n
\n\n    
\n
    \n        \n            \n                Quotes by: GoodReads.com\n            
\n            \n                Made with \x817\xc5
8 by Scrapinghub\n            
\n        
\n    \n\n'
2019-04-05 19:50:12 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-05 19:50:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 444,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2701,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 5, 11, 50, 12, 560342),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 4, 5, 11, 50, 11, 713697)}
2019-04-05 19:50:12 [scrapy.core.engine] INFO: Spider closed (finished)

5、开始爬取

首先浏览下目标信息的html结构：

1、修改items.py中的内容，将欲提取的3个信息按照指定的格式填入：

2、修改quotes.py中的内容，添加爬取的规则，并且和步骤一中items.py的配置相映射。

# -*- coding: utf-8 -*-
import scrapy
from quotetutorial.items import QuotetutorialItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):

        quotes = response.css('.quote')
        for quote in quotes:
            item = QuotetutorialItem()
            text = quote.css('.text::text').extract_first()
            author = quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            yield item

这边用到了自带的css选择器。
在命令窗口中，利用shell指令可以进行交互性测试，scrapy shell "quotes.toscrape.com"（注意这里的双引号），从这里我们可以理解上面的css选出了什么，extract_first()和extract()有什么区别。

C:\Users\m1812\quotetutorial>scrapy shell "quotes.toscrape.com"

2019-04-05 20:08:39 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
2019-04-05 20:08:39 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'quotetutorial', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilte
r', 'SPIDER_MODULES': ['quotetutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'quotetutorial.spiders'}
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-05 20:08:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-05 20:08:39 [scrapy.core.engine] INFO: Spider opened
2019-04-05 20:08:40 [scrapy.core.engine] DEBUG: Crawled (404)  (referer: None)
2019-04-05 20:08:40 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)
2019-04-05 20:08:46 [traitlets] DEBUG: Using default logger
2019-04-05 20:08:46 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    
[s]   item       {}
[s]   request    
[s]   response   <200 http://quotes.toscrape.com>
[s]   settings   
[s]   spider     
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]:

In [1]: response
Out[1]: <200 http://quotes.toscrape.com>

In [2]: quotes = response.css('.quote')

In [3]: quotes
Out[3]: 
[,
 ,
 ,
 ,
 ,
 ,
 ,
 ,
 ,
 ]

In [4]: quotes[0]
Out[4]: 

In [5]: quotes[0].css('.text')
Out[5]: []

In [6]:  quotes[0].css('.text::text')
Out[6]: []

In [7]:  quotes[0].css('.text::text').extract()
Out[7]: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']

In [8]: quotes[0].css('.text').extract()
Out[8]: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing ou
r thinking.”']

In [9]:  quotes[0].css('.text::text').extract_first()
Out[9]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

In [10]:  quotes[0].css('.tags .tag::text').extract()
Out[10]: ['change', 'deep-thoughts', 'thinking', 'world']

In [11]: exit()

此时我们再运行下爬虫scrapy crawl quotes
在终端可以看到很多信息。

3、单页爬取完成，接下来要进行翻页。网页的url变化如下，可以通过next按钮的href属性获得下一页网址。

修改quotes.py中的代码：

# -*- coding: utf-8 -*-
import scrapy
from quotetutorial.items import QuotetutorialItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):

        quotes = response.css('.quote')
        for quote in quotes:
            item = QuotetutorialItem()   #字典类型
            text = quote.css('.text::text').extract_first()
            author = quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            yield item

        next = response.css('.pager .next a::attr(href)').extract_first()
        url = response.urljoin(next)  #生成完整url
        yield scrapy.Request(url=url, callback=self.parse)  #递归调用

4、保存数据
命令行scrapy crawl quotes -o quotes.json，保存为json格式

或者存成jl格式，scrapy crawl quotes -o quotes.jl

或者存成CSV格式，scrapy crawl quotes -o quotes.csv
或者存成xml格式，scrapy crawl quotes -o quotes.xml
或者存成pickle格式，scrapy crawl quotes -o quotes.pickle
或者存成marshal格式，scrapy crawl quotes -o quotes.marshal

5、处理一些不想要的item或者保存到数据库
这时候需要修改pipelines.py中的代码

这里限定了字符最大为50个字符，超过的部分在后面添加...
同时定义了mongodb的保存函数

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo

from scrapy.exceptions import DropItem

# 这里限定了最大字符个数为50,超过用省略号代替
class QuotetutorialPipeline(object):

    def __init__(self):
        self.limit = 50

    # 只能返回两种值，item和DropItem
    def process_item(self, item, spider):
        if item['text']:
            if len(item['text']) > self.limit:
                item['text'] = item['text'][0:self.limit].rstrip() + '...'
                return item
        else:
            return DropItem('Missing Text')

class MongoPipeline(object):

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        name = item.__class__.__name__
        self.db[name].insert(dict(item))
        return item

    def close_spider(self, spider):
        # 关闭mongodb
        self.client.close()

还要修改setting.py中的相关设置

取消setting.py中关于pipeline的注释，这里的数字表示优先级，越小的优先级越高

重新运行命令行scrapy crawl quotes

mongodb中也可以看到保存的数据了。

参考自崔庆才博主的Scrapy教学

python3的爬虫笔记13——Scrapy初窥

1、Scrapy安装

2、Scrapy指令

3、新建项目

4、Scrapy初窥

\n Quotes to Scrape\n

5、开始爬取

你可能感兴趣的:(python3的爬虫笔记13——Scrapy初窥)