python3的爬虫笔记13——Scrapy初窥

1、Scrapy安装

在windows平台anaconda环境下,在命令窗口输入conda install scrapy,输入确认的y后,静静等待安装完成即可。安装完成后,在窗口输入scrapy version,能显示版本号说明能够正常使用。


2、Scrapy指令

输入scrapy -h可以看到指令,关于命令行,后面会再总结。

Scrapy 1.3.3 - project: quotetutorial

Usage:
  scrapy  [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  commands
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy  -h" to see more info about a command

3、新建项目

爬取的为用于测试scrapy的网站:http://quotes.toscrape.com/
爬取目标:获取名言---作者---标签。

网站样式

1、命令窗口下,用cd指令移动到想用来存放项目的文件夹
2、命令窗口下,scrapy startproject 你的文件夹名,这里命名为scrapy startproject quotetutorial
可以看到两个提示, cd quotetutorial ,scrapy genspider example example.com,(即cd 你的工作文件夹 ,scrapy genspider 你的爬虫名 爬取的目标地址),根据提示继续操作。

C:\Users\m1812>scrapy startproject quotetutorial
New Scrapy project 'quotetutorial', using template directory 'C:\\Users\\m1812\\Anaconda3\\lib\\site-packages\\scrapy\\templates\\project', created in:
    C:\Users\m1812\quotetutorial

You can start your first spider with:
    cd quotetutorial
    scrapy genspider example example.com

3、cd quotetutorial移动到创建好的文件夹中

C:\Users\m1812>cd quotetutorial

4、scrapy genspider quotes quotes.toscrape.com,生成一个名为quotes.py的文件,地址为quotes quotes.toscrape.com

C:\Users\m1812\quotetutorial>scrapy genspider quotes quotes.toscrape.com
Created spider 'quotes' using template 'basic' in module:
  quotetutorial.spiders.quotes
用pycharm打开工程,框架如图

4、Scrapy初窥

1、修改quotes.py中的parse函数,让其打印出网页的html代码,这个网页直接输出print(response.text)会有编码报错。parse函数会在爬虫运行的最后开始执行,这里的response就是网页请求返回的结果。


2、在命令窗口中使用scrapy crawl quotes运行爬虫,看到scrapy除了打印出网页html代码外,还有很多信息输出。

C:\Users\m1812\quotetutorial>scrapy crawl quotes
2019-04-05 19:50:11 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
2019-04-05 19:50:11 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'quotetutorial.spiders', 'SPIDER_MODULES': ['quotetutorial.spi
ders'], 'BOT_NAME': 'quotetutorial', 'ROBOTSTXT_OBEY': True}
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-05 19:50:11 [scrapy.core.engine] INFO: Spider opened
2019-04-05 19:50:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-05 19:50:11 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-05 19:50:12 [scrapy.core.engine] DEBUG: Crawled (404)  (referer: None)
2019-04-05 19:50:12 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)
b'\n\n\n\t\n\tQuotes to Scrape\n    \n    \n\n\n    
\n
\n
\n

\n Quotes to Scrape\n

\n
\n
\n

\n \n Login\n \n

\n
\n
\n \n\n
\n
\n\ n
\n \xa1\xb0The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\xa1\xb1\n by \n (about)\n \n
\n Tags:\n \n \n change\n \n deep-thoughts\n \n thinking\n \n world\n \n
\n
\n\n
\n \xa1\xb0It is our choices, Harry, that show what we truly are, far more than our abilities.\xa1\xb1\n by \n (about)\n \n
\n Tags:\n \n \n abilities\n \n choices\n \n
\n
\n\n
\n \xa1\xb0There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\xa1 \xb1\n by \n (about)\ n \n
\n Tags:\n \n \n inspirational\n \n life\n \n live\n \n miracle\n \n miracles\n \n
\n
\n\n
\n \xa1\xb0The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\xa1\xb1\n by \n (about)\n \n
\n Tags:\n \n \n aliteracy\n \n books\n \n classic\n \n humor\n \n
\n
\n\n
\n \xa1\xb0Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring .\xa1\xb1\n by \n (about)\n \n \n
\n\n
\n \xa1\xb0Try not to become a man of success. Rather become a man of value.\xa1\xb 1\n by \n (about)\n \n
\n Tags:\n \n \n adulthood\n \n success\n \n value\n \n
\n
\ n\n
\n \xa1\xb0It is better to be hated for what you are than to be loved for what you are not.\xa1\xb1\n by \n (about)\n \n
\n Tags:\n \n \n life\n \n love\n \n
\n
\n\n
\n \xa1\xb0I have not failed. I've just found 10,000 ways that won& #39;t work.\xa1\xb1\n by \n (about)\n \n
\n Tags:\n \n \n edison\n \n failure\n \n inspirational< /a>\n \n paraphrased\n \n
\n
\n\n
\n \xa1\xb0A woman is like a tea bag; you never know how strong it is until it's in hot water.\xa1\xb1\n by