scrapy - tutorial

Install

sudo pip install libxml2-dev libxslt1-dev lxml libffi-dev
git clone git://github.com/scrapy/scrapy.git
cd /path/to/scrapy/
sudo python setup.py install

Usage

[nixawk@core tutorial]$ scrapy -h
Scrapy 0.25.1 - project: tutorial

Usage:
  scrapy [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
  deploy        Deploy project in Scrapyd target
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy -h" to see more info about a command

Start a new project

[nixawk@core ~]$ scrapy startproject tutorial
2015-01-20 03:07:20+0000 [scrapy] INFO: Scrapy 0.25.1 started (bot: scrapybot)
2015-01-20 03:07:20+0000 [scrapy] INFO: Optional features available: ssl, http11
2015-01-20 03:07:20+0000 [scrapy] INFO: Overridden settings: {}
New Scrapy project 'tutorial' created in:
    /home/notfound/tutorial

You can start your first spider with:
    cd tutorial
    scrapy genspider example example.com

Files

[nixawk@core share]$ tree ./tutorial/
./tutorial/
├── scrapy.cfg
└── tutorial
    ├── __init__.py
    ├── __init__.pyc
    ├── items.py
    ├── items.pyc
    ├── pipelines.py
    ├── settings.py
    ├── settings.pyc
    └── spiders
        ├── __init__.py
        ├── __init__.pyc
        ├── tutorial_spider.py
        └── tutorial_spider.pyc

2 directories, 12 files

Demo – a simple spider

[nixawk@core tutorial]$ cat ./tutorial/items.py
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TutorialItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

[nixawk@core tutorial]$ cat ./tutorial/spiders/tutorial_spider.py
import scrapy
from tutorial.items import TutorialItem
from pprint import pprint


class TutorialSpider(scrapy.spider.Spider):
    name = "tutorial"
    allowed_domains = ["learnpythonthehardway.org"]
    start_urls = [
        "http://learnpythonthehardway.org/book/"
    ]

    def parse(self, response):
        # response.selector
        # response.selector.xpath()
        # response.selector.css()
        # response.xpath()
        # response.css()

        for sel in response.xpath('//ul[@class="simple"]'):
            item = TutorialItem()

            item['title'] = sel.xpath(
                'li/a[@class="reference external"]/text()').extract()

            item['link'] = sel.xpath(
                'li/a[@class="reference external"]/@href').extract()

            pprint(item)

Result

[nixawk@core tutorial]$ scrapy crawl tutorial
2015-01-20 03:00:11+0000 [scrapy] INFO: Scrapy 0.25.1 started (bot: tutorial)
2015-01-20 03:00:11+0000 [scrapy] INFO: Optional features available: ssl, http11
2015-01-20 03:00:11+0000 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'}
/usr/lib/python2.7/site-packages/Twisted-14.0.2-py2.7-linux-x86_64.egg/twisted/internet/_sslverify.py:184: UserWarning: You do not have the service_identity module installed. Please install it from . Without the service_identity module and a recent enough pyOpenSSL tosupport it, Twisted can perform only rudimentary TLS client hostnameverification.  Many valid certificate/hostname mappings may be rejected.
  verifyHostname, VerificationError = _selectVerifyImplementation()
2015-01-20 03:00:17+0000 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, CoreStats, SpiderState
2015-01-20 03:00:17+0000 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-01-20 03:00:17+0000 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-01-20 03:00:17+0000 [scrapy] INFO: Enabled item pipelines:
2015-01-20 03:00:17+0000 [tutorial] INFO: Spider opened
2015-01-20 03:00:17+0000 [tutorial] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-20 03:00:17+0000 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-01-20 03:00:18+0000 [tutorial] DEBUG: Crawled (200) (referer: None)
{'link': [u'preface.html',
          u'intro.html',
          u'ex0.html',
          u'ex1.html',
          u'ex2.html',
          u'ex3.html',
          u'ex4.html',
          u'ex5.html',
          u'ex6.html',
          u'ex7.html',
          u'ex8.html',
          u'ex9.html',
          u'ex10.html',
          u'ex11.html',
          u'ex12.html',
          u'ex13.html',
          u'ex14.html',
          u'ex15.html',
          u'ex16.html',
          u'ex17.html',
          u'ex18.html',
          u'ex19.html',
          u'ex20.html',
          u'ex21.html',
          u'ex22.html',
          u'ex23.html',
          u'ex24.html',
          u'ex25.html',
          u'ex26.html',
          u'ex27.html',
          u'ex28.html',
          u'ex29.html',
          u'ex30.html',
          u'ex31.html',
          u'ex32.html',
          u'ex33.html',
          u'ex34.html',
          u'ex35.html',
          u'ex36.html',
          u'ex37.html',
          u'ex38.html',
          u'ex39.html',
          u'ex40.html',
          u'ex41.html',
          u'ex42.html',
          u'ex43.html',
          u'ex44.html',
          u'ex45.html',
          u'ex46.html',
          u'ex47.html',
          u'ex48.html',
          u'ex49.html',
          u'ex50.html',
          u'ex51.html',
          u'ex52.html',
          u'advice.html',
          u'next.html',
          u'appendixa.html'],
 'title': [u'Preface',
           u'Introduction: The Hard Way Is Easier',
           u'Exercise 0: The Setup',
           u'Exercise 1: A Good First Program',
           u'Exercise 2: Comments And Pound Characters',
           u'Exercise 3: Numbers And Math',
           u'Exercise 4: Variables And Names',
           u'Exercise 5: More Variables And Printing',
           u'Exercise 6: Strings And Text',
           u'Exercise 7: More Printing',
           u'Exercise 8: Printing, Printing',
           u'Exercise 9: Printing, Printing, Printing',
           u'Exercise 10: What Was That?',
           u'Exercise 11: Asking Questions',
           u'Exercise 12: Prompting People',
           u'Exercise 13: Parameters, Unpacking, Variables',
           u'Exercise 14: Prompting And Passing',
           u'Exercise 15: Reading Files',
           u'Exercise 16: Reading And Writing Files',
           u'Exercise 17: More Files',
           u'Exercise 18: Names, Variables, Code, Functions',
           u'Exercise 19: Functions And Variables',
           u'Exercise 20: Functions And Files',
           u'Exercise 21: Functions Can Return Something',
           u'Exercise 22: What Do You Know So Far?',
           u'Exercise 23: Read Some Code',
           u'Exercise 24: More Practice',
           u'Exercise 25: Even More Practice',
           u'Exercise 26: Congratulations, Take A Test!',
           u'Exercise 27: Memorizing Logic',
           u'Exercise 28: Boolean Practice',
           u'Exercise 29: What If',
           u'Exercise 30: Else And If',
           u'Exercise 31: Making Decisions',
           u'Exercise 32: Loops And Lists',
           u'Exercise 33: While Loops',
           u'Exercise 34: Accessing Elements Of Lists',
           u'Exercise 35: Branches and Functions',
           u'Exercise 36: Designing and Debugging',
           u'Exercise 37: Symbol Review',
           u'Exercise 38: Doing Things To Lists',
           u'Exercise 39: Dictionaries, Oh Lovely Dictionaries',
           u'Exercise 40: Modules, Classes, And Objects',
           u'Exercise 41: Learning To Speak Object Oriented',
           u'Exercise 42: Is-A, Has-A, Objects, and Classes',
           u'Exercise 43: Gothons From Planet Percal #25',
           u'Exercise 44: Inheritance Vs. Composition',
           u'Exercise 45: You Make A Game',
           u'Exercise 46: A Project Skeleton',
           u'Exercise 47: Automated Testing',
           u'Exercise 48: Advanced User Input',
           u'Exercise 49: Making Sentences',
           u'Exercise 50: Your First Website',
           u'Exercise 51: Getting Input From A Browser',
           u'Exercise 52: The Start Of Your Web Game',
           u'Advice From An Old Programmer',
           u'Next Steps',
           u'Appendix A: Command Line Crash Course']}
2015-01-20 03:00:18+0000 [tutorial] INFO: Closing spider (finished)
2015-01-20 03:00:18+0000 [tutorial] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 229,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 4297,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 1, 20, 3, 0, 18, 468030),
     'log_count/DEBUG': 1,
     'log_count/INFO': 3,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2015, 1, 20, 3, 0, 17, 501193)}
2015-01-20 03:00:18+0000 [tutorial] INFO: Spider closed (finished)









你可能感兴趣的:(python)