Install |
sudo pip install libxml2-dev libxslt1-dev lxml libffi-dev git clone git://github.com/scrapy/scrapy.git cd /path/to/scrapy/ sudo python setup.py install |
Usage |
[nixawk@core tutorial]$ scrapy -h Scrapy 0.25.1 - project: tutorial Usage: scrapy Available commands: bench Run quick benchmark test check Check spider contracts crawl Run a spider deploy Deploy project in Scrapyd target edit Edit spider fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates list List available spiders parse Parse URL (using its spider) and print the results runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy Use "scrapy |
Start a new project |
[nixawk@core ~]$ scrapy startproject tutorial 2015-01-20 03:07:20+0000 [scrapy] INFO: Scrapy 0.25.1 started (bot: scrapybot) 2015-01-20 03:07:20+0000 [scrapy] INFO: Optional features available: ssl, http11 2015-01-20 03:07:20+0000 [scrapy] INFO: Overridden settings: {} New Scrapy project 'tutorial' created in: /home/notfound/tutorial You can start your first spider with: cd tutorial scrapy genspider example example.com |
Files |
[nixawk@core share]$ tree ./tutorial/ ./tutorial/ ├── scrapy.cfg └── tutorial ├── __init__.py ├── __init__.pyc ├── items.py ├── items.pyc ├── pipelines.py ├── settings.py ├── settings.pyc └── spiders ├── __init__.py ├── __init__.pyc ├── tutorial_spider.py └── tutorial_spider.pyc 2 directories, 12 files |
Demo – a simple spider |
[nixawk@core tutorial]$ cat ./tutorial/items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class TutorialItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() link = scrapy.Field() desc = scrapy.Field() |
[nixawk@core tutorial]$ cat ./tutorial/spiders/tutorial_spider.py import scrapy from tutorial.items import TutorialItem from pprint import pprint class TutorialSpider(scrapy.spider.Spider): name = "tutorial" allowed_domains = ["learnpythonthehardway.org"] start_urls = [ "http://learnpythonthehardway.org/book/" ] def parse(self, response): # response.selector # response.selector.xpath() # response.selector.css() # response.xpath() # response.css() for sel in response.xpath('//ul[@class="simple"]'): item = TutorialItem() item['title'] = sel.xpath( 'li/a[@class="reference external"]/text()').extract() item['link'] = sel.xpath( 'li/a[@class="reference external"]/@href').extract() pprint(item) |
Result |
[nixawk@core tutorial]$ scrapy crawl tutorial 2015-01-20 03:00:11+0000 [scrapy] INFO: Scrapy 0.25.1 started (bot: tutorial) 2015-01-20 03:00:11+0000 [scrapy] INFO: Optional features available: ssl, http11 2015-01-20 03:00:11+0000 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'} /usr/lib/python2.7/site-packages/Twisted-14.0.2-py2.7-linux-x86_64.egg/twisted/internet/_sslverify.py:184: UserWarning: You do not have the service_identity module installed. Please install it from verifyHostname, VerificationError = _selectVerifyImplementation() 2015-01-20 03:00:17+0000 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, CoreStats, SpiderState 2015-01-20 03:00:17+0000 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2015-01-20 03:00:17+0000 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2015-01-20 03:00:17+0000 [scrapy] INFO: Enabled item pipelines: 2015-01-20 03:00:17+0000 [tutorial] INFO: Spider opened 2015-01-20 03:00:17+0000 [tutorial] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-01-20 03:00:17+0000 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2015-01-20 03:00:18+0000 [tutorial] DEBUG: Crawled (200) {'link': [u'preface.html', u'intro.html', u'ex0.html', u'ex1.html', u'ex2.html', u'ex3.html', u'ex4.html', u'ex5.html', u'ex6.html', u'ex7.html', u'ex8.html', u'ex9.html', u'ex10.html', u'ex11.html', u'ex12.html', u'ex13.html', u'ex14.html', u'ex15.html', u'ex16.html', u'ex17.html', u'ex18.html', u'ex19.html', u'ex20.html', u'ex21.html', u'ex22.html', u'ex23.html', u'ex24.html', u'ex25.html', u'ex26.html', u'ex27.html', u'ex28.html', u'ex29.html', u'ex30.html', u'ex31.html', u'ex32.html', u'ex33.html', u'ex34.html', u'ex35.html', u'ex36.html', u'ex37.html', u'ex38.html', u'ex39.html', u'ex40.html', u'ex41.html', u'ex42.html', u'ex43.html', u'ex44.html', u'ex45.html', u'ex46.html', u'ex47.html', u'ex48.html', u'ex49.html', u'ex50.html', u'ex51.html', u'ex52.html', u'advice.html', u'next.html', u'appendixa.html'], 'title': [u'Preface', u'Introduction: The Hard Way Is Easier', u'Exercise 0: The Setup', u'Exercise 1: A Good First Program', u'Exercise 2: Comments And Pound Characters', u'Exercise 3: Numbers And Math', u'Exercise 4: Variables And Names', u'Exercise 5: More Variables And Printing', u'Exercise 6: Strings And Text', u'Exercise 7: More Printing', u'Exercise 8: Printing, Printing', u'Exercise 9: Printing, Printing, Printing', u'Exercise 10: What Was That?', u'Exercise 11: Asking Questions', u'Exercise 12: Prompting People', u'Exercise 13: Parameters, Unpacking, Variables', u'Exercise 14: Prompting And Passing', u'Exercise 15: Reading Files', u'Exercise 16: Reading And Writing Files', u'Exercise 17: More Files', u'Exercise 18: Names, Variables, Code, Functions', u'Exercise 19: Functions And Variables', u'Exercise 20: Functions And Files', u'Exercise 21: Functions Can Return Something', u'Exercise 22: What Do You Know So Far?', u'Exercise 23: Read Some Code', u'Exercise 24: More Practice', u'Exercise 25: Even More Practice', u'Exercise 26: Congratulations, Take A Test!', u'Exercise 27: Memorizing Logic', u'Exercise 28: Boolean Practice', u'Exercise 29: What If', u'Exercise 30: Else And If', u'Exercise 31: Making Decisions', u'Exercise 32: Loops And Lists', u'Exercise 33: While Loops', u'Exercise 34: Accessing Elements Of Lists', u'Exercise 35: Branches and Functions', u'Exercise 36: Designing and Debugging', u'Exercise 37: Symbol Review', u'Exercise 38: Doing Things To Lists', u'Exercise 39: Dictionaries, Oh Lovely Dictionaries', u'Exercise 40: Modules, Classes, And Objects', u'Exercise 41: Learning To Speak Object Oriented', u'Exercise 42: Is-A, Has-A, Objects, and Classes', u'Exercise 43: Gothons From Planet Percal #25', u'Exercise 44: Inheritance Vs. Composition', u'Exercise 45: You Make A Game', u'Exercise 46: A Project Skeleton', u'Exercise 47: Automated Testing', u'Exercise 48: Advanced User Input', u'Exercise 49: Making Sentences', u'Exercise 50: Your First Website', u'Exercise 51: Getting Input From A Browser', u'Exercise 52: The Start Of Your Web Game', u'Advice From An Old Programmer', u'Next Steps', u'Appendix A: Command Line Crash Course']} 2015-01-20 03:00:18+0000 [tutorial] INFO: Closing spider (finished) 2015-01-20 03:00:18+0000 [tutorial] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 229, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 4297, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 1, 20, 3, 0, 18, 468030), 'log_count/DEBUG': 1, 'log_count/INFO': 3, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2015, 1, 20, 3, 0, 17, 501193)} 2015-01-20 03:00:18+0000 [tutorial] INFO: Spider closed (finished) |