scrapy的使用,修正《Python下用Scrapy和MongoDB构建爬虫系统》

修改了《Python下用Scrapy和MongoDB构建爬虫系统 》http://python.jobbole.com/81320/ 的一点小问题
1,创建项目:scrapy crawl air2,名为air2
爬取stackoverflow.com的首页http://stackoverflow.com/questions?pagesize=50&sort=newest
2,目录结构

├── scrapy.cfg
└── air2
    ├── __init__.py
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        ├── air2_spider.py
        └── __init__.py

3,修改之处:为每个文件中涉及到scrapy的类都增加了scrapy的作用域,防止出现Crawled 0 pages (at 0 pages/min 这样的错误提示。
修改两个文件:air2/spider/air2_spider.py以及air2/items.py.

air2/spider/air2_spider.py代码如下

import sys
sys.path.insert(0,'..')
import items
import scrapy
from scrapy import Spider


class Air2Spider(Spider):
        name="air2"
        allowed_domains=["stackoverflow.com"]
        start_urls=["http://stackoverflow.com/questions?pagesize=50&sort=newest",]


        def parse(self,response):
                sel=scrapy.Selector(response)
                questions=sel.xpath('//div[@class="summary"]/h3')


                for question in questions:
                        item=items.Air2Item()
                        title=question.xpath('a[@class="question-hyperlink"]/text()').extract()[0]                   
                        url=question.xpath('a[@class="question-hyperlink"]/@href').extract()[0]
                        print(title)
                        print(url)
                        item['title']=title
                        item['url']=url
                        yield item

其中air2/items.py的代码如下:

import scrapy

class Air2Item(scrapy.Item):
    # define the fields for your item here like:
    title=scrapy.Field()
    url=scrapy.Field()

4,运行
cd air2/air2/spiders/
scrapy crawl air2 -o items.json -t json,得到items.json文件

[{"url": "/questions/29971480/facebook-unity-multi-friend-selector-not-showing-pictures-on-developer-users-whe", "title": "Facebook Unity multi friend selector not showing pictures on developer users when doing FB.AppRequest"},
{"url": "/questions/29971477/reading-information-from-memory-array-into-excel-sheet-formula", "title": "Reading information from Memory array into excel sheet formula"},
...]

你可能感兴趣的:(python)