python2.7.3 --- scrapy举例

scrapy举例

0,安装scrapy

详细的请查阅这里

http://doc.scrapy.org/en/0.14/intro/install.html

1,创建项目

sudo scrapy startproject dmoz

2,编写蜘蛛

到dmoz根目录下面的dmoz/spiders目录下面,创建dmoz_spider.py

内容如下:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 

from dmoz.items import DmozItem 

class DmozSpider(BaseSpider): 
   name = "dmoz.org" 
   allowed_domains = ["dmoz.org"] 
   start_urls = [ 
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", 
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" 
   ] 

   def parse(self, response): 
       hxs = HtmlXPathSelector(response) 
       sites = hxs.select('//ul/li') 
       items = [] 
       for site in sites: 
           item = DmozItem() 
           item['title'] = site.select('a/text()').extract() 
           item['link'] = site.select('a/@href ').extract() 
           item['desc'] = site.select('text()').extract() 
           items.append(item) 
       return items 

3,执行蜘蛛

到项目的顶级目录执行

scrapy crawl dmoz.org --set FEED_URI=items.json --set FEED_FORMAT=json

4,查看结果

会在你当前执行的目录下面看到一个items文件

cat items.json

 

你可能感兴趣的:(python2.7.3 --- scrapy举例)