Scrapy Spider前奏

Scrapy Spider前奏

  1. 观察页面内容, 查找要抓取的数据
  2. 利用XPath提取数据
  3. 运行spider来获取网站的数据,以JSON、XML格式存储/item pipeline将item存储到数据库中

程序员每日一服药:

scrapy 0.24


scrapy startproject tutorial

cd tutorial

vim turorial/item.py

    import scrapy
    class TutorialItem(scrapy.Item):
        title=scrapy.Field()
        link=scrapy.Field()
        desc=scrapy.Field()

scrapy genspider dmoz dmoz.org

      1 # -*- coding: utf-8 -*-
      2 import scrapy
      3 
      4 
      5 class DmozSpider(scrapy.Spider):
      6     name = "dmoz"
      7     allowed_domains = ["dmoz.org"]
      8     start_urls = (
      9                   "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
      10     )
      11
      12     def parse(self, response):
      13         filename=response.url.split("/")[-2]
      14         with open(filename,'wb') as f:
      15             f.write(response.body)

start_urls与官方参考文档不同,为(),“,”不能省略。生成一个名为BOOK的文件,包含指定网址的body部分。类似印象笔记保存网页的方式,哈哈,我也能写一个裁剪网页的东西了~

# -*- coding: utf-8 -*-
import scrapy
from tutorial.items import TutorialItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = (
              "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
    )

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item=TutorialItem()
            item['title']=sel.xpath('a/text()').extract()
            item['link']=sel.xpath('a/@href').extract()
            item['desc']=sel.xpath('text()').extract()
            yield item

利用XPath提取数据放入item中

scrapy crawl dmoz -o items.json

将提取的数据存入到json文件中

你可能感兴趣的:(scrapy)