初窥Python之Scrapy框架

Scrapy是什么

Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。可以应用在包括数据挖掘，信息处理或存储历史数据等一系列的程序中。

中文教程地址

mac升级pip命令

pip install -U pip

mac安装mysql-python

pip install mysql-python

解决 mysql_config not found 错误

在默认方式安装 Mac 版 MySql 时
会在 /usr/local/mysql/bin 目录下生成 mysql_config 文件
可以通过修改 OS X 的系统环境变量来解决找不到 mysql_config 的错误

修改 OS X 环境变量：

打开终端，在终端中使用 vim 打开 “~/.bash_profile”，在 .bash_profile 中添加以下内容：

PATH="/usr/local/mysql/bin:${PATH}"
export PATH
export DYLD_LIBRARY_PATH=/usr/local/mysql/lib/
export VERSIONER_PYTHON_PREFER_64_BIT=yes
export VERSIONER_PYTHON_PREFER_32_BIT=no

使修改生效：

source ~/.bash_profile

再安装，即可解决mysql_config not found 错误：

pip install mysql-python

Scrapy安装

pip install scrapy

Scrapy入门

完成下列任务:
1.创建一个Scrapy项目
2.定义提取的Item
3.编写爬取网站的 spider 并提取 Item
4.编写 Item Pipeline 来存储提取到的Item(即数据)

Scrapy由 Python 编写。

创建项目

在开始爬取之前，您必须创建一个新的Scrapy项目。进入您打算存储代码的目录中，运行下列命令:

scrapy startproject jianshu

该命令将会创建包含下列内容的 jianshu目录:

jianshu/
    scrapy.cfg
    jianshu/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

这些文件分别是:
scrapy.cfg: 项目的配置文件
tutorial/: 该项目的python模块。之后您将在此加入代码。
tutorial/items.py: 项目中的item文件.
tutorial/pipelines.py: 项目中的pipelines文件.
tutorial/settings.py: 项目的设置文件.
tutorial/spiders/: 放置spider代码的目录.

定义Item

Item是保存爬取到的数据的容器；其使用方法和python字典类似，并且提供了额外保护机制来避免拼写错误导致的未定义字段错误。
类似在ORM中做的一样，您可以通过创建一个 scrapy.Item类，并且定义类型为 scrapy.Field的类属性来定义一个Item。 (如果不了解ORM, 不用担心，您会发现这个步骤非常简单)
首先根据需要从dmoz.org获取到的数据对item进行建模。我们需要从dmoz中获取名字，url，以及网站的描述。对此，在item中定义相应的字段。编辑 tutorial
目录中的items.py文件:

import scrapy

class DmozItem(scrapy.Item): 
    title = scrapy.Field() 
    link = scrapy.Field() 
    desc = scrapy.Field()

pipelines.py中定义class，settings要引用才生效

pipelines.py中：

class WebcrawlerScrapyPipeline(object):    
'''保存到数据库中对应的class       
1、在settings.py文件中配置       
2、在自己实现的爬虫类中yield item,会自动执行'''

settings.py中：注意，这里jianshu是项目名

ITEM_PIPELINES = {    
'jianshu.pipelines.WebcrawlerScrapyPipeline': 300,#保存到mysql数据库
}

爬取首页热文到csv代码实现：

开发工具：PyCharm

工程目录结构：

目录结构.png

核心代码【jianshuSpider.py】：

# -*- coding: UTF-8 -*-

import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
from jianshu.items import JianshuItem
import urllib



class Jianshu(CrawlSpider):
    name='jianshu'
    start_urls=['http://www.jianshu.com/top/monthly']
    url = 'http://www.jianshu.com'

    def parse(self, response):
        item = JianshuItem()
        selector = Selector(response)
        articles = selector.xpath('//ul[@class="article-list thumbnails"]/li')

        for article in articles:
            title = article.xpath('div/h4/a/text()').extract()
            url = article.xpath('div/h4/a/@href').extract()
            author = article.xpath('div/p/a/text()').extract()

            # 下载所有热门文章的缩略图, 注意有些文章没有图片
            try:
                image = article.xpath("a/img/@src").extract()
                urllib.urlretrieve(image[0], '/Users/wangshuang/Desktop/Documents/images/%s-%s.jpg' %(author[0],title[0]))
            except:
                print ("--no---image--")


            listtop = article.xpath('div/div/a/text()').extract()
            likeNum = article.xpath('div/div/span/text()').extract()

            readAndComment = article.xpath('div/div[@class="list-footer"]')
            data = readAndComment[0].xpath('string(.)').extract()[0]


            item['title'] = title
            item['url'] = 'http://www.jianshu.com/'+url[0]
            item['author'] = author

            item['readNum']=listtop[0]
            # 有的文章是禁用了评论的
            try:
                item['commentNum']=listtop[1]
            except:
                item['commentNum']=''
            item['likeNum']= likeNum
            yield item

        next_link = selector.xpath('//*[@id="list-container"]/div/button/@data-url').extract()



        if len(next_link)==1 :

            next_link = self.url+ str(next_link[0])
            print ("----"+next_link)
            yield Request(next_link,callback=self.parse)

核心代码【items.py】：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item,Field

# 爬取首页热文到csv文件中
class JianshuItem(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 要解析处理哪些数据，在items.py中定义好，相当于Java中的实体类：
    title = Field()
    author = Field()
    url = Field()
    readNum = Field()
    commentNum = Field()
    likeNum = Field()

    pass

核心代码【pipelines.py】：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class JianshuPipeline(object):
    def process_item(self, item, spider):
        return item

核心代码【settings.py】：

# Scrapy settings for jianshu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'jianshu'

SPIDER_MODULES = ['jianshu.spiders']
NEWSPIDER_MODULE = 'jianshu.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'jianshu (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

FEED_URI=u'/Users/wangshuang/Desktop/Documents/jianshu-monthly.csv'
FEED_FORMAT='CSV'

# 出现Ignoring response <403 http://www.jianshu.com/top/monthly>: HTTP status code is not handled or not allowed
# 问题解决如下,关闭代理
# 关闭代理
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}

核心代码【main.py】：

from scrapy import cmdline
cmdline.execute("scrapy crawl jianshu".split()).encode('utf-8')

实现最终效果：

实现效果.png

项目工程地址：

https://github.com/wangxiaoda513/python_scrapy