一、【python scrapy 的使用】新建爬虫项目爬取一个页面

Scrapy 官方文档

https://doc.scrapy.org/en/latest/

http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

1.安装scrapy

终端命令：

pip3 install scrapy

2.创建项目

终端命令：

scrapy  startproject 
cd                                   #进入工程目录
scrapy genspider

url domain为想爬取的网址域名
之后会在当前路径生成一个以projectname为名称的文件夹，以下projectname=mySpider ，spidername=epilepsy_spider、url domain=baidu.com为例，文件夹目录结构如图：

scrapy文件目录结构

scrapy.cfg 项目的配置信息，主要为Scrapy命令行工具提供一个基础的配置信息，真正爬虫相关的配置信息在settings.py文件中

items.py 定义爬取数据的对象

pipelines 数据处理行为，如：一般结构化的数据持久化

settings.py 配置文件，如：递归的层数、并发数，延迟下载等

spiders 爬虫目录，如：创建文件，编写爬虫规则

3.编写items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class MovieItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()

4.编写epilepsy_spider.py包括爬去网页的起始路径和xpath表达式

# -*- coding: utf-8 -*-
import scrapy
from mySpider.items import MovieItem

class MeijuSpider(scrapy.Spider):
    name = 'epilepsy_spider'
    allowed_domains = ['baidu.com']
    start_urls = ['https://baike.baidu.com/medicine/disease/%E7%99%AB%E7%97%AB/1613?from=lemma']

    def parse(self, response):
        movies = response.xpath('//*[@id="medical_content"]/ul').extract()
        item = MovieItem()
        item['name'] = movies
        print(item['name'])

        yield item

在pycharm项目内创建Spider项目时注意将Spider项目的父文件夹设置为Resource Root：文件夹右键->Make Directory as ->Resource Root
关于xpath表达式的使用见下一篇

5.编写pipeline.py定义爬取数据的保存路径

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class MoviePipeline(object):
    def process_item(self, item, spider):
        with open("./epilepsy.txt", "a") as fp:
            for i in item['name']:
                fp.write(i+'\n')

6.执行爬虫

终端命令：

scrapy crawl

一、【python scrapy 的使用】新建爬虫项目爬取一个页面

1.安装scrapy

2.创建项目

3.编写items.py

4.编写epilepsy_spider.py包括爬去网页的起始路径和xpath表达式

5.编写pipeline.py定义爬取数据的保存路径

6.执行爬虫

你可能感兴趣的:(一、【python scrapy 的使用】新建爬虫项目爬取一个页面)