访问的网站是:http://www.imooc.com/course/list?sort=pop
首先我们创建一个Scrapy项目
$ scrapy startproject mooc_subjects
New Scrapy project 'mooc_subjects', using template directory '/home/pit-yk/anaconda3/lib/python3.6/site-packages/scrapy/templates/project', created in:
/media/pit-yk/办公/python/codes/知乎专栏---Ehco/Scrapy/mooc_subjects
You can start your first spider with:
cd mooc_subjects
scrapy genspider example example.com
$ tree
.
├── mooc_subjects
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── __pycache__
│ │ ├── __init__.cpython-36.pyc
│ │ ├── items.cpython-36.pyc
│ │ ├── pipelines.cpython-36.pyc
│ │ └── settings.cpython-36.pyc
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ ├── MySpider.py
│ └── __pycache__
│ ├── __init__.cpython-36.pyc
│ └── MySpider.cpython-36.pyc
├── mooc_subjects.txt
└── scrapy.cfg
4 directories, 15 files
这是我程序运行结束后的tree,我在开始构造这个框架的时候:
第一步,先浏览网页html,查看能扣取到的信息有哪些?
以上每个
然后就知道可以得到的信息有:
课程标题:title
课程视频链接:url
课程图片:image_url
课程简介:introduction
课程学习总人数:student
那就构造出了---items.py
$ cat items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class MoocSubjectsItem(scrapy.Item):
# define the fields for your item here like:
#name = scrapy.Field()
# 课程标题
title = scrapy.Field()
#课程url
url = scrapy.Field()
#课程标题图片
image_url = scrapy.Field()
#课程描述
introduction = scrapy.Field()
#学习人数
student = scrapy.Field()
$ cat MySpider.py
#!/usr/bin/env python
# coding=utf-8
import scrapy
from mooc_subjects.items import MoocSubjectsItem
class MySpider(scrapy.Spider):
name = "MySpider"
# 设定域名
allowed_domains = ['imooc.com']
# 填写爬取地址
start_urls = ["http://www.imooc.com/course/list"]
# 填写爬取方法
def parse(self, response):
# 实例一个容器保存爬取信息
item = MoocSubjectsItem()
for box in response.xpath('//div[@class="course-card-container"]'):
item['url'] = 'http://www.imooc.com' + box.xpath('.//@href').extract()[0]
item['image_url'] = box.xpath('.//@src').extract()[0]
item['title'] = box.xpath('.//h3[@class="course-card-name"]/text()').extract()[0]
item['introduction'] = box.xpath('.//p/text()').extract()[0]
item['student'] = box.xpath('.//div[@class="course-card-info"]/span[2]/text()').extract()[0]
yield item
# 返回信息
然后,顺理成章的得到---pipelines.py
$ cat pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
class MoocSubjectsPipeline(object):
def process_item(self, item, spider):
# 获取当前工作目录
base_dir = os.getcwd()
# 文件存储位置
fiename = base_dir + '/mooc_subjects.txt'
# 以内存追加的方式打开文件并写入对应的数据
with open(fiename,'a') as f:
f.write(item['title'] + '\t')
f.write(item['student'] + '人学习'+'\n')
f.write(item['introduction'] + '\n')
f.write(item['url'] + '\n\n')
# 下载图片
return item
最后,但又是最重要的---配置设置文件,根据自己需求,目前这个框架感觉只需要
ITEM_PIPELINES = {
'mooc_subjects.pipelines.MoocSubjectsPipeline': 300,
}
然后...没有然后了,就得到了
但是后来感觉少些东西,貌似只爬取了网页第一页,如果我要爬取更多的页面呢?
方法1:
#在MySpider.py里面将原来的 start_urls = []替换为下面的,我们就可以爬取指定页
start_urls = []
# 通过简单的循环,来生成爬取页面的列表
# 这里我们爬1~5页
for i in range(1, 6):
start_urls.append('http://www.imooc.com/course/list?page=' + str(i))
def parse(self, response):
......
#url跟进开始
#获取下一页的url信息
url = response.xpath("//a[contains(text(),'下一页')]/@href").extract()
if url :
#将信息组合成下一页的url
page = 'http://www.imooc.com' + url[0]
#返回url
yield scrapy.Request(page, callback=self.parse)
#url跟进结束