scrapy爬取学院新闻

采集策略

任务：采集四川大学公共管理学院所有的新闻资讯
策略：先分析网页，发现网页之间存在的关系需要从新闻动态页面点击进入新闻详情页抓取到新闻的具体内容

采集流程

分析网页，确定需要采集的内容，命名实体。
根据实体，定位网页上的标签，制定采集规则
新建一个project -ggnews
define items 定义四个实体的item
编写代码ggnews_1.py
执行爬虫
最终存为json或xml

1. 确定采集目标

先进到四川大学公共管理学院的官网，发现抓取不到全部新闻，需要点击more进入到新闻动态里。

公共管理学院首页.png

新闻动态首页只有12条新闻的标题和日期，点进才有所有的新闻

新闻动态首页.png

新闻详情页.png

发现新闻详情页有需要的新闻详情：标题、发布时间、图片、内容，确定实体title、time、img、content。

2. 制定采集规则

以上发现要从新闻动态页面点击链接进入新闻详情页才能抓取到新闻的具体内容。要采集所有新闻内容，就需要先采集新闻动态的所有新闻链接,并且进入到新闻详情链接里面抓取所有的新闻内容。
新闻动态页的采集：
分析新闻动态首页，用开发者工具定位链接的标签

链接标签.png

href.xpath(
                "//ul[@class='newsinfo_list_ul mobile_dn']/li/div/div[@class='news_c fr']/h3/a/@href")

由于新闻是分页表示的，需要分析其下一页表示

下一页按钮.png

下一页按钮2.png

发现其中规律，写一个循环，将所有页面表示出来。

 next_page = response.xpath(
            "//div[@class='pager cf tr pt10 pb10 mt30 mobile_dn']/li[last()-1]/a/@href").extract_first()

        if next_page is not None:
            next_pages = response.urljoin(next_page)

新闻详情页：
找到那四个实体：

标题时间.png

item['date'] = response.xpath('//div[@class="detail_zy_title"]/p/text()').extract()
item['title'] = response.xpath('//div[@class="detail_zy_title"]/h1/text()').extract()

内容.png

item['content'] = response.xpath("//div[@class='detail_zy_c pb30 mb30']/p/span/text()").extract()

图片.png

item['img'] = response.xpath('//div/img/@src').extract()

先采集新闻动态页的全部新闻链接，再通过链接循环采集新闻的四个实体。

3.本地代码编写

新建一个project

scrapy startproject ggnews
cd /ggnews/ggnews

define items 定义实体的item

import scrapy
class GgnewsItem(scrapy.Item):
   
    title = scrapy.Field()
    time = scrapy.Field()
    content = scrapy.Field()
    img = scrapy.Field()

ggnews_1.py

import scrapy

from ggnews.items import GgnewsItem

class GgnewsSpider(scrapy.Spider):
    name = "ggnews"
    start_urls = [
        'http://ggglxy.scu.edu.cn/index.php?c=special&sid=1',
    ]

    def parse(self, response):
        for href in response.xpath(
                "//ul[@class='newsinfo_list_ul mobile_dn']/li/div/div[@class='news_c fr']/h3/a/@href"):
            url = response.urljoin(href.extract())

            yield scrapy.Request(url, callback=self.parse_details)


        next_page = response.xpath(
            "//div[@class='pager cf tr pt10 pb10 mt30 mobile_dn']/li[last()-1]/a/@href").extract_first()

        if next_page is not None:
            next_pages = response.urljoin(next_page)

            yield scrapy.Request(next_pages, callback=self.parse)

 
    def parse_details(self, response):


        item = GgnewsItem()

        item['title'] = response.xpath("//div[@class='detail_zy_title']/h1/text()").extract()
        item['time'] = response.xpath("//div[@class='detail_zy_title']/p/text()").extract()
        item['content'] = response.xpath("//div[@class='detail_zy_c pb30 mb30']/p/span/text()").extract()
        item['img'] = response.xpath('//div/img/@src').extract()

        yield item

执行爬虫

scrapy crawl ggnews -o ggnews.xml

执行.png

最终结果分析

以xml的形式得到了所有新闻的链接、标题、详情、图片、发布时间，但是出现乱码，应该是要在代码中加上什么转码的代码，还有两条信息报错，也没有来得及去分析。

结果乱码.png