scrapy 爬取 arxiv.org 论文

和同学想要建立一个检索 arxiv.org 论文的网站,这是一个 demo
Github地址:https://github.com/Joovo/Arxiv

鸽了好久把博客补了, scrapy 的操作:

  • scrapy shell 检验 xpath 正确性
  • reponse.xpath().extract() 转换为字符串列表
  • str.strip()处理数据
  • 获取 xpath 的子节点的所有 text

arxiv.org 本身是通过构造 url 来爬取比较简单,通过构造年月的时间戳和页面展示数据的条数。

python3 -m scrapy startproject Arxiv
cd Arxiv
# quick start a simple spider
scrapy genspider arxiv arxiv.org

# how to crawl 
scrapy crawl arxiv

有了基本框架后,修改items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy


class ArxivItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title=scrapy.Field()
    authors=scrapy.Field()
    comments=scrapy.Field()
    subjects=scrapy.Field()

修改pipelines.py,用于下载

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json


class ArxivPipeline(object):
    def __init__(self):
        self.file = open('./items.json', 'a+')

    def process_item(self, item, spider):
        content = (json.dumps(dict(item))+"\n").encode(encoding='utf-8')
        self.file.write(content)
        return item

创建./spider/Arxiv.py

Arxiv.py继承了scrapy.Spider,另外还有几个用于继承的类需要查文档,这里需要实现的是 parse方法。

  • 需要导入设置好的Item,对页面解析,.框架内部“流通的”的是Item类。
  • parse通过迭代器返回一个item对象或是一个response
  • 返回的response会加入队列,等待处理。
# -*- coding: utf-8 -*-
import scrapy
from Arxiv.items import *
import re


class ArxivSpider(scrapy.Spider):
    name = 'arxiv'
    allowed_domains = ['arxiv.org']
    start_urls = ['https://arxiv.org/list/cs.CV/1801?show=1000']

    def parse(self, response):
        self.logger.info('A response from %s just arrived' % response.url)
        # get num line
        num = response.xpath('//*[@id="dlpage"]/small[1]/text()[1]').extract()[0]
        # get max_index
        max_index = int(re.search(r'\d+', num).group(0))
        for index in range(1, max_index + 1):
            item = ArxivItem()
            # get title and clean data
            title = response.xpath('//*[@id="dlpage"]/dl/dd[' + str(index) + ']/div/div[1]/text()').extract()
            # remove blank char
            title = [i.strip() for i in title]
            # remove blank str
            title = [i for i in title if i is not '']
            # insert title
            item['title'] = title[0]

            authors = ''
            # authors'  father node
            xpath_fa = '//*[@id="dlpage"]/dl/dd[' + str(index) + ']/div/div[2]//a/text()'
            author_list = response.xpath(xpath_fa).getall()
            authors=str.join('',author_list)
            item['authors'] = authors

            item['subjects']=response.xpath('string(//*[@id="dlpage"]/dl/dd['+str(5)+']/div/div[5]/span[2])').extract_first()
            
            yield item
        # 这里下一个url指向的是1802,改为循环就可以爬取全部信息
        yield scrapy.Request('https://arxiv.org/list/cs.CV/1802?show=1000', callback=self.parse)

item.json

{"title": "Deep Reinforcement Learning for Unsupervised Video Summarization with  Diversity-Representativeness Reward", "authors": "Kaiyang Zhou, Kaiyang Zhou, Kaiyang Zhou", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Deformable GANs for Pose-based Human Image Generation", "authors": "Aliaksandr Siarohin, Aliaksandr Siarohin, Aliaksandr Siarohin, Aliaksandr Siarohin", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Face Synthesis from Visual Attributes via Sketch using Conditional VAEs  and GANs", "authors": "Xing Di, Xing Di", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "A PDE-based log-agnostic illumination correction algorithm", "authors": "U. A. Nnolim", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "A Real-time and Registration-free Framework for Dynamic Shape  Instantiation", "authors": "Xiao-Yun Zhou, Xiao-Yun Zhou, Xiao-Yun Zhou", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Fractional Local Neighborhood Intensity Pattern for Image Retrieval  using Genetic Algorithm", "authors": "Avirup Bhattacharyya, Avirup Bhattacharyya, Avirup Bhattacharyya, Avirup Bhattacharyya, Avirup Bhattacharyya", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "A Unified Method for First and Third Person Action Recognition", "authors": "Ali Javidani, Ali Javidani", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Integrating semi-supervised label propagation and random forests for  multi-atlas based hippocampus segmentation", "authors": "Qiang Zheng, Qiang Zheng", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Transfer learning for diagnosis of congenital abnormalities of the  kidney and urinary tract in children based on Ultrasound imaging data", "authors": "Qiang Zheng, Qiang Zheng, Qiang Zheng", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Context aware saliency map generation using semantic segmentation", "authors": "Mahdi Ahmadi, Mahdi Ahmadi, Mahdi Ahmadi, Mahdi Ahmadi", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}

随着网站更新代码年久失修,根据网友 @一念逍遥、
指出authors部分需要勘误,已针对该问题改正,其他部分不做修改。

你可能感兴趣的:(※,Python,爬虫)