和同学想要建立一个检索 arxiv.org 论文的网站,这是一个 demo
Github地址:https://github.com/Joovo/Arxiv
鸽了好久把博客补了, scrapy 的操作:
arxiv.org 本身是通过构造 url 来爬取比较简单,通过构造年月的时间戳和页面展示数据的条数。
python3 -m scrapy startproject Arxiv
cd Arxiv
# quick start a simple spider
scrapy genspider arxiv arxiv.org
# how to crawl
scrapy crawl arxiv
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ArxivItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title=scrapy.Field()
authors=scrapy.Field()
comments=scrapy.Field()
subjects=scrapy.Field()
pipelines.py
,用于下载# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
class ArxivPipeline(object):
def __init__(self):
self.file = open('./items.json', 'a+')
def process_item(self, item, spider):
content = (json.dumps(dict(item))+"\n").encode(encoding='utf-8')
self.file.write(content)
return item
./spider/Arxiv.py
Arxiv.py
继承了scrapy.Spider
,另外还有几个用于继承的类需要查文档,这里需要实现的是 parse
方法。
Item
,对页面解析,.框架内部“流通的”的是Item
类。parse
通过迭代器返回一个item对象或是一个responseresponse
会加入队列,等待处理。# -*- coding: utf-8 -*-
import scrapy
from Arxiv.items import *
import re
class ArxivSpider(scrapy.Spider):
name = 'arxiv'
allowed_domains = ['arxiv.org']
start_urls = ['https://arxiv.org/list/cs.CV/1801?show=1000']
def parse(self, response):
self.logger.info('A response from %s just arrived' % response.url)
# get num line
num = response.xpath('//*[@id="dlpage"]/small[1]/text()[1]').extract()[0]
# get max_index
max_index = int(re.search(r'\d+', num).group(0))
for index in range(1, max_index + 1):
item = ArxivItem()
# get title and clean data
title = response.xpath('//*[@id="dlpage"]/dl/dd[' + str(index) + ']/div/div[1]/text()').extract()
# remove blank char
title = [i.strip() for i in title]
# remove blank str
title = [i for i in title if i is not '']
# insert title
item['title'] = title[0]
authors = ''
# authors' father node
xpath_fa = '//*[@id="dlpage"]/dl/dd[' + str(index) + ']/div/div[2]//a/text()'
author_list = response.xpath(xpath_fa).getall()
authors=str.join('',author_list)
item['authors'] = authors
item['subjects']=response.xpath('string(//*[@id="dlpage"]/dl/dd['+str(5)+']/div/div[5]/span[2])').extract_first()
yield item
# 这里下一个url指向的是1802,改为循环就可以爬取全部信息
yield scrapy.Request('https://arxiv.org/list/cs.CV/1802?show=1000', callback=self.parse)
{"title": "Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward", "authors": "Kaiyang Zhou, Kaiyang Zhou, Kaiyang Zhou", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Deformable GANs for Pose-based Human Image Generation", "authors": "Aliaksandr Siarohin, Aliaksandr Siarohin, Aliaksandr Siarohin, Aliaksandr Siarohin", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Face Synthesis from Visual Attributes via Sketch using Conditional VAEs and GANs", "authors": "Xing Di, Xing Di", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "A PDE-based log-agnostic illumination correction algorithm", "authors": "U. A. Nnolim", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "A Real-time and Registration-free Framework for Dynamic Shape Instantiation", "authors": "Xiao-Yun Zhou, Xiao-Yun Zhou, Xiao-Yun Zhou", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Fractional Local Neighborhood Intensity Pattern for Image Retrieval using Genetic Algorithm", "authors": "Avirup Bhattacharyya, Avirup Bhattacharyya, Avirup Bhattacharyya, Avirup Bhattacharyya, Avirup Bhattacharyya", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "A Unified Method for First and Third Person Action Recognition", "authors": "Ali Javidani, Ali Javidani", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Integrating semi-supervised label propagation and random forests for multi-atlas based hippocampus segmentation", "authors": "Qiang Zheng, Qiang Zheng", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Transfer learning for diagnosis of congenital abnormalities of the kidney and urinary tract in children based on Ultrasound imaging data", "authors": "Qiang Zheng, Qiang Zheng, Qiang Zheng", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Context aware saliency map generation using semantic segmentation", "authors": "Mahdi Ahmadi, Mahdi Ahmadi, Mahdi Ahmadi, Mahdi Ahmadi", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
随着网站更新代码年久失修,根据网友 @一念逍遥、
指出authors部分需要勘误,已针对该问题改正,其他部分不做修改。