arxiv论文爬虫

文章目录

  • readme
    • Arxiv Interesting Papers Crawler
      • Description:
      • The time range of the paper downloading:
      • The mode of the downloading:
      • The root of the downloading:
      • The domain of the downloading:
      • The customized keywords:
      • The customized keywords conference
  • Code

readme

Arxiv Interesting Papers Crawler

Description:

Customized personal paper downloading of the arxiv,
the user can download papers they are interested in.

The time range of the paper downloading:

user can choose to download the interesting papers daily, in past week,
in certain month of certain year (e.g. 2020.04). query_word: ‘recent’(daily),
‘pastweek’(in past week), ‘2004’(2020.04).

The mode of the downloading:

If user choose to download the interesting papers daily, the mode must be ‘daily’, otherwise,
the mode must be ‘all’. query_mode: ‘all’, ‘daily’.

The root of the downloading:

download_root_dir: e.g. ‘/Users/zhangzilong/Desktop/arxiv/’ (Need to reset)

The domain of the downloading:

domain: Computer Vision and Pattern Recognition ‘cs.CV/’, Machine learning ‘cs.LG/’ default: CV.

The customized keywords:

The keywords in which the users are interested appears in the title of the paper. key_words: ‘self-supervised’, ‘contrastive learning’, ‘anomaly detection’,
‘novelty detection’, ‘representation learning’, ‘out-of-distribution’. (Need to reset)

The customized keywords conference

The keywords conference appears in the comments of the committing paper.
key_words_conference: ‘ICLR’, ‘CVPR’, ‘ICML’, ‘ICCV’.


Usage: download the paper of interest daily, to run:

python3 main_arxiv.py
recent
daily

Code

#!/usr/bin/python
# -*- coding:utf-8 -*-
import urllib.parse
import urllib.request
import lxml
from bs4 import BeautifulSoup
import re
import ssl
import time
import os
"""
1. add comments

2. add daily
"""


class main_arxiv(object):

    def __init__(self, query_word: str, domain='cs.CV/', query_mode='all',
                 key_words=['deep learning'],     # 关键字(需更改)
                 key_words_conference=['PST'],    # 会议或期刊(需更改)
                 download_root_dir=r'D:\data\Arxiv-paper-crawler-daily-master\results'):   # 爬取文件的保存路径
        """query_word: month_year, recent, pastweek"""
        self.original_url = 'https://arxiv.org/'
        self.domain_url = self.original_url + 'list/' + domain + query_word
        assert 'all' in query_mode or 'daily' in query_mode, 'please input correct query mode(all, daily)'
        self.query_mode = query_mode
        self.headers = {
   
            'User-Agent':
                'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'
        }
        self.key_words = key_words
        self.key_words_conference = key_words_conference
        self.root_dir = download_root_dir
        current_time = time.strftime("

你可能感兴趣的:(pycharm,ar,爬虫)