Customized personal paper downloading of the arxiv,
the user can download papers they are interested in.
user can choose to download the interesting papers daily, in past week,
in certain month of certain year (e.g. 2020.04). query_word: ‘recent’(daily),
‘pastweek’(in past week), ‘2004’(2020.04).
If user choose to download the interesting papers daily, the mode must be ‘daily’, otherwise,
the mode must be ‘all’. query_mode: ‘all’, ‘daily’.
download_root_dir: e.g. ‘/Users/zhangzilong/Desktop/arxiv/’ (Need to reset)
domain: Computer Vision and Pattern Recognition ‘cs.CV/’, Machine learning ‘cs.LG/’ default: CV.
The keywords in which the users are interested appears in the title of the paper. key_words: ‘self-supervised’, ‘contrastive learning’, ‘anomaly detection’,
‘novelty detection’, ‘representation learning’, ‘out-of-distribution’. (Need to reset)
The keywords conference appears in the comments of the committing paper.
key_words_conference: ‘ICLR’, ‘CVPR’, ‘ICML’, ‘ICCV’.
Usage: download the paper of interest daily, to run:
python3 main_arxiv.py
recent
daily
#!/usr/bin/python
# -*- coding:utf-8 -*-
import urllib.parse
import urllib.request
import lxml
from bs4 import BeautifulSoup
import re
import ssl
import time
import os
"""
1. add comments
2. add daily
"""
class main_arxiv(object):
def __init__(self, query_word: str, domain='cs.CV/', query_mode='all',
key_words=['deep learning'], # 关键字(需更改)
key_words_conference=['PST'], # 会议或期刊(需更改)
download_root_dir=r'D:\data\Arxiv-paper-crawler-daily-master\results'): # 爬取文件的保存路径
"""query_word: month_year, recent, pastweek"""
self.original_url = 'https://arxiv.org/'
self.domain_url = self.original_url + 'list/' + domain + query_word
assert 'all' in query_mode or 'daily' in query_mode, 'please input correct query mode(all, daily)'
self.query_mode = query_mode
self.headers = {
'User-Agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'
}
self.key_words = key_words
self.key_words_conference = key_words_conference
self.root_dir = download_root_dir
current_time = time.strftime("