博客地址
标签(空格分隔): IEEE Xplore, bash
测试环境:Ubuntu 15.04, 中山大学
首先,从下载一篇论文开始,在IEEE Xplore上任意下载一篇论文,获取下载链接, 如:
http://ieeexplore.ieee.org/ielx7/6875427/6877223/06877226.pdf?tp=&arnumber=6877226&isnumber=6877223
截取?前面部分:
http://ieeexplore.ieee.org/ielx7/6875427/6877223/06877226.pdf
然后,Linux上使用wget命令可以快速地从指定URL下载文件(后面也是使用这个命令来实现批量下载),
一篇论文就这么下载了,所以,要实现批量下载,必须要获取所有论文的下载URL,其实,多下载几篇论文比较下它们的下载链接就可以发现:
http://ieeexplore.ieee.org/ielx7/6875427/6877223/06877326.pdf?tp=&arnumber=6877326&isnumber=6877223
http://ieeexplore.ieee.org/ielx7/6875427/6877223/06877325.pdf?tp=&arnumber=6877325&isnumber=6877223
http://ieeexplore.ieee.org/ielx7/6875427/6877223/06877324.pdf?tp=&arnumber=6877324&isnumber=6877223
下载链接的格式如下,(前两串数字即“6875427”和“6877223”对于同一个会议都是相同的,所以只需要获取一次就可以了):
http://ieeexplore.ieee.org/ielx7/6875427/6877223/0{arnumber}.pdf
所以,可以将下载链接分为两个部分, 注意arnumber前面有多了一个0:
http://ieeexplore.ieee.org/ielx7/6875427/6877223/ 和 0{arnumber}.pdf
问题就变成,如何获取所有论文的arnumber了,这个方法就有两种,一种可以使用爬虫,解析网页获取,但是写代码来比较麻烦,这里使用另外一种,IEEE Xplore提供了一个Download Citations的功能,如图:
下载后保存至文件,
Thangavel, M.; Chandrasekaran, M.; Madheswaran, M., "Analysis of B-mode transverse ultrasound common carotid artery images using contour tracking by particle filtering technique," in Devices, Circuits and Systems (ICDCS), 2012 International Conference on , vol., no., pp.470-473, 15-16 March 2012
doi: 10.1109/ICDCSyst.2012.6188759
keywords: {biodiffusion;biomedical ultrasonics;blood vessels;cardiovascular system;diseases;filters;image denoising;image segmentation;medical image processing;particle filtering (numerical methods);speckle;ultrasonic imaging;B-mode transverse ultrasound common carotid artery images;atherosclerosis;cardiovascular diseases;contour tracking;edge preserving anisotropic diffusion filter;image segmentation;medical image analysis;particle filtering technique;speckle noises;speckle reduction;Fitting;Image segmentation;Image Segmentation;Medical imaging;Particle filtering;Ultrasound image},
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6188759&isnumber=6188639
其中包含了每一篇论文的信息,包括标题(可以用来文件命名),关键字,URL等信息,其中的URL并不能直接用来wget下载论文,但是包含了我们要获取的arnumber信息,好了~接下来要做的就是从这些信息里面抽取arnumber和论文标题了
观察下载后的Citations信息,发现论文标题都包含在双引号之间,即“”标题””这样,arnumber即“arnumber=6188759”,那就用正则表达式来匹配吧,看命令:
cat {citations file} | grep -o -e "arnumber=[0-9]*" -e '"[^\"]*"' >> "{save file}"
实现了从刚才下载的索引文件里抽取出论文标题和arnumber信息,并保存至另外一个文件的功能,其中有两个正则表达式, 分别用来匹配arnumber和论文标题,得到的信息如下:
"Algorithm Engineering for Scalable Parallel External Sorting,"
arnumber=6012805
"Power-Aware Replica Placement and Update Strategies in Tree Networks,"
arnumber=6012820
"Minimum Cost Resource Allocation for Meeting Job Requirements,"
arnumber=6012821
每两行代表一篇论文的标题和arnumber,然后就好办了,进行Shell编程, 循环读取以上的信息,使用arnumber去下载,然后用论文标题作为文件名保存,那么,如何读取呢~
#!/bin/bash
base="http://ieeexplore.ieee.org/ielx7/6875427/6877223/"
file="文件名.txt"
while read -r title; read -r arnumber
do
title=`echo $title | cut -d "\"" -f 2 | cut -d "," -f 1 | sed 's/\///'` #获取title
arnumber=`echo $arnumber | cut -d "=" -f 2` #获取arnumber
wget "$base/0$arnumber.pdf" #下载
mv "0$arnumber.pdf" "$title.pdf" #用标题来作为文件名保存
done < "$file"
保存为download.sh, 给予它执行的权限:
sudo chmod +x download.sh
然后./download.sh就可以运行了,等待程序运行完就ok了~
上面还用到了两个命令, cut 主要用来截取部分字符串, sed用来去除标题中的斜杠,因为斜杠不能出现在文件名中~具体用法不说了
亲测,ICDCS 2012, IPDPS 2012-2015 可用~
附上我的完整程序:
#!/bin/bash
base=
file=
tempfile1="downlist.txt" #临时文件,用完删除
tempfile2="urls.txt" #临时文件,用完删除
if [ -f $tempfile1 ]; then
rm $tempfile1
fi
if [ -f $tempfile2 ]; then
rm $tempfile2
fi
usage()
{
echo "Usage: `basename $0` -b url_base_string -f input_file [-h help]"
exit 1
}
while getopts "b:f:h" arg #选项后面的冒号表示该选项需要参数
do
case $arg in
b)
base=$OPTARG
;;
f)
file=$OPTARG
;;
h)
usage
;;
?) #当有不认识的选项的时候arg为?
echo "unkonw argument"
exit 1
;;
esac
done
if [ -z "$base" ]; then #该脚本必须提供-b选项
echo "You must specify base with -b option"
exit
fi
if [ -z "$file" ]; then #该脚本必须提供-f选项
echo "You must specify file with -f option"
exit
fi
cat $file | grep -o -e "arnumber=[0-9]*" -e '"[^\"]*"' >> "$tempfile1"
while read -r title; read -r arnumber #循环读取标题和arnumber
do
title=`echo $title | cut -d "\"" -f 2 | cut -d "," -f 1 | sed 's/\///'`
arnumber=`echo $arnumber | cut -d "=" -f 2`
echo "$base/0$arnumber.pdf" >> "$tempfile2" #这里先生成所有下载链接,然后保存到临时文件
done < "$tempfile1"
wget -i $tempfile2 #批量下载论文
echo $?
while read -r title; read -r arnumber #重命名
do
title=`echo $title | cut -d "\"" -f 2 | cut -d "," -f 1 | sed 's/\///'`
arnumber=`echo $arnumber | cut -d "=" -f 2`
mv "0$arnumber.pdf" "$title.pdf"
done < "$tempfile1"
if [ -f $tempfile1 ]; then
rm $tempfile1
fi
if [ -f $tempfile2 ]; then
rm $tempfile2
fi
用法:./download.sh -b {base url, 需自行获取} -f {从IEEE Xplore上下载的Citations文件}
./download.sh -b http://ieeexplore.ieee.org/ielx5/6180033/6188639 -f downloadCitations.txt
有需要可以问我,嗯~@maxuan