国外数据下载(阿里云+七牛云)

问题描述

需求:从nasa批量下载CAPLISO数据,每个数据在400M~500M之间。
遇到的问题:直接下载速度几K到几十K/s,后可以达到1M/s。
尝试方法:
VPN:不稳定;
云服务器:架设在境外,从nasa到服务器下载速度很好,但从服务器到本地速度很慢,ftp速度在20k~200k/s之间。
解决方案:
用阿里云下载到服务器后,上传到七牛云,然后再从七牛云下载到本地。(其中尝试直接使用qshell sync/fetch 抓取资源,但以失败告终)

云服务器下载数据:

wget无法从名字上区分下载失败成功的文件,没有找到合适的下载工具,决定使用Python的wget包,进行数据下载:

import wget,threading,glob,requests
Max_connentions = 3 # 最大线程数

def downUrl(url):
    # wget.download(url)
    print(url)
    wget.download(url)
    semaphore.release()
   
def sub1(i):
    print(i)
    semaphore.release()
    
if __name__=="__main__":
    semaphore = threading.Semaphore(Max_connentions)
    # 读取文件
    urlFile = glob.glob("*.txt")[0]
    # 获取url
    urls = []
    with open(urlFile,"r") as f:
        for line in f:
            url = line.replace("\n",'')
            urls.append(url)
    
    # 多线程下载
    threads = []
    for url in urls:
        semaphore.acquire()
        t = threading.Thread(target=downUrl,args=(url,))
        threads.append(t)
        t.start()
    for t in threads:
        t.join()

七牛云自动备份

参考借助海外服务器+七牛云加速文件下载, 和七牛云命令行工具.

  1. 下载使用qshell
mkdir qshell
cd shell
wget http://devtools.qiniu.com/qshell-linux-x86-v2.4.1.zip
unzip qshell-linux-x86-v2.4.1.zip 
mv qshell-linux-x86-v2.4.1 qshell
export PATH=$PATH:/root/qshell
qshell account AK SK name
  1. 文件同步
    考虑使用 qshell qupload同步文件.
    2.1 使用dircache输出本地指定路径下所有的文件列表
qshell dircache /root/CAPLISO -o CAL.txt
cat CAL.txt
root@iZrj95owmqogpx359ex82rZ:~/qshell# cat CAL.txt 
CAL_LID_L1-Standard-V4-10.2019-01-01T00-25-44ZN.hdf     450167668       15897894038522642
CAL_LID_L1-Standard-V4-10.2019-01-01T01-11-54ZD.hdf     507582814       15897895655303009
CAL_LID_L1-Standard-V4-10.2019-01-01T02-04-19ZN.hdf     450560900       15897894846512810
CAL_LID_L1-Standard-V4-10.2019-01-01T02-50-29ZD.hdf     507976024       15897900092148671
CAL_LID_L1-Standard-V4-10.2019-01-01T03-42-49ZN.hdf     450167690       15897899708814358
CAL_LID_L1-Standard-V4-10.2019-01-01T04-29-00ZD.hdf     507976019       15897901334875073
CAL_LID_L1-Standard-V4-10.2019-01-01T05-21-20ZN.hdf     450167690       15897904515833847
CAL_LID_L1-Standard-V4-10.2019-01-01T06-07-30ZD.hdf     507976018       15897907486464768
CAL_LID_L1-Standard-V4-10.2019-01-01T06-59-55ZN.hdf     450167693       15897906621112456
CAL_LID_L1-Standard-V4-10.2019-01-01T07-46-05ZD.hdf     507976022       15897912499531996
CAL_LID_L1-Standard-V4-10.2019-01-01T08-38-25ZN.hdf     450560903       15897911334968496
CAL_LID_L1-Standard-V4-10.2019-01-01T09-24-35ZD.hdf     507582804       15897915489883694
CAL_LID_L1-Standard-V4-10.2019-01-01T10-17-00ZN.hdf     450560902       15897915841776838
CAL_LID_L1-Standard-V4-10.2019-01-01T11-03-10ZD.hdfs9howd3_.tmp 396845056       15897916606685410
CAL_LID_L1-Standard-V4-10.2019-01-01T11-55-31ZN.hdfpnon2cqk.tmp 88195072        15897916606965420
CAL_LID_L1-Standard-V4-10.2019-01-01T12-41-41ZD.hdf9uzdqnk6.tmp 89456640        15897916607005422
LID_L1_2019_01.txt      116242  15897867594335127
batchDownload.py        840     15897889706120881

只有.hdf结尾的是下载完成的,筛选下:

cat CAL.txt  | grep 'hdf' | grep  -v '.tmp' > filelist.txt
cat filelist.txt 
CAL_LID_L1-Standard-V4-10.2019-01-01T00-25-44ZN.hdf     450167668       15897894038522642
CAL_LID_L1-Standard-V4-10.2019-01-01T01-11-54ZD.hdf     507582814       15897895655303009
CAL_LID_L1-Standard-V4-10.2019-01-01T02-04-19ZN.hdf     450560900       15897894846512810
CAL_LID_L1-Standard-V4-10.2019-01-01T02-50-29ZD.hdf     507976024       15897900092148671
CAL_LID_L1-Standard-V4-10.2019-01-01T03-42-49ZN.hdf     450167690       15897899708814358
CAL_LID_L1-Standard-V4-10.2019-01-01T04-29-00ZD.hdf     507976019       15897901334875073
CAL_LID_L1-Standard-V4-10.2019-01-01T05-21-20ZN.hdf     450167690       15897904515833847
CAL_LID_L1-Standard-V4-10.2019-01-01T06-07-30ZD.hdf     507976018       15897907486464768
CAL_LID_L1-Standard-V4-10.2019-01-01T06-59-55ZN.hdf     450167693       15897906621112456
CAL_LID_L1-Standard-V4-10.2019-01-01T07-46-05ZD.hdf     507976022       15897912499531996
CAL_LID_L1-Standard-V4-10.2019-01-01T08-38-25ZN.hdf     450560903       15897911334968496
CAL_LID_L1-Standard-V4-10.2019-01-01T09-24-35ZD.hdf     507582804       15897915489883694
CAL_LID_L1-Standard-V4-10.2019-01-01T10-17-00ZN.hdf     450560902       15897915841776838

设置配置文件up.conf

{
“src_dir” : “/root/CAPLISO”,
“bucket” : “capliso-download”,
“file_list” : “filelist.txt”,
“ignore_dir” : false,
“overwrite” : false,
“check_exists” : false,
“check_hash” : false,
“check_size” : false,
“rescan_local” : true,
“skip_file_prefixes” : “test,demo,”,
“skip_path_prefixes” : “hello/,temp/”,
“skip_fixed_strings” : “.svn,.git”,
“skip_suffixes” : “.DS_Store,.exe”,
“log_file” : “upload.log”,
“log_level” : “info”,
“log_rotate” : 1,
“log_stdout” : false,
“file_type” : 0,
“delete_on_success” : true
}

2.2 同步:

./qshell dircache /root/CAPLISO -o CAL.txt
cat CAL.txt  | grep 'hdf' | grep  -v '.tmp' > filelist.txt
cat filelist.txt 
./qshell qupload --success-list success.txt --failure-list failure.txt up.conf
  1. 从云下载到本地:
import os,glob

File = glob.glob("*.txt")[0]
with open(File,"r") as f:
    for line in f:
        name = line.split()[1]
        if not os.path.exists(name):
            print("Downloading: {}".format(name))
            os.system("qshell get capliso-download {}".format(name))
        else: print("Saved: {}".format(name))

问题小记

  1. ssh 出现:

Socket error Event: 32 Error: 10053.
Connection closing…Socket close.

解决:

chmod 400 /etc/ssh/*
service sshd restart
chmod 770 /etc/ssh/ssh_host_dsa_key.pub
chmod 770 /etc/ssh/ssh_host_rsa_key.pub
service network restart
  1. python3 wget.download(url) 出现

socket.gaierror: [Errno -3] Temporary failure in name resoluion

DNS 解析服务器出错,添加谷歌DNS服务器:

nameserver 8.8.8.8
nameserver 8.8.4.4

你可能感兴趣的:(下载)