批量下载GOCI二级产品

任务描述

  • 下载GOCI卫星二级产品数据(Chla, ρ r c \rho_{rc} ρrc)
  • 时间跨度: 2011年4月——2019年11月

具体实现

1.获取下载链接

import requests
from bs4 import BeautifulSoup
import time
import random

def find_data(url):
   ip = "http://222.236.46.45"
   time.sleep(random.uniform(3, 5)) # 随机停顿3-5秒,请求太快必封IP
   res = requests.get(url=url) # 用requests发起请求
   html = BeautifulSoup(res.text, "html.parser") # 用BeautifulSoup解析网页
   for link in html.find_all('a')[1:]: # 获取网页中所有的a标签,去掉第一个a标签,
       full_link = ip + link.get('href') # 从a标签中获取链接
       if ".zip" in full_link: # 如果链接以.zip结尾,就追加到列表中
           find_chl2_rc2(full_link)
       else:
           find_data(full_link) # 进行递归爬取整个网页所有链接

2.筛选

因为师兄要求只下载每天第一条Chla和第三条 ρ r c \rho_{rc} ρrc(中午时间数据质量最好),所以加了一个筛选的过程
批量下载GOCI二级产品_第1张图片

def find_chl2_rc2(full_link):
    if "CHL" in full_link:
        if full_link[80:82] == "02":
            downlist.append(full_link)
            print(full_link)
    elif "CDOM" in full_link:
        pass  # 'pass' mean do nothing
    elif "TSS" in full_link:
        pass  # 'pass' mean do nothing
    elif "RRS" in full_link:
        pass  # 'pass' mean do nothing
    else:
        if full_link[80:82] == "02":
            downlist.append(full_link)
            print(full_link)

3.下载

IDM下载

从IDM导入获取到的下载链接
批量下载GOCI二级产品_第2张图片
批量下载GOCI二级产品_第3张图片
选择下载路径,点击“全部选择”–>“确定”,开始下载 批量下载GOCI二级产品_第4张图片

wget下载

使用wget进行下载downlist.txt中的所有下载链接,-nc:不要重复下载已存在的文件;-c:断点续传

wget --input-file=downlist.txt -nc -c

不足与展望

  • 用requests请求太快容易被封ip,后面考虑使用ip池
  • 数据链接筛选过程看起来很low,考虑使用正则表达式进行优化
  • 自己有点忙,所以没有用到多线程下载,直接调用IDM下载

完整代码

import requests
from bs4 import BeautifulSoup
import time
import random


def find_chl2_rc2(full_link):
    if "CHL" in full_link:
        if full_link[80:82] == "02":
            downlist.append(full_link)
            print(full_link)
    elif "CDOM" in full_link:
        pass  # 'pass' mean do nothing
    elif "TSS" in full_link:
        pass  # 'pass' mean do nothing
    elif "RRS" in full_link:
        pass  # 'pass' mean do nothing
    else:
        if full_link[80:82] == "02":
            downlist.append(full_link)
            print(full_link)


def find_data(url):
    ip = "http://222.236.46.45"
    time.sleep(random.uniform(3, 5))
    res = requests.get(url=url)
    html = BeautifulSoup(res.text, "html.parser")
    for link in html.find_all('a')[1:]:
        full_link = ip + link.get('href')
        if ".zip" in full_link:
            find_chl2_rc2(full_link)
        else:
            find_data(full_link)


if __name__ == '__main__':
    BASE_URL = 'http://222.236.46.45/nfsdb/COMS/GOCI/2.0/2019'
    downlist = []
    find_data(BASE_URL)
    with open("downlist.txt", 'w') as f:
        for line in downlist:
            f.write(line + '\n')

你可能感兴趣的:(水色,遥感)