Python脚本下载TCGA大数据,非常简单,开放源代码

前言

使用TCGA官方的gdc-client下载工具有时候很慢,经常会挂掉,那干脆自己写一个下载小程序。于是使用TCGA的API写了个下载TCGA数据的脚本,脚本也是需要下载manifest文件的。

环境

后面有把程序打包成EXE,包含命令行的和图形界面的,让没有python的同学也能用

环境:Python3.6
函数包:

  • os
  • pandas
  • requests
  • sys
  • argparse
  • signal

代码

# coding:utf-8
'''
This tool is to simplify the steps to download TCGA data.The tool has two main parameters,
-m is the manifest file path.
-s is the location where the downloaded file is to be saved (it is best to create a new folder for the downloaded data).
This tool supports breakpoint resuming. After the program is interrupted, it can be restarted,and the program will download file after the last downloaded file. Note that this download tool converts the file in the past folder format directly into a txt file. The file name is the UUID of the file in the original TCGA. If necessary, press ctrl+c to terminate the program.
author: chenwi
date: 2018/07/10
mail: [email protected]
'''
import os
import pandas as pd
import requests
import sys
import argparse
import signal

print(__doc__)

requests.packages.urllib3.disable_warnings()


def download(url, file_path):
    r = requests.get(url, stream=True, verify=False)
    total_size = int(r.headers['content-length'])
    # print(total_size)
    temp_size = 0

    with open(file_path, "wb") as f:

        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                temp_size += len(chunk)
                f.write(chunk)
                done = int(50 * temp_size / total_size)
                sys.stdout.write("\r[%s%s] %d%%" % ('#' * done, ' ' * (50 - done), 100 * temp_size / total_size))
                sys.stdout.flush()
    print()


def get_UUID_list(manifest_path):
    UUID_list = pd.read_table(manifest_path, sep='\t', encoding='utf-8')['id']
    UUID_list = list(UUID_list)
    return UUID_list


def get_last_UUID(file_path):
    dir_list = os.listdir(file_path)
    if not dir_list:
        return
    else:
        dir_list = sorted(dir_list, key=lambda x: os.path.getmtime(os.path.join(file_path, x)))

        return dir_list[-1][:-4]


def get_lastUUID_index(UUID_list, last_UUID):
    for i, UUID in enumerate(UUID_list):
        if UUID == last_UUID:
            return i
    return 0


def quit(signum, frame):
    # Ctrl+C quit
    print('You choose to stop me.')
    exit()
    print()


if __name__ == '__main__':

    signal.signal(signal.SIGINT, quit)
    signal.signal(signal.SIGTERM, quit)

    parser = argparse.ArgumentParser()
    parser.add_argument("-m", "--manifest", dest="M", type=str, default="gdc_manifest.txt",
                        help="gdc_manifest.txt file path")
    parser.add_argument("-s", "--save", dest="S", type=str, default=os.curdir,
                        help="Which folder is the download file saved to?")
    args = parser.parse_args()

    link = r'https://api.gdc.cancer.gov/data/'

    # args
    manifest_path = args.M
    save_path = args.S

    print("Save file to {}".format(save_path))

    UUID_list = get_UUID_list(manifest_path)
    last_UUID = get_last_UUID(save_path)
    print("Last download file {}".format(last_UUID))
    last_UUID_index = get_lastUUID_index(UUID_list, last_UUID)

    for UUID in UUID_list[last_UUID_index:]:
        url = os.path.join(link, UUID)
        file_path = os.path.join(save_path, UUID + '.txt')
        download(url, file_path)
        print(f'{UUID} have been downloaded')

使用方法

在命令行中命令就行:

python tcga_download.py -m manifest-xx.txt -s xxx

讲解:
manifest-xx.txt 是你下载的manifest文件路径
xxx是你下载的文件像保存到的那个文件夹(这个文件夹最好是新建的空文件夹)

演示:
Python脚本下载TCGA大数据,非常简单,开放源代码_第1张图片

将程序打包成EXE

最后对于那些没有安装Python的人来说,可以使用我打包好的工具tcga_download.exe来下载TCGA数据,简单方便,有点类似gdc-client这个工具,哈哈哈,不过自己写的还是有成就感吧,后期打算做成QT界面版本的,点点鼠标就行。
tcga_download.exe放在网盘里了,有需要可以自行下载
链接:https://pan.baidu.com/s/1AGyZ5cAyPUK06zqiQGx-nQ 密码:3os4

演示:
Python脚本下载TCGA大数据,非常简单,开放源代码_第2张图片

图形界面的下载EXE

点点鼠标就能下载的小公举exe:
下载地址:https://github.com/chenwi/TCGAD

演示:
Python脚本下载TCGA大数据,非常简单,开放源代码_第3张图片

你可能感兴趣的:(生物信息,python)