使用TCGA官方的gdc-client下载工具有时候很慢,经常会挂掉,那干脆自己写一个下载小程序。于是使用TCGA的API写了个下载TCGA数据的脚本,脚本也是需要下载manifest文件的。
后面有把程序打包成EXE,包含命令行的和图形界面的,让没有python的同学也能用
环境:Python3.6
函数包:
# coding:utf-8
'''
This tool is to simplify the steps to download TCGA data.The tool has two main parameters,
-m is the manifest file path.
-s is the location where the downloaded file is to be saved (it is best to create a new folder for the downloaded data).
This tool supports breakpoint resuming. After the program is interrupted, it can be restarted,and the program will download file after the last downloaded file. Note that this download tool converts the file in the past folder format directly into a txt file. The file name is the UUID of the file in the original TCGA. If necessary, press ctrl+c to terminate the program.
author: chenwi
date: 2018/07/10
mail: [email protected]
'''
import os
import pandas as pd
import requests
import sys
import argparse
import signal
print(__doc__)
requests.packages.urllib3.disable_warnings()
def download(url, file_path):
r = requests.get(url, stream=True, verify=False)
total_size = int(r.headers['content-length'])
# print(total_size)
temp_size = 0
with open(file_path, "wb") as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
temp_size += len(chunk)
f.write(chunk)
done = int(50 * temp_size / total_size)
sys.stdout.write("\r[%s%s] %d%%" % ('#' * done, ' ' * (50 - done), 100 * temp_size / total_size))
sys.stdout.flush()
print()
def get_UUID_list(manifest_path):
UUID_list = pd.read_table(manifest_path, sep='\t', encoding='utf-8')['id']
UUID_list = list(UUID_list)
return UUID_list
def get_last_UUID(file_path):
dir_list = os.listdir(file_path)
if not dir_list:
return
else:
dir_list = sorted(dir_list, key=lambda x: os.path.getmtime(os.path.join(file_path, x)))
return dir_list[-1][:-4]
def get_lastUUID_index(UUID_list, last_UUID):
for i, UUID in enumerate(UUID_list):
if UUID == last_UUID:
return i
return 0
def quit(signum, frame):
# Ctrl+C quit
print('You choose to stop me.')
exit()
print()
if __name__ == '__main__':
signal.signal(signal.SIGINT, quit)
signal.signal(signal.SIGTERM, quit)
parser = argparse.ArgumentParser()
parser.add_argument("-m", "--manifest", dest="M", type=str, default="gdc_manifest.txt",
help="gdc_manifest.txt file path")
parser.add_argument("-s", "--save", dest="S", type=str, default=os.curdir,
help="Which folder is the download file saved to?")
args = parser.parse_args()
link = r'https://api.gdc.cancer.gov/data/'
# args
manifest_path = args.M
save_path = args.S
print("Save file to {}".format(save_path))
UUID_list = get_UUID_list(manifest_path)
last_UUID = get_last_UUID(save_path)
print("Last download file {}".format(last_UUID))
last_UUID_index = get_lastUUID_index(UUID_list, last_UUID)
for UUID in UUID_list[last_UUID_index:]:
url = os.path.join(link, UUID)
file_path = os.path.join(save_path, UUID + '.txt')
download(url, file_path)
print(f'{UUID} have been downloaded')
在命令行中命令就行:
python tcga_download.py -m manifest-xx.txt -s xxx
讲解:
manifest-xx.txt 是你下载的manifest文件路径
xxx是你下载的文件像保存到的那个文件夹(这个文件夹最好是新建的空文件夹)
最后对于那些没有安装Python的人来说,可以使用我打包好的工具tcga_download.exe来下载TCGA数据,简单方便,有点类似gdc-client这个工具,哈哈哈,不过自己写的还是有成就感吧,后期打算做成QT界面版本的,点点鼠标就行。
tcga_download.exe放在网盘里了,有需要可以自行下载
链接:https://pan.baidu.com/s/1AGyZ5cAyPUK06zqiQGx-nQ 密码:3os4
点点鼠标就能下载的小公举exe:
下载地址:https://github.com/chenwi/TCGAD