学习资料:TCGA.GDC数据处理系列
第一步是数据下载
1.从数据库下载manifest文件
数据存放网站:https://portal.gdc.cancer.gov/在Repository
勾选自己需要的case
和file
类型
(1)选cases
(2)选files
需要下载三个文件,分别存放了
miRNA
、isoform
和clinical
信息
mirna
:
- Data Category--transcriptome profilling
- Data Type-- miRNA Expression Quantification
- Experimental Strategy-- miRNA-Seq
Manifest -- gdc....-miRNA.txt , save
isoform
:
- Data Category--transcriptome profilling
- Data Type-- Isoform Expression Quantification
- Experimental Strategy-- miRNA-Seq
Manifest -- gdc....-isoform.txt , save
临床数据
- Data category -- clinical
- 选 bcr xml格式
Manifest -- gdc....-clinical.txt , save
2.了解数据
统计下载的三个文件的行数,也就是各自的样本数量+1(+1是因为有一行是行名)。
$ wc -l gdc_manifest.2020-01-06*
117 gdc_manifest.2020-01-06-clinical.txt
117 gdc_manifest.2020-01-06-isoform.txt
117 gdc_manifest.2020-01-06-miRNA.txt
351 total
3.学习下载工具
https://gdc.cancer.gov/access-data/gdc-data-transfer-tool
Binary Distributions
Links to the binary distributions for supported platforms are provided below.
- gdc-client_v1.4.0_OSX_x64_10.12.6.zip
- gdc-client_v1.4.0_Windows_x64.zip
- gdc-client_v1.4.0_Ubuntu_x64.zip
- Try out the new Beta GDC Data Transfer Tool User Interface!
使用官网提供的工具gdc-client ,下载自己电脑对应的版本,存放于工作目录下并解压好。
查看帮助文档
$ ./gdc-client --help
usage: gdc-client [-h] [--version] {download,upload,settings} ...
The Genomic Data Commons Command Line Client
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
commands:
{download,upload,settings}
for more information, specify -h after a command
download download data from the GDC
upload upload data to the GDC
settings display default settings
该软件有三个子命令,我们需要的是download。
查看download的帮助文档:
./gdc-client download --help
usage: gdc-client download [-h] [--debug] [--log-file LOG_FILE] [--color_off]
[-t TOKEN_FILE] [-d DIR] [-s server]
[--no-segment-md5sums] [--no-file-md5sum]
[-n N_PROCESSES]
[--http-chunk-size HTTP_CHUNK_SIZE]
[--save-interval SAVE_INTERVAL] [--no-verify]
[--no-related-files] [--no-annotations]
[--no-auto-retry] [--retry-amount RETRY_AMOUNT]
[--wait-time WAIT_TIME] [--latest] [--config FILE]
[-u] [-m MANIFEST]
[file_id [file_id ...]]
positional arguments:
file_id The GDC UUID of the file(s) to download
optional arguments:
-h, --help show this help message and exit
--debug Enable debug logging. If a failure occurs, the program
will stop.
--log-file LOG_FILE Save logs to file. Amount logged affected by --debug
--color_off Disable colored output
-t TOKEN_FILE, --token-file TOKEN_FILE
GDC API auth token file
-d DIR, --dir DIR Directory to download files to. Defaults to current
directory
-s server, --server server
The TCP server address server[:port]
--no-segment-md5sums Do not calculate inbound segment md5sums and/or do not
verify md5sums on restart
--no-file-md5sum Do not verify file md5sum after download
-n N_PROCESSES, --n-processes N_PROCESSES
Number of client connections.
--http-chunk-size HTTP_CHUNK_SIZE, -c HTTP_CHUNK_SIZE
Size in bytes of standard HTTP block size.
--save-interval SAVE_INTERVAL
The number of chunks after which to flush state file.
A lower save interval will result in more frequent
printout but lower performance.
--no-verify Perform insecure SSL connection and transfer
--no-related-files Do not download related files.
--no-annotations Do not download annotations.
--no-auto-retry Ask before retrying to download a file
--retry-amount RETRY_AMOUNT
Number of times to retry a download
--wait-time WAIT_TIME
Amount of seconds to wait before retrying
--latest Download latest version of a file if it exists
--config FILE Path to INI-type config file
-u, --udt Use the UDT protocol.
-m MANIFEST, --manifest MANIFEST
GDC download manifest file
可以找到两个有用的参数:-d和-m。-d是设置下载目录,-m是下载指令。
所以下载命令是:
mkdir clinical
mkdir mirna
mkdir isoform
./gdc-client download -m gdc_manifest.2020-01-06-clinical.txt -d clinical
./gdc-client download -m gdc_manifest.2020-01-06-miRNA.txt -d mirna
./gdc-client download -m gdc_manifest.2020-01-06-isoform.txt -d isoform
第一条命令完成结果如下:
(base) Cheng-MacBook-Pro:Downloads chelsea$ mkdir clinical
(base) Cheng-MacBook-Pro:Downloads chelsea$ mkdir mirna
(base) Cheng-MacBook-Pro:Downloads chelsea$ mkdir isoform
(base) Cheng-MacBook-Pro:Downloads chelsea$ ./gdc-client download -m gdc_manifest.2020-01-06-clinical.txt -d clinical
100% [##############################################] Time: 0:00:06 0.15 B/s
100% [##############################################] Time: 0:00:07 0.13 B/s
100% [##############################################] Time: 0:00:07 0.13 B/s
100% [##############################################] Time: 0:00:08 0.12 B/s
100% [##############################################] Time: 0:00:03 0.28 B/s
100% [##############################################] Time: 0:00:02 15.51 kB/s
100% [##############################################] Time: 0:00:01 23.08 kB/s
100% [##############################################] Time: 0:00:02 11.91 kB/s
100% [##############################################] Time: 0:00:02 16.33 kB/s
100% [##############################################] Time: 0:00:01 32.72 kB/s
100% [##############################################] Time: 0:00:02 16.35 kB/s
Successfully downloaded: 116
关于xml小洁老师写了后面的的记录。
一个是哈德雷大神写的包xml2:
https://blog.rstudio.com/2015/04/21/xml2/
一个是R包TCGAbiolinks,可以处理tcga中下载的xml文件,从中获取信息。
参考TCGA数据处理系列。
参考资料链接:
作者:小洁忘了怎么分身
链接:https://www.jianshu.com/p/559d9604fcdf