2021-03-22 转录组原始测序数据-ascp下载原始数据

因为GEO和SRA数据库互通,GEO不存fastq数据,只存别人定量好的下游的FPKM的值;
TCGA原始的fastq数据需要有权限申请,但count数据是会有的;
所以下载原始数据需要到SRA(美国的NCBI)、ENA(欧洲)和DDBJ(日本)这三大数据库。
文章中列的是GEO编号,但是下载需要aspera链接,而GEO没有,但是GEO和SRA有关联,但是SRA也没有该下载链接,所以转求助于和SRA关联的ENA数据库。

示例数据演示

右键另打开,仍属于NCBI的数据库之一:
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778

image.png

https://www.ncbi.nlm.nih.gov/bioproject/PRJNA229998
关联其他数据库

SRA数据库:
https://www.ncbi.nlm.nih.gov/sra?term=SRP033351
一个样本一条信息

点进去看:重点看Library
image.png

打开ENA数据库:
输入Bioproject编号(删掉前面的空格);
点击VIEW进行搜索:
image.png

点击Show Column Selection,勾选应选选项:
image.png

image.png

点击TSV进行下载:
image.png

study_accession:项目编号;
sample_accession:样本编号;
secondary_sample_accession:SRA数据库编号,因为三大核算数据库相联通;
experiment_accession:实验编号;
run_accession:run编号;
tax_id:物种名字;
sra_aspera:使用aspera下载数据的链接;
sra_md5:

(base) Mar23 23:11:03 ~
$ pwd
/trainee2/Mar23
(base) Mar23 23:22:31 ~
$ ln -s /teach/t_rna/data/airway/sra/filereport_read_run_PRJNA229998_tsv.txt ./
(base) Mar23 23:23:25 ~
$ ls
biosoft   filereport_read_run_PRJNA229998_tsv.txt  project_backup
catfile   miniconda3                               readme.txt
Data      Miniconda3-latest-Linux-x86_64.sh        t_linux
database  pipline
download  project
(base) Mar23 23:23:36 ~
$ ll
total 34380
drwxr-xr-x 13 Mar23 trainee     4096 Apr  6 23:23 ./
drwxr-xr-x 28 root  root        4096 Apr  6 23:24 ../
-rw-------  1 Mar23 Mar23      40859 Apr  6 23:23 .bash_history
-rw-r--r--  1 Mar23 root        4512 Mar 22 12:43 .bashrc
-rw-r--r--  1 Mar23 Mar23      16384 Mar 27 21:37 .bashrc.swp
drwxrwxr-x  5 Mar23 Mar23       4096 Mar 27 21:54 biosoft/
drwx------  2 Mar23 Mar23       4096 Mar 20 13:04 .cache/
-rw-rw-r--  1 Mar23 Mar23          0 Mar 20 22:57 catfile
drwxrwxr-x  2 Mar23 Mar23       4096 Mar 22 12:42 .conda/
-rw-rw-r--  1 Mar23 Mar23        255 Mar 26 22:51 .condarc
drwxrwxr-x  2 Mar23 Mar23       4096 Mar 25 22:27 .continuum/
drwxr-xr-x  3 Mar23 Mar23       4096 Apr  4 23:19 Data/
drwxrwxr-x  2 Mar23 Mar23       4096 Apr  2 23:15 database/
-rw-rw-r--  1 Mar23 Mar23   35050467 Mar 27 01:21 download
lrwxrwxrwx  1 Mar23 Mar23         68 Apr  6 23:23 filereport_read_run_PRJNA229998_tsv.txt -> /teach/t_rna/data/airway/sra/filereport_read_run_PRJNA229998_tsv.txt
drwxrwxr-x 18 Mar23 Mar23       4096 Mar 26 22:30 miniconda3/
lrwxrwxrwx  1 Mar23 Mar23         48 Mar 23 19:43 Miniconda3-latest-Linux-x86_64.sh -> /teach/t_linux/Miniconda3-latest-Linux-x86_64.sh
drwx------  2 Mar23 Mar23       4096 Mar 26 23:24 .ncbi/
drwxrwxr-x  2 Mar23 Mar23       4096 Apr  2 23:15 pipline/
-rw-r--r--  1 Mar23 root         655 Mar 15 07:18 .profile
drwxrwxr-x  2 Mar23 Mar23       4096 Apr  4 23:16 project/
drwxrwxr-x  2 Mar23 Mar23       4096 Apr  2 23:16 project_backup/
-rw-r--r--  1 Mar23 root         206 Mar 15 07:18 readme.txt
lrwxrwxrwx  1 Mar23 Mar23         13 Mar 20 20:34 t_linux -> /home/t_linux/
-rw-------  1 Mar23 Mar23       4628 Mar 27 23:03 .viminfo
-rw-rw-r--  1 Mar23 Mar23        329 Mar 27 01:21 .wget-hsts
(base) Mar23 23:24:13 ~
$ mv filereport_read_run_PRJNA229998_tsv.txt Data/rawdata/sra/
(base) Mar23 23:25:20 ~
$ cd Data/rawdata/sra/
(base) Mar23 23:25:31 ~/Data/rawdata/sra
$ ls
filereport_read_run_PRJNA229998_tsv.txt
(base) Mar23 23:25:33 ~/Data/rawdata/sra
$ head -n 1 filereport_read_run_PRJNA229998_tsv.txt 
study_accession sample_accession    experiment_accession    run_accession   tax_id  scientific_name fastq_md5   fastq_aspera    submitted_ftp   sra_bytes   sra_md5 sra_ftp sra_aspera
(base) Mar23 23:25:59 ~/Data/rawdata/sra
$ head -n 1 filereport_read_run_PRJNA229998_tsv.txt | tr '\t' '\n'
study_accession
sample_accession
experiment_accession
run_accession
tax_id
scientific_name
fastq_md5
fastq_aspera
submitted_ftp
sra_bytes
sra_md5
sra_ftp
sra_aspera
(base) Mar23 23:26:21 ~/Data/rawdata/sra
$ head -n 1 filereport_read_run_PRJNA229998_tsv.txt | tr '\t' '\n'| cat -n 
     1  study_accession
     2  sample_accession
     3  experiment_accession
     4  run_accession
     5  tax_id
     6  scientific_name
     7  fastq_md5
     8  fastq_aspera
     9  submitted_ftp
    10  sra_bytes
    11  sra_md5
    12  sra_ftp
    13  sra_aspera
(base) Mar23 23:26:27 ~/Data/rawdata/sra
$ less -S filereport_read_run_PRJNA229998_tsv.txt | cut -f 13
sra_aspera
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/008/SRR1039508
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/009/SRR1039509
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/000/SRR1039510
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/001/SRR1039511
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/002/SRR1039512
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/003/SRR1039513
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/004/SRR1039514
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/005/SRR1039515
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/006/SRR1039516
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/007/SRR1039517
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/008/SRR1039518
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/009/SRR1039519
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/000/SRR1039520
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/001/SRR1039521
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/002/SRR1039522
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/003/SRR1039523
(base) Mar23 23:28:39 ~/Data/rawdata/sra
$ less -S filereport_read_run_PRJNA229998_tsv.txt | cut -f 13 | awk 'NR>1{print}'
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/008/SRR1039508
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/009/SRR1039509
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/000/SRR1039510
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/001/SRR1039511
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/002/SRR1039512
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/003/SRR1039513
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/004/SRR1039514
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/005/SRR1039515
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/006/SRR1039516
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/007/SRR1039517
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/008/SRR1039518
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/009/SRR1039519
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/000/SRR1039520
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/001/SRR1039521
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/002/SRR1039522
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/003/SRR1039523
(base) Mar23 23:28:56 ~/Data/rawdata/sra
$ less -S filereport_read_run_PRJNA229998_tsv.txt | cut -f 13 | awk 'NR>1{print}' >sra.url
(base) Mar23 23:29:52 ~/Data/rawdata/sra
$ cat -A sra.url 
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/008/SRR1039508$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/009/SRR1039509$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/000/SRR1039510$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/001/SRR1039511$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/002/SRR1039512$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/003/SRR1039513$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/004/SRR1039514$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/005/SRR1039515$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/006/SRR1039516$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/007/SRR1039517$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/008/SRR1039518$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/009/SRR1039519$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/000/SRR1039520$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/001/SRR1039521$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/002/SRR1039522$
fasp.sra.ebi.ac.uk:/vol1/srr/SRR103/003/SRR1039523$

意外发生时:ctrl+C退出

按ctrl+C退出

](https://upload-images.jianshu.io/upload_images/17157412-96bd32a0e7d67328.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
*sed -i "s/\s :表示行尾
^:表示行首

(base) Mar23 21:16:19 ~/Data/rawdata/sra
$ cat filereport_read_run_PRJNA229998_tsv.txt | cut -f 11,4
run_accession   sra_md5
SRR1039508  b55775f72aa66e2632adf9d5a5bf0e84
SRR1039509  436e98885ef79c0c722a8065aedc1bc4
SRR1039510  b7af76fb67fa0424f7fe763bb447e330
SRR1039511  97ee3a81fc3efdb96368ac5e283f31b1
SRR1039512  4498c5b7ecef41896eb86741aa92acde
SRR1039513  83262fe5042240e0f746e4c370e1a3ed
SRR1039514  c88901e8e32fb0b1f1751a6cb73fff64
SRR1039515  813c7d6f3ebb53f39c381ca5c09f70e3
SRR1039516  141e4b140ddd1b45e468255a4edf4609
SRR1039517  5fa424d477310838e1d65073908acb2c
SRR1039518  fc03abf20ea8a455e2595c4e15a6a78c
SRR1039519  8facbd57cafd8d5059ad992e5c027815
SRR1039520  dd724892de776dc5b3e30771d92e0916
SRR1039521  3f6532f491497ab9c7132b0624961a85
SRR1039522  80489e66e342ea35163025c68c2cb7ab
SRR1039523  3f05d4761772965d2d25997ff34db371

sra_md5要在前,run_accession要在后

(base) Mar23 21:16:39 ~/Data/rawdata/sra
$ cat filereport_read_run_PRJNA229998_tsv.txt | awk 'NR>1{print$11"  "$4}'
b55775f72aa66e2632adf9d5a5bf0e84  SRR1039508
436e98885ef79c0c722a8065aedc1bc4  SRR1039509
b7af76fb67fa0424f7fe763bb447e330  SRR1039510
97ee3a81fc3efdb96368ac5e283f31b1  SRR1039511
4498c5b7ecef41896eb86741aa92acde  SRR1039512
83262fe5042240e0f746e4c370e1a3ed  SRR1039513
c88901e8e32fb0b1f1751a6cb73fff64  SRR1039514
813c7d6f3ebb53f39c381ca5c09f70e3  SRR1039515
141e4b140ddd1b45e468255a4edf4609  SRR1039516
5fa424d477310838e1d65073908acb2c  SRR1039517
fc03abf20ea8a455e2595c4e15a6a78c  SRR1039518
8facbd57cafd8d5059ad992e5c027815  SRR1039519
dd724892de776dc5b3e30771d92e0916  SRR1039520
3f6532f491497ab9c7132b0624961a85  SRR1039521
80489e66e342ea35163025c68c2cb7ab  SRR1039522
3f05d4761772965d2d25997ff34db371  SRR1039523

数据完整性检验:md5值检验

(rna) Mar23 21:20:08 ~/Data/rawdata/sra
$ ln -s /teach/t_rna/data/airway/sra/SRR103951* ./ #将老师文件夹里的文件链接到当前目录
(rna) Mar23 21:21:13 ~/Data/rawdata/sra
$ ls
filereport_read_run_PRJNA229998_tsv.txt  SRR1039510  SRR1039512
sra.url                                  SRR1039511
(rna) Mar23 21:21:30 ~/Data/rawdata/sra
$ cat filereport_read_run_PRJNA229998_tsv.txt | awk 'NR>1{print$11"  "$4}' >md5.txt
(rna) Mar23 21:23:22 ~/Data/rawdata/sra
$ cat md5.txt
b55775f72aa66e2632adf9d5a5bf0e84  SRR1039508
436e98885ef79c0c722a8065aedc1bc4  SRR1039509
b7af76fb67fa0424f7fe763bb447e330  SRR1039510
97ee3a81fc3efdb96368ac5e283f31b1  SRR1039511
4498c5b7ecef41896eb86741aa92acde  SRR1039512
83262fe5042240e0f746e4c370e1a3ed  SRR1039513
c88901e8e32fb0b1f1751a6cb73fff64  SRR1039514
813c7d6f3ebb53f39c381ca5c09f70e3  SRR1039515
141e4b140ddd1b45e468255a4edf4609  SRR1039516
5fa424d477310838e1d65073908acb2c  SRR1039517
fc03abf20ea8a455e2595c4e15a6a78c  SRR1039518
8facbd57cafd8d5059ad992e5c027815  SRR1039519
dd724892de776dc5b3e30771d92e0916  SRR1039520
3f6532f491497ab9c7132b0624961a85  SRR1039521
80489e66e342ea35163025c68c2cb7ab  SRR1039522
3f05d4761772965d2d25997ff34db371  SRR1039523
(rna) Mar23 21:23:37 ~/Data/rawdata/sra
$ cat filereport_read_run_PRJNA229998_tsv.txt | awk 'NR>3&&NR<7{print$11"  "$4}'
b7af76fb67fa0424f7fe763bb447e330  SRR1039510
97ee3a81fc3efdb96368ac5e283f31b1  SRR1039511
4498c5b7ecef41896eb86741aa92acde  SRR1039512
(rna) Mar23 21:27:35 ~/Data/rawdata/sra
$ cat filereport_read_run_PRJNA229998_tsv.txt | awk 'NR>3&&NR<7{print$11"  "$4}' >md5.txt 
(rna) Mar23 21:30:56 ~/Data/rawdata/sra
$ ls
filereport_read_run_PRJNA229998_tsv.txt  sra.url     SRR1039511
md5.txt                                  SRR1039510  SRR1039512
(rna) Mar23 21:31:11 ~/Data/rawdata/sra
$ ll
total 20
drwxrwxr-x 2 Mar23 Mar23 4096 Apr 10 21:23 ./
drwxrwxr-x 3 Mar23 Mar23 4096 Apr  4 23:19 ../
lrwxrwxrwx 1 Mar23 Mar23   68 Apr  6 23:23 filereport_read_run_PRJNA229998_tsv.txt -> /teach/t_rna/data/airway/sra/filereport_read_run_PRJNA229998_tsv.txt
-rw-rw-r-- 1 Mar23 Mar23  135 Apr 10 21:30 md5.txt
-rw-rw-r-- 1 Mar23 Mar23  816 Apr  6 23:29 sra.url
lrwxrwxrwx 1 Mar23 Mar23   39 Apr 10 21:21 SRR1039510 -> /teach/t_rna/data/airway/sra/SRR1039510
lrwxrwxrwx 1 Mar23 Mar23   39 Apr 10 21:21 SRR1039511 -> /teach/t_rna/data/airway/sra/SRR1039511
lrwxrwxrwx 1 Mar23 Mar23   39 Apr 10 21:21 SRR1039512 -> /teach/t_rna/data/airway/sra/SRR1039512
(rna) Mar23 21:32:21 ~/Data/rawdata/sra
$ cat md5.txt 
b7af76fb67fa0424f7fe763bb447e330  SRR1039510
97ee3a81fc3efdb96368ac5e283f31b1  SRR1039511
4498c5b7ecef41896eb86741aa92acde  SRR1039512
(rna) Mar23 21:33:40 ~/Data/rawdata/sra
$ md5sum -c md5.txt 
SRR1039510: OK
SRR1039511: OK
SRR1039512: OK

上传md5值

生成md5值

md5值打印在屏幕上

(rna) Mar23 22:30:39 ~/Data/rawdata/sra
$ md5sum filereport_read_run_PRJNA229998_tsv.txt 
553c8bb68676be08026e6dc5950c429f  filereport_read_run_PRJNA229998_tsv.txt
(rna) Mar23 22:36:39 ~/Data/rawdata/sra
$ md5sum SRR*
b7af76fb67fa0424f7fe763bb447e330  SRR1039510
97ee3a81fc3efdb96368ac5e283f31b1  SRR1039511
4498c5b7ecef41896eb86741aa92acde  SRR1039512

md5值保存在文件中

(rna) Mar23 22:37:36 ~/Data/rawdata/sra
$ md5sum SRR* >raw_md5.txt &  #& 代表提交后台
[1] 28129
(rna) Mar23 22:39:23 ~/Data/rawdata/sra
$ jobs 
[1]+  Done                    md5sum SRR* > raw_md5.txt
(rna) Mar23 22:39:51 ~/Data/rawdata/sra
$ ll
total 28
drwxrwxr-x 2 Mar23 Mar23 4096 Apr 10 22:39 ./
drwxrwxr-x 3 Mar23 Mar23 4096 Apr  4 23:19 ../
-rw-rw-r-- 1 Mar23 Mar23   45 Apr 10 22:25 CHECK
lrwxrwxrwx 1 Mar23 Mar23   68 Apr  6 23:23 filereport_read_run_PRJNA229998_tsv.txt -> /teach/t_rna/data/airway/sra/filereport_read_run_PRJNA229998_tsv.txt
-rw-rw-r-- 1 Mar23 Mar23  720 Apr 10 22:30 md5.txt
-rw-rw-r-- 1 Mar23 Mar23  135 Apr 10 22:39 raw_md5.txt
-rw-rw-r-- 1 Mar23 Mar23  816 Apr  6 23:29 sra.url
lrwxrwxrwx 1 Mar23 Mar23   39 Apr 10 21:21 SRR1039510 -> /teach/t_rna/data/airway/sra/SRR1039510
lrwxrwxrwx 1 Mar23 Mar23   39 Apr 10 21:21 SRR1039511 -> /teach/t_rna/data/airway/sra/SRR1039511
lrwxrwxrwx 1 Mar23 Mar23   39 Apr 10 21:21 SRR1039512 -> /teach/t_rna/data/airway/sra/SRR1039512

在上层目录中生成md5值

(rna) Mar23 22:40:08 ~/Data/rawdata/sra
$ cd ../
(rna) Mar23 22:41:30 ~/Data/rawdata
$ ls
sra
(rna) Mar23 22:41:40 ~/Data/rawdata
$ md5sum sra/SRR103951*
b7af76fb67fa0424f7fe763bb447e330  sra/SRR1039510
97ee3a81fc3efdb96368ac5e283f31b1  sra/SRR1039511
4498c5b7ecef41896eb86741aa92acde  sra/SRR1039512
(rna) Mar23 22:43:57 ~/Data/rawdata
$ ls sra/SRR103951*
sra/SRR1039510  sra/SRR1039511  sra/SRR1039512
(rna) Mar23 22:43:57 ~/Data/rawdata
$ ls sra/SRR103951*
sra/SRR1039510  sra/SRR1039511  sra/SRR1039512
(rna) Mar23 22:44:46 ~/Data/rawdata
$ pwd
/trainee2/Mar23/Data/rawdata
(rna) Mar23 22:48:33 ~/Data/rawdata
$ cd sra/
(rna) Mar23 22:48:38 ~/Data/rawdata/sra
$ ls
CHECK                                    raw_md5.txt  SRR1039511
filereport_read_run_PRJNA229998_tsv.txt  sra.url      SRR1039512
md5.txt                                  SRR1039510
(rna) Mar23 22:48:40 ~/Data/rawdata/sra
$ md5sum SRR103951*
b7af76fb67fa0424f7fe763bb447e330  SRR1039510
97ee3a81fc3efdb96368ac5e283f31b1  SRR1039511
4498c5b7ecef41896eb86741aa92acde  SRR1039512

你可能感兴趣的:(2021-03-22 转录组原始测序数据-ascp下载原始数据)