ubuntu下使用sratoolkit将sra文件转换成fastq文件:
环境:ubuntu14.04
sratoolkit.2.5.5-ubuntu64
1.下载
下载地址:
http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software#
2.将sra转换成fastq:
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump SRR003161
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ls
SRR002664.fastq SRR002664.sra SRR003161.fastq SRR003161.sra
数据文件请见:http://blog.csdn.net/xubo245/article/details/50507222
3.查看fastq:
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ more SRR003161.fastq
@SRR003161.1 FEKQ5UX01AS5XC length=124
TCAGATGCAATCATCGAATGGTCTCGAATGGAATCNTCTANAGAGATGGAATGTATCNCTCGCCANACGACACNCGAACAGGGNAAGGCAAGCAGNAGGNAGNNNANNNNNNNNNNNNNNNNNN
+SRR003161.1 FEKQ5UX01AS5XC length=124
AAAAAAAAAAAAAAAA:::BAAFAABAAB?>>=44!39=77//39AC666666C22CAAAA93333///7-0017
>9999>>A???ACCCCCCC2239322>9977????CCCCCCCCC877777777111111::::5555:555:::::::::;:555:;;::::0040-----***--467::::;;;;;;:::511155555:555:::;::::::7777744-------///245::;;;::::::;;;;;;;;:5555
4774----------44-----064---------6---522451115247644255-----,4---24464422---------!,,,4464224!11:::7:::111111--7777---!----
@SRR003161.3 FEKQ5UX01ARXN7 length=645
TCAGCATGCTAGACAGAAGAATTCTCAGTAACTTTCTTTGTGCTGTGTGTATTCAACTCACAGAGTTGGAACCGTTCCTTTGTCAACAGAGCTAGAATTTGAAACCNCTCTTGAGGACTACGCGAAANAGGGGANAAGGTCCAAAGGCCAGTANAGGGNTCGGANGTANAAGATNCTNAAAATAAAACNGA
NAGAATCATTCTNAAGAAACTTNTTGNATGTNTGCCCTTTCAAACTCAACAGGAGTTTACCAAACCTTTTCTTTTCTAAAGGAGACTAAGGTTTTAAGAAAACCACTTACTCGGTCTTTGGTTAATGTCTGCAAAGGTGGATTATTGGACCTTCTTGAGGTCCCTTTCGTTGCGTAAAACCGGGGTTTCTT
CCTTTCACTTAGTCGTACGTAACGTAAACGTAAAAGGTAAAGGTTACGTTACGTTAACGTTTAAACGTTTTTTTAACGTTTTGGTTTGGTTTGGTTGTTAGTTTACTTAACCTTAACCTAACCTAAACGTAAAGGTTTAACGGTTAAACCGTTAACGTTACGTTTAACGTTAAGGTAAGGAAGGACGAGTA
AGTTAAGTTAAACTAAACTACTAGTAGACGACGACAACGAAGGAGAGAGAGACGACACGAGGAGGAGNGNNN
+SRR003161.3 FEKQ5UX01ARXN7 length=645
AAAAAAAAAAAAAAAAAAAAAAIFAABA?7792222.,,:3<<<<:0222276:220::20020028662222022000002,220006666=9000669600000!0699788...4877873...!,.333.!......4447........!....!....4!...!..66.!..!....4+++*.!..
!.33333686--!---------!--3!332,!,,,,,,,,*,,,,2,,,,,,,,,2,,,,,,,,,,,,.,,((((,(,,,,,),,,,,,,,,,..000----,,(,,,,,,,,,,,,,,,,)),,,,,10,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,),,,1..,,,,,,,,,,,,))
,,,,,,,,,,,,,,,,,,,03330,,,,,,,)))),,0(((,,,,,100,,,,,,,,0,,,,,,-03,----)))),,'''',,(((,,))),,)),,,,,,,,,,))00,,,,,,,,000,,,,,,,,,))),,,,,)),,,)),,,,,0,,)),,11133-,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,-,,))),,),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10000,,,,,,,,,,!,!!!
@SRR003161.4 FEKQ5UX01AMUAT length=587
TCAGGTTTGGAATGTGGGCTCTGAAGCCATACAACACAGTTTCTACTCTTTATCTTACACCTCCTGACTTTGTGACATTGGTTAAATATTTTATTTATTATNNCATAACTTACTACTTTGTTAAATTAGAAGTACGACTGTCTACACTCTTAGGTAGTTGGTCTGTTGAAATTAAATAATAGNACTTTAAC
TTACTTAAATAGANATACACACGACTTAGTTAGTTGTTGGCTGGAAATTAGGTATNTGTTTTAGTTCCTACACCTTACTTAACCCTAACCTACCATNTAATACTTTTACTTGTTCTCNGANANATNATAGTNTCTACGTTGAGTATATTACTTATATTACACGGTACGACGGACCGACGTCGTACACGTCT
CGTCTTCTNCNANNATGTAGTGAGTCTNTTTATTNTTTCTTAACTACTACTACTCGTTGTAGTAAGTAATAATAANTNNTCTACACCTACGACTGTATTGTAAGTACAAGAAGGACCGACGTTTCGTTACCTTTCTTCTTCGTCCTCTACTTAACCTGTTACTACGTACGCGAACACGGACGTAGGAGGAG
GAGGACACGAACGG
+SRR003161.4 FEKQ5UX01AMUAT length=587
AAAAAAAAAAAAAAAAAAAAAAIEEAIIIIIIAAIIIA:666AAE???<<<@AA===A=>>AAAAAAAAAAAAA?@???980000040....0/**04490!!00000600.........,,.....,.....74..............33.....7.....4..............++664!.000000.
135855----*--!3------------33,,,,,,,,,2222222,,,,*,,,,,!,,,,,,3,((,,,,00,,,,,,,,,,,,,1,,)),,,,01!333001,,,,03((,,,,,,!,,!,!,,!,,3,,!,1,,,,,,,,,,,,,,,,,,,,,,,,3,,,,,433,,,,,,,,13,,,,,,,,,04,,,
,,,,,,,,!,!,!!,,,,10,,,311,!,,,1))!,,,,)),,30,,,0330,,,,,,003333,,,,,0003,,!,!!,,01,,,033,,,,,1,,,,,,,,00,,,,,,,,,1331313/.,,,)),,,,,,,)),,,,,,,,,,010,,,,,,,,,,3303,,,,0000000,,,,03,,,,,0,,,,
,,,,34333,,,,,
4.sra转换成fasta:
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --fasta 20 SRR003161
2016-01-13T05:33:42 fastq-dump.2.5.5 err: timeout exhausted while reading file within network system module - failed SRR003161
=============================================================
An error occurred during processing.
A report was generated into the file '/home/hadoop/ncbi_error_report.xml'.
If the problem persists, you may consider sending the file
to '[email protected]' for assistance.
=============================================================
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ more SRR003161.fasta
>SRR003161.1 FEKQ5UX01AS5XC length=124
TCAGATGCAATCATCGAATG
GTCTCGAATGGAATCNTCTA
NAGAGATGGAATGTATCNCT
CGCCANACGACACNCGAACA
GGGNAAGGCAAGCAGNAGGN
AGNNNANNNNNNNNNNNNNN
NNNN
>SRR003161.2 FEKQ5UX01AOE96 length=505
TCAGTTTGAGATGGAGTTTC
ATTCTTGTTGCCCAGGCTGG
AGTGCAATGGCGCAATCTCA
GCTCACAGCAACCTCCGCCT
CCCGGGTTCAAGCGATTCTC
CTGCCTCAGCCTCTCGAGTA
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --fasta 50 SRR003161
2016-01-13T05:36:52 fastq-dump.2.5.5 err: timeout exhausted while reading file within network system module - failed SRR003161
=============================================================
An error occurred during processing.
A report was generated into the file '/home/hadoop/ncbi_error_report.xml'.
If the problem persists, you may consider sending the file
to '[email protected]' for assistance.
=============================================================
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ls
SRR002664.fastq SRR002664.sra SRR003161.fasta SRR003161.fastq SRR003161.sra
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ more SRR003161.fasta
>SRR003161.1 FEKQ5UX01AS5XC length=124
TCAGATGCAATCATCGAATGGTCTCGAATGGAATCNTCTANAGAGATGGA
ATGTATCNCTCGCCANACGACACNCGAACAGGGNAAGGCAAGCAGNAGGN
AGNNNANNNNNNNNNNNNNNNNNN
>SRR003161.2 FEKQ5UX01AOE96 length=505
TCAGTTTGAGATGGAGTTTCATTCTTGTTGCCCAGGCTGGAGTGCAATGG
CGCAATCTCAGCTCACAGCAACCTCCGCCTCCCGGGTTCAAGCGATTCTC
CTGCCTCAGCCTCTCGAGTAGCTGGGATTACAGGCATGCACCATCACGCC
CAGCTAATTTGCATTTTTTATTAGAGATGGGGTTTCTCCACATTGGTCAG
GCTGATCTCGAACTCCTGACCTCAGGTGATCTGCCTGCCTTGGCCTCCCA
AAGTGCTGGGATTACAGGCATGAGCCTGAGCCCAACCTATTTACTTTCAA
TCCATCTTTTCAATAACTTAAATACAAGTGTCAATATATACAATCTTTTC
CTCCCTGGTTATCAAGCTTTCTAATATATATGGATGTATCTTCCAAGGTT
TTTGATCCCATTTTACTTTACAGGCTCACTGCTGTGGAACCCAGAGAGCA
GTCTCTTTTCAAGGNGGGCTGAGACNCGCAACAGGGGATTAGGCCAAGGC
NCAGG
>SRR003161.3 FEKQ5UX01ARXN7 length=645
TCAGCATGCTAGACAGAAGAATTCTCAGTAACTTTCTTTGTGCTGTGTGT
ATTCAACTCACAGAGTTGGAACCGTTCCTTTGTCAACAGAGCTAGAATTT
GAAACCNCTCTTGAGGACTACGCGAAANAGGGGANAAGGTCCAAAGGCCA
GTANAGGGNTCGGANGTANAAGATNCTNAAAATAAAACNGANAGAATCAT
TCTNAAGAAACTTNTTGNATGTNTGCCCTTTCAAACTCAACAGGAGTTTA
CCAAACCTTTTCTTTTCTAAAGGAGACTAAGGTTTTAAGAAAACCACTTA
换个数据集就可以了,
成功的:faste 50 为每行50个碱基
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --fasta 50 SRR002664
Read 487522 spots for SRR002664
Written 487522 spots for SRR002664
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ls
back SRR002664.fasta SRR002664.sra SRR003161.fasta SRR003161.fastq SRR003161.sra
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ll -h
total 986M
drwxrwxr-x 3 hadoop hadoop 4.0K 1月 13 13:40 ./
drwxrwxr-x 5 hadoop hadoop 4.0K 1月 12 21:31 ../
drwxrwxr-x 2 hadoop hadoop 4.0K 1月 13 13:39 back/
-rw-rw-r-- 1 hadoop hadoop 150M 1月 13 13:40 SRR002664.fasta
-rw-r--r-- 1 hadoop hadoop 17M 12月 15 22:13 SRR002664.sra
-rw-rw-r-- 1 hadoop hadoop 274M 1月 13 13:36 SRR003161.fasta
-rw-rw-r-- 1 hadoop hadoop 538M 1月 13 13:00 SRR003161.fastq
-rw-r--r-- 1 hadoop hadoop 9.0M 12月 15 23:12 SRR003161.sra
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ more SRR002664.fasta
>SRR002664.1 FC20KVN01EFCX9 length=192
TCAGCTCACGTCTGTAATCCTAGCATTTTGGGAGGCTGAGACGGGCAGAT
CACTTGAGGTCATGAGTTCGAGACCAGCCTGGCAACCATGGCGAAACCCT
GTCTCTACTAAAATACAAAATTAGCCAGGCATGGTGGCGCATGCCTGTCT
GAGACACGCAACAGGGGATAGGCAAGGCACACAGGGGATAGG
>SRR002664.2 FC20KVN01ELL46 length=127
TCAGCAAAGAAAACAAATTCCTTTCTGGCACCACCTCAAAGAAGAATTTC
在用fastq验证:
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump SRR002664
Read 487522 spots for SRR002664
Written 487522 spots for SRR002664
5.split
将双端测序文件分开
(1)split-files生成两个fastq文件
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --split-files SRR002664
Read 487522 spots for SRR002664
Written 487522 spots for SRR002664
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ll -h
total 924M
drwxrwxr-x 3 hadoop hadoop 4.0K 1月 13 14:05 ./
drwxrwxr-x 5 hadoop hadoop 4.0K 1月 12 21:31 ../
drwxrwxr-x 2 hadoop hadoop 4.0K 1月 13 13:52 back/
-rw-rw-r-- 1 hadoop hadoop 44M 1月 13 14:05 SRR002664_1.fastq
-rw-rw-r-- 1 hadoop hadoop 291M 1月 13 14:05 SRR002664_2.fastq
-rw-rw-r-- 1 hadoop hadoop 291M 1月 13 14:02 SRR002664.fastq
-rw-r--r-- 1 hadoop hadoop 17M 12月 15 22:13 SRR002664.sra
-rw-rw-r-- 1 hadoop hadoop 274M 1月 13 13:56 SRR003161.fasta
-rw-r--r-- 1 hadoop hadoop 9.0M 12月 15 23:12 SRR003161.sra
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --split-3 SRR002664
Rejected 487522 READS because of filtering out non-biological READS
Read 487522 spots for SRR002664
Written 487522 spots for SRR002664
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ll
total 1192100
drwxrwxr-x 3 hadoop hadoop 4096 1月 13 14:21 ./
drwxrwxr-x 5 hadoop hadoop 4096 1月 12 21:31 ../
drwxrwxr-x 2 hadoop hadoop 4096 1月 13 14:21 back/
-rw-rw-r-- 1 hadoop hadoop 304893796 1月 13 14:21 SRR002664.fastq
-rw-r--r-- 1 hadoop hadoop 16874064 12月 15 22:13 SRR002664.sra
-rw-rw-r-- 1 hadoop hadoop 42893052 1月 13 14:16 SRR003161_1.fastq
-rw-rw-r-- 1 hadoop hadoop 559892770 1月 13 14:16 SRR003161_2.fastq
-rw-rw-r-- 1 hadoop hadoop 286773153 1月 13 13:56 SRR003161.fasta
-rw-r--r-- 1 hadoop hadoop 9353980 12月 15 23:12 SRR003161.sra
对于–split-3参数,是这样介绍的:
Legacy 3-file splitting for mate-pairs: first biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only one biological read is present it is placed in *.fastq. Biological reads and above are ignored
也就是说如果SRA文件中只有一个文件,那么这个参数就会被忽略。如果原文件中有两个文件,那么它就会把成对的文件按*_1.fastq, *_2.fastq这样分开。如果还有出现了第三个文件,就意味着这个文件本身是未成配对的部分。可能是当初提交的时候因为事先过滤过了一下,所以有一部分数据被删除了
借鉴参考【4】
(3)--split-spot
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --split-spot SRR002664
Read 487522 spots for SRR002664
Written 487522 spots for SRR002664
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ll
total 1236636
drwxrwxr-x 3 hadoop hadoop 4096 1月 13 14:53 ./
drwxrwxr-x 5 hadoop hadoop 4096 1月 12 21:31 ../
drwxrwxr-x 2 hadoop hadoop 4096 1月 13 14:21 back/
-rw-rw-r-- 1 hadoop hadoop 350498654 1月 13 14:54 SRR002664.fastq
-rw-r--r-- 1 hadoop hadoop 16874064 12月 15 22:13 SRR002664.sra
-rw-rw-r-- 1 hadoop hadoop 42893052 1月 13 14:16 SRR003161_1.fastq
-rw-rw-r-- 1 hadoop hadoop 559892770 1月 13 14:16 SRR003161_2.fastq
-rw-rw-r-- 1 hadoop hadoop 286773153 1月 13 13:56 SRR003161.fasta
-rw-r--r-- 1 hadoop hadoop 9353980 12月 15 23:12 SRR003161.sra
--split-spot | Split spots into individual reads. |
参考:
【1】 http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=fastq-dump
【2】 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc
【3】 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software#
【4】 http://www.bbioo.com/lifesciences/40-112832-1.html