ubuntu下使用sratoolkit将sra文件转换成fastq文件:
环境:ubuntu14.04
sratoolkit.2.5.5-ubuntu64
1.下载
下载地址:
http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software#
2.将sra转换成fastq:
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump SRR003161 <pre name="code" class="plain">hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ls SRR002664.fastq SRR002664.sra SRR003161.fastq SRR003161.sra数据文件请见:http://blog.csdn.net/xubo245/article/details/50507222
3.查看fastq:
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ more SRR003161.fastq
@SRR003161.1 FEKQ5UX01AS5XC length=124 TCAGATGCAATCATCGAATGGTCTCGAATGGAATCNTCTANAGAGATGGAATGTATCNCTCGCCANACGACACNCGAACAGGGNAAGGCAAGCAGNAGGNAGNNNANNNNNNNNNNNNNNNNNN +SRR003161.1 FEKQ5UX01AS5XC length=124 AAAAAAAAAAAAAAAA:::BAAFAABAAB?>>=44!39=<!:866699888220862!08:8002!0200000!022200800!20660000600!000!06!!!6!!!!!!!!!!!!!!!!!! @SRR003161.2 FEKQ5UX01AOE96 length=505 TCAGTTTGAGATGGAGTTTCATTCTTGTTGCCCAGGCTGGAGTGCAATGGCGCAATCTCAGCTCACAGCAACCTCCGCCTCCCGGGTTCAAGCGATTCTCCTGCCTCAGCCTCTCGAGTAGCTGGGATTACAGGCATGCACCATCACGCCCAGCTAATTTGCATTTTTTATTAGAGATGGGGTTTCTCCAC ATTGGTCAGGCTGATCTCGAACTCCTGACCTCAGGTGATCTGCCTGCCTTGGCCTCCCAAAGTGCTGGGATTACAGGCATGAGCCTGAGCCCAACCTATTTACTTTCAATCCATCTTTTCAATAACTTAAATACAAGTGTCAATATATACAATCTTTTCCTCCCTGGTTATCAAGCTTTCTAATATATATG GATGTATCTTCCAAGGTTTTTGATCCCATTTTACTTTACAGGCTCACTGCTGTGGAACCCAGAGAGCAGTCTCTTTTCAAGGNGGGCTGAGACNCGCAACAGGGGATTAGGCCAAGGCNCAGG +SRR003161.2 FEKQ5UX01AOE96 length=505 CCCCCCCCCCCCCCCC@@@CCCFEEEFEEG888EEEFFEEEEFGGGGGGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCA<777@@CCCBCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCAAACCCCCCCCCCCCCCCCCCCCCCC:93339@A>77//39AC666666C22CAAAA93333///7-0017 >9999>>A???ACCCCCCC2239322>9977<?????CCCCCCCCC877777777111111::::5555:555:::::::::;:555:;;::::0040-----***--467::::;;;;;;:::511155555:555:::;::::::7777744-------///245::;;;::::::;;;;;;;;:5555 4774----------44-----064---------6---522451115247644255-----,4---24464422---------!,,,4464224!11:::7:::111111--7777---!---- @SRR003161.3 FEKQ5UX01ARXN7 length=645 TCAGCATGCTAGACAGAAGAATTCTCAGTAACTTTCTTTGTGCTGTGTGTATTCAACTCACAGAGTTGGAACCGTTCCTTTGTCAACAGAGCTAGAATTTGAAACCNCTCTTGAGGACTACGCGAAANAGGGGANAAGGTCCAAAGGCCAGTANAGGGNTCGGANGTANAAGATNCTNAAAATAAAACNGA NAGAATCATTCTNAAGAAACTTNTTGNATGTNTGCCCTTTCAAACTCAACAGGAGTTTACCAAACCTTTTCTTTTCTAAAGGAGACTAAGGTTTTAAGAAAACCACTTACTCGGTCTTTGGTTAATGTCTGCAAAGGTGGATTATTGGACCTTCTTGAGGTCCCTTTCGTTGCGTAAAACCGGGGTTTCTT CCTTTCACTTAGTCGTACGTAACGTAAACGTAAAAGGTAAAGGTTACGTTACGTTAACGTTTAAACGTTTTTTTAACGTTTTGGTTTGGTTTGGTTGTTAGTTTACTTAACCTTAACCTAACCTAAACGTAAAGGTTTAACGGTTAAACCGTTAACGTTACGTTTAACGTTAAGGTAAGGAAGGACGAGTA AGTTAAGTTAAACTAAACTACTAGTAGACGACGACAACGAAGGAGAGAGAGACGACACGAGGAGGAGNGNNN +SRR003161.3 FEKQ5UX01ARXN7 length=645 AAAAAAAAAAAAAAAAAAAAAAIFAABA?7792222.,,:3<<<<:0222276:220::20020028662222022000002,220006666=9000669600000!0699788...4877873...!,.333.!......4447........!....!....4!...!..66.!..!....4+++*.!.. !.33333686--!---------!--3!332,!,,,,,,,,*,,,,2,,,,,,,,,2,,,,,,,,,,,,.,,((((,(,,,,,),,,,,,,,,,..000----,,(,,,,,,,,,,,,,,,,)),,,,,10,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,),,,1..,,,,,,,,,,,,)) ,,,,,,,,,,,,,,,,,,,03330,,,,,,,)))),,0(((,,,,,100,,,,,,,,0,,,,,,-03,----)))),,'''',,(((,,))),,)),,,,,,,,,,))00,,,,,,,,000,,,,,,,,,))),,,,,)),,,)),,,,,0,,)),,11133-,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,-,,))),,),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10000,,,,,,,,,,!,!!! @SRR003161.4 FEKQ5UX01AMUAT length=587 TCAGGTTTGGAATGTGGGCTCTGAAGCCATACAACACAGTTTCTACTCTTTATCTTACACCTCCTGACTTTGTGACATTGGTTAAATATTTTATTTATTATNNCATAACTTACTACTTTGTTAAATTAGAAGTACGACTGTCTACACTCTTAGGTAGTTGGTCTGTTGAAATTAAATAATAGNACTTTAAC TTACTTAAATAGANATACACACGACTTAGTTAGTTGTTGGCTGGAAATTAGGTATNTGTTTTAGTTCCTACACCTTACTTAACCCTAACCTACCATNTAATACTTTTACTTGTTCTCNGANANATNATAGTNTCTACGTTGAGTATATTACTTATATTACACGGTACGACGGACCGACGTCGTACACGTCT CGTCTTCTNCNANNATGTAGTGAGTCTNTTTATTNTTTCTTAACTACTACTACTCGTTGTAGTAAGTAATAATAANTNNTCTACACCTACGACTGTATTGTAAGTACAAGAAGGACCGACGTTTCGTTACCTTTCTTCTTCGTCCTCTACTTAACCTGTTACTACGTACGCGAACACGGACGTAGGAGGAG GAGGACACGAACGG +SRR003161.4 FEKQ5UX01AMUAT length=587 AAAAAAAAAAAAAAAAAAAAAAIEEAIIIIIIAAIIIA:666AAE???<<<@AA===A=>>AAAAAAAAAAAAA?@???980000040....0/**04490!!00000600.........,,.....,.....74..............33.....7.....4..............++664!.000000. 135855----*--!3------------33,,,,,,,,,2222222,,,,*,,,,,!,,,,,,3,((,,,,00,,,,,,,,,,,,,1,,)),,,,01!333001,,,,03((,,,,,,!,,!,!,,!,,3,,!,1,,,,,,,,,,,,,,,,,,,,,,,,3,,,,,433,,,,,,,,13,,,,,,,,,04,,, ,,,,,,,,!,!,!!,,,,10,,,311,!,,,1))!,,,,)),,30,,,0330,,,,,,003333,,,,,0003,,!,!!,,01,,,033,,,,,1,,,,,,,,00,,,,,,,,,1331313/.,,,)),,,,,,,)),,,,,,,,,,010,,,,,,,,,,3303,,,,0000000,,,,03,,,,,0,,,, ,,,,34333,,,,,
4.sra转换成fasta:
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --fasta 20 SRR003161 2016-01-13T05:33:42 fastq-dump.2.5.5 err: timeout exhausted while reading file within network system module - failed SRR003161 ============================================================= An error occurred during processing. A report was generated into the file '/home/hadoop/ncbi_error_report.xml'. If the problem persists, you may consider sending the file to '[email protected]' for assistance. ============================================================= hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ more SRR003161.fasta >SRR003161.1 FEKQ5UX01AS5XC length=124 TCAGATGCAATCATCGAATG GTCTCGAATGGAATCNTCTA NAGAGATGGAATGTATCNCT CGCCANACGACACNCGAACA GGGNAAGGCAAGCAGNAGGN AGNNNANNNNNNNNNNNNNN NNNN >SRR003161.2 FEKQ5UX01AOE96 length=505 TCAGTTTGAGATGGAGTTTC ATTCTTGTTGCCCAGGCTGG AGTGCAATGGCGCAATCTCA GCTCACAGCAACCTCCGCCT CCCGGGTTCAAGCGATTCTC CTGCCTCAGCCTCTCGAGTA
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --fasta 50 SRR003161 2016-01-13T05:36:52 fastq-dump.2.5.5 err: timeout exhausted while reading file within network system module - failed SRR003161 ============================================================= An error occurred during processing. A report was generated into the file '/home/hadoop/ncbi_error_report.xml'. If the problem persists, you may consider sending the file to '[email protected]' for assistance. ============================================================= hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ls SRR002664.fastq SRR002664.sra SRR003161.fasta SRR003161.fastq SRR003161.sra hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ more SRR003161.fasta >SRR003161.1 FEKQ5UX01AS5XC length=124 TCAGATGCAATCATCGAATGGTCTCGAATGGAATCNTCTANAGAGATGGA ATGTATCNCTCGCCANACGACACNCGAACAGGGNAAGGCAAGCAGNAGGN AGNNNANNNNNNNNNNNNNNNNNN >SRR003161.2 FEKQ5UX01AOE96 length=505 TCAGTTTGAGATGGAGTTTCATTCTTGTTGCCCAGGCTGGAGTGCAATGG CGCAATCTCAGCTCACAGCAACCTCCGCCTCCCGGGTTCAAGCGATTCTC CTGCCTCAGCCTCTCGAGTAGCTGGGATTACAGGCATGCACCATCACGCC CAGCTAATTTGCATTTTTTATTAGAGATGGGGTTTCTCCACATTGGTCAG GCTGATCTCGAACTCCTGACCTCAGGTGATCTGCCTGCCTTGGCCTCCCA AAGTGCTGGGATTACAGGCATGAGCCTGAGCCCAACCTATTTACTTTCAA TCCATCTTTTCAATAACTTAAATACAAGTGTCAATATATACAATCTTTTC CTCCCTGGTTATCAAGCTTTCTAATATATATGGATGTATCTTCCAAGGTT TTTGATCCCATTTTACTTTACAGGCTCACTGCTGTGGAACCCAGAGAGCA GTCTCTTTTCAAGGNGGGCTGAGACNCGCAACAGGGGATTAGGCCAAGGC NCAGG >SRR003161.3 FEKQ5UX01ARXN7 length=645 TCAGCATGCTAGACAGAAGAATTCTCAGTAACTTTCTTTGTGCTGTGTGT ATTCAACTCACAGAGTTGGAACCGTTCCTTTGTCAACAGAGCTAGAATTT GAAACCNCTCTTGAGGACTACGCGAAANAGGGGANAAGGTCCAAAGGCCA GTANAGGGNTCGGANGTANAAGATNCTNAAAATAAAACNGANAGAATCAT TCTNAAGAAACTTNTTGNATGTNTGCCCTTTCAAACTCAACAGGAGTTTA CCAAACCTTTTCTTTTCTAAAGGAGACTAAGGTTTTAAGAAAACCACTTA
换个数据集就可以了,
成功的:faste 50 为每行50个碱基
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --fasta 50 SRR002664 Read 487522 spots for SRR002664 Written 487522 spots for SRR002664 hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ls back SRR002664.fasta SRR002664.sra SRR003161.fasta SRR003161.fastq SRR003161.sra hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ll -h total 986M drwxrwxr-x 3 hadoop hadoop 4.0K 1月 13 13:40 ./ drwxrwxr-x 5 hadoop hadoop 4.0K 1月 12 21:31 ../ drwxrwxr-x 2 hadoop hadoop 4.0K 1月 13 13:39 back/ -rw-rw-r-- 1 hadoop hadoop 150M 1月 13 13:40 SRR002664.fasta -rw-r--r-- 1 hadoop hadoop 17M 12月 15 22:13 SRR002664.sra -rw-rw-r-- 1 hadoop hadoop 274M 1月 13 13:36 SRR003161.fasta -rw-rw-r-- 1 hadoop hadoop 538M 1月 13 13:00 SRR003161.fastq -rw-r--r-- 1 hadoop hadoop 9.0M 12月 15 23:12 SRR003161.sra hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ more SRR002664.fasta >SRR002664.1 FC20KVN01EFCX9 length=192 TCAGCTCACGTCTGTAATCCTAGCATTTTGGGAGGCTGAGACGGGCAGAT CACTTGAGGTCATGAGTTCGAGACCAGCCTGGCAACCATGGCGAAACCCT GTCTCTACTAAAATACAAAATTAGCCAGGCATGGTGGCGCATGCCTGTCT GAGACACGCAACAGGGGATAGGCAAGGCACACAGGGGATAGG >SRR002664.2 FC20KVN01ELL46 length=127 TCAGCAAAGAAAACAAATTCCTTTCTGGCACCACCTCAAAGAAGAATTTC在用fastq验证:
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump SRR002664 Read 487522 spots for SRR002664 Written 487522 spots for SRR002664
5.split
将双端测序文件分开
(1)split-files生成两个fastq文件
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --split-files SRR002664 Read 487522 spots for SRR002664 Written 487522 spots for SRR002664 hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ll -h total 924M drwxrwxr-x 3 hadoop hadoop 4.0K 1月 13 14:05 ./ drwxrwxr-x 5 hadoop hadoop 4.0K 1月 12 21:31 ../ drwxrwxr-x 2 hadoop hadoop 4.0K 1月 13 13:52 back/ -rw-rw-r-- 1 hadoop hadoop 44M 1月 13 14:05 SRR002664_1.fastq -rw-rw-r-- 1 hadoop hadoop 291M 1月 13 14:05 SRR002664_2.fastq -rw-rw-r-- 1 hadoop hadoop 291M 1月 13 14:02 SRR002664.fastq -rw-r--r-- 1 hadoop hadoop 17M 12月 15 22:13 SRR002664.sra -rw-rw-r-- 1 hadoop hadoop 274M 1月 13 13:56 SRR003161.fasta -rw-r--r-- 1 hadoop hadoop 9.0M 12月 15 23:12 SRR003161.sra
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --split-3 SRR002664 Rejected 487522 READS because of filtering out non-biological READS Read 487522 spots for SRR002664 Written 487522 spots for SRR002664 hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ll total 1192100 drwxrwxr-x 3 hadoop hadoop 4096 1月 13 14:21 ./ drwxrwxr-x 5 hadoop hadoop 4096 1月 12 21:31 ../ drwxrwxr-x 2 hadoop hadoop 4096 1月 13 14:21 back/ -rw-rw-r-- 1 hadoop hadoop 304893796 1月 13 14:21 SRR002664.fastq -rw-r--r-- 1 hadoop hadoop 16874064 12月 15 22:13 SRR002664.sra -rw-rw-r-- 1 hadoop hadoop 42893052 1月 13 14:16 SRR003161_1.fastq -rw-rw-r-- 1 hadoop hadoop 559892770 1月 13 14:16 SRR003161_2.fastq -rw-rw-r-- 1 hadoop hadoop 286773153 1月 13 13:56 SRR003161.fasta -rw-r--r-- 1 hadoop hadoop 9353980 12月 15 23:12 SRR003161.sra
对于–split-3参数,是这样介绍的:
Legacy 3-file splitting for mate-pairs: first biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only one biological read is present it is placed in *.fastq. Biological reads and above are ignored
也就是说如果SRA文件中只有一个文件,那么这个参数就会被忽略。如果原文件中有两个文件,那么它就会把成对的文件按*_1.fastq, *_2.fastq这样分开。如果还有出现了第三个文件,就意味着这个文件本身是未成配对的部分。可能是当初提交的时候因为事先过滤过了一下,所以有一部分数据被删除了
借鉴参考【4】
(3)--split-spot
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --split-spot SRR002664 Read 487522 spots for SRR002664 Written 487522 spots for SRR002664 hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ll total 1236636 drwxrwxr-x 3 hadoop hadoop 4096 1月 13 14:53 ./ drwxrwxr-x 5 hadoop hadoop 4096 1月 12 21:31 ../ drwxrwxr-x 2 hadoop hadoop 4096 1月 13 14:21 back/ -rw-rw-r-- 1 hadoop hadoop 350498654 1月 13 14:54 SRR002664.fastq -rw-r--r-- 1 hadoop hadoop 16874064 12月 15 22:13 SRR002664.sra -rw-rw-r-- 1 hadoop hadoop 42893052 1月 13 14:16 SRR003161_1.fastq -rw-rw-r-- 1 hadoop hadoop 559892770 1月 13 14:16 SRR003161_2.fastq -rw-rw-r-- 1 hadoop hadoop 286773153 1月 13 13:56 SRR003161.fasta -rw-r--r-- 1 hadoop hadoop 9353980 12月 15 23:12 SRR003161.sra
--split-spot | Split spots into individual reads. |
参考:
【1】 http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=fastq-dump
【2】 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc
【3】 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software#
【4】 http://www.bbioo.com/lifesciences/40-112832-1.html