ubuntu下使用sratoolkit将sra文件转换成fastq文件

ubuntu下使用sratoolkit将sra文件转换成fastq文件:

环境:ubuntu14.04

sratoolkit.2.5.5-ubuntu64


1.下载

下载地址:

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software#


2.将sra转换成fastq:

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump SRR003161
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ls
SRR002664.fastq  SRR002664.sra  SRR003161.fastq  SRR003161.sra
 

 数据文件请见:http://blog.csdn.net/xubo245/article/details/50507222 
  

3.查看fastq:

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ more SRR003161.fastq 

@SRR003161.1 FEKQ5UX01AS5XC length=124
TCAGATGCAATCATCGAATGGTCTCGAATGGAATCNTCTANAGAGATGGAATGTATCNCTCGCCANACGACACNCGAACAGGGNAAGGCAAGCAGNAGGNAGNNNANNNNNNNNNNNNNNNNNN
+SRR003161.1 FEKQ5UX01AS5XC length=124
AAAAAAAAAAAAAAAA:::BAAFAABAAB?>>=44!39=77//39AC666666C22CAAAA93333///7-0017
>9999>>A???ACCCCCCC2239322>9977>AAAAAAAAAAAAA?@???980000040....0/**04490!!00000600.........,,.....,.....74..............33.....7.....4..............++664!.000000.
135855----*--!3------------33,,,,,,,,,2222222,,,,*,,,,,!,,,,,,3,((,,,,00,,,,,,,,,,,,,1,,)),,,,01!333001,,,,03((,,,,,,!,,!,!,,!,,3,,!,1,,,,,,,,,,,,,,,,,,,,,,,,3,,,,,433,,,,,,,,13,,,,,,,,,04,,,
,,,,,,,,!,!,!!,,,,10,,,311,!,,,1))!,,,,)),,30,,,0330,,,,,,003333,,,,,0003,,!,!!,,01,,,033,,,,,1,,,,,,,,00,,,,,,,,,1331313/.,,,)),,,,,,,)),,,,,,,,,,010,,,,,,,,,,3303,,,,0000000,,,,03,,,,,0,,,,
,,,,34333,,,,,

4.sra转换成fasta:

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --fasta 20 SRR003161
2016-01-13T05:33:42 fastq-dump.2.5.5 err: timeout exhausted while reading file within network system module - failed SRR003161

=============================================================
An error occurred during processing.
A report was generated into the file '/home/hadoop/ncbi_error_report.xml'.
If the problem persists, you may consider sending the file
to '[email protected]' for assistance.
=============================================================


hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ more SRR003161.fasta
>SRR003161.1 FEKQ5UX01AS5XC length=124
TCAGATGCAATCATCGAATG
GTCTCGAATGGAATCNTCTA
NAGAGATGGAATGTATCNCT
CGCCANACGACACNCGAACA
GGGNAAGGCAAGCAGNAGGN
AGNNNANNNNNNNNNNNNNN
NNNN
>SRR003161.2 FEKQ5UX01AOE96 length=505
TCAGTTTGAGATGGAGTTTC
ATTCTTGTTGCCCAGGCTGG
AGTGCAATGGCGCAATCTCA
GCTCACAGCAACCTCCGCCT
CCCGGGTTCAAGCGATTCTC
CTGCCTCAGCCTCTCGAGTA

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --fasta 50 SRR003161
2016-01-13T05:36:52 fastq-dump.2.5.5 err: timeout exhausted while reading file within network system module - failed SRR003161

=============================================================
An error occurred during processing.
A report was generated into the file '/home/hadoop/ncbi_error_report.xml'.
If the problem persists, you may consider sending the file
to '[email protected]' for assistance.
=============================================================

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ls
SRR002664.fastq  SRR002664.sra  SRR003161.fasta  SRR003161.fastq  SRR003161.sra
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ more SRR003161.fasta
>SRR003161.1 FEKQ5UX01AS5XC length=124
TCAGATGCAATCATCGAATGGTCTCGAATGGAATCNTCTANAGAGATGGA
ATGTATCNCTCGCCANACGACACNCGAACAGGGNAAGGCAAGCAGNAGGN
AGNNNANNNNNNNNNNNNNNNNNN
>SRR003161.2 FEKQ5UX01AOE96 length=505
TCAGTTTGAGATGGAGTTTCATTCTTGTTGCCCAGGCTGGAGTGCAATGG
CGCAATCTCAGCTCACAGCAACCTCCGCCTCCCGGGTTCAAGCGATTCTC
CTGCCTCAGCCTCTCGAGTAGCTGGGATTACAGGCATGCACCATCACGCC
CAGCTAATTTGCATTTTTTATTAGAGATGGGGTTTCTCCACATTGGTCAG
GCTGATCTCGAACTCCTGACCTCAGGTGATCTGCCTGCCTTGGCCTCCCA
AAGTGCTGGGATTACAGGCATGAGCCTGAGCCCAACCTATTTACTTTCAA
TCCATCTTTTCAATAACTTAAATACAAGTGTCAATATATACAATCTTTTC
CTCCCTGGTTATCAAGCTTTCTAATATATATGGATGTATCTTCCAAGGTT
TTTGATCCCATTTTACTTTACAGGCTCACTGCTGTGGAACCCAGAGAGCA
GTCTCTTTTCAAGGNGGGCTGAGACNCGCAACAGGGGATTAGGCCAAGGC
NCAGG
>SRR003161.3 FEKQ5UX01ARXN7 length=645
TCAGCATGCTAGACAGAAGAATTCTCAGTAACTTTCTTTGTGCTGTGTGT
ATTCAACTCACAGAGTTGGAACCGTTCCTTTGTCAACAGAGCTAGAATTT
GAAACCNCTCTTGAGGACTACGCGAAANAGGGGANAAGGTCCAAAGGCCA
GTANAGGGNTCGGANGTANAAGATNCTNAAAATAAAACNGANAGAATCAT
TCTNAAGAAACTTNTTGNATGTNTGCCCTTTCAAACTCAACAGGAGTTTA
CCAAACCTTTTCTTTTCTAAAGGAGACTAAGGTTTTAAGAAAACCACTTA

暂时没解决err、、、



换个数据集就可以了,

成功的:faste 50 为每行50个碱基

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --fasta 50 SRR002664
Read 487522 spots for SRR002664
Written 487522 spots for SRR002664
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ls
back  SRR002664.fasta  SRR002664.sra  SRR003161.fasta  SRR003161.fastq  SRR003161.sra
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ll -h
total 986M
drwxrwxr-x 3 hadoop hadoop 4.0K  1月 13 13:40 ./
drwxrwxr-x 5 hadoop hadoop 4.0K  1月 12 21:31 ../
drwxrwxr-x 2 hadoop hadoop 4.0K  1月 13 13:39 back/
-rw-rw-r-- 1 hadoop hadoop 150M  1月 13 13:40 SRR002664.fasta
-rw-r--r-- 1 hadoop hadoop  17M 12月 15 22:13 SRR002664.sra
-rw-rw-r-- 1 hadoop hadoop 274M  1月 13 13:36 SRR003161.fasta
-rw-rw-r-- 1 hadoop hadoop 538M  1月 13 13:00 SRR003161.fastq
-rw-r--r-- 1 hadoop hadoop 9.0M 12月 15 23:12 SRR003161.sra
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ more SRR002664.fasta 
>SRR002664.1 FC20KVN01EFCX9 length=192
TCAGCTCACGTCTGTAATCCTAGCATTTTGGGAGGCTGAGACGGGCAGAT
CACTTGAGGTCATGAGTTCGAGACCAGCCTGGCAACCATGGCGAAACCCT
GTCTCTACTAAAATACAAAATTAGCCAGGCATGGTGGCGCATGCCTGTCT
GAGACACGCAACAGGGGATAGGCAAGGCACACAGGGGATAGG
>SRR002664.2 FC20KVN01ELL46 length=127
TCAGCAAAGAAAACAAATTCCTTTCTGGCACCACCTCAAAGAAGAATTTC
在用fastq验证:

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump  SRR002664
Read 487522 spots for SRR002664
Written 487522 spots for SRR002664

5.split

将双端测序文件分开

(1)split-files生成两个fastq文件

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --split-files SRR002664
Read 487522 spots for SRR002664
Written 487522 spots for SRR002664
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ll -h
total 924M
drwxrwxr-x 3 hadoop hadoop 4.0K  1月 13 14:05 ./
drwxrwxr-x 5 hadoop hadoop 4.0K  1月 12 21:31 ../
drwxrwxr-x 2 hadoop hadoop 4.0K  1月 13 13:52 back/
-rw-rw-r-- 1 hadoop hadoop  44M  1月 13 14:05 SRR002664_1.fastq
-rw-rw-r-- 1 hadoop hadoop 291M  1月 13 14:05 SRR002664_2.fastq
-rw-rw-r-- 1 hadoop hadoop 291M  1月 13 14:02 SRR002664.fastq
-rw-r--r-- 1 hadoop hadoop  17M 12月 15 22:13 SRR002664.sra
-rw-rw-r-- 1 hadoop hadoop 274M  1月 13 13:56 SRR003161.fasta
-rw-r--r-- 1 hadoop hadoop 9.0M 12月 15 23:12 SRR003161.sra

(2)--split-3

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --split-3 SRR002664
Rejected 487522 READS because of filtering out non-biological READS
Read 487522 spots for SRR002664
Written 487522 spots for SRR002664
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ll
total 1192100
drwxrwxr-x 3 hadoop hadoop      4096  1月 13 14:21 ./
drwxrwxr-x 5 hadoop hadoop      4096  1月 12 21:31 ../
drwxrwxr-x 2 hadoop hadoop      4096  1月 13 14:21 back/
-rw-rw-r-- 1 hadoop hadoop 304893796  1月 13 14:21 SRR002664.fastq
-rw-r--r-- 1 hadoop hadoop  16874064 12月 15 22:13 SRR002664.sra
-rw-rw-r-- 1 hadoop hadoop  42893052  1月 13 14:16 SRR003161_1.fastq
-rw-rw-r-- 1 hadoop hadoop 559892770  1月 13 14:16 SRR003161_2.fastq
-rw-rw-r-- 1 hadoop hadoop 286773153  1月 13 13:56 SRR003161.fasta
-rw-r--r-- 1 hadoop hadoop   9353980 12月 15 23:12 SRR003161.sra


对于–split-3参数,是这样介绍的:
Legacy 3-file splitting for mate-pairs: first biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only one biological read is present it is placed in *.fastq. Biological reads and above are ignored

也就是说如果SRA文件中只有一个文件,那么这个参数就会被忽略。如果原文件中有两个文件,那么它就会把成对的文件按*_1.fastq, *_2.fastq这样分开。如果还有出现了第三个文件,就意味着这个文件本身是未成配对的部分。可能是当初提交的时候因为事先过滤过了一下,所以有一部分数据被删除了

借鉴参考【4】



(3)--split-spot

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --split-spot SRR002664
Read 487522 spots for SRR002664
Written 487522 spots for SRR002664
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ll
total 1236636
drwxrwxr-x 3 hadoop hadoop      4096  1月 13 14:53 ./
drwxrwxr-x 5 hadoop hadoop      4096  1月 12 21:31 ../
drwxrwxr-x 2 hadoop hadoop      4096  1月 13 14:21 back/
-rw-rw-r-- 1 hadoop hadoop 350498654  1月 13 14:54 SRR002664.fastq
-rw-r--r-- 1 hadoop hadoop  16874064 12月 15 22:13 SRR002664.sra
-rw-rw-r-- 1 hadoop hadoop  42893052  1月 13 14:16 SRR003161_1.fastq
-rw-rw-r-- 1 hadoop hadoop 559892770  1月 13 14:16 SRR003161_2.fastq
-rw-rw-r-- 1 hadoop hadoop 286773153  1月 13 13:56 SRR003161.fasta
-rw-r--r-- 1 hadoop hadoop   9353980 12月 15 23:12 SRR003161.sra

    --split-spot Split spots into individual reads.





参考:

【1】 http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=fastq-dump

【2】 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc

【3】 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software#

【4】 http://www.bbioo.com/lifesciences/40-112832-1.html

你可能感兴趣的:(云计算)