Linux011 Sra toolkit安装及使用

SRA(Sequence ReadArchive)数据库是用于存储二代测序的原始数据,包括 454,Illumina,SOLiD,IonTorrent,Helicos 和 CompleteGenomics。除了原始序列数据外,SRA现在也存在raw reads在参考基因的比对信息。
根据SRA数据产生的特点,将SRA数据分为四类:

  • Studies-- 研究课题
  • Experiments-- 实验设计
  • Runs-- 测序结果集
  • Samples-- 样品信息
    SRA Toolkit是将NCBI数据库中sra文件下载并转换为 .fstaq.gz文件的工具。
进入NCBI官网,选择SRA数据库
image.png
找到sra toolkit下载页面
image.png
复制下载链接
image.png
在linux中使用wget命令下载
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.10.8/sratoolkit.2.10.8-ubuntu64.tar.gz
  • 将文件移动至指定文件夹,如/home/sratoolkit
mkdir /home/sratoolkit
mv sratoolkit.2.10.8-ubuntu64.tar.gz /home/sratoolkit
解压
cd /home/sratoolkit
tar xzvf sratoolkit.2.10.8-ubuntu64.tar.gz
修改.bashrc文件
echo "export PATH=\$PATH:/home/sratoolkit/sratoolkit.2.10.8-ubuntu64/bin" >> ~/.bashrc
source ~/.bashrc
fastq-dump -h
安装到fastq-dump -h时报错,按照报错原因运行 vdb-config --interactive即可
image.png

sra toolkit使用

  • SRA检索,以brca为例,可以在NCBI sra数据库检索到大量的测序数据,另外paper一般也会提供测序数据的SRA号,可直接根据号码进行检索


    image.png

    image.png
prefetch命令下载文件,比如:prefetch SRR11097713

prefetch Usage:
prefetch [options] [ ...]
prefetch [options]
prefetch [options] --list
Frequently Used Options:
General:
-h | --help Displays ALL options, general usage, and version information.
-V | --version Display the version of the program.
Data transfer:
-f | --force Force object download. One of: no, yes, all. no [default]: Skip download if the object if found and complete; yes: Download it even if it is found and is complete; all: Ignore lock files (stale locks or if it is currently being downloaded: use at your own risk!).
--transport Value one of: ascp (only), http (only), both (first try ascp, fallback to http). Default: both.
-l | --list List the contents of a kart file.
-s | --list-sizes List the content of kart file with target file sizes.
-N | --min-size Minimum file size to download in KB (inclusive).
-X | --max-size Maximum file size to download in KB (exclusive). Default: 20G.
-o | --order Kart prefetch order. One of: kart (in kart order), size (by file size: smallest first). default: size.
-a | --ascp-path Path to ascp program and private key file (asperaweb_id_dsa.openssh).
-p | --progress Time period in minutes to display download progress (0: no progress). Default: 1.
--option-file Read more options and parameters from the file.

fastq-dump
  • 将sra转换成fastq:fastq-dump SRR11097713
  • sra转换成fasta:fastq-dump --fasta 50 SRR11097713
  • 将双端测序文件分开:fastq-dump --split-files SRR11097713

fastq-dump Usage:
fastq-dump [options] [ ...]
fastq-dump [options]
Frequently Used Options:
General:
-h | --help Displays ALL options, general usage, and version information.
-V | --version Display the version of the program.
Data formatting:
--split-files Dump each read into separate file. Files will receive suffix corresponding to read number.
--split-spot Split spots into individual reads.
--fasta <[line width]> FASTA only, no qualities. Optional line wrap width (set to zero for no wrapping).
-I | --readids Append read id after spot id as 'accession.spot.readid' on defline.
-F | --origfmt Defline contains only original sequence name.
-C | --dumpcs <[cskey]> Formats sequence using color space (default for SOLiD). "cskey" may be specified for translation.
-B | --dumpbase Formats sequence using base space (default for other than SOLiD).
-Q | --offset Offset to use for ASCII quality scores. Default is 33 ("!").
Filtering:
-N | --minSpotId Minimum spot id to be dumped. Use with "X" to dump a range.
-X | --maxSpotId Maximum spot id to be dumped. Use with "N" to dump a range.
-M | --minReadLen Filter by sequence length >=
--skip-technical Dump only biological reads.
--aligned Dump only aligned sequences. Aligned datasets only; see sra-stat.
--unaligned Dump only unaligned sequences. Will dump all for unaligned datasets.
Workflow and piping:
-O | --outdir Output directory, default is current working directory ('.').
-Z | --stdout Output to stdout, all split data become joined into single stream.
--gzip Compress output using gzip.
--bzip2 Compress output using bzip2.

你可能感兴趣的:(Linux011 Sra toolkit安装及使用)