目的:用安装好的sratoolkit把sra文件转换为fastq格式的测序文件,并且用fastqc软件测试测序文件的质量。
作业:理解测序reads,GC含量,质量值,接头,index,fastqc的全部报告,搜索中文教程
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc
fastq-dump: Convert SRA data into fastq format
prefetch: Allows command-line downloading of SRA, dbGaP, and ADSP data
sam-dump: Convert SRA data to sam format
sra-pileup: Generate pileup statistics on aligned SRA data
vdb-config: Display and modify VDB configuration information
vdb-decrypt: Decrypt non-SRA dbGaP data ("phenotype data")
abi-dump: Convert SRA data into ABI format (csfasta / qual)
illumina-dump: Convert SRA data into Illumina native formats (qseq, etc.)
sff-dump: Convert SRA data to sff format
sra-stat: Generate statistics about SRA data (quality distribution, etc.)
vdb-dump: Output the native VDB format of SRA data.
vdb-encrypt: Encrypt non-SRA dbGaP data ("phenotype data")
vdb-validate: Validate the integrity of downloaded SRA data
download repository(Linux): /home/[user_name]/ncbi/public
For the test, we are using an arbitrary dataset, SRR390728 (RNA-Seq (polyA+) analysis of DLBCL cell line HS0798), from the National Cancer Institute’s Cancer Genome Characterization Initiative (CGCI) Project. It is a reasonably small SRA dataset that contains aligned (reference-compressed) data, allowing us to test multiple aspects of the toolkit simultaneously.
./fastq-dump -X 5 -Z SRR390728
fastq-dump.2.x err: item not found while constructing within virtual database module - the path 'SRR390728' cannot be opened as database or table"
Go to the “bin” subdirectory for the Toolkit and run the following command:
./vdb-config -i
fastq-dump -h #显示帮助
Usage:
fastq-dump [options] <path> [<path>...]
fastq-dump [options] <accession>
INPUT
-A|--accession <accession> Replaces accession derived from <path> in
filename(s) and deflines (only for single
table dump)
--table <table-name> Table name within cSRA object, default is
"SEQUENCE"
OUTPUT
-O|--outdir <path> Output directory, default is working
directory '.' )
-Z|--stdout Output to stdout, all split data become
joined into single stream
--gzip Compress output using gzip: deprecated, not
recommended
--bzip2 Compress output using bzip2: deprecated,
not recommended
Multiple File Options Setting these options will produce more
than 1 file, each of which will be suffixed
according to splitting criteria.
FASTQ文件每个序列通常为4行,分别为:
Line 1 begins with a ‘@’ character and is followed by a sequence identifier and an optional description (like a FASTA title line).
Line 2 is the raw sequence letters.
Line 3 begins with a ‘+’ character and is optionally followed by the same sequence identifier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
fastq-dump --split-3 -O SRR35899$i.sra
# 翻车了,不压缩有120G,推荐压缩 --gzip
# 利用循环减少重复操作
fastqc SRR3589956_1.fastq
得到一个zip压缩文件和一个html文件
打开html文件获得检测结果
青山屋主专栏:
http://fbb84b26.wiz03.com/share/s/3XK4IC0cm4CL22pU
r1HPcQQ1iRTvV2GwkwL2AaxYi2fXHP7
conda install -c bioconda multiqc
multiqc . # 扫描当前文件夹