两种模式:mapping-based mode与alignment-base mode
特点:fast, built-in selective-alignment mapping algorithm
官方文档中,transcript是序列,fraction是reads
现版本用的是Selective alignment,The use of selective alignment implies the use of range factorization。
序列比对:Alignment and mapping methodology influence transcript abundance estimation | Genome Biology | Full Text (biomedcentral.com)
range factorization:Improved data-driven likelihood factorizations for transcript abundance estimation | Bioinformatics | Oxford Academic (oup.com)
Salmon expects that the reads have been aligned directly to the transcriptome (like RSEM, eXpress, etc.) rather than to the genome (as does, e.g. Cufflinks)
所以如果用alignment模式,且比对的是基因组,那么需要转化为比对转录组的情况。官方提供三种方式:
3.1. SAM/BAM 转化为 FAST{A/Q},使用lightweight-alignment-based mode
3.2. SAM/BAM 转化为 FAST{A/Q}然后用aligner再重新比对到转录组上,然后再运行salmon
3.3. sam-xlate这类工具可以直接将基因组比对BAM文件直接转化为转录组坐标。但是这类工具目前用的人不多。
?基因组与转录组要求这么严格吗?
1. 基础:
1.Strand Matching
When it is said that the read “comes from” a strand, we mean that the read should align with / map to that strand. For example, for libraries having the OSR protocol as described above, we expect that read1 maps to the reverse strand, and read2 maps to the forward strand.
-
library type
2. mapping-based mode
2.1. 创建index
using selective alignment with a decoy-aware transcriptome, to mitigate potential spurious mapping of reads that actually arise from some unannotated genomic locus that is sequence-similar to an annotated transcriptome
2.1.1. MashMap2将reads与物种基因组比对。 generateDecoyTranscriptome.sh可以简化这个步骤指引在README
2.1.2. 整个基因组作为index,可参考https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/
2.1.3. 常见物种预先准备好的index:http://refgenomes.databio.org/
2.1.4. indexr
./bin/salmon index -t transcripts.fa -i transcripts_index --decoys decoys.txt -k 31
#31-mer适用于75bp以上,短的话,可以设置-k小一点
2.2. quantifying
quasi-mapping, quantification, and bootstrapping / posterior sampling
#双末端
./bin/salmon quant -i transcripts_index -l -1 reads1.fq -2 reads2.fq --validateMappings -o transc单末端ripts_quant
#单末端
./bin/salmon quant -i transcripts_index -l -r reads.fq --validateMappings -o transcripts_quant
#双末端interleaved 文件
salmon quant -i index -l IU -1 lib_1_1.fq.gz lib_2_1.fq.gz -2 lib_1_2.fq.gz lib_2_2.fq.gz --validateMappings -o out
salmon quant -i index -l IU -1 <(cat lib_1_1.fq lib_2_1.fq) -2 <(cat lib_1_2.fq lib_2_2.fq) --validateMappings -o out
salmon quant -i index -l IU -1 <(gunzip -c lib_1_1.fq.gz lib_2_1.fq.gz) -2 <(gunzip -c lib_1_2.fq.gz lib_2_2.fq.gz) --validateMappings -o out
2.3. quasi-mapping
3. alignment-based mode
./bin/salmon quant -t transcripts.fa -l -a aln.bam -o salmon_quant
#查看该模式选项
salmon quant --help-alignment
跑完后会有个quant.sf文件,格式类似于Sailfish 的文件格式
4. 输出
4.1. 定量文件quant.sf
目标序列名、目标序列长度、有效长度、TPM、count
4.2. 命令信息文件cmd_info.json
记录salmon主要的运行命令
4.3. 辅助文件夹aux_info
内有多个文件,记录salmon运行的情况
4.3.1. meta information:主要的文件,内容:meta information about the run, including stats such as the number of observed and mapped fragments, details of the bias modeling etc.
4.3.2. Unique and ambiguous count file:各个转录序列匹配到的reads number
4.3.3. Observed library format counts:依据对测序库的种类预估,计算不同情况下的mapping数目以及差异
4.3.4. Fragment length distribution:片段长度分布
4.3.5. Sequence-specific bias files
4.3.6. Fragment-GC bias files:GC
5. 其他选项
5.1. salmon quant -h 可以看到所有选项
5.2. --mimicBT2: mimic alignment using Bowtie2 (with the flags --no-discordant and --no-mixed), allowing both mismatches and indels in alignments.
5.3. --mimicStrictBT2:These setting essentially disallow indels in the resulting alignments.
5.4. --recoverOrphans:仅有单末端匹配的时候,该模式会寻找上下游序列
5.5. --hardFilter:仅保留高质量匹配
5.6. --skipQuant:停在定量前
5.7. --allowDovetail:Dovetailing mappings and alignments
5.8. -p / --threads:多线程定量,默认最高
5.9. --dumpEq:输出Equivalence class file
5.10. --incompatPrior:if an incompatible mapping is the only mapping for a fragment, Salmon will still assign this fragment to the transcript. you can set --incompatPrior 0.0. This will cause Salmon to only consider mappings (or alignments) that are compatible with the prescribed or inferred library type.
5.11. --fldMean:Since the empirical fragment length distribution cannot be estimated from the mappings of single-end reads, the --fldMean allows the user to set the expected mean fragment length of the sequencing library.
5.12. --fldSD:这两个都只作用于单末端
5.13. --minScoreFraction:最小匹配分数,最大匹配分数计算:maximum possible score for a fragment is ms = read_len * ma (or ms = (left_read_len + right_read_len) * ma for paired-end reads).
5.14. --bandwidth:
5.16. --maxMMPExtension:reads读取长度
5.17. --ma
5.18. --mp
5.19. --go
5.20. --ge
5.21. --rangeFactorizationBins:Currently, this feature interacts best (i.e., yields the most considerable improvements) when either (1) using alignment-based mode and simultaneously enabling error modeling with --useErrorModel or (2) when enabling --validateMappings in quasi-mapping-based mode. We recommend 4 as a reasonable parameter for this option
5.21. --useEM:Use the “standard” EM algorithm to optimize abundance estimates instead of the variational Bayesian EM algorithm。
However, preliminary testing suggests that the sparsity-inducing effect of running the VBEM with a small prior may lead, in general, to more accurate estimates (the current testing was performed mostly through simulation). Hence, the VBEM is the default, and the standard EM algorithm is accessed via the –useEM flag.
5.22. --numBootstraps:
5.23. --numGibbsSamples:in this case the samples are generated using posterior Gibbs sampling over the fragment equivalence classes rather than bootstrapping. The --numBootstraps and --numGibbsSamples options are mutually exclusive
5.24. --seqBias:Salmon learns the sequence-specific bias parameters using 1,000,000 reads from the beginning of the input to learn and correct for sequence-specific biases in the input data. Salmon uses a variable-length Markov Model (VLMM) to model the sequence specific biases at both the 5’ and 3’ end of sequenced fragments.
5.25. --gcBias:用fastqc去检验一下序列GC偏差大不大,大的话,选择此选项。Salmon will enable it to learn and correct for fragment-level GC biases in the input data
5.26. --posBias:Salmon will enable modeling of a position-specific fragment start distribution. This is meant to model non-uniform coverage biases that are sometimes present in RNA-seq data (e.g. 5’ or 3’ positional bias).
5.27. --biasSpeedSamp:通过减少考量的reads,加速上面三个偏差建模
5.28. --writeUnmappedNames:By reading through the file of unmapped reads and selecting the appropriate sequences from the input FASTA/Q files, you can build an “unmapped” file that can then be used to investigate why these reads may not have mapped
5.29. --writeMappings:
5.30. --l A or --libType A:allow Salmon to automatically infer the library type(文库类型)
Also the automatic library type detection is performed on the basis of the alignments in the file. Thus, for example, if the upstream aligner has been told to perform strand-aware mapping (i.e. to ignore potential alignments that don’t map in the expected manner), but the actual library is unstranded, automatic library type detection cannot detect this. It will attempt to detect the library type that is most consistent with the alignment that are provided.