宏转录组学习笔记(二)

上次竟然忘记把学习的教程地址放上了,这里首先放上:2018-cicese-metatranscriptomics

继续前面的学习,前面已经把软件安装完成,数据库准备好,下面就是分析的过程了,基本上按照原文的命令进行的,由于教程中没有给出tara_f135_full_megahit.fasta这个文件,这里我们就把这几个样本的组装提到了前面,自己组装获得这个序列,然后再进行物种注释。

质控

把不合格的数据去除,这里使用了fastqc对每个样本进行检查,multiqc汇兑结果,然后Trimmomatic进行去除。

1.fastqc检查,multiqc汇兑结果

#激活工作环境
conda activate tera
#为了方便,设置变量,临时有用,退出失效,所以再运行一次
export PROJECT=~/work
#进入文件夹
cd ${PROJECT}
#建立质控文件夹并进入,每个步骤建立文件夹是个好习惯
mkdir -p quality
cd quality
#软链接数据,省去路径问题,也是一个好方法
ln -s ../data/*.fq.gz ./
#再次check一切就绪
printf "I see $(ls -1 *.fq.gz | wc -l) files here.\n"
#开始质控,原文没有加入多线程, 这里我使用4线程
fastqc *.fq.gz -t 4
#查看确认运行完成,一般没有问题
ls -d *fastqc.zip*
'''
TARA_135_DCM_5-20_rep1_1m_1_fastqc.zip TARA_135_SRF_5-20_rep2_1m_1_fastqc.zip 
...
TARA_135_SRF_5-20_rep1_1m_2_fastqc.zip TARA_136_SRF_5-20_rep2_1m_2_fastqc.zip
'''
#multiqc汇兑结果
multiqc .

运行完成后使用浏览器查看下结果如TARA_135_SRF_5-20_rep1_1m_1_fastqc.html,就是我们熟悉的网页了。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8yujkrcM-1585208089194)(https://jiawen.zd200572.com/wp-content/uploads/2020/03/qc.jpg)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Q4SkJ6LV-1585208089199)(https://jiawen.zd200572.com/wp-content/uploads/2020/03/multiqc.jpg)]

2.根据质量修剪和过滤

#建立新的文件夹,软链文件,准备adapters
cd $PROJECT
mkdir -p trim
cd trim
ln -s ../data/*.fq.gz .
cat ~/Miniconda/envs/tera/share/trimmomatic-*/adapters/* > combined.fa
#一个大的for循环解决问题
for filename in *1.fq.gz;
do
#使用 basename 变量去除 _1.fq.gz 产生 base变量,第一次见这种操作
base=$(basename $filename _1.fq.gz)
echo $base

# 运行 Trimmomatic
trimmomatic PE ${base}_1.fq.gz \
              ${base}_2.fq.gz \
     ${base}_1.qc.fq.gz s1_se \
     ${base}_2.qc.fq.gz s2_se \
     ILLUMINACLIP:combined.fa:2:40:15 \
     LEADING:2 TRAILING:2 \
     SLIDINGWINDOW:4:2 \
     MINLEN:25
# 保存orphans序列
gzip -9c s1_se s2_se >> orphans.qc.fq.gz
rm -f s1_se s2_se

done
#看下质控效果,同样是fastqc-multiqc,一起完成,稍微看了眼,只去掉了很少序列数
fastqc *.qc.fq.gz -t 4 
multiqc .
#取消目录下的所有文件可写权限
chmod a-w ${PROJECT}/trim/*.qc.fq.gz

更多质控trim策略看这, MacManes, 2014 ,这是转录组的教程,但是原理应该通用。

Khmer错误修剪

前面的trimmomatic修剪后,依然有序列包含错误,Q30意味着1000个Q值是30的碱基中,约有一个是错的。高质量碱基也会小概率的错误,反过来,许多低质量碱基也是正确的。然后,我们修剪地很轻微,只去除了质量极低的碱基。这是有意为之,因为组装需要保留尽可能大的测序深度,组装软件可以通过测序深度判断正确与否。另一个修剪选择是基于k-mer丰度(kmer spectral error),文章显示结果优于基于质量修剪。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vqzESdQs-1585208089200)(https://ngs-docs.github.io/2018-cicese-metatranscriptomics/files/2014-zhang.png)]

它的逻辑是这样的:如果你在高测序深度发现低丰度的K-mers,这些很可能就是错误。当然,菌株变异也会引起这样。在宏转录组中可能遇到极低丰度的菌的情况,所以我们不必须做这一步。我们实验室开发了一个方法,把序列按丰度高低分成两个组,仅修剪高丰度的序列,这样可以兼得低丰度序列,又能去除错误

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wuKj6rjC-1585208089201)(https://ngs-docs.github.io/2018-cicese-metatranscriptomics/files/kmer-trimming.png)]

#建立文件夹并进入
cd ${PROJECT}
mkdir -p khmer_trim
cd khmer_trim
#选择一个样本开始,链接文件
ln -s ${PROJECT}/trim/TARA_135_SRF_5-20_*qc.fq.gz ./
#还是for循环运行,这一步是最耗内存和时间的,运行需要8G以上内存,否则会使用swap,运行速度变得超慢
#实测在8G内存上,如果有图形界面的系统会卡死,开的无图形界面的去服务器使用内存7.8G
#参数里已经设置了内存8G,可能改小点也能运行,速度差别,没有测试
for filename in *_1.qc.fq.gz
do
  #Use the program basename to remove _1.qc.fq.gz to generate the base
  base=$(basename $filename _1.qc.fq.gz)
  echo $base

  interleave-reads.py ${base}_1.qc.fq.gz ${base}_2.qc.fq.gz | \
  trim-low-abund.py - -V -Z 10 -C 3 -o - --gzip -M 8e9 | \
  extract-paired-reads.py --gzip -p ${base}.khmer.pe.fq.gz -s ${base}.khmer.se.fq.gz

done
'''
removing ./tmph8zrazrjkhmer/-.pass2
removing temp directory & contents (./tmph8zrazrjkhmer)
read 1995938 reads, 194677930 bp
wrote 1980263 reads, 192516464 bp
looked at 1910590 reads twice (1.96 passes)
removed 15675 reads and trimmed 36041 reads (2.59%)
trimmed or removed 1.11%% of bases (2161466 total)
1995938 reads were high coverage (100.00%);
skipped 0 reads/0 bases because of low coverage
fp rate estimated to be 0.000
output streamed to stdout
DONE; read 1980263 sequences, 982692 pairs and 14879 singletons
wrote to: TARA_135_SRF_5-20_rep2_1m.khmer.pe.fq.gz and TARA_135_SRF_5-20_rep2_1m.khmer.se.fq.gz
'''
#查看k-mer丰度变化,去除了少部分序列(几万条)
unique-kmers.py TARA_135_SRF_5-20_rep1_1m_1.qc.fq.gz TARA_135_SRF_5-20_rep1_1m_2.qc.fq.gz
'''
Estimated number of unique 32-mers in TARA_135_SRF_5-20_rep1_1m_1.qc.fq.gz: 58871706
Estimated number of unique 32-mers in TARA_135_SRF_5-20_rep1_1m_2.qc.fq.gz: 57626073
Total estimated number of unique 32-mers: 100687408
'''
unique-kmers.py TARA_135_SRF_5-20_rep1_1m.khmer.pe.fq.gz
'''
Estimated number of unique 32-mers in TARA_135_SRF_5-20_rep1_1m.khmer.pe.fq.gz: 100296061
Total estimated number of unique 32-mers: 100296061
'''

因为这是专门用来培训的小数据,如果真实数据去除的序列数应该比这个大得多。这里有更多相关知识。

宏转录组组装

基本过程就是你丢给软件一些reads,获得一些代表你的转录组的contigs,或者拉长没有太多明显多态性和长重复的reads。你使用修剪好的reads运行一个转录组组装程序,得到一些组装好的RNA。这些contigs代表环境中真核生物中的转录本(Poly-A mRNA)。

#建立文件夹,链接文件
cd $PROJECT
mkdir -p assembly
cd assembly
ln -fs ${PROJECT}/khmer_trim/*.khmer.pe.fq.gz .
ls
#组装,这里我采用12G内存,2线程,已经是我的笔记本的极限,实际运行下来只使用了几百M的样子
time megahit --12 TARA_135_SRF_5-20_rep1_1m.khmer.pe.fq.gz,TARA_135_SRF_5-20_rep2_1m.khmer.pe.fq.gz  --memory 12e9 --num-cpu-threads 4 --out-prefix TARA_135_SRF --out-dir ./TARA_135_SRF_khmer -f
详细输出如下:
'''
2020-03-15 13:29:51 - MEGAHIT v1.2.9
2020-03-15 13:29:51 - Using megahit_core with POPCNT and BMI2 support
2020-03-15 13:29:51 - Convert reads to binary library
2020-03-15 13:29:55 - b'INFO  sequence/io/sequence_lib.cpp  :   77 - Lib 0 (/home/ubuntu/work/assembly/TARA_135_SRF_5-20_rep1_1m.khmer.pe.fq.gz): interleaved, 1958670 reads, 101 max length'
2020-03-15 13:29:58 - b'INFO  sequence/io/sequence_lib.cpp  :   77 - Lib 1 (/home/ubuntu/work/assembly/TARA_135_SRF_5-20_rep2_1m.khmer.pe.fq.gz): interleaved, 1965384 reads, 101 max length'
2020-03-15 13:29:58 - b'INFO  utils/utils.h                 :  152 - Real: 7.6951\tuser: 2.6234\tsys: 0.4533\tmaxrss: 102644'
2020-03-15 13:29:58 - k-max reset to: 119 
2020-03-15 13:29:58 - Start assembly. Number of CPU threads 4 
2020-03-15 13:29:58 - k list: 21,29,39,59,79,99,119 
2020-03-15 13:29:58 - Memory used: 8000000000
2020-03-15 13:29:58 - Extract solid (k+1)-mers for k = 21 
2020-03-15 13:30:55 - Build graph for k = 21 
......
2020-03-15 13:42:29 - Build graph for k = 119 
2020-03-15 13:42:31 - Assemble contigs from SdBG for k = 119
2020-03-15 13:42:41 - Merging to output final contigs 
2020-03-15 13:42:41 - 16707 contigs, total 8656583 bp, min 202 bp, max 9085 bp, avg 518 bp, N50 520 bp
2020-03-15 13:42:41 - ALL DONE. Time elapsed: 770.087650 seconds 
#时间有点长,有十二分钟。可以休息下,或者继续阅读后面的内容,准备下。
real    12m50.137s
user    23m11.702s
sys     0m7.025s
'''

查看组装文件

输出文件会是TARA_135_SRF_khmer/TARA_135_SRF.contigs.fa

#复制并重命名组装文件
cp ./TARA_135_SRF_khmer/*contigs.fa tara135_SRF_megahit.fasta
head tara135_SRF_megahit.fasta 

我们为什么组装,组装的优点是什么,对于组装文件,我们能对它做些什么呢?

  • 组装集减少读取冗余,因此组装每个转录本应具有大约一个序列,而不是读取,因为读取每个脚本具有许多读取。
  • 组装的连续体比读取时间长,因此更容易对它们进行基因搜索。我们明天再报道!
  • 程序集的错误也比读取少,因此样本比较等可能更准确。但是,如果数据覆盖范围确实很低,并且大量信息也会丢失,则程序集也可能消除某些数据。

进一步参考

还有其他的组装软件,你可以用您的数据尝试,其中一些列在我们的参考页。我们感兴趣的一个是PLASS,它在蛋白质水平组装!

组装评估和定量

Transrate是一个可用于几种不同类型的组装评估的程序。第一个也是最简单的方法是计算有关组装基于长度的指标,例如总碱基数和contigs的 N50。Transrate还可用于比较两个组装或给出一个分数,该分数表示为组装提供积极支持的输入reads的比例。有关指标以及如何运行基于参考的Transrate的更多信息,请参阅Smith-Unna 等人 2016 年的文档和论文。

#检查$PROJECT变量
echo $PROJECT
cd $PROJECT
#建立文件夹并进入
mkdir -p evaluation
cd evaluation
#把组装文件复制过来,应该是文件比较小就不链接了
cp $PROJECT/assembly/tara135_SRF_megahit.fasta ./
#使用transrate产生组装统计数据,这里我没有加环境变量,直接绝对路径运行了
./transrate-1.0.3-osx/transrate -h #查看帮助,测试软件可用
#统计
time ./transrate-1.0.3-osx/transrate --assembly tara135_SRF_megahit.fasta
'''
[ INFO] 2020-03-15 14:15:45 : Loading assembly: /Users/zd200572/work/evaluation/evaluation/tara135_SRF_megahit.fasta
[ INFO] 2020-03-15 14:15:46 : Analysing assembly: /Users/zd200572/work/evaluation/evaluation/tara135_SRF_megahit.fasta
[ INFO] 2020-03-15 14:15:46 : Results will be saved in /Users/zd200572/work/evaluation/evaluation/transrate_results/tara135_SRF_megahit
[ INFO] 2020-03-15 14:15:46 : Calculating contig metrics...
[ INFO] 2020-03-15 14:15:48 : Contig metrics:
[ INFO] 2020-03-15 14:15:48 : -----------------------------------
[ INFO] 2020-03-15 14:15:48 : n seqs                        16707
[ INFO] 2020-03-15 14:15:48 : smallest                        202
[ INFO] 2020-03-15 14:15:48 : largest                        9085
[ INFO] 2020-03-15 14:15:48 : n bases                     8656583
[ INFO] 2020-03-15 14:15:48 : mean len                     518.14
[ INFO] 2020-03-15 14:15:48 : n under 200                       0
[ INFO] 2020-03-15 14:15:48 : n over 1k                       909
[ INFO] 2020-03-15 14:15:48 : n over 10k                        0
[ INFO] 2020-03-15 14:15:48 : n with orf                     5531
[ INFO] 2020-03-15 14:15:48 : mean orf percent              84.25
[ INFO] 2020-03-15 14:15:48 : n90                             331
[ INFO] 2020-03-15 14:15:48 : n70                             408
[ INFO] 2020-03-15 14:15:48 : n50                             520
[ INFO] 2020-03-15 14:15:48 : n30                             711
[ INFO] 2020-03-15 14:15:48 : n10                            1208
[ INFO] 2020-03-15 14:15:48 : gc                              0.5
[ INFO] 2020-03-15 14:15:48 : bases n                           0
[ INFO] 2020-03-15 14:15:48 : proportion n                    0.0
[ INFO] 2020-03-15 14:15:48 : Contig metrics done in 2 seconds
[ INFO] 2020-03-15 14:15:48 : No reads provided, skipping read diagnostics
[ INFO] 2020-03-15 14:15:48 : No reference provided, skipping comparative diagnostics
[ INFO] 2020-03-15 14:15:48 : Writing contig metrics for each contig to /Users/zd200572/work/evaluation/evaluation/transrate_results/tara135_SRF_megahit/contigs.csv
[ INFO] 2020-03-15 14:15:48 : Writing analysis results to assemblies.csv

real    0m4.218s
user    0m3.837s
sys     0m0.291s
'''
ln -s ../assembly/tara_f135_full_megahit.fasta
# full vs. subset 因为这里没有全部的,这步没做
./transrate-1.0.3-osx/transrate --reference=tara_f135_full_megahit.fasta --assembly=tara135_SRF_megahit.fasta --output=full_v_subset

# subset vs. full 因为这里没有全部的,这步没做
./transrate-1.0.3-osx/transrate --assembly=tara_f135_full_megahit.fasta --reference=tara135_SRF_megahit.fasta --output=subset_v_full

用salmon量化reads

这个组装对我们测序reads的代表性如何,我们将使用鲑鱼来量化表达。Salmon 是一种用于量化 RNAseq reads的新软件,它既运行快,又考虑了转录本的长度(Patro等人,2017 年)。

您可以在此处阅读更多有关salmon类似的"pseudoalignment":

  • 博客文章简介: http://robpatro.com/blog/?p=248
  • 2016年博客文章评估和比较方法在这里
  • salmonGithub
  • 2018年论文:[A direct comparison of genome alignment and transcriptome pseudoalignment](https://www.biorxiv.org/content/early/2018/10/16/444620)
#首先链接文件
ln -s ${PROJECT}/trim/TARA_135_SRF_5-20_*.qc.fq.gz ./ 
#用salmon进行定量
#检查salmon是否可用,并查看运行选项:
salmon -h
'''
salmon v1.1.0

Usage:  salmon -h|--help or
        salmon -v|--version or
        salmon -c|--cite or
        salmon [--no-version-check]  [-h | options]

Commands:
     index Create a salmon index
     quant Quantify a sample
     alevin single cell analysis
     swim  Perform super-secret operation
     quantmerge Merge multiple quantifications into a single file
'''
#set -u 让你知道是否有任何未设置的变量,即如果变量未定义。
set -u
printf "\nMy trimmed data is in $PROJECT/trim/, and consists of $(ls -1 ${PROJECT}/trim/*.qc.fq.gz | wc -l) files\n\n"
set +u
cd ${PROJECT}
mkdir -p quant
cd quant
#文章中原文使用的是全部的数据集,因为手上没有,就用前面组装的那个代替了 
ln -s  $PROJECT/assembly/tara135_SRF_megahit.fasta ./tara_f135_full_megahit.fasta
#首先 索引
salmon index --index tara135 --transcripts tara_f135_full_megahit.fasta
'''
index ["tara135"] did not previously exist  . . . creating it
[2020-03-15 15:18:00.055] [jLog] [info] building index
out : tara135
[2020-03-15 15:18:00.077] [puff::index::jointLog] [info] Running fixFasta
[Step 1 of 4] : counting k-mers
counted k-mers for 10000 transcripts
[2020-03-15 15:18:00.705] [puff::index::jointLog] [info] Replaced 0 non-ATCG nucleotides
[2020-03-15 15:18:00.705] [puff::index::jointLog] [info] Clipped poly-A tails from 22 transcripts
[2020-03-15 15:18:01.220] [puff::index::jointLog] [info] ntHll estimated 7962681 distinct k-mers, setting filter sizeto 2^28
Threads = 2
Vertex length = 31
Hash functions = 5
Filter size = 268435456
Capacity = 2
Files:
tara135/ref_k31_fixed.fa
--------------------------------------------------------------------------------
Round 0, 0:268435456
Pass    Filling Filtering
1       3       87
2       76      0
True junctions count = 13582
False junctions count = 42694
Hash table size = 56276
Candidate marks count = 75419
--------------------------------------------------------------------------------
Reallocating bifurcations time: 0
True marks count: 61643
Edges construction time: 49
--------------------------------------------------------------------------------
Distinct junctions = 13582

approximateContigTotalLength: 7171187
counters:
245 39 48 2
contig count: 34564 element count: 9003711 complex nodes: 334
# of ones in rank vector: 34563
[2020-03-15 15:21:38.689] [puff::index::jointLog] [info] Setting the index/BinaryGfa directory tara135
size = 9003711
-----------------------------------------
| Loading contigs | Time = 2.9299 ms
-----------------------------------------
size = 9003711
-----------------------------------------
| Loading contig boundaries | Time = 2.6109 ms
-----------------------------------------
Number of ones: 34563
Number of ones per inventory item: 512
Inventory entries filled: 68
[2020-03-15 15:21:38.735] [puff::index::jointLog] [info] Done wrapping the rank vector with a rank9sel structure.
[2020-03-15 15:21:38.761] [puff::index::jointLog] [info] contig count for validation: 34563
[2020-03-15 15:21:38.792] [puff::index::jointLog] [info] Total # of Contigs : 34,563
[2020-03-15 15:21:38.792] [puff::index::jointLog] [info] Total # of numerical Contigs : 34,563
[2020-03-15 15:21:38.839] [puff::index::jointLog] [info]
Total # of segments we have position for : 34,563
[2020-03-15 15:21:38.845] [puff::index::jointLog] [info] total contig vec entries 45,794
[2020-03-15 15:21:38.845] [puff::index::jointLog] [info] bits per offset entry 16
[2020-03-15 15:21:38.896] [puff::index::jointLog] [info] there were 21,081  equivalence classes
[2020-03-15 15:21:38.952] [puff::index::jointLog] [info] # segments = 34,563
[2020-03-15 15:21:38.952] [puff::index::jointLog] [info] total length = 9,003,711
[2020-03-15 15:21:38.962] [puff::index::jointLog] [info] Reading the reference files ...
[2020-03-15 15:21:39.035] [puff::index::jointLog] [info] positional integer width = 24
[2020-03-15 15:21:39.035] [puff::index::jointLog] [info] seqSize = 9,003,711
[2020-03-15 15:21:39.035] [puff::index::jointLog] [info] rankSize = 9,003,711
[2020-03-15 15:21:39.035] [puff::index::jointLog] [info] edgeVecSize = 0
[2020-03-15 15:21:39.035] [puff::index::jointLog] [info] num keys = 7,966,821
for info, total work write each  : 2.331    total work inram from level 3 : 4.322  total work raw : 25.000
[Building BooPHF]  100  %   elapsed:   0 min 4  sec   remaining:   0 min 0  sec
Bitarray        41749312  bits (100.00 %)   (array + ranks )
final hash             0  bits (0.00 %) (nb in final hash 0)
[2020-03-15 15:21:43.505] [puff::index::jointLog] [info] mphf size = 4.97691 MB
[2020-03-15 15:21:43.514] [puff::index::jointLog] [info] chunk size = 4,501,856
[2020-03-15 15:21:43.516] [puff::index::jointLog] [info] chunk 0 = [0, 4,501,856)
[2020-03-15 15:21:43.525] [puff::index::jointLog] [info] chunk 1 = [4,501,856, 9,003,681)
[2020-03-15 15:21:44.832] [puff::index::jointLog] [info] finished populating pos vector
[2020-03-15 15:21:44.832] [puff::index::jointLog] [info] writing index components
[2020-03-15 15:21:51.498] [puff::index::jointLog] [info] finished writing dense pufferfish index
[2020-03-15 15:21:51.507] [jLog] [info] done building index
'''
#然后,比对定量
 ln -s ../trim/*.qc.fq.gz ./
for sample in *1.qc.fq.gz
do
  base=$(basename $sample _1.qc.fq.gz)
  echo $base
  salmon quant -i tara135 -p 2 -l A -1 ${base}_1.qc.fq.gz -2 ${base}_2.qc.fq.gz -o ${base}_quant
done 
'''
将建立一系列文件夹,每个包含几个文件,
   aux_info
   cmd_info.json
   lib_format_counts.json
   libParams
   logs
   quant.sf
#以下是输出
TARA_135_DCM_5-20_rep1_1m
Version Info: Could not resolve upgrade information in the alotted time.
Check for upgrades manually at https://combine-lab.github.io/salmon
### salmon (mapping-based) v1.1.0
### [ program ] => salmon
### [ command ] => quant
### [ index ] => { tara135 }
### [ threads ] => { 2 }
### [ libType ] => { A }
### [ mates1 ] => { TARA_135_DCM_5-20_rep1_1m_1.qc.fq.gz }
### [ mates2 ] => { TARA_135_DCM_5-20_rep1_1m_2.qc.fq.gz }
### [ output ] => { TARA_135_DCM_5-20_rep1_1m_quant }
......
[2020-03-15 15:23:42.811] [jointLog] [info] Mapping rate = 7.51527%
[2020-03-15 15:23:42.811] [jointLog] [info] finished quantifyLibrary()
[2020-03-15 15:23:42.811] [jointLog] [info] Starting optimizer
[2020-03-15 15:23:42.838] [jointLog] [info] Marked 0 weighted equivalence classes as degenerate
[2020-03-15 15:23:42.839] [jointLog] [info] iteration = 0 | max rel diff. = 389.945
[2020-03-15 15:23:43.007] [jointLog] [info] iteration = 100 | max rel diff. = 4.15249e-16
[2020-03-15 15:23:43.010] [jointLog] [info] Finished optimizer
[2020-03-15 15:23:43.010] [jointLog] [info] writing output
'''
#查看下contig表达的相关信息
 head -20 TARA_135_SRF_5-20_rep1_1m_quant/quant.sf 
 '''
 Name    Length  EffectiveLength TPM     NumReads
k119_0  520     307.784 37.218819       6.000
k119_1  365     183.149 31.273300       3.000
k119_8414       1160    975.533 27.399576       14.000
k119_2  585     380.586 45.148977       9.000
k119_3  429     245.318 38.913383       5.000
k119_8415       971     786.533 38.838355       16.000
k119_8416       391     208.092 64.224392       7.000
k119_4  1078    878.533 14266.484104    6564.736
k119_8417       434     250.228 30.519799       4.000
k119_8418       419     235.507 32.427598       4.000
k119_5  881     696.533 38.374625       14.000
k119_8419       308     130.702 58.429974       4.000
k119_6  341     160.515 59.471764       5.000
k119_8420       411     227.646 83.868114       10.000
k119_7  365     183.149 106.604899      10.226
k119_8421       818     633.533 75.340516       25.000
k119_8422       321     142.271 67.098405       5.000
k119_8423       371     188.900 70.749454       7.000
k119_8424       491     306.791 24.892887       4.000
'''
#查看下总比对率,因为没有真正的数据,所以比对率极低。。。。。。
less TARA_135_SRF_5-20_rep1_1m_quant/logs/salmon_quant.log
grep Mapping *quant/logs/*
'''
TARA_135_DCM_5-20_rep1_1m_quant/logs/salmon_quant.log:[2020-03-15 15:23:15.464] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
...
TARA_137_DCM_5-20_rep2_1m_quant/logs/salmon_quant.log:[2020-03-15 15:27:18.742] [jointLog] [info] Mapping rate = 1.65539%
'''

评估读取映射的另一种方法

Transrate 实际上具有一种用于将reads"比对"到转录组的模式,并在读取映射上生成一些指标。read assessmentsalmon

./transrate --assembly=tara135_SRF_megahit.fasta --threads=2 --left=*_1.qc.fq.gz --right *_2.qc.fq.gz --output=${PROJECT}/evaluation/tara135_SRF_transrate

其他有用的教程和参考资料

  • https://github.com/ngs-docs/2015-nov-adv-rna/blob/master/salmon.rst
  • http://angus.readthedocs.io/en/2016/rob_quant/tut.html
  • https://2016-aug-nonmodel-rnaseq.readthedocs.io/en/latest/quantification.html

你可能感兴趣的:(宏转录组学习笔记(二))