最近需要跑qc流程,用sunbeam的qc流程。然后再preprocess.tsv文件里,deconta文件的fastq数目和文件里记录的non host不一致。貌似生成的fastq文件没有去除host。找问题。溯源到decontamination rule
cat /programs/sunbeam-stable/rules/qc/decontaminate.rules
rule filter_reads:
input:
hostreads = str(QC_FP/'decontam'/'intermediates'/'{sample}_hostreads.ids'),
reads = str(QC_FP/'cleaned'/'{sample}_{rp}.fastq.gz'),
hostids = expand(str(QC_FP/'decontam'/'intermediates'/'{host}'/'{{sample}}.ids'), host=HostGenomes.keys())
output:
reads = str(QC_FP/'decontam'/'{sample}_{rp}.fastq.gz'),
log = str(QC_FP/'log'/'decontam'/'{sample}_{rp}.txt')
run:
original = int(str(subprocess.getoutput(
"zcat {} | wc -l".format(input.reads))).strip())//4
host = int(subprocess.getoutput(
"cat {} | wc -l".format(input.hostreads)).strip())
nonhost = int(original-host)
shell("""
gzip -dc {input.reads} | \
rbt fastq-filter {input.hostreads} | \
gzip > {output.reads}
""")
hostdict = OrderedDict()
for hostid in input.hostids:
hostname = os.path.basename(os.path.dirname(hostid))
hostcts = int(subprocess.getoutput("cat {} | wc -l".format(hostid)).strip())
hostdict[hostname] = hostcts
with open(output.log, 'w') as log:
log.write("{}\n".format("\t".join(list(hostdict.keys()) + ["host","nonhost"] )))
log.write("{}\n".format("\t".join( map(str, list(hostdict.values()) + [host, nonhost]) )))
可以发现这个filter_reads的rule,通过生成的id文件(这个id文件是所有map到host 基因组的reads ID号码。)
如下:
A00709:489:HCK2KDSX5:2:1101:49781:1000
A00709:489:HCK2KDSX5:2:1101:81782:1000
A00709:489:HCK2KDSX5:2:1101:89383:1000
A00709:489:HCK2KDSX5:2:1101:122644:1000
然后命令行是用https://github.com/rust-bio/rust-bio-tools/blob/master/README.md
rbt来去掉给定的id文件。
我进入到它对应的docker环境,截取输入文件的前100行,执行不报错,
cat D0head100.fastq | rbt fastq-filter D0head100.ids |wc -l
rbt: Relink `/opt/conda/envs/sunbeam/bin/../lib/./libgfortran.so.5' with `/lib/x86_64-linux-gnu/librt.so.1' for IFUNC symbol `clock_gettime'
100
所以没有成功去除还是一百行。
查看fastq的header,我发现
@A00709:489:HCK2KDSX5:2:1101:13801:1031/1
AAGTGTGTATTTCTCATTTTCCGTGATTTTCAGTTTTCTCGCCATATTTCAGGTCCTACAGTGTGCATTTCTCATTTTTCACATTTTTCATTGATTTCGTCATTTTTCAAGGGTTCAAGTGGTTGTTTCTCATTTTTAATGATTTTCATT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF,FFFFFFFFFFFF:FFFFFFFFFFFF:FFFFF:FFFFFFFFFFF,F,FFFFFFF:FFFFFFFFF:,FFFF,FF,FFF,FFF,:
@A00709:489:HCK2KDSX5:2:1101:14109:1031/1
GACTCATACCTTCAATCTCAGCCTTTAGGAGGCAGAGGCAGATGCATGTC
+
FFFFFFFF,FFFFFFFFFFF:F:F:FFF:FFFFFFFFF::FF,FFF,F,F
header文件是带有/1 &/2的。猜测是这个原因导致rbt不识别。更改id文件,手动加上/1。
新的id文件
A00709:489:HCK2KDSX5:2:1101:49781:1000/1
A00709:489:HCK2KDSX5:2:1101:81782:1000/1
A00709:489:HCK2KDSX5:2:1101:89383:1000/1
A00709:489:HCK2KDSX5:2:1101:122644:1000/1
再次重复上述命令。
(sunbeam) user@84ed9ba9a5a3:/analysis$ cat D0head100.fastq | rbt fastq-filter D0head100.ids |wc -l
rbt: Relink `/opt/conda/envs/sunbeam/bin/../lib/./libgfortran.so.5' with `/lib/x86_64-linux-gnu/librt.so.1' for IFUNC symbol `clock_gettime'
84
去除成功。4条reads去除成功。
所以bug就是这个header不能直接连接着/1和/2, 如果是space中间空着是可以的。
解决这个bug的灵感来源于这里。
https://github.com/sunbeam-labs/sunbeam/issues/221