踩坑sunbeam rbt 去除host reads

最近需要跑qc流程,用sunbeam的qc流程。然后再preprocess.tsv文件里,deconta文件的fastq数目和文件里记录的non host不一致。貌似生成的fastq文件没有去除host。找问题。溯源到decontamination rule

cat /programs/sunbeam-stable/rules/qc/decontaminate.rules
rule filter_reads:
    input:
        hostreads = str(QC_FP/'decontam'/'intermediates'/'{sample}_hostreads.ids'),
        reads = str(QC_FP/'cleaned'/'{sample}_{rp}.fastq.gz'),
        hostids = expand(str(QC_FP/'decontam'/'intermediates'/'{host}'/'{{sample}}.ids'), host=HostGenomes.keys())
    output:
        reads = str(QC_FP/'decontam'/'{sample}_{rp}.fastq.gz'),
        log = str(QC_FP/'log'/'decontam'/'{sample}_{rp}.txt')
    run:
        original = int(str(subprocess.getoutput(
            "zcat {} | wc -l".format(input.reads))).strip())//4
        host = int(subprocess.getoutput(
            "cat {} | wc -l".format(input.hostreads)).strip())
        nonhost = int(original-host)
        shell("""
        gzip -dc {input.reads} | \
        rbt fastq-filter {input.hostreads} | \
        gzip > {output.reads}
        """)
        
        hostdict = OrderedDict()
        for hostid in input.hostids:
            hostname = os.path.basename(os.path.dirname(hostid))
            hostcts = int(subprocess.getoutput("cat {} | wc -l".format(hostid)).strip())
            hostdict[hostname] = hostcts
        
        with open(output.log, 'w') as log:
            log.write("{}\n".format("\t".join(list(hostdict.keys()) + ["host","nonhost"] )))
            log.write("{}\n".format("\t".join( map(str, list(hostdict.values()) + [host, nonhost]) )))

可以发现这个filter_reads的rule,通过生成的id文件(这个id文件是所有map到host 基因组的reads ID号码。)

如下:

A00709:489:HCK2KDSX5:2:1101:49781:1000
A00709:489:HCK2KDSX5:2:1101:81782:1000
A00709:489:HCK2KDSX5:2:1101:89383:1000
A00709:489:HCK2KDSX5:2:1101:122644:1000
 

然后命令行是用https://github.com/rust-bio/rust-bio-tools/blob/master/README.md

rbt来去掉给定的id文件。 

我进入到它对应的docker环境,截取输入文件的前100行,执行不报错,

cat D0head100.fastq | rbt fastq-filter D0head100.ids |wc -l 
rbt: Relink `/opt/conda/envs/sunbeam/bin/../lib/./libgfortran.so.5' with `/lib/x86_64-linux-gnu/librt.so.1' for IFUNC symbol `clock_gettime'
100

所以没有成功去除还是一百行。

查看fastq的header,我发现

@A00709:489:HCK2KDSX5:2:1101:13801:1031/1
AAGTGTGTATTTCTCATTTTCCGTGATTTTCAGTTTTCTCGCCATATTTCAGGTCCTACAGTGTGCATTTCTCATTTTTCACATTTTTCATTGATTTCGTCATTTTTCAAGGGTTCAAGTGGTTGTTTCTCATTTTTAATGATTTTCATT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF,FFFFFFFFFFFF:FFFFFFFFFFFF:FFFFF:FFFFFFFFFFF,F,FFFFFFF:FFFFFFFFF:,FFFF,FF,FFF,FFF,:
@A00709:489:HCK2KDSX5:2:1101:14109:1031/1
GACTCATACCTTCAATCTCAGCCTTTAGGAGGCAGAGGCAGATGCATGTC
+
FFFFFFFF,FFFFFFFFFFF:F:F:FFF:FFFFFFFFF::FF,FFF,F,F

header文件是带有/1 &/2的。猜测是这个原因导致rbt不识别。更改id文件,手动加上/1。

新的id文件 

A00709:489:HCK2KDSX5:2:1101:49781:1000/1
A00709:489:HCK2KDSX5:2:1101:81782:1000/1
A00709:489:HCK2KDSX5:2:1101:89383:1000/1
A00709:489:HCK2KDSX5:2:1101:122644:1000/1

再次重复上述命令。

(sunbeam) user@84ed9ba9a5a3:/analysis$ cat D0head100.fastq | rbt fastq-filter D0head100.ids |wc -l 
rbt: Relink `/opt/conda/envs/sunbeam/bin/../lib/./libgfortran.so.5' with `/lib/x86_64-linux-gnu/librt.so.1' for IFUNC symbol `clock_gettime'
84

去除成功。4条reads去除成功。

所以bug就是这个header不能直接连接着/1和/2, 如果是space中间空着是可以的。

解决这个bug的灵感来源于这里。

https://github.com/sunbeam-labs/sunbeam/issues/221

你可能感兴趣的:(报错,生物信息)