介绍

snakemake 原理

image

建议的框架

├── config.yaml
├── environment.yaml
├── scripts
│   ├── script1.py
│   └── script2.R
└── Snakefile

Install

conda create -n py35 python=3.5
source activate py35
conda install -c bioconda -c conda-forge snakemake

用conda在python3环境里安装。

之后，如果python3可以正常用。（which python3）可以被访问，那么snakemake可以直接运行。不需要source到py35环境。

如果没有python3，那么需要source到py35环境才可以用snakemake。

使用

常用参数

--snakefile, -s 指定Snakefile，否则是当前目录下的Snakefile
--dryrun, -n 不真正执行，一般用来查看Snakefile是否有错
--printshellcmds, -p 输出要执行的shell命令
--reason, -r 输出每条rule执行的原因,默认FALSE
--cores, --jobs, -j 指定运行的核数，若不指定，则使用最大的核数
--force, -f 重新运行第一条rule或指定的rule。或指定输出结果（绝对路径）
--forceall, -F 重新运行所有的rule，不管是否已经有输出结果。
--forcerun, -R 执行指定的rule及下游所有任务.无需target。（但rule all中必须有rule的输出文件）
--keep-going, -k Go on with independent jobs if a job fails.(某任务报错，其他不相关的任务可以继续运行）
snakemake --dag | dot -Tpdf > dag.pdf 可视化

关于f强制运行

output没有wildcards的： -f rulename
output有wildcards：-f outputfile

多个output指定一个就可以。要是绝对路径。比如rule的output使用了wildcards，就像下面的

join(outdir0, "{sample}_1.fastq"), join(outdir0, "{sample}_2.fastq")

可以指定（target）即某个样本的输出文件

snakemake -np -f projects/02snakeReq/00.prepare/B_1.fastq

集群投递

snakemake --cluster "qsub -V -cwd -q v01" -j 10
--cluster 集群运行指令
qusb -V -cwd -q 表示输出当前环境变量(-V),在当前目录下运行(-cwd), 投递到指定的队列(-q), 如果不指定则使用任何可用队列
--local-cores N: 在每个集群中最多并行N核
--cluster-config/-u FILE: 集群配置文件

snakemake -j 99 --debug --immediate-submit --cluster-config cluster.json --cluster 'bsub_cluster.py {dependencies}'

snakemake --cluster "qsub -q res -cwd  -o logs -e logs" --jobs 32

the maximum number of jobs to be queued or executed at the same time

可最大提交32的jobs同时跑。在cpu允许范围内，jobs越大跑的越快

使用例子

常用

# 查看全部流程
snakemake -np
# 指定snakefile
snakemake --snakemakefile myfile.py
# 重跑rule dedup_realn（dry run），注意如果snakemake认为需要跑的step（rull all输出文件推断）还是会跑的
snakemake --snakefile res-snake.py -np -R dedup_realn

修改代码后，重跑改变的rule

snakemake -n -R `snakemake --list-input-changes`
snakemake -n -R `snakemake --list-params-changes`
snakemake -n -R `snakemake --list-code-changes`

删除snamake 输出的文件 。

仅删除rule在output中写的文件。最好先和--dry-run跑确认一下。同样的还有--delete-temp-output

 snakemake some_target --delete-all-output

snakemake index --delete-all-output -np --snakefile res-snake.py

Building DAG of jobs...
Would delete /02snakeReq/00.prepare/ref/salmonella.dict
Would delete /02snakeReq/00.prepare/ref/salmonella.fa

workflow太大了，不看output，只看最终结果

snakemake -n --quiet

打开zsh关于snakemake的自动完成：将以下代码放在~/.zshrc中
compdef _gnu_generic snakemake

部分运行

## 所有和这个rule相关的file都会rerun
./pyflow-ChIPseq -R call_peaks_macs2

## 重跑一个样本，只要指定target file
./pyflow-ATACseq -R 04aln/m280.sorted.bam

##只重跑align rule。
./pyflow-ATACseq -f align

-f
强制跑：依赖的前面没有重跑的文件亦可以跳过，只要旧文件存在就可以跑。

有wildcards的rule不能-f来跑。因为没有跑rule all，无法推断sample。

但可以指定输出文件来跑-f

查看结果

snakemake --summary | sort -k1,1 | less -S

# or detailed summary will give you the commands used to generated the output and what input is used
snakemake --detailed-summary | sort -k1,1 > snakemake_run_summary.txt

output_file date rule version log-file(s) status plan

其他功能

访问其他rule中的参数

rule salmon_quant:
    input:
        r1 = lambda wildcards: FILES[wildcards.sample]['R1'],
        r2 = lambda wildcards: FILES[wildcards.sample]['R2'],
        index = rules.salmon_index.output.index

protected

用来指定某些中间文件是需要保留的，eg.output: protected(“f1.bam”)。

expand :

相当于列表推导式

image

邮件通知
link

config

可在snakemake命令时设置。会覆盖config.yaml中的相同参数。优先认为是number。如果必须识别为string，可以加引号。

snakemake mytarget --config foo=bar
snakemake --config 'version="2018_1"'

priorites

temp

temp: 通过temp方法可以在所有rule运行完后删除指定的中间文件，eg.output: temp(“f1.bam”)。

Further, an output file marked as temp is deleted after all rules that use it as an input are completed:

rule NAME:
    input:
        "path/to/inputfile"
    output:
        temp("path/to/outputfile")
    shell:
        "somecommand {input} {output}"

报错和解决

反义

在rule的shell使用awk注意：大括号要变成双括号；tab要再次反义

grep -v "#"  ${{smp}}_merged.sv.vcf|awk '
            BEGIN{{FS=OFS="\\t"}}
            $10!~/NaN/{{$10=NR}}
            ...

output为文件夹

Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory().

ref_split = directory(join(outdir0, "ref/ref_split"))

当output是一个文件夹？关于snakemake自动建文件夹
好像不行。output只能是文件才会建文件夹

Unable to set utime on symlink /projects/02snakeReq/00.prepare/ref/salmonella.fa. Your Python build does not support it.

设置output文件来让snakemake了解分析顺序。
output文件必须是文件而不是文件夹。
后rule的input，必须承接前rule的output。目前不知道是否也要写在rule all中
snp_filter的output，和 snp_indel_anno来衔接
annovar_db的output，he snp_indel_anno来衔接

报错

检查rule的output和rule all的input是否一一对应

image

wildcards

shell中{sample}要写成{wildcards.sample}

image

MissingOutputException

Waiting at most 50 seconds for missing files.
MissingOutputException in line 525 of projects/02snakeReq/res-snake.py:
Missing files after 50 seconds:
projects/03svTest/04.sv/D-korps6_svaba/D-korps6.sv.vcf

rule的output文件必须在输出结果文件中。

KeyError

不同的结果文件名不能一模一样。（写在input，output中的文件）

举例：05.sv下有lumpy和manta文件夹，都有A.sv.vcf文件。后面在05.sv文件夹下也有A.sv.vcf文件。而且这三个文件分属不同的rule。sample会识别错误

KeyError: 'A-korps6_pindel/A-korps6'
Wildcards:
sample=A-korps6_pindel/A-korps6

重跑流程
重跑流程，当原来产生的文件要被覆盖时，可能会出现ProtectedOutputException。要用protected()

ProtectedOutputException in line 114 of Resequencing/03.SNP_INDEL_detection_v2/snakefile_v2.py:
Write-protected output files for rule index:
/projects/04liver_cancer/00.prepare/ref/hg19.fa

参考

生信分析流程构建的几大流派
snakemake框架
snakemake pyflow-ATACseq流程
https://github.com/AAFC-BICoE/snakemake-mothur
使用GATK和snakemake框架的总结
yaml语言教程
snakemake 官方文档
https://github.com/inodb/snakemake-parallel-bwa/blob/master/Snakefile
mutect流程

流程管理工具Snakemake用法总结

介绍

Install

使用

常用参数

关于f强制运行

集群投递

使用例子

其他功能

报错和解决

参考

你可能感兴趣的:(流程管理工具Snakemake用法总结)