Process是Nextflow流程的基本单元。 我以为,最合适的翻译,可能是步骤,当然或许是工序。为了避免别扭,在之后的笔记中,均用步骤。
一个步骤,以process
开头,包含了该步骤所有代码
process sayHello {
"""
echo 'Hello world!' > file
"""
}
总的来说,整体包括五个部分
process < name > {
[ directives ]
input:
< process inputs >
output:
< process outputs >
when:
< condition >
[script|shell|exec]:
< user script to be executed >
}
Script部分
这一部分主要就是包括Process中需要执行的命令。默认是BASH命令
如果有输入输出的话,这一个部分必须放在最后。
process doMoreThings {
"""
blastp -db $db -query query.fa -outfmt 6 > blast_result
cat blast_result | head -n 10 | cut -f 2 > top_hits
blastdbcmd -db $db -entry_batch top_hits > sequences
"""
}
其中三个双引号支持变量多行和变量内插。
如果是要使用系统的变量,那么可以使用三个单引号,避免变量内插
process printPath {
'''
echo The path is: $PATH
'''
}
当然还有一种方式是,使用反斜线
process doOtherThings {
"""
blastp -db \$DB -query query.fa -outfmt 6 > blast_result
cat blast_result | head -n $MAX | cut -f 2 > top_hits
blastdbcmd -db \$DB -entry_batch top_hits > sequences
"""
}
Scripts支持多种语言与混编
Scripts部分默认使用bash命令,但是用户可以指定,使其使用其他脚本语言,如Perl,Python,Ruby,R 等
process perlStuff {
"""
#!/usr/bin/perl
print 'Hi there!' . '\n';
"""
}
process pyStuff {
"""
#!/usr/bin/python
x = 'Hello'
y = 'world!'
print "%s - %s" % (x,y)
"""
}
条件式Script
seq_to_align = ...
mode = 'tcoffee'
process align {
input:
file seq_to_aln from sequences
script:
if( mode == 'tcoffee' )
"""
t_coffee -in $seq_to_aln > out_file
"""
else if( mode == 'mafft' )
"""
mafft --anysymbol --parttree --quiet $seq_to_aln > out_file
"""
else if( mode == 'clustalo' )
"""
clustalo -i $seq_to_aln -o out_file
"""
else
error "Invalid alignment mode: ${mode}"
}
模板
也就是所,可以写一些脚本模板,直接被重复调用
process template_example {
input:
val STR from 'this', 'that'
script:
template 'my_script.sh'
}
目录下有my_script.sh
文件,内容为
#!/bin/bash
echo "process started at `date`"
echo $STR
:
echo "process completed"
测试模板的方式,可以是直接在shell终端输入
STR='foo' bash templates/my_script.sh
Shell区块
用于强制Shell上下文,此时Nextflow的变量需要用!
来指定
process myTask {
input:
val str from 'Hello', 'Hola', 'Bonjour'
shell:
'''
echo User $USER says !{str}
'''
}
其中$USER
变量是Shell的,而!{str}
是Nextflow的
本地执行(内置语法)
Nextflow本身就是Groovy的拓展,可直接使用自带的命令
x = Channel.from( 'a', 'b', 'c')
process simpleSum {
input:
val x
exec:
println "Hello Mr. $x"
}
输入
Nextflow的Process是相对独立的,通过Channels进行通讯。每一个Input的区块,可以定义输入数据来源。每个Process只能有一个Input区块,而Input区块可以包括多个Input声明(也就是允许多个输入)。
大体语法如下
input:
[from
输入常用数值
num = Channel.from( 1, 2, 3 )
process basicExample {
input:
val x from num
"echo process job $x"
}
那么会输出
process job 3
process job 1
process job 2
由于数据来源于同一个Process,所以可以省略from
num = Channel.from( 1, 2, 3 )
process basicExample {
input:
val num
"echo process job $num"
}
从文件中输入
proteins = Channel.fromPath( '/some/path/*.fa' )
process blastThemAll {
input:
file query_file from proteins
"blastp -query ${query_file} -db nr"
}
在输入文件名与管道名一致的情况下,可以省略
proteins = Channel.fromPath( '/some/path/*.fa' )
process blastThemAll {
input:
file proteins
"blastp -query $proteins -db nr"
}
可以直接对输入命名为本地变量,于是可以省略$
符号?似乎就可以保证外部命令正常运行
input:
file query_file name 'query.fa' from proteins
或者直接
input:
file 'query.fa' from proteins
使用的时候
proteins = Channel.fromPath( '/some/path/*.fa' )
process blastThemAll {
input:
file 'query.fa' from proteins
"blastp -query query.fa -db nr"
}
多个输入文件
fasta = Channel.fromPath( "/some/path/*.fa" ).buffer(size:3)
process blastThemAll {
input:
file 'seq' from fasta
"echo seq*"
}
将会输出
seq1 seq2 seq3
seq1 seq2 seq3
...
fasta = Channel.fromPath( "/some/path/*.fa" ).buffer(size:3)
process blastThemAll {
input:
file 'seq?.fa' from fasta
"cat seq1.fa seq2.fa seq3.fa"
}
动态文件名输入
process simpleCount {
input:
val x from species
file "${x}.fa" from genomes
"""
cat ${x}.fa | grep '>'
"""
}
标准输入的类型
str = Channel.from('hello', 'hola', 'bonjour', 'ciao').map { it+'\n' }
process printAll {
input:
stdin str
"""
cat -
"""
}
将会输出
hola
bonjour
ciao
hello
环境变量的类型
str = Channel.from('hello', 'hola', 'bonjour', 'ciao')
process printEnv {
input:
env HELLO from str
'''
echo $HELLO world!
'''
}
将会输出
hello world!
ciao world!
bonjour world!
hola world!
Input of type 'set'
The set
qualifier allows you to group multiple parameters in a single parameter definition. It can be useful when a process receives, in input, tuples of values that need to be handled separately. Each element in the tuple is associated to a corresponding element with the set
definition. For example:
tuple = Channel.from( [1, 'alpha'], [2, 'beta'], [3, 'delta'] )process setExample {
input:
set val(x), file('latin.txt') from tuple"""
echo Processing $x
cat - latin.txt > copy
"""}
In the above example the set
parameter is used to define the value x
and the file latin.txt
, which will receive a value from the same channel.
In the set
declaration items can be defined by using the following qualifiers: val
, env
, file
and stdin
.
A shorter notation can be used by applying the following substitution rules:
long | short |
---|---|
val(x) | x |
file(x) | (not supported) |
file('name') | 'name' |
file(x:'name') | x:'name' |
stdin | '-' |
env(x) | (not supported) |
Thus the previous example could be rewritten as follows:
tuple = Channel.from( [1, 'alpha'], [2, 'beta'], [3, 'delta'] )process setExample {
input:
set x, 'latin.txt' from tuple"""
echo Processing $x
cat - latin.txt > copy
"""}
File names can be defined in dynamic manner as explained in the Dynamic input file names section.
输入的自动重复(亮点!)
可以使用一个each
标签,高效地产生重复步骤,如
sequences = Channel.fromPath('*.fa')
methods = ['regular', 'expresso', 'psicoffee']
process alignSequences {
input:
file seq from sequences
each mode from methods
"""
t_coffee -in $seq -mode $mode > result
"""
}
以上会对米一个序列文件,分别执行三个模式的比对
了解多个输入通道的工作模式
process foo {
echo true
input:
val x from Channel.from(1,2)
val y from Channel.from('a','b','c')
script:
"""
echo $x and $y
"""
}
会输出
1 and a
2 and b
而
process bar {
echo true
input:
val x from Channel.value(1)
val y from Channel.from('a','b','c')
script:
"""
echo $x and $y
"""
}
则会自动重复 1
1 and a
1 and b
1 and c
还有其他....
输出
methods = ['prot','dna', 'rna']
process foo {
input:
val x from methods
output:
val x into receiver
"""
echo $x > file
"""
}
receiver.println { "Received: $it" }
process align {
input:
val x from species
file seq from sequences
output:
file "${x}.aln" into genomes
"""
t_coffee -in $seq > ${x}.aln
"""
}
When
Directive
写在最后
我觉得经过这两三个小时的文档阅读,应该可以掌握Nextflow的使用了。暂时也没必要继续看文档了