NextFlow的步骤Process(系列之四)

Process是Nextflow流程的基本单元。 我以为,最合适的翻译,可能是步骤,当然或许是工序。为了避免别扭,在之后的笔记中,均用步骤

一个步骤,以process开头,包含了该步骤所有代码

process sayHello {

    """
    echo 'Hello world!' > file
    """

}

总的来说,整体包括五个部分

process < name > {

   [ directives ]

   input:
    < process inputs >

   output:
    < process outputs >

   when:
    < condition >

   [script|shell|exec]:
   < user script to be executed >

}
Script部分

这一部分主要就是包括Process中需要执行的命令。默认是BASH命令
如果有输入输出的话,这一个部分必须放在最后。

process doMoreThings {

  """
  blastp -db $db -query query.fa -outfmt 6 > blast_result
  cat blast_result | head -n 10 | cut -f 2 > top_hits
  blastdbcmd -db $db -entry_batch top_hits > sequences
  """

}

其中三个双引号支持变量多行和变量内插。
如果是要使用系统的变量,那么可以使用三个单引号,避免变量内插

process printPath {

   '''
   echo The path is: $PATH
   '''

}

当然还有一种方式是,使用反斜线

process doOtherThings {

  """
  blastp -db \$DB -query query.fa -outfmt 6 > blast_result
  cat blast_result | head -n $MAX | cut -f 2 > top_hits
  blastdbcmd -db \$DB -entry_batch top_hits > sequences
  """

}
Scripts支持多种语言与混编

Scripts部分默认使用bash命令,但是用户可以指定,使其使用其他脚本语言,如Perl,Python,Ruby,R 等

process perlStuff {

    """
    #!/usr/bin/perl

    print 'Hi there!' . '\n';
    """

}

process pyStuff {

    """
    #!/usr/bin/python

    x = 'Hello'
    y = 'world!'
    print "%s - %s" % (x,y)
    """

}
条件式Script
seq_to_align = ...
mode = 'tcoffee'

process align {
    input:
    file seq_to_aln from sequences

    script:
    if( mode == 'tcoffee' )
        """
        t_coffee -in $seq_to_aln > out_file
        """

    else if( mode == 'mafft' )
        """
        mafft --anysymbol --parttree --quiet $seq_to_aln > out_file
        """

    else if( mode == 'clustalo' )
        """
        clustalo -i $seq_to_aln -o out_file
        """

    else
        error "Invalid alignment mode: ${mode}"

}
模板

也就是所,可以写一些脚本模板,直接被重复调用

process template_example {

    input:
    val STR from 'this', 'that'

    script:
    template 'my_script.sh'

}

目录下有my_script.sh文件,内容为

#!/bin/bash
echo "process started at `date`"
echo $STR
:
echo "process completed"

测试模板的方式,可以是直接在shell终端输入

STR='foo' bash templates/my_script.sh
Shell区块

用于强制Shell上下文,此时Nextflow的变量需要用!来指定

process myTask {

    input:
    val str from 'Hello', 'Hola', 'Bonjour'

    shell:
    '''
    echo User $USER says !{str}
    '''

}

其中$USER变量是Shell的,而!{str}是Nextflow的

本地执行(内置语法)

Nextflow本身就是Groovy的拓展,可直接使用自带的命令

x = Channel.from( 'a', 'b', 'c')

process simpleSum {
    input:
    val x

    exec:
    println "Hello Mr. $x"
}
输入

Nextflow的Process是相对独立的,通过Channels进行通讯。每一个Input的区块,可以定义输入数据来源。每个Process只能有一个Input区块,而Input区块可以包括多个Input声明(也就是允许多个输入)。
大体语法如下

input:
    [from ] [attributes]
输入常用数值
num = Channel.from( 1, 2, 3 )

process basicExample {
  input:
  val x from num

  "echo process job $x"

}

那么会输出

process job 3
process job 1
process job 2

由于数据来源于同一个Process,所以可以省略from

num = Channel.from( 1, 2, 3 )

process basicExample {
  input:
  val num

  "echo process job $num"

}
从文件中输入
proteins = Channel.fromPath( '/some/path/*.fa' )

process blastThemAll {
  input:
  file query_file from proteins

  "blastp -query ${query_file} -db nr"

}

在输入文件名与管道名一致的情况下,可以省略

proteins = Channel.fromPath( '/some/path/*.fa' )

process blastThemAll {
  input:
  file proteins

  "blastp -query $proteins -db nr"

}

可以直接对输入命名为本地变量,于是可以省略$符号?似乎就可以保证外部命令正常运行

input:
    file query_file name 'query.fa' from proteins

或者直接

input:
    file 'query.fa' from proteins

使用的时候

proteins = Channel.fromPath( '/some/path/*.fa' )

process blastThemAll {
  input:
  file 'query.fa' from proteins

  "blastp -query query.fa -db nr"

}
多个输入文件
fasta = Channel.fromPath( "/some/path/*.fa" ).buffer(size:3)

process blastThemAll {
    input:
    file 'seq' from fasta

    "echo seq*"

}

将会输出

seq1 seq2 seq3
seq1 seq2 seq3
...
fasta = Channel.fromPath( "/some/path/*.fa" ).buffer(size:3)

process blastThemAll {
    input:
    file 'seq?.fa' from fasta

    "cat seq1.fa seq2.fa seq3.fa"

}
动态文件名输入
process simpleCount {
  input:
  val x from species
  file "${x}.fa" from genomes

  """
  cat ${x}.fa | grep '>'
  """
}
标准输入的类型
str = Channel.from('hello', 'hola', 'bonjour', 'ciao').map { it+'\n' }

process printAll {
   input:
   stdin str

   """
   cat -
   """

}

将会输出

hola
bonjour
ciao
hello
环境变量的类型
str = Channel.from('hello', 'hola', 'bonjour', 'ciao')

process printEnv {

    input:
    env HELLO from str

    '''
    echo $HELLO world!
    '''

}

将会输出

hello world!
ciao world!
bonjour world!
hola world!

Input of type 'set'

The set qualifier allows you to group multiple parameters in a single parameter definition. It can be useful when a process receives, in input, tuples of values that need to be handled separately. Each element in the tuple is associated to a corresponding element with the set definition. For example:

tuple = Channel.from( [1, 'alpha'], [2, 'beta'], [3, 'delta'] )

process setExample {
input:
set val(x), file('latin.txt') from tuple

"""
echo Processing $x
cat - latin.txt > copy
"""

}

In the above example the set parameter is used to define the value x and the file latin.txt, which will receive a value from the same channel.

In the set declaration items can be defined by using the following qualifiers: val, env, file and stdin.

A shorter notation can be used by applying the following substitution rules:

long short
val(x) x
file(x) (not supported)
file('name') 'name'
file(x:'name') x:'name'
stdin '-'
env(x) (not supported)

Thus the previous example could be rewritten as follows:

tuple = Channel.from( [1, 'alpha'], [2, 'beta'], [3, 'delta'] )

process setExample {
input:
set x, 'latin.txt' from tuple

"""

echo Processing $x
cat - latin.txt > copy
"""

}

File names can be defined in dynamic manner as explained in the Dynamic input file names section.

输入的自动重复(亮点!)

可以使用一个each标签,高效地产生重复步骤,如

sequences = Channel.fromPath('*.fa')
methods = ['regular', 'expresso', 'psicoffee']

process alignSequences {
  input:
  file seq from sequences
  each mode from methods

  """
  t_coffee -in $seq -mode $mode > result
  """
}

以上会对米一个序列文件,分别执行三个模式的比对

了解多个输入通道的工作模式
process foo {
  echo true
  input:
  val x from Channel.from(1,2)
  val y from Channel.from('a','b','c')
  script:
   """
   echo $x and $y
   """
}

会输出

1 and a
2 and b

process bar {
  echo true
  input:
  val x from Channel.value(1)
  val y from Channel.from('a','b','c')
  script:
   """
   echo $x and $y
   """
}

则会自动重复 1

1 and a
1 and b
1 and c

还有其他....

输出
methods = ['prot','dna', 'rna']

process foo {
  input:
  val x from methods

  output:
  val x into receiver

  """
  echo $x > file
  """

}

receiver.println { "Received: $it" }
process align {
  input:
  val x from species
  file seq from sequences

  output:
  file "${x}.aln" into genomes

  """
  t_coffee -in $seq > ${x}.aln
  """
}
When
Directive

写在最后

我觉得经过这两三个小时的文档阅读,应该可以掌握Nextflow的使用了。暂时也没必要继续看文档了

你可能感兴趣的:(NextFlow的步骤Process(系列之四))