文献阅读 2.7 Bioconda的bioRxiv预印版和Nature Methods正式版

bioRxiv preprint

Bioconda: A sustainable and comprehensive software distribution for the life sciences

Bioconda:用于生命科学的可持续且全面的软件发行版

Abstract

We present Bioconda (https://bioconda.github.io), a distribution of bioinformatics software for the lightweight, multi-platform and language-agnostic package manager Conda. Currently, Bioconda offers a collection of over 3000 software packages, which is continuously maintained, updated, and extended by a growing global community of more than 200 contributors. Bioconda improves analysis reproducibility by allowing users to define isolated environments with defined software versions, all of which are easily installed and managed without administrative privileges.

我们介绍了 Bioconda (https://bioconda.github.io),这是一个用于轻量级、多平台和与语言无关的包管理器 Conda 的生物信息学软件发行版。 目前,Bioconda 提供了超过 3000 个软件包的集合,这些软件包由不断增长的 200 多名贡献者组成的全球社区持续维护、更新和扩展。 Bioconda 允许用户使用定义的软件版本定义隔离环境,从而提高了分析的重现性,所有这些都易于安装和管理,无需管理权限

Introduction

Thousands of new software tools have been released for bioinformatics in recent years, in a variety of programming languages. Accompanying this diversity of construction is an array of installation methods. Often, Software has to be compiled manually for different hardware architectures and operating systems, with management left to the user or system administrator. Scripting languages usually deliver their own package management tools for installing, updating, and removing packages, though these are often limited in scope to packages written in the same scripting language and cannot handle external dependencies. Published scientific software often consists of simple collections of custom scripts distributed with textual descriptions of the manual steps required to install the software. New analyses often require novel combinations of multiple tools, and the heterogeneity of scientific software makes management of a software stack complicated and error-prone. Moreover, it inhibits reproducible science, because it is hard to reproduce a software stack on different machines. System-wide deployment of software has traditionally been handled by administrators, but reproducibility often requires that the researcher (who is often not an expert in administration) is able to maintain full control of the software environment and rapidly modify it without administrative privileges.

近年来,以各种编程语言发布了数以千计的用于生物信息学的新软件工具。伴随着这种结构的多样性的是一系列的安装方法。通常,必须针对不同的硬件架构和操作系统手动编译软件,并将管理留给用户或系统管理员。脚本语言通常提供自己的包管理工具,用于安装、更新和删除包,尽管这些工具的范围通常仅限于使用相同脚本语言编写的包,并且无法处理外部依赖项。已发布的科学软件通常由简单的自定义脚本集合组成,其中包含安装软件所需的手动步骤的文本描述。新的分析通常需要多种工具的新颖组合,而科学软件的异质性使得软件堆栈的管理变得复杂且容易出错此外,它抑制了可重现的科学,因为很难在不同的机器上重现软件堆栈。系统范围内的软件部署传统上由管理员处理,但可重复性通常要求研究人员(通常不是管理专家)能够保持对软件环境的完全控制并在没有管理权限的情况下快速修改它

The Conda package manager (https://conda.io) has become an increasingly popular approach to overcome these challenges. Conda normalizes software installations across language ecosystems by describing each software package with a recipe that defines meta-information and dependencies, as well as a build script that performs the steps necessary to build and install the software. Conda prepares and builds software packages within an isolated environment, transforming them into relocatable binaries. Conda packages can be built for all three major operating systems: Linux, macOS, and Windows. Importantly, installation and management of packages requires no administrative privileges, such that a researcher can control the available software tools regardless of the underlying infrastructure. Moreover, Conda obviates reliance on system-wide installation by allowing users to generate isolated software environments, within which versions and tools can be managed per-project, without generating conflicts or incompatibilities (see online methods). These environments support reproducibility, as they can can be rapidly exchanged via files that describe their installation state. Conda is tightly integrated into popular solutions for reproducible scientific data analysis like Galaxy, bcbio-nextgen (https://github.com/chapmanb/bcbio-nextgen), and Snakemake. Finally, while Conda provides many commonly-used packages by default, it also allows users to optionally include additional repositories (termed channels) of packages that can be installed.

Conda 包管理器 (https://conda.io) 已成为克服这些挑战的一种越来越流行的方法。 Conda 通过使用定义元信息和依赖关系的配方以及执行构建和安装软件所需步骤的构建脚本来描述每个软件包,从而规范化跨语言生态系统的软件安装。 Conda 在隔离环境中准备和构建软件包,将它们转换为可重定位的二进制文件。可以为所有三种主要操作系统构建 Conda 包:Linux、macOS 和 Windows。重要的是,软件包的安装和管理不需要管理权限,这样研究人员就可以控制可用的软件工具,而不管底层基础设施如何。此外,Conda 允许用户生成隔离的软件环境,在其中可以按项目管理版本和工具,而不会产生冲突或不兼容,从而避免了对系统范围安装的依赖(参见在线方法)。这些环境支持可重复性,因为它们可以通过描述其安装状态的文件快速交换。 Conda 紧密集成到可重现科学数据分析的流行解决方案中,例如 Galaxy、bcbio-nextgen (https://github.com/chapmanb/bcbio-nextgen) 和 Snakemake。最后,虽然 Conda 默认提供了许多常用包,但它还允许用户选择性地包含可以安装的包的附加存储库(称为通道)

Results

In order to unlock the benefits of Conda for the life sciences, the Bioconda project was founded in 2015. The mission of Bioconda is to make bioinformatics software easily installable and manageable via the Conda package manager. Via its channel for the Conda package manager, Bioconda currently provides over 3000 software packages for Linux and macOS. Development is driven by an open community of over 200 international scientists. In the prior two years, package count and the number of contributors have increased linearly, on average, with no sign of saturation (Fig. 1a,b). The barrier to entry is low, requiring a willingness to participate and adherence to community guidelines. Many software developers contribute recipes for their own tools, and many Bioconda contributors are invested in the project as they are also users of Conda and Bioconda. Bioconda provides packages from various language ecosystems like Python, R (CRAN and Bioconductor), Perl, Haskell, as well as a plethora of C/C++ programs (Fig. 1c). Many of these packages have complex dependency structures that require various manual steps to install when not relying on a package manager like Conda (Fig. 2a, Online Methods). With over 6.3 million downloads, the service has become a backbone of bioinformatics infrastructure (Fig. 1d). Bioconda is complemented by the conda-forge project (https://conda-forge.github.io), which hosts software not specifically related to the biological sciences. The two projects collaborate closely, and the Bioconda team maintains over 500 packages hosted by conda-forge. Among all currently available distributions of bioinformatics software, Bioconda is by far the most comprehensive, while being among the youngest (Fig. 2d).

Figure 1

Figure 1: Bioconda development and usage since the beginning of the project. (a) contributing authors and added recipes over time. (b) code line additions and deletions per week. (c) package count per language ecosystem (saturated colors on bottom represent explicitly life science related packages). (d) total downloads per language ecosystem. The term “other” entails all recipes that do not fall into one of the specific categories. Note that a subset of packages that started in Bioconda have since been migrated to the more appropriate, general-purpose conda-forge channel. Older versions of such packages still reside in the Bioconda channel, and as such are included in the recipe count (a) and download count (d). Statistics obtained Oct. 25, 2017.

图 1:自项目开始以来 Bioconda 的开发和使用。 (a) 随着时间的推移,贡献的作者和添加的食谱。 (b) 每周增加和删除代码行。 (c) 每个语言生态系统的包数(底部的饱和颜色明确表示与生命科学相关的包)。 (d) 每个语言生态系统的总下载量。 “其他”一词包含不属于特定类别之一的所有方法。 请注意,在 Bioconda 中启动的一部分软件包已被迁移到更合适的通用 conda-forge 频道。 此类软件包的旧版本仍位于 Bioconda 频道中,因此包含在方法计数 (a) 和下载计数 (d) 中。 统计数据于 2017 年 10 月 25 日获得。

To ensure reliable maintenance of such numbers of packages, we use a semi-automatic, agent-assisted development workflow (Fig. 2b). All Bioconda recipes are hosted in a GitHub repository (https://github.com/bioconda/bioconda-recipes). Both the addition of new recipes and the update of existing recipes in Bioconda is handled via pull requests. Thereby, a modified version of one or more recipes is compared against the current state of Bioconda. Once a pull request arrives, our infrastructure performs several automatic checks. Problems discovered in any step are reported to the contributor and further progress is blocked until they are resolved. First, the modified recipes are checked for syntactic anti-patterns, i.e., formulations that are syntactically correct but bad style (termed linting). Second, the modified recipes are built on Linux and macOS, via a cloud based, free-of-charge service (https://travis-ci.org). Successfully built recipes are tested (e.g., by running the generated executable). Since Bioconda packages must be able to run on any supported system, it is important to check that the built packages do not rely on particular elements from the build environment. Therefore, testing happens in two stages: (a) test cases are executed in the build environment (b) test cases are executed in a minimal Docker (https://docker.com) container which purposefully lacks all non-common system libraries (hence, a dependency that is not explicitly defined will lead to a failure). Once the build and test steps have succeeded, a member of the Bioconda team reviews the proposed changes and, if acceptable, merges the modifications into the official repository. Upon merging, the recipes are built again and uploaded to the hosted Bioconda channel (https://anaconda.org/bioconda), where they become available via the Conda package manager. When a Bioconda package is updated to a new version, older builds are generally preserved, and recipes for multiple older versions may be maintained in the Bioconda repository. The usual turnaround time of above workflow is short (Fig. 2d). 61% of the pull requests are merged within 5 hours. Of those, 36% are even merged within 1 hour. Only 18% of the pull requests need more than a day. Hence, publishing software in Bioconda or updating already existing packages can be accomplished typically within minutes to a few hours.

为了确保对如此数量的包进行可靠维护,我们使用了半自动、代理辅助的开发工作流程(图 2b)。所有 Bioconda 配方都托管在 GitHub 存储库 (https://github.com/bioconda/bioconda-recipes) 中。 Bioconda 中新配方的添加和现有配方的更新都是通过拉取请求处理的。因此,将一个或多个配方的修改版本与 Bioconda 的当前状态进行比较。一旦拉取请求到达,我们的基础设施就会执行多项自动检查。在任何步骤中发现的问题都会报告给贡献者,并阻止进一步的进展,直到它们得到解决。首先,检查修改后的配方是否存在语法反模式,即语法正确但风格不好的配方(称为 linting)。其次,修改后的配方是通过基于云的免费服务 (https://travis-ci.org) 在 Linux 和 macOS 上构建的。测试成功构建的配方(例如,通过运行生成的可执行文件)。由于 Bioconda 软件包必须能够在任何受支持的系统上运行,因此检查构建的软件包是否不依赖于构建环境中的特定元素非常重要。因此,测试分两个阶段进行:(a) 测试用例在构建环境中执行 (b) 测试用例在最小的 Docker (https://docker.com) 容器中执行,该容器故意缺少所有非通用系统库 (因此,未明确定义的依赖项将导致失败)。一旦构建和测试步骤成功,Bioconda 团队的一名成员将审查提议的更改,如果可以接受,则将修改合并到官方存储库中。合并后,配方将再次构建并上传到托管的 Bioconda 频道 (https://anaconda.org/bioconda),在那里它们可以通过 Conda 包管理器获得。当 Bioconda 包更新到新版本时,通常会保留旧版本,并且可以在 Bioconda 存储库中维护多个旧版本的配方。上述工作流程的通常周转时间很短(图 2d)。 61% 的拉取请求在 5 小时内合并。其中,36% 甚至在 1 小时内合并。只有 18% 的拉取请求需要超过一天的时间。因此,在 Bioconda 中发布软件或更新现有软件包通常可以在几分钟到几小时内完成。

Figure 2

Figure 2: Dependency structure, workflow, comparison with other resources, and turnaround time. (a) largest connected component of directed acyclic graph of Bioconda packages (nodes) and dependencies (edges). Highlighted is the induced subgraph of the CNVkit package and it’s dependencies (node coloring as defined in Fig. 1c, squared node represents CNVkit). (b) GitHub based development workflow: a contributor provides a pull request that undergoes several build and test steps, followed by a human review. If any of these checks does not succeed, the contributor can update the pull request accordingly. Once all steps have passed, the changes can be merged. (c) Turnaround time from submission to merge of pull requests in Bioconda. (d) Comparison of explicitly life science related packages in Bioconda with Debian Med (https://www.debian.org/devel/debian-med), Gentoo Science Overlay (category sci-biology, https://github.com/gentoo/sci), EasyBuild (module bio, https://easybuilders.github.io/easybuild), Biolinux, Homebrew Science (tag bioinformatics, https://brew.sh), GNU Guix (category bioinformatics, https://www.gnu.org/s/guix), and BioBuilds (https://biobuilds.org). The lower panel shows the project age since the first release or commit. Statistics obtained Oct. 23, 2017.

图 2:依赖结构、工作流程、与其他资源的比较以及周转时间。 (a) Bioconda 包(节点)和依赖项(边)的有向无环图的最大连通分量。突出显示的是 CNVkit 包的诱导子图及其依赖项(图 1c 中定义的节点着色,方形节点表示 CNVkit)。 (b) 基于 GitHub 的开发工作流程:贡献者提供一个拉取请求,该请求经过多个构建和测试步骤,然后是人工审核。如果这些检查中的任何一个不成功,贡献者可以相应地更新拉取请求。一旦所有步骤都通过,就可以合并更改。 (c) Bioconda 中从提交到合并拉取请求的周转时间。 (d) Bioconda 与 Debian Med (https://www.debian.org/devel/debian-med)、Gentoo Science Overlay (category sci-biology, https://github.com/) 中明确与生命科学相关的软件包的比较gentoo/sci)、EasyBuild (module bio, https://easybuilders.github.io/easybuild), Biolinux, Homebrew Science (tag bioinformatics, https://brew.sh), GNU Guix (category bioinformatics, https:// www.gnu.org/s/guix) 和 BioBuilds (https://biobuilds.org)。下面的面板显示了自第一次发布或提交以来的项目年龄。统计数据于 2017 年 10 月 23 日获得。

Reproducible software management and distribution is enhanced by other current technologies. Conda integrates itself well with environment modules (http://modules.sourceforge.net/), a technology used nearly universally across HPC systems. An administrator can use Conda to easily define software stacks for multiple labs and project-specific configurations. Popularized by Docker, containers provide another way to publish an entire software stack, down to the operating system. They provide greater isolation and control over the environment a software is executed in, at the expense of some customizability. Conda complements container based approaches. Where flexibility is needed, Conda packages can be used and combined directly. Where the uniformity of containers is required, Conda can be used to build images without having to reproduce the nuanced installation steps that would ordinarily be required to build and install a software within an image. In fact, for each Bioconda package, our build system automatically builds a minimal Docker image containing that package and its dependencies, which is subsequently uploaded and made available via the Biocontainers project. As a consequence, every built Bioconda package is available not only for installation via Conda, but also as a container via Docker, Rkt (https://coreos.com/rkt), and Singularity, such that the desired level of reproducibility can be chosen freely.

其他当前技术增强了可重现的软件管理和分发。 Conda 将自身与环境模块 (http://modules.sourceforge.net/) 很好地集成在一起,这是一种几乎在 HPC 系统中普遍使用的技术。管理员可以使用 Conda 轻松定义多个实验室和项目特定配置的软件堆栈。由 Docker 普及的容器提供了另一种发布整个软件堆栈的方式,直至操作系统。它们以牺牲一些可定制性为代价,提供了对软件执行环境的更大隔离和控制。 Conda 补充了基于容器的方法。在需要灵活性的地方,可以直接使用和组合 Conda 包。在需要容器的统一性的情况下,Conda 可用于构建镜像,而无需重现通常在镜像中构建和安装软件所需的细微安装步骤。事实上,对于每个 Bioconda 包,我们的构建系统会自动构建一个包含该包及其依赖项的最小 Docker 映像,该映像随后通过 Biocontainers 项目上传并提供。因此,每个构建的 Bioconda 包不仅可以通过 Conda 安装,还可以通过 Docker、Rkt (https://coreos.com/rkt) 和 Singularity 作为容器,这样可以达到所需的再现性水平自由选择

Discussion

By turning the arduous and error-prone process of installing bioinformatics software, previously repeated endlessly by scientists around the globe, into a concerted community effort, Bioconda frees significant resources to instead be invested into productive research. The new simplicity of deploying even complex software stacks with strictly controlled software versions enables software authors to safely rely on existing methods. Where previously the cost of depending on a third party tool - requiring its installation and maintaining compatibility with new versions - was often higher than the effort to re-implement its methods, authors can now simply specify the tool and version required, incurring only negligible costs even for large requirement sets.

通过将以前由全球科学家无休止地重复的安装生物信息学软件的艰巨和容易出错的过程转变为协调一致的社区努力,Bioconda 释放了大量资源,转而投资于生产性研究。 部署具有严格控制的软件版本的复杂软件堆栈的新简单性使软件作者能够安全地依赖现有方法。 以前依赖第三方工具的成本——要求其安装和维护与新版本的兼容性——通常高于重新实现其方法的努力,而现在作者可以简单地指定所需的工具和版本,只产生微不足道的成本 即使对于大型需求集

For reproducible data science, it is crucial that software libraries and tools are provided via an easy to use, unified interface, such that they can be easily deployed and sustainably managed. With its ability to maintain isolated software environments, the integration into major workflow management systems and the fact that no administration privileges are needed, the Conda package manager is the ideal tool to ensure sustainable and reproducible software management. With Bioconda, we unlock Conda for the life sciences while coordinating closely with other related projects such as conda-forge and Biocontainers. Bioconda offers a comprehensive resource of thousands of software libraries and tools that is maintained by hundreds of international contributors. Although it is among the youngest, it outperforms all competing projects by far in the number of available packages. With almost six million downloads so far, Bioconda packages have been well received by the community. We invite everybody to participate in reaching the goal of a central, comprehensive, and language agnostic collection of easily installable software by maintaining existing or publishing new software in Bioconda.

对于可重现的数据科学,通过易于使用、统一的界面提供软件库和工具至关重要,这样它们就可以轻松部署和可持续管理。凭借其维护隔离软件环境的能力、与主要工作流管理系统的集成以及无需管理权限这一事实,Conda 包管理器是确保可持续和可重复软件管理的理想工具通过 Bioconda,我们为生命科学解锁了 Conda,同时与 conda-forge 和 Biocontainers 等其他相关项目密切协调 Bioconda 提供由数百名国际贡献者维护的数千个软件库和工具的综合资源。尽管它是最年轻的项目之一,但它在可用包的数量上远远超过了所有竞争项目。迄今为止,Bioconda 软件包的下载量已接近 600 万次,受到社区的好评。我们邀请每个人通过在 Bioconda 中维护现有软件或发布新软件来参与实现易于安装软件的集中、全面和语言无关集合的目标

Online Methods

Security Considerations

Using Bioconda as a service to obtain packages for local installation entails trusting that (a) the provided software itself is not harmful and (b) it has not been modified in a harmful way. Ensuring (a) is up to the user. In contrast, (b) is handled by our workflow. First, source code or binary files defined in recipes are checked for integrity via MD5 or SHA256 hash values. Second, all review and testing steps are enforced via the GitHub interface. This guarantees that all packages have been tested automatically and reviewed by a human being. Third, all changes to the repository of recipes are publicly tracked, and all build and test steps are transparently visible to the user. Finally, the automatic parts of the development workflfow are implemented in the open-source software bioconda-utils (https://github.com/bioconda/bioconda-utils). In the future, we will further explore the possibility to sign packages cryptographically.

使用 Bioconda 作为服务来获取本地安装的软件包需要相信 (a) 提供的软件本身是无害的,并且 (b) 它没有以有害的方式进行修改。 确保 (a) 取决于用户。 相反,(b)由我们的工作流程处理。 首先,通过 MD5 或 SHA256 哈希值检查配方中定义的源代码或二进制文件的完整性。 其次,所有审查和测试步骤都是通过 GitHub 界面执行的。 这保证了所有包都经过自动测试并由人工审核。 第三,配方存储库的所有更改都被公开跟踪,所有构建和测试步骤对用户都是透明的。 最后,开发工作流程的自动化部分在开源软件 bioconda-utils (https://github.com/bioconda/bioconda-utils) 中实现。 未来,我们将进一步探索加密签名包的可能性。

Software management with Conda

Via the Conda package manager, installing software from Bioconda becomes very simple. In the following, we describe the basic functionality assuming that the user has access to a Linux or macOS terminal. After installing Conda, the first step is to set up the Bioconda channel via:

通过 Conda 包管理器,从 Bioconda 安装软件变得非常简单。 在下文中,我们将描述假设用户可以访问 Linux 或 macOS 终端的基本功能。 安装 Conda 后,第一步是通过以下方式设置 Bioconda 通道:

$ conda config --add channels conda-forge

$ conda config --add channels bioconda

Now, all Bioconda packages are visible to the Conda package manager. For example, the software CNVkit, can be searched for with

现在,所有 Bioconda 包都对 Conda 包管理器可见。 例如,软件 CNVkit,可以用

$ conda search cnvkit

in order to check if and in which versions it is available. It can be installed with:

为了检查它是否可用以及在哪些版本中可用。 它可以安装:

$ conda install cnvkit

CNVkit needs various dependencies from Python and R, which would otherwise have to be installed in separate manual steps (Fig. 2a). Furthermore, Conda enables updating and removing all these dependencies via one unified interface. A key value of Conda is the ability to define isolated, shareable software environments. This can happen ad-hoc, or via YAML (https://yaml.org) files. For example, the following defines an environment consisting of Salmon and DESeq2:

CNVkit 需要来自 Python 和 R 的各种依赖项,否则必须在单独的手动步骤中安装(图 2a)。 此外,Conda 可以通过一个统一的界面更新和删除所有这些依赖项。 Conda 的一个关键价值是能够定义隔离的、可共享的软件环境。 这可以临时发生,也可以通过 YAML (https://yaml.org) 文件发生。 例如,下面定义了一个由 Salmon 和 DESeq2 组成的环境:

channels:

- bioconda

- conda-forge

- defaults

dependencies:

- bioconductor-deseq2 =1.16.1

- salmon =0.8.2

- r-base =3.4.1

Given that the above environment specification is stored in the file env.yaml, an environment my-env meeting the specified requirements can be created via the command:

鉴于上述环境规范存储在文件 env.yaml 中,可以通过命令创建满足指定要求的环境 my-env:

$ conda env create --name my-env --file env.yaml

To use the commands installed in this environment, it must first be “activated” by issuing the following command:

要使用安装在此环境中的命令,必须首先通过发出以下命令“激活”它:

$ source activate my-env

Within the environment, R, Salmon, and DESeq2 are available in exactly the defined versions. For example, salmon can be executed with:

在该环境中,R、Salmon 和 DESeq2 可在完全定义的版本中使用。 例如,salmon 可以通过以下方式执行:

$ salmon --help

It is possible to modify an existing environment by using conda update, conda install and conda remove. For example, we could add a particular version of Kallisto and update Salmon to the latest available version with:

可以使用 conda update、conda install 和 conda remove 修改现有环境。 例如,我们可以添加一个特定版本的 Kallisto 并将 Salmon 更新到最新的可用版本:

$ conda install kallisto=0.43.1

$ conda update salmon

Finally, the environment can be deactivated again with:

最后,可以通过以下方式再次停用环境:

$ source deactivate

How isolated software environments enable reproducible research

孤立的软件环境如何实现可重复的研究

With isolated software environments as shown above, it is possible to define an exact version for each package. This increases reproducibility by eliminating differences due to implementation changes. Note that above we also pin an R version, although the latest compatible one would also be automatically installed without mentioning it. To further increase reproducibility, this pattern can be extended to all dependencies of DESeq2 and Salmon and recursively down to basic system libraries like zlib and boost (https://www.boost.org). Environments are isolated from the rest of the system, while still allowing interaction with it: e.g., tools inside the environment are preferred over system tools, while system tools that are not available from within the environment can still be used. Conda also supports the automatic creation of environment definitions from already existing environments. This allows to rapidly explore the needed combination of packages before it is finalized into an environment definition. When used with workflow management systems like Galaxy, bcbio-nextgen (https://github.com/chapmanb/bcbio-nextgen), and Snakemake that interact directly with Conda, a data analysis can be shipped and deployed in a fully reproducible way, from description and automatic execution of every analysis step down to the description and automatic installation of any required software.

使用如上所示的隔离软件环境,可以为每个包定义一个准确的版本。这通过消除由于实施更改而导致的差异来提高可重复性。请注意,上面我们还固定了一个 R 版本,尽管最新的兼容版本也会自动安装而无需提及。为了进一步提高可重复性,这种模式可以扩展到 DESeq2 和 Salmon 的所有依赖项,并递归到 zlib 和 boost (https://www.boost.org) 等基本系统库。环境与系统的其余部分隔离,同时仍允许与之交互:例如,环境内的工具优于系统工具,而环境内不可用的系统工具仍然可以使用。 Conda 还支持从现有环境自动创建环境定义。这允许在最终确定为环境定义之前快速探索所需的包组合。当与 Galaxy、bcbio-nextgen (https://github.com/chapmanb/bcbio-nextgen) 和 Snakemake 等直接与 Conda 交互的工作流管理系统一起使用时,可以以完全可重复的方式交付和部署数据分析,从每个分析步骤的描述和自动执行到任何所需软件的描述和自动安装。

Data analysis

The presented figures and numbers have been generated via a fully automated, reproducible Snakemake  workflow that is freely available under https://github.com/bioconda/bioconda-paper.

呈现的图片和数字是通过完全自动化、可重复的 Snakemake 工作流程生成的,该工作流程可在 https://github.com/bioconda/bioconda-paper 下免费获得

接下来,读者跟着笔者一起阅读一下正式published的文章内容,看看跟预印本有多少区别。。。

期刊:Nature methods (47.990/Q1)

Bioconda: sustainable and comprehensive software distribution for the life sciences

To the Editor: Bioinformatics software comes in a variety of programming languages and requires diverse installation methods. This heterogeneity makes management of a software stack complicated, error-prone, and inordinately time-consuming. Whereas software deployment has traditionally been handled by administrators, ensuring the reproducibility of data analyses1–3 requires that the researcher be able to maintain full control of the software environment, rapidly modify it without administrative privileges, and reproduce the same software stack on different machines.

致编辑:生物信息学软件有多种编程语言,需要多种安装方法。 这种异质性使软件堆栈的管理变得复杂、容易出错并且非常耗时。 虽然软件部署传统上由管理员处理,但确保数据分析的可重复性要求研究人员能够保持对软件环境的完全控制,在没有管理权限的情况下快速修改它,并在不同的机器上重现相同的软件堆栈。

The Conda package manager (https://conda.io) has become an increasingly popular means to overcome these challenges for all major operating systems. Conda normalizes software installations across language ecosystems by describing each software with a human readable ‘recipe’ that defines meta-information and dependencies, as well as a simple ‘build script’ that performs the steps necessary to build and install the software. Conda builds software packages in an isolated environment, transforming them into relocatable binaries. Importantly, it obviates reliance on system-wide administration privileges by allowing users to generate isolated software environments in which they can manage software versions by project, without generating incompatibilities and side-effects (Supplementary Results). These environments support reproducibility, as they can be rapidly exchanged via files that describe their installation state. Conda is tightly integrated into popular solutions for reproducible data analysis such as Galaxy, bcbio-nextgen (https:// github.com/chapmanb/bcbio-nextgen), and Snakemake. To further enhance reproducibility guarantees, Conda can be combined with container or virtual machine-based approaches and archive facilities such as Zenodo (Supplementary Results). Finally, although Conda provides many commonly used packages by default, it also allows users to optionally include additional, community-managed repositories of packages (termed channels).

Conda 包管理器 (https://conda.io) 已成为克服所有主要操作系统的这些挑战的越来越流行的方法。 Conda 通过使用定义元信息和依赖关系的人类可读“配方”以及执行构建和安装软件所需步骤的简单“构建脚本”来描述每个软件,从而规范跨语言生态系统的软件安装。 Conda 在隔离环境中构建软件包,将它们转换为可重定位的二进制文件。重要的是,它通过允许用户生成隔离的软件环境来消除对系统范围管理权限的依赖,在其中他们可以按项目管理软件版本,而不会产生不兼容性和副作用(补充结果)。这些环境支持可重复性,因为它们可以通过描述其安装状态的文件快速交换。 Conda 紧密集成到流行的可重现数据分析解决方案中,例如 Galaxy、bcbio-nextgen (https://github.com/chapmanb/bcbio-nextgen) 和 Snakemake。为了进一步增强可重复性保证,Conda 可以与基于容器或虚拟机的方法以及 Zenodo(补充结果)等存档设施相结合。最后,尽管默认情况下 Conda 提供了许多常用的包,但它也允许用户选择性地包含额外的、社区管理的包存储库(称为通道)。

To unlock the benefits of Conda for the life sciences, we present the Bioconda project (https://bioconda.github.io). The Bioconda project provides over 3,000 Conda software packages for Linux and macOS. Rapid turnaround times (Supplementary Results) and extensive documentation (https://bioconda.github.io/contributing.html) have led to a growing community of over 200 international scientists working in the project (Supplementary Results). The project is led by a core team, which is complemented by interest groups for particular language ecosystems. Unlimited (in time and space) storage for generated packages is donated by Anaconda Inc. All other used infrastructure is free of charge. Bioconda provides packages from various language ecosystems such as Python, R (CRAN and Bioconductor), Perl, Haskell, Java, and C/C++ (Fig. 1a). Many of the packages have complex dependency structures that require various manual steps for installation when not relying on a package manager like Conda (Supplementary Results). With over 6.3 million downloads, Bioconda has become a backbone of bioinformatics infrastructure that is used heavily across all language ecosystems (Fig. 1b). It is complemented by the conda-forge project (https://conda-forge.github.io), which hosts software not specifically related to the biological sciences. This separation has proven beneficial, because the focused nature of the Bioconda community allows for fast turnaround times and support when a user needs to contribute packages or fix problems. Nevertheless, the two projects collaborate closely, and the Bioconda team maintains over 500 packages hosted by conda-forge.

为了释放 Conda 对生命科学的好处,我们介绍了 Bioconda 项目 (https://bioconda.github.io)。 Bioconda 项目为 Linux 和 macOS 提供了超过 3,000 个 Conda 软件包。快速的周转时间(补充结果)和广泛的文档(https://bioconda.github.io/contributing.html)已经导致越来越多的 200 多名国际科学家参与该项目(补充结果)。该项目由一个核心团队领导,并辅以特定语言生态系统的兴趣小组。 Anaconda Inc. 为生成的包提供无限(时间和空间)存储。所有其他使用的基础设施都是免费的。 Bioconda 提供来自各种语言生态系统的软件包,例如 Python、R(CRAN 和 Bioconductor)、Perl、Haskell、Java 和 C/C++(图 1a)。许多包具有复杂的依赖结构,当不依赖于 Conda 等包管理器时,需要各种手动安装步骤(补充结果)。 Bioconda 拥有超过 630 万次的下载量,已成为生物信息学基础设施的支柱,在所有语言生态系统中得到广泛使用(图 1b)。它由 conda-forge 项目 (https://conda-forge.github.io) 补充,该项目托管与生物科学没有特别相关的软件。这种分离已被证明是有益的,因为 Bioconda 社区的专注性质允许快速周转时间并在用户需要贡献包或解决问题时提供支持。尽管如此,这两个项目密切合作,Bioconda 团队维护了由 conda-forge 托管的 500 多个包。

图1

Fig. 1 | Package numbers and usage. a, Package count per language ecosystem (saturated colors on the lower portions of the bars represent explicitly life-science-related packages). b, Distribution of per-package downloads, separated by language ecosystem. The term “other” encompasses all packages that do not fall into one of the specific categories named. White dots represent the mean; dark bars represent the interval between upper and lower quartiles. c, Comparison of the number of explicitly life-science-related packages in Bioconda with that in Debian Med (https://www.debian.org/devel/debian-med), Gentoo Science Overlay (category sci-biology; https://github.com/gentoo/sci), EasyBuild (module bio; https://easybuilders.github.io/easybuild), Biolinux6, Homebrew Science (tag bioinformatics; https://brew.sh), GNU Guix (category bioinformatics; https://www.gnu.org/s/guix), and BioBuilds (https://biobuilds.org). The lower graph shows the project age since the first release or commit. Statistics obtained 25 October 2017.

图 1 |包号和用法。 a,每个语言生态系统的包数(条形下部的饱和颜色明确表示与生命科学相关的包)。 b,按包下载的分布,按语言生态系统分开。术语“其他”包括不属于指定的特定类别之一的所有包。白点代表平均值;黑条表示上四分位数和下四分位数之间的间隔。 c,Bioconda 与 Debian Med (https://www.debian.org/devel/debian-med)、Gentoo Science Overlay (category sci-biology; https: //github.com/gentoo/sci)、EasyBuild(模块生物;https://easybuilders.github.io/easybuild)、Biolinux6、Homebrew Science(标记生物信息学;https://brew.sh)、GNU Guix(类别生物信息学;https://www.gnu.org/s/guix) 和 BioBuilds (https://biobuilds.org)。下图显示了自第一次发布或提交以来的项目年龄。统计数据于 2017 年 10 月 25 日获得。

Bioconda is not the only effort to distribute bioinformatics software (Fig. 1c). The alternatives can be categorized into system-wide (Debian-Med, Genotoo Science, Biolinux, and Homebrew) and per-user (EasyBuild, GNU Guix, and BioBuilds) installation mechanisms. The system-wide approaches lack the ability to put the scientist in control of the installed software stack, and thus do not meet the requirements for reproducibility outlined above. All per-user-based approaches provide a similar feature set (BioBuilds is also using the Conda package manager). However, among all available approaches, Bioconda, despite being the most recent, is by far the most comprehensive, with thousands of software libraries and tools that are maintained by hundreds of international contributors (Fig. 1c).

Bioconda 并不是分发生物信息学软件的唯一努力(图 1c)。 替代方案可以分为系统范围(Debian-Med、Genotoo Science、Biolinux 和 Homebrew)和每个用户(EasyBuild、GNU Guix 和 BioBuilds)安装机制。 系统范围的方法缺乏让科学家控制已安装软件堆栈的能力,因此不满足上述再现性要求。 所有基于每个用户的方法都提供了类似的功能集(BioBuilds 也在使用 Conda 包管理器)。 然而,在所有可用的方法中,Bioconda 尽管是最新的,但却是迄今为止最全面的,拥有由数百名国际贡献者维护的数千个软件库和工具(图 1c)。

For reproducible data science, it is crucial that software libraries and tools be provided via an easy-to-use, unified interface, so that they can be easily deployed and sustainably managed. With its ability to maintain isolated software environments, integration into major workflow management systems, and lack of requirement for any administration privileges for use, the Conda package manager is the ideal tool to ensure sustainable and reproducible software management. Bioconda packages have been well received by the community, with over six million downloads so far. We invite everybody to join the Bioconda community, participate in maintaining or publishing new software, and work toward the goal of a central, comprehensive, and language-agnostic collection of easily installable software for the life sciences.

对于可重现的数据科学,至关重要的是通过易于使用、统一的界面提供软件库和工具,以便它们可以轻松部署和可持续管理。 Conda 包管理器能够维护隔离的软件环境,集成到主要的工作流管理系统中,并且无需任何管理权限即可使用,是确保可持续和可重复的软件管理的理想工具。 Bioconda 软件包深受社区好评,目前已下载超过 600 万次。 我们邀请所有人加入 Bioconda 社区,参与维护或发布新软件,并朝着为生命科学提供易于安装的软件的集中、全面且与语言无关的集合的目标而努力。

Reporting Summary. Further information on experimental design is available in the Nature Research Reporting Summary linked to this article.

报告摘要。 有关实验设计的更多信息,请参阅本文链接的自然研究报告摘要。

Data availability. Data and code underlying the presented results are enclosed in a Snakemake workflow archive available at https://doi.org/10.5281/zenodo.1068297. The archive can also be used to automatically reproduce all results and figures presented in this paper.

数据可用性。 所呈现结果的数据和代码包含在 https://doi.org/10.5281/zenodo.1068297 上的 Snakemake 工作流存档中。 该存档还可用于自动重现本文中提供的所有结果和数据。

References

1. Mesirov, J. P. Science 327, 415–416 (2010).

2. Baker, M. Nature 533, 452–454 (2016).

3. Munafò, M. R. et al. Nat. Hum. Behav. 1, 0021 (2017).

4. Afgan, E. et al. Nucleic Acids Res. 44, W3–W10 (2016).

5. Köster, J. & Rahmann, S. Bioinformatics 28, 2520–2522 (2012).

6. Field, D. et al. Nat. Biotechnol. 24, 801–803 (2006).

Acknowledgements

We thank all contributors, the conda-forge team, and Anaconda Inc. for excellent cooperation. Further, we thank Travis CI (https://travis-ci.com) and Circle CI (https://circleci.com) for providing free Linux and macOS computing capacity. Finally, we thank ELIXIR (https://www.elixir-europe.org) for constant support and donation of staff. This work was supported by the Intramural Program of the National Institute of Diabetes and Digestive and Kidney Diseases, US National Institutes of Health (R.D.), the Netherlands Organisation for Scientific Research (NWO) (VENI grant 016. Veni.173.076 to J.K.), the German Research Foundation (SFB 876 to J.K.), and the NYU Abu Dhabi Research Institute for the NYU Abu Dhabi Center for Genomics and Systems Biology, program number CGSB1 (grant to J.R. and A. Yousif).

我们感谢所有贡献者、conda-forge 团队和 Anaconda Inc. 的出色合作。 此外,我们感谢 Travis CI (https://travis-ci.com) 和 Circle CI (https://circleci.com) 提供免费的 Linux 和 macOS 计算能力。 最后,我们感谢 ELIXIR (https://www.elixir-europe.org) 对员工的不断支持和捐赠。 这项工作得到了国家糖尿病、消化和肾脏疾病研究所、美国国立卫生研究院 (R.D.)、荷兰科学研究组织 (NWO) (VENI 授予 016. Veni.173.076 给 J.K.) 的校内计划的支持, 德国研究基金会(SFB 876 to J.K.)和纽约大学阿布扎比基因组学和系统生物学中心的纽约大学阿布扎比研究所,项目编号 CGSB1(授予 J.R. 和 A. Yousif)。

Author contributions

J.K. and R.D. wrote the manuscript and conducted the data analysis. K. Beauchamp, C. Brueffer, B.A.C., F. Eggenhofer, B.G., E. Pruesse, M. Raden, J.R., D. Ryan, I. Shlyakter, A.S., C.H.T.-T., and R.V. (in alphabetical order) contributed to writing of the manuscript. D.A. Søndergaard supervised student programmers on writing Conda package recipes and maintaining the connection with ELIXIR. All other members of the Bioconda Team contributed or maintained recipes (author order was determined by the number of commits in October 2017).

作者贡献

J.K. 和 R.D. 撰写手稿并进行数据分析。 K. Beauchamp、C. Brueffer、B.A.C.、F. Eggenhofer、B.G.、E. Pruesse、M. Raden、J.R.、D. Ryan、I. Shlyakter、A.S.、C.H.T.-T. 和 R.V. (按字母顺序)对手稿的撰写做出了贡献。 D.A. Søndergaard 指导学生程序员编写 Conda 包配方并保持与 ELIXIR 的连接。 Bioconda 团队的所有其他成员都贡献或维护了配方(作者顺序由 2017 年 10 月的提交数量决定)。

总结,我们可以看到预印本跟正式发表的内容大相径庭,但是正式版更加简洁,精练,字数也没那么多,图片也合二唯一了,就是把原先的图1c,d和图2d合并为一个图了。创世的成果,怎么才能企及呢,可能梦里吧。。。

你可能感兴趣的:(文献阅读 2.7 Bioconda的bioRxiv预印版和Nature Methods正式版)