译自 hadleywickham: Project-oriented workflow
I(hadley wickham) was honored to speak this week at the IASC-ARS/NZSA Conference, hosted by the Stats Department at The University of Auckland. One of the conference themes is to celebrate the accomplishments of Ross Ihaka, who got R started back in 1992, along with Robert Gentleman. My talk included advice on setting up your R life to maximize effectiveness and reduce frustration.
我很荣幸本周在由奥克兰大学统计部主办的IASC-ARS / NZSA会议上发言。其中一个会议主题是庆祝Ross Ihaka的成就,他在1992年与Robert Gentleman一起开始R语言的开发。我的演讲包括你在使用R过程中的一些建议,以最大限度地提高效率并减少挫败感。
Two specific slides generated much discussion and consternation in #rstats Twitter: 两个特定的幻灯片在#rstats Twitter中产生了很多讨论和惊愕:
If the first line of your R script is 如果你的R脚本的第一行是
setwd("C:\Users\jenny\path\that\only\I\have")
I will come into your office and SET YOUR COMPUTER ON FIRE .我将进入你的办公室并将你的计算机放一把火烧掉
If the first line of your R script is如果你的R脚本的第一行是
rm(list = ls())
I will come into your office and SET YOUR COMPUTER ON FIRE .我将进入你的办公室并将你的计算机放一把火烧掉
I stand by these strong opinions, but on their own, threats to commit arson aren’t terribly helpful! Here I explain why these habits can be harmful and may be indicative of an awkward workflow. Feel free to discuss more on community.rstudio.com.
我坚持这些强烈的意见,但就他们自己而言,纵火烧电脑的威胁并不是非常有用!在这里,我解释了为什么这些习惯可能是有害的,并可能表明一个尴尬的工作流程。欢迎在community.rstudio.com上讨论更多内容 。
Caveat: only you can decide how much you care about this. The importance of these practices has a lot to do with whether your code will be run by other people, on other machines, and in the future. If your current practices serve your purposes, then go forth and be happy.
警告:只有你可以决定你对此有多关心。这些实践的重要性与您的代码是否将来会由其他人在其他计算机上运行有很大关系。如果您目前的做法符合您的目的,那么请开开心心,继续前进。
Workflow versus Product 工作流程与产品
Let’s make a distinction between things you do because of personal taste and habits (“workflow”) versus the logic and output that is the essence of your project (“product”). These are part of your workflow:
让我们来区分一下你做的事情,个人的品味和习惯(“工作流程”)是一方面,逻辑和输出是你项目的本质(“产品”)是另外一方面。以下这些是您工作流程的一部分:
- The editor you use to write your R code.您用来编写R代码的编辑器。
- The name of your home directory.主目录的名称。
- The R code you ran before lunch.你在午餐前跑过的R代码。
I consider these to be clearly product: 我认为这些显然是产品:
- The raw data.原始数据。
- The R code someone needs to run on your raw data to get your results, including the explicit
library()
calls to load necessary packages.有人需要在原始数据上运行R代码以获得结果,包括显式library()
调用以加载必要的包。
Ideally, you don’t hardwire anything about your workflow into your product. Workflow-related operations should be executed by you interactively, using whatever means is appropriate to your setup, but not built into the scripts themselves.
理想情况下,您不会将有关工作流程的任何内容硬连接到产品中。工作流程相关的操作应由您以交互方式执行,使用适合您的设置的任何方法,但不是内置于脚本本身。
Self-contained projects 独立的项目
I suggest organizing each data analysis into a project: a folder on your computer that holds all the files relevant to that particular piece of work. I’m not assuming this is an RStudio Project, though this is a nice implementation discussed below.
我建议将每个数据分析组织到一个项目中:计算机上的一个文件夹,其中包含与该特定工作相关的所有文件。我不是假设这是一个RStudio项目,尽管这是一个很好的实现,如下所述。
Any resident R script is written assuming that it will be run from a fresh R process with working directory set to the project directory. It creates everything it needs, in its own workspace or folder, and it touches nothing it did not create. For example, it does not install additional packages (another pet peeve of mine).
编写任何驻留R脚本,都假设它将从一个新的R进程运行,并将工作目录设置为项目目录。它在自己的工作空间或文件夹中创建所需的一切,并且它没有触及任何它没有创建的东西。例如,它没有安装额外的包(安装了额外的包是另一个烦我的地方)。
This convention guarantees that the project can be moved around on your computer or onto other computers and will still “just work”. I argue that this is the only practical convention that creates reliable, polite behavior across different computers or users and over time. This convention is neither new, nor unique to R.
此约定保证这个项目可以在您的计算机上移动或移动到其他计算机上,并且仍然“正常工作”。我认为这是唯一可以在不同计算机或用户之间创建可靠,礼貌行为的实用约定。这个惯例既不是新的,也不是R独有的。
It’s like agreeing that we will all drive on the left or the right. A hallmark of civilization is following conventions that constrain your behavior a little, in the name of public safety.
这就像是同意我们要不全都在左侧驾驶,要不就是全都在右侧驾驶。文明的标志是遵循惯例,这些惯例通常是以公共安全的名义来限制你的行为。
Use of a development environment 使用开发环境
You will notice that the workflow recommendations given here are easier to implement if you use an IDE (integrated development environment). RStudio is a great example (what I use today), but there are many others, including: Emacs + ESS(what I used for ~15 years before RStudio), vim + Nvim-R, Visual Studio + RTVS.
您会注意到,如果您使用IDE(集成开发环境),则此处给出的工作流建议更容易实现。RStudio 是一个很好的例子(我今天使用的),但还有很多其他的,包括:Emacs + ESS(我在RStudio之前用了大约15年),vim + Nvim-R,Visual Studio + RTVS。
Direction of causality: long-time coders don’t organize their work into self-contained projects and use relative paths because they use an IDE. They use an IDE because it makes it easier to follow standard practices, such as these.
因果关系的方向:长期编码员不会将他们的工作组织成自包含的独立项目并使用相对路径,因为他们使用IDE。他们使用IDE,因为它可以更容易地遵循标准做法,例如这些。
What’s wrong with setwd()
? 使用setwd()有什么错?
I run a lot of student code in STAT 545 and, at the start, I see a lot of R scripts that look like this:
我在STAT 545中运行了很多学生代码,在开始时,我看到很多R脚本看起来像这样:
library(ggplot2)
setwd("/Users/jenny/cuddly_broccoli/verbose_funicular/foofy/data")
df <- read.delim("raw_foofy_data.csv")
p <- ggplot(df, aes(x, y)) + geom_point()
ggsave("../figs/foofy_scatterplot.png")
The chance of the setwd()
command having the desired effect – making the file paths work – for anyone besides its author is 0%. It’s also unlikely to work for the author one or two years or computers from now. The project is not self-contained and portable. To recreate and perhaps extend this plot, the lucky recipient will need to hand edit one or more paths to reflect where the project has landed on their machine. When you do this for the 73rd time in 2 days, while marking an assignment, you start to fantasize about lighting the perpetrator’s computer on fire.
除了作者之外的任何人,setwd()
命令要是能使文件路径工作具有所需效果的机会都是0%。此后,它也不太可能为你正常工作一两年,或者其他电脑上还能工作。该项目不是独立和便携的。要重新创建并扩展此图,幸运的收件人需要手动编辑一个或多个路径,以反映项目在其计算机上的实际位置。当您在批改作业时,2天内第73次(很多次)执行此操作(改路径),您开始想要烧掉这些人的计算机。
This use of setwd()
is also highly suggestive that the useR does all of their work in one R process and manually switches gears when they shift from one project to another. That sort of workflow makes it unpleasant to work on more than one project at a time and also makes it easy for work done on one project to accidentally leak into subsequent work on another (e.g., objects, loaded packages, session options).
这种使用setwd()
也高度暗示useR在一个R过程中完成所有工作,并在从一个项目转移到另一个项目时手动切换。这种工作流程使得一次处理多个项目变得令人不愉快,并且使得在一个项目上完成的工作很容易意外泄漏到另一个项目的后续工作中(例如,对象,加载的包,会话选项)。
Use projects and the here package 使用项目和here包
How can you avoid setwd()
at the top of every script? 你怎么才能能避免setwd()
在每个脚本的顶部?
- Organize each logical project into a folder on your computer. 将每个逻辑项目组织到计算机上的文件夹中。
- Make sure the top-level folder advertises itself as such. This can be as simple as having an empty file named
.here
. Or, if you use RStudio and/or Git, those both leave characteristic files behind that will get the job done. 确保顶级文件夹是一眼就能看明白的是有特征的(自白)。这可以很简单,例如顶级文件夹中有一个名字是.here
的空文件。或者,如果你使用RStudio和/或Git,那些都会留下特征文件,这将完成工作。 - Use the
here()
function from the here package to build the path when you read or write a file. Create paths relative to the top-level directory.使用here package的here()
函数可在读取或写入文件时构建路径。创建相对于顶级目录的路径。 - Whenever you work on this project, launch the R process from the project’s top-level directory. If you launch R from the shell,
cd
to the correct folder first. 每当您处理此项目时,从项目的顶级目录启动R进程。如果从shell启动R,则首先启动切换目录(Change Directory: CD)到正确的文件夹。
To continue our example, start R in the foofy
directory, wherever that may be. Now the code looks like so:要继续我们的示例,请在foofy
目录中启动R ,无论它在哪里。现在代码看起来像这样:
library(ggplot2)
library(here)
df <- read.delim(here("data", "raw_foofy_data.csv"))
p <- ggplot(df, aes(x, y)) + geom_point()
ggsave(here("figs", "foofy_scatterplot.png"))
This will run, with no edits, for anyone who follows the convention about launching R in the project folder. In fact, it will even work if R’s working directory is anywhere inside the project, i.e. it will work from sub-folders. This plays well with knitr/rmarkdown’s default behavior around working directory and in package development/checking workflows.
这段代码不需要额外的编辑,对于任何遵循关于在项目文件夹中启动R的约定的人都可以运行的很好。实际上,它甚至可以工作在项目内的任何位置,只要R的工作目录是在项目里边,例如是子文件夹里边是可以工作的。这与knitr / rmarkdown在工作目录和包开发/检查工作流程中的默认行为相吻合。
Read up on the here package to learn about more features, such as additional ways to mark the top directory and troubleshooting with dr_here()
. I have also written a more detailed paean to this package before.
阅读here package,了解更多功能,例如标记顶级目录和故障排除的其他方法dr_here()
。我之前也写过文章,推荐赞颂这个包,文章详见这里。
RStudio Projects - RStudio项目
This work style is so crucial that RStudio has an official notion of a Project (with a capital “P”). You can designate a new or existing folder as a Project. All this means is that RStudio leaves a file, e.g., foofy.Rproj
, in the folder, which is used to store settings specific to that project.
这种工作方式至关重要,以至于RStudio有一个正式的项目 Project概念(大写的“P”)。您可以将新文件夹或现有文件夹指定为项目。所有这些意味着RStudio foofy.Rproj
在文件夹中留下文件,例如,该文件用于存储特定于该项目的设置。
Double-click on a .Rproj
file to open a fresh instance of RStudio, with the working directory and file browser pointed at the project folder. The here package is aware of this and the presence of an .Rproj
is one of the ways it recognizes the top-level folder for a project.
双击.Rproj
文件以打开RStudio的新实例,这个新实例包括了工作目录和文件浏览器,直接指向项目文件夹。here包是知道这一特点的,并且.Rproj
文件的存在是它识别项目的顶级文件夹的方式之一。
RStudio fully supports Project-based workflows, making it easy to switch from one to another, have many projects open at once, re-launch recently used Projects, etc.
RStudio完全支持基于项目的工作流程,可以轻松地从一个工作流切换到另一个工作流,一次打开许多项目,重新启动最近使用的项目等。
What’s wrong with rm(list = ls())
? 使用rm(list = ls())有什么不对
It’s also fairly common to see data analysis scripts that begin with this object-nuking command: 查看以此object-nuking命令开头的数据分析脚本也很常见:
rm(list = ls())
Just like hard-wiring the working directory, this is highly suggestive that the useR works in one R process and manually switches gears when they shift from one project to another. That, in turn, suggests that development frequently happens in a long-running R process that has been used vs. fresh and clean.
就像在工作目录中进行硬连接一样,这非常强烈地表明useR在一个R过程中工作,并且当它们从一个项目转移到另一个项目时手动切换。反过来,这表明经常在长期运行的R过程中进行R代码的开发,该R过程已被使用,这个R过程已经不是新鲜和清洁的了。
The problem is that rm(list = ls())
does NOT, in fact, create a fresh R process. All it does is delete user-created objects from the global workspace.
问题在于,rm(list = ls())
实际上并没有创建一个新的R过程。它所做的就是从全局工作区中删除用户创建的对象。
Many other changes to the R landscape persist invisibly and can have profound effects on subsequent development. Any packages that have been loaded are still available. Any options that have been set to non-default values remain that way. Working directory is not affected (which is, of course, why we see setwd()
so often here too!).
R这块地区(作用空间)的的许多其他变化无形地持续存在着,并可能对后续发展产生深远影响。任何已加载的包仍然可用。任何已设置为非默认值的选项都保持这种方式。工作目录不受影响(当然,这也是我们setwd()
经常在这里看到的原因!)。
Why does this matter? It makes your script vulnerable to hidden dependencies on things you ran in this R process before you executed rm(list = ls())
.
为什么这很重要?它会使您的脚本在执行rm(list = ls())
之前,容易受到您在此R进程中运行的事物的隐藏依赖性的影响。
- You might use functions from a package without including the necessary
library()
call. Your collaborator won’t be able to run this script. 您可以使用包中的函数而不包括必要的library()
调用。您的协作者将无法运行此脚本。 - You might code up an analysis assuming that
stringsAsFactors = FALSE
but next week, when you have restarted R, everything will inexplicably be broken. 您可能会编写一个分析,假设stringsAsFactors = FALSE
但是下周,当您重新启动R时,一切都将莫名其妙地被打破。 - You might write paths relative to some random working directory, then be puzzled next month when nothing can be found or results don’t appear where you expect. 您可能会编写相对于某个随机工作目录的路径,然后在下个月遇到任何问题或者结果没有出现在您预期的位置时会感到困惑。
The solution is to write every script assuming it will be run in a fresh R process. How do you adopt this style? Key steps:
解决方案是编写每个脚本都假设它将在新的R进程中运行。你如何采用这种风格?关键步骤如下:
- User-level setup: Do not save
.RData
, when you quit R and don’t load.RData
when you fire up R. - 用户级设置:
.RData
退出R时不保存,并且.RData
在启动R时不加载。- In RStudio, this behavior can be requested in the General tab of Preferences. 在RStudio中,可以在“首选项”的“常规”选项卡中请求此行为。
- If you run R from the shell, put something like this in your
.bash_profile
:alias R='R --no-save --no-restore-data'
. 如果从shell中运行R,把这样的事情在你的.bash_profile
:alias R='R --no-save --no-restore-data'
。
- Don’t do things in your
.Rprofile
that affect how R code runs, such as loading a package like dplyr or ggplot or setting an option such asstringsAsFactors = FALSE
. 不要在.Rprofile
中加入影响R代码的运行方式的东西,例如加载像dplyr或ggplot这样的包,或者设置诸如的选项stringsAsFactors = FALSE
。 - Daily work habit: Restart R very often and re-run your under-development script from the top.
- 日常工作习惯:经常重启R并从项目顶部重新运行正在开发的脚本。
- If you use RStudio, use the menu item Session > Restart R or the associated keyboard shortcut Ctrl+Shift+F10 (Windows and Linux) or Command+Shift+F10 (Mac OS). You can re-run all code up to the current line with Ctrl+Alt+B (Windows and Linux) or Command+Option+B (Mac OS). 如果您使用RStudio,请使用菜单项Session> Restart R或相关的键盘快捷键Ctrl + Shift + F10(Windows和Linux)或Command + Shift + F10(Mac OS)。您可以使用Ctrl + Alt + B(Windows和Linux)或Command + Option + B(Mac OS)将所有代码重新运行到当前行。
- If you run R from the shell, use Ctrl+D to quit, then
R
to restart. 如果从shell运行R,请使用Ctrl + D退出,然后R
重新启动。
This requires that you fully embrace the idea that source is real: 这要求您完全接受源代码就是真实的想法:
The source code is real. The objects are realizations of the source code. Source for EVERY user modified object is placed in a particular directory or directories, for later editing and retrieval. – from the ESS manual 源代码是真实的。对象是源代码的实现。每个用户修改对象的源都放在特定目录或目录中,以便以后编辑和检索。- 来自ESS手册
This doesn’t mean that your scripts need to be perfectly polished and ready to run unattended on a remote server. Scripts can be messy, anticipating interactive execution, but still be complete. Clean them up when and if you need to.
这并不意味着您的脚本需要完美修饰抛光,并且可以在远程服务器上无人值守地运行。脚本可能很混乱,预计可以交互式执行,但仍然是完整的。如果需要,可以清理它们。
What about objects that take a long time to create? Isolate that bit in its own script and write the precious object to file with saveRDS(my_precious, here("results", "my_precious.rds"))
. Now you can develop scripts to do downstream work that reload the precious object via my_precious <- readRDS(here("results", "my_precious.rds"))
. It is a good idea to break data analysis into logical, isolated pieces anyway.
对于需要很长时间才能创建的对象呢?在自己的脚本中隔离该位并将珍贵的对象写入文件saveRDS(my_precious, here("results", "my_precious.rds"))
。现在,您可以开发脚本来执行下游工作,从而重新加载宝贵的对象my_precious <- readRDS(here("results", "my_precious.rds"))
。无论如何,将数据分析分解为逻辑孤立的部分是个好主意。
Lastly, rm(list = ls())
is hostile to anyone that you ask to help you with your R problems. If they take a short break from their own work to help debug your code, their generosity is rewarded by losing all of their previous work. Now granted, if your helper has bought into all the practices recommended here, this is easy to recover from, but it’s still irritating. When this happens for the 100th time in a semester, it rekindles the computer arson fantasies triggered by last week’s fiascos with setwd()
.
最后,rm(list = ls())
对你要求帮助你解决你的R问题的任何人都是敌对(不友好)的。如果他们从他们自己的工作中稍作休息以帮助调试您的代码,他们会因此失去他们以前的所有工作,这就是他们慷慨的回报。现在,如果帮助你的人已经使用了这里推荐的所有做法,这很容易恢复,但它仍然令人生气。当这样的事情在一个学期发生了第100次时,它重新点燃了由上周setwd()
的惨败引发的计算机纵火幻想(你这个setwd()很让人生气, 真想把你的电脑一把火点了)。