本文翻译自:Using IPython notebooks under version control
What is a good strategy for keeping IPython notebooks under version control? 使IPython笔记本保持版本控制的好策略是什么?
The notebook format is quite amenable for version control: if one wants to version control the notebook and the outputs then this works quite well. 笔记本格式非常适合版本控制:如果要对笔记本及其输出进行版本控制,则效果很好。 The annoyance comes when one wants only to version control the input, excluding the cell outputs (aka. "build products") which can be large binary blobs, especially for movies and plots. 当人们只想对输入进行版本控制时,就会感到烦恼,不包括可能是大型二进制Blob(尤其是电影和情节)的像元输出(又称“生成产品”)。 In particular, I am trying to find a good workflow that: 特别是,我试图找到一个好的工作流程:
As mentioned, if I chose to include the outputs (which is desirable when using nbviewer for example), then everything is fine. 如前所述,如果我选择包括输出(例如,在使用nbviewer时是理想的),那么一切都很好。 The problem is when I do not want to version control the output. 问题是,当我不想版本控制输出。 There are some tools and scripts for stripping the output of the notebook, but frequently I encounter the following issues: 有一些工具和脚本可用于剥离笔记本的输出,但是我经常遇到以下问题:
Cell/All Output/Clear
menu option, thereby creating unwanted noise in the diffs. 与“ Cell/All Output/Clear
菜单选项相比,某些剥离输出的脚本会稍微改变格式,从而在差异中产生不必要的噪音。 This is resolved by some of the answers. 这可以通过一些答案解决。 I have considered several options that I shall discuss below, but have yet to find a good comprehensive solution. 我考虑了以下将要讨论的几个选项,但尚未找到一个好的综合解决方案。 A full solution might require some changes to IPython, or may rely on some simple external scripts. 完整的解决方案可能需要对IPython进行一些更改,或者可能依赖于一些简单的外部脚本。 I currently use mercurial , but would like a solution that also works with git : an ideal solution would be version-control agnostic. 我目前使用mercurial ,但想要一个也可以与git一起使用的解决方案:理想的解决方案是版本控制无关的。
This issue has been discussed many times, but there is no definitive or clear solution from the user's perspective. 已经多次讨论了此问题,但是从用户的角度来看,没有确定的或明确的解决方案。 The answer to this question should provide the definitive strategy. 这个问题的答案应该提供确定的策略。 It is fine if it requires a recent (even development) version of IPython or an easily installed extension. 如果它需要IPython的最新版本(甚至是开发版本)或易于安装的扩展程序,那就很好。
Update: I have been playing with my modified notebook version which optionally saves a .clean
version with every save using Gregory Crosswhite's suggestions . 更新:我一直在玩我的笔记本电脑修改的版本,可选择节省了.clean
版本,每次保存使用格雷戈里Crosswhite的建议 。 This satisfies most of my constraints but leaves the following unresolved: 这满足了我的大部分约束,但以下问题尚未解决:
.clean
file, and then need to be integrated somehow into my working version. 这些将进入.clean
文件,然后需要以某种方式集成到我的工作版本中。 (Of course, I can always re-execute the notebook, but this can be a pain, especially if some of the results depend on long calculations, parallel computations, etc.) I do not have a good idea about how to resolve this yet. (当然,我总是可以重新执行笔记本,但是这可能会很痛苦,尤其是如果某些结果取决于长时间的计算,并行计算等时。)关于如何解决这个问题我还没有个好主意。 Perhaps a workflow involving an extension like ipycache might work, but that seems a little too complicated. 也许涉及像ipycache这样的扩展程序的工作流程可能会起作用,但这似乎有点太复杂了。 Cell/All Output/Clear
menu option for removing the output. 笔记本计算机运行时,可以使用“ Cell/All Output/Clear
菜单选项删除输出。 参考:https://stackoom.com/question/1GblD/在版本控制下使用IPython笔记本
Unfortunately, I do not know much about Mercurial, but I can give you a possible solution that works with Git, in the hopes that you might be able to translate my Git commands into their Mercurial equivalents. 不幸的是,我对Mercurial的了解不多,但是我可以为您提供一种与Git一起使用的可行解决方案,以期您希望能够将我的Git命令转换为与Mercurial等效的命令。
For background, in Git the add
command stores the changes that have been made to a file into a staging area. 对于后台,在Git中, add
命令将对文件所做的更改存储到暂存区中。 Once you have done this, any subsequent changes to the file are ignored by Git unless you tell it to stage them as well. 完成此操作后,Git会忽略对该文件的任何后续更改,除非您还告诉它也要暂存它们。 Hence, the following script, which, for each of the given files, strips out all of the outputs
and prompt_number sections
, stages the stripped file, and then restores the original: 因此,以下脚本(对于每个给定的文件)会剥离所有outputs
和prompt_number sections
, prompt_number sections
剥离的文件,然后还原原始文件:
NOTE: If running this gets you an error message like ImportError: No module named IPython.nbformat
, then use ipython
to run the script instead of python
. 注意:如果运行此命令会收到类似ImportError: No module named IPython.nbformat
的错误消息,请使用ipython
而不是python
运行脚本。
from IPython.nbformat import current
import io
from os import remove, rename
from shutil import copyfile
from subprocess import Popen
from sys import argv
for filename in argv[1:]:
# Backup the current file
backup_filename = filename + ".backup"
copyfile(filename,backup_filename)
try:
# Read in the notebook
with io.open(filename,'r',encoding='utf-8') as f:
notebook = current.reads(f.read(),format="ipynb")
# Strip out all of the output and prompt_number sections
for worksheet in notebook["worksheets"]:
for cell in worksheet["cells"]:
cell.outputs = []
if "prompt_number" in cell:
del cell["prompt_number"]
# Write the stripped file
with io.open(filename, 'w', encoding='utf-8') as f:
current.write(notebook,f,format='ipynb')
# Run git add to stage the non-output changes
print("git add",filename)
Popen(["git","add",filename]).wait()
finally:
# Restore the original file; remove is needed in case
# we are running in windows.
remove(filename)
rename(backup_filename,filename)
Once the script has been run on the files whose changes you wanted to commit, just run git commit
. 在要提交更改的文件上运行脚本后,只需运行git commit
。
Here is my solution with git. 这是我的git解决方案。 It allows you to just add and commit (and diff) as usual: those operations will not alter your working tree, and at the same time (re)running a notebook will not alter your git history. 它允许您像往常一样添加和提交(和diff):这些操作不会改变您的工作树,并且同时(重新)运行笔记本不会改变您的git历史记录。
Although this can probably be adapted to other VCSs, I know it doesn't satisfy your requirements (at least the VSC agnosticity). 尽管这可能适用于其他VCS,但我知道它不能满足您的要求(至少VSC不可知)。 Still, it is perfect for me, and although it's nothing particularly brilliant, and many people probably already use it, I didn't find clear instructions about how to implement it by googling around. 尽管如此,它对我来说仍然是完美的,尽管没有什么特别出色的,而且很多人可能已经在使用它,但是我没有找到关于如何通过谷歌搜索来实现它的明确说明。 So it may be useful to other people. 因此对其他人可能有用。
~/bin/ipynb_output_filter.py
) 将具有此内容的文件保存在某处(下面,假定~/bin/ipynb_output_filter.py
) chmod +x ~/bin/ipynb_output_filter.py
) 使它可执行( chmod +x ~/bin/ipynb_output_filter.py
) Create the file ~/.gitattributes
, with the following content 创建文件~/.gitattributes
,其内容如下
*.ipynb filter=dropoutput_ipynb
Run the following commands: 运行以下命令:
git config --global core.attributesfile ~/.gitattributes git config --global filter.dropoutput_ipynb.clean ~/bin/ipynb_output_filter.py git config --global filter.dropoutput_ipynb.smudge cat
Done! 做完了!
Limitations: 局限性:
somebranch
and you do git checkout otherbranch; git checkout somebranch
在git中,如果您在somebranch
分支中,并且执行git checkout otherbranch; git checkout somebranch
git checkout otherbranch; git checkout somebranch
, you usually expect the working tree to be unchanged. git checkout otherbranch; git checkout somebranch
,您通常希望工作树保持不变。 Here instead you will have lost the output and cells numbering of notebooks whose source differs between the two branches. 取而代之的是,您将丢失其来源在两个分支之间不同的笔记本的输出和单元编号。 git commit notebook_file.ipynb
, although it would at least keep git diff notebook_file.ipynb
free from base64 garbage). 为了不仅在每次执行涉及结帐的操作时都将其丢弃,可以通过将其存储在单独的文件中来更改方法(但请注意,在运行上述代码时,不知道提交ID!),并可能对其进行版本控制(但请注意,这至少需要执行git commit notebook_file.ipynb
,尽管这至少会使git diff notebook_file.ipynb
免于base64垃圾)。 My solution reflects the fact that I personally don't like to keep generated stuff versioned - notice that doing merges involving the output is almost guaranteed to invalidate the output or your productivity or both. 我的解决方案反映了一个事实,即我个人不希望对生成的内容进行版本控制-请注意,进行包含输出的合并几乎可以保证使输出或您的生产率或两者无效。
EDIT: 编辑:
if you do adopt the solution as I suggested it - that is, globally - you will have trouble in case for some git repo you want to version output. 如果您确实按照我的建议采用了该解决方案-也就是说,在全球范围内-如果要版本输出的某些git repo会遇到麻烦。 So if you want to disable the output filtering for a specific git repository, simply create inside it a file .git/info/attributes , with 因此,如果您要禁用特定git存储库的输出过滤,只需在其中创建一个文件.git / info / attributes ,使用
**.ipynb filter= **。ipynb过滤器=
as content. 作为内容。 Clearly, in the same way it is possible to do the opposite: enable the filtering only for a specific repository. 显然,以相同的方式可以执行相反的操作: 仅对特定存储库启用过滤。
the code is now maintained in its own git repo 该代码现在保留在自己的git repo中
if the instructions above result in ImportErrors, try adding "ipython" before the path of the script: 如果以上说明导致ImportErrors,请尝试在脚本路径之前添加“ ipython”:
git config --global filter.dropoutput_ipynb.clean ipython ~/bin/ipynb_output_filter.py
EDIT : May 2016 (updated February 2017): there are several alternatives to my script - for completeness, here is a list of those I know: nbstripout ( other variants ), nbstrip , jq . 编辑 :2016年5月(2017年2月更新):我的脚本有几种选择-为了完整性,这是我所知道的那些列表: nbstripout ( 其他 变体 ), nbstrip和jq 。
I use a very pragmatic approach; 我使用非常务实的方法。 which work well for several notebooks, at several sides. 适用于多个笔记本的多个侧面。 And it even enables me to 'transfer' notebooks around. 而且它甚至使我能够“转移”笔记本。 It works both for Windows as Unix/MacOS. 它既适用于Windows,也适用于Unix / MacOS。
Al thought it is simple, is solve the problems above... Al认为很简单,就是解决上面的问题...
Basically, do not track the .ipnyb
-files, only the corresponding .py
-files. 基本上, 不跟踪.ipnyb
-files,只有相应.py
-files。
By starting the notebook-server with the --script
option, that file is automatically created/saved when the notebook is saved. 通过使用--script
选项启动笔记本服务器 ,保存笔记本时将自动创建/保存该文件。
Those .py
-files do contain all input; 这些.py
-files确实包含所有输入; non-code is saved into comments, as are the cell-borders. 非代码和单元格边框一起保存到注释中。 Those file can be read/imported ( and dragged) into the notebook-server to (re)create a notebook. 可以将这些文件读取/导入(并拖动)到笔记本服务器中,以(重新)创建笔记本。 Only the output is gone; 只有输出消失了; until it is re-run. 直到重新运行。
Personally I use mercurial to version-track the .py
files; 我个人使用mercurial对.py
文件进行版本跟踪; and use the normal (command-line) commands to add, check-in (ect) for that. 并使用常规(命令行)命令进行添加,签入(添加)。 Most other (D)VCS will allow this to. 大多数其他(D)VCS都允许这样做。
Its simple to track the history now; 现在很容易跟踪历史; the .py
are small, textual and simple to diff. .py
很小,文本且易于区分。 Once and a while, we need a clone (just branch; start a 2nd notebook-sever there), or a older version (check-it out and import into a notebook-server), etc. 有时,我们需要一个克隆(只是分支;在那里启动一个第二个笔记本),或者一个旧版本(签出并导入到笔记本服务器中),等等。
--script
option) and do version-track it 创建(bash)脚本以启动服务器(使用--script
选项)并对其进行版本跟踪 .py
-file, but does not check it in. 保存笔记本不会保存.py
-file,但不会将其检入。
file@date+rev.py
) should be helpful It would be to much work to add that; 检出(例如) file@date+rev.py
)应该会有所帮助。 and maybe I will do so once. 也许我会这样做一次。 Until now, I just do that by hand. 到目前为止,我只是手工完成。 We have a collaborative project where the product is Jupyter Notebooks, and we've use an approach for the last six months that is working great: we activate saving the .py
files automatically and track both .ipynb
files and the .py
files. 我们有一个合作项目,产品为Jupyter Notebooks,在过去的六个月中,我们一直使用一种效果很好的方法:我们自动激活保存.py
文件并跟踪.ipynb
文件和.py
文件。
That way if someone wants to view/download the latest notebook they can do that via github or nbviewer, and if someone wants to see how the the notebook code has changed, they can just look at the changes to the .py
files. 这样,如果有人想要查看/下载最新的笔记本,则可以通过github或nbviewer进行操作,如果有人想要查看笔记本的代码如何更改,则只需查看.py
文件的更改即可。
For Jupyter
notebook servers , this can be accomplished by adding the lines 对于Jupyter
笔记本服务器 ,这可以通过添加以下行来完成
import os
from subprocess import check_call
def post_save(model, os_path, contents_manager):
"""post-save hook for converting notebooks to .py scripts"""
if model['type'] != 'notebook':
return # only do this for notebooks
d, fname = os.path.split(os_path)
check_call(['jupyter', 'nbconvert', '--to', 'script', fname], cwd=d)
c.FileContentsManager.post_save_hook = post_save
to the jupyter_notebook_config.py
file and restarting the notebook server. 到jupyter_notebook_config.py
文件,然后重新启动笔记本服务器。
If you aren't sure in which directory to find your jupyter_notebook_config.py
file, you can type jupyter --config-dir
, and if you don't find the file there, you can create it by typing jupyter notebook --generate-config
. 如果不确定在哪个目录中找到jupyter_notebook_config.py
文件,则可以键入jupyter --config-dir
,如果找不到该文件,则可以通过键入jupyter notebook --generate-config
来创建它。 jupyter notebook --generate-config
For Ipython 3
notebook servers , this can be accomplished by adding the lines 对于Ipython 3
笔记本服务器 ,这可以通过添加以下行来完成
import os
from subprocess import check_call
def post_save(model, os_path, contents_manager):
"""post-save hook for converting notebooks to .py scripts"""
if model['type'] != 'notebook':
return # only do this for notebooks
d, fname = os.path.split(os_path)
check_call(['ipython', 'nbconvert', '--to', 'script', fname], cwd=d)
c.FileContentsManager.post_save_hook = post_save
to the ipython_notebook_config.py
file and restarting the notebook server. 到ipython_notebook_config.py
文件,然后重新启动笔记本服务器。 These lines are from a github issues answer @minrk provided and @dror includes them in his SO answer as well. 这些行来自github问题答案@minrk提供 ,@ dror也将它们包括在他的SO答案中。
For Ipython 2
notebook servers , this can be accomplished by starting the server using: 对于Ipython 2
笔记本服务器 ,可以通过使用以下命令启动服务器来实现:
ipython notebook --script
or by adding the line 或通过添加行
c.FileNotebookManager.save_script = True
to the ipython_notebook_config.py
file and restarting the notebook server. 到ipython_notebook_config.py
文件,然后重新启动笔记本服务器。
If you aren't sure in which directory to find your ipython_notebook_config.py
file, you can type ipython locate profile default
, and if you don't find the file there, you can create it by typing ipython profile create
. 如果不确定在哪个目录中找到ipython_notebook_config.py
文件,则可以键入ipython locate profile default
,如果找不到该文件,则可以通过键入ipython profile create
来创建它。
Here's our project on github that is using this approach : and here's a github example of exploring recent changes to a notebook . 这是我们在github上使用这种方法的项目 :这是探索笔记本最近更改的github示例 。
We've been very happy with this. 我们对此感到非常高兴。
I did what Albert & Rich did - Don't version .ipynb files (as these can contain images, which gets messy). 我做了Albert&Rich所做的事情-不要对.ipynb文件进行版本控制(因为这些文件可能包含图像,会变得凌乱)。 Instead, either always run ipython notebook --script
or put c.FileNotebookManager.save_script = True
in your config file, so that a (versionable) .py
file is always created when you save your notebook. 相反,请始终运行ipython notebook --script
或将c.FileNotebookManager.save_script = True
放入配置文件中,以便在保存笔记本时始终创建一个(可版本化的) .py
文件。
To regenerate notebooks (after checking out a repo or switching a branch) I put the script py_file_to_notebooks.py in the directory where I store my notebooks. 为了重新生成笔记本(签出仓库或切换分支后),我将脚本py_file_to_notebooks.py放在了我存储笔记本的目录中。
Now, after checking out a repo, just run python py_file_to_notebooks.py
to generate the ipynb files. 现在,签出一个python py_file_to_notebooks.py
后,只需运行python py_file_to_notebooks.py
即可生成ipynb文件。 After switching branch, you may have to run python py_file_to_notebooks.py -ov
to overwrite the existing ipynb files. 切换分支后,您可能必须运行python py_file_to_notebooks.py -ov
来覆盖现有的ipynb文件。
Just to be on the safe side, it's good to also add *.ipynb
to your .gitignore
file. *.ipynb
安全考虑,最好在.gitignore
文件中添加*.ipynb
。
Edit: I no longer do this because (A) you have to regenerate your notebooks from py files every time you checkout a branch and (B) there's other stuff like markdown in notebooks that you lose. 编辑:我不再这样做了,因为(A)每次签出分支时都必须从py文件重新生成笔记本,并且(B)还有其他东西,例如丢失的笔记本中的markdown。 I instead strip output from notebooks using a git filter. 我改为使用git过滤器从笔记本中剥离输出。 Discussion on how to do this is here . 有关如何执行此操作的讨论在这里 。