在本地计算机上安装git
In this post you will learn a very efficient way to use Colab when working on a project that will allow you control the files locally on your own computer.
在本文中,您将学到一种在项目上工作时使用Colab的非常有效的方法,该方法将允许您在自己的计算机上本地控制文件。
I will show you how to do it when the project is maintained in a git repository, (even a private git repository), so that you can handle all the git actions easily from your computer as you are already used to.
当项目保存在git仓库(甚至是私有git仓库)中时,我将向您展示如何执行该操作,以便您可以像以前一样轻松地从计算机上处理所有git动作。
I will share with you the code and file structure that I use that allows a quick initialization of the Colab notebook.
我将与您分享用于快速初始化Colab笔记本的代码和文件结构。
The benefits that I see when using this proposed method are:
使用此建议方法时,我看到的好处是:
- It allows one click re-initialization of the notebook 一键即可重新初始化笔记本
- It simplifies the process of working in team with colab and git 它简化了使用colab和git进行团队合作的过程
- This process also allows running the same notebook locally without any changes 此过程还允许在本地运行同一笔记本而无需任何更改
I love google Colab. I think that google has done an amazing service to the ML/AI community when they opened the option for anyone to work, test and play with python code, ML models and concepts, totally free of charge, even with GPUs that are an important part when developing some of the common architectures that are in use today.
我爱Google Colab。 我认为Google为ML / AI社区提供了了不起的服务,因为他们为所有人提供了免费,免费使用python代码,ML模型和概念进行工作,测试和玩耍的选项,即使GPU是重要的一部分在开发当今使用的一些常见体系结构时。
I know many, including myself, who would have thought twice and probably give up on their attempt to learn ML if they had to pay for gaining the first experience. Running GPU training can be costly, and having the option to run it free lowers the barrier for people that have the ambitions but lack the resources to make the first steps.
我认识很多人,包括我自己,他们会三思而后行,如果他们不得不为获得初次体验付出代价,他们可能会放弃学习ML的尝试。 运行GPU培训的成本可能很高,并且可以选择免费运行GPU培训为有抱负但缺乏第一步准备资源的人们降低了障碍。
That’s being said, working with Google Colab has its “annoyances”, that haunted me for quite some time. I don’t know about you, but from my experience, when working on a project it is very beneficial to be organized, so working with a version control like git is very important.
话虽这么说,但与Google Colab合作有其“烦恼”,这困扰了我很长时间。 我不了解您,但是根据我的经验,在项目中进行组织非常有益,因此使用git之类的版本控制非常重要。
There are many problems when trying to combine Google Colab and Git.
尝试将Google Colab和Git结合使用时会遇到很多问题。
- First is how to load the project repository to Colab (and if the repo is private, how to do it in a secure way that will not share your credentials). 首先是如何将项目存储库加载到Colab(如果存储库是私有的,则如何以不会共享您的凭据的安全方式进行存储)。
- On Colab, where do you store your notebook in relation to your repo 在Colab上,与回购相关的笔记本存放位置在哪里
- Once you make changes to your code, how do you push these changes from your Colab hosted notebook into your repo. 更改代码后,如何将这些更改从Colab托管笔记本中推送到存储库中。
- If you (or someone else in your team/group) had made changes to some files in the repo, and you want to use them in your notebook, how do you pull the updates from the repo so that they will be available to your notebook? And can you do that even without reloading the notebook? 如果您(或团队/小组中的其他人)对存储库中的某些文件进行了更改,并且想在笔记本中使用它们,那么如何从存储库中提取更新,以便笔记本计算机可以使用它们? 而且即使不重新装载笔记本电脑也可以这样做吗?
- In colab, each time when we open the notebook, it starts `fresh`. We need to connect to files stored on our gdrive, and even how do we run the project when the files are stored somewhere deep inside the gdrive file system? 在colab中,每次我们打开笔记本时,它都会重新启动。 我们需要连接到存储在gdrive上的文件,甚至当文件存储在gdrive文件系统内部的某个位置时,我们还要如何运行项目?
I believe that the system that I came up with, gives a good solution for all these problems.
我相信我提出的系统可以为所有这些问题提供一个很好的解决方案。
Google Colab and Git系统: (The Google Colab and Git system:)
The first step for solution is using google drive for desktop. If you haven’t checked this one up, go ahead and do it right now, because what this application offers is an ability for you to have a local folder in your computer that is synced to your google drive.
解决方案的第一步是使用Google桌面驱动器 。 如果您尚未对此进行检查,请立即进行操作,因为此应用程序提供的功能是使您能够在计算机中拥有一个本地文件夹,该文件夹已同步到您的Google驱动器。
You can choose which exact folders in gdrive you would like to have on your local hard disk, so you don’t actually need to have backup for the entire gdrive. Once you specify the folder, the sync of files between Colab and your local folder is very fast. Any change that you do in Colab is almost immediately transferred to your local file, and the other way around. So, you can practically work locally with the IDE that you are used too, which makes the work flow much more convenient.
您可以选择gdrive中希望在本地硬盘上拥有的确切文件夹,因此实际上不需要为整个gdrive备份。 指定文件夹后,Colab和本地文件夹之间的文件同步非常快速。 您在Colab中所做的任何更改几乎都会立即转移到您的本地文件中,反之亦然。 因此,您实际上也可以使用所使用的IDE在本地进行工作,这使工作流程更加方便。
This also solves the issue of how you get the repo to the gdrive folder in the first place. Once you connected gdrive to your disk, create a folder that will hold your project. You can do that in gdrive or on your local, it doesn’t matter since the folder is tracked, and therefore the sync process will make sure that exists both in your computer and in the cloud. Then from your local machine, open your shell/terminal in the said folder, and simply `git clone` the repo.
这也解决了如何首先将存储库获取到gdrive文件夹的问题。 将gdrive连接到磁盘后,创建一个用于保存项目的文件夹。 您可以在gdrive或本地驱动器上执行此操作,因为已跟踪文件夹,所以这无关紧要,因此同步过程将确保计算机和云中都存在该文件夹。 然后从本地计算机上,在上述文件夹中打开外壳程序/终端,然后简单地“ git clone”仓库。
Since the entire repo is located on a local folder on your machine, you can use your favorite git manager to handle the repo (personally I use SourceTree)
由于整个仓库都位于计算机的本地文件夹中,因此您可以使用自己喜欢的git管理器来处理仓库(我个人使用SourceTree)
This is the project folder structure that I use:
这是我使用的项目文件夹结构:
myProject
| —— notebooks
| —— src
| —— __init__.py
| —— other python files
| __init__.py
| credentials.sample.ini
You can store the project anywhere you want on your gdrive (no matter how folders level deep). The ‘trick’ will be that we will ‘change directory`(cd) to the work folder once we open the notebook.
您可以将项目存储在gdrive上的任何位置(无论文件夹的深度如何)。 “诀窍”是一旦打开笔记本,我们将“将目录(cd)更改为工作文件夹”。
The notebook that we run in colab is the one that is stored in the `notebooks` folder in the repo. This means that any change that you make to the notebook in colab, will be synced to your local google drive folder, which will allow you to push the changes to the repo server from your local machine.
我们在colab中运行的笔记本是存储在仓库中的“笔记本”文件夹中的笔记本。 这意味着您对colab中的笔记本所做的任何更改都将同步到本地google驱动器文件夹,这将使您可以将更改从本地计算机推送到回购服务器。
One important thing to remember: Even though the notebook is located in the ‘notebooks’ folder, the working folder when running the notebook doesn’t have to be! In fact, using the root folder of the project as the working folder makes a lot more sense, since it gives access to the files in the `src` folder. The reason for the __init__.py files is that it defines the `src` as a python module, which allows using normal python imports for any files that is located inside the folder.
要记住的重要一件事:即使笔记本计算机位于“ notebooks”文件夹中,运行该笔记本计算机时也不必一定要使用该文件夹! 实际上,使用项目的根文件夹作为工作文件夹更为有意义,因为它可以访问src文件夹中的文件。 __init__.py文件的原因是它将src定义为python模块,从而允许对文件夹内的所有文件使用常规的python导入。
In order to streamline the process of changing the directory to the root folder, I recommend using a config file (credentials.colab.ini). I store this file locally, and once I reconnect to a new colab notebook I manually upload the config file to the `/content/’ folder in gdrive (this is the root folder of the Colab notebook instance, so just drag and drop the file from your local folder to the UI. See image below.
为了简化将目录更改为根文件夹的过程,我建议使用配置文件(credentials.colab.ini)。 我将此文件存储在本地,一旦我重新连接到新的colab笔记本,就将配置文件手动上传到gdrive中的“ / content /”文件夹(这是Colab笔记本实例的根文件夹,因此只需将文件拖放到从本地文件夹到用户界面。请参见下图。
Pay attention that the actual credentials.colab.ini mustn’t be pushed to the git repo! Depending on your project, you can use this file to store any other important configurations that you may need.
请注意,不得将实际的certificate.colab.ini推送到git repo! 根据您的项目,可以使用此文件存储您可能需要的任何其他重要配置。
An important reason for the config file is to hold the root path of the project. I am then using the awesome python configparser module to read the config file, so that we can access the configuration parameters anywhere in the notebook.
配置文件的重要原因是保留项目的根路径。 然后,我使用很棒的python configparser模块读取配置文件,以便我们可以在笔记本电脑的任何位置访问配置参数。
import configparser
import os.path
from os import path
from importlib import reloadWANDB_enable = False
creds_path_ar = ["../credentials.ini","credentials.colab.ini"]
root_path = ""
data_path = ""for creds_path in creds_path_ar:
if path.exists(creds_path):
config_parser = configparser.ConfigParser()
config_parser.read(creds_path)
root_path = config_parser['MAIN']["PATH_ROOT"]
data_path = config_parser['MAIN']["PATH_DATA"]
ENV = config_parser['MAIN']["ENV"]
break
Another configuration that you might find beneficial adding to the configuration file are the data location (if you store the data outside of your repo) and location of embedding files. These NLP embedding files sure takes a lot of disk space, and since I have more than one project that use them, I prefer to store them in one location, so that I won’t have multiple giant copies of them in my gdrive, which is free only up to 15Gb.
您可能会发现添加到配置文件中的另一个有益配置是数据位置(如果您将数据存储在存储库之外)和嵌入文件的位置。 这些NLP嵌入文件肯定会占用大量磁盘空间,并且由于我有多个使用它们的项目,因此我更喜欢将它们存储在一个位置,这样我的gdrive中就不会有多个巨型副本。仅在15Gb以下才是免费的。
For this reason, have `credentials.example.ini` in your repo, which holds only the template for possible parameters to be used. Any member of the team should make his own copy of the file and updates the values that he needs inside his copy. And put the actual credential file put into gitignore so that it will not get into the repo by mistake.
因此,在您的存储库中有`credentials.example.ini`,该文件仅保存要使用的可能参数的模板。 团队的任何成员都应制作自己的文件副本,并在副本中更新所需的值。 并将实际的凭证文件放入gitignore中,这样它就不会被错误地放入存储库中。
Reason for this line:
此行的原因:
creds_path_ar = [“../credentials.ini”,”credentials.colab.ini”]
This line allows using the notebook from Colab and also when running the exact same notebook with a local hosted jupyter engine. When opening the notebook in your local, the working folder is the same as the one where the notebook is located. So, create another configuration that will be used when running the notebook locally, and put it on the root folder of the project.
这行代码可以使用Colab的笔记本,也可以在使用本地托管的jupyter引擎运行完全相同的笔记本时使用。 在本地打开笔记本时,工作文件夹与笔记本所在的文件夹相同。 因此,创建在本地运行笔记本时将要使用的另一种配置,并将其放在项目的根文件夹中。
A typical configuration file looks like this:
典型的配置文件如下所示:
[DEFAULT]
WANDB_ENABLE = FALSE
ENV = LOCAL[MAIN]
WANDB_LOGIN = ……………………………
PATH_ROOT = /content/drive/My Drive/WORK/ML/MyProject
PATH_DATA = /content/drive/My Drive/WORK/ML/data
ENV = COLAB
As you can see, you can use the configparser DEFAULT section in case you need. Another important parameter in the configuration file is the ENV variable. This allows controlling specific code that will run only on Colab, or only locally.
如您所见,您可以根据需要使用configparser DEFAULT部分。 配置文件中的另一个重要参数是ENV变量。 这允许控制仅在Colab上或仅在本地运行的特定代码。
One piece of code that definitely needs to be run only on colab is the following, which mounts gdrive to the notebook instance.
以下是肯定只需要在colab上运行的一段代码,该代码将gdrive安装到笔记本实例上。
if ENV==”COLAB”:
from google.colab import drive
drive.mount(‘/content/drive’)
When running the notebook for the first time after connecting to the Colab instance you will need to authorize the access. But until you get completely disconnected from Colab, in case you need to restart the instance, you can simply rerun the cell and the drive will already be mounted.
连接到Colab实例后首次运行笔记本时,您需要授权访问。 但是,在完全断开与Colab的连接之前,如果需要重新启动实例,则只需重新运行单元并已安装驱动器即可。
Finally, both on local and on colab we change to the work folder by running this cell:
最后,在本地和colab上,我们都通过运行以下单元格转到工作文件夹:
cd {root_path}
which concludes our process. Running these cells at the beginning of the notebook allows an easy one click setup of the notebook directly from git!
到此结束了我们的过程。 在笔记本电脑的开头运行这些单元格可以直接从git轻松设置笔记本电脑的一键式设置 !
If you have more than one person in the team, this method allows them all to share the same notebook, and run it as they want. They can run it on Colab, or even locally if they have access to better resources.
如果团队中有多个人,则此方法允许他们所有人共享同一笔记本,并根据需要运行它。 如果可以访问更好的资源,他们可以在Colab上运行它,甚至可以在本地运行它。
有效使用此方法的其他提示 (Additional tips to use this method effectively)
Few more subtle productivity and efficiency operations that are worth mentioning will improve your workflow when using this system:
使用此系统时,几乎没有什么值得一提的微妙的生产力和效率操作可以改善您的工作流程:
立即打开gdrive文件,而无需在gdrive UI中进行繁琐的导航。 (Immediate gdrive file open without the cumbersome navigation in gdrive UI.)
Have you ever tried to navigate to a deep folder somewhere inside your gdrive? In the past I used to get really frustrated because it is a slow process. You need to navigate the folders one by one, and each folder change takes a few seconds. You might think I crazy, but once you do it few times a day it starts to add up. With Google Drive for Desktop you can jump directly to a folder/file location, which can save you these expensive seconds. This option is located in the context window of the right click of your explorer, see image below
您是否曾经尝试过导航到gdrive内部某个深层文件夹? 过去,我曾经很沮丧,因为这是一个缓慢的过程。 您需要一个一个地浏览文件夹,每个文件夹更改都需要几秒钟。 您可能以为我疯了,但是一旦您每天做几次,它就会开始累加。 借助Google桌面版云端硬盘,您可以直接跳至文件夹/文件位置,从而节省了宝贵的时间。 此选项位于资源管理器右键单击的上下文窗口中,请参见下图
在本地工作,提取更改的文件并重新加载模块 (Working locally, pulling changed files and Reloading modules)
Another benefit to having your files locally, is that for normal python files that are used within your notebook you can use IDE to edit them. What I mean is that if you struct your project correctly, most important functions should be inside a python module that is imported to the notebook.
将文件本地存储的另一个好处是,对于笔记本中使用的普通python文件,您可以使用IDE对其进行编辑。 我的意思是,如果正确构建项目,则最重要的功能应该在导入笔记本的python模块内部。
Striping away methods out of the notebook into python files is a method that is a good practice ayway, as it supports a more organized code structure, and an increased re-usability of the code. When it is used with this Colab system, it adds also the ability to make changes to files locally. I can open the project straight from my google drive local folder in vscode, and any change I make on files is synced to gdrive.
将方法从笔记本中剥离为python文件是一种不错的方法,因为它支持更组织化的代码结构,并提高了代码的可重用性。 与该Colab系统一起使用时,它还增加了在本地更改文件的功能。 我可以直接从vscode中的Google驱动器本地文件夹中打开项目,并且我对文件所做的任何更改都将同步到gdrive。
For example, I can have experiment_utils.py inside the src folder and I can import it with:
例如,我可以在src文件夹中包含experiment_utils.py,并可以使用以下命令导入它:
from src import experiment_utils as utils
and then call utils.myFunc() inside the notebook.
然后在笔记本中调用utils.myFunc()。
You can even make changes on the fly. You don’t need to restart the notebook on every change to your module files. To achieve that, have these lines at the beginning of your notebook:
您甚至可以随时进行更改。 您无需在每次更改模块文件时都重新启动笔记本计算机。 为此,请将这些行放在笔记本的开头:
%load_ext autoreload
%autoreload 2
If you still having trouble, you can use the reload method:
如果仍然遇到问题,可以使用reload方法:
from importlib import reload
reload(moduleName)
(just create a new code cell and execute once for the module that you are trying to import. Rerunning the cell does the import of the module can also help.
(只需创建一个新的代码单元,然后对要导入的模块执行一次即可。重新运行该单元确实可以导入该模块。
Of course, that if someone changed the repo, you can pull changes from your local computer, and the updates will be automatically synced to gdrive immediately. Google drive for desktop will show you a little green V icon next to the folder/file name when sync is complete.
当然,如果有人更改了存储库,则可以从本地计算机提取更改,并且更新将立即自动同步到gdrive。 同步完成后,桌面版Google云端硬盘会在文件夹/文件名旁边显示一个绿色的V图标。
I decided to call my python file folder `src`. You can choose other name, but be careful: Don’t name the `src` folder `code`, since Colab environment has some pre-existing code.py file that will give you a strange error message. (Took me some time to figure this one out).
我决定将我的python文件夹命名为src。 您可以选择其他名称,但要小心: 不要将src文件夹命名为code ,因为Colab环境中已有一些预先存在的code.py文件,该文件会给您带来奇怪的错误消息。 (花了我一些时间弄清楚这一点)。
结论 (Conclusion)
I created a template of the project that is explained in this post. You can access it from my github account: hershkoy/colab_git_template.
我创建了本文中说明的项目模板。 您可以从我的github帐户访问它: hershkoy / colab_git_template 。
I really hope that you have found this post informative. In case you liked it, give it some likes and share with whoever you think can benefit from it. If you find a way to improve this system further, let me know!
我真的希望您发现这篇文章有益。 如果您喜欢它,请给它一些喜欢,并与您认为可以从中受益的任何人分享。 如果您找到进一步改善此系统的方法,请告诉我!
翻译自: https://medium.com/@hershkoy/how-to-use-colab-with-git-on-your-local-machine-1c95586967e
在本地计算机上安装git