Chapter 4: Repositories
第四章:配置库
This is part of an online book called Source Control HOWTO, a best practices guide on source control, version control, and configuration management.
这是一篇名为如何做源码控制的在线书籍的一部分,一本关于源码控制、版本控制、配置管理的最佳实践手册。
Cars and clocks
汽车和钟
In previous chapters I have mentioned the concept of a repository, but I haven't said much further about it. In this chapter, I want to provide a lot more detail. Please bear with me as I spend a little time talking about how an SCM tool works "under the hood". I am doing this because an SCM tool is more like a car than a clock.
- An SCM tool is not like a clock. Clock users have no need to know how a clock works inside. We just want to know what time it is. Those who understand the inner workings of a clock cannot tell time any more skillfully than the rest of us.
- An SCM tool is more like a car. Lots of people do use cars without knowing how they work. However, people who really understand cars tend to get better performance out of them.
在之前的章节里面,我提到过库的概念,但是我没有过多的谈及。在本章,我想做更多的描述。请容忍我花点时间谈谈关于配置管理工具如何“在引擎盖”下工作。我解释这个是因为一个配置管理工具同钟比起来更像汽车。
l 一个配置管理工具不像钟。钟的使用者不需要知道一个钟的内部是如何工作的。我们只需要知道时间。那些知道钟内部如何工作的人并不能比我们这些不知道的人能够更准确地报时。
l 一个配置管理工具更像汽车。许多开车的人都不知道它们是怎么工作的,但是,真正知道汽车的人们更注意从汽车身上获得更好的性能。
Rest assured, that this book is still a "HOWTO". My goal here remains to create a practical explanation of how to do source control. However, I believe that you can use an SCM tool more effectively if you know a little bit about what's happening inside.
放心,这本书依旧是说“如何做”。我的目标还是创建一个实践来解释如何做配置控制。当然,我相信你如果知道一点工具内部的工作,你就能够更有效的使用配置工具。
Repository = File System * Time
配置库=文件系统*时间
A repository is the official place where you store all your source code. It keeps track of all your files, as well as the layout of the directories in which they are stored. It resides on a server where it can be shared by all the members of your team.
一个库就是你存储你的所有源代码的正式的地方。它保存了对你所有文件的追踪,而且像字典一样的有序存放。它存放在服务器上,共享给你的团队所有的人员。
But there has to be more. If the definition in the previous paragraph were the whole story, then an SCM repository would be no more than a network file system. A repository is much more than that. A repository contains history.
但是那里肯定还有更多的东西。如果前一段的定义是整体定义,那么配置库就仅仅是一个网络文件系统。但一个库显然不止这些,还包含了历史。
A file system is two-dimensional: its space is defined by directories and files. In contrast, a repository is three-dimensional: it exists in a continuum defined by directories, files and time. An SCM repository contains every version of your source code that has ever existed. The additional dimension creates some rather interesting challenges in the architecture of a repository and the decisions about how it manages data.
一个文件系统是二维的:它的空间被定义为目录和文件。相对而言,一个库是三维的:它存在于一个对库、文件和时间的统一体里面。一个配置库包含了你的源代码已经存在的每个版本。这个增加的维度为库的结构设计和数据管理增添了一些相当有趣的挑战。
How do we store all those old versions of everything?
我们如何存储每个文件的所有旧版本?
As a first guess, let's not be terribly clever. We need to store every version of the source tree. Why not just keep a complete copy of the entire tree for every change that has happened?
做第一假设,我们不要过于聪明。我们需要存储源代码树的每个版本。那为什么不能在发生每个变更时刚好保留整棵树的一个完全拷贝呢?
We obviously use Vault as the SCM tool for our own development of Vault. We began development of Vault in the fall of 2001. In the summer of 2002, we started "
dogfooding
". On October 25th, 2002 , we abandoned our repository history and started a fresh repository for the core components of Vault. Since that day, this tree has been modified 4,686 times.
我们显然用我们自己开发的Vault做我们的配置管理工具。我们开始开发Vault是在2001年秋。在2002年夏天,我们开始我们的“
dogfooding
”(译者注:这是一个俚语,表示是一个自行测试的评估体系,是基于Beta或者发布版的候选软件).在2002.10.25,我们放弃了我们的库历史,然后开始用一个全新的库来放Vault的关键组件。从那开始,这个树被修改过4686次。
This repository contains approximately 40 MB of source code. If we chose to store the entire tree for every change, those 4,686 copies of the source tree would consume approximately 183 GB, without compression. At today's prices for disk space, this option is worth considering.
这个库包含了大概 40M 的源代码。如果我们选择保存这整棵树的每次变更,那这4686份源码树的拷贝不压缩的话就有大概 183G 。对于今天的硬盘价格来说,这种方式倒是值得考虑。
However, this particular repository is just not very large. We have several others as well, but the sum total of all the code we have ever written still doesn't qualify as "large". Many of our Vault customers have trees which are a lot bigger.
但是,这个特别的库并不是很大。还不如我们其他还有的几个大,但我们所有写过的代码总和仍然不够“庞大”。许多我们的Vault客户的版本的树要大些。
As an example, consider the source tree for OpenOffice.org. This tree is approximately 634 MB. Based on their claim of 270 developers and the fact that their repository is almost four years old, I'm going to conservatively estimate that they have made perhaps 20,000 checkins. So, if we used the dumb approach of storing a full copy of their tree for every change, we'll need around 12 TB of disk space. That's 12 terabytes.
举个例子,来考虑关于开放工作室组织的源码树。这棵树大概 634M 。基于他们宣称的270名开发人员和他们的库有4年的历史的事实。我保守的估计他们有2万次签入。那么,如果我们在每次变更的时候用愚蠢的方式保留整个树的拷贝,那我们需要大概12TB的硬盘空间。那个12个兆字节(译者注:1TB=1024GB)啊。
At this point, the argument that "disk space is cheap" starts to break down. The disk space for 12 TB of data is cheaper than it has ever been in the history of the planet. But this is mission critical data. We have to consider things like performance and backups and RAID and administration. The cost of storing 12 TB of ultra-important data is more than just the cost of the actual disk platters.
基于这点,“硬盘空间是便宜的”的观点就被颠覆了。12TB数据的硬盘空间比史上的行星要便宜点儿。但是这个是估计数据。我们还要考虑了运行、备份和RAID(磁盘阵列)以及管理。所以存储12TB极为重要的数据所花费的比实际的大数据量硬盘还多。
So we actually do have an incentive to store this information a bit more efficiently. Fortunately, there is an obvious reason why this is going to be easy to do. We observe that tree N is often not terribly different from tree N-1. By definition, each version of the tree is derived from its predecessor. A checkin might be as simple as a one-line fix to a single file. All of the other files are unchanged, so we don't really need to store another copy of them.
所以我们实际上有动机来使信息存储有效率些。幸运的是,有一个很明显的原因是为什么这样做很容易。我们发现,树N通常不是同树N-1差别特别大。定义中,每个树的版本都是来自他的前一个版本。一个签入可能只是很简单的单线的修改一个文件。其他的文件并没有变更过,那我们就不用存储他们的拷贝。
So, we don't want to store the full contents of the tree for every single change. Instead, we want a way to store a tree represented as a set of changes to another tree. We call this a "delta".
那么,我们也不用存储每次变更时树的所有注释。取而代之,我们打算一种方式:存储一棵树,把一系列变更描绘成另一棵树。我们称之为“增量”。
Delta direction
增量方向
As we decide to store our repositories using deltas, we must be concerned about performance. Retrieving a tree which is in a deltified representation requires more effort than retrieving one which is stored in full. For example, let's suppose that version 1 of the tree is stored in full, but every subsequent revision is represented as a delta from its predecessor. This means that in order to retrieve version 4,686, we must first retrieve version 1 and then apply 4,685 deltas. Obviously, this approach would mean that retrieving some versions will be faster than others. When using this approach we say that we are using "forward deltas", because each delta expresses the set of changes from one version to the next.
当我们决定用增量来存储我们的库,我们必须顾及到执行效率。获得一个增量定义的需求会比获得一个被存储的整个树有更多的成果。例如,我们假设树的版本1被完全存储,但是每个后来的版本被从它的祖先开始以增量式表示。这意味着为了获得版本4686,我们必须先取得版本1,然后应用4685个增量。显然,这个方式可能意味着取回一些版本会比其他的快。当使用这种方式的时候,我们说我们使用了“前向增量”,因为每个增量表示了从一个版本的变更到下一个版本的变更。
We observe that not all versions of the tree are equally likely to be retrieved. For example, version 83 of the Vault tree is not special in any way. It is likely that we have not retrieved that version in over a year. I suspect that we will never retrieve it again. However, we retrieve the latest version of the tree many times per day. In fact, as a broad generalization, we can say that at any given moment, the most recent version of the tree is probably the most likely one to be needed.
我们发现不是这棵树的所有版本都刚好需要被取回。例如,Vault的83版本无论如何都不是特殊的。好像我们有超过一年没有取过那个版本。我假定我们将永远不会再取它了,那么,我们每天取这个树的最新版本很多次,实际上,作为一个广泛定义,我们可以说随时,树的最好的最近版本可能刚好就是最需要的。
The simplistic use of forward deltas delivers its worst performance for the most common case. Not good.
前向增量的过于简单的使用提交了通常情况下最坏的执行。不好。
Another idea is to use "reverse deltas". In this approach, we store the most recent tree in full. Every other tree N is represented as a set of differences from tree N+1. This approach delivers its best performance for the most common case, but it can still take an awfully long time to retrieve older trees.
还有一个办法是使用“反向增量”。这种方式里面,我们存储最近的这棵完全树。每个其他的树N都被描绘成一套不同于N+1的树。这个方式提交了它对最普通的情况的最好的执行,但是它依然花掉很长的时间来取回旧的树。
Some SCM tools use some sort of a compromise design. In one approach, instead of storing just one full tree and representing every other tree as a delta, we sprinkle a few more full trees along the way. For example, suppose that we store a full tree for every 10th version. This approach uses more disk space, but the SCM server never has to apply more than 9 deltas to retrieve any tree.
一些配置管理工具使用了一些折中的设计。一种方式是:取代刚好存储一棵完整的树并描述每棵其他的树为一个增量,沿着这种方式我们散列分布了少数完整的树。例如,假设我们每十个版本存储一棵完整的树。这个方式需要更多的磁盘空间,但是配置管理服务器不需要应用多于9个增量来获得任何树了。
What is a delta?
什么是增量?
I've been throwing around this concept of deltas, but I haven't stopped to describe them.
我已经抛出了增量这个概念,但是我没有停下来描述过它们。
A tree is a hierarchy of folders and files. A delta is the difference between two trees. In theory, those two trees do not need to be related. However, in practice, the only reason we calculate the difference between them is because one of them is derived from the other. Some developer started with tree N and made one or more changes, resulting in tree N+1.
一棵树就是一个目录和文件的层级结构。一个增量是两棵树之间的差别。理论上讲,这两棵树不需要相近。然而,事实上,我们计算差别的唯一原因是因为它们中的一个来源于另一个。一些开发人员从树N开始制造变更,然后在树N+1计算结果。
We can think of the delta as a set of changes. In fact, many SCM tools use the term "changeset" for exactly this purpose. A changeset is merely a list of the changes which express the difference between two trees.
我们可以认为增量就是一系列变化。事实上,很多配置管理工具使用了术语“changset(变更集合)”恰恰是为了这个目的。一个变更集合仅仅是变更的列表,列出了两棵树的差别。
For example, let's suppose that Wilbur starts with tree N and makes the following changes:
- He deletes $/top/subfolder/foo.c because it is no longer needed.
- He edits $/top/subfolder/Makefile to remove foo.c from the list of file names
- He edits $/top/bar.c to remove all the calls to the functions in foo.c
- He renames $/top/hello.c and gives it the new name hola.c
- He adds a new file called feature_creep.c to $/top/
- He edits $/top/Makefile to add feature_creep.c to the list of filenames
- He moves $/top/subfolder/readme.txt into $/top
例如,假设Wilbur从树N开始制造变更:
1. 他删除了$/top/subfolder/foo.c,因为这个文件不需要了
2. 他编辑$/top/subfolder/Makefile,删除文件列表中foo.c的名字
3. 他编辑$/top/bar.c,删除所有对foo.c中的功能的调用
4. 他重命名了$/top/hello.c,新的名字为hola.c
5. 他增加了一个名为feature_creep.c的新文件放到$/top/下
6. 他编辑了$/top/Makefile来增加feature_creep.c到文件名列表
7. 他移动$/top/subfolder/readme.txt到$/top
At this point, he commits all of these changes to the repository as a single transaction. When the SCM server stores this delta, it must remember all of these changes.
这时,他提交了所有的变更到库里面,以一个单独的事务提交。当配置管理服务器存储这个增量的时候,它必须记住所有的变更。
For changeset item 1 above, the delete of foo.c is easily represented. We simply remember that foo.c existed in tree N but does not exist in tree N+1.
对于变更集中的第1项,删除foo.c是很容易描述的,我们简单的记住foo.c在树n中存在而不在树N+1存在。
For changeset item 4, the rename of hello.c is a bit more complex. To handle renames, we need each object in the repository to have an identifier which never changes, even when the name or location of the item changes.
对于变更集中的第4项,重命名hello.c就要复杂些。为了处理重命名,我们需要库中的对每个象有一个是否变更的标示,甚至在文件名和位置变更的时候都有标示。
For changeset item 7, the move of readme.txt is another example of why repositories need IDs for each item. If we simply remember every item by its path, we cannot remember the occasions when that path changes.
对于变更集中的第7项,移动readme.txt是另一个为什么库需要为每个项分配ID的例子。如果我们简单记住每个项的路径,我们就不能记住当路径变化时的情形。
Changeset item 5 is going to be a lot bulkier than some of the other items here. For this item we need to remember that tree N+1 has a file called feature_creep.c which was never present in tree N. However, a full representation of this changeset item needs to contain the entire contents of that file.
变更集中的第5项正变得比其他的项更大。对这个项,我们需要记住树N+1有一个文件叫feature_creep.c, 从来没有在树N中出现过。然后,关于这个变更集合项的完整描述需要包含整个文件的内容。
Changeset items 2, 3 and 6 represent situations where a file which already existed has been modified in some way. We could handle these items the same way as item 5, by storing the entire contents of the new version of the file. However, we will be happier if we can do deltas at the file level just as we are doing deltas at the tree level.
变更集中的第2,3和6项,描述了一个已经存在并被用某种方式修改过的文件的情况。我们能够用同第5项同样的方式来处理这几项,通过对文件的新版本的整个内容的存储。然而,我们能够在文件层面做增量就像我们在树的层面做增量的话,我们会更高兴的。
File deltas
文件增量
A file delta merely expresses the difference between two files. Once again, the reason we calculate a file delta is because we believe it will be smaller than the file itself, usually because one of the files is derived from the other.
一个文件的增量仅仅表达了两个文件的不同。还有,我们计算一个文件的增量是因为我们相信它自己发生了一些小变化,通常因为一个文件来源于另一个。
For text files, a well-known approach to the file delta problem is to compare line-by-line and output a list of lines which have been modified, inserted or changed. This is the same kind of results which are produced by the Unix 'diff' command. The bad news is that this approach only works for text files. The good news is that software developers and web developers have a lot of text files.
对于文本文件,处理文件增量的著名的方式是一行一行的对比,然后输出被修改了的、插入的或变更了的行的列表。这同在UNIX环境下使用“diff”命令一样,生成同样类型的结果。不好的是这个方式只在文本格式有效。好的消息是软件或网络开发人员有很多文本文件。
CVS and Perforce use this approach for repository storage. Text files are deltified using a line-oriented diff. Binary files are not deltified at all, although Perforce does reduce the penalty somewhat by compressing them.
CVS
和Perforce使用这种方式来存储库。文本文件被增量标示使用了一个线性导向的对比。二进制文件没有被彻底增量标示,尽管Perforce通过压缩它们减少了点处罚。
Subversion and Vault are examples of tools which use binary file deltas for repository storage. Vault uses a file delta algorithm called VCDiff, as described in RFC 3284. This algorithm is byte-oriented, not line-oriented. It outputs a list of byte ranges which have been changed. This means it can handle any kind of file, binary or text. As an ancillary benefit, the VCDiff algorithm compresses the data at the same time.
Subversion
和Vault是使用了二进制文件增量的存储库的工具实例。Vault使用一个叫VCDiff的文件增量运算法则,被在RFC 3284中进行了描述。这个运算法则是字节导向的,不是线性导向的。它输出了那些变更了的字节列表排序。这意味着它可以提交任何类型的文件,二进制或文本文件。作为一个辅助的益处,VCDiff运算法则同时压缩了数据。
Binary deltas are a critical feature for some SCM tool users, especially in situations where the binary files are large. Consider the case where a user checks out a 10 MB file, changes a few bytes, and checks it back in. In CVS, the size of the repository will increase by 10 MB. In Subversion and Vault, the repository will only grow by a small amount.
二进制增量对配置管理工具用户是一个重要的特征,特别是当二进制文件很大的情况下。考虑到那种一个用户签出一个10兆的文件只变更几个字节就签入。在CVS里面,数据库会同样的增加十兆。在Subversion和Vault中,数据库会只增长一点点。
Deltas and diffs are different
增量和差别是不同的
Please note that I make a distinction between the terms "delta" and "diff".
请注意,我在“增量”和“差别”之间使用了一个区别。
- A "delta" is the difference between two versions. If we have one full file and a delta, then we can construct the other full file. A delta is used primarily because it is smaller than the full file, not because it is useful for a human being to read. The purpose of a delta is efficiency. When deltas are done at the level of bytes instead of textual lines, that efficiency becomes available to all kinds of files, not just text files.
- 一个“增量”是两个版本之间的差异。如果我们有一个完整的文件和一个增量,那么我们能够构建另一个完整的文件。一个增量被使用的首要原因是它比整个的文件小,不是因为它是对人类阅读有益。增量的这个目的是有效的。当增量是在字节层面运作,取代了文本行级别,那效率就变得不仅仅对二进制的文件而是所有类型有用了。
- A "diff" is the human-readable difference between two versions of a text file. It is usually line-oriented, but really cool visual diff tools can also highlight the specific characters on a line which differ. The purpose of a diff is to show a developer exactly what has changed between two versions of a file. Diffs are really useful for text files, because human beings tend to read text files. Most human beings don't read binary files, and human-readable diffs of binary files are similarly uninteresting.
- 差别是人类可读的两个版本之间的文本差异。它通常是线性的,但是真正很酷的视窗比较文具可以在一行上面高亮特殊的字段。差别的目的是显示一个开发人员刚好在两个版本之间变更了什么。差别是真正可用的文本文件,因为人们趋向于读文本文件。许多人不会读二进制文件,而人类可读的二进制文件的差别同样很无趣。
As mentioned above, some SCM tools use binary deltas for repository storage or to improve performance over slow network lines. However, those tools also support textual diffs. Deltas and diffs serve two distinct purposes, both of which are important. It is merely coincidence that some SCM tools use textual diffs as their repository deltas.
如上面所提到,一些配置管理工具使用二进制增量来存储库或者提高低速网络的执行效率。然而,那些工具也支持文本的差别。增量和差别为两种不同的目的服务,它们都很重要。这仅在一些配置管理工具直接使用文本的差别作为它们库的增量的时候一致。
The evolution of source control technology
源码控制技术的发展
At this point I should admit that I have presented a somewhat idealized view of the world. Not all SCM tools work the way I have described. In fact, I have presented things exactly backwards, discussing tree-wide deltas before file deltas. That is not the way the history of the world unfolded.
在这点上,我要承认我提出过一个有点理想化的世界观。不是所有的配置管理工具都通过这种我描述过的方式进行工作。事实上,我也正确地向后描述过事情,在文件增量之前讨论过tree-wide增量。那不是这个世界展开过的历史之路。
Prehistoric ancestors of modern programmers had to live with extremely primitive tools. Early version control systems like RCS only handled file deltas. There was no way for the system to remember folder-level operations like add, renaming or deleting files.
现代编程的史前祖先曾经通过极其古老的工具生存,早点的版本控制系统,比如RCS,只是提交文件增量。这种系统没有其他的方式来记忆目录层级,比如增加、重命名或删除文件。
Over time, the design of SCM tools matured. CVS is probably the most popular source control tool in the world today. It was originally developed as a set of wrappers around RCS which essentially provided support for some folder-level operations. Although CVS still has some important limitations, it was a big step forward.
时光流逝,配置管理工具的设计成熟了。CVS可能是当今世界最流行的源码控制工具。它最开始是作为一套RCS的外壳来进行开发的,提供了支持目录层级的操作。尽管CVS仍然有一些重要的局限,但是配置管理工具向前发展了一大步。
Today, several modern source control systems are designed around the notion of tree-wide deltas. By accurately remembering every possible operation which can happen to a repository, these tools provide a truly complete history of a project.
现在,一些流行的源码控制系统围绕tree-wide增量的概念来设计。通过精确的保留每个对库可能产生的操作,这些工具提供了一个真正的项目的历史。
What can be stored in a repository?
什么可以被放到库里面?
Best Practice: Checkin all the canonical stuff, and nothing else
最佳实践:签入所有规范的素材,其他的全部不要
Although you can store anything you want in a repository, that doesn't mean you should. The best practice here is to store everything which is necessary to do a build, and nothing else. I call this "the canonical stuff".
尽管你可以在库里保存任何东西,但是那不意味着你就应该随便放。这里的最佳实践是:放入真正需要构建的东西,其他的都不要。我将这些称为“规范素材”。
To put this another way, I recommend that you do not store any file which is automatically generated. Checkin your hand-edited source code. Don't checkin EXEs and DLLs. If you use a code generation tool, checkin the input file, not the generated code file. If you generate your product documentation in several different formats, checkin the original format, the one that you manually edit.
为了通过另外的方式这样做,我建议你不要存储任何可以自动生成的文件。签入你手工编辑的源码。不要签入EXE文件和DLL文件。如果你使用一个代码生成工具,签入这个输入文件,不是生成的代码文件。如果你用几种不同的格式生成你的产品文档,签入你手工编辑的原始格式。
If you have two files, one of which is automatically generated from the other, then you just don't need to checkin both of them. You would in effect be managing two expressions of the same thing. If one of them gets out of sync with the other, then you have a problem.
如果你有两个文件,一个是从另一个文件自动生成的,那么你就不用签入两个文件。你可以有效的管理同样事情的两个表达方式。如果它们中的一个被取出来同另一个同步,那你才会出一些问题。
People sometimes ask us what kind of things can be stored in a repository. In general, the answer is: "Any file". It is true that I am focusing on tools which are designed for software developers and web developers. However, those tools don't really care what kind of file you store inside them. Vault doesn't care. Perforce, Subversion and CVS don't care. Any of these tools will gratefully accept any file you want to store.
人们有的时候问我们什么类型的东西可以放到库里面。通常答案都是:“任何文件”。这是真的,因为我集中精力在为软件和WEB开发人员设计工具上。然而,那些工具没有真正的关心哪种文件可以放进库里。Vault也不关心。Perforce,Subversion和CVS都不关心。这些工具都积极的接受你要存储的文件。
If you will be storing a lot of binary files, it is helpful to know how your SCM tool handles them. A tool which uses binary deltas in the repository may be a better choice.
如果你要存储很多二进制文件,这将对你了解配置管理工具如何提交他们有帮助。一个工具在配置库中使用了二进制增量可能是一个更好的选择。
If all of your files are binary, you may want to explore other solutions. Tools like Vault and Subversion were designed for programmers. These products contain features designed specifically for use with source code, including diff and automerge. You can use these systems to store all of your Excel spreadsheets, but they are probably not the best tool for the job. Consider exploring "document management" systems instead.
如果你所有的文件都是二进制的,你打算用其他的方案来浏览。像Vault和Subversion是为程序人员设计的工具。这些产品包含了特别的为源码设计的特性,包含了差异比较和自动合并。你能够使用这些系统来存储所有的你的Excel表格,但是他们可能不是最好的工具。你应该考虑使用“文件管理”系统。
How is the repository itself stored?
配置库自己是怎么存储的?
We need to descend through one more layer of abstraction before we turn our attention back to more practical matters. So far I have been talking about how things are stored and managed within a repository, but I have not broached the subject of how the repository itself is stored.
在我们将我们的注意力回过来在更多的实际问题中,我们需要降低更多提取的层次。目前为止,我谈过了文件在一个库里面是怎样被存储和管理的,但是我没有讨论配置库自己是怎么存储的。
A repository must store every version of every file. It must remember the hierarchy of files and folders for every version of the tree. It must remember metadata, information about every file and folder. It must remember checkin comments, explanations provided by the developer for each checkin. For large trees and trees with very many revisions, this can be a lot of data that needs to be managed efficiently and reliably. There are several different ways of approaching the problem.
一个库必须存储任何文件的任何版本。它必须记住树中每个版本的文件和目录的层级。它必须记住元数据,每个文件和目录的信息。它必须记住签入的内容,开发人员每次签入的时候的注释。对于大的树和树的众多的版本,还需要有效可靠的管理大量的数据。有几种不同的方式可以解决这个问题。
RCS kept one archive file for every file being managed. If your file was called "foo.c" then the archive file was called "foo.c,v". Usually these archive files were kept in a subdirectory of the working directory, just one level down. RCS files were plain text, you could just look at them with any editor. Inside the file you would find a bunch of metadata and a full copy of the latest version of the file, plus a series of line-oriented file deltas, one for each previous version. (Please forgive me for speaking of RCS in the past tense. Despite all the fond memories, that particular phase of my life is over.)
RCS
为每个被管理的文件保留了一个档案文件。如果你的文件名是“foo.c”,那它的档案文件就是“foo.c,v”。通常这些档案文件被保存在工作目录的一个子目录中,就像一个下级目录一样。RCS文件是纯文本的,你可以用编辑器打开他们。在文件里面你可以看到一串元数据和文件最近版本的全部拷贝,加上一系列线性的针对之前每个版本的文件增量。(请原谅我在过去的句子里谈到RCS。无论多么美好的记忆,都是我生命中已经过去的片断了。)
CVS uses a similar design, albeit with a lot more capabilities. A CVS repository is distinct, completely separate from the working directory, but it still uses ",v" files just like RCS. The directory structure of a CVS repository contains some additional metadata.
CVS
使用了一个类似的设计,虽然具有了更多的能力。一个CVS库是明显的、彻底的同工作目录分离的,但是它仍然像RCS那样使用“,V”文件。CVS的目录结构包含了一些额外的元数据。
When managing larger and larger source trees, it becomes clear that the storage challenges of a repository are exactly the same as the storage challenges of a database. For this reason, many SCM tools use an actual database as the backend data store. Subversion uses Berkeley DB. Vault uses SQL Server 2000. The benefit of this approach is enormous, especially for SCM tools which support atomic transactions. Microsoft has invested lots of time and money to ensure that SQL Server is a safe place to store important information. Data corruption simply doesn't happen. All of the ultra-tricky details of transactions are handled by the underlying database.
当管理越来越大的源码树的时候,事情变得越来越清晰:一个配置库存储的挑战同样是数据库存储的挑战。因为这个原因,许多配置管理工具使用一个真正的数据库来存储数据。Subversion使用BerkeleyDB。Vault使用SQLSERVER2000。使用这种方式的好处是很巨大的,特别是对于那些支持原子事务的工具。微软已经投入很多时间和钱来保证SQLSERVER是一个存储重要信息的安全地方。数据崩溃通常不容易发生。所有关于事务是如何的提交的相当机警的讨论就在商用数据库中。
Perforce uses somewhat of a hybrid approach, storing all of the metadata in a database but keeping all of the actual file contents in RCS files. This approach trades some safety for speed. Since Perforce manages its own archive files, it has to take responsibility for all the strange things that threaten to corrupt them. On the other hand, writing a file is a bit faster than writing a blob into a SQL database. Perforce has the reputation of being one of the fastest SCM tools.
Perforce
使用比较混杂的方式,在数据库中存储所有的元数据,但是在RCS中保持所有的真实文件的内容。这种方式带来一个速度的安全性。自从Perforce管理它自己的档案文件,它不得不对所有奇怪的威胁到数据崩溃的事情负责。另一方面,写一个文件比写一个blob字段到SQL中要快些。Perforce有最快的配置管理工具的声誉。
Managing repositories
管理配置库
Best Practice: Use separate repositories for things which are truly separate
最佳实践:对真正分离的事物使用分离的库
Most SCM tools offer the ability to have multiple distinct repositories. Vault can even host multiple repositories on the same Vault server. People often ask us when this capability should be used.
许多配置管理工具都可以建立许多不同的库。Vault甚至可以在同一台Vault服务器上建立多个库。人们常常问我们这有什么用。
In general, you should store related items in the same repository. Start a separate repository only in situations where the contents of the two are completely unrelated. In a small ISV, it may be quite logical to have only one repository which contains every project.
通常,你可以存储类似的项目到同一个库。建立一个分离的库仅仅是在两个项内容完全不相关的情况下。在一个小的独立软件开发商那里,一个包含了所有项目的库是相当合理的。
Creating a source control repository is kind of a special event. It's a little bit like adopting a cat. People often get a cat without realizing the animal is going to be around for 10-20 years. Your repository may have similar longevity, or even longer.
创建一个源码库是有点特殊的情况。有点象收养一只猫。人们通常收养一只猫的时候没有想过这个猫要在自己身边10-20年。你的库可能有类似的寿命,甚至更长。
Shortly after SourceGear was founded in 1997, we created a SourceSafe repository. Over seven years later, that repository is still in use, almost every day. (Along with a whole bunch of legacy projects, it contains the source code for SourceOffSite. We never migrated that project to Vault because we wanted the SourceOffSite developers to continue eating their own dogfood.)
SourceGear
在1997年被创建,我们创建了一个SourceSafe的库。7年之后,那个库几乎是每天都还在使用。(它包含了SourceOffSite的源码,还伴随着遗留项目的整个树串。我们从来没有移植那个项目到Vault上,因为我们希望SourceOffSite的开发人员继续去啃它们自己的狗骨头。)
That repository is well over a gigabyte in size (which is actually rather small, but then SourceGear has never been a very big company). It contains thousands of files, thousands of checkins, and has been backed up thousands of times.
这个库在十亿字节的时候会溢出(这实际上相当小了,而SourceGear却已经是一个很大的公司了)。它包含了数以千计的文件,数以千计的签入和数以千计的回滚。
Treat your repository well and it will serve you well:
对你的库好点它就会对你好点:
- Obviously you should do regular backups. That repository contains everything your fussy and expensive programmers have ever created. Don't risk losing it.
- 显然你应该规范备份。库包含了你所有的琐碎的事情和程序人员宝贵的代码。不要冒丢失它的险。
- Just for fun, take an hour this week and check your backup to see if it actually works. It's shocking how many people are doing daily backups that cannot actually be restored when they are needed.
- 好笑的是,要每周花一个小时来检查你的备份是否可以真正的可用。很多人在他们真正需要的时候却恐怖的发现做了每日备份但是备份却没有真正的被保存起来。
- Put your repository on a reliable server. If your repository goes down, your entire team is blocked from doing work. Disk drives like to fail, so use RAID. Power supplies like to fail, so get a server with redundant power supplies. The electrical grid likes to fail, so get a good Uninterruptible Power Supply (UPS).
- 把你的库放到一个可信的服务器上。如果你的库坏了,你整个团队工作就得停滞。硬盘喜欢坏掉,所以用RAID。供电电源也爱坏掉,那就让一个服务器拥有多个供电电源。电路也喜欢坏掉,那就用一个好的UPS。
- Be conservative in the way your SCM server machine is managed. Don't put anything on that machine that doesn't need to be there. Don't feel the need to install every single Service Pack on the day it gets released. I've been shocked how many times one of our servers went south simply because we installed a service pack or hotfix from Windows Update. Obviously I want our machines to be kept current with the latest security fixes, but I've been burned too many times not to be cautious. Install those patches on some other machine before you put them on critical servers.
- 让你的配置管理服务器被用传统的方式管理。不要放不需要的东西到那台机器上。不要觉得有必要在SP发布的时候就立刻去安装每个SP。我遇到好多次因为我们安装了一个SP或者使用了Windows自动更新进行了自动修复,我们的服务器就轻易的死掉了。显然我希望我们的服务器能保持一个有当前最新的安全性修复,但是我多次因为没有小心而受到处罚。请在安装它们到正式服务器之前在其他机器上安装那些补丁。
- Keep your SCM server inside a firewall. If you need to allow your developers to access the repository from home, carefully poke a hole, but leave everything else as tight as you can. Make sure your developers are using some sort of bulk encryption. Vault uses SSL. Tools like Perforce, CVS and Subversion can be tunneled through ssh or something similar.
- 保证你的配置管理服务器同其他机器在一个防火墙内。如果你允许你的开发人员从家里就可以访问配置库,那就小心的开一个洞,不要再放其他的任何东西,能有多谨慎就多谨慎。确信你的开发人员在使用一些必须的加密协议。Vault使用SSL。象Perforce, CVS 和 Subversion可以通过SSH或者类似的协议。
This brief list of tips is hardly a complete guide for administrators. I am merely trying to describe the level of care and caution which should be used for your SCM repository.
这上面列出的还仅仅是一个管理员的指南。我只不过试图描述在你的配置管理库中需要关心和小心的程度。
Undo
撤销
As I have mentioned, one of the best things about source control is that it contains your entire history. Every version of everything is stored. Nothing is ever deleted.
如我所说过,源码控制最好的就是包含你整个的历史。每个版本的每个事件都被保存了,没有任何东西被删除。
However, sometimes this benefit can be a real pain. What if I made a mistake and checked in something that should not be checked in? My history contains something I would rather forget. I want to pretend that it never happened. Isn't there some way to really delete from a repository?
然而,有的时候这个益处恰是一个真正的痛苦。如果我产生了一个失误并且签入了不需要签入的东西的时候会发生什么?我的历史包含了我愿意遗忘的历史。我希望它好像从来没有发上过。那有没有什么办法从库里面真正的删除它们?
In general, the recommended way to fix a problem is to checkin a new version which fixes it. Try not to worry about the fact that your repository contains a full history of the error. Your mistakes are a part of your past. Accept them and move on with your life.
通常,解决这个问题的建议是在修改的时候签入一个新的版本。不要担心你的库中包含了整个失误的历史。你的失误是你过去的一个部分。接受它们然后继续你的生命吧。
However, most SCM tools do provide one or more ways of dealing with this situation. First, there is a command I call "rollback". This command is essentially an "undo" for revisions of a file. For example, let's say that a certain file is at version 7 and we want to go back to version 6. In Vault, we select version 6 and choose the Rollback command.
然而,很多配置管理工具提供了一种或更多种方式来处理这种情况。首先,有一个我称为“回滚”的命令。这个命令实质上就是“撤销”一个文件的修订。例如,我们说一个文件在版本7,而我们希望回到版本6。在Vault里面,我们选择版本6然后使用回滚命令。
To be fair, I should admit that the rollback command is not always destructive. In some SCM tools, the rollback feature really does make version 7 disappear forever. Vault's rollback is non-destructive. It simply creates a version 8 which is identical to version 6. The designers of Vault are fanatical purists, or at the very least, one of them is.
为了公平,我允许回滚命令不是破坏性的。有些配置管理工具,回滚功能真的使版本7永远消失掉了。Vault的回滚功能是非破坏性的。它简单的创建一个同版本6一样的版本8。Vault设计者都是狂热的理论爱好者,最起码他们中的一个是。
As a concession to those who are less fanatical, Vault does support a way to truly destroy things in a repository. We call this feature "obliterate". I believe Subversion and Perforce use the same term. The obliterate command is the only way to delete something and make it truly gone forever.
作为一种对那些不那么狂热的人的让步,Vault也支持真正的在库里面破坏东西。我们称这个功能为“删除”。我相信Subversion和Perforce使用了同样的术语。删除命令是唯一的删除一些东西并且使它真正的消失的命令。
Best Practice: Never obliterate anything that was real work
最佳实践:不要删除真正工作的任何东西
The purist in me wants to recommend that nothing should ever be obliterated. However, my pragmatist side prevails. There are situations where obliterate is not sinful.
在我脑袋里理想化的一面希望任何东西都不要被删除,但是我的现实的一面却成功了,有时有些地方被删除并没有那么可怕。
However, obliterate should never be used to delete actual work. Don't obliterate a file simply because you discovered it to be a bad idea. Don't obliterate a file simply because you don't need it anymore. Obliterate is for situations where something in the repository should never have been there at all. For example, if you accidentally checkin a gigabyte of MP3s alongside your C++ include files, obliterate is a justifiable choice.
当然,删除应该决不用于删除真正的工作。不要因为你发现它不好就删除一个文件。也不要因为不再需要就删除。删除是为了一些在库中根本不需要的。例如,如果你意外的签入一个MP3的文件到你的C++文件里面,那删除就是一个正确的选择。
In my original spec for Vault, I had decided that we would not implement any form of destructive delete. We eventually decided to compromise and implement this command, but I really wanted to discourage its use. SourceSafe makes it far too easy to rewrite history and pretend that something never happened. In the Delete dialog box, SourceSafe includes a checkbox called "Destroy Permanently". This is an atrocious design decision, roughly equivalent to leaving a sledgehammer next to the server machine so that people can bash the hard disks with it every once in a while. This checkbox is almost irresistible. It simply begs to be checked, even though it is very rarely the right thing to do.
在Vault的原始规则里面,我曾经确定我们不会执行任何破坏性的删除。我们最后决定妥协并使用这个命令,但是我真正的希望阻止它的使用。SourceSafe使这个命令很简单快速的重写历史和假设什么都没有发生过。在删除对话框,SourceSafe包含了一个成为“永久破坏”的选择框。这是一个很凶悍的设计思想,粗糙的等于拿一个大的锤子让人们可以在硬盘旋转中去敲打服务器。这个选择框是相当有诱惑的。它简单的要求检查,尽管很少有正确的事情来做。
When we first designed the obliterate command for Vault, I wanted its user interface to somehow make the user feel guilty. I argued that the obliterate dialog box should include a photograph of a 75-year old catholic nun scowling and holding a yardstick.
当我们开始为Vault设计删除命令的时候,我希望它的用户界面能够使用户莫名其妙的觉得不舒服。我辩论说这个删除对话框包含了一个拿着一根绳子的75岁的修女。
The rest of the team agreed that we should discourage people from using this command, but in the end, we settled on a less graphical approach. In Vault, the obliterate command is available only in the Admin client, not the regular client people use every day. In effect, we made the obliterate command available, but inconvenient. People who really need to obliterate can find the command and get it done. Everyone else has to think twice before they try to rewrite history and pretend something never happened.
其他的团队成员同意我应该劝阻人民不要使用这个命令,但是到最后,我们决定采取了一个小的图形方式。在Vault里面,删除命令是仅仅在管理员端可以使用的,不是其他的客户端的客户可以每天使用的。我们还使这个命令可用,却并不方便。真正需要删除的人们可以找这个命令然后执行。其他的人在他们试图重写历史并伪装什么事情都没有发生之前需要思考两次。
Kimchi again?
再来点韩国泡菜?
Recently when I asked my fifth grade daughter what she had learned in school, she proudly informed me that "everyone in Korea eats kimchi at every meal, every day". In the world of a ten-year-old, things are simpler. Rules don't have exceptions. Generalizations always apply.
最近我问我五年级的女儿她从学校学到了什么,她骄傲的告诉我“在韩国的人每天、每顿都吃韩国泡菜”。在一个十岁的年纪,事情非常简单。规则没有例外。通常总是被运用。
This is how we learn. We understand the basic rules first and see the finer points later. First we learn that memory leaks are impossible in the CLR. Later, when our app consumes all available RAM, we learn more.
这就是我们如何来学习。我们首先了解了基本规则,然后再看重点。首先我们认识到内存泄漏在语音录音器里面是不可能的。后来,当我们的程序消耗了所有可用的RAM,我们就学到了更多。
My habit as I write these chapters is to first present the basics in a "matter of fact" fashion, rarely acknowledging that there are exceptions to my broad generalizations. I did this during the chapter on checkins, failing to mention the "edit-merge-commit" until I had thoroughly explored "checkout-edit-checkin".
我的习惯就象我写这些文章一样,首先以一种事实方式呈现基础,我的宽泛的概括很罕见的得到认可。我在章节签入里面做这些事情,直到我彻底的研究了“签出-编辑-签入”之前我都没有提及“编辑-合并-提交”。
In this chapter, I have written everything from the perspective of just one specific architecture. SCM tools like Vault, Perforce, CVS and Subversion are based on the concept of a centralized server which hosts a single repository. Each client has a working folder. All clients contact the same server.
在这个章节,我只以一个特定结构的看法去描述每件事情。配置管理工具,比如Vault,Perforce,CVS和Subversion都是基于集中只有一个单独的库的服务器的概念。每个客户端有一个工作目录,所有的客户端同同一台服务器联系。
I confess that not all SCM tools work this way. Tools like BitKeeper and Arch are based on the concept of distributed repositories. Instead of one repository, there can be several, or even many. Things can be retrieved or committed to any repository at any time. The repositories are synchronized by migrating changesets from one repository to another. This results in a merge situation which is not altogether different from merging branches.
我承认不是所有的配置管理工具都是用那种方式工作。比如BitKeeper 和Arch都是基于分布式数据库的。一个库可以有好几个,甚至更多。工作能够在任何时间从任何库中获得或提交。这个库是通过从一个库移动变更到另一个库同步的。在一个合并的地方这个结果不是同合并分支差异相同的。
From the perspective of this SCM geek, distributed repositories are an attractive concept. Admittedly, they are advanced and complex, requiring a bit more of a learning curve on the part of the end user. But for the power user, this paradigm for source control is very cool.
关于这个配置管理讨厌的看法,分布库是一个吸引人的概念。诚然,他们是高级和复杂的,需要终端用户更多的学习。但是对高级用户,这个例子对版本控制非常酷。
Having no experience in the implementation of these systems, I will not be explaining their behavior in any detail. Suffice it to say that this approach is similar in some ways, but very different in others. This series of articles will continue to focus on the more mainstream architecture for source control.
还没有执行这些系统的经验,我将不会解释他们的行为。有力的说明这个方式在某些地方是相同的,但是又同其他的非常不同。这个系列文章将继续关注主流结构的版本控制工具。
Looking ahead
In this chapter, I discussed the details of repositories. In the next chapter, I'll go back over to the client side and dive into the details of working folders.
这一章节,我论述了关于库的情况。下一章节,我将回头来描述客户端和深入钻研工作目录