IPFS and many other distributed systems take advantage of a datastructure called directed acyclic graphs, or DAGs. Specifically, they use Merkle DAGs, which are DAGs where each node has a unique identifier that is a hash of the node’s contents. Sound familiar? This refers back to the CID concept that we covered in the previous section. Put another way: identifying a data object (like a Merkle DAG node) by the value of its hash is content addressing.
IPFS和许多其他分布式系统使用一种称为有向无环图(DAGs)的数据结构。具体地说,它们使用Merkle DAGs,图中的每个节点都有一个唯一标识符,该标识符是节点内容的散列。听起来很熟悉?这是指我们在前一节中介绍的CID概念。换句话说:根据数据对象的散列值来标识该数据对象(比如Merkle-DAG节点)就是内容寻址。
IPFS uses a Merkle DAG that is optimized for representing directories and files, but you can structure a Merkle DAG in many different ways. For example, Git uses a Merkle DAG that has many versions of your repo inside of it.
IPFS使用的Merkle DAG是为表示目录和文件而优化的,但是您可以用许多不同的方式构造Merkle-DAG。例如,Git使用Merkle-DAG,其中包含许多版本的repo。
To build a Merkle DAG representation of your content, IPFS often first splits it into blocks. Splitting it into blocks means that different parts of the file can come from different sources and be authenticated quickly. (If you’ve ever used BitTorrent, you may have noticed that when you download a file, BitTorrent can fetch it from multiple peers at once; this is the same idea.)
为了构建内容的Merkle-DAG表示,IPFS通常首先将其分割成块。将文件分割成块意味着文件的不同部分可以来自不同的源,并且可以快速地进行验证。(如果您曾经使用过BitTorrent,您可能已经注意到,当您下载一个文件时,BitTorrent可以同时从多个对等方获取该文件;这是相同的想法。)
Merkle DAGs are a bit of a “turtles all the way down” scenario; that is, everything has a CID. Let’s say you have a file, and its CID identifies it. What if that file is in a folder with several other files? Those files will have CIDs too. What about that folder’s CID? It would be a hash of the CIDs from the files underneath (ie, the folder’s content). In turn, those files are made up of blocks, and each of those blocks has a CID. You can see how a file system on your computer could be represented as a DAG. You can also see, hopefully, how Merkle DAG graphs start to form. For a visual exploration of this concept, take a look at the IPLD Explorer.
Merkle DAGs有点像“一路向下的乌龟”场景;也就是说,所有东西都有内容标识符(CID)。假设你有一个文件和它的CID。如果该文件与其他几个文件放在一个文件夹中怎么办?那些文件也会有CID。文件夹的CID是什么呢?它将是来自下层文件(即文件夹内容)的cid的散列。反过来,这些文件由块组成,每个块都有一个CID。您可以看到如何将计算机上的文件系统表示为DAG。希望你也能看到,Merkle-DAG图是如何开始形成的。对于这个概念的可视化探索,请查看IPLD资源管理器。
Another useful feature of Merkle DAGs and breaking content into blocks is that if you have two similar files, they can share parts of the Merkle DAG, i.e., parts of different Merkle DAGs can reference the same subset of data. For example, if you update a website, only updated files receive new content addresses. Your old version and your new version can refer to the same blocks for everything else. This can make transferring versions of large datasets (such as genomics research or weather data) more efficient because you only need to transfer the parts that are new or have changed instead of creating entirely new files each time.
Merkle DAG的另一个有用特性是,如果有两个相似的文件,它们可以共享Merkle DAG的一部分,即不同Merkle DAG的一部分可以引用相同的数据子集。例如,如果更新网站,则只有更新的文件才能接收新的内容地址。你的旧版本和新版本可以引用相同的块。这可以使传输大型数据集(如基因组学研究或天气数据)的版本更加有效,因为您只需要传输新的或已更改的部分,而不是每次都创建全新的文件。
So, to recap, IPFS lets you give CIDs to content and link that content together in a Merkle DAG. Now let’s move on to the last piece: how you find and move content.
因此,概括地说,IPFS赋予内容以CID,并在Merkle DAG中将该内容链接在一起。
A Merkle DAG is a DAG where each node has an identifier and this is the result of hashing the node’s contents — any opaque payload carried by the node and the list of identifiers of its children — using a cryptographic hash function like SHA256. This brings some important considerations:
Merkle-DAG是一种DAG,其中每个节点都有一个标识符,这是使用加密散列函数(如SHA256)对节点的内容(节点携带的任何不透明负载及其子节点的标识符列表)进行散列的结果。这带来了一些重要的考虑:
Merkle-DAGs只能从叶构造,即从没有子节点的节点构造。在子项之后添加父项,因为子项的标识符必须提前计算才能链接它们。
Merkle-DAG中的每个节点都是(子)Merkle-DAG本身的根,此子图包含在父DAG中。
Merkle-DAG节点是不可变的。节点中的任何更改都会更改其标识符,从而影响DAG中的上层节点,实质上创建了不同的DAG。
Merkle DAGs are similar to Merkle trees, but there are no balance requirements and every node can carry a payload. In DAGs, several branches can re-converge or, in other words, a node can have several parents.
Merkle-DAGs类似于Merkle树,但没有平衡要求,每个节点都可以携带有效载荷。在DAGs中,多个分支可以重新收敛,换句话说,一个节点可以有多个父节点。
Identifying a data object (like a Merkle DAG node) by the value of its hash is referred to as content addressing. Thus, we name the node identifier as Content Identifier, or CID.
根据数据对象的散列值来标识该数据对象(如Merkle-DAG节点)称为内容寻址。因此,我们将节点标识符命名为内容标识符或CID。
For example, the previous linked list, assuming that the payload of each node is just the CID of its descendant, would be: A=Hash(B)→B=Hash©→C=Hash(∅). The properties of the hash function ensure that no cycles can exist when creating Merkle DAGs. (Note: Hash functions are one-way functions. Creating a cycle should then be impossibly difficult, unless some weakness is discovered and exploited.)
例如,前一个链表,假设每个节点的有效负载只是其子代的CID,将是:A=Hash(B)→B=Hash(C)→C=Hash(∅)。哈希函数的属性确保在创建Merkle dag时不存在任何循环。(注意:哈希函数是单向函数。创建一个环几乎是不可能的,除非发现并利用了某些弱点。)
Merkle DAGs are self-verified structures. The CID of a node is univocally linked to the contents of its payload and those of all its descendants. Thus two nodes with the same CID univocally represent exactly the same DAG. This will be a key property to efficiently sync Merkle-CRDTs without having to copy the full DAG, as exploited by systems like IPFS. Merkle DAGs are very widely used. Source control systems like git and others use them to efficiently store the repository history, in away that enables de-duplicating the objects and detecting conflicts between branches.
Merkle-DAGs是自验证结构。一个节点的CID与它的有效负载的内容以及它的所有后代的内容是单一链接的。因此,具有相同CID的两个节点表示完全相同的DAG。这将是高效同步Merkle CRDTs而不必复制完整DAG的关键特性,如IPFS等系统所利用的那样。Merkle-DAGs应用非常广泛。像git和其他一些源代码管理系统使用它们来有效地存储仓库历史记录,这样就可以消除对象的重复并检测分支之间的冲突。
ps:在IPFS网络中,存储文件时,首先会将文件切片,切割成256KB大小的文件。之后循环调用(MerkleDAG.Add)方法构建文件MerkleDAG。 文件hash值创建流程: 1:将切片之后的文件进行sha-256运算 2:将运算结果选取0~31位 3:将选取结果根据base58编码,运算结果前追加Qm 即为最后结果作为文件的46位hash值。