数据库中各表关联图及其说明_如何在图中思考:图论及其应用的说明性介绍

数据库中各表关联图及其说明

by Vardan Grigoryan (vardanator)

由Vardan Grigoryan(vardanator)

如何在图中思考:图论及其应用的说明性介绍 (How to think in graphs: An illustrative introduction to Graph Theory and its applications)

Graph theory represents one of the most important and interesting areas in computer science. But at the same time it’s one of the most misunderstood (at least it was to me).

图论是计算机科学中最重要,最有趣的领域之一。 但是同时,它也是最容易被人误解的一种(至少对我来说是这样)。

Understanding, using and thinking in graphs makes us better programmers. At least that’s how we’re supposed to think. A graph is a set of vertices V and a set of edges E, comprising an ordered pair G=(V, E).

了解,使用和思考图形使我们成为更好的程序员。 至少我们应该这样认为。 图是一组顶点V和一组边缘E,包括一组有序对G =(V,E)。

While trying to studying graph theory and implementing some algorithms, I was regularly getting stuck, just because it was so boring.

同时努力学习图论和实施一些算法,我经常被卡住,只是因为它是如此无聊。

The best way to understand something is to understand its applications. In this article, we’re going to demonstrate various applications of graph theory. But more importantly, these applications will contain detailed illustrations. So lets get started and dive in.

理解某物的最好方法是理解其应用。 在本文中,我们将演示图论的各种应用。 但更重要的是,这些应用程序将包含详细的说明。 因此,让我们开始吧!

While this approach might seem too detailed (to seasoned programmers), but believe me, as someone who was once there and tried to understand graph theory, detailed explanations are always preferred over succinct definitions.

尽管这种方法(对于经验丰富的程序员而言)似乎太详细了,但是请相信我,作为曾经在那里并试图理解图论的人,相对于简洁的定义,总是首选详细的解释。

So, if you’ve been looking for a “graph theory and everything about it tutorial for absolute unbelievable dummies”, then you’ve come to the right place. Or at least I hope. So lets get started and dive in.

因此,如果您一直在寻找“图形理论及其绝对不可思议的虚拟事物教程”,那么您来对地方了。 或者至少我希望。 因此,让我们开始吧!

目录 (Table of Contents)

  • Disclaimers

    免责声明

  • Seven Bridges of Königsberg

    柯尼斯堡七桥

  • Graph representation: Intro

    图表表示形式:简介
  • Intro to Graph representation and binary trees (Airbnb example)

    图形表示法和二叉树简介(Airbnb示例)

  • Graph representation: Outro

    图表表示形式:Outro

  • Twitter example: tweet delivery problem

    Twitter示例:推文传递问题

  • Graph Algorithms: intro

    图算法:简介

  • Netflix and Amazon: inverted index example

    Netflix和Amazon:倒排索引示例

  • Traversals: DFS and BFS

    遍历:DFS和BFS

  • Uber and the shortest path problem (Dijkstra’s algorithm)

    Uber和最短路径问题(Dijkstra算法)

免责声明 (Disclaimers)

DISCLAIMER 1: I am not an expert in CS, algorithms, data structures and especially in graph theory. I am not involved in any project for the companies discussed in this article. Solutions to the problems are not final and could be improved drastically. If you find any issue or something unreasonable, you are more than welcome to leave a comment. If you work at one of the mentioned companies or are involved in corresponding software projects, please respond with the actual solution (it will be helpful to others). To all others, be patient readers, this is a pretty LONG article.

免责声明1: 我不是CS,算法,数据结构,尤其是图论方面的专家。 我没有参与本文讨论的公司的任何项目。 解决问题的方法不是最终的,可以大幅度改善。 如果您发现任何问题或不合理的地方,欢迎您发表评论。 如果您在上述公司之一工作或参与相应的软件项目,请提供实际的解决方案(这将对其他人有所帮助)。 对于所有其他人,请耐心阅读,这是一篇很长的文章。

DISCLAIMER 2: This article is somewhat different in the style that information is provided. Sometimes it might seem a bit digressed from the sub-topic, but patient readers will eventually find themselves with a complete understanding of the bigger picture.

免责声明2: 本文在提供信息的方式上有些不同。 有时似乎与该子主题有些偏离,但耐心的读者最终会发现自己对全局有一个完整的了解。

DISCLAIMER 3: This article is written for a broad audience of programmers. While having junior programmers as the target audience, I hope it will be interesting to experienced professionals as well.

免责声明3本文是为广大程序员编写的。 在希望将初级程序员作为目标受众的同时,我希望这对经验丰富的专业人员也将很有趣。

柯尼斯堡七桥 (Seven Bridges of Königsberg)

Let’s start with something that I used to regularly encounter in graph theory books that discuss “the origins of graph theory”, the Seven Bridges of Königsberg (not really sure, but you can pronounce it as “qyonigsberg”). There were seven bridges in Kaliningrad, connecting two big islands surrounded by the Pregolya river and two portions of mainlands divided by the same river.

让我们从我经常在图论书籍中经常讨论的东西开始,这些书讨论的是“图论的起源”,即柯尼斯堡的七桥 (虽然不确定,但您可以将其发音为“ qyonigsberg”)。 加里宁格勒有七座桥梁,连接被普雷戈利亚河包围的两个大岛和被同一河分开的两部分大陆。

In the 18th century this was called Königsberg (part of Prussia) and the area above had a lot more bridges. The problem or just a brain teaser with Königsberg’s bridges was to be able to walk through the city by crossing all the seven bridges only once. They didn’t have an internet connection at that time, so it should have been entertaining. Here’s the illustrated view of the seven bridges of Königsberg in 18th century.

在18世纪,这被称为Königsberg(普鲁士的一部分),上面的区域还有更多的桥梁。 问题或与柯尼斯堡(Königsberg)的桥梁只是一个脑筋急转弯的问题是,仅通过一次穿过所有七座桥梁就可以穿越城市。 当时他们没有互联网,所以应该很有趣。 这是18世纪柯尼斯堡的七座桥梁的图解视图。

Try it. See if you can walk through the city by crossing each bridge only once.

试试吧。 看看您是否可以只跨过每座桥一次就能穿越城市。

  • There should not be any uncrossed bridge(s).

    不应有任何不交叉的桥梁。
  • Each bridge must not be crossed more than once.

    每个桥梁不得超过一次。

If you are familiar with this problem, you know that it’s impossible to do it. Although you were trying hard enough and you may try even harder now, you’ll eventually give up.

如果您熟悉此问题,那么您将无法做到。 尽管您已经付出了足够的努力,现在可能会更加努力,但最终您还是会放弃。

Sometimes it’s reasonable to give up fast. That’s how Euler solved this problem - he gave up pretty soon. Instead of trying to solve it, he adopted a different approach of trying to prove that it’s not possible to walk through the city by crossing each bridge one and only time.

有时放弃快速是合理的。 欧拉就是这样解决这个问题的-他很快就放弃了。 他没有尝试解决这个问题,而是采取了另一种方法来试图证明不可能只通过一次穿过每一座桥梁就可以穿越城市。

Let’s try to understand how Euler was thinking and how he came up with the solution (if there isn’t a solution, it still needs a proof). That is a real challenge here, because walking through the thought process of such a venerable mathematician is kind of dishonorable. (Venerable so much that Knuth and friends dedicated their book to Leonhard Euler). We rather will pretend to “think like Euler”. Let’s start with picturing the impossible.

让我们尝试了解Euler的想法以及他如何提出解决方案(如果没有解决方案,它仍然需要证明)。 这是一个真正的挑战,因为走过这样一位杰出的数学家的思维过程是不光彩的。 (非常受尊敬,以至于Knuth和朋友将他们的书献给了 Leonhard Euler )。 我们宁愿假装“像欧拉一样思考”。 让我们从想象不可能开始。

There are four distinct places, two islands and two parts of mainland. And seven bridges. It’s interesting to find out if there is any pattern regarding the number of bridges connected to islands or mainland (we will use the term “land” to refer to the four distinct places).

有四个不同的地方,两个岛屿和大陆的两个部分。 还有七座桥。 找出与岛屿或大陆连接的桥梁数量是否有任何模式是很有趣的(我们将使用术语“土地”来指代四个不同的地方)。

At a first glance, there seems to be some sort of a pattern. There are an odd number of bridges connected to each land. If you have to cross each bridge once, then you can enter a land and leave it if it has 2 bridges.

乍一看,似乎有某种模式。 连接到每个区域的桥梁数量奇数。 如果您必须跨过每座桥一次,那么您可以进入一块土地,如果它有两座桥,则可以离开。

It’s easy to see in the illustrations above that if you enter a land by crossing one bridge, you can always leave the land by crossing its second bridge. Whenever a third bridge appears, you won’t be able to leave a land once you enter it by crossing all its bridges. If you try to generalize this reasoning for a single piece of land, you’ll be able to show that, in case of an even number of bridges it’s always possible to leave the land and in case of an odd number of bridges it isn’t. Try it in your mind!

从上面的插图中可以很容易地看出,如果您跨过一座桥进入一块土地,则总是可以越过另一座桥离开该土地。 每当出现第三座桥时,一旦跨过所有桥进入该地,您将无法离开该地。 如果您尝试将这一推理归纳为一块土地,那么您将能够证明,在桥梁数量偶数的情况下,总是可以离开土地,而在桥梁数量奇数的情况下,这是不可能的。 t。 在您的脑海中尝试一下!

Let’s add a new bridge to see how the number of overall connected bridges changes and whether it solves the problem.

让我们添加一个新的网桥,以查看整个连接的网桥的数量如何变化以及它是否解决了问题。

Now that we have two even (4 and 4) and two odd (3 and 5) number of bridges connecting the four pieces of land, let’s draw a new route with the addition of this new bridge.

现在我们有两个偶数(4和4)和两个奇数(3和5)的桥梁连接这四块土地,让我们在增加新桥的基础上画一条新路线。

We saw that the number of even and odd number of bridges played a role in determining if the solution was possible. Here’s a question. Does the number of bridges solve the problem? Should it be even all the time? Turns out that it’s not the case. That’s what Euler did. He found a way to show that the number of bridges matter. And more interestingly, the number of pieces of land with an odd number of connected bridges also matters. That’s when Euler started to “convert” lands and bridges into something we know as graphs. Here’s how a graph representing the Königsberg bridges problem could look like (note that our “temporarily” added bridge isn’t there).

我们看到,偶数和奇数桥的数量在确定解决方案是否可行方面发挥了作用。 这是一个问题。 桥的数量是否可以解决问题? 应该一直都这样吗? 事实并非如此。 欧拉就是这么做的。 他找到了一种方法来表明桥的数量很重要。 更有趣的是,连接桥梁数量奇数的土地数量也很重要。 从那时起,欧拉开始“转换”土地并将其桥接成我们称为图形的东西。 这是代表Königsberg桥梁问题的图形的样子(请注意,此处没有“临时”添加的桥梁)。

One important thing to note is the generalization/abstraction of a problem. Whenever you solve a specific problem, the most important thing is to generalize the solution for similar problems. In this particular case, Euler’s task was to generalize the bridge crossing problem to be able to solve similar problems in the future, i.e. for all the bridges in the world. Visualization also helps to view the problem at a different angle. The following graphs are all various representations of the same Königsberg bridge problem shown above.

要注意的重要一件事是问题的概括/抽象。 每当您解决一个特定的问题时,最重要的是将类似问题的解决方案归纳起来。 在这种特殊情况下,欧拉的任务是推广桥梁穿越问题,以便将来能够解决类似问题,即解决世界上所有桥梁的问题。 可视化还有助于从不同角度查看问题。 下图是上述同一柯尼斯堡桥问题的所有不同表示。

So yes, visually graphs are a good choice for picturing problems. But now we need to find out how the Königsberg problem can be solved using graphs. Pay attention to the number of lines coming out of each circle. And yes, let’s name them as seasoned professionals would do, from now on we will call circles, vertices and the lines connecting them, edges. You might’ve seen letter notations, V for (vendetta?) vertex, E for edge.

因此,是的,视觉图表是描绘问题的理想选择。 但是现在我们需要找出如何使用图来解决Königsberg问题。 注意每个圆圈中出现的行数。 是的,让我们以经验丰富的专业人士的名字来命名,从现在开始,我们将圆, 顶点和连接它们的线称为edge 。 您可能已经看过字母符号, V代表(vendetta?)顶点, E代表edge。

The next important thing is the so-called degree of a vertex, the number of edges incident connected to the vertex. In our example above, the number of bridges connected to lands can be expressed as degrees of the graph vertex.

下一个重要的事情是一个顶点的所谓 ,边缘事件的数量连接到顶点。 在上面的示例中,连接到平台的桥的数量可以表示为图顶点的度数。

In his endeavor Euler showed that the possibility of a walk through graph (city) traversing each edge (bridge) one and only one time is strictly dependent on the degrees of vertices (lands). The path consisting of such edges called (in his honor) an Euler path. The length of an Euler path is the number of edges. Get ready for some strict language. ?

欧拉(Euler)所做的努力表明,一次遍历每个边缘(桥)一次(仅一次)的图形(城市)的可能性严格取决于顶点(陆地)的程度。 由这些边缘组成的路径称为(以他的名义)欧拉路径。 欧拉路径的长度是边的数量。 准备一些严格的语言。 ?

An Euler path of a finite undirected graph G(V, E) is a path such that every edge of G appears on it once. If G has an Euler path, then it is called an Euler graph. [1]
有限无向图G(V,E)的Euler路径是这样的路径,使得G的每个边都出现一次。 如果G具有欧拉路径,则称为欧拉图。 [1]

Theorem. A finite undirected connected graph is an Euler graph if and only if exactly two vertices are of odd degree or all vertices are of even degree. In the latter case, every Euler path of the graph is a circuit, and in the former case, none is. [1]

定理 。 的有限无向连通图是欧拉图表当且仅当正好两个顶点是奇数度所有顶点均匀度的。 在后一种情况下,图的每个Euler路径都是一个电路,而在前一种情况下,都不是。 [1]

I used “Euler path” instead of “Eulerian path” just to be consistent with the referenced books [1] definition. If you know someone who differentiates Euler path and Eulerian path, and Euler graph and Eulerian graph, let them know to leave a comment.

为了与参考书籍[1]的定义保持一致,我使用了“欧拉路径”而不是“欧拉路径”。 如果您知道有人区分欧拉路径和欧拉路径,以及欧拉图和欧拉图,请告诉他们。

First of all, let’s clarify the new terms in the above definition and theorem.

首先,让我们在上述定义和定理中阐明新术语。

  • Undirected graph - a graph that doesn’t have a particular direction for edges.

    无向图 -没有特定方向的边的图。

  • Directed graph - a graph in which edges have a particular direction.

    有向图 -边具有特定方向的图。

  • Connected graph - a graph where there is no unreachable vertex. There must be a path between every pair of vertices.

    连接图 -没有不可达顶点的图。 每对顶点之间必须有一条路径。

  • Disconnected graph - a graph where there are unreachable vertices. There is not a path between every pair of vertices.

    断开连接的图 -顶点不可达的图。 每对顶点之间没有路径。

  • Finite graph - a graph with a finite number of nodes and edges.

    有限图 -具有有限数量的节点和边的图。

  • Infinite graph - a graph where an end of the graph in a particular direction(s) extends to infinity.

    无限图 -在特定方向上图的一端延伸到无穷大的图。

We’ll discuss some of these terms in the coming paragraphs.

我们将在接下来的段落中讨论其中一些术语。

Graphs can be directed and undirected, and that’s one of the interesting properties of graphs. You must’ve seen a popular Facebook vs Twitter example for directed and undirected graphs. A Facebook friendship relation may be easily represented as an undirected graph, because if Alice is a friend with Bob, then Bob must be a friend with Alice, too. There is no direction, both are friends with each other.

图可以是有向的和无向的,这是图的有趣特性之一。 您必须已经看到了有向图和无向图的流行的Facebook vs Twitter示例。 Facebook友谊关系很容易表示为无向图,因为如果Alice是Bob的朋友,那么Bob也必须也是Alice的朋友。 没有方向,彼此是朋友。

Also note the vertex labeled as “Patrick”, it is kind of special (he’s got no friends), as it doesn’t have any incident edges. It is still a part of the graph, but in this case we will say that this graph is not connected, it is a disconnected graph (same goes with “John”, “Ashot” and “Beth” as they are interconnected with each other but separated from others). In a connected graph there is no unreachable vertex, there must be a path between every pair of vertices.

还要注意标记为“ Patrick”的顶点,这有点特殊(他没有朋友),因为它没有任何入射边缘。 它仍然是图的一部分,但在这种情况下,我们将说该图未连接,它是一个断开的图 (“ John”,“ Ashot”和“ Beth”彼此连接在一起但与其他人分开) 连接的图中,没有不可达的顶点,每对顶点之间必须有一条路径。

Contrary to the Facebook example, if Alice follows Bob on Twitter, that doesn’t require Bob to follow Alice back. So a “follow” relation must have a direction indicator, showing which vertex (user) has a directed edge (follows) to the other vertex.

与Facebook示例相反,如果Alice在Twitter上关注Bob,则不需要Bob跟随Alice。 因此,“跟随”关系必须具有方向指示器,以显示哪个顶点(用户)具有指向另一个顶点的有向边(跟随)。

Now, knowing what is a finite connected undirected graph, let’s get back to Euler’s graph:

现在,知道什么是有限 连接 无向图,让我们回到欧拉图:

So why did we discuss Königsberg bridges problem and Euler graphs in the first place? Well, it’s not so boring and by investigating the problem and foregoing solution we touched the elements behind graphs (vertex, edge, directed, undirected) avoiding a dry theoretical approach. And no, we are not done with Euler graphs and the problem above, yet. ?

那么,为什么我们首先讨论柯尼斯堡桥问题和欧拉图呢? 好吧,这不是那么无聊,通过研究问题和前述解决方案,我们触及了图形背后的元素(顶点,边,有向,无向),从而避免了枯燥的理论方法。 不,我们还没有完成Euler图和上述问题。 ?

We should now move on to the computer representation of graphs as that is the topic of interest for us programmers. By representing a graph in a computer program, we will be able to devise an algorithm for tracing graph path(s), and therefore find out if it is an Euler path. Before that, try to think of a good application for an Euler graph (besides fiddling around with bridges).

现在,我们应该继续进行图形的计算机表示,因为这是我们程序员感兴趣的主题。 通过在计算机程序中表示图形,我们将能够设计出一种用于跟踪图形路径的算法,从而确定其是否为欧拉路径。 在此之前,请尝试为Euler图(除了摆弄桥)之外的一个好应用。

图表表示形式:简介 (Graph representation: Intro)

Now this is quite a tedious task, so be patient. Remember the fight between Arrays and Linked Lists? Use arrays if you need fast element access, use lists if you need fast element insertion/deletion, etc. I hardly believe you ever struggled with something like “how to represent lists”. Well, in case of graphs the actual representation is really bothering, because first you should decide how exactly are you going to represent a graph. And believe me, you are not going to like this. Adjacency list, adjacency matrix, maybe edge lists? Toss a coin.

现在,这是一个繁琐的任务,因此请耐心等待。 还记得数组和链接列表之间的斗争吗? 如果需要快速访问元素,请使用数组;如果需要快速插入/删除元素,请使用列表,等等。我几乎不相信您曾经为“如何表示列表”而苦苦挣扎。 好吧,在图形的情况下,实际的表示真的很麻烦,因为首先您应该确定要如何精确地表示图形。 相信我,你不会喜欢这样的。 邻接表,邻接矩阵,或者边缘列表? 抛硬币。

You should have tossed hard, because we are starting with a tree. You must have seen a binary tree (or BT for short) at least once (the following is not a binary search tree).

您应该辛苦地折腾,因为我们从一棵树开始。 你一定见过一个二叉树(或BT的简称)至少一次(以下不是一个二叉搜索树)。

Just because it consists of vertices and edges, it’s a graph. You also may recall how most commonly a binary tree is represented (at least in textbooks).

仅仅因为它由顶点和边组成,所以它是一个图。 您可能还记得二叉树的表示方式(至少在教科书中如此)。

It might seem too basic for people who are already familiar with binary trees, but I still have to illustrate it to make sure we are on the same page (note that we are still dealing with pseudocode).

对于已经熟悉二叉树的人来说,这似乎太基本了,但是我仍然必须说明它,以确保我们在同一页上(请注意,我们仍在处理伪代码)。

If you are new to trees, read the pseudocode above carefully, then follow the steps in the illustration below.

如果您不熟悉树,请仔细阅读上面的伪代码,然后按照下图中的步骤进行操作。

While a binary tree is a simple “collection” of nodes, each of which has left and right child nodes. A binary search tree is much more useful as it applies one simple rule which allows fast key lookups. Binary search trees (BST) keep their keys in sorted order. You are free to implement your BT with any rule you want (although it might change its name based on the rule, for instance, min-heap or max-heap). The most important expectation for a BST is that it satisfies the binary search property (that’s where the name comes from). Each node’s key must be greater than any key in its left sub-tree and less than any key in its right sub-tree.

二叉树是节点的简单“集合”,每个节点都有左右子节点。 二进制搜索树更有用,因为它应用了一条允许快速键查找的简单规则。 二进制搜索树(BST)使其关键字保持排序顺序。 您可以随意使用所需的任何规则来实现BT(尽管它可能会根据规则更改其名称,例如min-heap或max-heap)。 对BST的最重要期望是它满足二进制搜索属性(即名称的来源)。 每个节点的密钥必须于其左子树中的任何密钥,并且小于其右子树中的任何密钥。

I’d like to point out a very interesting point regarding the statement “greater than” that’s crucial to understand how BST’s function. Whenever you change the property to “greater than or equal”, your BST will be able to save duplicate keys when inserting new nodes, otherwise it will keep only nodes with unique keys. You can find really good articles on the web about binary search trees. We won’t be providing a full implementation of a binary search tree, but for the sake of consistency, we’ll illustrate a simple binary search tree here.

我想指出一个关于“大于”的非常有趣的观点,这对于理解BST的功能至关重要。 只要将属性更改为“大于或等于”,BST便可以在插入新节点时保存重复的键,否则BST将仅保留具有唯一键的节点。 您可以在网上找到有关二叉搜索树的非常好的文章。 我们不会提供二进制搜索树的完整实现,但是为了保持一致,我们将在此处说明一个简单的二进制搜索树。

图形表示法和二叉树简介(Airbnb示例) (Intro to Graph representation and binary trees (Airbnb example))

Trees are very useful data structures. You might not have implemented a tree from scratch in your projects. But you’ve probably used them even without noticing. Let’s look at an artificial yet valuable example and try to answer the “why” question, “Why use a binary search tree in the first place”.

树是非常有用的数据结构。 您可能尚未在项目中从头实现树。 但是,即使您没有注意到,您也可能使用了它们。 让我们看一个人为但有价值的示例,并尝试回答“为什么”问题,“为什么首先使用二进制搜索树”。

As you’ve noticed, there is a “search” in binary search tree. So basically, everything that needs a fast lookup, should be placed in a binary search tree. “Should” doesn’t mean must, the most important thing to keep in mind in programming is to solve a problem with proper tools. There are tons of cases where a simple linked list with its O(N) lookup might be more preferable than a BST with its O(logN) lookup.

您已经注意到,二进制搜索树中有一个“搜索”。 因此,基本上,所有需要快速查找的内容都应放在二进制搜索树中。 “应该”并不意味着必须,编程中要记住的最重要的事情就是使用适当的工具解决问题。 在很多情况下,使用O(N)查找的简单链表比使用O(logN)查找的BST更可取。

Typically we would use a library implementation of a BST, most likely std::set or std::map in C++. However in this tutorial we are free to reinvent our own wheel. BSTs are implemented in almost any general-purpose programming language library. You can find them in the corresponding documentation of your favorite language. Approaching a “real-life example”, here’s the problem we’ll try to tackle - Airbnb Home Search.

通常,我们将使用BST的库实现,最有可能是C ++中的std :: set或std :: map。 但是,在本教程中,我们可以自由地重新发明自己的轮子。 BST几乎在所有通用编程语言库中实现。 您可以在您喜欢的语言的相应文档中找到它们。 接近“现实生活中的例子”,这就是我们将要解决的问题-Airbnb Home Search。

How do we search for homes based on some query with a bunch of filters as fast as possible. This is a hard task. It becomes harder if we consider that Airbnb stores 4 millions listings.

我们如何基于一些查询,并以尽可能多的过滤条件搜索房屋。 这是一项艰巨的任务。 如果我们认为Airbnb可以存储400万个房源 ,就变得更加困难。

So when users search for homes, there is a chance that they might “touch” 4 million records stored in the database. Sure the results are limited to the “top listings” shown on the website’s home page and a user almost is never curious “enough” to view millions of listings. I don’t have any analytics regarding Airbnb, but we can use a powerful tool in programming called “assumptions”. So we will assume that a single user finds a good home by viewing at most ~1K homes.

因此,当用户搜索房屋时,他们有可能“接触”数据库中存储的400万条记录。 当然,结果仅限于网站首页上显示的“热门列表”,并且用户几乎永远不会好奇“足够”查看数百万个列表。 我没有关于Airbnb的任何分析,但是我们可以在编程中使用一个功能强大的工具,即“假设”。 因此,我们假设一个用户通过查看最多约一千个房屋来找到一个好的房屋。

The most important factor here is the number of real-time users, as it makes a difference in data structures and database(s) choices and the project architecture overall. As obvious as it might seem, if there are just 100 users overall, then we may not bother at all.

这里最重要的因素是实时用户数,因为它会影响数据结构和数据库选择以及整个项目体系结构。 看起来很明显,如果总共只有100个用户,那么我们可能根本不会打扰。

On the contrary, if the number of users overall and real-time users in particular is far beyond the million threshold, we have to think really wisely about each decision. “Each” is used exactly right, that’s why companies hire the best while striving for excellence in service provision.

相反,如果总体用户(尤其是实时用户)的数量远远超过百万阈值,那么我们必须对每项决定进行明智的思考。 完全正确地使用“每个”,这就是为什么公司在寻求卓越的服务提供的同时聘请最好的人才的原因。

Google, Facebook, Airbnb, Netflix, Amazon, Twitter, and many others deal with huge amounts of data and the right choice to serve millions of bytes of data each second to millions of real-time users starts from hiring the right engineers. That’s why we, the programmers, struggle with these data structures, algorithms and problem solving in possible interviews, because all they need is the engineer having the ability to solve such big problems in the fastest and most efficient possible way.

Google,Facebook,Airbnb,Netflix,Amazon,Twitter和许多其他公司处理大量数据,而正确的选择是从雇用合适的工程师开始,以每秒向数百万个实时用户提供数百万个字节的数据。 这就是为什么我们程序员在可能的面试中要面对这些数据结构,算法和问题解决方法的原因,因为他们所需要的只是工程师具备以最快,最有效的方式解决此类大问题的能力。

So here’s a use case. A user visits the home page (we’re still talking about Airbnb) and tries to filter out homes to find the best possible fit. How would we deal with this problem? (Note that this problem is rather backend-side, so we won’t care about front-end or the network traffic or https over http or Amazon EC2 over home cluster and so on).

所以这是一个用例。 用户访问主页(我们仍在谈论Airbnb),并尝试过滤掉房屋以找到最合适的房屋。 我们将如何处理这个问题? (请注意,此问题是后端方面的,因此我们不会在意前端或网络流量或基于HTTP的HTTP或基于家庭群集的Amazon EC2等等)。

First of all, as we are already familiar with one of the most powerful tools in a programmers’ inventory (talking about assumptions rather than abstractions), let’s start with a few assumptions:

首先,由于我们已经熟悉了程序员清单中最强大的工具之一(谈论假设而不是抽象),因此让我们从以下几个假设开始:

  • We’re dealing with data that completely fits in the RAM.

    我们正在处理完全适合RAM的数据。
  • Our RAM is big enough.

    我们的RAM足够大。

Big enough to hold, hmm, how much? Well that’s a good question. How much memory will be required to store the actual data. If we are dealing with 4 million units of data (again, am assumption), and if we probably know each unit’s size, then we can easily derive the required memory size, i.e. 4M * sizeof(one_unit).

足够大,可以容纳多少? 好吧,这是一个好问题。 存储实际数据将需要多少内存。 如果我们要处理400万个数据单元(同样是假设),并且如果我们可能知道每个单元的大小,则可以轻松得出所需的内存大小,即4M * sizeof(one_unit)。

Let’s consider a “home” object and its properties. Actually, let’s consider at least those properties that we will deal with while solving our problem (a “home” is our unit). We will represent it as a C++ structure in some pseudocode. You can easily convert it to a MongoDB schema object or anything you want. We just discuss the property names and types (try to think about using bitfields or bitsets for space economy).

让我们考虑一个“ home”对象及其属性。 实际上,让我们至少考虑在解决问题时将要处理的那些属性(“房屋”是我们的单位)。 我们将使用伪代码将其表示为C ++结构。 您可以轻松地将其转换为MongoDB模式对象或所需的任何对象。 我们仅讨论属性名称和类型(尝试考虑使用位域或位集来节省空间)。

The above structure is not perfect (obviously) and there are many assumptions and/or incomplete parts. I just looked at Airbnb’s filters and devised property lists that should exist to satisfy search queries. It’s just an example.

上面的结构(显然)并不完美,并且有很多假设和/或不完整的部分。 我只是看了看Airbnb的过滤器和设计的属性列表,这些列表应该存在才能满足搜索查询的要求。 这只是一个例子。

Now we should calculate how many bytes in memory will take each AirbnbHome object.

现在,我们应该计算每个AirbnbHome对象将占用多少内存字节。

  • Home name -name is a wstring to support multilingual names/titles, which means each character will take 2 bytes (we may not bother with character size if we would use other language, but in C++ char is 1-byte character and wchar is 2-byte character). A quick look at Airbnb’s listings allows us to assume that the name of a home should take up to 100 characters (though mostly it is around 50, rather than 100), we’ll assume 100 characters as a maximum value, which leads to ~200 bytes of memory. uint is 4 bytes, uchar is 1 byte, ushort is 2 bytes (again, in our assumptions).

    家名 - name是一个wstring以支持多语言名称/标题,这意味着每个字符将占用2个字节(如果使用其他语言,我们可能不会打扰字符大小,但是在C ++中char是1字节字符, wchar是2字节字符)。 快速浏览一下Airbnb的列表,我们可以假设房屋名称最多可以包含100个字符(尽管大多数情况下大约为50个字符,而不是100个字符),我们假设最大值为100个字符,这导致〜 200个字节的内存。 uint是4个字节, uchar是1个字节, ushort是2个字节(同样,在我们的假设中)。

  • Photos - Photos are residing within some storage service, like Amazon S3 (as far as I know, this assumption is most likely to be true for Airbnb, but again, Amazon S3 is just an assumption)

    照片 -照片驻留在某些存储服务中,例如Amazon S3(据我所知,此假设最有可能适用于Airbnb,但同样,Amazon S3只是一个假设)

  • Photo URLs - We have those photo URLs, and considering the fact that there is no standard size limit on the URLs, but there is in fact a well-known limit of 2083 characters, we‘ll take it as a max size of any URL. So taking into account that each home has 5 photos in average, it would take up to ~10Kb.

    图片网址 -我们拥有这些图片网址,并考虑到这些网址没有标准的大小限制,但实际上众所周知的限制是2083个字符,因此我们将其视为任何网址的最大大小。 因此,考虑到每个家庭平均有5张照片,最多需要10Kb。

  • Photo IDs - Let’s have a rethink about this. Usually storage services serve content with the same base URLs, like http(s)://s3.amazonaws.com//<object>, i.e. there is a common pattern for constructing URLs and we need to store only the actual photo ID. Let’s say we use some unique ID generator, which returns a 20 byte unique string ID where photo objects and the URL pattern for particular photo looks like https://s3.amazonaws.com/some-know-bucket/nique-photo-id>. This gives us good space efficiency, so for storing string IDs of five photos we will need only 100 bytes of memory.

    带有照片的身份证 -让我们重新考虑一下。 通常,存储服务使用相同的基本URL来提供内容,例如http(s)://s3.amazonaws.com//&l t; object>,即存在构造URL的通用模式,我们只需要存储实际的照片ID。 假设我们使用一些唯一的ID生成器,该生成器返回20字节的唯一字符串ID,其中照片对象和特定照片的URL模式looks like https://s3.amazonaws.com/some-know-bucket/ nique-photo -id>。 这给我们带来了很好的空间效率,因此,要存储五张照片的字符串ID,我们仅需要100字节的内存。

  • Host ID - The same “trick” (above) could be done with the host_id, i.e. the user ID who hosts the home, takes 20 bytes of memory (actually we could just use integer IDs for users, but considering that some DB systems like MongoDB have rather specific unique ID generator, we’re assuming a 20 byte length string ID as some “median” value which fits into almost any DB system with a little change. Mongo’s ID length is 24 bytes). And finally, we’ll take a bitset of up to 32 bits in size as a 4 bytes object and a bitset of size between 32 and 64 bits, as a 8 byte object. Mind the assumptions. We used bitset in this example for any property that expresses an enum, but is able to take more than one value, in other words a kind of multiple choice checkbox.

    主机ID-使用host_id可以完成相同的“技巧”,即托管房屋的用户ID占用20字节的内存( 实际上,我们可以为用户使用整数ID,但是考虑到某些数据库系统,例如MongoDB具有相当特定的唯一ID生成器,我们假定20字节长的字符串ID作为某个“中值”值,几乎可以适应几乎所有DB系统(Mongo的ID长度为24个字节 )。 最后,我们将一个最大为32位的位集作为4字节对象,将一个大小在32至64位之间的位集作为8字节对象。 注意这些假设。 在此示例中,我们将位集用于表示枚举但可以采用多个值的任何属性,换句话说,是一种多选复选框。

Amenities - Each Airbnb home keeps a list of available amenities, e.g. “iron”, “washer”, “tv”, “wifi”, “hangers”, “smoke detector” and even “laptop friendly workspace” and so on. There might be more than 20 amenities, we stick to the 20 just because it’s the number of filterable amenities on the Airbnb filters page. Bitset saves us some good space, if we keep proper ordering for amenities. For instance, if a home has all above mentioned amenities (see checked ones in the screenshot), we will just set a bit at corresponding position in the bitset.

便利设施 -每个Airbnb房屋都会保留可用设施的清单,例如“熨斗”,“洗衣机”,“电视”,“ wifi”,“衣架”,“烟雾探测器”,甚至是“便携式笔记本电脑工作区”等。 可能有20多种便利设施,我们坚持使用20种便利设施,只是因为这是Airbnb过滤器页面上可过滤的便利设施的数量。 如果我们保持适当的订购顺序,Bitset可以为我们节省一些空间。 例如,如果房屋具有上述所有设施(请参见屏幕快照中的选中设施),我们将在位集中的相应位置设置一个位。

For example, checking if a home has a “washer”:

例如,检查房屋是否有“洗衣机”:

Or a little more professionally:

或更专业一点:

You can improve the code as much as you want (and fix compile errors). We just wanted to emphasize the idea behind bitsets in this problem context.

您可以根据需要任意改进代码(并修复编译错误)。 我们只是想在此问题上下文中强调比特集背后的想法。

  • House rules, Home Type - The same idea (that we implemented for the amenities field) goes with “house rules”, “home type” and others.

    房屋规则,房屋类型 -“房屋规则”,“房屋类型”和其他相同的想法(我们为生活用品领域实现)。

  • Country code, City name - Finally, the country code and city name. As mentioned in the comments of the code above (see remarks), we won’t store latitude and longitude to avoid geo-spatial queries (a subject of another article). Instead, we save country code and city name to narrow down the search by a location (omitting streets for the sake of simplicity, please forgive me). Country code could be represented as 2 characters, 3 characters or 3 digits, we’ll save a numeric representation and will use an ushort for it. (Un)fortunately there are many more cities than countries, so we can’t use a “city code” (though we can make one for internal use). Instead we’ll store actual city name, preserving 50 bytes in average for a city name and for super-specific cases like Taumatawhakatangihangakoauauotamateaturipukakapikimaungahoronukupokaiwhenuakitanatahu (85 letter city). We better use an additional boolean variable which indicates that this is that specific super-long city (don’t try to pronounce it). So, keeping in mind the memory overhead of strings and vectors. We’ll add an additional 32 bytes (just in case) to the final size of the struct. We also will assume that we work on a 64-bit system, although we chose very compact values for int and short.

    国家代码,城市名称 -最后,国家代码和城市名称。 就像上面代码的注释中提到的那样(请参阅备注),我们不会存储纬度和经度以避免地理空间查询( 另一篇文章的主题 )。 相反,我们保存国家/地区代码和城市名称,以按位置缩小搜索范围(为简单起见,请省略街道,请原谅我)。 国家/地区代码可以表示为2个字符,3个字符或3个数字,我们将保存一个数字表示形式,并使用ushort表示。 (Un)不幸的是,城市数量多于国家/地区,因此我们不能使用“城市代码”(尽管我们可以将其内部使用)。 相反,我们将存储实际的城市名称,对于一个城市名称和诸如Taumatawhakatangihangakoauauotaotaaturaturipukakapikimaungahoronukupokaiwhenuakitanatahu (85个字母的城市)之类的超特殊情况,平均保留50个字节。 我们最好使用一个附加的布尔变量,该变量指示这是特定的超长城市(不要尝试发音)。 因此,请记住字符串和向量的内存开销。 我们将在结构的最终大小中增加32个字节(以防万一)。 尽管我们为intshort选择了非常紧凑的值,但我们还将假设我们在64位系统上工作。

So, 420+ bytes with an overhead of 32 bytes, 452 bytes and considering the fact that some of you might just be obsessed with the aligning factor, let’s round up it to 500 bytes. So each “home” object takes up to 500 bytes, and for all home listings (there could be some confusing moments with the listings count and actual home count, just let me know if I got something wrong), 500 bytes * 4 million = 1.86GB ~ 2GB. Seems plausible. We made many assumptions while constructing the struct, making it cheaper to save in memory, I really expected much more than 2 Gigabytes and if I did a mistake in calculations, let me know. Anyway, moving forward, so whatever we gonna do with this data, we will need at least 2 GB of memory. If you got bored, deal with it. We are just starting.

因此,如果有420 +字节的开销( 32字节和452字节),并考虑到某些人可能只对对齐因子感到困惑,那么我们将其舍入为500字节 。 因此,每个“ home”对象最多占用500个字节,并且对于所有home列表( 列表数量和实际home数量可能会有一些令人困惑的时刻,请让我知道我是否记错了 ),500字节* 400万= 1.86GB〜2GB。 似乎合理。 在构造该结构时,我们做出了许多假设,使其更便宜地保存在内存中,我确实期望超过2 GB,如果我在计算中犯了一个错误,请告诉我。 无论如何,继续前进,因此无论我们要处理这些数据如何,我们都将至少需要2 GB的内存。 如果您感到无聊,请处理。 我们才刚刚开始。

Now the hardest part of the task. Choosing the right data structure for this problem (filter the listings as efficiently as possible) is not the hardest task. The hardest task is (for me) to search listings by a bunch of filters. If there would be just one search key (just one filter) we would easily solve it. Suppose the only thing users care is the price, so all we need is to find AirbnbHome objects with prices falling in the provided range. If we’ll use a binary search tree for that, here’s how it might look.

现在是任务中最困难的部分。 为这个问题选择正确的数据结构(尽可能高效地过滤列表)并不是最困难的任务。 对我而言,最困难的任务是通过一堆过滤器搜索列表。 如果只有一个搜索键(只有一个过滤器),我们将很容易解决。 假设用户唯一关心的是价格,那么我们所需AirbnbHome就是找到价格在提供范围内的AirbnbHome对象。 如果我们为此使用二进制搜索树,则它的外观如下。

If you imagine all 4 millions objects, this tree grows very very big. By the way, the memory overhead grows as well, just because we used a BST to store objects. As each parent tree node has two additional pointers to its left and right child it adds up to 8 additional bytes for each child pointer (assuming a 64-bit system). For 4 million nodes it sums up to ~62 Mb, which in comparison to 2Gb of object data looks quite small, though it is not something that we can “omit” easily.

如果您想象所有400万个对象,那么这棵树会变得非常大。 顺便说一句,由于我们使用BST来存储对象,因此内存开销也会增加。 由于每个父树节点都有两个指向其左子节点和右子节点的指针,因此它为每个子指针增加了最多8个字节(假设是64位系统)。 对于400万个节点,总和约为62 Mb ,与2Gb的对象数据相比看起来很小,尽管我们不能轻易“忽略”这一点。

The tree in the last illustration so far shows that any item can be easily found in O(logN) complexity. If you aren’t familiar or are not sure enough to chit-chat in big-ohs, we’ll clarify it below, otherwise skip the complexity subsection.

到目前为止,上图中的树显示可以轻松地以O(logN)复杂度找到任何项目。 如果您不熟悉或不确定要大声聊天,我们将在下面进行说明,否则跳过“复杂性”小节。

Algorithmic complexity - Let’s make this quick as there will be a long and detailed explanation in an upcoming article: “Algorithmic Complexity and Software Performance: The Missing Manual”. For most of the cases finding the big O complexity for an algorithm is somewhat easy. First thing to note is that we always consider the worst case, i.e. the maximum number of operations that an algorithm does to produce a positive outcome (to solve the problem).

算法复杂度 -让我们快点讲一下,因为在即将发表的文章“算法复杂度和软件性能:缺少的手册”中将进行详细的详细说明。 在大多数情况下,为算法找到大的O复杂性有些容易。 首先要注意的是,我们总是考虑最坏的情况,即算法为产生积极结果(解决问题)所进行的最大操作数。

Suppose an array has 100 elements in an unsorted order. How many comparisons would it take to find an element (also taking into account that the required element could be missing)? It will take up to 100 comparisons as we should compare each element’s value with the value we are looking for. And despite the fact that the element might be the first element in the array (leading to a single comparison), we will consider only the worst possible case (element is either missing or is residing at the last position of the array).

假设一个数组具有100个未排序元素。 找到一个元素需要进行多少次比较(还要考虑到可能缺少必需的元素)? 最多需要进行100次比较,因为我们应该将每个元素的值与所需的值进行比较。 尽管该元素可能是数组中的第一个元素(导致进行单个比较),但我们只会考虑最坏的情况(元素丢失或位于数组的最后一个位置)。

The point of “calculating” algorithmic complexity is finding a dependency between the number of operations and the size of input, for instance the array above had 100 elements and the number of operations were also 100, if the number of array elements (its input) will increase to 1423, the number of operations to find any element will also increase to 1423 (the worst case). So the thin line between input and number of operations is clear in this case, it is so-called linear, the number of operations grows as much as grows array’s input. Growth. That’s the key point in complexity, we say that searching for an element in an unsorted array takes O(N) time to emphasize that the process of finding it will take up to N operations (or even up to N operations times some constant value such as 3N). On the other hand, accessing any element in an array takes a constant time, i.e. O(1). That’s because of an array’s structure. It’s a contiguous data structure, and holds elements of the same type (mind JS arrays), so “jumping” to a particular element requires only calculating its relative position to the array’s first element.

“计算”算法复杂度的要点是找到操作数输入大小之间的依赖关系,例如,如果数组元素(其输入)的数目为100,则上面的数组也为100将增加到1423,查找任何元素的操作数也将增加到1423(最坏的情况)。 因此,在这种情况下,输入和操作数之间的细线很清楚,这就是所谓的线性 ,操作数的增长与数组输入的增长一样多。 成长。 这是复杂性的关键,我们说在一个未排序的数组中搜索元素需要O(N)时间,以强调查找它的过程将需要N个操作(或者甚至是N个操作乘以某个恒定值,例如作为3N)。 另一方面,访问数组中的任何元素都需要一个恒定的时间,即O(1)。 那是因为数组的结构。 它是一个连续的数据结构,并保存相同类型的元素(注意JS数组),因此“跳转”到特定元素只需要计算其相对于数组第一个元素的相对位置。

One thing is very clear. A binary search tree keeps its nodes in sorted order. So what would be the algorithmic complexity of searching an element in a binary search tree? We should calculate the number of operations required to find an element (in the worst case).

一件事很清楚。 二叉搜索树将其节点保持在已排序的顺序中。 那么在二叉搜索树中搜索元素的算法复杂度是多少? 我们应该计算查找元素所需的操作数(在最坏的情况下)。

Look at the illustration above. When starting our search at the root, the first comparison may lead to three cases,

看上面的插图。 从根本上开始搜索时,第一次比较可能会导致三种情况,

  1. The node is found.

    找到该节点。
  2. The comparison continues to node’s left sub-tree if the required element is less than the node’s value

    如果所需元素小于节点的值,则比较继续到节点的左子树
  3. The comparison continues to the node’s right sub-tree if the value we search for is greater than the node’s value.

    如果我们搜索的值大于节点的值,则比较继续到节点的右子树。

At each step we reduce the size of nodes needed to be considered by half. The number of operations (i.e. comparisons) needed to find an element in the BST equals the height of the tree. The height of a tree is the number of nodes on the longest path. In this case it’s 4. And the height is [base 2] logN + 1, as shown. So the complexity of search is O(logN + 1) = O(logN). This means that searching something in 4 million nodes requires log4000000 = ~22 comparisons in the worst case.

在每一步中,我们将需要考虑的节点大小减少一半。 在BST中查找元素所需的操作(即比较)次数等于树的高度。 树的高度是最长路径上的节点数。 在这种情况下为4。高度为[base 2] logN + 1,如图所示。 因此,搜索的复杂度为O(logN + 1)= O(logN)。 这意味着在最坏的情况下,要在400万个节点中搜索内容,需要进行log4000000 =〜22 比较。

Back to the tree - Element access time in a binary search tree is O(logN). Why not use hashtables? Hashtables have constant access time, which makes it reasonable to use hashtables almost everywhere.

返回树 -二进制搜索树中的元素访问时间为O(logN)。 为什么不使用哈希表? 哈希表具有恒定的访问时间,这使得在几乎所有地方使用哈希表是合理的。

In this problem we must take into account an important requirement. We must be able to make range searches, e.g. homes with prices from $80 to $162. In case of a BST, it’s easy to get all the nodes in a range just by doing an inorder traversal of the tree and keeping a counter. For a hashtable it is somewhat expensive which makes it reasonable to stick with BSTs in this case.

在这个问题上,我们必须考虑一个重要的要求。 我们必须能够进行范围搜索,例如价格从80美元到162美元的房屋。 在使用BST的情况下,只需对树进行有序遍历并保持计数器,就很容易获得范围内的所有节点。 对于哈希表,它有些昂贵,这使得在这种情况下坚持使用BST是合理的。

Though there is another spot, which leads us to rethink hashtables. The density. Prices won’t go up “forever”, most of the homes reside at the same price range. Look at the screenshot, the histogram shows us the real picture of the prices, millions of homes are in the same range (+/- $18 - $212), they have the same average price. Simple arrays may play a good role. Assuming the index of an array as the price and the value as the list of homes, we might access any price range in constant time (well, almost constant). Here’s how it looks (way abstract):

尽管还有另一个地方,这使我们重新考虑了哈希表。 密度。 价格不会“永远”上涨,大多数房屋的价格范围相同。 看截图,直方图向我们展示了价格的真实情况,数百万套房屋处于相同的价格范围内(+/- $ 18-$ 212),它们的平ASP格相同。 简单数组可能会起到很好的作用。 假设数组的索引是价格,而值是房屋列表,则我们可以在恒定时间内访问任何价格范围(嗯,几乎恒定)。 这是它的外观(抽象方式):

Just like a hashtable, we are accessing each set of homes by its price. All homes having the same price are grouped under a separate BST. It will also save us some space if we store home IDs instead of the whole object defined above (the AirbnbHome struct). The most possible scenario is to save all homes full objects in a hashtable mapping home ID to home full object and storing another hashtable (or better, an array), which maps prices with homes IDs.

就像哈希表一样,我们正在按价格访问每套房屋。 所有具有相同价格的房屋都归为一个单独的BST。 如果我们存储房屋ID而不是上面定义的整个对象( AirbnbHome结构),这还将为我们节省一些空间。 最可能的情况是将所有房屋完整对象保存在将房屋ID映射到房屋完整对象的哈希表中,并存储另一个哈希表(或更好的数组),该哈希表将价格与房屋ID映射。

So when users request a price range, we fetch home IDs from the price table, cut the results to a fixed size (i.e. the pagination, usually around 10 - 30 items are shown on one page), fetch the full home objects using each home ID.

因此,当用户请求价格范围时,我们从价格表中获取房屋ID,将结果切成固定大小(即分页,通常在一页上显示10至30件商品),并使用每个房屋获取完整的房屋对象ID。

Just keep another thing in mind (think of it in the background). Balancing is crucial for a BST, because it’s the only guarantee of having tree operations done in O(logN). The problem of unbalanced BST is obvious when you insert elements in sorted order. Eventually, the tree becomes just a linked list, which obviously leads to linear-time operations. Forget this for now, suppose all our trees are perfectly balanced. Take a look at the illustration above once again. Each array element represents a big tree. What if we change the illustration to something like this:

只要记住另一件事(在后台考虑)。 平衡对于BST至关重要,因为它是在O(logN)中完成树操作的唯一保证。 当您按顺序插入元素时,BST不平衡的问题很明显。 最终,树变成了一个链表,这显然导致了线性时间运算。 现在就算了吧,假设我们所有的树都完美平衡。 再次看一下上面的插图。 每个数组元素代表一棵大树。 如果我们将插图更改为以下内容,该怎么办:

It resembles a “more realistic” graph. This illustration represents the most disguised data structures and graphs, which takes us to the next section.

它类似于“更现实”的图形。 该图表示最隐蔽的数据结构和图形,将我们带到下一部分。

图表表示形式:Outro (Graph representation: Outro)

The bad news about graphs is that there isn’t a single definition for the graph representation. That’s why you can’t find a std::graph in the library. We already had a chance to represent a “special” graph called BST. The point is, tree is a graph, but graph is not a tree. The last illustration shows us that we have a lot of trees under a single abstraction, “prices vs homes” and some of the vertices “differ” in their type, prices are graph nodes having only the price value and refer to the whole tree of home IDs (home vertices) that satisfy the particular price. It is much like a hybrid data structure, than a simple graph that we’re used to seeing in textbook examples.

关于图形的坏消息是图形表示没有单一的定义。 That's why you can't find a std::graph in the library. We already had a chance to represent a “special” graph called BST. The point is, tree is a graph, but graph is not a tree. The last illustration shows us that we have a lot of trees under a single abstraction, “prices vs homes” and some of the vertices “differ” in their type, prices are graph nodes having only the price value and refer to the whole tree of home IDs (home vertices) that satisfy the particular price. It is much like a hybrid data structure, than a simple graph that we're used to seeing in textbook examples.

That’s the key point in graph representation, there isn’t a fixed and “de jure” structure for graph representation (unlike BSTs with their specified node-based representation with left/right child pointers, though you can represent a BST with a single array). You can represent a graph in the most convenient way you wish (most convenient to particular problem), the main thing is that you “see” it as a graph. And by “seeing a graph” we mean applying algorithms that are specific to graphs.

That's the key point in graph representation, there isn't a fixed and “de jure” structure for graph representation (unlike BSTs with their specified node-based representation with left/right child pointers, though you can represent a BST with a single array). You can represent a graph in the most convenient way you wish (most convenient to particular problem), the main thing is that you “see” it as a graph. And by “seeing a graph” we mean applying algorithms that are specific to graphs.

What about an N-ary tree, it is more likely to resemble a graph.

What about an N-ary tree, it is more likely to resemble a graph.

And the first thing that comes into mind to represent an N-ary tree node is something like this:

And the first thing that comes into mind to represent an N-ary tree node is something like this:

This structure represents just a single node of a tree. The full tree looks more like this:

This structure represents just a single node of a tree. The full tree looks more like this:

This class is an abstraction around a single tree node named root_ . That’s all we need to build a tree of any size. That’s the starting point of the tree. For adding a new tree node we need to allocate a memory to it and add that node to the root of the tree.

This class is an abstraction around a single tree node named root_ . That's all we need to build a tree of any size. That's the starting point of the tree. For adding a new tree node we need to allocate a memory to it and add that node to the root of the tree.

A graph is much like an N-ary tree, with a slight difference. Try to spot it.

A graph is much like an N-ary tree, with a slight difference. Try to spot it.

Is this a graph? No. I mean yes, but it’s the same N-ary tree from the previous illustration, just a little rotated. As a rule of a thumb, whenever you see a tree (even if it is an apple tree, lemon tree or binary search tree), you can be sure that it is also a graph. So, devising a structure for a graph node (graph vertex), we can come up with the same structure:

Is this a graph? No. I mean yes, but it's the same N-ary tree from the previous illustration, just a little rotated. As a rule of a thumb, whenever you see a tree (even if it is an apple tree, lemon tree or binary search tree), you can be sure that it is also a graph. So, devising a structure for a graph node (graph vertex), we can come up with the same structure:

Is this enough to construct a graph? Well, no. And here’s why. Look at these two graphs from previous illustrations, find a difference:

Is this enough to construct a graph? Well, no. 这就是为什么。 Look at these two graphs from previous illustrations, find a difference:

The graph in the illustration at the left side has no single point to “enter” (it’s rather a forest than a single tree), in the contrary, the graph in the right illustration doesn’t have unreachable vertices. Sounds familiar.

The graph in the illustration at the left side has no single point to “enter” (it's rather a forest than a single tree), in the contrary, the graph in the right illustration doesn't have unreachable vertices. Sounds familiar.

A graph is connected when there is a path between every pair of vertices. [Wikipedia]

A graph is connected when there is a path between every pair of vertices. [ Wikipedia ]

Obviously, there isn’t a path between every pair of vertices for the “prices vs homes” graph (if it isn’t obvious from the illustration, just assume that prices are not connected with each other). As much as it’s just an example to show that we aren’t able to construct a graph with a single GraphNode struct, there are cases that we have to deal with disconnected graphs like this. Take a look at this class:

Obviously, there isn't a path between every pair of vertices for the “prices vs homes” graph (if it isn't obvious from the illustration, just assume that prices are not connected with each other). As much as it's just an example to show that we aren't able to construct a graph with a single GraphNode struct, there are cases that we have to deal with disconnected graphs like this. Take a look at this class:

Just like an N-ary tree is built around a single node (the root node), a connected graph also can be built around a root node. It’s said that trees are “rooted”, i.e. they have a starting point. A connected graph can be represented as a rooted tree (with a couple of more properties), it’s already obvious, but keep in mind that the actual representation may differ from algorithm to algorithm, from problem to problem even for a connected graph. However, considering node-based nature of graphs, a disconnected graph can be represented like this:

Just like an N-ary tree is built around a single node (the root node), a connected graph also can be built around a root node. It's said that trees are “rooted”, ie they have a starting point. A connected graph can be represented as a rooted tree (with a couple of more properties), it's already obvious, but keep in mind that the actual representation may differ from algorithm to algorithm, from problem to problem even for a connected graph. However, considering node-based nature of graphs, a disconnected graph can be represented like this:

For graph traversals like DFS/BFS it’s natural to use a tree-like representation. Helps a lot. However, cases like efficient path tracing require a different representation. Remember Euler’s graph? To track down a graph’s “eulerness”, we should trace an Euler path within it. That means visiting all vertices by traversing each edge only once, and when the tracing finishes and we have untraversed edges, then the graph doesn’t have an Euler path, and therefore is not an Euler graph.

For graph traversals like DFS/BFS it's natural to use a tree-like representation. Helps a lot. However, cases like efficient path tracing require a different representation. Remember Euler's graph? To track down a graph's “eulerness”, we should trace an Euler path within it. That means visiting all vertices by traversing each edge only once, and when the tracing finishes and we have untraversed edges, then the graph doesn't have an Euler path, and therefore is not an Euler graph.

There is even faster method, we can check the degrees of vertices (suppose each vertex stores its degree) and just as the definition says, if a graph has vertices of odd degree and there aren’t exactly two of them, then it is not an Euler graph. The complexity of such a check is O(|V|), where |V| is the number of graph vertices. We can track down odd/even degrees while inserting new edges to increase odd/even degree checks to O(1). Lightning fast. Never mind, we’re just going to trace a graph, that’s it. Below is both the representation of a graph and the Trace() function returning a path.

There is even faster method, we can check the degrees of vertices (suppose each vertex stores its degree) and just as the definition says, if a graph has vertices of odd degree and there aren't exactly two of them, then it is not an Euler graph. The complexity of such a check is O(|V|), where |V| is the number of graph vertices. We can track down odd/even degrees while inserting new edges to increase odd/even degree checks to O(1). Lightning fast. Never mind, we're just going to trace a graph, that's it. Below is both the representation of a graph and the Trace() function returning a path.

Mind the bugs, bugs are everywhere. This code contains a lot of assumptions, for instance, the labeling, so by a vertex we understand a string label. Sure you can easily update it to be anything you want. Doesn’t matter in the context of this example. Next, the naming. As mentioned in the comments, VELOGraph is for Vertex Edge Label Only Graph (I made this up). The point is, this graph representation contains a table for mapping a vertex label with edges incident to that vertex, and a list of edges containing a pair of vertices (connected by a particular edge) and a flag which is used only by the Trace() function. Take a look at the Trace() function implementation. It uses edge’s flag to mark an already traversed edge (flags should be reset after any Trace() call).

Mind the bugs, bugs are everywhere. This code contains a lot of assumptions, for instance, the labeling, so by a vertex we understand a string label. Sure you can easily update it to be anything you want. Doesn't matter in the context of this example. Next, the naming. As mentioned in the comments, VELOGraph is for Vertex Edge Label Only Graph (I made this up). The point is, this graph representation contains a table for mapping a vertex label with edges incident to that vertex, and a list of edges containing a pair of vertices (connected by a particular edge) and a flag which is used only by the Trace() function. Take a look at the Trace() function implementation. It uses edge's flag to mark an already traversed edge (flags should be reset after any Trace() call).

Twitter Example: Tweet Delivery Problem (Twitter Example: Tweet Delivery Problem)

Here’s another representation called an adjacency matrix, which could be useful in directed graphs, like one we used for Twitter follower graph.

Here's another representation called an adjacency matrix, which could be useful in directed graphs, like one we used for Twitter follower graph.

There are 8 vertices in this Twitter example. So all we need to represent this graph is a |V|x|V| square matrix (|V| number of rows and |V| number of columns). If there is a directed edge from v to u, then matrix’s [v][u] is true, otherwise it’s false.

There are 8 vertices in this Twitter example. So all we need to represent this graph is a |V|x|V| square matrix (|V| number of rows and |V| number of columns). If there is a directed edge from v to u , then matrix's [v][u] is true , otherwise it's false .

As you can see, this matrix is way too sparse, its trade off is the fast access. To see if Patrick follows Sponge Bob, we should just check the value of matrix["Patrick"]["Sponge Bob"]. To get the list of Ann’s followers, we just process the entire “Ann” column (title is in yellow). To find who are being followed (sounds strange) by Sponge Bob, we process the entire row “Sponge Bob”. Adjacency matrix could be used for undirected graphs as well, and instead of settings 1’s if a there is an edge from v to u, we should set both values to 1, e.g. adj_matrix[v][u] = 1, adj_matrix[u][v] = 1. Undirected graph’s adjacency matrix is symmetric.

As you can see, this matrix is way too sparse, its trade off is the fast access. To see if Patrick follows Sponge Bob, we should just check the value of matrix["Patrick"]["Sponge Bob"] . To get the list of Ann's followers, we just process the entire “Ann” column (title is in yellow). To find who are being followed (sounds strange) by Sponge Bob, we process the entire row “Sponge Bob”. Adjacency matrix could be used for undirected graphs as well, and instead of settings 1's if a there is an edge from v to u , we should set both values to 1, eg adj_matrix[v][u] = 1, adj_matrix[u][v] = 1. Undirected graph's adjacency matrix is symmetric.

Note that instead of storing ones and zeroes in an adjacency matrix, we can store something “more useful”, like edge weights. One of the best examples might be a graph of places with distance information.

Note that instead of storing ones and zeroes in an adjacency matrix, we can store something “more useful”, like edge weights . One of the best examples might be a graph of places with distance information.

The graph above represents distances between houses of Patrick, Sponge Bob and others (also known as weighted graph). We put “infinity” signs if there isn’t a direct route between vertices. That doesn’t mean that there are no routes at all, and at the same time that doesn’t mean that there must necessarily be routes. It might be defined while applying an algorithm for finding a route between vertices (there is even better way to store vertices and edges incident to it, called an incidence matrix).

The graph above represents distances between houses of Patrick, Sponge Bob and others (also known as weighted graph ). We put “infinity” signs if there isn't a direct route between vertices. That doesn't mean that there are no routes at all, and at the same time that doesn't mean that there must necessarily be routes. It might be defined while applying an algorithm for finding a route between vertices (there is even better way to store vertices and edges incident to it, called an incidence matrix).

While adjacency matrix seemed a good use for Twitter’s followes graph, keeping a square matrix for nearly 300 million users (monthly active users) takes 300 * 300 * 1 bytes (storing boolean values). That is, ~82000 Tb (Terabytes), which is 1024 * 82000 Gb. Well, don’t know about your home cluster, my laptop doesn’t have so much RAM. Bitsets? A BitBoard could help us a little, reducing the required size to ~10000 Tb. Still way too big. As mentioned above, an adjacency matrix is too sparse. It forces us to use more space than actually needed. That’s why using a list of edges incident to vertices may be useful. The point is, an adjacency matrix allows us to keep both “follows” and “doesn’t follow” information, while all we need is to know information about the followes, something like this:

While adjacency matrix seemed a good use for Twitter's followes graph, keeping a square matrix for nearly 300 million users (monthly active users) takes 300 * 300 * 1 bytes (storing boolean values). That is, ~82000 Tb (Terabytes), which is 1024 * 82000 Gb. Well, don't know about your home cluster, my laptop doesn't have so much RAM. Bitsets? A BitBoard could help us a little, reducing the required size to ~10000 Tb. Still way too big. As mentioned above, an adjacency matrix is too sparse. It forces us to use more space than actually needed. That's why using a list of edges incident to vertices may be useful. The point is, an adjacency matrix allows us to keep both “follows” and “doesn't follow” information, while all we need is to know information about the followes, something like this:

The illustration at the right side is called an adjacency list. Each list describes the set of neighbors of a vertex in the graph. By the way, the actual implementation of the graph representation as an adjacency list, again, differs (ridiculous facts). In the illustration, we highlighted a hashtable usage, which is reasonable, as the access of any vertex will be O(1), and for the list of neighbor vertices we didn’t mention the exact data structure, deviating from linked lists to vectors. Choice is yours.

The illustration at the right side is called an adjacency list . Each list describes the set of neighbors of a vertex in the graph. By the way, the actual implementation of the graph representation as an adjacency list, again, differs (ridiculous facts). In the illustration, we highlighted a hashtable usage, which is reasonable, as the access of any vertex will be O(1), and for the list of neighbor vertices we didn't mention the exact data structure, deviating from linked lists to vectors. Choice is yours.

The point is, to find out whether Patrick does follow Liz, we should access the hashtable (constant time) and go through all items in the list comparing each element with “Liz” element (linear time). Linear time isn’t that bad at this point, because we have to loop over only a fixed amount of vertices adjacent to “Patrick”. What about the space complexity, is it ok to use at Twitter? Well, we need at least 300 million hashtable records, each of which points to a vector (choosing vector to avoid memory overhead of linked lists’ left/right pointers) containing, how much? No stats here, found just an average number of twitter followers, 707 (googled).

The point is, to find out whether Patrick does follow Liz, we should access the hashtable (constant time) and go through all items in the list comparing each element with “Liz” element (linear time). Linear time isn't that bad at this point, because we have to loop over only a fixed amount of vertices adjacent to “Patrick”. What about the space complexity, is it ok to use at Twitter? Well, we need at least 300 million hashtable records, each of which points to a vector (choosing vector to avoid memory overhead of linked lists' left/right pointers) containing, how much? No stats here, found just an average number of twitter followers, 707 (googled).

So if we consider that each hashtable record points to an array of 707 user IDs (each weighing 8 byte), and let’s assume that hashtable’s overhead is only its keys, which are again, user ids, so the hashtable itself takes 300 million * 8 bytes. Overall, we have 300 million * 8 bytes for hashtable + 707 * 8 bytes for each hashtable key, that is 300 million * 8 * 707 * 8 bytes = ~12 Tb. Well, can’t say that feels better, but yes, feels much better than 10,000 Tb.

So if we consider that each hashtable record points to an array of 707 user IDs (each weighing 8 byte), and let's assume that hashtable's overhead is only its keys, which are again, user ids, so the hashtable itself takes 300 million * 8 bytes. Overall, we have 300 million * 8 bytes for hashtable + 707 * 8 bytes for each hashtable key, that is 300 million * 8 * 707 * 8 bytes = ~12 Tb . Well, can't say that feels better, but yes, feels much better than 10,000 Tb.

Honestly, I don’t know whether this 12Tb is a reasonable number. But considering the fact that I’m spending around $30 on a dedicated server machine with 32 Gb of RAM, then, storing (sharded) 12 Tb requires at least 385 such servers, plus a couple of control servers (for data distribution control) rounds up to 400. So it will cost me $12K (monthly).

Honestly, I don't know whether this 12Tb is a reasonable number. But considering the fact that I'm spending around $30 on a dedicated server machine with 32 Gb of RAM, then, storing (sharded) 12 Tb requires at least 385 such servers, plus a couple of control servers (for data distribution control) rounds up to 400. So it will cost me $12K (monthly).

Well, considering the fact that the data should be replicated, and that something always can go wrong, we’ll triple the number of servers and then again, add some control servers, let’s say we need at least 1500 servers, which will cost us $45K monthly. Well, definitely not good for me as I hardly can keep one server, but seems okay for Twitter (it’s really nothing compared to real Twitter servers). Let’s assume it is really okay for Twitter.

Well, considering the fact that the data should be replicated, and that something always can go wrong, we'll triple the number of servers and then again, add some control servers, let's say we need at least 1500 servers, which will cost us $45K monthly. Well, definitely not good for me as I hardly can keep one server, but seems okay for Twitter (it's really nothing compared to real Twitter servers). Let's assume it is really okay for Twitter.

Now, are we okay here? Not yet, that was just the data regarding the followers. What is the main thing in Twitter? I mean, technically, what is its biggest problem? You won’t be alone if you say it’s the fast delivery of tweets. I will definitely second that. And not fast, but lightning fast. Say Patrick tweeted something about his thoughts on food, all his followers should receive that very tweet in a reasonable time. How long will it take? We are free of making any assumption here, and use any abstractions we want, but we are interested in the real world production systems, so, let’s dig. Here’s what’s typically happens when someone tweets.

Now, are we okay here? Not yet, that was just the data regarding the followers. What is the main thing in Twitter? I mean, technically, what is its biggest problem? You won't be alone if you say it's the fast delivery of tweets. I will definitely second that. And not fast, but lightning fast. Say Patrick tweeted something about his thoughts on food, all his followers should receive that very tweet in a reasonable time. 这需要多长时间? We are free of making any assumption here, and use any abstractions we want, but we are interested in the real world production systems, so, let's dig. Here's what's typically happens when someone tweets.

Again, don’t know much about how long it takes for one tweet to reach all followers, but publicly available statistics tell us that there are about 500 million daily tweets. Daily! ?

Again, don't know much about how long it takes for one tweet to reach all followers, but publicly available statistics tell us that there are about 500 million daily tweets. Daily! ?

So the process above happens 500 million times every day. I can’t really find anything on tweet delivery speed. I vaguely recall something about a maximum of 5 seconds for a tweet to reach all its followers. And also note the “heavy cases”, celebrities with more than a million followers. They might tweet something about their wonderful breakfast at the beach house, but Twitter sweats much to deliver that super-useful content to millions of followers.

So the process above happens 500 million times every day. I can't really find anything on tweet delivery speed. I vaguely recall something about a maximum of 5 seconds for a tweet to reach all its followers. And also note the “heavy cases”, celebrities with more than a million followers. They might tweet something about their wonderful breakfast at the beach house, but Twitter sweats much to deliver that super-useful content to millions of followers.

To solve tweet delivery problem we don’t really need the graph of followers, instead we need a graph of followers. The previous graph (with a hashtable and a bunch of lists) allows us to efficiently find all users followed by any particular user. But it does not allow us to efficiently find all users who are following one particular user, for that case we have to scan all the hashtable keys. That’s why we should construct another graph, which is kind of a symmetric opposite to the one we constructed for followers. This new graph will again consist of a hashtable containing all 300 million vertices, each of which points to a list of adjacent vertices (the structure remains the same), but this time, the list of adjacent vertices will represent followers.

To solve tweet delivery problem we don't really need the graph of followers, instead we need a graph of followers . The previous graph (with a hashtable and a bunch of lists) allows us to efficiently find all users followed by any particular user. But it does not allow us to efficiently find all users who are following one particular user, for that case we have to scan all the hashtable keys. That's why we should construct another graph, which is kind of a symmetric opposite to the one we constructed for followers. This new graph will again consist of a hashtable containing all 300 million vertices, each of which points to a list of adjacent vertices (the structure remains the same), but this time, the list of adjacent vertices will represent followers.

So based on this illustration, whenever Liz tweets something, Sponge Bob and Ann must see that very tweet on their timelines. A common technique to accomplish this is by keeping separate structures for each user’s timeline. In case of Twitter’s 300+ million users, we might assume there are at least 300+ million timelines (for each user). Basically, whenever a user tweets, we should get the list of user’s followers and update their timelines (insert that same tweet into each one of them). A timeline might be represented as a linked list, or a balanced tree (tweet datetimes as node keys).

So based on this illustration, whenever Liz tweets something, Sponge Bob and Ann must see that very tweet on their timelines. A common technique to accomplish this is by keeping separate structures for each user's timeline. In case of Twitter's 300+ million users, we might assume there are at least 300+ million timelines (for each user). Basically, whenever a user tweets, we should get the list of user's followers and update their timelines (insert that same tweet into each one of them). A timeline might be represented as a linked list, or a balanced tree (tweet datetimes as node keys).

This is just a basic idea we abstracted from actual timeline representation and of course, we can make the actual delivery faster if we use multithreading. This is crucial for ‘heavy cases’, because for millions of followers the ones that reside closer to the end are being processed later than the ones residing closer to the front of the list.

This is just a basic idea we abstracted from actual timeline representation and of course, we can make the actual delivery faster if we use multithreading. This is crucial for 'heavy cases', because for millions of followers the ones that reside closer to the end are being processed later than the ones residing closer to the front of the list.

The following pseudocode tries to illuminate this multithreading delivery idea:

The following pseudocode tries to illuminate this multithreading delivery idea:

So whenever followers refresh their timelines, they will receive the new tweet.

So whenever followers refresh their timelines, they will receive the new tweet.

It will be fair to say, that we merely touched the tip of the iceberg of real problems at Airbnb or Twitter. It takes a really long time and the hard work of really talented engineers to accomplish such great results in complex systems like Twitter, Google, Facebook, Amazon, Airbnb and others. Just keep this in mind while reading this article.

It will be fair to say, that we merely touched the tip of the iceberg of real problems at Airbnb or Twitter. It takes a really long time and the hard work of really talented engineers to accomplish such great results in complex systems like Twitter, Google, Facebook, Amazon, Airbnb and others. Just keep this in mind while reading this article.

The point of demonstrating Twitter’s tweet delivery problem is to embrace the usage of graphs, even though we didn’t use any graph algorithm, we just used a representation of the graph. Sure we pseudocoded a function for delivering tweets, but that is something we came up during the process of searching for a solution.

The point of demonstrating Twitter's tweet delivery problem is to embrace the usage of graphs, even though we didn't use any graph algorithm, we just used a representation of the graph. Sure we pseudocoded a function for delivering tweets, but that is something we came up during the process of searching for a solution.

What I meant by “any graph algorithm” is any algorithm from this list. As something big enough to make programmers cry, graph theory and graph algorithm applications are somewhat different to spot at a glimpse. We were discussing the Airbnb homes and efficient filtering before finishing with graph representations, and the main obvious thing was the inability to efficiently filter homes with more than one filter key. Is there anything that could be done using graph algorithms? Well, we can’t tell for sure, but at least we can try. What if we represent each filter as a separate vertex?

What I meant by “any graph algorithm” is any algorithm from this list . As something big enough to make programmers cry, graph theory and graph algorithm applications are somewhat different to spot at a glimpse. We were discussing the Airbnb homes and efficient filtering before finishing with graph representations, and the main obvious thing was the inability to efficiently filter homes with more than one filter key. Is there anything that could be done using graph algorithms? Well, we can't tell for sure, but at least we can try. What if we represent each filter as a separate vertex?

Literally each filter, even all the prices from $10 to $1000+, all city names, country codes, amenities (TV, Wi-Fi, and all others), number of adults, and each number as a separate graph vertex.

Literally each filter, even all the prices from $10 to $1000+, all city names, country codes, amenities (TV, Wi-Fi, and all others), number of adults, and each number as a separate graph vertex.

We can even make this set of vertices more “friendly” if we add “type” vertices too, like “Amenities” connected to all vertices representing an amenity filter.

We can even make this set of vertices more “friendly” if we add “type” vertices too, like “Amenities” connected to all vertices representing an amenity filter.

Now, what if we represent Airbnb homes as vertices and then connect each home with “filter” vertex if that home supports the corresponding filter (For example, connecting “home 1” with “kitchen” if “home 1” has “kitchen” in its amenities)?

Now, what if we represent Airbnb homes as vertices and then connect each home with “filter” vertex if that home supports the corresponding filter (For example, connecting “home 1” with “kitchen” if “home 1” has “kitchen” in its amenities)?

A subtle change of this illustration makes it more likely to resemble a special type of graph, called a bipartite graph.

A subtle change of this illustration makes it more likely to resemble a special type of graph, called a bipartite graph .

Bipartite graph or just bigraph is a graph whose vertices can be divided into two disjoint and independent sets such that every edge connects a vertex in one set to one in other set. - Wikipedia.

Bipartite graph or just bigraph is a graph whose vertices can be divided into two disjoint and independent sets such that every edge connects a vertex in one set to one in other set. - Wikipedia .

In our example one of the sets represents filters (we’ll denote it by F) and the other is a homes set (denoted by H). For example, if there are 100 thousand homes with the price value $62, then price vertex labeled “$62” will have 100 thousand edges incident to each homes vertices. If we measure the worst case of space complexity, i.e. each home has all the properties satisfying to all the filters, than the total amount of edges to be stored will be 70,000 * 4 million. If we represent each edge as a pair of two ids: {filter_id; home_id} and if we rethink about IDs and use a 4 byte (int) numeric id for filters and 8 byte (long) id for homes, then each edge would require at least 12 bytes. So storing 70,000 * 4 million 12 bytes values will require around 3Tb of memory. We made a small mistake in calculation, you see.

In our example one of the sets represents filters (we'll denote it by F) and the other is a homes set (denoted by H). For example, if there are 100 thousand homes with the price value $62, then price vertex labeled “$62” will have 100 thousand edges incident to each homes vertices. If we measure the worst case of space complexity, ie each home has all the properties satisfying to all the filters, than the total amount of edges to be stored will be 70,000 * 4 million. If we represent each edge as a pair of two ids: {filter_id; home_id} and if we rethink about IDs and use a 4 byte (int) numeric id for filters and 8 byte (long) id for homes, then each edge would require at least 12 bytes. So storing 70,000 * 4 million 12 bytes values will require around 3Tb of memory. We made a small mistake in calculation, you see.

The number of filters is around 70,000 because of the 65 thousand cities active in Airbnb (stats). And the good news is that the same home cannot be located in more than one city. That is, our actual number of edges pairing with cities is 4 million (each home located in one city). So we’ll calculate for 70k - 65k = 5 thousand filters, that means we need 5000 * 4 million * 12 bytes of memory, which is less than 0.3 Tb. Sounds good. But what gives us this bipartite graph? Most commonly a website/mobile request will consist of several filters, for example like this:

The number of filters is around 70,000 because of the 65 thousand cities active in Airbnb ( stats ). And the good news is that the same home cannot be located in more than one city. That is, our actual number of edges pairing with cities is 4 million (each home located in one city). So we'll calculate for 70k - 65k = 5 thousand filters, that means we need 5000 * 4 million * 12 bytes of memory, which is less than 0.3 Tb. 听起来不错。 But what gives us this bipartite graph? Most commonly a website/mobile request will consist of several filters, for example like this:

house_type: "entire_place",adults_number: 2,price_range_start: 56,price_range_end: 80,beds_number: 2,amenities: ["tv", "wifi", "laptop friendly workspace"],facilities: ["gym"]

And all we need is to find all the “filter vertices” above and process all the “home vertices” that are adjacent to these “filter vertices”. This takes us to a scary subject.

And all we need is to find all the “filter vertices” above and process all the “home vertices” that are adjacent to these “filter vertices”. This takes us to a scary subject.

Graph Algorithms: Intro (Graph Algorithms: Intro)

Any processing done with graphs might be categorized as a “graph algorithm”. You literally can implement a function printing all the vertices of a graph and name it “’s vertex printing algorithm”. Most of us are scared of the graph algorithms listed in textbooks (see the full list here). Let’s try to apply a bipartite graph matching algorithm, such as Hopcroft-Karp algorithm to our Airbnb homes filtering problem:

Any processing done with graphs might be categorized as a “graph algorithm”. You literally can implement a function printing all the vertices of a graph and name it “ 's vertex printing alg orithm”. Most of us are scared of the graph algorithms listed in textb ooks (see the full lis t here). Let's try to apply a bipartite graph matching algorithm, su ch as Hopcroft-Karp alg orithm to our Airbnb homes filtering problem:

Given a bipartite graph of Airbnb homes (H) and filters (F), where every element (vertex) of H can have more than one adjacent elements (vertex) of F (sharing a common edge). Find a subset of H consisting of vertices that are adjacent to vertices in a subset of F.
Given a bipartite graph of Airbnb homes (H) and filters (F), where every element (vertex) of H can have more than one adjacent elements (vertex) of F (sharing a common edge). Find a subset of H consisting of vertices that are adjacent to vertices in a subset of F.

Confusing problem definition, however we can’t be sure at this point whether Hopcroft-Karp algorithm solves our problem. But I assure you that this journey will teach us many crucial ideas behind graph algorithms. And the journey is not so short, so be patient.

Confusing problem definition, however we can't be sure at this point whether Hopcroft-Karp algorithm solves our problem. But I assure you that this journey will teach us many crucial ideas behind graph algorithms. And the journey is not so short, so be patient.

The Hopcroft–Karp algorithm is an algorithm that takes as input, a bipartite graph and produces as output, a maximum cardinality matching - a set of as many edges as possible with the property that no two edges share an endpoint - Wikipedia.

The Hopcroft – Karp algorithm is an algorithm that takes as input, a bipartite graph and produces as output, a maximum cardinality matching - a set of as many edges as possible with the property that no two edges share an endpoint - Wikipedia .

Readers familiar with this algorithm are already aware that this doesn’t solve our problem, because matching requires that no two edges share a common vertex.

Readers familiar with this algorithm are already aware that this doesn't solve our problem, because matching requires that no two edges share a common vertex.

Let’s look at an example illustration, where there are just 4 filters and 8 homes (for the sake of simplicity).

Let's look at an example illustration, where there are just 4 filters and 8 homes (for the sake of simplicity).

  • Homes are denoted by letters from A through H, filters are chosen randomly.

    Homes are denoted by letters from A through H, filters are chosen randomly.
  • Home A has a price ($50), and 1 bed, (that’s all we got for the price).

    Home A has a price ($50), and 1 bed, (that's all we got for the price).
  • All homes from A through H have a $50 per night price tag and 1 bed, but few of them have “Wi-Fi” and/or “TV”.

    All homes from A through H have a $50 per night price tag and 1 bed, but few of them have “Wi-Fi” and/or “TV”.

So the following illustration tries to show which homes should we “return” for the request asking for homes that have all four filters available (For example, they cost $50 per night, they have 1 bed and also they have Wi-Fi and TV).

So the following illustration tries to show which homes should we “return” for the request asking for homes that have all four filters available (For example, they cost $50 per night, they have 1 bed and also they have Wi-Fi and TV).

The solution to our problem requires edges with common vertices leading to distinct home vertices that are incident to the same filter subset, while Hopcroft-Karp algorithm eliminates such edges with common endpoints and produces edges incident to vertices in both subsets.

The solution to our problem requires edges with common vertices leading to distinct home vertices that are incident to the same filter subset, while Hopcroft-Karp algorithm eliminates such edges with common endpoints and produces edges incident to vertices in both subsets.

Take a look at the illustration above, all we need are homes D and G, which both satisfy to all four filter values. What we really need is to get all matching edges which share a common endpoint.

Take a look at the illustration above, all we need are homes D and G, which both satisfy to all four filter values. What we really need is to get all matching edges which share a common endpoint.

We could devise an algorithm for this approach, but its processing time is arguably not relevant to users needs (users needs = lightning fast, right here, right now). Probably it would be faster to create a balanced binary search tree with multiple sort keys, almost like a database index file, which maps primary/foreign keys with a set of satisfying records.

We could devise an algorithm for this approach, but its processing time is arguably not relevant to users needs (users needs = lightning fast, right here, right now). Probably it would be faster to create a balanced binary search tree with multiple sort keys, almost like a database index file, which maps primary/foreign keys with a set of satisfying records.

Balanced binary search trees and database indexing will be discussed in a separate article, where we will return to the Airbnb home filtering problem again.

Balanced binary search trees and database indexing will be discussed in a separate article, where we will return to the Airbnb home filtering problem again.

The Hopcroft-Karp algorithm (and many others) are based on both DFS (Depth-First Search) and BFS (Breadth-First Search) graph traversal algorithms. To be honest, the actual reason to introduce the Hopcroft-Karp algorithm here is to surreptitiously switch to graph traversals, which is better to start from the nice graphs, binary trees.

The Hopcroft-Karp algorithm (and many others) are based on both DFS (Depth-First Search) and BFS (Breadth-First Search) graph traversal algorithms. To be honest, the actual reason to introduce the Hopcroft-Karp algorithm here is to surreptitiously switch to graph traversals, which is better to start from the nice graphs, binary trees.

Binary tree traversals are really beautiful, mostly because of their recursive nature. There are three basic traversals called in-order, post-order and pre-order (you may come up with your own traversal algorithm). They are easy to understand if you have ever traversed a linked list. In linked lists you just print the current node’s value (named item in the code below) and continue to the next node.

Binary tree traversals are really beautiful, mostly because of their recursive nature. There are three basic traversals called in-order, post-order and pre-order (you may come up with your own traversal algorithm). They are easy to understand if you have ever traversed a linked list. In linked lists you just print the current node's value (named item in the code below) and continue to the next node.

Almost the same goes with binary trees, you print the node value (or whatever else you need to do with it) and then continue to the next node, but in this case, there are “two next” nodes, left and right. So you should do the same for both left and right nodes. But you have three different choices here:

Almost the same goes with binary trees, you print the node value (or whatever else you need to do with it) and then continue to the next node, but in this case, there are “two next” nodes, left and right. So you should do the same for both left and right nodes. But you have three different choices here:

  • print the node value then go to the left node, and then go to the right node, or

    print the node value then go to the left node, and then go to the right node, or

  • go to the left node, print the node value, and then go to the right node, or

    go to the left node, print the node value, and then go to the right node, or

  • go to the left node, then go to the right node, and then print the value of the node.

    go to the left node, then go to the right node, and then print the value of the node.

Obviously, recursive functions look very elegant though they are so expensive. Each time we call a function recursively, it means we call a completely “new” function (see the illustration above). And by “new” we mean that another stack memory space should be “allocated” for the function arguments and local variables. That’s why recursive calls are expensive (the extra stack space allocations and the many function calls) and dangerous (mind the stack overflow) and it is obviously suggested to use iterative implementations. In mission critical systems programming (aircraft, NASA rovers and so on) a recursion is completely prohibited (no stats, no experience, just telling you the rumors).

Obviously, recursive functions look very elegant though they are so expensive. Each time we call a function recursively, it means we call a completely “new” function (see the illustration above). And by “new” we mean that another stack memory space should be “allocated” for the function arguments and local variables. That's why recursive calls are expensive (the extra stack space allocations and the many function calls) and dangerous (mind the stack overflow) and it is obviously suggested to use iterative implementations. In mission critical systems programming (aircraft, NASA rovers and so on) a recursion is completely prohibited (no stats, no experience, just telling you the rumors).

Netflix and Amazon: Inverted Index Example (Netflix and Amazon: Inverted Index Example)

Let’s say we want to store all Netflix movies in a binary search tree with movie titles as sort keys. So whenever a user types something like “Inter”, we will return a list of movies starting with “Inter” (for instance, “Interstellar”, “Interceptor”, “Interrogation of Walter White”).

Let's say we want to store all Netflix movies in a binary search tree with movie titles as sort keys. So whenever a user types something like “Inter”, we will return a list of movies starting with “Inter” (for instance, “Interstellar”, “Interceptor”, “Interrogation of Walter White”).

Now, it would be great if we’ll return every movie that contains “Inter” in its title (not only ones that start with “Inter”), and the list would be sorted by movie ratings or something that is relevant to that particular user (like thrillers more than drama). The point of this example is to make efficient range queries to a BST.

Now, it would be great if we'll return every movie that contains “Inter” in its title (not only ones that start with “Inter”), and the list would be sorted by movie ratings or something that is relevant to that particular user (like thrillers more than drama). The point of this example is to make efficient range queries to a BST.

But as usual, we won’t dive deeper into the cold water to spot the rest of the iceberg. Basically, we need a fast lookup by search keywords and then get a list of results sorted by some key, which most likely should be a movie rating and/or some internal ranking based on a user’s personalized data. We’ll try to stick to the KISK principle (Keep It Simple, Karl) as much as possible.

But as usual, we won't dive deeper into the cold water to spot the rest of the iceberg. Basically, we need a fast lookup by search keywords and then get a list of results sorted by some key, which most likely should be a movie rating and/or some internal ranking based on a user's personalized data. We'll try to stick to the KISK principle (Keep It Simple, Karl) as much as possible.

“KISK” or “let’s keep it simple” or “for the sake of simplicity”, a super excuse for tutorial writers to abstract from the real problem and make tons of assumptions by bringing an “abc” easy example and its solution in pseudocode that works even on your grandma’s laptop.
“KISK” or “let's keep it simple” or “for the sake of simplicity”, a super excuse for tutorial writers to abstract from the real problem and make tons of assumptions by bringing an “abc” easy example and its solution in pseudocode that works even on your grandma's laptop.

This problem could be easily applied to Amazon’s product search as we most commonly search something in Amazon by typing a text describing our interest (like “Graph Algorithms”) and get results sorted by product ratings. I haven’t experienced personalized results in Amazon’s search results. But I’m pretty sure Amazon does that stuff too. So, it will be fair to change the title of this subtopic to…

This problem could be easily applied to Amazon's product search as we most commonly search something in Amazon by typing a text describing our interest (like “Graph Algorithms”) and get results sorted by product ratings. I haven't experienced personalized results in Amazon's search results. But I'm pretty sure Amazon does that stuff too. So, it will be fair to change the title of this subtopic to…

Netflix and Amazon. Netflix serves movies, Amazon serves products, we’ll name them “items”, so whenever you read an “item” think of a movie in Netflix or any [viable] product in Amazon.

Netflix and Amazon . Netflix serves movies, Amazon serves products, we'll name them “items”, so whenever you read an “item” think of a movie in Netflix or any [viable] product in Amazon.

What is most commonly done with the items is the parsing of their title and description (we’ll stick to the title only), so if an operator (usually a human being inserting item’s data into Netflix/Amazon database via an admin dashboard) inserts a new item into the database, its title is being processed by some “ItemTitleProcessor” to produce keywords.

What is most commonly done with the items is the parsing of their title and description ( we'll stick to the title only ), so if an operator (usually a human being inserting item's data into Netflix/Amazon database via an admin dashboard ) inserts a new item into the database, its title is being processed by some “ItemTitleProcessor” to produce keywords.

Each item has its unique ID, which is being linked to the keyword found in its title. This is what search engines do while crawling websites all over the world. They analyze each document’s content, tokenize it (break it into smaller entities called words) and add it to a table, which maps each token (word) to the document ID (website) where the token has been “seen”.

Each item has its unique ID, which is being linked to the keyword found in its title. This is what search engines do while crawling websites all over the world. They analyze each document's content, tokenize it (break it into smaller entities called words) and add it to a table, which maps each token (word) to the document ID (website) where the token has been “seen”.

So whenever you search for “hello”, the search engine fetches all documents mapped to the keyword “hello” (reality is much complex, because the most important thing is the search relevancy, which is why Google Search is so awesome). So a similar table for Netflix/Amazon may look like this (again, think of Movies or Products when reading Items).

So whenever you search for “hello”, the search engine fetches all documents mapped to the keyword “hello” (reality is much complex, because the most important thing is the search relevancy, which is why Google Search is so awesome). So a similar table for Netflix/Amazon may look like this (again, think of Movies or Products when reading Items).

Hashtables, again. Yes, we will keep a hashtable for this inverted index (index structure storing a mapping from content - Wikipedia). The hashtable will map a keyword to a BST of items. Why BST? Because we want to keep them sorted and at the same time serve them (respond to frontend requests) in sequential sorted portions, (for instance, 100 items at a request using pagination). Not really something that shows the power of BSTs. But let’s pretend that we also need a fast lookup in the search result, say you want all 3 star movies with the keyword “machine”.

Hashtables, again. Yes, we will keep a hashtable for this inverted index ( index structure storing a mapping from content - Wikipedia ). The hashtable will map a keyword to a BST of items. Why BST? Because we want to keep them sorted and at the same time serve them (respond to frontend requests) in sequential sorted portions, (for instance, 100 items at a request using pagination). Not really something that shows the power of BSTs. But let's pretend that we also need a fast lookup in the search result, say you want all 3 star movies with the keyword “machine”.

Note that it’s okay to have duplicate items in different trees, because an item usually can be found with more than one keyword.

Note that it's okay to have duplicate items in different trees, because an item usually can be found with more than one keyword .

We’ll operate with items defined as follows:

We'll operate with items defined as follows:

Each time a new item is inserted into a database, the title is processed and added to the big index table, which maps a keyword to an item. There could be many items sharing the same keyword, so we keep these items in a BST sorted by their rating.

Each time a new item is inserted into a database, the title is processed and added to the big index table, which maps a keyword to an item. There could be many items sharing the same keyword, so we keep these items in a BST sorted by their rating.

When users search for some keyword, they get a list of items sorted by their rating. How can we get a list from a tree in a sorted order? By doing an in-order traversal.

When users search for some keyword, they get a list of items sorted by their rating. How can we get a list from a tree in a sorted order? By doing an in-order traversal.

Here’s how an implementation of InOrderProduceVector() might look:

Here's how an implementation of InOrderProduceVector() might look:

But, but… We need the highest rated items first, and our in-order traversal produces the lowest rated items first. That’s because of its nature. In-order traversal works “bottom up”, from the lowest to the highest item. To get what we wanted, i.e. the list in descending order instead of ascending, we should take a look at the in-order traversal implementation a bit closer.

But, but… We need the highest rated items first, and our in-order traversal produces the lowest rated items first. That's because of its nature. In-order traversal works “bottom up”, from the lowest to the highest item. To get what we wanted, ie the list in descending order instead of ascending, we should take a look at the in-order traversal implementation a bit closer.

What we are doing is going through the left node, then printing the current node’s value and then going through the right node. The left most node is the node with the smallest value. So simply changing the implementation to go through the right node first will lead us to a descending order of the list. We’ll name it as others do, a reverse in-order traversal.

What we are doing is going through the left node, then printing the current node's value and then going through the right node. The left most node is the node with the smallest value. So simply changing the implementation to go through the right node first will lead us to a descending order of the list. We'll name it as others do, a reverse in-order traversal.

Let’s update the code above (introducing in a single listing). Warning - Bugs Ahead!

Let's update the code above (introducing in a single listing). Warning - Bugs Ahead!

That’s it. We can serve item search results pretty fast. As mentioned above, inverted indexing is used mostly in search engines, like Google. Although Google Search is a very complex search engine, it does use some simple ideas (way too modernized though) to match search queries to documents and serve the results as fast as possible.

而已。 We can serve item search results pretty fast. As mentioned above, inverted indexing is used mostly in search engines, like Google. Although Google Search is a very complex search engine, it does use some simple ideas (way too modernized though) to match search queries to documents and serve the results as fast as possible.

We used tree traversals to serve results in sorted order. At this point it might seem that pre/in/post-order traversals are more than enough, but sometimes there is a need for another type of traversal.

We used tree traversals to serve results in sorted order. At this point it might seem that pre/in/post-order traversals are more than enough, but sometimes there is a need for another type of traversal.

Let’s tackle this well-known programming interview question, “How would you print a [binary] tree level by level?”.

Let's tackle this well-known programming interview question, “How would you print a [binary] tree level by level?”.

Traversals: DFS and BFS (Traversals: DFS and BFS)

If you are not familiar with this problem, think of some data structure that you could use to store nodes while traversing the tree. If we compare level-by-level traversal of a tree with the others above (pre, in, post order traversals), we’ll eventually devise two main traversals of graphs, that is a depth-first search (DFS) and breadth-first search (BFS).

If you are not familiar with this problem, think of some data structure that you could use to store nodes while traversing the tree. If we compare level-by-level traversal of a tree with the others above (pre, in, post order traversals), we'll eventually devise two main traversals of graphs, that is a depth-first search (DFS) and breadth-first search (BFS).

Depth-first search hunts for the farthest node, breadth-first search explores nearest nodes first.

Depth-first search hunts for the farthest node, breadth-first search explores nearest nodes first.

  • Depth-first search - more actions, less thoughts.

    Depth-first search - more actions, less thoughts.

  • Breadth-first search - take a good look around you before going farther.

    Breadth-first search - take a good look around you before going farther.

DFS is much like pre, in, post-order traversals. While BFS is what we need if we want to print a tree’s nodes level-by-level.

DFS is much like pre, in, post-order traversals. While BFS is what we need if we want to print a tree's nodes level-by-level.

To accomplish this, we would need a queue (data structure) to store the “level” of the graph while printing (visiting) its “parent level”. In the previous illustration nodes that are placed in the queue are in light blue.

To accomplish this, we would need a queue (data structure) to store the “level” of the graph while printing (visiting) its “parent level”. In the previous illustration nodes that are placed in the queue are in light blue.

Basically, going level by level, nodes on each level are fetched from the queue, and while visiting each fetched node, we also should insert its children into the queue (for the next level). The following code is simple enough to get the main idea of BFS. It is assumed that the graph is connected, although it can be modified to apply to disconnected graphs.

Basically, going level by level, nodes on each level are fetched from the queue, and while visiting each fetched node, we also should insert its children into the queue (for the next level). The following code is simple enough to get the main idea of BFS. It is assumed that the graph is connected, although it can be modified to apply to disconnected graphs.

The basic idea is easy to show on a node-based connected graph representation. Just keep in mind that the implementation of the graph traversal differs from representation to representation.

The basic idea is easy to show on a node-based connected graph representation. Just keep in mind that the implementation of the graph traversal differs from representation to representation.

BFS and DFS are important tools in tackling graph searching problems (but remember that there are tons of graph search algorithms). While DFS has elegant recursive implementation, it is reasonable to implement it iteratively. For the iterative implementation of BFS we used a queue, for DFS you will need a stack. One of the most popular problems in graphs and at the same time one of the most possible reasons you read in this article is the problem of finding the shortest path between graph vertices. And this takes us to our last thought experiment.

BFS and DFS are important tools in tackling graph searching problems ( but remember that there are tons of graph search algorithms) . While DFS has elegant recursive implementation, it is reasonable to implement it iteratively. For the iterative implementation of BFS we used a queue, for DFS you will need a stack. One of the most popular problems in graphs and at the same time one of the most possible reasons you read in this article is the problem of finding the shortest path between graph vertices. And this takes us to our last thought experiment.

Uber and the Shortest Path Problem (Dijkstra's Algorithm) (Uber and the Shortest Path Problem (Dijkstra’s Algorithm))

With its 50 million users and 7 million drivers (source), one of the most important things that is critical to Uber’s functioning is the ability to match drivers with riders in an efficient way. The problem starts with locations.

With its 50 million users and 7 million drivers ( source ), one of the most important things that is critical to Uber's functioning is the ability to match drivers with riders in an efficient way. The problem starts with locations.

The backend should process millions of user requests, sending each of the requests to one or more (usually more) drivers nearby. While it is easier and sometimes even smarter to send the user request to all nearby drivers, some pre-processing might actually help.

The backend should process millions of user requests, sending each of the requests to one or more (usually more) drivers nearby. While it is easier and sometimes even smarter to send the user request to all nearby drivers, some pre-processing might actually help.

Besides processing incoming requests and finding the location area based on the user coordinates and then finding drivers with nearest coordinates, we also need to find the right driver for the ride. To avoid geospatial request processing (fetching nearby cars by comparing their current coordinates with user’s coordinates), let’s say we already have a segment of the map with user and several nearby cars.

Besides processing incoming requests and finding the location area based on the user coordinates and then finding drivers with nearest coordinates, we also need to find the right driver for the ride. To avoid geospatial request processing (fetching nearby cars by comparing their current coordinates with user's coordinates), let's say we already have a segment of the map with user and several nearby cars.

Something like this:

像这样:

Possible paths from a car to a user are in yellow. The problem is to calculate the minimum required distance for the car to reach the user, in other words, find the shortest path between them. While this is more about Google Maps rather than Uber, we’ll try to solve it for this particular and very simplified case mostly because there are usually more than one drivers car and Uber might want to calculate the nearest car with the highest rating to send it to the user.

Possible paths from a car to a user are in yellow. The problem is to calculate the minimum required distance for the car to reach the user, in other words, find the shortest path between them. While this is more about Google Maps rather than Uber, we'll try to solve it for this particular and very simplified case mostly because there are usually more than one drivers car and Uber might want to calculate the nearest car with the highest rating to send it to the user.

For this illustration that means calculating for all three cars the shortest path reaching to the user and decide which car would be the optimal to send. To make things really simple, we’ll discuss the case with just one car. Here are some possible routes to reach to the user.

For this illustration that means calculating for all three cars the shortest path reaching to the user and decide which car would be the optimal to send. To make things really simple, we'll discuss the case with just one car. Here are some possible routes to reach to the user.

Cutting to the chase, we’ll represent this segment as a graph:

Cutting to the chase, we'll represent this segment as a graph:

This is an undirected weighted graph (edge-weighted, to be more specific). To find the shortest path between vertices B (the car) and A (the user), we should find a route between them consisting of edges with possibly minimum weights. You are free to devise your version of the solution. We’ll stick with Dijkstra’s version. The following steps of Dijkstra’s algorithm are from Wikipedia.

This is an undirected weighted graph (edge-weighted, to be more specific). To find the shortest path between vertices B (the car) and A (the user), we should find a route between them consisting of edges with possibly minimum weights. You are free to devise your version of the solution. We'll stick with Dijkstra's version . The following steps of Dijkstra's algorithm are from Wikipedia .

Let the node at which we are starting be called the initial node. Let the distance of node Y be the distance from the initial node to Y. Dijkstra’s algorithm will assign some initial distance values and will try to improve them step by step.

Let the node at which we are starting be called the initial node . Let the distance of node Y be the distance from the initial node to Y. Dijkstra's algorithm will assign some initial distance values and will try to improve them step by step.

  1. Mark all nodes unvisited. Create a set of all the unvisited nodes called the unvisited set.

    Mark all nodes unvisited. Create a set of all the unvisited nodes called the unvisited set.

  2. Assign to every node a tentative distance value: set it to zero for our initial node and to infinity for all other nodes. Set the initial node as current.

    Assign to every node a tentative distance value: set it to zero for our initial node and to infinity for all other nodes. Set the initial node as current.

  3. For the current node, consider all of its unvisited neighbors and calculate their tentative distances through the current node. Compare the newly calculated tentative distance to the current assigned value and assign the smaller one. For example, if the current node A is marked with a distance of 6, and the edge connecting it with a neighbor B has length 2, then the distance to B through A will be 6 + 2 = 8. If B was previously marked with a distance greater than 8 then change it to 8. Otherwise, keep the current value.

    For the current node, consider all of its unvisited neighbors and calculate their tentative distances through the current node. Compare the newly calculated tentative distance to the current assigned value and assign the smaller one. For example, if the current node A is marked with a distance of 6, and the edge connecting it with a neighbor B has length 2, then the distance to B through A will be 6 + 2 = 8. If B was previously marked with a distance greater than 8 then change it to 8. Otherwise, keep the current value.

  4. When we are done considering all of the neighbors of the current node, mark the current node as visited and remove it from the unvisited set. A visited node will never be checked again.

    When we are done considering all of the neighbors of the current node, mark the current node as visited and remove it from the unvisited set. A visited node will never be checked again.

  5. If the destination node has been marked visited (when planning a route between two specific nodes) or if the smallest tentative distance among the nodes in the unvisited set is infinity (when planning a complete traversal; occurs when there is no connection between the initial node and remaining unvisited nodes), then stop. The algorithm has finished.

    If the destination node has been marked visited (when planning a route between two specific nodes) or if the smallest tentative distance among the nodes in the unvisited set is infinity (when planning a complete traversal; occurs when there is no connection between the initial node and remaining unvisited nodes), then stop. The algorithm has finished.

  6. Otherwise, select the unvisited node that is marked with the smallest tentative distance, set it as the new “current node”, and go back to step 3.

    Otherwise, select the unvisited node that is marked with the smallest tentative distance, set it as the new “current node”, and go back to step 3.

Applying this to our example, we’ll start with vertex B (the car) as the initial node. For first two steps:

Applying this to our example, we'll start with vertex B (the car) as the initial node. For first two steps:

Our unvisited set consists of all vertices. Also note the table at the left side of the illustration. For all vertices, it will contain all the shortest distances from B and the previous (marked “Prev”) vertex that lead to the vertex. For instance the distance is 20 from B to F, and the previous vertex is B.

Our unvisited set consists of all vertices. Also note the table at the left side of the illustration. For all vertices, it will contain all the shortest distances from B and the previous (marked “Prev”) vertex that lead to the vertex. For instance the distance is 20 from B to F, and the previous vertex is B.

We are marking B as visited and move it to its neighbor F.

We are marking B as visited and move it to its neighbor F.

Now we are marking F as visited and choosing the next unvisited node with smallest tentative distance, which is G. Also note the table at the left side. In the previous illustration nodes C, F and G already have their tentative distances set with the previous nodes which lead to the mentioned nodes.

Now we are marking F as visited and choosing the next unvisited node with smallest tentative distance, which is G. Also note the table at the left side. In the previous illustration nodes C, F and G already have their tentative distances set with the previous nodes which lead to the mentioned nodes.

As stated in the algorithm, if the destination node has been marked visited (when planning a route between two specific nodes as in our case) then we can stop. So our next step stops the algorithm with the following values.

As stated in the algorithm, if the destination node has been marked visited (when planning a route between two specific nodes as in our case) then we can stop. So our next step stops the algorithm with the following values.

So we have both the shortest distance from B to A and the route through F and G nodes.

So we have both the shortest distance from B to A and the route through F and G nodes.

This is really the simplest possible example of potential problems at Uber, comparing this to our iceberg analogy, we are at the tip of the tip of the iceberg. However, this is a good first start to explore the real world of graph theory and its applications. I didn’t complete what I initially planned for in this article, but in the near future, most probably, this will be continued (also including database indexing internals).

This is really the simplest possible example of potential problems at Uber, comparing this to our iceberg analogy, we are at the tip of the tip of the iceberg. However, this is a good first start to explore the real world of graph theory and its applications. I didn't complete what I initially planned for in this article, but in the near future, most probably, this will be continued (also including database indexing internals).

There is still so much to tell about graphs (still need to study). Take this article as another tip of the iceberg. If you have read this far, you definitely deserve a cookie. Don’t forget to clap and share. Thank you.

There is still so much to tell about graphs (still need to study). Take this article as another tip of the iceberg. If you have read this far, you definitely deserve a cookie. Don't forget to clap and share. 谢谢。

资源资源 (Resources)

[1] Sh. Even, G. Even, Graph Algorithms

[1] Sh. Even, G. Even, Graph Algorithms

进一步阅读 (Further reading)

R. Sedgewick, K. Wayne, Algorithms

R. Sedgewick, K. Wayne, Algorithms

T. Cormen, Ch. Leiserson, R. Rivest, C. Stein, Introduction to Algorithms

T. Cormen, Ch. Leiserson, R. Rivest, C. Stein, Introduction to Algorithms

Airbnb Engineering, AirbnbEng

Airbnb Engineering, AirbnbEng

Netflix Tech Blog, Netflix Technology Blog

Netflix Tech Blog, Netflix Technology Blog

Twitter Engineering Blog

Twitter Engineering Blog

Uber Engineering Blog

Uber Engineering Blog

翻译自: https://www.freecodecamp.org/news/i-dont-understand-graph-theory-1c96572a1401/

数据库中各表关联图及其说明

你可能感兴趣的:(编程语言,python,人工智能,java,大数据)