hyperloglog算法
by Alex Nadalin
通过亚历克斯·纳达林
Every now and then I bump into a concept that’s so simple and powerful that I’m wish I’d discovered such an incredible and beautiful idea.
我时不时碰到一个如此简单而强大的概念,希望我发现了这样一个令人难以置信且美丽的想法。
I discovered HyperLogLog (HLL) a couple of years ago, and fell in love with it right after reading how redis decided to add a HLL data structure.
几年前,我发现了HyperLogLog (HLL),并在阅读Redis如何决定添加HLL数据结构后立即爱上了它。
The idea behind HLL is devastatingly simple but extremely powerful. This is what makes it such a widespread algorithm, used by giants of the internet such as Google and Reddit.
HLL背后的想法非常简单,但却非常强大。 这就是使其成为如此广泛的算法的原因,该算法已被互联网巨头(例如Google和Reddit)使用。
My friend Tommy and I planned to go to a conference. While heading to its location, we decided to wager on who would meet the most new people. Once we reached the place, we’d start conversing around and keep a counter of how many people we talked to.
我的朋友汤米和我打算去参加一个会议。 前往其所在地时,我们决定押注谁会遇到最新的人。 到达该地点后,我们将开始交谈,并与我们交谈的人数保持一致。
At the end of the event, Tommy comes to me with his figure — say, 17 — and I tell him that I had a word with 46 people.
在活动结束时,汤米(Tommy)带着他的身影来到我身边,例如17岁。我告诉他,我与46个人有过一段话。
Clearly, I am the winner, but Tommy’s frustrated as he thinks I’ve counted the same people multiple times. He believes he only saw me talking to maybe 15–20 people in total.
显然,我是赢家,但汤米(Tommy)沮丧,因为他认为我已经多次计算过同一个人。 他相信他只看到我与大约15–20个人进行了交谈。
So, the wager’s off. We decide that for our next event, we’ll be taking down names instead, to be sure we’re counting unique people, and not just the total number of conversations.
因此,下注了。 我们决定在下一次活动中,取而代之的是记下姓名,以确保我们在计算的是唯一身份的人,而不只是对话的总数。
At the end of the following conference, we meet each other with a very long list of names and — guess what? Tommy had a couple more encounters than I did! We laugh it off, and while discussing our approach to counting uniques, Tommy comes up with a great idea:
在下一次会议结束时,我们会面很长的名字,彼此会面-猜猜是什么? 汤米比我多遇到了两次! 我们笑了起来,在讨论计算唯一性的方法时,Tommy提出了一个好主意:
“Alex, you know what? We can’t go around with pen and paper and track down a list of names, it’s really impractical! Today I spoke to 65 different people and counting their names on this paper was a real pain. I lost count 3 times and had to start from scratch!”
“亚历克斯,你知道吗? 我们不能随便用笔和纸来寻找名字的列表,这是不切实际的! 今天,我与65位不同的人进行了交谈,而在这篇论文上计算他们的名字实在是很痛苦。 我输了3次,不得不从头开始!”
“Yeah, I know, but do we even have an alternative?”
“是的,我知道,但是我们还有其他选择吗?”
“What if, for our next conference, instead of asking for names, we ask people the last 5 digits of their phone number? Instead of winning by counting their names, the winner will be the one who spoke to someone with the longest sequence of leading zeroes in those digits.”
“如果在下一次会议上,我们不问姓名,而是问人们电话号码的后5位怎么办? 获胜者将是与那些与数字中前导零序列最长的人交谈的人,而不是通过计数他们的名字来获胜。”
“Wait Tommy, you’re going too fast! Slow down a second and give me an example…”
“等等汤米,你太快了! 慢一点,给我一个例子……”
“Sure, just ask each person for those last 5 digits, ok? Let’s suppose they reply ‘54701’. There’s no leading zero, so the longest sequence of leading zeroes is 0. The next person you talk to says ‘02561’ — that’s a leading zero! So your longest sequence is now 1.”
“当然,只要问每个人的后5位数字,好吗? 假设他们回答“ 54701”。 没有前导零,因此前导零的最长序列是0。您与之交谈的下一个人说'02561'-这是前导零! 因此,您最长的序列现在是1。”
“You’re starting to make sense to me…”
“你开始对我有意义……”
“Yeah, so if we speak to only a couple of people, chances are that are longest zero-sequence will be 0. But if we talk to maybe 10 people, we have more chances of it being 1.”
“是的,因此,如果我们仅与几个人交谈,则最长的零序可能性就是0。但是,如果我们与10个人交谈,则更有可能为1。”
“Now, imagine you tell me your longest zero-sequence is 5 — you must have spoken to thousands of people to find someone with 00000 in their phone number!”
“现在,假设您告诉我您最长的零序是5-您必须已经与成千上万的人交谈,以找到电话号码为00000的人!”
“Dude, you’re a damn genius!”
“老兄,你真是个天才!”
It allows us to estimate unique items within a large dataset by recording the longest sequence of zeroes within that set.
它允许我们通过记录该集中最长的零序列来估计大型数据集中的唯一项。
This ends up creating an incredible advantage over keeping track of each and every element in the set. It is an incredibly efficient way to count unique values, with relatively high accuracy.
与跟踪集合中的每个元素相比,这最终带来了难以置信的优势。 这是一种以相对较高的精度对唯一值进行计数的高效方法。
“The HyperLogLog algorithm can estimate cardinalities well beyond 10⁹ with a relative accuracy (standard error) of 2% while only using 1.5kb of memory”
“ HyperLogLog算法可以估计基数超过10 can,相对精度(标准误)为2%,而仅使用1.5kb的内存”
Fangjin Yang — Fast, Cheap, and 98% Right: Cardinality Estimation for Big Data
杨芳金- 快速,便宜且98%正确:大数据的基数估计
Since I may be oversimplifying, let’s have a look at some more details of HLL.
由于我可能过于简化,因此让我们看一下HLL的更多详细信息。
HLL is part of a family of algorithms that aim to address cardinality estimation — otherwise known as the “count-distinct problem.” How can we efficiently count the number of unique objects in a data set?
HLL是旨在解决基数估计 (也称为“计数差异问题”)的一系列算法的一部分。 我们如何有效地计算数据集中唯一对象的数量?
This is extremely useful for lots of today’s web applications. For example, when you want to count how many unique views an article on your site has generated.
这对于当今的许多Web应用程序都非常有用。 例如,当您要计算网站上的文章产生了多少个唯一身份视图时。
When HLL runs, it takes your input data and hashes it, turning it into a bit sequence:
当HLL运行时,它将获取您的输入数据并对其进行哈希处理,将其转换为位序列:
IP address of the viewer: 54.134.45.789
HLL hash: 010010101010101010111010...
Now, an important part of HLL is to make sure that your hashing function distributes bits as evenly as possible. You don’t want to use a weak function such as:
现在,HLL的重要部分是确保散列函数尽可能均匀地分配位。 您不想使用弱函数,例如:
A HLL using this hashing function would return biased results if, for example, the distribution of your visitors is tied to a specific geographic region.
例如,如果访问者的分布绑定到特定地理区域 ,则使用此哈希函数的HLL将返回有偏差的结果。
The original paper has a few more details on what a good hashing function means for HLL:
原始论文有更多详细信息,说明良好的哈希函数对HLL意味着什么:
“All known efficient cardinality estimators rely on randomization, which is ensured by the use of hash functions.
“所有已知的有效基数估计量都依赖于随机化,这可以通过使用哈希函数来确保。
The elements to be counted belonging to a certain data domain D, we assume given a hash function, h : D → {0, 1}∞; that is, we assimilate hashed values to infinite binary strings of {0, 1}∞, or equivalently to real numbers of the unit interval.
假定属于某个数据域D的要计数的元素,我们假定给定一个哈希函数h:D→{0,1}∞; 也就是说,我们将散列值同化为{0,1}∞的无限二进制字符串,或者等效于单位间隔的实数。
[…]
[…]
We postulate that the hash function has been designed in such a way that the hashed values closely resemble a uniform model of randomness, namely, bits of hashed values are assumed to be independent and to have each probability [0.5] of occurring.”
我们假设哈希函数的设计方式是使哈希值非常类似于统一的随机性模型,即,假定哈希值的各个位是独立的,并且具有发生的每个概率[0.5]。”
Philippe Flajolet — HyperLogLog: The Analysis of a Near-optimal Cardinality Estimation Algorithm
Philippe Flajolet — HyperLogLog:一种近似最佳基数估计算法的分析
Now, after we’ve picked a suitable hash function, we need to address another pitfall: variance.
现在,在选择了合适的哈希函数之后,我们需要解决另一个陷阱: variance 。
Going back to our example, imagine that the first person you talk to at the conference tells you their number ends with 00004
— jackpot!
回到我们的例子,假设您在会议上与之交谈的第一个人告诉您他们的电话号码以00004
结尾—大奖!
You might have won a wager against Tommy, but if you use this method in real life, chances are that specific data in your set will negatively influence the estimation.
您可能已经赢得了对Tommy的赌注,但是如果您在现实生活中使用此方法,则您的集合中的特定数据很可能会对估计产生负面影响。
Fear no more, as this is a problem HLL was born to solve.
不用再担心了,因为这是HLL天生要解决的问题。
Not many are aware that Philippe Flajolet, one of the brains behind HLL, has been involved in cardinality-estimation problems for a long time. Long enough to have come up with the Flajolet-Martin algorithm in 1984 and (super-)LogLog in 2003.
没有多少人知道Philippe Flajolet是HLL背后的大脑之一,长期以来一直参与基数估计问题。 足够长的时间来提出1984年的 Flajolet -Martin算法和2003年的(super-)LogLog 。
These algorithms already addressed some of the problems with outlying hashed values, by dividing measurements into buckets, and (somewhat) averaging values across buckets.
这些算法已经通过将度量划分为多个存储区,以及(在某种程度上)对存储区中的值进行平均,解决了散列值偏远的一些问题。
If you got lost here, let me go back to our original example.
如果您在这里迷路了,让我回到最初的例子。
Instead of just taking the last 5 digits of a phone number, we take 6 of them. Now, we store the longest sequence of leading zeroes together with the first digit (the ‘bucket’).
我们不只是采用电话号码的最后5位数字,而是采用其中的6位数字。 现在,我们将最长的前导零序列与第一个数字(“存储桶”)一起存储。
This means that our data will look like:
这意味着我们的数据将如下所示:
Input:708942 --> in the 7th bucket, the longest sequence of 0s is 1518942 --> in the 5th bucket, the longest sequence of 0s is 0500973 --> in the 5th bucket, the longest sequence of 0s is now 2900000 --> in the 9th bucket, the longest sequence of 0s is 5900672 --> in the 9th bucket, the longest sequence of 0s stays 5
Buckets:0: 01: 02: 03: 04: 05: 26: 07: 18: 09: 5
Output:avg(buckets) = 0.8
As you can see, if we weren’t employing buckets, we would instead use 5 as the longest sequence of zeroes. This would negatively impact our estimation.
如您所见,如果我们不使用存储桶,我们将使用5作为最长的零序列。 这会对我们的估计产生负面影响。
Although I simplified the math behind buckets (it’s not just a simple average), you can totally see how this approach makes sense.
尽管我简化了存储桶背后的数学运算(这不仅仅是简单的平均值),但是您可以完全了解这种方法的合理性。
It’s interesting to see how Flajolet addresses variance throughout his works:
有趣的是,Flajolet如何处理整个作品中的差异:
“While we’ve got an estimate that’s already pretty good, it’s possible to get a lot better. Durand and Flajolet make the observation that outlying values do a lot to decrease the accuracy of the estimate; by throwing out the largest values before averaging, accuracy can be improved.
“虽然我们已经有了相当不错的估计,但可能会变得更好。 杜兰德(Durand)和弗拉霍莱特(Flajolet)观察到,离群值在很大程度上降低了估计的准确性。 通过在取平均值之前扔掉最大值,可以提高精度。
Specifically, by throwing out the 30% of buckets with the largest values, and averaging only 70% of buckets with the smaller values, accuracy can be improved from 1.30/sqrt(m) to only 1.05/sqrt(m)! That means that our earlier example, with 640 bytes of state and an average error of 4% now has an average error of about 3.2%, with no additional increase in space required.
具体来说,将30%的最大值的存储桶扔掉,然后平均将70%的较小值的存储桶平均,可以将精度从1.30 / sqrt(m)提高到1.05 / sqrt(m)! 这意味着我们前面的示例具有640字节的状态和4%的平均错误,现在的平均错误约为3.2%,而无需额外增加空间。
Finally, the major contribution of Flajolet et al in the HyperLogLog paper is to use a different type of averaging, taking the harmonic mean instead of the geometric mean we just applied. By doing this, they’re able to edge down the error to 1.04/sqrt(m), again with no increase in state required”
最后,HyperLogLog论文中Flajolet等人的主要贡献是使用了一种不同的平均方式,采用谐波均值而不是我们刚刚应用的几何均值。 通过这样做,他们能够将误差降低到1.04 / sqrt(m),而又不需要增加状态”
Nick Johnson — Improving Accuracy: SuperLogLog and HyperLogLog
尼克·约翰逊 ( Nick Johnson) — 提高准确性:SuperLogLog和HyperLogLog
So, where can we find the application of HLLs? Two great web-scale examples are:
那么,我们在哪里可以找到HLL的应用? 网络规模的两个很好的例子是:
BigQuery, to efficiently count uniques in a table (APPROX_COUNT_DISTINCT()
)
BigQuery ,以有效地计算表中的唯一性( APPROX_COUNT_DISTINCT()
)
Reddit, where it’s used to calculate how many unique views a post has gathered
Reddit ,用于计算帖子收集了多少个唯一视图
In particular, see how HLL impacts queries on BigQuery:
特别是,请参阅HLL如何影响BigQuery上的查询:
SELECT COUNT(DISTINCT actor.login) exact_cntFROM `githubarchive.year.2016`6,610,026 (4.1s elapsed, 3.39 GB processed, 320,825,029 rows scanned)
SELECT APPROX_COUNT_DISTINCT(actor.login) approx_cntFROM `githubarchive.year.2016`6,643,627 (2.6s elapsed, 3.39 GB processed, 320,825,029 rows scanned)
The second result is an approximation (with an error rate of ~0.5%), but takes a fraction of the time.
第二个结果是一个近似值(错误率约为0.5%),但花费的时间却很少。
Long story short: HyperLogLog is amazing!
长话短说: HyperLogLog很棒!
Now you know what it is and when it can be used, so go out and do incredible stuff with it!
现在您知道它是什么以及何时可以使用它了,所以出去用它做令人难以置信的事情!
HyperLogLog on Wikipedia
Wikipedia上的HyperLog登录
the original paper
原始纸
HyperLogLog++, Google’s improved implementation of HLL
HyperLogLog ++,Google对HLL的改进实现
Redis new data structure: the HyperLogLog
Redis新数据结构:HyperLogLog
Damn Cool Algorithms: Cardinality Estimation
该死的酷算法:基数估计
HLL data types in Riak
Riak中的HLL数据类型
HyperLogLog and MinHash
HyperLogLog和MinHash
Originally published at odino.org (13th Jan 2018).
最初发布于odino.org (2018年1月13日)。
翻译自: https://www.freecodecamp.org/news/my-favorite-algorithm-and-data-structure-hyperloglog-6583a25c8a4f/
hyperloglog算法