How Reddit ranking algorithms workThis is a follow up post to How Hacker News ranking algorithm works. This time around I will examine how Reddit's default story and comment rankings work. Reddit's algorithms are fairly simple to understand and to implement and in this post I'll dig deeper into them. The first part of this post will focus on story ranking, i.e. how are Reddit stories ranked? The second part of this post will focus on comment ranking, which does not use the same ranking as stories (unlikeHacker News), Reddit's comment ranking algorithm is quite interesting and the idea guy behind it is Randall Munroe (the author ofxkcd). Digging into the story ranking codeReddit is open sourced and the code is freely available. Reddit is implemented in Python and theircode is located here. Their sorting algorithms are implemented inPyrex, which is a language to write Python C extensions. They have used Pyrex for speed reasons. I have rewrittentheir Pyrex implementation into pure Python since it's easier to read. The default story algorithm called the hot ranking is implemented like this: #Rewritten code from /r2/r2/lib/db/_sorts.pyx
from datetime import datetime, timedelta
from math import log
epoch = datetime(1970, 1, 1)
def epoch_seconds(date):
"""Returns the number of seconds from the epoch to date."""
td = date - epoch
return td.days * 86400 + td.seconds + (float(td.microseconds) / 1000000)
def score(ups, downs):
return ups - downs
def hot(ups, downs, date):
"""The hot formula. Should match the equivalent function in postgres."""
s = score(ups, downs)
order = log(max(abs(s), 1), 10)
sign = 1 if s > 0 else -1 if s < 0 else 0
seconds = epoch_seconds(date) - 1134028003
return round(sign * order + seconds / 45000, 7)
In mathematical notation the hot algorithm looks like this (I have this fromSEOmoz, but I doubt they are the author of this): Effects of submission timeFollowing things can be said about submission time related to story ranking:
Here is a visualization of the score for a story that has same amount of up and downvotes, but different submission time: The logarithm scaleReddit's hot ranking uses the logarithm function to weight the first votes higher than the rest. Generally this applies:
Here is a visualization: Without using the logarithm scale the score would look like this: Effects of downvotesReddit is one of the few sites that has downvotes. As you can read in the code a story's "score" is defined to be:
The meaning of this can be visualized like this:
Conclusion of Reddit's story ranking
How Reddit's comment ranking worksRandall Munroe of xkcd is the idea guy behind Reddit'sbest ranking. He has written a great blog post about it:
You should read his blog post as it explains the algorithm in a very understandable way. The outline of his blog post is following:
Digging into the comment ranking codeThe confidence sort algorithm is implemented in _sorts.pyx, I have rewritten their Pyrex implementation into pure Python (do also note that I have removed their caching optimization): #Rewritten code from /r2/r2/lib/db/_sorts.pyx
from math import sqrt
def _confidence(ups, downs):
n = ups + downs
if n == 0:
return 0
z = 1.0 #1.0 = 85%, 1.6 = 95%
phat = float(ups) / n
return sqrt(phat+z*z/(2*n)-z*((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
def confidence(ups, downs):
if ups + downs == 0:
return 0
else:
return _confidence(ups, downs)
The confidence sort uses Wilson score interval and the mathematical notation looks like this:
Let's summarize the above in a following manner:
Randall has a great example of how the confidence sort ranks comments in his blog post:
Effects of submission time: there are none!The great thing about the confidence sort is that submission time is irrelevant (much unlike thehot sort or Hacker News's ranking algorithm). Comments are ranked by confidence and by data sampling - - i.e. the more votes a comment gets the more accurate its score will become. VisualizationLet's visualize the confidence sort and see how it ranks comments. We can use Randall's example:
Application outside of rankingLike Evan Miller notes Wilson's score interval has applications outside of ranking. He lists 3 examples:
To use it you only need two things:
Given how powerful and simple this is, it's amazing that most sites today use the naive ways to rank their content. This includes billion dollar companies likeAmazon.com, which defineAverage rating = (Positive ratings) / (Total ratings). ConclusionI hope you have found this useful and leave comments if you have any questions or remarks. Happy hacking as always :) Related
23. Nov 2010
•
Code·Design·Python
|
|