Reservoir Sampling

Refs:

  1. https://gregable.com/2007/10/reservoir-sampling.html
  2. http://blog.jobbole.com/42550/

Problem:

Reservoir Sampling is an algorithm for sampling elements from a stream of data. Imagine you are given a really large stream of data elements, for example:

  • Queries on google searches in May
  • Products bought at Walmart during the Christmas season
  • Names in a phone book
    Your goal is to efficiently return a random sample of 1,000 elements evenly distributed from the original stream. How would you do it?

简而言之就是data stream,最好是one pass,只遍历数据一遍就抽样。

解法:stream里面抽k个样,维持一个reservoir (k大小),之后每进(k+i)一个数A,以 k/(k+i)的比例抽样留下A,然后再在reservoir里面均匀抽样某一个数被A替换。这样子无论是新树A还是之前的所有数,在reservoir中survive的probability都是 k/(k+i). (之前的数: 1/(k+i-1) * k/(k+i) )


image.png

你可能感兴趣的:(Reservoir Sampling)