python打乱迭代器_带有生成器/可迭代/迭代器的Python随机样本

我试图从一个很大的文本语料库中获取一个随机样本。

您出色的综合答案目前表明胜出iter_sample_fast(gen, pop)。但是,我尝试了Katriel的推荐random.sample(list(gen), pop)-与之相比,它的速度非常快!

def iter_sample_easy(iterable, samplesize):

return random.sample(list(iterable), samplesize)

Sampling 1000 from 10000

Using iter_sample_fast 0.0192 s

Using iter_sample_easy 0.0009 s

Sampling 10000 from 100000

Using iter_sample_fast 0.1807 s

Using iter_sample_easy 0.0103 s

Sampling 100000 from 1000000

Using iter_sample_fast 1.8192 s

Using iter_sample_easy 0.2268 s

Sampling 200000 from 1000000

Using iter_sample_fast 1.7467 s

Using iter_sample_easy 0.3297 s

Sampling 500000 from 1000000

Using iter_sample_easy 0.5628 s

Sampling 2000000 from 5000000

Using iter_sample_easy 2.7147 s

现在,随着您的语料库变得非常大,将整个可迭代实现变成a list将使用大量内存。但是,如果我们可以对问题进行分块处理,我们仍然可以利用Python的超快性:基本上,我们选择一个CHUNKSIZE“合理小”的对象random.sample,对该大小的块进行处理,然后random.sample再次使用以将它们合并在一起。我们只需要正确设置边界条件即可。

如果的长度list(iterable)是的精确倍数CHUNKSIZE且不大于,我会知道如何做samplesize*CHUNKSIZE:

def iter_sample_dist_naive(iterable, samplesize):

CHUNKSIZE = 10000

samples = []

it = iter(iterable)

try:

while True:

first = next(it)

chunk = itertools.chain([first], itertools.islice(it, CHUNKSIZE-1))

samples += iter_sample_easy(chunk, samplesize)

except StopIteration:

return random.sample(samples, samplesize)

但是,上面的代码在时会产生不均匀的采样len(list(iterable)) % CHUNKSIZE != 0,并且由于len(list(iterable)) * samplesize / CHUNKSIZE“很大” 而耗尽内存。恐怕这些bug的修复超出了我的薪水等级,但是此博客文章中描述了一种解决方案,对我来说听起来很合理。(搜索字词:“分布式随机抽样”,“分布式水库抽样”。)

Sampling 1000 from 10000

Using iter_sample_fast 0.0182 s

Using iter_sample_dist_naive 0.0017 s

Using iter_sample_easy 0.0009 s

Sampling 10000 from 100000

Using iter_sample_fast 0.1830 s

Using iter_sample_dist_naive 0.0402 s

Using iter_sample_easy 0.0103 s

Sampling 100000 from 1000000

Using iter_sample_fast 1.7965 s

Using iter_sample_dist_naive 0.6726 s

Using iter_sample_easy 0.2268 s

Sampling 200000 from 1000000

Using iter_sample_fast 1.7467 s

Using iter_sample_dist_naive 0.8209 s

Using iter_sample_easy 0.3297 s

我们真正获胜的samplesize时机相对而言很小len(list(iterable))。

Sampling 20 from 10000

Using iterSample 0.0202 s

Using sample_from_iterable 0.0047 s

Using iter_sample_fast 0.0196 s

Using iter_sample_easy 0.0001 s

Using iter_sample_dist_naive 0.0004 s

Sampling 20 from 100000

Using iterSample 0.2004 s

Using sample_from_iterable 0.0522 s

Using iter_sample_fast 0.1903 s

Using iter_sample_easy 0.0016 s

Using iter_sample_dist_naive 0.0029 s

Sampling 20 from 1000000

Using iterSample 1.9343 s

Using sample_from_iterable 0.4907 s

Using iter_sample_fast 1.9533 s

Using iter_sample_easy 0.0211 s

Using iter_sample_dist_naive 0.0319 s

Sampling 20 from 10000000

Using iterSample 18.6686 s

Using sample_from_iterable 4.8120 s

Using iter_sample_fast 19.3525 s

Using iter_sample_easy 0.3162 s

Using iter_sample_dist_naive 0.3210 s

Sampling 20 from 100000000

Using iter_sample_easy 2.8248 s

Using iter_sample_dist_naive 3.3817 s

你可能感兴趣的:(python打乱迭代器)