排序算法的选择

没有一种算法显然是“最佳”算法。这取决于一系列因素。

 

首先,能将数据放入主内存吗?如果你能,那么需要依赖外部排序算法。这些算法通常基于快速排序(quicksort)和归并排序(mergesort),【译注:如果不能,根据使用的数据集的大小和类型,使用专用数据库加载数据或利用Google的BigQuery等基于云的服务】。

 

其次,输入数据的分布是什么样的?如果它大多数都是有序的,那么像Timsort这样的东西可能是一个很好的选择,因为它的设计可以很好地处理排序数据。如果它几乎是随机的,那么Timsort可能不是一个好选择。

 

第三,排序什么类型的的元素?如果要对一般对象进行排序,那么几乎可以锁定比较排序。如果没有,也许可​​以使用非比较排序,如计算排序(counting sort)或基数排序(radix sort)。

 

第四,数据有多少模块?一些排序算法(quicksort,mergesort,MSD基数排序)非常好并行化,而其他排序算法(如heapsort)则不是好的选择。

 

第五,数据如何表示?如果它们存储在数组中,则快速排序或快速排序变体可能会因为引用的位置而表现良好,而mergesort可能由于需要额外的内存而变慢。但是,如果它们在链表中,则来自quicksort的引用位置消失,mergesort再次突然变得具有竞争力。

 

最好的选择可能是考虑很多不同的因素,然后从那里做出决定。设计和研究算法如此有趣的原因之一是,很少有一个单一的最佳选择;通常,最好的选择取决于具体情况和根据所看到的情况而变化。

 

(虽然quicksort有一个退化的O(n2)最坏的情况,但是有很多方法可以避免这种情况。如果看起来快速排序会退化,则内省排序(introsort)算法会跟踪递归深度并将算法切换到堆栈。这可以保证O(n log n)最坏情况下的内存开销,并最大限度地提高从quicksort获得的好处随机快速排序虽然仍然具有O(n2)最坏情况,但实际上击中最坏情况的概率极小。

 

Heapsort在实践中是一种很好的算法,但在某些情况下并不像其他算法那么快,因为它没有良好的参考局部性。也就是说,它永远不会退化并只需要O(1)辅助空间这一事实是一个巨大的卖点。

 

Mergesort确实需要大量的辅助内存,这就是为什么在有大量数据需要排序时可能不想使用它的原因之一。但值得了解的是,因为它的变体被广泛使用。)

 

原文:

There's no one algorithm that's clearly the "best" algorithm. It depends on a bunch of factors.

For starters, can you fit your data into main memory? If you can't, then you'd need to rely on an external sorting algorithm. These algorithms are often based on quicksort and mergesort.

Second, do you know anything about your input distribution? If it's mostly sorted, then something like Timsort might be a great option, since it's designed to work well on sorted data. If it's mostly random, Timsort is probably not a good choice.

Third, what kind of elements are you sorting? If you are sorting generic objects, then you're pretty much locked into comparison sorting. If not, perhaps you could use a non-comparison sort like counting sort or radix sort.

Fourth, how many cores do you have? Some sorting algorithms (quicksort, mergesort, MSD radix sort) parallelize really well, while others do not (heapsort).

Fifth, how are your data represented? If they're stored in an array, quicksort or a quicksort variant will likely do well because of locality of reference, while mergesort might be slow due to the extra memory needed. If they're in a linked list, though, the locality of reference from quicksort goes away and mergesort suddenly becomes competitive again.

The best option is probably to take a lot of different factors into account and then make a decision from there. One of the reason it's so fun to design and study algorithms is that there's rarely one single best choice; often, the best option depends a ton on your particular situation and changes based on what you're seeing.

(You mentioned a few details about quicksort, heapsort, and mergesort that I wanted to touch on before wrapping up this answer. While you're right that quicksort has a degenerate O(n2) worst case, there are many ways to avoid this. The introsort algorithm keeps track of the recursion depth and switches the algorithm to heapsort if it looks like the quicksort will degenerate. This guarantees O(n log n) worst-case behavior with low memory overhead and maximizes the amount of benefit you get from quicksort. Randomized quicksort, while still having an O(n2) worst case, has a vanishingly small probability of actually hitting that worst case.

Heapsort is a good algorithm in practice, but isn't as fast as the other algorithms in some cases because it doesn't have good locality of reference. That said, the fact that it never degenerates and needs only O(1) auxiliary space is a huge selling point.

Mergesort does need a lot of auxiliary memory, which is one reason why you might not want to use it if you have a huge amount of data to sort. It's worth knowing about, though, since its variants are widely used.)

 

你可能感兴趣的:(c++,c)