[Z]External Sorting for sorting large files in disk

大部分外部排序的理论书籍都讲解基于磁带的外部排序,其理论可以作为基础,而磁盘的排序可能有所不同,摘录一篇文章,比较简单,更详细的信息还需要参照书籍,如TAOCP。:-)

http://exceptional-code.blogspot.com/2011/07/external-sorting-for-sorting-large.html

 

External Sorting for sorting large files in disk

 

Sorting is a fundamental programming task. Given the abundance of built-in libraries that perform tasks like sorting and binary search, we often become forgetful of exactly how these tasks are accomplished.

When the data is so large that it cannot be processed in memory at one time we need to resort to the file system to store part or all the data during the sorting process. We then need to perform another layer of disk operations on top of regular sorting algorithms to manage the data as they get sorted.

External Sorting is precisely the technique we described in the previous paragraph.

Let us describe in some detail how external sorting can be done in Java:

First the algorithm:

Say, we have one file (it can be more than one file, but having just one file simplifies the process for illustration purpose) in disk containing N numbers. And suppose the memory in our computer can hold M numbers at a time.

1. Start reading the input file from the beginning.
2. Read M (or less if number of entries remaining in the file is less than M) numbers from the file and store it into a temp buffer.
3. Sort (using any good sorting method - Quicksort, for example) the numbers in the buffer stored in step 2.
4. Create a temp file in disk and write out the sorted numbers from step 3 to this temp file. Save the name of the temp file.
5. Repeat step 2 to 4 until all numbers from the input file has been read, sorted, and written out to temp files.

At this point, we have chunks of numbers of size M sorted and stored in temp files in disk. We need to merge all these sorted files into one single sorted file. We will apply the merging algorithm from Merge Sort to join the numbers from these sorted files together.

6. Open all the temp files (and set the read pointer to the beginning of the files).
7. Find the minimum number from the set of numbers currently pointed to by the file read pointer.
8. Write the number to disk. (To increase efficiency you could write the number to a buffer first and then flush the buffer out to disk when the buffer is full. But modern I/O libraries should be doing this anyway for you).
9. Read another number from the file that contained the minimum number at step 7.
10. Repeat step 7 to 9 until all numbers from all the temp files have been processed, merged, and written out to disk.

The new file in disk now contains a sorted list of the numbers supplied in the initial input file.


Happy external sorting!

你可能感兴趣的:(extern)