作为开篇第一章, 作者强调的重点在于如何”精确的描述问题“,这也是所有程序开发的第一步 -- 需求分析。
1 Precise problem statement
1) input
A file containing at most 107 positive intergers (each < 107); any interger occurs twice is an error; no other data is associated with the interger.
2) output
a sorted list in increasing order
3) constraints
at most 1MB is available in main memory; ample disk storage is available; 10s ≤ run time < several minutes (at most)
2 Program design
1) merge sort with work files
read the file once from the input, sort it with the aid of work files that are read and written many times, and then write it once
2) 40-pass algorithm
if we store each number in 4 bytes (32-bit int), we can store 250,000 numbers in 1MB (1 megabytes/4 bytes).
we use a program that makes 40 passes over the input files. The first pass reads 0 ~ 249,999, and the 40th pass reads 9,750,000 ~ 9,999,999
3) read once without intermediate files
only if we could represent all the integers in the input file in 1MB of main memory
实际上,即使利用下文的 bitmap 数据结构,107个整数仍然需要1.25MB(> 1MB)
3 Implementation sketch
1) bitmap
we use bitmap data structure to represent the file by a string of ten million bits in which the ith bit is on only if the interger i is in the file
E.g. store the set {1, 2, 3, 5, 8, 13} in a string of 20 bits
0 1 1 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0
直白解释:从 0~19,对应相应的 string 位置为 1th ~ 20th,出现则填1,否则填0。例如,当12出现时,在其相应的位置处(也即 sting 的第13位)填1
2) pseudocode
// n is the number of bits in the vector (in this case 10,000,000) // 1) initialize set to empty for i = [0, n) bit[i] = 0 // 2) insert present elements into the set for each i in the input file bit[i] = 1 // 3) write sorted output for i = [0, n) if bit[i] == 1 write i on the output file
4 Principles
a simple design = the right problem + bitmap data structure + multiple-pass algorithm + time-space tradeoff
5 merge sort 合并排序
it works as follows (摘自维基)
1) divide the unsorted list into n sublists, each containing 1 element (a list of 1 element is considered sorted).
2) repeatedly merge sublists to produce new sorted sublists until there is only 1 sublist remaining.