实现外部归并算法并要求
内部归并算法并不困难,常见的有递归和迭代。为了简化算法难度,我们可以先从内部迭代实现出发(以下是网络上找到的归并算法java语言模板):
// 归并排序(Java-迭代版)
public static void merge_sort(int[] arr) {
int len = arr.length;
int[] result = new int[len];
int block, start;
for(block = 1; block < len*2; block *= 2) {
for(start = 0; start <len; start += 2 * block) {
int low = start;
int mid = (start + block) < len ? (start + block) : len;
int high = (start + 2 * block) < len ? (start + 2 * block) : len;
//两个块的起始下标及结束下标
int start1 = low, end1 = mid;
int start2 = mid, end2 = high;
//开始对两个block进行归并排序
while (start1 < end1 && start2 < end2) {
result[low++] = arr[start1] < arr[start2] ? arr[start1++] : arr[start2++];
}
while(start1 < end1) {
result[low++] = arr[start1++];
}
while(start2 < end2) {
result[low++] = arr[start2++];
}
}
int[] temp = arr;
arr = result;
result = temp;
}
result = arr;
}
我们接下来要做的就是对这个算法进行简单的修改+代理。
我们可以看到上面的代码初始化Seg=1,而我们的初始归并段是一个已经排序好的block,因此将初始长度设置为Seg=intNumPerBlock。
我们可以看到如上算法在内部排序中用 a[ ] , b[ ] 两个数组储存排序数据,而在外部归并中,数据是存储在外存中的。
在实际过程中每次要获取数据的时候,先去cache中的input0、input1中去寻找,如果有,则将其归并输出至output,每当output满了时,将其写入外存。
而为了在已有代码上修改,我们可以引入代理函数,将其改变成如下代码:
public void mergeSort() {
/*
I imitate the internal merge sort by encapsulating the interface of memory function.
*/
int len = blockNum*intNumPerBlock;
int block, start;
for(block = intNumPerBlock; block < len*2; block *= 2) {
output.resetBlock();
for(start = 0; start <len; start += 2 * block) {
int low = start;
int mid = (start + block) < len ? (start + block) : len;
int high = (start + 2 * block) < len ? (start + 2 * block) : len;
int start1 = low, end1 = mid;
int start2 = mid, end2 = high;
while (start1 < end1 && start2 < end2) {
output.set(low++,input0.get(start1) < input1.get(start2) ? input0.get(start1++) : input1.get(start2++));
}
while(start1 < end1) {
output.set(low++,input0.get(start1++));
}
while(start2 < end2) {
output.set(low++,input1.get(start2++));
}
}
/*
Some sorted results maybe left in the cache because there is no input to trigger the rewrite function,
so we need to flush the left ones to the disk.
*/
output.flushToDisk();
System.out.println();
Disk.printDisk();
/*
Copy the result from temp to the disk two-dimension matrix.
Note: temp[][] and disk[][] are both assumed in the external memory.
*/
Disk.switchTheDisk();
}
}
由于每次更新以整个cache为单位,因此用 startBlock 和 endBlock 来记录目前对应的文件块。
@Data
public class Buffer {
private int blockNum;
private int startBlock;
private int endBlock;
private int intNumPerBlock=5;
private int [][]buffer;
private int readCount;
private int writeCount;
Buffer(int blockNum, int intNumPerBlock){
this.blockNum=blockNum;
this.intNumPerBlock=intNumPerBlock;
cache=new int[blockNum][intNumPerBlock];
startBlock=-1;
endBlock=-1;
readCount=0;
writeCount=0;
}
}
public int get(int i){
/*
@param(i) means the index in the disk;
We should translate it into the two-dimension index of the matrix which simulates the disk.
*/
int targetBlock=i/intNumPerBlock;
if(targetBlock<endBlock&&targetBlock>=startBlock){
/*
The target int is in the inputCache.
Hit the Cache.
*/
return cache[targetBlock-startBlock][i%intNumPerBlock];
}else{
/*
The target int is not in the input Cache.
Miss the hit.
We need to read the corresponding block from the disk.
*/
readBlockFromDisk(targetBlock);
return cache[targetBlock-startBlock][i%intNumPerBlock];
}
}
private void readBlockFromDisk(int targetBlock){
readCount++;
startBlock=targetBlock;
endBlock=(targetBlock+blockNum<Disk.blockNum?targetBlock+blockNum:Disk.blockNum);
for(int i=0;i<endBlock-startBlock;i++){
cache[i]=Disk.disk[startBlock+i].clone();
}
}
public void set(int i,int x){
int targetBlock=i/intNumPerBlock;
if(targetBlock<endBlock && targetBlock>=startBlock){
/*
Write to the output Buffer.
*/
cache[targetBlock-startBlock][i%intNumPerBlock]=x;
}
else if(targetBlock>=endBlock){
writeBlockToDisk();
readBlockFromDisk(targetBlock);
startBlock+=blockNum;
endBlock=startBlock+1;
cache[targetBlock-startBlock][i%intNumPerBlock]=x;
}
}
private void writeBlockToDisk(){
writeCount++;
for(int i=startBlock;i<endBlock;i++){
Disk.temp[i]=cache[i-startBlock].clone();
}
}
为了简化实验难度,这里用二维数据描述外存。blockNum 指外存中待排序数据所占的块数,intNumPerBlock 指每个块设定的待排序的数的个数,所以一共需要排序blockNum*intNumPerBlock 个数据。
public class Disk {
public static int blockNum;
private static int intNumPerBlock;
public static int[][]disk;
public static int [][]temp;
Disk(int blockNum, int intNumPerBlock){
this.blockNum=blockNum;
this.intNumPerBlock=intNumPerBlock;
disk=new int[blockNum][intNumPerBlock];
temp=new int[blockNum][intNumPerBlock];
initializeDisk();
}
}
通过将想法实现,对代码进行修改后测试,生成了5*10个数据进行排序,可以发现成功对实验目标进行了实现。
为了具体分析,我们更改参数对结果进行分析:
当其他条件不变的情况下, outputBlockNum 变为原来的两倍时,可以发现writeCount变为原来的一半;
BlockNum 变为原来的两倍时,writeCount,readCount都变为了原来的两倍。这与理论值皆基本吻合。
上边提到的writeCount是input0+input1的writeCount之和,但是当我们将它们分开分析的时候发现input0常常比input1要多,分析发现是归并时候出现奇数次后的正常结果。
本次实验实现了外部归并排序,主要的思路就是采取代理函数进行内外存的数据交互,这样可以大大简化思维难度,也符合层次化编程的要求。