一 前言:
1. 语术:
· QS: QuickSort。
· HS: HeapSort
· Hadoop 版本: 基于Version 2.7.1代码分析
2. HS堆排序思想回顾:
基于数组的堆的基本知识:节点i的父节点是(i-1)/2, 节点i的左子结点是2i+1,节点i的右子节点为2i+2.
这里有图画说明, https://www.jianshu.com/p/7d037c332a9d
3. 目的
分析Hadoop的HS排序是否与标准的HS排序有区别?是否有优化?
二.内容
源代码:
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.util;
import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
/**
* An implementation of the core algorithm of HeapSort.
*/
@InterfaceAudience.Private
@InterfaceStability.Unstable
public final class HeapSort implements IndexedSorter {
public HeapSort() { }
private static void downHeap(final IndexedSortable s, final int b,
int i, final int N) {
for (int idx = i << 1; idx < N; idx = i << 1) {
if (idx + 1 < N && s.compare(b + idx, b + idx + 1) < 0) {
if (s.compare(b + i, b + idx + 1) < 0) {
s.swap(b + i, b + idx + 1);
} else return;
i = idx + 1;
} else if (s.compare(b + i, b + idx) < 0) {
s.swap(b + i, b + idx);
i = idx;
} else return;
}
}
/**
* Sort the given range of items using heap sort.
* {@inheritDoc}
*/
@Override
public void sort(IndexedSortable s, int p, int r) {
sort(s, p, r, null);
}
@Override
public void sort(final IndexedSortable s, final int p, final int r,
final Progressable rep) {
final int N = r - p;
// build heap w/ reverse comparator, then write in-place from end
final int t = Integer.highestOneBit(N);
for (int i = t; i > 1; i >>>= 1) {
for (int j = i >>> 1; j < i; ++j) {
downHeap(s, p-1, j, N + 1);
}
if (null != rep) {
rep.progress();
}
}
for (int i = r - 1; i > p; --i) {
s.swap(p, i);
downHeap(s, p - 1, 1, i - p + 1);
}
}
}
1. 建立大顶堆函数。Hadoop的QS传过来的P-1赋值给b,实质b就是-1,P是从0开始的下标。i是表示从哪个枝节点建大顶(2^n ), N是数组的length+1的值
private static void downHeap(final IndexedSortable s, final int b,
int i, finalint N) {
for(intidx= i << 1; idx < N; idx = i << 1) {
if(idx+ 1 < N && s.compare(b + idx, b + idx + 1) < 0) {
if (s.compare(b + i, b + idx + 1) < 0) {
s.swap(b + i, b + idx + 1);
} elsereturn;
i = idx + 1;
} elseif(s.compare(b + i, b + idx) < 0) {
s.swap(b + i, b + idx);
i = idx;
} elsereturn;
}
}
2. 用final int t = Integer.highestOneBit(N);代码算出数组length最大的2次方的值。以2次方为循环单元逐渐建立大顶堆树。如下面的红色数据下标,数组的长度为10,Integer.highestOneBit(10)为8,第一个循环是从j=4开始建大顶堆到j=7结束,传入到downHeap(…)函数里是调整array[3]及它的左右子节点array[7]/array[8]。这里不像标准HS里是以最大叶结点的父节点一次性for循环建成大顶堆的。Hadoop的HS以2次方为单位,这样的算法我理解是为了提高内存性能。当数组很大的时候,比如j=512到j=1024的数据,内存性能应该会有提高。
代码如下:
for (int i = t; i > 1; i >>>= 1) {
for (int j = i >>> 1; j < i; ++j) {
downHeap(s, p-1, j, N + 1);
}
if (null != rep) {
rep.progress();
}
}
现在最大值已经是树顶了,把它交换到array[legth-1]。循环到第一个后,就是一个从小到大的堆了。代码如下:
for (int i = r - 1; i > p; --i) {
s.swap(p, i);
downHeap(s, p - 1, 1, i - p + 1);
}
}