浅析Hadoop下的基础排序算法(HeapSort)

一     前言:

1.        语术:

·        QS: QuickSort

·        HS:  HeapSort

·        Hadoop 版本: 基于Version 2.7.1代码分析

2.       HS堆排序思想回顾:

基于数组的堆的基本知识:节点i的父节点是(i-1)/2, 节点i的左子结点是2i+1,节点i的右子节点为2i+2.

这里有图画说明, https://www.jianshu.com/p/7d037c332a9d

3.       目的

分析Hadoop的HS排序是否与标准的HS排序有区别?是否有优化?

二.内容

源代码:

/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.hadoop.util;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;

/**
 * An implementation of the core algorithm of HeapSort.
 */
@InterfaceAudience.Private
@InterfaceStability.Unstable
public final class HeapSort implements IndexedSorter {

  public HeapSort() { }

  private static void downHeap(final IndexedSortable s, final int b,
      int i, final int N) {
    for (int idx = i << 1; idx < N; idx = i << 1) {
      if (idx + 1 < N && s.compare(b + idx, b + idx + 1) < 0) {
        if (s.compare(b + i, b + idx + 1) < 0) {
          s.swap(b + i, b + idx + 1);
        } else return;
        i = idx + 1;
      } else if (s.compare(b + i, b + idx) < 0) {
        s.swap(b + i, b + idx);
        i = idx;
      } else return;
    }
  }

  /**
   * Sort the given range of items using heap sort.
   * {@inheritDoc}
   */
  @Override
  public void sort(IndexedSortable s, int p, int r) {
    sort(s, p, r, null);
  }

  @Override
  public void sort(final IndexedSortable s, final int p, final int r,
      final Progressable rep) {
    final int N = r - p;
    // build heap w/ reverse comparator, then write in-place from end
    final int t = Integer.highestOneBit(N);
    for (int i = t; i > 1; i >>>= 1) {
      for (int j = i >>> 1; j < i; ++j) {
        downHeap(s, p-1, j, N + 1);
      }
      if (null != rep) {
        rep.progress();
      }
    }
    for (int i = r - 1; i > p; --i) {
      s.swap(p, i);
      downHeap(s, p - 1, 1, i - p + 1);
    }
  }
}

1.        建立大顶堆函数。Hadoop的QS传过来的P-1赋值给b,实质b就是-1,P是从0开始的下标。i是表示从哪个枝节点建大顶(2^n ), N是数组的length+1的值

 private static void downHeap(final IndexedSortable s, final int b,

     int i, finalint N) {

   for(intidx= i << 1; idx < N; idx = i << 1) {

     if(idx+ 1 < N && s.compare(b + idx, b + idx + 1) < 0) {

        if (s.compare(b + i, b + idx + 1) < 0) {

          s.swap(b + i, b + idx + 1);

        } elsereturn;

        i = idx + 1;

     } elseif(s.compare(b + i, b + idx) < 0) {

        s.swap(b + i, b + idx);

        i = idx;

     } elsereturn;

   }

  }

2.        用final int t = Integer.highestOneBit(N);代码算出数组length最大的2次方的值。以2次方为循环单元逐渐建立大顶堆树。如下面的红色数据下标,数组的长度为10,Integer.highestOneBit(10)为8,第一个循环是从j=4开始建大顶堆到j=7结束,传入到downHeap)函数里是调整array[3]及它的左右子节点array[7]/array[8]。这里不像标准HS里是以最大叶结点的父节点一次性for循环建成大顶堆的。Hadoop的HS以2次方为单位,这样的算法我理解是为了提高内存性能。当数组很大的时候,比如j=512到j=1024的数据,内存性能应该会有提高。

代码如下:

for (int i = t; i > 1; i >>>= 1) {

     for (int j = i >>> 1; j < i; ++j) {

       downHeap(s, p-1, j, N + 1);

     }

     if (null != rep) {

       rep.progress();

     }

    }

 

    现在最大值已经是树顶了,把它交换到array[legth-1]。循环到第一个后,就是一个从小到大的堆了。代码如下:

    for (int i = r - 1; i > p; --i) {

      s.swap(p, i);

      downHeap(s, p - 1, 1, i - p + 1);

    }

  }



你可能感兴趣的:(Hadoop)