堆排序与topK问题

找出一个有10亿数字数组中,前K个最大值

第一步:hash去重

解法1:划分法

def partition(L, left, right):
    low = left
    if left < right:
        key = L[left]
        high = right
        while low < high:
            while low < high and L[high] <= key:
                high -= 1
            L[low] = L[high]
            while low < high and L[low] >= key:
                low += 1
            L[high] = L[low]
        L[low] = key
        return low

def topK(L, K):
    if len(L) < K:
        pass
    low = 0
    high = len(L) - 1
    j = partition(L, low, high)
    while j != K and low < high:
        if K > j:
            low += 1
        else:
            high = j
        j = partition(L, low, high)


if __name__ == "__main__":
    L = [3,2,7,4,6,5,1,8,0, 19, 23, 4, 5, 23, 3, 4, 0,1,2,3,45,6,5,34,212,3234,234,3,4,4,3,43,43,343,34,34,343,43,2]
    n = 2 #find most max value
    topK(L, n)
    print 'result:', L[0:n]

result: [3234, 343]


思路:利用快速排序的原理,每次选取第left的值作为参考值:找出一个划分位置low,使得L[low],左边的值比参考值大,右边的值比参考值小,这样一直持续下去,直到low和K相等,则可以找到前K个最大值。因为选取每个参考值,都要便利一遍数组,因此:算法复杂度为O(N)。

优点:算法复杂度最低

解法2:大顶堆法

思路:先用前K个值构建大顶堆,也就是顶部是最大值,如果下一个值比顶部大,则立马调整这个大顶堆,否则取叶子节点肯定是一个最小值,如果数组中值比最小值还小,则直接舍弃。 算法复杂度为O(N * log(N))  

堆排序基本实现:

#coding: utf-8
#!/usr/bin/python

# create heap
def build_heap(lists, size):
    for i in range(0, (int(size/2)))[::-1]:
        adjust_heap(lists, i, size)

# adjust heap
def adjust_heap(lists, i, size):
    lchild = 2 * i + 1
    rchild = 2 * i + 2
    max = i
    if i < size / 2:
        if lchild < size and lists[lchild] > lists[max]:
            max = lchild
        if rchild < size and lists[rchild] > lists[max]:
            max = rchild
        if max != i:
            lists[max], lists[i] = lists[i], lists[max]
            adjust_heap(lists, max, size)

# heap sort
def heap_sort(lists):
    size = len(lists)
    build_heap(lists, size)
    for i in range(0, size)[::-1]:
        lists[0], lists[i] = lists[i], lists[0]
        adjust_heap(lists, 0, i)
    return lists


a = [2,3,4,5,6,7,8,9,1,2,34,5,4,54,5,45,4,5,45,4,5,646,456,45,6,45,645,6,45,6,456,45,6,323,412,3,25,5,7,68,6,78,678]
print("began sort:%s" %a)
b = heap_sort(a)
print("end sort:%s" %b)
began sort:[2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 34, 5, 4, 54, 5, 45, 4, 5, 45, 4, 5, 646, 456, 45, 6, 45, 645, 6, 45, 6, 456, 45, 6, 323, 412, 3, 25, 5, 7, 68, 6, 78, 678]
end sort:[1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 7, 7, 8, 9, 25, 34, 45, 45, 45, 45, 45, 45, 54, 68, 78, 323, 412, 456, 456, 645, 646, 678]


你可能感兴趣的:(堆排序与topK问题)