找出一个有10亿数字数组中,前K个最大值
第一步:hash去重
解法1:划分法
def partition(L, left, right):
low = left
if left < right:
key = L[left]
high = right
while low < high:
while low < high and L[high] <= key:
high -= 1
L[low] = L[high]
while low < high and L[low] >= key:
low += 1
L[high] = L[low]
L[low] = key
return low
def topK(L, K):
if len(L) < K:
pass
low = 0
high = len(L) - 1
j = partition(L, low, high)
while j != K and low < high:
if K > j:
low += 1
else:
high = j
j = partition(L, low, high)
if __name__ == "__main__":
L = [3,2,7,4,6,5,1,8,0, 19, 23, 4, 5, 23, 3, 4, 0,1,2,3,45,6,5,34,212,3234,234,3,4,4,3,43,43,343,34,34,343,43,2]
n = 2 #find most max value
topK(L, n)
print 'result:', L[0:n]
result: [3234, 343]
思路:利用快速排序的原理,每次选取第left的值作为参考值:找出一个划分位置low,使得L[low],左边的值比参考值大,右边的值比参考值小,这样一直持续下去,直到low和K相等,则可以找到前K个最大值。因为选取每个参考值,都要便利一遍数组,因此:算法复杂度为O(N)。
优点:算法复杂度最低
解法2:大顶堆法
思路:先用前K个值构建大顶堆,也就是顶部是最大值,如果下一个值比顶部大,则立马调整这个大顶堆,否则取叶子节点肯定是一个最小值,如果数组中值比最小值还小,则直接舍弃。 算法复杂度为O(N * log(N))
堆排序基本实现:
#coding: utf-8
#!/usr/bin/python
# create heap
def build_heap(lists, size):
for i in range(0, (int(size/2)))[::-1]:
adjust_heap(lists, i, size)
# adjust heap
def adjust_heap(lists, i, size):
lchild = 2 * i + 1
rchild = 2 * i + 2
max = i
if i < size / 2:
if lchild < size and lists[lchild] > lists[max]:
max = lchild
if rchild < size and lists[rchild] > lists[max]:
max = rchild
if max != i:
lists[max], lists[i] = lists[i], lists[max]
adjust_heap(lists, max, size)
# heap sort
def heap_sort(lists):
size = len(lists)
build_heap(lists, size)
for i in range(0, size)[::-1]:
lists[0], lists[i] = lists[i], lists[0]
adjust_heap(lists, 0, i)
return lists
a = [2,3,4,5,6,7,8,9,1,2,34,5,4,54,5,45,4,5,45,4,5,646,456,45,6,45,645,6,45,6,456,45,6,323,412,3,25,5,7,68,6,78,678]
print("began sort:%s" %a)
b = heap_sort(a)
print("end sort:%s" %b)
began sort:[2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 34, 5, 4, 54, 5, 45, 4, 5, 45, 4, 5, 646, 456, 45, 6, 45, 645, 6, 45, 6, 456, 45, 6, 323, 412, 3, 25, 5, 7, 68, 6, 78, 678]