LeetCode专项练习之双堆(Two Heaps)笔记

本文是根据穷码农的LeetCode刷题建议而进行专项练习时记录的心得。

这次的题虽然只有三道,但都是Hard模式,花费了我很多时间。主要是自己以前从来没用过”堆“这个数据结构,导致自己需要相当一部分时间去从底层了解并实现它。做题时,自己也有很多思路没能想到,只能先参考大佬们的解题笔记。

不过,在自己了解了堆的特性后,觉得它理解起来并不困难,都是基于列表的一系列操作(优先队列)而实现的。这个专题之所以叫”双堆“,也是因为一般使用堆得同时用到”大顶堆“和”小顶堆“这两个概念,尤其是涉及到求中间值一类题型的时候。关于堆的具体实现,我参考了以下这篇文章:

【算法日积月累】9-堆与优先队列 | 算法与数据结构、机器学习、深度学习​

下面是我从底层实现堆的代码片段:

"""
    The implementation of data structure 'heap'(max and min) through "priority queue".
    Reference: https://www.liwei.party/2019/01/10/algorithms-and-data-structures/priority-queue/
"""
class MaxHeap:
    def __init__(self, capability):
        """
        Initiate the queue.
        Note: because the heap behaves like a tree, we use a list to "record" this tree. In addition, the first
        element starts with index 1, not 0.
        :return:
        """
        # define how many elements it can contain
        self.capability = capability
        # define the list to store data (with pre-defined space). '+1': because index starts with 1
        self.data = [None for _ in range(capability + 1)]
        # define the number of elemente
        self._count = 0

    def get_size(self) -> int:
        """
        get the size of the heap
        :return:
        """
        return self._count

    def set_size(self, size):
        """
        set the size of the heap
        :return:
        """
        self._count = size

    def is_empty(self) -> bool:
        """
        determine whether it is empty
        :return:
        """
        return self._count == 0

    def insert(self, num):
        """
        insert new data to heap
        :param num: the new element
        :return:
        """
        if self._count == self.capability:
            raise Exception("Heap reaches the limitation.")

        # insert the element to the tail first
        self._count += 1
        self.data[self._count] = num

        # see if it can be moved up
        self.shift_up(self._count)

    def shift_up(self, itemPos):
        """
        'swim' the element to higher place if it is larger than others
        :param numPos: the index of the element
        :return:
        """
        try:
            target = self.data[itemPos]

            # father: itemPos // 2
            while itemPos > 1 and self.data[itemPos // 2] < target:
                self.data[itemPos] = self.data[itemPos // 2]
                itemPos //= 2
            self.data[itemPos] = target

        except TypeError:
            print(TypeError)
            return

    def shift_down(self, itemPos):
        """
        'sink' the element to lower place if it is smaller than others
        method: as long as the 'tree' has children, it will keep replacing.
        :param numPos: the index of the element
        :return:
        """
        try:
            target = self.data[itemPos]

            # children (left): itemPos * 2
            # children (right): itemPos * 2 + 1
            while itemPos * 2 <= self._count:
                children = itemPos * 2
                # attention to the next half: left children and right children needs to be compared
                if children + 1 <= self._count and self.data[children + 1] > self.data[children]:
                    children = children + 1

                # attention: check the false condition first. If the children is smaller than target,
                # no need to keep looping (cannot write to another format —— 'itemPos' will keep changing)
                if self.data[children] <= target:
                    break

                self.data[itemPos] = self.data[children]
                itemPos = children

            self.data[itemPos] = target
        except TypeError:
            print(TypeError)
            return

    def extract_max(self):
        """
        get the maximum element at the top
        :return: the maximum value of the heap
        """
        if self.is_empty():
            raise Exception("Empty heap.")

        maxVal = self.data[1]

        # use the last element to replace the empty position (natural way)
        self.data[1], self.data[self._count] = self.data[self._count], self.data[1]

        self.data[self._count] = None
        self._count -= 1
        self.shift_down(1)

        return maxVal


# similar to MaxHeap
class MinHeap:
    def __init__(self, capability):
        self.capability = capability
        self.data = [None for _ in range(capability + 1)]
        self._count = 0

    def get_size(self):
        return self._count

    def set_size(self, size):
        """
        set the size of the heap
        :return:
        """
        self._count = size

    def is_empty(self):
        return self._count == 0

    def insert(self, item):
        # add to the tail
        self._count += 1
        self.data[self._count] = item

        # swim
        self.shift_up(self._count)

    def shift_up(self, itemPos):
        try:
            target = self.data[itemPos]

            # father: itemPos // 2
            while itemPos > 1 and self.data[itemPos // 2] > target:
                self.data[itemPos] = self.data[itemPos // 2]
                itemPos //= 2
            self.data[itemPos] = target

        except TypeError:
            print(TypeError)
            return

    def extract_min(self):
        if self.is_empty():
            raise Exception("Empty heap.")

        minVal = self.data[1]

        # replace with the tail item
        self.data[1] = self.data[self._count]
        self.data[self._count] = None
        self._count -= 1

        # sink
        self.shift_down(1)

        return minVal

    def shift_down(self, itemPos):
        try:
            target = self.data[itemPos]

            # children (left): itemPos * 2
            # children (right): itemPos * 2 + 1
            while itemPos * 2 <= self._count:
                children = itemPos * 2
                # attention to the next half: left children and right children needs to be compared
                if children + 1 <= self._count and self.data[children + 1] < self.data[children]:
                    children = children + 1

                # attention: check the false condition first. If the children is larger than target,
                # no need to keep looping (cannot write to another format —— 'itemPos' will keep changing)
                if self.data[children] >= target:
                    break

                self.data[itemPos] = self.data[children]
                itemPos = children

            self.data[itemPos] = target
        except TypeError:
            return

虽然我并没有在所有题都用上自己实现的堆,但先通过底层实现一遍还是可以进一步加深自己对堆的理解。

今天的笔记包含基于双堆(Two Heaps)类型下的3个题目,它们在leetcode上的编号和题名分别是:

  • 295 - Find Median from Data Stream
  • 480 - Sliding Window Median
  • 502 - IPO

下面将根据以上顺序分别记录代码和对应心得,使用的编译器为Pycharm (Python3)。


Find Median from Data Stream

Median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value. So the median is the mean of the two middle value.

For example,
[2,3,4], the median is 3
[2,3], the median is (2 + 3) / 2 = 2.5

Design a data structure that supports the following two operations:
1. void addNum(int num) - Add a integer number from the data stream to the data structure.
2. double findMedian() - Return the median of all elements so far.

Example:

addNum(1)
addNum(2)
findMedian() -> 1.5
addNum(3) 
findMedian() -> 2

这道题我用上了自己写的堆。一开始,由于我不清楚堆的结构,所以我采用了插入排序外加提取中位数的方法去解决。虽然道理上行得通,但很不幸超时了。当然,超时也是正常现象。因为插入排序最坏情况下的时间复杂度是n平方(将要插入的数字与需要的顺序相反),这对于Python而言影响是极大的。所以,堆的优势就体现出来了。由于堆的时间复杂度恒定为O(logN),所以无论是添加元素还是寻找中位数,堆的速度都相对更快。

有了堆之后,每次可以直接从里面取出一个最值,无需考虑其他元素的情况。在此题里,考虑到中位数的性质,我们可以把数据分割成两个部分:比中位数小的部分(大顶堆),比中位数大的部分(小顶堆)。这两个堆都有着自己的特性:大顶堆的堆顶元素是这个堆最大的元素,其他的元素都小于等于该元素;小顶堆,则刚好相反。

另外,两个堆都能"动态地"把各自的最大/最小元素推至顶层去。约定俗成,当从数据流读出的元素个数为奇数时,此时的中位数就是大顶堆的顶层元素,并且大顶堆的元素个数会比小顶堆的个数多一个。当从数据流读出的个数为偶数时,此时的中位数就是两个堆顶层元素的平均数。于是,大顶堆和小顶堆就有着以下关系:

  1. 大顶堆的顶层元素值 <= 小顶堆的顶层元素值;
  2. 大顶堆的元素个数 = 小顶堆元素个数 or(小顶堆元素个数+1)。 

知道了这个规律,解题思路就比较清晰了:每次将读取的元素放入大顶堆中,然后把大顶堆顶层元素移动到小顶堆,最后判断小顶堆元素个数是否大于大顶堆,大于了就把顶层元素放回去。

至于为什么堆可以"动态"调整元素位置,是因为它借助了"优先队列"这个抽象数据结构。优先队列,顾名思义就是队列里的元素有优先级存在,在调用时可以直接抽取"最优"的元素(最大、最小等)。每次将新的元素放入队列时,堆会自主把它和其他元素进行比较,并更新此时的“最优”元素。

from Data_Structure.Heap import MaxHeap, MinHeap


class MedianFinder:
    # correct solution: 双堆(最大堆和最小堆)。
    def __init__(self):
        """
        initialize your data structure here.
        """
        self.maxHeap = MaxHeap(10)
        self.minHeap = MinHeap(10)

    def addNum(self, num: int) -> None:
        # insert to max heap and extract the largest number to min heap
        self.maxHeap.insert(num)
        self.minHeap.insert(self.maxHeap.extract_max())

        # check the balance
        if self.minHeap.get_size() > self.maxHeap.get_size():
            self.maxHeap.insert(self.minHeap.extract_min())

    def findMedian(self) -> float:
        # odd or even
        if self.minHeap.get_size() == self.maxHeap.get_size():
            # get the first item directly
            return (self.maxHeap.data[1] + self.minHeap.data[1]) / 2
        else:
            return self.maxHeap.data[1]

Sliding Window Median

Median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value. So the median is the mean of the two middle value.

Examples:
[2,3,4] , the median is 3
[2,3], the median is (2 + 3) / 2 = 2.5

Given an array nums, there is a sliding window of size k which is moving from the very left of the array to the very right. You can only see the k numbers in the window. Each time the sliding window moves right by one position. Your job is to output the median array for each window in the original array.

For example,
Given nums = [1,3,-1,-3,5,3,6,7], and k = 3.

Window position                Median
---------------               -----
[1  3  -1] -3  5  3  6  7       1
 1 [3  -1  -3] 5  3  6  7       -1
 1  3 [-1  -3  5] 3  6  7       -1
 1  3  -1 [-3  5  3] 6  7       3
 1  3  -1  -3 [5  3  6] 7       5
 1  3  -1  -3  5 [3  6  7]      6
Therefore, return the median sliding window as [1,-1,-1,3,5,6].

Note: 
You may assume k is always valid, ie: k is always smaller than input array's size for non-empty array.
Answers within 10^-5 of the actual value will be accepted as correct.

这道题我本来也想采用自己实现的数据结构,但最终失败了。它不仅耗时长(初始化时得自预先定义一个列表长度),而且当数据量达到一定程度(比如1000长度的列表)时,会报错。虽然成功返回了中位数列表,但列表中仅有一部分与答案不匹配,并且规律性不强,难以短时间判断是哪部分出现了问题。我排查了一阵子后,发现当循环进行到第160次左右时,某一个数据突然出现了丢失现象,目前猜测问题应该不在自己的数据结构,而是是在"添加变量"与"删除变量"之间的某个细节出现了问题。鉴于本次目的为刷题加深自己的理解,并且已经耗费了一天半的时间,我最后采用了Python自带的堆结构来完成此题,不再纠结细节了。

由于Python自带的堆实现默认为小顶堆,所以为了创建并维护大顶堆,基本就是取反操作。所有将被添加进大顶堆或者与大顶堆元素进行比较时,得将数字必须全部取反。此题的解题思路大致如下:

首先,遍历所有list元素,根据window尺寸一个一个'heappush'(添加并排序)到小顶堆里;接着,根据大顶堆和小顶堆之间的关系,平衡两者的元素数量;然后,当window的数量已满时,根据上一题的方法求出并记录当前window的中位数;最后,把window第一个元素删除掉(马上就要进入下一循环了)。在删除时,将被删除数优先与大顶堆的最大值比较,如果它"大于"最值,就在小顶堆里面执行删除操作,小于则在大顶堆里操作。

有一点需要注意的是,在删除元素之后,得先对堆里的元素进行上浮和下沉操作,确保元素顺序正确,再重新平衡双堆的元素数量。

import heapq
from heapq import *


class Solution:
    def __init__(self):
        # parameters
        self.ans = []
        self.maxHeap = []
        self.minHeap = []

    def medianSlidingWindow(self, nums: list, k: int) -> list:
        # official solution: 使用Python自带的'heapq'包。
        # traverse
        for i in range(len(nums)):
            # push the element to maxHeap/minHeap
            if not self.maxHeap or -self.maxHeap[0] >= nums[i]:
                # after push, automatically sorted
                heappush(self.maxHeap, -nums[i])
            else:
                heappush(self.minHeap, nums[i])

            # balance two heaps
            self.balance()

            # check the window
            if i + 1 - k >= 0:
                # get current median
                self.ans.append(self.getMedian())

                # remove the first element in the window (it will move outside the window at next round)
                # remember to reverse any number in maxHeap when comparing or searching
                removeNum = nums[i + 1 - k]
                # essential!!!!!!!!!! There is a 'equal' symbol for comparison
                if removeNum <= -self.maxHeap[0]:
                    self.delete(self.maxHeap, -removeNum)
                else:
                    self.delete(self.minHeap, removeNum)

                # balance again because the number of elements has declined
                self.balance()

        return self.ans

    def getMedian(self):
        # even
        if len(self.maxHeap) == len(self.minHeap):
            return -self.maxHeap[0] / 2 + self.minHeap[0] / 2
        # odd
        else:
            return -self.maxHeap[0]

    def delete(self, heap, num):
        # change the delete number to last element in heap and remove the last element
        index = heap.index(num)
        heap[index] = heap[-1]
        del heap[-1]

        # attention: sift up and down
        if index < len(heap):
            heapq._siftup(heap, index)
            heapq._siftdown(heap, 0, index)


    def balance(self):
        """
        principal:
            1. the top element in maxHeap is always smaller or equal to any elements in minHeap
            2. the number of elements in maxHeap is equal or only gets one more element than that in minHeap
        :return: None
        """
        if len(self.minHeap) > len(self.maxHeap):
            # note: 'heappush' will automatically sort the list after inserting any numbers
            # same as 'heappop'
            heappush(self.maxHeap, -heappop(self.minHeap))
        elif len(self.maxHeap) > len(self.minHeap) + 1:
            heappush(self.minHeap, -heappop(self.maxHeap))

IPO

Suppose LeetCode will start its IPO soon. In order to sell a good price of its shares to Venture Capital, LeetCode would like to work on some projects to increase its capital before the IPO. Since it has limited resources, it can only finish at most k distinct projects before the IPO. Help LeetCode design the best way to maximize its total capital after finishing at most k distinct projects.

You are given several projects. For each project i, it has a pure profit Pi and a minimum capital of Ci is needed to start the corresponding project. Initially, you have W capital. When you finish a project, you will obtain its pure profit and the profit will be added to your total capital.

To sum up, pick a list of at most k distinct projects from given projects to maximize your final capital, and output your final maximized capital.

Example 1:

Input: k=2, W=0, Profits=[1,2,3], Capital=[0,1,1].

Output: 4

Explanation: Since your initial capital is 0, you can only start the project indexed 0.
             After finishing it you will obtain profit 1 and your capital becomes 1.
             With capital 1, you can either start the project indexed 1 or the project indexed 2.
             Since you can choose at most 2 projects, you need to finish the project indexed 2 to get the maximum capital.
             Therefore, output the final maximized capital, which is 0 + 1 + 3 = 4.

Note:
You may assume all numbers in the input are non-negative integers.
The length of Profits array and Capital array will not exceed 50,000.
The answer is guaranteed to fit in a 32-bit signed integer.

此题的题干描述得比较具体,不像之前练习的那样问得十分直接,而是需要自己把题目逻辑抽象出来,寻找规律,正确处理对应变量才能获得需要的答案。

根据题意,我们可以推导出"成本越小越好,利润越大越好"这个条件,然后将其抽象出来,与堆这个数据结构结合,得到"构建'成本'的小顶堆与'利润'的大顶堆,再根据k的大小将最值叠加"即可。

但有一点得注意,这里的"成本"与"利润"是绑定在一起的,因此在构建堆时一开始得将它们统一带入。在Python中,有一个叫做'zip()'的内置函数,可以绑定两个不同列表,将其元素一一对应结合为元组,然后以列表的形式返回(参考:Python3 zip() 函数)。 

知道了这个,我们就可以一开始将"成本"与"利润"用zip绑定起来,统一放入小顶堆自主排序(元组的排序中,首先比较第一个元素谁小,如果相等就比较第二个元素,以此类推),进入循环;接着,在k次循环里,根据我们现有的资金W判断当前有哪些项目可以做,并把当前能做的所有项目pop出来,将其利润单独push进大顶堆中然后,根据大顶堆的元素,pop出利润最高的项目(最大值),并与我们最终需要返回的最大资金(final maximal capital)变量累加;最后,一直循环直到我们的精力不够(k=0)或者没项目可做(大顶堆没元素)了为止。

from heapq import *


class Solution:
    def __init__(self):
        # parameters
        self.maxCapital = 0
        self.minHeap = []
        self.maxHeap = []

    def findMaximizedCapital(self, k: int, W: int, Profits: list, Capital: list) -> int:
        # solution: two heaps. 堆排序+限定条件下提取最值操作。

        # bind the profits and corresponding capital, then sort them to minHeap
        # note: when sorting tuples, 'heapify' will take the first element of a tuple into account
        self.minHeap = list(zip(Capital, Profits))
        heapify(self.minHeap)
        self.maxCapital = W

        curProject = (0, 0)
        curMaxProfit = 0
        # Process projects
        while k > 0:
            # get every project that we can currently carry out (some projects have same capital but different profits)
            while self.minHeap and W >= self.minHeap[0][0]:
                curProject = heappop(self.minHeap)
                heappush(self.maxHeap, -curProject[1])

            # add project profit to W
            W += curProject[1]

            # choose the biggest profit
            if self.maxHeap:
                curMaxProfit = -heappop(self.maxHeap)
                self.maxCapital += curMaxProfit
            # if there is no more projects we can do (k is too large and there are no more potential projects)
            else:
                break

            k -= 1

        return self.maxCapital


solution = Solution()
print(solution.findMaximizedCapital(11, 11, [1, 2, 3], [11, 12, 13]))

总结

第一次使用堆这个数据结构,自己需要一些时间去消化,所以笔记的内容不一定完全正确,仅代表了我当前对堆的理解。

识别双堆,有几点可以用来参考:(此处来自于穷码农的归纳,我觉得很有用)

  1. 这种模式在优先队列计划安排问题(Scheduling)中有奇效;
  2. 如果问题让你找一组数中的最大/最小/中位数;
  3. 有时候,这种模式在涉及到二叉树数据结构时很有用(e.g. 完全二叉树)。

这里多一句嘴,以上代码的" from ** import ** " 其实并不是规范写法,应该用“import *** ”。以后的练习中我会注意代码规范的。

 

如果笔记存在一些问题,发现后我会尽快纠正。

*注:本文的所有题目均来源于leetcode

你可能感兴趣的:(数据结构与算法,leetcode训练)