Chapter 1 Data Structures and Algorithms
1.1. Unpacking a Sequence into Separate Variables
Problem
You have an N-element tuple or sequence that you would like to unpack into a collection of N variables.
Solution
Any sequence (or iterable) can be unpacked into variables using a simple assignment operation.
name, shares, price, (year, mon, day) = [ 'ACME', 50, 91.1, (2012, 12, 21) ]
Discussion
Unpacking actually works with any object that happens to be iterable, not just tuples or lists. This includes strings, files, iterators, and generators. When unpacking, python has no special syntax to discard certain values, but you can often just pick throwaway variable name for it.
_, shares, price, _ = [ 'ACME', 50, 91.1, (2012, 12, 21) ]
1.2. Unpacking Elements from Iterables of Arbitrary Length
Problem
You need to unpack N elements from an iterable, but the iterable may be longer than N elements, causing a "too many values to unpack" exception.
Solution
Python "star expressions" can used to address this problem.
def drop_first_last(grades):
first, *middle, last = grades
return avg(middle)
record = ('Dave', '[email protected]', '773-555-1212', '847-555-1212')
name, email, *phone_numbers = user_record
Discussion
It is worth nothing that the phone_numbers
variable will always be a list, regardless of how many phone numbers are unpacked (including one).
1.3. Keeping the Last N Items
Problem
You want to keep a limited history of the last few items seen during iteration or during some other kind of processing.
Solution
Keeping a limited history is a perfect use for a collections.deque
. The name is pronounced "deck" and is short for "double-ended queue".
from collections import deque
def search(lines, pattern, history=5):
previous_lines = deque(maxlen=history)
for line in lines:
if pattern in line:
yield line, previous_lines
previous_lines.append(line)
# Example use on a file
if __name__ == '__main__':
with open('somefile.txt') as f:
for line, prevlines in search(f, 'python', 5):
for pline in prevlines:
print(pline, end='')
print(line, end='')
print('-'*20)
Discussion
Deques supports thread-save, memory efficient appends and pops from either side of the deque.
Although you could manually perform such operation on a list (e.g., appending, deleting, etc.), the queue solution is far more elegant and runs a lot faster.
Adding or popping items from either end of a queue has O(1) complexity. This is unlike a list where inserting or removing items from the front of the list is O(N).
1.4. Finding the Largest or Smallest N Items
Problem
You want to make a list of the largest or smallest N items in a collection.
Solution
The heapq
module has two functions -- nlargest() and nsmallest() -- that do exactly what you want.
import heapq
nums = [1, 8, 2, 23, 7, -4, 18, 23, 42, 37, 2]
print(heapq.nlargest(3, nums)) # Prints [42, 37, 23]
print(heapq.nsmallest(3, nums)) # Prints [-4, 1, 2]
portfolio = [
{'name':'IBM', 'shares': 100, 'price': 91.1},
{'name':'AAPL', 'shares': 50, 'price': 543.22},
{'name':'FB', 'shares': 200, 'price': 21.09},
{'name':'HPQ', 'shares': 35, 'price': 31.75},
{'name':'YHOO', 'shares': 45, 'price': 16.35},
{'name':'ACME', 'shares': 75, 'price': 115.65}
]
cheap = heapq.nsmallest(3, portfolio, key=lambda s: s['price'])
expensive = heapq.nlargest(3, portfolio, key=lambda s: s['price'])
Discussion
If you are looking for the N smallest or largest items and N is small compared to the overall size of the collection, these functions provide superior performance.
The nlargest()
and nsmallest()
functions are most appropriate if you are trying to to find a relatively small number of items. If you are simply trying to find the single smallest or largest item (N=1), it is faster to use min()
and max()
. Similarly, if N is about the same size as the collection itself, it is usually faster to sort it first and take a slice.
The implementation of a heap is an interesting and worthwhile subject of study.
1.5. Implementing a Priority Queue
Problem
You want to implement a queue that sorts items by a given priority and always returns the item with the highest priority on each pop operation.
Solution
import heapq
class PriorityQueue:
def __init__(self):
self._queue = []
self._index = 0
def push(self, item, priority):
heapq.heappush(self._queue, (-priority, self._index, item))
self._index += 1
def pop(self):
return heapq.heappop(self._queue)[-1]
Discussion
The core of this recipe concerns the use of the heapq
module. The functions heapq.heappush()
and heapq.heappop()
insert and remove items from a list _queue
in a way such that the first item in the list has the smallest priority. The heappop()
method always returns the "smallest" item, so that is the key to making the queue pop the correct items. MoreOver, since the push and pop operations have O(logN) complexity where N is the number of items in the heap, they are fairly efficient even for fairly large values of N.
1.6. Mapping Keys to Multiple Values in a Dictionary
Problem
You want to make a dictionary that maps keys to more than one value (a so-called "multidict").
Solution
To easily construct such dictionaries, you can use defaultdict
in the collections
module. A feature of defaultdict
is that it automatically initializes the first value so you can simply focus on adding items.
from collections import defaultdict
d = defaultdict(list)
d['a'].append(1)
d['a'].append(2)
d['b'].append(4)
d = defaultdict(set)
d['a'].add(1)
d['a'].add(2)
d['b'].add(4)
Discussion
In principle, constructing a multivalued dictionary is simple. However, initialization of the first value can be messy if you try to do it yourself. Using a defaultdict
simply leads to much cleaner code.
1.7. Keeping Dictionary in Order
Problem
You want to create a dictionary, and yo also want to control the order of items when iterating or serializing.
Solution
To control the order of items in a dictionary, you can use OrderedDict
from collections
module. It exactly preserves the original insertion order of data when iterating.
Discussion
An OrderedDict
internally maintains a doubly linked list that orders the keys according to insertion order. When a new item is first created, it is placed at the end of the list. Subsequent reassignment of an existing key doesn't change the order.
Be aware that the size of an OrderedDict
is more than twice as large as a normal dictionary due to the extra linked list that's created.
1.8. Calculating with Dictionaries
Problem
You want to perform various calculations (e.g., minimum value, maximum value, sorting, etc.) on a dictionary of data.
Solution
prices = {
'ACME': 45.23,
'AAPL': 612.78,
'IBM': 205.55,
'HPQ': 37.20,
'FB': 10.75
}
min_price = min(zip(prices.values(), prices.keys()))
# min_price is (10.75, 'FB')
max_price = max(zip(prices.values(), prices.keys()))
# max_price is (612.78, 'AAPL')
prices_sorted = sorted(zip(prices.values(), prices.keys()))
# prices_sorted is [(10.75, 'FB'), (37.2, 'HPQ'),
# (45.23, 'ACME'), (205.55, 'IBM'),
# (612.78, 'AAPL')]
Discussion
The solution involving zip()
solves the problem by "inverting" the dictionary into a sequence of (value, key)
pairs. When performing comparisons on such tuples, the value
element is compared first, followed by the key. This gives you exactly the behavior that you want and allows reductions and sorting to be easily performed on the dictionary content using a single statement.
1.9. Finding Commonalities in Two Dictionaries
Problem
You have two dictionaries and want to find what they might have in common (same keys, same values, etc.).
Solution
To find out what the two dictionaries have in common, simply perform common set operations using the keys()
or items()
methods.
a = { 'x' : 1, 'y' : 2, 'z' : 3 }
b = { 'w' : 10, 'x' : 11, 'y' : 2 }
# Find keys in common
a.keys() & b.keys()
# { 'x', 'y' }
# Find keys in a that are not in b
a.keys() - b.keys()
# { 'z' }
# Find (key,value) pairs in common
a.items() & b.items() # { ('y', 2) }
These kinds of operations can also be used to alter or filter dictionary contents. For example, suppose you a want to make a new dictionary with selected keys removed. Here is some sample code using a dictionary comprehension:
# Make a new dictionary with certain keys removed
c = {key:a[key] for key in a.keys() - {'z', 'w'}}
# c is {'x': 1, 'y': 2}
Discussion
The keys()
method of a dictionary returns a keys-view object that exposes the keys. A little-known feature of keys views is that they also support common set of operations such as unions, intersections, and differences. Thus, if you need perform common set operations with dictionary keys, you can often just use the keys-view objects directly without first converting them into a set.
The items()
method of a dictionary returns an items-view object consisting of (key,value)
pairs. This object supports similar set operations and can be used to perform operations such as finding out which key-value pairs two dictionaries have in common.
The values()
method of a dictionary does not support the set operations described in this recipe.
1.10. Removing Duplicates from a Sequence while Maintaining Order
Problem
You want to eliminate the duplicated values in a sequence, but preserve the order of the remaining items.
Solution
If the values in the sequence are hashable, the problem can be easily solved using a set and a generator.
def dedupe(items):
seen = set()
for item in items:
if item not in seen:
yield item
seen.add(item)
>>> a = [1, 5, 2, 1, 9, 1, 5, 10]
>>> list(dedupe(a))
[1, 5, 2, 9, 10]
This only works if the items in the sequence are hashable. If you are trying to eliminate duplicates in a sequence of unhashable types (such as dicts), you can make a slight chage to this recipe, as follows:
def dedupe(items, key=None):
seen = set()
for item in items:
val = item if key is None else key(item)
if val not in seen:
yield item
seen.add(val)
>>> a = [ {'x':1, 'y':2}, {'x':1, 'y':3}, {'x':1, 'y':2}, {'x':2, 'y':4}]
>>> list(dedupe(a, key=lambda d: (d['x'],d['y'])))
[{'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 2, 'y': 4}]
>>> list(dedupe(a, key=lambda d: d['x']))
[{'x': 1, 'y': 2}, {'x': 2, 'y': 4}]
Discussion
If all you want to do is eliminate duplicates, it is often easy enough to make a set.
>>> a = [1, 5, 2, 1, 9, 1, 5, 10]
>>> set(a)
{1, 2, 10, 5, 9}
However, this approach doesn't preserve any kind of ordering. So, the resulting data will be scrambled afterward. The solution shown avoids this.
The use of a generator function in this recipe reflects the fact that you might want the function to be extremely general purpose--not necessarily tied directly to list processing. For example, if you want to read a file, eliminating duplicated lines, you could simply do this:
with open(somefile, 'r') as f:
for line in dedupe(f):
...
1.11 Naming a Slice
Problem
Your program has become an unreadable mess of hardcoded slice indices and you want to clean it up.
Solution
Suppose you have some code that is pulling specific data out of a record string with fixed fields.
###### 0123456789012345678901234567890123456789012345678901234567890
record = '....................100 .......513.25 ..........'
cost = int(record[20:32]) * float(record[40:48])
Instead of doing that, why not name the slices like this?
SHARES = slice(20,32)
PRICE = slice(40,48)
cost = int(record[SHARES]) * float(record[PRICE])
In the latter version, you avoid having a lot of mysterious hardcoded indices, and what you're doing becomes much clearer.
Discussion
In addition, you can map a slice onto a sequence of a specific size by using its indices(size)
method. This returns a tuple (start, stop, step)
where all values have been suitably limited to fit within bounds (as to avoid IndexError
exceptions when indexing).
>>> a = [5, 50, 2]
>>> s = 'HelloWorld'
>>> a.indices(len(s))
(5, 10, 2)
1.12. Determining the Most Frequently Occurring Items in a Sequence
Problem
You have a sequence of items, and you'd like to determine the most frequently occurring items in the sequence.
Solution
The collections.Counter
class is designed for just such a problem. It even comes with a handy most_common()
method that will give you the answer.
words = [
'look', 'into', 'my', 'eyes', 'look', 'into', 'my', 'eyes',
'the', 'eyes', 'the', 'eyes', 'the', 'eyes', 'not', 'around', 'the',
'eyes', "don't", 'look', 'around', 'the', 'eyes', 'look', 'into',
'my', 'eyes', "you're", 'under'
]
from collections import Counter
word_counts = Counter(words)
top_three = word_counts.most_common(3)
Discussion
Under the covers, a Counter
is a dictionary that maps the items to the number of occurrences. If you want to increment the count manually, simply use addition:
morewords = ['why','are','you','not','looking','in','my','eyes']
for word in morewords:
word_counts[word] += 1
Or, alternatively, you could use the update()
method:
word_counts.update(morewords)
A little-known feature of Counter
instances is that they can be easily combined using various mathematical operations.
>>> a = Counter(words)
>>> b = Counter(morewords)
>>> a
Counter({'eyes': 8, 'the': 5, 'look': 4, 'into': 3, 'my': 3, 'around': 2,
"you're": 1, "don't": 1, 'under': 1, 'not': 1})
>>> b
Counter({'eyes': 1, 'looking': 1, 'are': 1, 'in': 1, 'not': 1, 'you': 1,
'my': 1, 'why': 1})
>>> # Combine counts
>>> c = a + b
>>> c
Counter({'eyes': 9, 'the': 5, 'look': 4, 'my': 4, 'into': 3, 'not': 2,
'around': 2, "you're": 1, "don't": 1, 'in': 1, 'why': 1,
'looking': 1, 'are': 1, 'under': 1, 'you': 1})
>>> # Subtract counts
>>> d = a - b
>>> d
Counter({'eyes': 7, 'the': 5, 'look': 4, 'into': 3, 'my': 2, 'around': 2,
"you're": 1, "don't": 1, 'under': 1})
>>>
Needless to say, Counter
objects are a tremendously useful tool for almost any kind of problem where you need to tabulate and count data. You should prefer this over manually written solutions involving dictionaries.
1.13. Sorting a List of Dictionaries by a Common Key
Problem
You have a list of dictionaries and you would like to sort the entries according to one or more of the dictionary values.
Solution
rows = [
{'fname': 'Brian', 'lname': 'Jones', 'uid': 1003},
{'fname': 'David', 'lname': 'Beazley', 'uid': 1002},
{'fname': 'John', 'lname': 'Cleese', 'uid': 1001},
{'fname': 'Big', 'lname': 'Jones', 'uid': 1004}
]
from operator import itemgetter
rows_by_fname = sorted(rows, key=itemgetter('fname'))
rows_by_uid = sorted(rows, key=itemgetter('uid'))
rows_by_lfname = sorted(rows, key=itemgetter('lname','fname'))
Discussion
The itemgetter()
function creates a callable that accepts a single item from rows
as input and returns a value that will be used as the basis for sorting.
The functionality of itemgetter()
is sometimes replaced by lambda
expressions which often works just fine. However, the solution involving itemgetter()
typically runs a bit faster. Thus, you might prefer it if performance is a concern.
Last, but not least, don't forget that the technique shown in this recipe can be applied to functions such as min()
and max()
.
1.14. Sorting Objects Without Native Comparison Support
Problem
You want to compare objects of the same class, but they don't natively support comparison operations.
Solution
Instead of using lambda
, an alternative approach is to use operator.attrgetter()
.
Discussion
The choice of whether or not to use lambda
or attrgetter()
may be one of personal preference. However, attrgetter()
is often a tad bit faster and also has the added feature of allowing multiple fields to be extracted simultaneously.
1.15. Grouping Records Together Based on a Field
Problem
You have a sequence of dictionaries or instances and you want to iterate over the data in groups based on the value of a particular field. such as date.
Solution
The itertools.groupby()
function is particularly useful for grouping data together like this.
rows = [
{'address': '5412 N CLARK', 'date': '07/01/2012'},
{'address': '5148 N CLARK', 'date': '07/04/2012'},
{'address': '5800 E 58TH', 'date': '07/02/2012'},
{'address': '2122 N CLARK', 'date': '07/03/2012'},
{'address': '5645 N RAVENSWOOD', 'date': '07/02/2012'},
{'address': '1060 W ADDISON', 'date': '07/02/2012'},
{'address': '4801 N BROADWAY', 'date': '07/01/2012'},
{'address': '1039 W GRANVILLE', 'date': '07/04/2012'},
]
from operator import itemgetter
from itertools import groupby
# Sort by the desired field first
rows.sort(key=itemgetter('date'))
# Iterate in groups
for date, items in groupby(rows, key=itemgetter('date')):
print(date)
for i in items:
print(' ', i)
Discussion
The groupby()
function works by scanning a sequence and finding sequential "runs" of identical values (or values returned by the given key function). On each iteration, it returns the value along with an iterator that produces all of the items in a group with the same value.
An important preliminary step is sorting the data according to the field of interest. Since groupby()
only examines consecutive items, failing to sort first won't group the records as you want.
1.16. Filtering Sequence Elements
Problem
You have inside of a sequence, and need to extract values or reduce the sequence using some criteria.
Solution
The easiest way to filter sequence data is often to use a list comprehension.
>>> mylist = [1, 4, -5, 10, -7, 2, 3, -1]
>>> [n for n in mylist if n > 0]
[1, 4, 10, 2, 3]
>>> [n for n in mylist if n < 0]
[-5, -7, -1]
>>>
One potential downside of using a list comprehension is that it might produce a large result if the original input is large. If this is a concern, you can use generator expressions to produce the filtered values iteratively.
>>> pos = (n for n in mylist if n > 0)
>>> pos
at 0x1006a0eb0>
>>> for x in pos:
...
print(x)
Discussion
List comprehensions and generator expressions are often the easiest and most straightforward ways to filter simple data. They also have the added power to transform the data at the same time.
Another notable filtering tool is itertools.compress()
, which takes an iterable and an accompanying Boolean selector sequence as input. As output, it gives you all of the items in the iterable where the corresponding element in the selector is True
. This can be useful if you're trying to apply the results of filtering one sequence to another related sequence.
addresses = [
'5412 N CLARK',
'5148 N CLARK',
'5800 E 58TH',
'2122 N CLARK'
'5645 N RAVENSWOOD',
'1060 W ADDISON',
'4801 N BROADWAY',
'1039 W GRANVILLE',
]
counts = [0, 3, 10, 4, 1, 7, 6, 1]
>>> from itertools import compress
>>> more5 = [n > 5 for n in counts]
>>> more5
[False, False, True, False, False, True, True, False]
>>> list(compress(addresses, more5))
['5800 E 58TH', '4801 N BROADWAY', '1039 W GRANVILLE']
1.17. Extracting a Subset of a Dictionary
Problem
You want to make a dictionary that is a subset of another dictionary.
Solution
This is easily accomplished by using a dictionary comprehension.
prices = {
'ACME': 45.23,
'AAPL': 612.78,
'IBM': 205.55,
'HPQ': 37.20,
'FB': 10.75
}
# Make a dictionary of all prices over 200
p1 = { key:value for key, value in prices.items() if value > 200 }
# Make a dictionary of tech stocks
tech_names = { 'AAPL', 'IBM', 'HPQ', 'MSFT' }
p2 = { key:value for key,value in prices.items() if key in tech_names }
Discussion
Much of what can be accomplished with a dictionary comprehension might also be done by creating a sequence of tuples and passing them to the dict()
function.
p1 = dict((key, value) for key, value in prices.items() if value > 200)
However, the dictionary comprehension solution is a bit clearer and actually runs quite a bit faster.
1.18. Mapping Names to Sequence Elements
Problem
You have code that accesses list or tuple elements by position, but this make the code somewhat difficult to read at times. You'd also like to be less dependent on position in the structure, by accessing the elements by name.
Solution
collections.namedtuple()
provides these benefits. It is actually a factory method that returns a subclass of the standard Python tuple
type. You feed it a type name, and the fields it should have, and it returns a class that you can instantiate, passing in values for the fields you've defined, and so no.
>>> from collections import namedtuple
>>> Subscriber = namedtuple('Subscriber', ['addr', 'joined'])
>>> sub = Subscriber('[email protected]', '2012-10-19')
>>> sub
Subscriber(addr='[email protected]', joined='2012-10-19')
>>> sub.addr
'[email protected]'
>>> sub.joined
'2012-10-19'
>>>
Although an instance of a namedtuple
looks like a normal class instance, it is interchangeable with a tuple and supports all of the usual operations such as indexing and unpacking.
A major use case for named tuples is decoupling your code from the position of the elements it mainipulates.
To illustrate, here is some code using ordinary tuples:
def compute_cost(records):
total = 0.0
for rec in records:
total += rec[1] * rec[2]
return total
Here is a version that uses a namedtuple
:
from collections import namedtuple
Stock = namedtuple('Stock', ['name', 'shares', 'price'])
def compute_cost(records):
total = 0.0
for rec in records:
s = Stock(*rec)
total += s.shares * s.price
return total
Discussion
One possible use of a namedtuple
is as a replacement for a dictionary, which requires more space to store. Thus, if you are building large data structures involving dictionaries, use of a namedtuple
will be more efficient. However, be aware that unlike a dictionary, a namedtuple
is immutable.
>>> s = Stock('ACME', 100, 123.45)
>>> s
Stock(name='ACME', shares=100, price=123.45)
>>> s.shares = 75
Traceback (most recent call last):
File "", line 1, in
AttributeError: can't set attribute
>>>
If you need to change any of the attributes, it can be done using the _replace()
method of a namedtuple
instance, which makes an entirely new namedtuple with specified values replaced.
>>> s = s._replace(shares=75)
>>> s
Stock(name='ACME', shares=75, price=123.45)
>>>
A subtle use of _replace()
method is that it can be a convenient way to populate named tuples that have optional or missing fields. To do this, you make a prototype tuple containing the default values and the use _replace()
to create new instances with values replaced.
from collections import namedtuple
Stock = namedtuple('Stock', ['name', 'shares', 'price', 'date', 'time'])
# Create a prototype instance
stock_prototype = Stock('', 0, 0.0, None, None)
# Function to convert a dictionary to a Stock
def dict_to_stock(s):
return stock_prototype._replace(**s)
Here is an example of how this code would work:
>>> a = {'name': 'ACME', 'shares': 100, 'price': 123.45}
>>> dict_to_stock(a)
Stock(name='ACME', shares=100, price=123.45, date=None, time=None)
>>> b = {'name': 'ACME', 'shares': 100, 'price': 123.45, 'date': '12/17/2012'}
>>> dict_to_stock(b)
Stock(name='ACME', shares=100, price=123.45, date='12/17/2012', time=None)
>>>
Last, but not least, it should be noted that if your goal is to define an efficient data structure where you will be changing various instance attributes, using namedtuple
is not your best choice. Instead, consider defining a class using __slots__
instead.
1.19. Transforming and Reducing Data at the Same Time
Problem
You need to execute a reduction function, but first need to transform or filter the data.
Solution
A very elegant way to combine a data reduction and a transformation is to use a generator-expression argument.
Discussion
The solution shows a subtle syntactic aspect of generator expressions when supplied as the single argument to a function (i.e., you don't need repeated parentheses).
s = sum((x * x for x in nums)) # Pass generator-expr as argument
s = sum(x * x for x in nums) # More elegant syntax
1.20. Combining Multiple Mappings into a Single Mapping
Problem
You have multiple dictionaries or mappings that you want to logically combine into a single mapping to perform certain operations, such as looking up values or checking for the existence of keys.
Solution
Now suppose you want to perform lookups where you have to check both dictionaries (e.g., first checking in a
and then in b
if not found). An easy way to do this is to use the ChainMap
class from the collections
module.
a = {'x': 1, 'z': 3 }
b = {'y': 2, 'z': 4 }
from collections import ChainMap
c = ChainMap(a,b)
print(c['x']) # Outputs 1 (from a)
print(c['y']) # Outputs 2 (from b)
print(c['z']) # Outputs 3 (from a)
Discussion
A ChainMap
takes multiple mappings and makes them logically appear as one. However, the mappings are not literally merged together. Instead, a ChainMap
simply keeps a list of the underlying mappings and redefines common dictionary operations to scan the list.
If there are duplicate keys, the values from the first mapping get used.
Operations that mutate the mapping always affect the first mapping listed.