《Python Cookbook》学习笔记-第一、二章

《Python Cookbook》学习笔记拿出来分享给大家，暂时更新第一、二章。
同步自知乎https://zhuanlan.zhihu.com/p/53457483

Jupyter notebook链接

1.第一章数据结构和算法
2.第二章.字符串和文本

1.第一章数据结构和算法

1.1将序列分解为单个变量

任何一个可以迭代的对象都可以通过简单的赋值操作分解成单个变量

p = (2, 3)
x, y = p
x
2
y
3
data = ['acme', 90, 40, (2018,12,12)]
name, price, shares, date = data
name
'acme'
date
(2018, 12, 12)
name, price, shares, (a,d,c) = data
p = (2, 3)
x, y = p
a
2018
d
12
#如果数量不匹配
a, c = data
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

 in ()
      1 #如果数量不匹配
----> 2 a, c = data

ValueError: too many values to unpack (expected 2)

只要对象可以迭代，那么就可以进行分解操作

h, e, l , l, o = 'hello'
h
'h'
l
'l'

可以使用一些用不到的变量名来丢弃数值

h, _,_,_,_ = 'hello'
h
'h'

1.2从任意长度的可迭代对象分解元素

解决方案：‘表达式’可以解决这个问题*

#去掉最高最低分取得平均分
def drop_first_last(data):
    fist,*_,last = data
    return avg(middle)
record = ('libo', '[email protected]', '110', '120')
name, email, phone, *others = record
name
'libo'
*others
File "", line 1
    *others
           ^
SyntaxError: can't use starred expression here
others
['120']
record = ('libo', '[email protected]', '110', '120')
*others, phome = record
others
#返回的是一个列表
['libo', '[email protected]', '110']

*语句在迭代一个可变长度的元组序列时，尤其有用

record = [('a',1,2), ('b',1,2,3), ('c',1,2,3,4)]

for i, *other in record:
    print(i, other)
a [1, 2]
b [1, 2, 3]
c [1, 2, 3, 4]
#综合使用
record = ['abc',50,123.3,(2018,12,20)]
name, *ignore, (year, *ignore) = record
name
'abc'
year
2018
#一个神奇的递归，使用*表达式
def _sum(items):
    head, *_it = items
    return head + sum(_it) if _it else head
_sum([i for i in range(10)])
45
from collections import deque

def search(lines, pattern, history=5):
    previous_lines = deque(maxlen=history)
    for line in lines:
        if pattern in line:
            yield line, previous_lines
        previous_lines.append(line)

注意这里的previous_lines.append（line）放在yield的后面，依然会被执行，这与return不相同。如果把这一句提前，会print两次python * deque（maxlen=number）创建了一个固定长度的队列，当有新的记录出现的时候，且队列已经满了，旧的会被挤掉，新的添加。有append方法

with open('untitled.txt', encoding='utf-8') as f:
    for line, previlines in search(f, 'python', 5):
        for pline in previlines:
            print(pline)
        print(line, end='')
        print('-'*10)
12345

678910

python
----------
678910

python

1112

1

2

python
----------
q = deque(maxlen=3)
q.append(1)
q.append(2)
q.append(3)
q.append(4)
q
deque([2, 3, 4])
#当不指定maxlen的时候，队列长度无限
q = deque()
q.append(1)
q.append(2)
q
deque([1, 2])
q.appendleft(10)
q
deque([10, 1, 2])
q.pop()
2
q.popleft()
10
q
deque([1])

1.4寻找最大最小的n个元素

在某个集合中寻找最大或者最小的n个元素

# heapq 中有两个函数——nlargest（）和nsmallest（）， 可以完成这一工作
import heapq
num = [-1.2,3,3-45,33-4,44,0]
print(heapq.nlargest(3, num))
print(heapq.nsmallest(2, num))
[44, 29, 3]
[-42, -1.2]
# 这两个函数还可以接受参数key， 传递一个函数fun， 根据fun的值进行选择
portfolio = [
    {'name':'libo', 'price':10, 'grades':3},
    {'name':'libobo','price':11, 'grades':4}
]
print(heapq.nlargest(2, portfolio, key=lambda s: s['grades']))
print(heapq.nsmallest(1, portfolio, key=lambda s: s['grades']))
[{'name': 'libobo', 'price': 11, 'grades': 4}, {'name': 'libo', 'price': 10, 'grades': 3}]
[{'name': 'libo', 'price': 10, 'grades': 3}]

使用堆的方法可以更加节省性能，当数据量很大，我们只要取得最小的那么几个的时候，这个速度更快

import math
nums = [int(math.sin(3*i)*15) for i in range(10)]
nums
[0, 2, -4, 6, -8, 9, -11, 12, -13, 14]
heap = list(nums)
heapq.heapify(heap) # heapify（）
heap
[-13, -8, -11, 2, 0, 9, -4, 12, 6, 14]
heapq.heappop(heap)
-13
heapq.heappop(heap)
# secondly
-11

值得注意的是，当获取的数量和集合元素数量接近时，排序是最好的办法，sorted（item）[:N]

1.5实现优先级队列

对不同元素按照优先级进行排序

class Item():
    def __init__(self, name):
        self.name = name
    def __repr__(self):
        return 'Item({!r})'.format(self.name)
item1 = Item('book')
item1
# 注意这里使用！r和不使用它的区别，不使用时：结果是Item（book），没有引号
Item('book')
import heapq
class PriorityQueue():
    def __init__(self):
        self.queue = []
        self.index = 0
    def push(self, item, priority):
        heapq.heappush(self.queue, (-priority, self.index, item)) # 注意 heappush（）的用法
        self.index += 1 # index是第二排序准则
    def pop(self):
        return heapq.heappop(self.queue)
q = PriorityQueue()
q.push(Item('foo'), 2)
q.push(Item('fooo'), 2)
q.push(Item('foooo'), 3)
q.push('123', 4)
q.pop()
(-4, 3, '123')
q.pop()
(-3, 2, Item('foooo'))
q.pop()
(-2, 0, Item('foo'))
q.pop()
(-2, 1, Item('fooo'))

（priority，item）形式也可以比较优先级，但优先级相同会出现问题。可以考虑添加inde

a = (1, 'a')
b = (2, 'b')
a < b
True
b < (2, 'c')
True

1.6在字典中将键映射到多个值上

d = {
    'a':[1,2],
    'b':[3,4]
}
d['c']=[5,6]
d
{'a': [1, 2], 'b': [3, 4], 'c': [5, 6]}
from collections import defaultdict

d = defaultdict(list)
d['a'].append(1)
d['a'].append(2)
d['b'].append(3)
d
defaultdict(list, {'a': [1, 2], 'b': [3]})
# 自己构建多值映射也可以
d = {
    'a':[2]
}
pairs = [
    ('a',1),
    (2,2)
]
for key, values in pairs:
    if key not in d:
        d[key] = []
    d[key].append(values)
d
{2: [2], 'a': [2, 1]}

1.7让字典有序

可以使用collection中的OrderDict类。对字典做迭代时候，会按照添加顺序

from collections import OrderedDict

d = OrderedDict()
d['a'] = 1
d['b'] = 2
d
OrderedDict([('a', 1), ('b', 2)])
import json
json.dumps(d)
'{"a": 1, "b": 2}'

1.8与字典有关的排序问题

# 找到股价最低的股票
price ={
    'Alibaba':30,
    'Tenxun':20,
    'IBM':21,
    'FB':35
}
min_price = min(zip(price.values(), price.keys()))
min_price
(20, 'Tenxun')
# zip()的用法
z = zip(['a', 'b', 'c'], [1, 2, 3])
print(list(z))
[('a', 1), ('b', 2), ('c', 3)]

其他一般方法

min(price)
'Alibaba'
max(price)
'Tenxun'
min(price.values())
20
key = min(price, key=lambda x: price[x]) # 'Tenxun'
price[key]
20

1.9寻找两个字典中的相同点（key，value）

a = {
    'a':1,
    'y':2,
    'c':3
}
b = {
    'x':3,
    'y':2,
    'z':1
}
# Find keys in common
a.keys() & b.keys()
{'y'}
a.values() & b.values()
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

 in ()
----> 1 a.values() & b.values()

TypeError: unsupported operand type(s) for &: 'dict_values' and 'dict_values'
a.keys() | b.keys()
{'a', 'c', 'x', 'y', 'z'}
a.keys() - b.keys()
{'a', 'c'}
# find (key, value) pairs in item
a.items() & b.items()
{('y', 2)}
a.items() | b.items()
{('a', 1), ('c', 3), ('x', 3), ('y', 2), ('z', 1)}
list(a.items())
[('a', 1), ('y', 2), ('c', 3)]

1.10从序列总移除重复项且保持元素间的迅速不变

# 如果序列中的值是可哈希的（hashable）
def dedupe(items):
    seen = set()
    for item in items:
        if item not in seen:
            yield item
            seen.add(item)
a = [i*i for i in range(-5,5)]
list(dedupe(a))
[25, 16, 9, 4, 1, 0]
a.append([1,2])
a
[25, 16, 9, 4, 1, 0, 1, 4, 9, 16, [1, 2]]
[1,2] in a
True
list(dedupe(a))
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

 in ()
----> 1 list(dedupe(a))

 in dedupe(items)
      3     seen = set()
      4     for item in items:
----> 5         if item not in seen:
      6             yield item
      7             seen.add(item)

TypeError: unhashable type: 'list'
set(a) # 列表不能哈希化
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

 in ()
----> 1 set(a) # 列表不能哈希化

TypeError: unhashable type: 'list'

1.可哈希（hashable）和不可改变性（immutable）

如果一个对象在自己的生命周期中有一哈希值（hash value）是不可改变的，那么它就是可哈希的（hashable）的，因为这些数据结构内置了哈希值，每个可哈希的对象都内置了hash方法，所以可哈希的对象可以通过哈希值进行对比，也可以作为字典的键值和作为set函数的参数。所有python中所有不可改变的的对象（imutable objects）都是可哈希的，比如字符串，元组，也就是说可改变的容器如字典，列表不可哈希（unhashable）。我们用户所定义的类的实例对象默认是可哈希的（hashable），它们都是唯一的，而hash值也就是它们的id()。

哈希

它是一个将大体量数据转化为很小数据的过程，甚至可以仅仅是一个数字，以便我们可以用在固定的时间复杂度下查询它，所以，哈希对高效的算法和数据结构很重要。

不可改变性

它指一些对象在被创建之后不会因为某些方式改变，特别是针对任何可以改变哈希对象的哈希值的方式

联系：

因为哈希键一定是不可改变的，所以它们对应的哈希值也不改变。如果允许它们改变，，那么它们在数据结构如哈希表中的存储位置也会改变，因此会与哈希的概念违背，效率会大打折扣

# 稍作改变，使得值哈希化
def dedupe(items, key=None):
    seen = set()
    for item in items:
        val = item if key==None else key(item)
        if val not in seen:
            yield item
        seen.add(val)
b = [{'x':1,'y':2}, {'x':4,'y':2}, {'x':1,'y':3}, {'x':1,'y':2}]
de = dedupe(b, key=lambda x:(x['x'],x['y']))
list(de) # 去除了重复的{'x':1,'y':2}
[{'x': 1, 'y': 2}, {'x': 4, 'y': 2}, {'x': 1, 'y': 3}]

其实使用集合的方法就可以实现，但是定义函数的目的在于，对任意可以迭代对象都可以使用该方法，比如 \n f = open('aa.txt') for line in dedupe(f): ……

set(b)
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

 in ()
----> 1 set(b)

TypeError: unhashable type: 'dict'

1.10对切片命名

item = [i for i in range(10)]
a = slice(2, 4)
item[a]
[2, 3]
del item[a]
item
[0, 1, 4, 5, 6, 7, 8, 9]
a.start
2
a.stop
4

1.11找出序列中出现最多的元素

collections模块中的Couner类，正是为此问题而设计的。它的most_common（）方法可以快速给出答案。

word = [str(int(math.sin(i)*5)) for i in range(20)]
word
['0',
 '4',
 '4',
 '0',
 '-3',
 '-4',
 '-1',
 '3',
 '4',
 '2',
 '-2',
 '-4',
 '-2',
 '2',
 '4',
 '3',
 '-1',
 '-4',
 '-3',
 '0']
from collections import Counter

word_counter = Counter(word)
top_3 = word_counter.most_common(3)
top_3
[('4', 4), ('0', 3), ('-4', 3)]

可以给Counter对象提供任意可以哈希化的序列作为参数。在底层实现中，Counter是一个字典，在元素和次数之间做了映射

word_counter['2']
2
#如果需要更新counter，可以手动添加‘’
for w in word:
    word_counter[w] += 1
word_counter['2']
4
#也可以使用update方法
word.append('2')
word_counter.update(word)
word_counter['2'] #4+2+1
7
#另外 counter对象也可以进行加减运算

1.13通过公共键对字典列表排序

# 前面我们讨论过多了键值排序方法。即，不同股票价格排序的方法。如果字典有多对key-vlue组成一个列表，根据其中一个如何排序

rows = [
    {'fname':'li5', 'lname':'bo', 'uid':101},
    {'fname':'li2', 'lname':'bo2', 'uid':102},
    {'fname':'li3', 'lname':'bo3', 'uid':103},
    {'fname':'li4', 'lname':'bo4', 'uid':104},
]
from operator import itemgetter

rows_by_fname = sorted(rows, key=itemgetter('fname'))
rows_by_fname
[{'fname': 'li2', 'lname': 'bo2', 'uid': 102},
 {'fname': 'li3', 'lname': 'bo3', 'uid': 103},
 {'fname': 'li4', 'lname': 'bo4', 'uid': 104},
 {'fname': 'li5', 'lname': 'bo', 'uid': 101}]

在这个例子中，rows被传递给sorted函数，该函数接受一个关键字参数key，key作为一个可调用对象（callable）。callable从rows中选取一个单独的元素作为输入并返回一个用来排序的依据值。

# 根据这个思想，我们可以这样
rows_by_fname = sorted(rows, key=lambda x: x['fname'])
rows_by_fname
[{'fname': 'li2', 'lname': 'bo2', 'uid': 102},
 {'fname': 'li3', 'lname': 'bo3', 'uid': 103},
 {'fname': 'li4', 'lname': 'bo4', 'uid': 104},
 {'fname': 'li5', 'lname': 'bo', 'uid': 101}]
#对照前面的例子
price
{'Alibaba': 30, 'FB': 35, 'IBM': 21, 'Tenxun': 20}
sorted(price, key=lambda k:price[k])
['Tenxun', 'IBM', 'Alibaba', 'FB']

1.14对不原生支持比较操作对象排序

# 这里详细体会sorted函数key作为回调对象的用法

class User():
    def __init__(self,id):
        self.id = id
    def __repr__(self):
        return 'U{}'.format(self.id)
uses = [User(40), User(20), User(30)]
uses
[U40, U20, U30]
sorted(uses, key=lambda x:x.id)
[U20, U30, U40]

与from operator import itemgetter 对应有operator.attrgetter()

import operator
sorted(uses, key=operator.attrgetter('id'))
[U20, U30, U40]

之所以会有这种方法，是因为这种方法比自己构造匿名函数，性能更加高效

1.15根据字段（如日期）将数据进行分组

rows = [
    {'address':'555N 444W', 'date':'2018/12/27'},
    {'address':'555N 44W', 'date':'2018/12/2'},
    {'address':'555N 4W', 'date':'2018/1/27'},
    {'address':'55N 444W', 'date':'2018/1/27'},
    {'address':'5N 444W', 'date':'2018/2/27'},
    {'address':'5N 444W', 'date':'2018/12/7'},
    {'address':'55N 44W', 'date':'2018/12/27'},

]
from operator import itemgetter
from itertools import groupby

rows.sort(key=itemgetter('date'))
rows
[{'address': '555N 4W', 'date': '2018/1/27'},
 {'address': '55N 444W', 'date': '2018/1/27'},
 {'address': '555N 44W', 'date': '2018/12/2'},
 {'address': '555N 444W', 'date': '2018/12/27'},
 {'address': '55N 44W', 'date': '2018/12/27'},
 {'address': '5N 444W', 'date': '2018/12/7'},
 {'address': '5N 444W', 'date': '2018/2/27'}]
for date, item in groupby(rows, key=itemgetter('date')):
    print(date)
    for i in item:
        print(' ', i)
2018/1/27
  {'address': '555N 4W', 'date': '2018/1/27'}
  {'address': '55N 444W', 'date': '2018/1/27'}
2018/12/2
  {'address': '555N 44W', 'date': '2018/12/2'}
2018/12/27
  {'address': '555N 444W', 'date': '2018/12/27'}
  {'address': '55N 44W', 'date': '2018/12/27'}
2018/12/7
  {'address': '5N 444W', 'date': '2018/12/7'}
2018/2/27
  {'address': '5N 444W', 'date': '2018/2/27'}

如果不事先排序好的话，会有这样的结果：

rows = [
    {'address':'555N 444W', 'date':'2018/12/27'},
    {'address':'555N 44W', 'date':'2018/12/2'},
    {'address':'555N 4W', 'date':'2018/1/27'},
    {'address':'55N 444W', 'date':'2018/1/27'},
    {'address':'5N 444W', 'date':'2018/2/27'},
    {'address':'5N 444W', 'date':'2018/12/7'},
    {'address':'55N 44W', 'date':'2018/12/27'},

]
for date, item in groupby(rows, key=itemgetter('date')):
    print(date)
    for i in item:
        print(' ', i)
2018/12/27
  {'address': '555N 444W', 'date': '2018/12/27'}
2018/12/2
  {'address': '555N 44W', 'date': '2018/12/2'}
2018/1/27
  {'address': '555N 4W', 'date': '2018/1/27'}
  {'address': '55N 444W', 'date': '2018/1/27'}
2018/2/27
  {'address': '5N 444W', 'date': '2018/2/27'}
2018/12/7
  {'address': '5N 444W', 'date': '2018/12/7'}
2018/12/27
  {'address': '55N 44W', 'date': '2018/12/27'}

从中可见groupby（）的用法。 group（）会依据前后的key产生一个类似字典的迭代器，前后相同的key被打包，对应的value是一个子迭代器。

讨论：我们也可以这样操作来实现按照日期排序：使用date作为key，字典作为value，构造一个多值对应的字典

dic = {

}

for row in rows:
    dic[row['date']] = []
for row in rows:
    dic[row['date']].append(row)
for key in dic:
    print(key)
    print(dic[key])
2018/12/27
[{'address': '555N 444W', 'date': '2018/12/27'}, {'address': '55N 44W', 'date': '2018/12/27'}]
2018/12/2
[{'address': '555N 44W', 'date': '2018/12/2'}]
2018/1/27
[{'address': '555N 4W', 'date': '2018/1/27'}, {'address': '55N 444W', 'date': '2018/1/27'}]
2018/2/27
[{'address': '5N 444W', 'date': '2018/2/27'}]
2018/12/7
[{'address': '5N 444W', 'date': '2018/12/7'}]
# 事实上利用前面讲到的 创建一键多值 的 defaultdict（）更加方便

from collections import defaultdict

row_by_date = defaultdict(list)
for row in rows:
    row_by_date[row['date']].append(row)
for key in row_by_date:
    print(key)
    print(row_by_date[key])
2018/12/27
[{'address': '555N 444W', 'date': '2018/12/27'}, {'address': '55N 44W', 'date': '2018/12/27'}]
2018/12/2
[{'address': '555N 44W', 'date': '2018/12/2'}]
2018/1/27
[{'address': '555N 4W', 'date': '2018/1/27'}, {'address': '55N 444W', 'date': '2018/1/27'}]
2018/2/27
[{'address': '5N 444W', 'date': '2018/2/27'}]
2018/12/7
[{'address': '5N 444W', 'date': '2018/12/7'}]

1.16依据条件对列表元素进行筛选

# 列表推导式
mylist = [i for i in range(10)]
[n for n in mylist if n>5]
[6, 7, 8, 9]

如果列表数据多，会产生庞大结果占据内存。那么可以使用生成器表达式通过迭代的方式筛选结果

pos = (n for n in mylist if n>5 )
pos
 at 0x0000024B2F8D4780>
for x in pos:
    print(x)
6
7
8
9
pos = [n for n in mylist if n>5]
pos
[6, 7, 8, 9]

当筛选条件复杂的时候，没法用简单的表达式表示时，可以考录使用內建函数filter()

values = [1,2,3,'3','#','na']

def is_int(n):
    try:
        int(n)
        return True
    except:
        return False
newlist = list(filter(is_int, values))
print(newlist)
[1, 2, 3, '3']
# 筛选和替换数据
clip_neg = [n if n > 5 else 0 for n in mylist]
clip_neg
[0, 0, 0, 0, 0, 0, 6, 7, 8, 9]
# compress的用法
from itertools import compress
list(compress(['a','b','c'], [True, False, True]))
['a', 'c']

1.17从字典中提取子集（字典筛选）

price = {
    'Aliba': 10,
    'Ten':20,
    'Fb':10.2,
    'IBM': 11.2,
    'BB':42
}
pl = {key:value for key,value in price.items() if value>15}
pl
{'BB': 42, 'Ten': 20}
# 另一种方法
pos = ((key,value) for key,value in price.items())
pos
 at 0x0000024B2F8D4048>
dict(pos)
{'Aliba': 10, 'BB': 42, 'Fb': 10.2, 'IBM': 11.2, 'Ten': 20}

1.19同时对数据做转换和换算

num = [i for i in range(10)]
sum(x*x for x in num)
285
# 这是把生成器作为函数的参数，省去了多余的括号,节省了内存
s = sum((x*x for x in num))

1.20将多个映射合并为单个映射

a = {'x':1,'y':2}
b = {'x':2,'z':3}
from collections import ChainMap
c = ChainMap(a,b)
c
ChainMap({'x': 1, 'y': 2}, {'x': 2, 'z': 3})
c['x']
1
list(c.keys())
['z', 'x', 'y']

update()方案可以起到类似的效果

a.update(b)
a
{'x': 2, 'y': 2, 'z': 3}
# 使用ChainMap的优势在于不会更改原始数据

merged = ChainMap(a,b)
merged
ChainMap({'x': 2, 'y': 2, 'z': 3}, {'x': 2, 'z': 3})
merged['z']
3
a['z']=100
merged['z']
100
a['z']
100
b['z']
3
# 修改ChainMap的操作只会进行再第一个字典上，如这里的a
merged['z']=106
a['z']
106
b['z']
3

编辑于 23:38