Python标准库学习笔记2:数据结构

为什么80%的码农都做不了架构师?>>>   hot3.png

1. 概述

    collections模块包含多种数据结构的实现,扩展了其他模块中的相应结构.例如,Deque是一个双端队列,允许从任意一端增加或删除元素.defaultdict是一个字典,如果找不到某个键,它会响应一个默认值,而OrderedDict会记住增加元素的序列.namedtuple扩展了一般的tuple,除了为每个成员元素提供一个数值索引外还提供一个属性名.

    对于大量数据,array会比list更高效的利用内存.由于array仅限于一种数据类型,与通用的list相比,它可以采用一种更紧凑的内存表示.而且list的很多方法同样适用于array.

    list包含一个sort()方法.但是heapq为有序表,函数可以修改列表的内容,并且以很低的开销维护列表原来的顺序.

    构建有序列表或数组还可以使用bisect.

    使用list的insert()和pop()方法模拟队列,不是线程安全的.要完成线程间的实序通信,可以使用Queue模块.multiprocessing包含一个Queue的版本,它会处理进程间的通信,从而能更容易的将一个多线程程序转换为使用进程而不是线程.

    struct对于解码另一个应用的数据(例如Windows下的二进制数据)会很有用,可以将这些数据解码为Python的内置类型,以便处理.

    对于高度互连的数据结构,如图和树,可以使用weakref维护引用,同时当不再需要某些对象时仍允许垃圾回收器进行清理.copy中的函数用于复制数据结构及其内容,包括用deepcopy()完成递归复制.

    可以使用pprint来创建易读的表示.

2. collections---容器数据类型

作用:容器数据类型

Python版本:2.4及以后版本

collections模块包含内置类型list,dict和tuple以外的其他容器数据类型

1. Counter

    Counter作为一个容器,可以跟踪相同的值增加了多少次.

    Counter支持3中形式的初始化.调用Counter的构造函数时可以提供一个元素序列或者一个包含键和计数的字典,还可以使用关键字参数将字符串名映射到计数:

>>> import collections
>>> collections.Counter(['a', 'b', 'c', 'a', 'b', 'b'])
Counter({'b': 3, 'a': 2, 'c': 1})
>>> collections.Counter({'a':2, 'b':3, 'c':1})
Counter({'b': 3, 'a': 2, 'c': 1})
>>> collections.Counter(a = 2, b = 3, c = 1)
Counter({'b': 3, 'a': 2, 'c': 1})
    由于返回的是一个字典,我们可以通过update来增加数据,通过items来查看数据,用elements来查看所有的数据:
>>> c = collections.Counter()
>>> c
Counter()
>>> c.update('abcdaab')
>>> c
Counter({'a': 3, 'b': 2, 'c': 1, 'd': 1})
>>> c.update({'a':1, 'd':5})
>>> c
Counter({'d': 6, 'a': 4, 'b': 2, 'c': 1})
>>> for key, value in c.items():
	print key, ' => ', value

	
a  =>  4
c  =>  1
b  =>  2
d  =>  6
>>> c.elements()

>>> list(c.elements())
['a', 'a', 'a', 'a', 'c', 'b', 'b', 'd', 'd', 'd', 'd', 'd', 'd']
    使用most_common()可以生成一个序列,其中包含n个最常遇见的输入值及其相应计数(类似于字典,通过值进行排序即可)
>>> c = collections.Counter()
>>> c.update({'a':5, 'b':3, 'c':11, 'd':23, 'e':2})
>>> for letter, count in c.most_common(3):
	print '%s: %d' % (letter, count)

	
d: 23
c: 11
a: 5
    但字典本身是一个哈希结构,不是一个可用于排序的可迭代其对象.所以无法使用字典实现上述的需求.

    而Counter甚至支持算术和集合操作来完成结果的聚集:

>>> import collections
>>> c1 = collections.Counter(['a', 'b', 'c', 'a', 'b', 'b'])
>>> c2 = collections.Counter('alphabet')
>>> c1
Counter({'b': 3, 'a': 2, 'c': 1})
>>> c2
Counter({'a': 2, 'b': 1, 'e': 1, 'h': 1, 'l': 1, 'p': 1, 't': 1})
>>> c1 + c2
Counter({'a': 4, 'b': 4, 'c': 1, 'e': 1, 'h': 1, 'l': 1, 'p': 1, 't': 1})
>>> c1 - c2
Counter({'b': 2, 'c': 1})
>>> c1 & c2
Counter({'a': 2, 'b': 1})

2. defaultdict

    标准字典包括一个方法setdefault()来获取一个值,如果这个值不存在则建立一个默认值.defaultdict初始化容器时会让调用者提前指定默认值.

>>> import collections
>>> def default_factory():
	return 'default value'

>>> d = collections.defaultdict(default_factory, foo='bar')
>>> d
defaultdict(, {'foo': 'bar'})
>>> d['foo']
'bar'
>>> d['bar']
'default value'

3. deque

    deque(双端队列)支持从任意一端增加和删除元素.

>>> d = collections.deque('abcdefg')
>>> d
deque(['a', 'b', 'c', 'd', 'e', 'f', 'g'])
>>> del d
>>> d = collections.deque()
>>> d.extend('abcdefg')
>>> d.append('h')
>>> d
deque(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
>>> d2 = collections.deque()
>>> d2.extendleft(range(6))
>>> d2.appendleft(6)
>>> d2
deque([6, 5, 4, 3, 2, 1, 0])
>>> d2.pop()
0
>>> d2.popleft()
6
>>> d2
deque([5, 4, 3, 2, 1])
    由于双端队列是线程安全的,所以甚至可以在不同线程中同时从两端利用队列的内容:
import collections
import threading
import time

candle = collections.deque(range(5))

def burn(direction, nextSource):
    while True:
        try:
            next = nextSource()
        except IndexError:
            break
        else:
            print '%8s: %s' % (direction, next)
            time.sleep(0.1)
    print '%8s donw' % direction
    return

if __name__ == "__main__":
    left = threading.Thread(target=burn, args=('Left', candle.popleft))
    right = threading.Thread(target=burn, args=('Right', candle.pop))

    left.start()
    right.start()

    left.join()
    right.join()
    解释器显示如下:
>>> 
    Left: 0   Right: 4

   Right: 3    Left: 1

   Right: 2    Left donw

   Right donw
    而deque有一个很有用的功能:可以按任意一个方向旋转,而跳过一些元素.
>>> import collections
>>> d = collections.deque(range(10))
>>> d
deque([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> d.rotate(2)
>>> d
deque([8, 9, 0, 1, 2, 3, 4, 5, 6, 7])
>>> d = collections.deque(range(10))
>>> d.rotate(-2)
>>> d
deque([2, 3, 4, 5, 6, 7, 8, 9, 0, 1])

4. namedtuple

    标准tuple使用数值索引来访问其成员

bob = ('Bob', 30, 'male')
print 'Representation:', bob

jane = ('jane', 29, 'female')
print '\nFields by index:', jane[0]

print '\nFields by index:'
for p in [bob, jane]:
    print '%s is a %d year old %s' % p
    解释器显示如下:
>>> 
Representation: ('Bob', 30, 'male')

Fields by index: jane

Fields by index:
Bob is a 30 year old male
jane is a 29 year old female
    由于使用tuple时需要记住对应各个值要使用哪个索引,这可能导致错误,特别是当tuple有大量字段,而且元组的构造和使用相距很远时.对于各个成员,namedtuple除了指定数值索引外,还会指定名字(可以理解为具有排序的字典结构).
import collections

Person = collections.namedtuple('Person', 'name age gender')

print 'Type of Person', type(Person)

bob = Person(name='Bob', age=30, gender='male')
print '\nRepresentation:', bob

jane = Person(name='jane', age=29, gender='female')
print '\nField by name:', jane.name

print '\nFields by index:'
for p in [bob, jane]:
    print '%s is a %d year old %s' % p
    解释器显示如下:
>>> 
Type of Person 

Representation: Person(name='Bob', age=30, gender='male')

Field by name: jane

Fields by index:
Bob is a 30 year old male
jane is a 29 year old female
    如果字段名重复或与Python关键字冲突,就是非法字段名:
import collections

try:
    collections.namedtuple('Person', 'name class age gender')
except ValueError, err:
    print err

try:
    collections.namedtuple('Person', 'name age gender age')
except ValueError, err:
    print err
    解释器显示如下:
>>> 
Type names and field names cannot be a keyword: 'class'
Encountered duplicate field name: 'age'
    如果创建一个namedtuple时要基于在程序控制之外的值(如表示一个数据库查询返回的记录行,而且数据库模式事先并不知道),要将rename选项设置为True,从而对非法字段重命名:
import collections

with_class = collections.namedtuple('Person', 'name class age gender', rename = True)
print with_class._fields

two_ages = collections.namedtuple('Person', 'name age gender age', rename = True)
print two_ages._fields
    解释器显示如下:
>>> 
('name', '_1', 'age', 'gender')
('name', 'age', 'gender', '_3')

5. OrderedDict

    OrderedDict是一个字典子类,可以记住其内容增加的顺序.

>>> import collections
>>> d1 = {}
>>> d1['a'] = 'A'
>>> d1['b'] = 'B'
>>> d1['c'] = 'C'
>>> d2 = collections.OrderedDict()
>>> d2['a'] = 'A'
>>> d2['b'] = 'B'
>>> d2['c'] = 'C'
>>> d1
{'a': 'A', 'c': 'C', 'b': 'B'}
>>> d2
OrderedDict([('a', 'A'), ('b', 'B'), ('c', 'C')])
    而且判断是否相等时候,元素增加的顺序也是考虑的一个因素.
>>> import collections
>>> d1 = {}
>>> d1['a'] = 'A'
>>> d1['b'] = 'B'
>>> d1['c'] = 'C'
>>> d2 = {}
>>> d2['c'] = 'C'
>>> d2['b'] = 'B'
>>> d2['a'] = 'A'
>>> d1 == d2
True
>>> d1 = collections.OrderedDict()
>>> d1['a'] = 'A'
>>> d1['b'] = 'B'
c
>>> d1['c'] = 'C'
>>> d2 = collections.OrderedDict()
>>> d2['c'] = 'C'
>>> d2['b'] = 'B'
>>> d2['a'] = 'A'
>>> d1 == d2
False

3. array---固定类型数据序列

作用:高效管理固定类型数值数据的序列

Python版本:1.4及以后版本

array模块定义了一个序列数据结构,和list类似但是所有成员都必须是相同的基本类型.

    array实例化时可以提供一个参数来描述允许哪种数据类型,还可以有一个初始的数据序列存储在数组中.array支持的操作包括分片,迭代以及向末尾增加元素.

import array

a = array.array('i', range(3))
print 'Initial:', a

a.extend(range(4, 6))
print 'Extended:', a

print 'Slice:', a[2:5]

print 'Iterator:'
print list(enumerate(a))
    解释器显示如下:
>>> 
Initial: array('i', [0, 1, 2])
Extended: array('i', [0, 1, 2, 4, 5])
Slice: array('i', [2, 4, 5])
Iterator:
[(0, 0), (1, 1), (2, 2), (3, 4), (4, 5)]
    我们可以使用高效读/写文件的专用内置方法将数组的内容写入文件或从文件读入数组:
import array
import binascii
import tempfile

a = array.array('i', range(5))
print 'A1:', a

output = open('test.txt', 'w')
a.tofile(output)
output.flush()
output.close()

with open('test.txt', 'rb') as input:
    raw_data = input.read()
    print 'Raw Contents:', binascii.hexlify(raw_data)

    input.seek(0)
    a2 = array.array('i')
    a2.fromfile(input, len(a))
    print 'A2:', a2
    解释器显示如下:
>>> 
A1: array('i', [0, 1, 2, 3, 4])
Raw Contents: 0000000001000000020000000300000004000000
A2: array('i', [0, 1, 2, 3, 4])
    如果数组中的数据没有采用固有的字节顺序,或者在发送到一个采用不同字节顺序的系统(或在网络上发送)之前需要交换顺序,可以由Python转换整个数组而无须迭代处理每一个元素:
import array
import binascii

def to_hex(a):
    chars_per_item = a.itemsize
    hex_version = binascii.hexlify(a)
    num_chunks = len(hex_version)
    for i in range(num_chunks):
        start = i * chars_per_item
        end = start + chars_per_item
        yield hex_version[start:end]

a1 = array.array('i', range(5))
a2 = array.array('i', range(5))
a2.byteswap()

fmt = '%10s %10s %10s %10s'
print fmt % ('A1 hex', 'A1', 'A2 hex', 'A2')
print fmt % (('-' * 10, ) * 4)
for values in zip(to_hex(a1), a1, to_hex(a2), a2):
    print fmt % values
    解释器显示如下(运行环境是64位系统):
>>> 
    A1 hex         A1     A2 hex         A2
---------- ---------- ---------- ----------
      0000          0       0000          0
      0000          1       0000   16777216
      0100          2       0000   33554432
      0000          3       0001   50331648
      0200          4       0000   67108864

4. heapq---堆排序算法

    最大堆确保父节点大于或等于其两个子节点.最小堆要求父节点小于或等于其子节点.Python的heapq模块实现了一个最小堆.

heapq_heapdata.py:

data = [19, 9, 4, 10, 11]
heapq_showtree.py:
import math
from cStringIO import StringIO

def show_tree(tree, total_width=36, fill=' '):
    """Pretty-print a tree."""
    output = StringIO()
    last_row = -1
    for i, n in enumerate(tree):
        if i:
            row = int(math.floor(math.log(i + 1, 2)))
        else:
            row = 0
        if row != last_row:
            output.write('\n')
        columns = 2 ** row
        col_width = int(math.floor((total_width * 1.0) / columns))
        output.write(str(n).center(col_width, fill))
        last_row = row
    print output.getvalue()
    print '-' * total_width
    print
    return
    创建堆有两种基本方式:heappush()和heapify():
import heapq
from heapq_showtree import show_tree
from heapq_heapdata import data

heap = []
print 'random:', data
print

for n in data:
    print 'add %3d:' % n
    heapq.heappush(heap, n)
    show_tree(heap)
    解释器显示如下:
>>> 
random: [19, 9, 4, 10, 11]

add  19:

                 19                 
------------------------------------

add   9:

                 9                  
        19        
------------------------------------

add   4:

                 4                  
        19                9         
------------------------------------

add  10:

                 4                  
        10                9         
    19   
------------------------------------

add  11:

                 4                  
        10                9         
    19       11   
------------------------------------
    而使用heapify更加高效:
import heapq
from heapq_showtree import show_tree
from heapq_heapdata import data

print 'random   :', data
heapq.heapify(data)
print 'heapified:'
show_tree(data)
    解释器显示如下:
>>>
random   : [19, 9, 4, 10, 11]
heapified:

                 4                  
        9                 19        
    10       11   
------------------------------------
    一旦堆已经正确组织,就可以使用heappop()删除有最小值的元素:
import heapq
from heapq_showtree import show_tree
from heapq_heapdata import data

print 'random   :', data
heapq.heapify(data)
print 'heapified:'
show_tree(data)
print

for i in range(2):
    smallest = heapq.heappop(data)
    print 'pop  %3d:' % smallest
    show_tree(data)
    解释器显示如下:
>>> 
random   : [19, 9, 4, 10, 11]
heapified:

                 4                  
        9                 19        
    10       11   
------------------------------------


pop    4:

                 9                  
        10                19        
    11   
------------------------------------

pop    9:

                 10                 
        11                19        
------------------------------------
    而我们可以使用heapreplace()来删除现有元素并替换为新值:
import heapq
from heapq_showtree import show_tree
from heapq_heapdata import data

heapq.heapify(data)
print 'start:'
show_tree(data)
print

for i in [0, 13]:
    smallest = heapq.heapreplace(data, i)
    print 'replace %2d with %2d:' % (smallest, i)
    show_tree(data)
    解释器显示如下:
>>> 
start:

                 4                  
        9                 19        
    10       11   
------------------------------------


replace  4 with  0:

                 0                  
        9                 19        
    10       11   
------------------------------------

replace  0 with 13:

                 9                  
        10                19        
    13       11   
------------------------------------
    heapq还包括两个检查可迭代对象的函数,查找其中包含的最大值或最小值的范围:
import heapq
from heapq_showtree import show_tree
from heapq_heapdata import data

print 'all      :', data
print '3 largest:', heapq.nlargest(3, data)
print 'from sort:', list(reversed(sorted(data)[-3:]))
print '3 smallest:', heapq.nsmallest(3, data)
print 'from sort :', sorted(data)[:3]
    解释器显示如下:
>>> 
all      : [19, 9, 4, 10, 11]
3 largest: [19, 11, 10]
from sort: [19, 11, 10]
3 smallest: [4, 9, 10]
from sort : [4, 9, 10]

5. bisect---维护有序列表

    bisect模块实现了一个算法用于向列表中插入元素,同时仍然保持列表有序.我们通过insort()向一个列表中插入元素.

import bisect
import random

random.seed(1)

print 'New Pos Contents'
print '--- --- --------'

lst = []
for i in range(1, 15):
    r = random.randint(1, 100)
    #通过方法bisect来确定r的具体插入位置
    position = bisect.bisect(lst, r)
    #将r插入到列表中
    bisect.insort(lst, r)
    print '%3d %3d' % (r, position), lst
    解释器显示如下:
>>> 
New Pos Contents
--- --- --------
 14   0 [14]
 85   1 [14, 85]
 77   1 [14, 77, 85]
 26   1 [14, 26, 77, 85]
 50   2 [14, 26, 50, 77, 85]
 45   2 [14, 26, 45, 50, 77, 85]
 66   4 [14, 26, 45, 50, 66, 77, 85]
 79   6 [14, 26, 45, 50, 66, 77, 79, 85]
 10   0 [10, 14, 26, 45, 50, 66, 77, 79, 85]
  3   0 [3, 10, 14, 26, 45, 50, 66, 77, 79, 85]
 84   9 [3, 10, 14, 26, 45, 50, 66, 77, 79, 84, 85]
 44   4 [3, 10, 14, 26, 44, 45, 50, 66, 77, 79, 84, 85]
 77   9 [3, 10, 14, 26, 44, 45, 50, 66, 77, 77, 79, 84, 85]
  1   0 [1, 3, 10, 14, 26, 44, 45, 50, 66, 77, 77, 79, 84, 85]
    insort实际上等价于insort_right:在现有值后面插入新值.我们也可以使用insort_left在现有值之前插入新值.
import bisect
import random

random.seed(1)

print 'New Pos Contents'
print '--- --- --------'

lst = []
for i in range(1, 15):
    r = random.randint(1, 100)
    position = bisect.bisect_left(lst, r)
    bisect.insort_left(lst, r)
    print '%3d %3d' % (r, position), lst

6. Queue---线程安全的FIFO实现

作用:提供一个线程安全的FIFO实现

Python版本:至少1.4

    Queue模块提供了一个适用于多线程编程的先进先出数据结构,可以用来在生产者和消费者线程之间安全的传递消息或其他数据.它会为调用者处理锁定,使多个线程可以安全的处理同一个Queue实例.Queue的大小(其中包含的元素个数)可能要受限,以限制内存使用或处理.

1. 基本FIFO队列

    使用put()将元素增加到序列一端,使用get()从另一端剔除.

import Queue

q = Queue.Queue()

for i in range(5):
    q.put(i)

while not q.empty():
    print q.get(),
print
    解释器显示如下:
>>> 
0 1 2 3 4

2. LIFO队列

    LifoQueue使用了后进先出顺序:

import Queue

q = Queue.LifoQueue()

for i in range(5):
    q.put(i)

while not q.empty():
    print q.get(),
print
    解释器显示如下:
>>> 
4 3 2 1 0

3. 优先队列

    优先队列是:元素的处理顺序需要根据这些元素的特性来决定.

import Queue
import threading

class Job(object):
    def __init__(self, priority, description):
        self.priority = priority
        self.description = description
        print 'New job:', description
    def __cmp__(self, other):
        return cmp(self.priority, other.priority)

q = Queue.PriorityQueue()
q.put(Job(3, 'Mid-level job'))
q.put(Job(10, 'Low-level job'))
q.put(Job(1, 'Important job'))

def process_job(q):
    while True:
        next_job = q.get()
        #这里之所以要用'A %s' % B,而不是'A ', B,是因为多线程情况下,第二种方式会输出混乱
        print 'Processing job:%s\n' % next_job.description
        q.task_done()

workers = [threading.Thread(target=process_job, args = (q,)),
           threading.Thread(target=process_job, args = (q,)),]

for w in workers:
    w.setDaemon(True)
    w.start()

q.join()
    解释器显示如下:
>>> 
New job: Mid-level job
New job: Low-level job
New job: Important job
Processing job:Important job
Processing job:Mid-level job


Processing job:Low-level job

7. struct---二进制数据结构

作用:在字符串和二进制数据之间转换

Python版本:1.4及以后版本

    struct模块包括一些在字节串与内置Python数据类型(如数字和字符串)之间完成转换的函数.

1. 打包和解包

    Struct支持使用格式指示符将数据打包为字符串,以及从字符串解包数据,格式指示符由表示数据类型的字符以及可选的数量及字节序指示符构成.

我们使用pack来打包数据,unpack来解包数据:

import struct
import binascii

values = (1, 'ab', 2.7)
s = struct.Struct('I 2s f')
packed_data = s.pack(*values)
print 'Original values:', values
print 'Format string  :', s.format
print 'Uses           :', s.size, 'bytes'
print 'Packed Value   :', binascii.hexlify(packed_data)

packed_data = binascii.unhexlify(binascii.hexlify(packed_data))
unpacked_data = s.unpack(packed_data)
print 'Unpacked Values:', unpacked_data

    解释器显示如下:

>>> 
Original values: (1, 'ab', 2.7)
Format string  : I 2s f
Uses           : 12 bytes
Packed Value   : 0100000061620000cdcc2c40
Unpacked Values: (1, 'ab', 2.700000047683716)

2. 字节序

    默认情况下,值会使用内置C库的字节序来编码.只需在格式串中提供一个显式的字节序指令,就可以容易的覆盖这个默认选择:

import struct
import binascii

values = (1, 'ab', 2.7)
print 'Original vlaues:', values
endianness = [
    ('@', 'native, native'),
    ('=', 'native, standard'),
    ('<', 'little-endian'),
    ('>', 'big-endian'),
    ('!', 'network'),
    ]

for code, name in endianness:
    s = struct.Struct(code + ' I 2s f')
    packed_data = s.pack(*values)
    print
    print 'Format string    :', s.format, 'for', name
    print 'Uses             :', s.size, 'bytes'
    print 'Packed Value     :', binascii.hexlify(packed_data)
    print 'Unpacked Value   :', s.unpack(packed_data)
    解释器显示如下:
>>> 
Original vlaues: (1, 'ab', 2.7)

Format string    : @ I 2s f for native, native
Uses             : 12 bytes
Packed Value     : 0100000061620000cdcc2c40
Unpacked Value   : (1, 'ab', 2.700000047683716)

Format string    : = I 2s f for native, standard
Uses             : 10 bytes
Packed Value     : 010000006162cdcc2c40
Unpacked Value   : (1, 'ab', 2.700000047683716)

Format string    : < I 2s f for little-endian
Uses             : 10 bytes
Packed Value     : 010000006162cdcc2c40
Unpacked Value   : (1, 'ab', 2.700000047683716)

Format string    : > I 2s f for big-endian
Uses             : 10 bytes
Packed Value     : 000000016162402ccccd
Unpacked Value   : (1, 'ab', 2.700000047683716)

Format string    : ! I 2s f for network
Uses             : 10 bytes
Packed Value     : 000000016162402ccccd
Unpacked Value   : (1, 'ab', 2.700000047683716)

3. 缓冲区

    通常在重视性能情况下或者向扩展模块传入或传出数据时才会处理二进制打包数据.通过避免为每个打包结构分配一个新缓冲区所带来的开销,可以优化这些情况.pack_into()和unpack_from()方法支持直接写入预分配的缓冲区.

import struct
import binascii

s = struct.Struct('I 2s f')
values = (1, 'ab', 2.7)
print 'Original:', values

print
print 'ctypes string buffer'

import ctypes
b = ctypes.create_string_buffer(s.size)
print 'Before   :', binascii.hexlify(b.raw)
s.pack_into(b, 0, *values)
print 'After    :', binascii.hexlify(b.raw)
print 'Unpacked:', s.unpack_from(b, 0)
print
print 'array'

import array
a = array.array('c', '\0' * s.size)
print 'Before   :', binascii.hexlify(a)
s.pack_into(a, 0, *values)
print 'After    :', binascii.hexlify(a)
print 'Unpacked :', s.unpack_from(a, 0)
    解释器显示如下:
>>> 
Original: (1, 'ab', 2.7)

ctypes string buffer
Before   : 000000000000000000000000
After    : 0100000061620000cdcc2c40
Unpacked: (1, 'ab', 2.700000047683716)

array
Before   : 000000000000000000000000
After    : 0100000061620000cdcc2c40
Unpacked : (1, 'ab', 2.700000047683716)

8. weakref---对象的非永久引用

作用:引用一个'昂贵'的对象,不过如果不再有其他非弱引用,则允许由垃圾回收器回收其内存.

Python版本:2.1及以后版本

    weakref模块支持对象的弱引用.正常的引用会增加对象的引用计数,避免它被垃圾回收,但并不总是希望如此,比如有时可能会出现一个循环引用,或者有时可能要构建一个对象缓存,需要内存时则要删除这个缓存.弱引用是避免对象被自动清除的一个对象句柄.

1. 引用

    对象的弱引用通过ref类管理.要获取原对象,可以调用引用对象.

import weakref

class ExpensiveObject(object):
    def __del__(self):
        print '(Deleting %s)' % self

obj = ExpensiveObject()
r = weakref.ref(obj)

print 'obj:', obj
print 'ref:', r
print 'r():', r()

print 'deleting obj'
del obj
#缓冲区并未并清除,如果为一般对象引用,则会引发异常
print 'r():', r()
    解释器显示如下:
>>> 
obj: <__main__.ExpensiveObject object at 0x0000000002CE07B8>
ref: 
r(): <__main__.ExpensiveObject object at 0x0000000002CE07B8>
deleting obj
(Deleting <__main__.ExpensiveObject object at 0x0000000002CE07B8>)
r(): None

2. 引用回调

    ref构造函数接受一个可选的回调函数,删除所引用的对象时会调用这个函数:

import weakref

class ExpensiveObject(object):
    def __del__(self):
        print '(Deleting %s)' % self

def callback(reference):
    """Invoked when referenced object is deleted"""
    print 'callback(', reference, ')'
    
obj = ExpensiveObject()
r = weakref.ref(obj, callback)

print 'obj:', obj
print 'ref:', r
print 'r():', r()

print 'deleting obj'
del obj
print 'r():', r()
    解释器显示如下:
>>> 
obj: <__main__.ExpensiveObject object at 0x0000000002C50828>
ref: 
r(): <__main__.ExpensiveObject object at 0x0000000002C50828>
deleting obj
callback(  )
(Deleting <__main__.ExpensiveObject object at 0x0000000002C50828>)
r(): None

3. 代理

    使用代理比使用弱引用更为方便,但是代理也仅仅是一个引用,而非真正的对象:

import weakref

class ExpensiveObject(object):
    def __init__(self, name):
        self.name = name
    def __del__(self):
        print '(Deleting %s)' % self
    
obj = ExpensiveObject('My Object')
r = weakref.ref(obj)
p = weakref.proxy(obj)

print 'via obj:', obj.name
print 'via ref:', r().name
print 'via proxy:', p.name
del obj
print 'via proxy:', p.name
    解释器显示如下:
>>> 
via obj: My Object
via ref: My Object
via proxy: My Object
(Deleting <__main__.ExpensiveObject object at 0x0000000002BC07B8>)
via proxy:

Traceback (most recent call last):
  File "C:\Python27\test.py", line 17, in 
    print 'via proxy:', p.name
ReferenceError: weakly-referenced object no longer exists

4. 循环引用

    弱引用有一种用法,即在不阻止垃圾回收时允许循环引用.

weakref_graph.py:

import gc
from pprint import pprint
import weakref

class Graph(object):
    def __init__(self, name):
        self.name = name
        self.other = None
    def set_next(self, other):
        print '%s.set_next(%r)' % (self.name, other)
        self.other = other
    def all_nodes(self):
        "Generate the nodes in the graph sequence."
        yield self
        n = self.other
        while n and n.name != self.name:
            yield n
            n = n.other
        if n is self:
            yield n
        return
    def __str__(self):
        return '->'.join(n.name for n in self.all_nodes())
    def __repr__(self):
        return '<%s at 0x%x name=%s' % (self.__class__.__name__, id(self), self.name)
    def __del__(self):
        print '(Deleting %s)' % self.name
        self.set_next(None)
def collect_and_show_garbage():
    "Show what garbage is present."
    print 'Collecting...'
    n = gc.collect()
    print 'Unreachable objects:', n
    print 'Garbage:',
    pprint(gc.garbage)
def demo(graph_factory):
    print 'Set up graph:'
    one = graph_factory('one')
    two = graph_factory('two')
    three = graph_factory('three')
    one.set_next(two)
    two.set_next(three)
    three.set_next(one)

    print
    print 'Graph:'
    print str(one)
    collect_and_show_garbage()

    print
    three = None
    two = None
    print 'After 2 references removed:'
    print str(one)
    collect_and_show_garbage()

    print
    print 'Removing last reference:'
    one = None
    collect_and_show_garbage()
weakref_cycle.py:
import gc
from pprint import pprint
import weakref

from weakref_graph import Graph, demo, collect_and_show_garbage

gc.set_debug(gc.DEBUG_LEAK)

print 'Setting up the cycle'
print
demo(Graph)
print
print 'Breaking the cycle and cleaning up garbage'
print
gc.garbage[0].set_next(None)
while gc.garbage:
    del gc.garbage[0]
print
collect_and_show_garbage()
    解释器显示如下:
>>> 
Setting up the cycle

Set up graph:
one.set_next(two->three->one
Collecting...
Unreachable objects: 0
Garbage:[]

After 2 references removed:
one->two->three->one
Collecting...
Unreachable objects: 0
Garbage:[]

Removing last reference:
Collecting...
gc: uncollectable 
gc: uncollectable 
gc: uncollectable 
gc: uncollectable 
gc: uncollectable 
gc: uncollectable 
Unreachable objects: 6
Garbage:[
    我们可以使用代理来进行回收:
import gc
from pprint import pprint
import weakref

from weakref_graph import Graph, demo

class WeakGraph(Graph):
    def set_next(self, other):
        if other is not None:
            if self in other.all_nodes():
                other = weakref.proxy(other)
        super(WeakGraph, self).set_next(other)
        return
demo(WeakGraph)
    解释器显示如下:
>>> 
Set up graph:
one.set_next()

Graph:
one->two->three
Collecting...
Unreachable objects: 0
Garbage:[]

After 2 references removed:
one->two->three
Collecting...
Unreachable objects: 0
Garbage:[]

Removing last reference:
(Deleting one)
one.set_next(None)
(Deleting two)
two.set_next(None)
(Deleting three)
three.set_next(None)
Collecting...
Unreachable objects: 0
Garbage:[]

5. 缓存对象

    WeakValueDictionary使用其中保存的值的弱引用,当其他代码不再实际使用这些值时允许将其垃圾回收.通过使用垃圾回收器的显式调用,由此说明了使用常规字典和WeakValueDictionary完成内存处理的差别.


import gc
from pprint import pprint
import weakref

gc.set_debug(gc.DEBUG_LEAK)

class ExpensiveObject(object):
    def __init__(self, name):
        self.name = name
    def __repr__(self):
        return 'ExpensiveObject(%s)' % self.name
    def __del__(self):
        print '     (Deleting %s)' % self

def demo(cache_factory):
    all_refs = {}
    print 'CACHE TYPE:', cache_factory
    cache = cache_factory()
    for name in ['one', 'two', 'three']:
        o = ExpensiveObject(name)
        cache[name] = o
        all_refs[name] = o
        del o
    print '     all_refs =',
    pprint(all_refs)
    print '\n   Before, cache contains:', cache.keys()
    for name, value in cache.items():
        print '     %s = %s' % (name, value)
        del value

    print '\n   Cleanup:'
    del all_refs
    gc.collect()

    print '\n   After, cache contains:', cache.keys()
    for name, value in cache.items():
        print '     %s = %s' % (name, value)
    print '     demo returning'
    return
demo(dict)
print

demo(weakref.WeakValueDictionary)
    如果循环变量指示缓存的值,这些循环变量必须显式清除,从而使对象的引用计数减少,否则,垃圾回收器不会删除这些对象,它们仍会保留在缓存中.类似的,all_refs变量用来维护引用,避免它们过早的被垃圾回收.



>>> 
CACHE TYPE: 
     all_refs ={'one': ExpensiveObject(one),
 'three': ExpensiveObject(three),
 'two': ExpensiveObject(two)}
 
   Before, cache contains: ['three', 'two', 'one']
     three = ExpensiveObject(three)
     two = ExpensiveObject(two)
     one = ExpensiveObject(one)

   Cleanup:

   After, cache contains: ['three', 'two', 'one']
     three = ExpensiveObject(three)
     two = ExpensiveObject(two)
     one = ExpensiveObject(one)
     demo returning
     (Deleting ExpensiveObject(three))
     (Deleting ExpensiveObject(two))
     (Deleting ExpensiveObject(one))

CACHE TYPE: weakref.WeakValueDictionary
     all_refs ={'one': ExpensiveObject(one),
 'three': ExpensiveObject(three),
 'two': ExpensiveObject(two)}
 
   Before, cache contains: ['three', 'two', 'one']
     three = ExpensiveObject(three)
     two = ExpensiveObject(two)
     one = ExpensiveObject(one)

   Cleanup:
     (Deleting ExpensiveObject(three))
     (Deleting ExpensiveObject(two))
     (Deleting ExpensiveObject(one))

   After, cache contains: []
     demo returning


9. copy---复制对象

作用:提供一些函数,可以使用浅副本或深副本语义复制对象

Python版本:1.4及以后版本

    copy模块包括两个函数copy()和deepcopy(),用于复制现有的对象.

1. 浅副本

    copy()创建一个副本,指向原对象内容的引用:


import copy

class MyClass:
    def __init__(self, name):
        self.name = name
    def __cmp__(self, other):
        return cmp(self.name, other.name)

a = MyClass('a')
my_list = [a]
dup = copy.copy(my_list)

print [id(x) for x in [my_list, dup]]
print [id(y) for x in [my_list, dup] for y in x]
    解释器显示如下:



>>> 
[44632008L, 44573384L]
[44573512L, 44573512L]


2. 深副本

    深副本是创建一个全新的副本,包括其内容.


import copy

class MyClass:
    def __init__(self, name):
        self.name = name
    def __cmp__(self, other):
        return cmp(self.name, other.name)

a = MyClass('a')
my_list = [a]
dup = copy.deepcopy(my_list)

print [id(x) for x in [my_list, dup]]
print [id(y) for x in [my_list, dup] for y in x]
    解释器显示如下:



>>> 
[45615048L, 36209224L]
[45556552L, 45556424L]


3. 定制复制行为

    我们可以改写__copy__()和__deepcopy__()来实现定制复制的行为:


import copy

class MyClass:
    def __init__(self, name):
        self.name = name
    def __cmp__(self, other):
        return cmp(self.name, other.name)
    def __copy__(self):
        print '__copy__()'
        return MyClass(self.name)
    def __deepcopy__(self, memo):
        print '__deepcopy__(%s)' % str(memo)
        return MyClass(copy.deepcopy(self.name, memo))

a = MyClass('a')

sc = copy.copy(a)
dc = copy.deepcopy(a)
    解释器显示如下:



>>> 
__copy__()
__deepcopy__({})


4. 深副本中的递归

    为了避免复制递归数据结构可能带来的问题,deepcopy()使用了一个字典跟踪已复制的对象.将这个字典传入__deepcopy__()方法,从而在该方法中也可以进行检查:

备注:这段代码不太理解


import copy
import pprint

class Graph:
    def __init__(self, name, connections):
        self.name = name
        self.connections = connections

    def add_connection(self, other):
        self.connections.append(other)

    def __repr__(self):
        return 'Graph(name=%s, id=%s)' % (self.name, id(self))

    def __deepcopy__(self, memo):
        print '\nCalling __deepcopy__ for %r' % self
        if self in memo:
            existing = memo.get(self)
            print '     Already copied to %r' % existing
            return existing
        print '     Memo dictionary:'
        pprint.pprint(memo, indent=4, width=40)
        dup = Graph(copy.deepcopy(self.name, memo), [])
        print '     Copying to new object %s' % dup
        memo[self] = dup
        for c in self.connections:
            dup.add_connection(copy.deepcopy(c, memo))
        return dup

root = Graph('root', [])
a = Graph('a', [root])
b = Graph('b', [a, root])
root.add_connection(a)
root.add_connection(b)

dup = copy.deepcopy(root)
    解释器显示如下:



>>> 

Calling __deepcopy__ for Graph(name=root, id=45364872)
     Memo dictionary:
{   }
     Copying to new object Graph(name=root, id=45359816)

Calling __deepcopy__ for Graph(name=a, id=45363848)
     Memo dictionary:
{   Graph(name=root, id=45364872): Graph(name=root, id=45359816),
    34200192L: 'root',
    46032552L: ['root']}
     Copying to new object Graph(name=a, id=45361160)

Calling __deepcopy__ for Graph(name=root, id=45364872)
     Already copied to Graph(name=root, id=45359816)

Calling __deepcopy__ for Graph(name=b, id=45365512)
     Memo dictionary:
{   Graph(name=a, id=45363848): Graph(name=a, id=45361160),
    Graph(name=root, id=45364872): Graph(name=root, id=45359816),
    33255512L: 'a',
    34200192L: 'root',
    45363848L: Graph(name=a, id=45361160),
    45364872L: Graph(name=root, id=45359816),
    46032552L: [   'root',
                   'a',
                   Graph(name=root, id=45364872),
                   Graph(name=a, id=45363848)]}
     Copying to new object Graph(name=b, id=45331720)


10. pprint---美观打印数据结构

作用:美观打印数据结构

Python版本: 1.4及以后版本

测试数据pprint_data.py:


data = [(1, {'a' : 'A', 'b' : 'B', 'c' : 'C', 'd' : 'D'}),
        (2, {'e' : 'E', 'f' : 'F', 'g' : 'G', 'h' : 'H',
             'i' : 'I', 'j' : 'J', 'k' : 'K', 'l' : 'L'}),]


1. 打印


from pprint import pprint

from pprint_data import data

print 'PRINT:'
print data
print
print 'PPRINT:'
pprint(data)
    解释器显示如下:



>>> 
PRINT:
[(1, {'a': 'A', 'c': 'C', 'b': 'B', 'd': 'D'}), (2, {'e': 'E', 'g': 'G', 'f': 'F', 'i': 'I', 'h': 'H', 'k': 'K', 'j': 'J', 'l': 'L'})]

PPRINT:
[(1, {'a': 'A', 'b': 'B', 'c': 'C', 'd': 'D'}),
 (2,
  {'e': 'E',
   'f': 'F',
   'g': 'G',
   'h': 'H',
   'i': 'I',
   'j': 'J',
   'k': 'K',
   'l': 'L'})]


2. 格式化

    要格式化一个数据结构而不把它直接写至一个流,可以使用pformat()来构造一个字符串表示.


from pprint import pformat
from pprint_data import data
import logging

logging.basicConfig(level=logging.DEBUG, format='%(levelname)-8s %(message)s',)

logging.debug('Logging pformatted data')
formatted = pformat(data)
for line in formatted.splitlines():
    logging.debug(line.rstrip())
    解释器显示如下:



>>> 
DEBUG    Logging pformatted data
DEBUG    [(1, {'a': 'A', 'b': 'B', 'c': 'C', 'd': 'D'}),
DEBUG     (2,
DEBUG      {'e': 'E',
DEBUG       'f': 'F',
DEBUG       'g': 'G',
DEBUG       'h': 'H',
DEBUG       'i': 'I',
DEBUG       'j': 'J',
DEBUG       'k': 'K',
DEBUG       'l': 'L'})]


3. 任意类

    通过定制__repr__()来定制特定的输出:


from pprint import pprint

class node(object):
    def __init__(self, name, contents = []):
        self.name = name
        self.contents = contents[:]
    def __repr__(self):
        return ('node(' + repr(self.name) + ', ' +
                repr(self.contents) + ')')

trees = [node('node-1'),
         node('node-2', [node('node-2-1')]),
         node('node-3', [node('node-3-1')]),]

pprint(trees)
    解释器显示如下:



>>> 
[node('node-1', []),
 node('node-2', [node('node-2-1', [])]),
 node('node-3', [node('node-3-1', [])])]


4. 递归

    递归数据结构由指向原数据源的引用来表示:


>>> ll = [1, 2]
>>> ll.append(ll)
>>> ll
[1, 2, [...]]
>>> pprint(ll)
[1, 2, ]


5. 限制嵌套输出

    我们可以指定depth来制定输出的层次:


>>> ll = [1, 2, [3, 4, [5, 6]]]
>>> pprint(ll, depth=2)
[1, 2, [3, 4, [...]]]


6. 控制输出宽度

    使用width来控制输出宽度


from pprint import pprint
from pprint_data import data

for width in [80, 5]:
    print 'WIDTH =', width
    pprint(data, width=width)
    print
    解释器显示如下:



>>> 
WIDTH = 80
[(1, {'a': 'A', 'b': 'B', 'c': 'C', 'd': 'D'}),
 (2,
  {'e': 'E',
   'f': 'F',
   'g': 'G',
   'h': 'H',
   'i': 'I',
   'j': 'J',
   'k': 'K',
   'l': 'L'})]

WIDTH = 5
[(1,
  {'a': 'A',
   'b': 'B',
   'c': 'C',
   'd': 'D'}),
 (2,
  {'e': 'E',
   'f': 'F',
   'g': 'G',
   'h': 'H',
   'i': 'I',
   'j': 'J',
   'k': 'K',
   'l': 'L'})]



转载于:https://my.oschina.net/voler/blog/381362

你可能感兴趣的:(Python标准库学习笔记2:数据结构)