about

这篇文章是Python2 的官方文档 7.3. struct
— Interpret strings as packed binary data的一个学习笔记
官方文档简介：

This module performs conversions between Python values and C structs represented as Python strings. This can be used in handling binary data stored in files or from network connections, among other sources.

简单说来，就是Python中的value(i.e. int, float, string) 和string（似二进制般的）之间的一个转换

struct模板主要函数有：

pack(v1, v2, ...)
unpack(string)
pack_into(buffer, offset, v1, v2, ...)
unpack_from(buffer, offset=0)

下文一一介绍

pack() and unpack()

pack()

先来看看官方说明:

pack(fmt, v1, v2, ...):

Return a string containing the values v1, v2, ... packed according to the given format. The arguments must match the values required by the format exactly.

就是把values:v1, v2按照对应fmt(format)方式转换为string.

来看个栗子：

>>> import struct
>>> 
>>> v1 = 1
>>> v2 = 'abc'
>>> bytes = struct.pack('i3s', v1, v2)
>>> bytes
'\x01\x00\x00\x00abc'

这里的fmt就是'i3s'，什么意思呢？其中i就是integer,即整数，后面的s对应string。在上面的栗子中，abc是长度为３的字符串，所以就有了3s.

这里有一个完整的fmt列表：

fmt.png

unpack()

同样，先看看官方文档

unpack(fmt, string)

Unpack the string (presumably packed by pack(fmt, ...)) according to the given format. The result is a tuple even if it contains exactly one item. The string must contain exactly the amount of data required by the format (len(string) must equal calcsize(fmt)).

简单说来，就是把string按照对应的fmt形式解析出来。注意，结果返回的是一个tuple

举个栗子

>>> bytes = '\x01\x00\x00\x00abc'
>>> v1, v2 = struct.unpack('i3s', bytes)
>>> v1
1
>>> v2
'abc'

这就把上面的v1，v2还原回去了。

注意，当返回值只有一个时：

>>> a = 2
>>> a_pack = struct.pack('i',a)
>>> a_unpack = struct.unpack('i',a_pack)  #此处得到的a_unpack为tuple
>>> a_unpack
(2,)
>>> a_unpack, = struct.unpack('i',a_pack) #此处得到的a_unpack为int
>>> a_unpack
2

Byte Order, Size, and Alignment

这里穿插一下字节的顺序，大小，和对齐问题

byte order

下面有个表

order.png

如果在fmt字符串前加上了'<',那么字节将会采用little-endian即小端的排列方式，如果是'>'会采用big-endian即大端的排列方式。默认的是'@'方式

举个栗子

>>> a = 2
>>> a_pack = struct.pack('i',a)      #这是默认的，机器不同可能会不同，我这里默认为字节按little-endian顺序排列
>>> a_pack
'\x02\x00\x00\x00'
>>> 
>>> a_pack2 = struct.pack('>i',a)    # '>'即big-endian
>>> a_pack2
'\x00\x00\x00\x02'
>>> 
>>> a_pack3 = struct.pack('>> a_pack3
'\x02\x00\x00\x00'

如果不按默认的小端或大端字节排列，加上'<'或'>'，unpack就要留意了

>>> a = 2
>>> a_pack2 = struct.pack('>i',a)   #big-endian
>>> a_pack2
'\x00\x00\x00\x02'
>>> a_unpack, = struct.unpack('>> a_unpack
33554432
>>> a_unpack2, = struct.unpack('>i', a_pack2)   #big-endian
>>> a_unpack2
2

如上所示，如果pack与unpack操作的字节顺序不一致，把little-endian和big-endian乱搞，就会导致数据搞乱

size and alignment

其实，struct是类似于C语言中的struct结构体方式存储数据的。故这里有一个数据的对齐方式问题。如果在内存为32位(即４GB)机器中，一般是以4 bytes对齐的。CPU一次读取４字节，然后放入对应的cache(缓存)中。

看个栗子

struct A{
  char c1;
  int a;
  char c2;
}

结构体A会占用多少内存大小呢？直觉上可能是 1+4+1 = 6　字节，但一般来说，其实是12字节！在第一个char变量c1占用了一字节后，由于是４字节对齐的，int变量a不会插在c1后面，而是把c1后面隐式的补上3个字节，然后把a放在了下面的那行中，最后把char变量c2放到a下面。
再看看下面的

struct A{
  char c1;
  char c2;
  int a;
}

这种情形，结构体A会占用多少内存呢？答案是8字节。原理同上，先把char变量c1放上去，和c1同行的还有３字节，一看下一个char变量c2才１字节，于是就把c2接在c1后面了，此时还剩2字节，但是已经不够int了，故只能填充上２字节，然后另起一行。

想想为什么要这样呢？这岂不是浪费了内存了？！从某种意义上说，确实是浪费了内存，但这却提高了CPU的效率！
想想这种情景模式：假设内存中某一行已经先放了一字节的char变量c, 下一个是轮到int变量a了，它一共占４字节内存，先是拿出3字节放在了变量c的后面，然后再拿最后的１字节放在下面一行。
如果CPU想读取a变量该怎么办？它应该读取２次！一次读取３字节，一次读取１字节。故这速度真是拖了，慢了一倍啊！如果变量ａ是另起一行的话，只要读取一次就够了，直接把４字节取走。

calcsize()

有了上了的简单认识，就好理解这个函数是干什么了的

文档君说

struct.calcsize(fmt)

Return the size of the struct (and hence of the string) corresponding to the given format.

简单说来，就是根据fmt计算出struct占用了内存的多少字节

举个栗子

>>> struct.calcsize('ci')
8
>>> struct.calcsize('ic')
5

查查上面的format表可知，c对应于char,大小为１字节；i对应于int,大小为４字节。所以，出现了上面情况，至于原因，不再累赘。只是最后的ic输出了５，我猜，在struct所占用内存行中的最后一行是不用再padding即填充了。

上面举的栗子都是加了padding的，如果不填充呢？

>>> struct.calcsize('>> struct.calcsize('@ci')
8

倘若在fmt前加上了'<','>','=','!'这些，则不会padding,即不填充。默认的或是'@'则会。

pack_into() and pack_from()

在具体讲解之前，先来看几个函数预热一下

binascii module

这个模块用于二进制和ASCII码之间的转换，下面介绍几个函数

binascii.b2a_hex(data)
binascii.hexlify(data)

Return the hexadecimal representation of the binary data. Every byte of data is converted into the corresponding 2-digit hex representation. The resulting string is therefore twice as long as the length of data.

简单说来，就是用十六进制表示二进制数。

举个栗子

>>> import binascii
>>> s = 'abc'
>>> binascii.b2a_hex(s)
'616263'
>>> binascii.hexlify(s)
'616263'

binascii.a2b_hex(hexstr)
binascii.unhexlify(hexstr)

Return the binary data represented by the hexadecimal string hexstr. This function is the inverse of b2a_hex()
hexstr must contain an even number of hexadecimal digits (which can be upper or lower case), otherwise a TypeError is raised.

简单说来，就是上面函数的反操作，即把十六进制串转为二进制数据

举个栗子

>>> binascii.a2b_hex('616263')
'abc'
>>> binascii.unhexlify('616263')
'abc'

pack_into()　and pack_from()

文档说

struct.pack_into(fmt, buffer, offset, v1, v2, ...)

Pack the values v1, v2, ...
according to the given format, write the packed bytes into the writable buffer starting at offset. Note that the offset is a required argument.

简单说来，就是把values：v1, v2, ...打包按格式fmt转换后写入指定的内存buffer中，并且可以指定buffer中的offset即偏移量，从哪里开始写。

struct.unpack_from(fmt, buffer[, offset=0])

Unpack the buffer according to the given format. The result is a tuple even if it contains exactly one item. The buffer must contain at least the amount of data required by the format (len(buffer[offset:])
must be at least calcsize(fmt)).

简单说来，就是从内存中的指定buffer区读取出来，然后按照fmt格式解析。可以指定offset，从buffer的哪个位置开始读取。

相比于前面的pack, unpack，这两个函数有什么作用呢？我们也可以看出区别，就是多了buffer这东东，内存中的一个缓冲区。在前面，pack需要将values v1, v2打包放入内存中某个区域，而这某个区域是程序内部定的，可能会让出很多的空间给它放，这有点浪费了。其次，如果每次间断性的来一些vlaues，然后又要开辟新的空间，这效率有点慢了，拖时间啊！那还不如我们一次性给定算了，而且我们可以指定多少内存给它，这样就不会浪费内存了。

举个栗子

import struct
import binascii
import ctypes

vals1 = (1, 'hello', 1.2)
vals2 = ('world', 2)
s1 = struct.Struct('I5sf')
s2 = struct.Struct('5sI')
print 's1 format: ', s1.format
print 's2 format: ', s2.format

b_buffer = ctypes.create_string_buffer(s1.size+s2.size)  #开出一块buffer
print 'Before pack:',binascii.hexlify(b_buffer)
s1.pack_into(b_buffer,0,*vals1)
s2.pack_into(b_buffer,s1.size,*vals2)
print 'After pack:',binascii.hexlify(b_buffer)
print 'vals1 is:', s1.unpack_from(b_buffer,0)
print 'vals2 is:', s2.unpack_from(b_buffer,s1.size)

结果输出：

s1 format:  I5sf
s2 format:  5sI
Before pack: 00000000000000000000000000000000000000000000000000000000
After pack: 0100000068656c6c6f0000009a99993f776f726c6400000002000000
vals1 is: (1, 'hello', 1.2000000476837158)
vals2 is: ('world', 2)

咋看之下，我们用了class struct.Struct(format)这个类，这跟前面是有一点不同，前面我们是面向过程，但现在是面向对象了，但各函数功能还是一样的。
这里需要注意的一点是，float在unpack后的精度变了！

这里，由于vals1, vals2是tuple,　故在函数传递时用*vals1带上星号*, 会把带星号*的tuple，此处的vals1, vals2解析出单独的数据。没有星号*就会出现参数错误。

参考

Python2 官方文档　7.3. struct — Interpret strings as packed binary data
糖拌咸鱼同学的浅析Python中的struct模块
Python2　官方文档　18.14. binascii — Convert between binary and ASCII
zhangxinrun同学的　结构体struct的自然对齐问题（经典）
知乎上郑诚同学的回答元组的reference前加个星号是什么意思？

Python学习笔记 --struct模板