Numpy的使用

参考内容
Numpy是python中高效导入、处理、存储、运算数据的包。它的底层语言是C语言。这里，主要介绍Numpy的一些基本操作。深入学习请查看官方文档。

主要内容：

数据类型
数据访问
高效函数
布尔蒙版和花式索引
结构体

导入Numpy和查看介绍

import numpy as np

In [1]: np.
In [2]: ?np

python里的数据类型

不同于其他语言的数据类型（比如C语言），变量赋值前必须定义类型。python的数据类型是动态类型，随用随取。

// C code
int sum = 0;
for(int i=0, i<=100,i++)
{ sum+=i;}

#python code
for i in range(100):
     sum+=i

一个python数据类型的整数不仅仅是一个数字，它还包括ob_refcnt ob_type ob_size ob_digit
而C语言就没这么多存储的东西，给你分配一个32字节的地址，就完事。这也就是python语言慢的原因之一。

python中的list 也不仅仅是一个 list
它存储的每一个数据都是指向对象的地址，而每一个对象都是（和上述的一样）存储了一堆数据，而C语言的数组在定义时就指定了类型，每个数据都是指向数据本身。因此python列表可以存储混合结构。

In [10] : L3 = [True, "2", 3.0, 4]
In [11] : [type(item) for item in L3]
Out[11] : [bool, str, float, int]

为了提高运算速率，python里有了array基础模块，抛弃了list里存储的冗杂信息，存储单一的类型（需指定，或者类型提升）

import array
L = list(range(10)
A = array.array('i',L)
A

结果：array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
i 代表整数类型。

但是仅仅这样还是不够，array太基础了，我们需要做一些运算时不太方便，比如加和，长度，维度，长度，等。于是有了 numpy.

从列表list创建数组array

# integer array:
np.array([1, 4, 2, 5, 3])

array([1, 4, 2, 5, 3])

隐式转换

np.array([3.14, 4, 2, 3])

显式转换

np.array([3.14, 4, 2, 3],dtype='float32')

使用列表生成式

np.array([range(i-1, i + 3) for i in [4, 4, 6]])

array([[3, 4, 5, 6],
[3, 4, 5, 6],
[5, 6, 7, 8]])

也可以使用函数快速生成

如果使用过matlab或者octave那么对这些肯定很熟悉。

生成0矩阵
生成单位矩阵
生成对角矩阵
生成满矩阵
生成等距向量
生成随机矩阵
生成正态矩阵

np.zeros((3,3),dtype='int'))
np.ones((3,4),dtype='float32')
np.eyes((3,3),dtype='int')
np.full((3,5),1,2)
np.linspace(0,0.1,5)
np.random.random((3,3))
np.random.normal(0,1,(3,3))
np.random.randint(0,10,(3,3))

介绍numpy的一些标准数据类型

bool_
int_
intc
intp
int8
int16
int32
int 64
uint8
uint16
float32
complex64
后面的数字表示位数，在计算机中用多少位来存储，前面的表示类型。_等同于64位.

numpy的基本介绍

numpy数组的属性

使用当 x是一个numpy类型的数组时，x.即可查看。

x.ndim 矩阵的维度
x.shape 矩阵每个维度的大小
x.size 矩阵的个数大小
x.dtype 矩阵的数据类型
x.itemsize 矩阵的每个对象的大小
x.nbypes 矩阵的存储总大小 #x.nbypes = x.itemsizexsize

切片

如何访问数组？

x1 = arange(10)
x1[1]#第二个数
x1[-1]#倒数第一个数
x2 = np.random.randint(10,(3,3))
x2[2,2]
x2[-1,-1]

切片

x = np.arange(10)
x[:5] #第一个到第五个
x[3:] #第三个之后
x[2：-1]#第二个到倒数第一个
x[from:to:setp]#开始的index,结束的index,以及步长 缺省就是开始和结束和1
#二维
x2[::2,::1]#和一维相同，推广一下

复制数组
当我们切出子数组时，他只是原数组的指针，如果改变了子数组，原数组也会改变。

>>>print(x2)
[[12  5  2  4]
 [ 7  6  8  8]
 [ 1  6  7  7]]
>>>x2_sub = x2[:2, :2]
>>>print(x2_sub)
>>>x2_sub[0, 0] = 99
>>>print(x2_sub)
[[99  5]
 [ 7  6]]
>>>print(x2)
[[99  5  2  4]
 [ 7  6  8  8]
 [ 1  6  7  7]]
=====
需要使用copy()函数
=====
>>>x2_sub_copy = x2[:2, :2].copy()
>>>print(x2_sub_copy)
[[99  5]
 [ 7  6]]
>>>x2_sub_copy[0, 0] = 42
>>>print(x2_sub_copy)
[[42  5]
 [ 7  6]]
>>>print(x2)
[[99  5  2  4]
 [ 7  6  8  8]
 [ 1  6  7  7]]

数组重塑

x.reshape((3,3))
注意：前提是变换前后的数组大小相同

在这里，介绍另一个常见的表达，如果是一个一维的向量想变成二维的行矩阵或者列矩阵，可以使用 x.reshape(1,n) or x.reshape(n,1) ，也可以使用更简单的表达式： x[np.newaxis,:],x[:,np.newaxis] 同样的效果。

数组拼接

np.concatenate()
np.hstack() 垂直堆栈
np.vstack() 水平堆栈
ps : np.dstack() 第三维连接

x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])

数组切割

np.split()
np.hsplit()
np.vsplit()
Ps: np.dsplit()

>>>grid = np.arange(16).reshape((4, 4))
>>>grid
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])
>>>upper, lower = np.vsplit(grid, [2])#上下分割
>>>left, right = np.hsplit(grid, [2])# 左右分割

numpy的数值运算函数

使用循环来计算是非常慢的，numpy提供了很多底层语言写的优化函数来加快计算。被称为普遍函数。

在numpy中使用数组运算十分自然，不需要你特别处理，已经被优化了。
使用运算符

+
-
*
/
//   #整除
**  #平方
%  #余数

也可以使用函数：

operator	等效函数	描述
+	np.add()	相加 1+1=2
-	np.substract()	相减 3-2=1
-	np.negative()	负数 -2
*	np.multiply()	相乘 2*2=4
/	np.divide()	相除 4 / 2 = 2
//	np.floor_divide()	整除 5 / 2 = 2
**	np.power()	平方 2** = 4
%	np.mod()	取模或求余 9%4=1

其他函数：

np.abs()
np.log()
np.exp()
np.exp2()
np.sin()
np.arcsin()
np.log()
np.log1p()
and more

一些更特殊的运算可以导入scipy包

from scipy import special
special.gamma(x)#伽玛分布
special.erf(x)#高斯分布
更多查看文档

output的妙用
在函数运算中可以加上out参数，直接输出到一个变量中。

y = np.zeros(10)
np.power(2, x, out=y[::2])
>>> print(y)

[  1.   0.   2.   0.   4.   0.   8.   0.  16.   0.]

累加，累乘，对应乘
reduce accumulate
可以直接使用以下函数
np.sum, np.prod, np.cumsum, np.cumprod

>>> x = np.arange(1, 8)
>>> np.add.reduce(x)
28
>>> np.multiply.reduce(x)
5040
>>> np.add.accumulate(x)
array([ 1,  3,  6, 10, 15, 21, 28], dtype=int32)
>>> np.multiply.accumulate(x)
array([   1,    2,    6,   24,  120,  720, 5040], dtype=int32)
#每一项对应乘 1*1 1*2 1*3... 2*1 2*2 ....
>>> x = np.arange(1, 6)
>>> np.multiply.outer(x, x)
array([[ 1,  2,  3,  4,  5],
       [ 2,  4,  6,  8, 10],
       [ 3,  6,  9, 12, 15],
       [ 4,  8, 12, 16, 20],
       [ 5, 10, 15, 20, 25]])

聚合函数 max,min,等统计函数

首先，不使用 numpy包也能计算求和，最大值，最小值，但是，使用np的方法可以减少执行时间。

import numpy as np
big_array = np.random.rand(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)
#比较不同
10 loops, best of 3: 104 ms per loop
1000 loops, best of 3: 442 µs per loop

最大值，最小值

min(big_array), max(big_array)
np.min(big_array), np.max(big_array)
%timeit min(big_array)
%timeit np.min(big_array)
63.7 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
499 µs ± 7.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

对于多维数组，可以指定计算的方向，默认为所有的数据。
axis=0 行方向 axis=1 列方向

其他聚合函数

Function Name	NaN-safe Version	Description
`np.sum`	`np.nansum`	Compute sum of elements
`np.prod`	`np.nanprod`	Compute product of elements
`np.mean`	`np.nanmean`	Compute mean of elements
`np.std`	`np.nanstd`	Compute standard deviation
`np.var`	`np.nanvar`	Compute variance
`np.min`	`np.nanmin`	Find minimum value
`np.max`	`np.nanmax`	Find maximum value
`np.argmin`	`np.nanargmin`	Find index of minimum value
`np.argmax`	`np.nanargmax`	Find index of maximum value
`np.median`	`np.nanmedian`	Compute median of elements
`np.percentile`	`np.nanpercentile`	Compute rank-based statistics of elements
`np.any`	N/A	Evaluate whether any elements are true
`np.all`	N/A	Evaluate whether all elements are true

所谓的 NaN-safe 版本意思是该函数可以忽略NaN值，更加安全。

广播数组运算

numpy的通用函数增加了我们计算的能力，广播数组运算也稍微提升了运算能力。据我了解，我还没有在其他语言碰到这种简化。

数组广播的规矩：
NumPy中的广播遵循一组严格的规则来确定两个数组之间的交互:
规则1:如果两个数组的维数不同，则维数较少的那个数组的形状会在其前边(左)填充维数。
规则2:如果两个数组的形状在任何维度都不匹配，则将该维度中形状为1的数组拉伸以匹配另一个形状。
规则3:如果在任何维度中大小不一致，且两者都不等于1，则会出现错误。

举一个实例：

>>> M = np.ones((3,2))
array([[1., 1., 1.],
       [1., 1., 1.]])
>>> a = np.arange(3)
 array([0, 1, 2])
>>>a +M
array([[1., 2., 3.],
       [1., 2., 3.]])

虽然数组a的大小与M并不相同，在数学上是不能相加的，但是由于广播机制，维度和M变成相同，因此相加得到结果。
它的过程是a先向下扩展，然后相加。

布尔蒙版，布尔逻辑运算

这个简单，逻辑值有两个False True
通过比较运算符可以得到逻辑值。

>>> x = np.array([1, 2, 3, 4, 5])
>>> x < 3  # less than
array([ True,  True, False, False, False], dtype=bool)
>>> x > 3  # greater than
array([False, False, False,  True,  True], dtype=bool)
>>> x <= 3  # less than or equal
array([ True,  True,  True, False, False], dtype=bool)
>>> x >= 3  # greater than or equal
array([False, False,  True,  True,  True], dtype=bool)
>>> x != 3  # not equal
array([ True,  True, False,  True,  True], dtype=bool)
>>> x == 3  # equal
array([False, False,  True, False, False], dtype=bool)
#甚至：
>>> (2 * x) == (x ** 2)
array([False,  True, False, False, False], dtype=bool)

也可以使用函数：np.equal() np.not_equal() ...
多维数组同理。

怎样使用布尔数组

用来计数
any() all()

>>> print(x)
array([[9, 7, 2],
       [6, 2, 3],
       [9, 5, 9]])
>>> np.count_nonzero(x < 6)
2
>>>np.sum(x<6)
2
>>>np.all(x<10)
True
>>> np.all(x<0)
False

布尔运算符
& | ~ ^
且或取反异或运算
在掩模中的布尔数组
意思是直接把布尔数组放入原数组中可以得到结果。
【True返回结果，False返回0】

>>> x
array([[5, 0, 3, 3],
       [7, 9, 3, 5],
       [2, 4, 7, 6]]
>>> x < 5
array([[False,  True,  True,  True],
       [False, False,  True, False],
       [ True,  True, False, False]], dtype=bool)
>>> x[x < 5]
array([0, 3, 3, 3, 2, 4])

bool and or bin

使用bool直接返回逻辑值
不等于0的值返回True，等于0返回False
and 和 or 相当于& |
bin 得到二进制

>>> bool(42), bool(0)
(True, False)
>>> bool(42 and 0)
False
>>> bool(42 or 0)
True
>>> bin(42)
'0b101010'

注意一点，两个数组之间不能使用 or and 简单的关系运算。 or and 两边只能连接逻辑矩阵或者逻辑变量。

花式索引

在前面，我们已经介绍了简单的索引
切片，布尔蒙版，直接arr[1] 访问和修改数组的各个部分，现在我们学习一种新的索引方式，通过传递索引数组来代替单个标量，这让我们非常快速地访问和修改数组值的复杂子集。

简单实例：

>>> x = rand.randint(100,size=10)
>>> x
[51 92 14 71 60 20 82 86 74 74]
>>> index = [3,4,6]
>>> x[index]
array([71,60,82])

其中，广播数组也适用

>>> print(X)
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
>>> X[2, [2, 0, 1]]
array([10,  8,  9])
>>> X[1:, [2, 0, 1]]
array([[ 6,  4,  5],
       [10,  8,  9]])

布尔蒙版也适用

>>>mask = np.array([1, 0, 1, 0], dtype=bool)
>>> X[[[0],[1],[2]], mask]
array([[ 0,  2],
       [ 4,  6],
       [ 8, 10]])

介绍两个简单函数np.searchsorted(bins,x)
其作用是返回x在bins数据中按照顺序应该被插入的位置。

>>> np.searchsorted([1,2,3,4,5], 3)
2
>>> np.searchsorted([1,2,3,4,5], 3, side='right')
3
>>> np.searchsorted([1,2,3,4,5], [-10, 10, 2, 3])
array([0, 5, 1, 2])

np.histogram()直方图统计

数组排序

基础包里有一个sorted()方法用来排序，不过只能对列表，一维数组排序，且效率不高。

numpy提供了快速排序的方法。
np.sort np.argsort
np.sort 返回排序后的数组
np.argsort 返回排序后的索引

>>>  x = np.random.randint(1,100,10);x
array([79, 15, 72,  4, 87, 83, 30,  5, 72, 25])
>>> np.sort(x)
array([ 4,  5, 15, 25, 30, 72, 72, 79, 83, 87])
>>> np.argsort(x)
array([3, 7, 1, 9, 6, 2, 8, 0, 5, 4], dtype=int64)

对于高维数组来说，可以单独的对某一维进行排序。

In [37]: x = np.random.randint(1,100,(3,3,3))

In [38]: x
Out[38]:
array([[[46, 56, 17],
        [64, 97, 89],
        [16, 85, 60]],

       [[ 3, 40, 64],
        [35, 13, 29],
        [99, 66, 68]],

       [[66, 28, 39],
        [ 3, 12, 70],
        [11, 46, 78]]])
In [39] : np.sort(x,axis=0)
out [39]:
array([[[ 3, 28, 17],
        [ 3, 12, 29],
        [11, 46, 60]],

       [[46, 40, 39],
        [35, 13, 70],
        [16, 66, 68]],

       [[66, 56, 64],
        [64, 97, 89],
        [99, 85, 78]]])
In [40]: np.sort(x,axis=1)
Out[40]:
array([[[16, 56, 17],
        [46, 85, 60],
        [64, 97, 89]],

       [[ 3, 13, 29],
        [35, 40, 64],
        [99, 66, 68]],

       [[ 3, 12, 39],
        [11, 28, 70],
        [66, 46, 78]]])

部分排序

如果不想对全部数据进行排序，也可以对部分数据进行排序。

np.partition() np.argpartition()

>>> x = np.array([7, 2, 3, 1, 6, 5, 4])
>>> np.partition(x, 3)
array([2, 1, 3, 4, 6, 5, 7])

它的意思是找到最小的K个值放到数组左边，顺序随机。
另一个是得到索引。

numpy中的结构体

在numpy也有结构体数组。
定义的方法如下：

name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]

可以使用函数定义：

data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
                          'formats':('U10', 'i4', 'f8')})

or:
np.dtype({'names':('name', 'age', 'weight'),
          'formats':('U10', 'i4', 'f8')})
or:
np.dtype({'names':('name', 'age', 'weight'),
          'formats':((np.str_, 10), int, np.float32)})

格式的写法含义：
后面的数字表示所占的bit位。

Character	Description	Example
`'b'`	Byte	`np.dtype('b')`
`'i'`	Signed integer	`np.dtype('i4') == np.int32`
`'u'`	Unsigned integer	`np.dtype('u1') == np.uint8`
`'f'`	Floating point	`np.dtype('f8') == np.int64`
`'c'`	Complex floating point	`np.dtype('c16') == np.complex128`
`'S'`, `'a'`	String	`np.dtype('S5')`
`'U'`	Unicode string	`np.dtype('U') == np.str_`
`'V'`	Raw data (void)	`np.dtype('V') == np.void`

结构体的访问方法：

data['feild']
可以使用np.recarray类使用属性访问

>>> data_rec = data.view(np.recarray)
array([25, 45, 37, 19], dtype=int32)
>>> data_rec.age
array([25, 45, 37, 19], dtype=int32)

如果要更进一步使用结构体的话或许pandas是一个不错的选择。