《利用Python进行数据分析》笔记+整理+案例 NumPy(第一部分)

1. numpy

(0)引入

import numpy as np
import numpy as np
array = np.array([[1,2,3],
                  [2,3,4]])
print(array)
print('number of dimention: ', array.ndim) #查看维度
print('shape: ', array.shape)              #查看形状
print('size: ', array.size)                #查看大小
[[1 2 3]
 [2 3 4]]
number of dimention:  2
shape:  (2, 3)
size:  6

(1)ndarray:多维数组对象

(a)创建ndarray

(i)使用array函数+其他序列比如tuple, list等

data1 = [2,3,4] # 一维的
arr1 = np.array(data1)
arr1
array([2, 3, 4])
arr2 = np.array([[1,2,3,4], #二维的
                 [5,6,7,8]])
arr2
array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

查看维度,形状,大小

arr2.ndim
2
arr2.shape
(2, 4)
arr2.size
8

(ii)使用zeros/zeros_like

# 定义为0矩阵
b = np.zeros((3,4))
print(b)
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
b1 = np.zeros_like(arr2)
b1
array([[0, 0, 0, 0],
       [0, 0, 0, 0]])

(iii)使用ones/ones_like

#定义1矩阵
c = np.ones((2,3))
print(c)
[[1. 1. 1.]
 [1. 1. 1.]]
c1 = np.ones_like(arr2)
c1
array([[1, 1, 1, 1],
       [1, 1, 1, 1]])

(iv)使用empty(定义出来的是一些接近0的没啥意义的值)

#定义空矩阵,一个非常接近于0的数字
d = np.empty((3,5),dtype=np.float64)
print(d)
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

(v)使用arange+reshape的组合

#定义从10开始,20结束,步长为2的数组,并且reshape为矩阵
e = np.arange(10,22,1).reshape((3,4))
print(e)
[[10 11 12 13]
 [14 15 16 17]
 [18 19 20 21]]

(vi)使用linspace(start, end, number_of_points)+reshape的组合

#生成一个线段
f = np.linspace(1,10,6).reshape((2,3))
print(f)
[[ 1.   2.8  4.6]
 [ 6.4  8.2 10. ]]

(vii)使用identity/eye(对角线为1,其余为0的N*N矩阵)

g = np.identity(5)
g
array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])
g1 = np.eye(4)
g1
array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

(viii)随机生成

  • randn:根据正态分布
  • random:随机
  • randint: 生成整肃

还有很多…

np.random.randn(2,3)
array([[ 2.57030105,  1.39206901, -0.30316474],
       [-0.42712186, -0.40631921, -1.71738395]])
np.random.random((2,4))
array([[0.95846336, 0.24449218, 0.70448788, 0.99157708],
       [0.75955074, 0.30768819, 0.4732268 , 0.23431481]])
np.random.randint(100,size=(4,4))
array([[25, 42, 96, 78],
       [22, 40, 82, 37],
       [67, 89, 91, 45],
       [97, 77, 39, 51]])

(b)ndarray的数据类型——dtype

(i)查看dtype

arr1 = np.array([1,2,3],dtype=np.float64)
arr2 = np.array([1,2,3],dtype=np.int32)
arr1.dtype
dtype('float64')
arr2.dtype
dtype('int32')

(ii)转换dtype——astype

arr = np.array([1,2,3,4,5])
arr.dtype
dtype('int32')
float_arr = arr.astype(np.float64)
float_arr.dtype
dtype('float64')

浮点数转换成整数,小数部分会被截断,例如:

arr = np.random.randn(3,4)
arr
array([[ 0.85802555,  0.05701799,  0.077082  , -0.38405232],
       [ 0.20142852, -0.87765168,  1.69069697,  0.99902233],
       [ 0.52238487, -1.7109163 , -0.58673934,  1.68724587]])
arr.astype(np.int32)
array([[ 0,  0,  0,  0],
       [ 0,  0,  1,  0],
       [ 0, -1,  0,  1]])

给当前ndarray的dtype赋值另一个ndarray的dtype

int_arr = np.arange(10)
int_arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
int_arr.astype(float_arr.dtype)
array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

(c)基础运算(重要性值:vectorization)

a = np.arange(1,13).reshape(3,4)
a
array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

(i)加法

a+a
array([[ 2,  4,  6,  8],
       [10, 12, 14, 16],
       [18, 20, 22, 24]])

(ii)减法

a-a
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])

(iii)乘法

a*a
array([[  1,   4,   9,  16],
       [ 25,  36,  49,  64],
       [ 81, 100, 121, 144]])

(iv)除法(注意不要除0)

a/a
array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

(v)传播(类似broadcasting)

  • ndarray与标量之间的运算会传播到每个元素
1/a
array([[1.        , 0.5       , 0.33333333, 0.25      ],
       [0.2       , 0.16666667, 0.14285714, 0.125     ],
       [0.11111111, 0.1       , 0.09090909, 0.08333333]])
a ** 2
array([[  1,   4,   9,  16],
       [ 25,  36,  49,  64],
       [ 81, 100, 121, 144]], dtype=int32)

(vi)比较

  • 与不同数组的比较
  • 与标量的比较
b = np.array([[1,7,4,3],
              [12,4,54,23],
              [9,7,6,44]])
b
array([[ 1,  7,  4,  3],
       [12,  4, 54, 23],
       [ 9,  7,  6, 44]])
a
array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])
# ndarray与ndarray的比较
a > b
array([[False, False, False,  True],
       [False,  True, False, False],
       [False,  True,  True, False]])
# ndarray与标量的比较,broadcasting机制
a < 6
array([[ True,  True,  True,  True],
       [ True, False, False, False],
       [False, False, False, False]])

(vii)sin/cos/tan …

#sin/cos/tan ...
10*np.sin(a)
array([[ 8.41470985,  9.09297427,  1.41120008, -7.56802495],
       [-9.58924275, -2.79415498,  6.56986599,  9.89358247],
       [ 4.12118485, -5.44021111, -9.99990207, -5.36572918]])

(viii)矩阵乘法

  • a.T代表矩阵的转置
  • a.dot(b)或np.dot(a,b)代表矩阵a与矩阵b做矩阵乘法
a.dot(a.T)
array([[ 30,  70, 110],
       [ 70, 174, 278],
       [110, 278, 446]])
np.dot(a,a.T)
array([[ 30,  70, 110],
       [ 70, 174, 278],
       [110, 278, 446]])

与a*b的区别

a*a
array([[  1,   4,   9,  16],
       [ 25,  36,  49,  64],
       [ 81, 100, 121, 144]])

(ix)最小值,最大值,求和

  • 最小值:min
  • 最大值:max
  • 求和:sum
    **可以用axis分别对行或列求以上三个值
#最小值,最大值,求和
print(a)
print(np.min(a))
print(np.max(a))
print(np.sum(a))

#定义axis来在某一列或一行求和,求最大值,求最小值;axis = 0代表行,axis = 1代表列
print(np.min(a,axis=0))
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
1
12
78
[1 2 3 4]

(d)索引和切片

arr = np.arange(3,15)
arr
array([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

(i)start:end:step

arr[3]
6
arr[3:6]
array([6, 7, 8])
arr[3:6]=12 #broadcasting
arr
array([ 3,  4,  5, 12, 12, 12,  9, 10, 11, 12, 13, 14])
arr[3:10:2]
array([12, 12, 10, 12])

对多维数组

arr2d = np.arange(1,10).reshape((3,3))
arr2d
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
#第1行和第2行
arr2d[:2]
array([[1, 2, 3],
       [4, 5, 6]])
#第2列和第3列
arr2d[:,1:]
array([[2, 3],
       [5, 6],
       [8, 9]])
#[2,2], [2,3], [3,2], [3,3]的内容
arr2d[1:,1:]
array([[5, 6],
       [8, 9]])

(ii)浅复制而非深复制

arr_slice = arr[3:6] #直接用代表这是个浅复制
arr_slice[2]=20
arr
array([ 3,  4,  5, 12, 12, 20,  9, 10, 11, 12, 13, 14])
arr_slice2 = arr[3:6].copy() #用copy()代表这是个深复制
arr_slice2[2]=200
arr
array([ 3,  4,  5, 12, 12, 20,  9, 10, 11, 12, 13, 14])

(iii)索引

arr3d = np.arange(1,13).reshape((2,2,3))
arr3d
array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])
arr3d[0]
array([[1, 2, 3],
       [4, 5, 6]])
arr3d[0][1]
# arr3d[0,1]也行
array([4, 5, 6])
arr3d[0,1]
array([4, 5, 6])
arr3d[0,1,2]
6

(iv)bool型索引

names = np.array(['Bob','Joe','Will','Bob','Joe','Will','Joe'])
data = np.random.randn(7,4)
names
array(['Bob', 'Joe', 'Will', 'Bob', 'Joe', 'Will', 'Joe'], dtype='
data
array([[-0.67049599, -0.26062926, -2.18117138, -0.12773764],
       [-0.58401462, -1.32029132,  2.58490103,  0.70486619],
       [-0.88625639,  0.5378594 ,  1.43502916,  0.16710821],
       [ 2.37130421,  0.24034913,  1.3959    , -0.66045837],
       [ 1.56083756,  0.55371859, -0.79279555,  1.38047116],
       [ 0.69331481, -0.48217467,  0.04360928,  0.38047942],
       [ 0.09901121,  0.43095669, -0.20241129,  0.13803783]])
names == 'Bob'
array([ True, False, False,  True, False, False, False])
data[names == 'Bob'] #布尔型数组可用作索引
array([[-0.67049599, -0.26062926, -2.18117138, -0.12773764],
       [ 2.37130421,  0.24034913,  1.3959    , -0.66045837]])
data[~(names == 'Bob')] #反选
array([[-0.58401462, -1.32029132,  2.58490103,  0.70486619],
       [-0.88625639,  0.5378594 ,  1.43502916,  0.16710821],
       [ 1.56083756,  0.55371859, -0.79279555,  1.38047116],
       [ 0.69331481, -0.48217467,  0.04360928,  0.38047942],
       [ 0.09901121,  0.43095669, -0.20241129,  0.13803783]])
data[names!='Bob'] #反选,同上
array([[-0.58401462, -1.32029132,  2.58490103,  0.70486619],
       [-0.88625639,  0.5378594 ,  1.43502916,  0.16710821],
       [ 1.56083756,  0.55371859, -0.79279555,  1.38047116],
       [ 0.69331481, -0.48217467,  0.04360928,  0.38047942],
       [ 0.09901121,  0.43095669, -0.20241129,  0.13803783]])
# 配合&,|等其他布尔算数运算符使用
mask = (names=='Bob')| (names =='Will')

Python关键字and和or在布尔型数组中⽆效。要是⽤& 和|。

mask
array([ True, False,  True,  True, False,  True, False])
data[mask]
array([[-0.67049599, -0.26062926, -2.18117138, -0.12773764],
       [-0.88625639,  0.5378594 ,  1.43502916,  0.16710821],
       [ 2.37130421,  0.24034913,  1.3959    , -0.66045837],
       [ 0.69331481, -0.48217467,  0.04360928,  0.38047942]])

通过布尔型数组设置值

#不想有小于0的数
data[data<0]=0
data
array([[0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 2.58490103, 0.70486619],
       [0.        , 0.5378594 , 1.43502916, 0.16710821],
       [2.37130421, 0.24034913, 1.3959    , 0.        ],
       [1.56083756, 0.55371859, 0.        , 1.38047116],
       [0.69331481, 0.        , 0.04360928, 0.38047942],
       [0.09901121, 0.43095669, 0.        , 0.13803783]])

通过一维布尔数组设置行和列的值

data[names != 'Joe']=6
data
array([[6.        , 6.        , 6.        , 6.        ],
       [0.        , 0.        , 2.58490103, 0.70486619],
       [6.        , 6.        , 6.        , 6.        ],
       [6.        , 6.        , 6.        , 6.        ],
       [1.56083756, 0.55371859, 0.        , 1.38047116],
       [6.        , 6.        , 6.        , 6.        ],
       [0.09901121, 0.43095669, 0.        , 0.13803783]])

(v)花式索引(Fancy indexing)

  • 利用整数数组进行索引
arr = np.zeros((8,4))
for i in range(8):
    arr[i]=i
arr
array([[0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [2., 2., 2., 2.],
       [3., 3., 3., 3.],
       [4., 4., 4., 4.],
       [5., 5., 5., 5.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]])
arr[[4,3,0,6]] #[4,3,0,6]代表取第4,3,0,6行
array([[4., 4., 4., 4.],
       [3., 3., 3., 3.],
       [0., 0., 0., 0.],
       [6., 6., 6., 6.]])
arr[[-3,-1,-2]] #负数就代表逆向取,和下标的负数是一样的
array([[5., 5., 5., 5.],
       [7., 7., 7., 7.],
       [6., 6., 6., 6.]])
arr = np.arange(32).reshape((8,4))
arr
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])
arr[[4,3,0,6],[0,3,1,2]] #取[4,0], [3,3], [0,1], [6,2],也就是行坐标是[4,3,0,6],纵坐标是[0,3,1,2],行纵坐标一一对应
array([16, 15,  1, 26])
arr[[1,5,7,2]][:,[0,3,1,2]] #按顺序取[[1,5,7,2]]行,[:,[0,3,1,2]]的“:”代表取[[1,5,7,2]]的全部行,按[0,3,1,2]顺序排列
#多理解!
array([[ 4,  7,  5,  6],
       [20, 23, 21, 22],
       [28, 31, 29, 30],
       [ 8, 11,  9, 10]])

(vi)数组转置和轴对换(在计算部分有提到过)

arr = np.arange(15).reshape((3,5))
arr
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
arr.T
array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])
np.dot(arr,arr.T)
array([[ 30,  80, 130],
       [ 80, 255, 430],
       [130, 430, 730]])

转置

  • arr.T(简单的轴对换)
  • transpose()(需要轴编号)
  • swapaxes() (需要轴编号)
arr=np.arange(16).reshape((2,2,4))
arr
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7]],

       [[ 8,  9, 10, 11],
        [12, 13, 14, 15]]])
arr.transpose((1,0,2)) #(1,0,2)代表axis=0,1,2的交换,这边是0轴和1轴进行交换
array([[[ 0,  1,  2,  3],
        [ 8,  9, 10, 11]],

       [[ 4,  5,  6,  7],
        [12, 13, 14, 15]]])
arr.swapaxes(1,2)
array([[[ 0,  4],
        [ 1,  5],
        [ 2,  6],
        [ 3,  7]],

       [[ 8, 12],
        [ 9, 13],
        [10, 14],
        [11, 15]]])
arr.swapaxes(1,2).shape
(2, 4, 2)

(2)通用函数(用来做基础运算)

  • 对ndarray中元素执行元素级运算的函数
  • 快速,简单

(a)一元通用函数(unary ufunc)

arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

(i)求square root

np.sqrt(arr)
array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])

(ii)求exponential

np.exp(arr)
array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
       5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
       2.98095799e+03, 8.10308393e+03])

(b)二元通用函数(binary ufunc)

x = np.random.randn(8)
y = np.random.randn(8)
x
array([ 1.69534398, -0.20536174,  1.23043455,  1.96227903, -0.79695811,
        0.47007701, -0.06528047,  2.29325038])
y
array([ 0.40641935,  1.22983554, -0.18501079,  0.92984369,  1.01300343,
        0.11191783, -1.83815574, -1.67414424])

(i)元素最大值/最小值

np.maximum(x,y)
array([ 1.69534398,  1.22983554,  1.23043455,  1.96227903,  1.01300343,
        0.47007701, -0.06528047,  2.29325038])
np.minimum(x,y)
array([ 0.40641935, -0.20536174, -0.18501079,  0.92984369, -0.79695811,
        0.11191783, -1.83815574, -1.67414424])

(ii)modf函数——返回remainder和quotient

arr = np.random.randn(7)*5
arr
array([ 5.39802994, -4.68762509,  6.4421605 , -1.19891402, 12.72916784,
        0.49999429, 10.21618861])
remainder, quotient = np.modf(arr)
remainder
array([ 0.39802994, -0.68762509,  0.4421605 , -0.19891402,  0.72916784,
        0.49999429,  0.21618861])
quotient
array([ 5., -4.,  6., -1., 12.,  0., 10.])

(3)利用ndarray进行数据处理

(a)例子


points = np.arange(-5,5,0.01)
xs, ys = np.meshgrid(points,points)
z = np.sqrt(xs**2+ys**2)
import matplotlib.pyplot as plt
plt.title("Image of $\sqrt{x^2+y^2}$")
plt.imshow(z,cmap=plt.cm.gray)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-A0f87ZX2-1603093279106)(output_184_1.png)]

(b)将条件逻辑表述为数组运算(np.where)

  • where里面可以传数组,也可以传标量

(i)对数组操作

xarr = np.arange(1.1,1.6,0.1)
yarr = np.arange(2.1,2.6,0.1)
cond = np.array([True, False, True, True, False])
result = np.where(cond,xarr,yarr)
result
array([1.1, 2.2, 1.3, 1.4, 2.5])

(ii)使用标量:正值全部换成2,负值全部换成-2

arr = np.random.randn(4,4)
arr
array([[-1.09102617,  0.14488428,  0.39996343, -0.58025741],
       [ 0.16935005, -0.35147731,  0.12913876, -1.627593  ],
       [-0.91612171, -1.43681774, -0.20800336, -0.25200059],
       [-0.73166757,  1.37763498,  0.31321662, -0.44070821]])
np.where(arr > 0, 2, -2)
array([[-2,  2,  2, -2],
       [ 2, -2,  2, -2],
       [-2, -2, -2, -2],
       [-2,  2,  2, -2]])

(iii)数组+标量

np.where(arr > 0, 2, arr) #arr的元素大于0的就换成2,否则不变
array([[-1.09102617,  2.        ,  2.        , -0.58025741],
       [ 2.        , -0.35147731,  2.        , -1.627593  ],
       [-0.91612171, -1.43681774, -0.20800336, -0.25200059],
       [-0.73166757,  2.        ,  2.        , -0.44070821]])

(c)数学和统计方法

  • 通过数组上的⼀组数学函数对整个数组或某个轴向的数据进⾏统计计算
arr = np.arange(20).reshape(5,4)
arr
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

(i)计算平均值

arr.mean()
9.5
np.mean(arr)
9.5
np.average(arr)
9.5

在某个axis求平均值

arr.mean(0) #0轴上的平均值->平均值在横轴上->每列的所有数的平均值
array([ 8.,  9., 10., 11.])
np.mean(arr,axis=0)
array([ 8.,  9., 10., 11.])

(ii)求和

arr.sum()
190
np.sum(arr)
190

在某个axis求和

np.sum(arr,axis=1)
array([ 6, 22, 38, 54, 70])
arr.sum(1)
array([ 6, 22, 38, 54, 70])

(iii)计算累加值/累乘值

arr.cumsum()
array([  0,   1,   3,   6,  10,  15,  21,  28,  36,  45,  55,  66,  78,
        91, 105, 120, 136, 153, 171, 190], dtype=int32)
np.cumprod(arr) #第一个数字是0
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int32)

在axis上求累加值/累乘值

arr.cumsum(1)
array([[ 0,  1,  3,  6],
       [ 4,  9, 15, 22],
       [ 8, 17, 27, 38],
       [12, 25, 39, 54],
       [16, 33, 51, 70]], dtype=int32)
np.cumprod(arr,axis=1)
array([[    0,     0,     0,     0],
       [    4,    20,   120,   840],
       [    8,    72,   720,  7920],
       [   12,   156,  2184, 32760],
       [   16,   272,  4896, 93024]], dtype=int32)

(iv)求中位数

np.median(arr)
9.5

(v)求最大值/最小值

np.max(arr)
19
np.min(arr)
0
np.max(arr,axis=0)
array([16, 17, 18, 19])

(vi)求最大值最小值的索引

np.argmin(arr)
0
np.argmax(arr)
19

(vii)求标准差和方差

np.std(arr)
5.766281297335398
np.var(arr)
33.25

(d)用于布尔型数组的方法

arr = np.random.randn(100)

(i)常用sum计数

(arr > 0).sum() #arr元素大于0的有几个
52

(ii)全部是True/部分是True

bools = arr>0
bools
array([False, False, False, False,  True,  True,  True,  True,  True,
       False, False, False,  True, False,  True, False,  True, False,
       False, False, False, False, False, False,  True,  True,  True,
        True, False,  True, False,  True,  True,  True,  True,  True,
       False,  True, False,  True,  True, False, False, False,  True,
       False,  True, False,  True,  True,  True,  True,  True, False,
       False, False,  True, False,  True,  True, False,  True, False,
        True,  True, False,  True,  True,  True, False, False,  True,
        True, False,  True,  True, False,  True, False,  True, False,
       False, False, False,  True, False,  True, False,  True, False,
       False,  True,  True,  True, False,  True,  True, False,  True,
       False])
bools.any()
True
bools.all()
False

(iii)排序

arr = np.array(np.random.randn(6)*10, dtype=np.int32)
arr
array([ 2, 21, -9, -7,  7, -2])
arr.sort()
arr
array([-9, -7, -2,  2,  7, 21])

在axis上排序

arr = np.array(np.random.randn(5,3)*10,dtype=np.int32)
arr
array([[  4,   9,   9],
       [-11,   3, -18],
       [  8,   9,   5],
       [ -5,   7,  -1],
       [ 16,   3, -10]])
arr.sort(1)
arr
array([[  4,   9,   9],
       [-18, -11,   3],
       [  5,   8,   9],
       [ -5,  -1,   7],
       [-10,   3,  16]])

(e)唯一化以及它的集合逻辑

(i)np.unique

names = np.array(['Amy','Bob','Carol','Dark','Amy','Carol','Sky','Dark'])
np.unique(names) #排序+唯一
array(['Amy', 'Bob', 'Carol', 'Dark', 'Sky'], dtype='

(ii)np.in1d

values1 = np.array(np.random.randn(5)*10,dtype=np.int32)
values1
array([ 0,  9, -9, 11, -4])
values2 = np.arange(6)
values2
array([0, 1, 2, 3, 4, 5])
np.in1d(values1,values2) #values2的元素是否在values1中
array([ True, False, False, False, False])

(4)用于数组的文件输入输出

  • save:保存
  • savez:将多个数组保存
  • savez_compressed:如果数据压缩得很好就可以用这个
  • load:加载
arr = np.arange(10)
np.save('some_array',arr) # 保存,如果⽂件路径末尾没有.npy,则该扩展名会被⾃动加上
np.load('some_array.npy') #读取磁盘上的数组
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.savez('mul_arrays.npz',a=arr,b=arr)
arch = np.load('mul_arrays.npz')
arch['a']
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
arch['b']
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

(5)线性代数

之前有提到过不少方法了

x = np.arange(1,7).reshape(2,3)
y = np.array([[6., 23.],
              [-1,7],
              [8,9]])
x
array([[1, 2, 3],
       [4, 5, 6]])
y
array([[ 6., 23.],
       [-1.,  7.],
       [ 8.,  9.]])

(a)矩阵乘法

x.dot(y)
array([[ 28.,  64.],
       [ 67., 181.]])
np.dot(x,y)
array([[ 28.,  64.],
       [ 67., 181.]])
x@y
array([[ 28.,  64.],
       [ 67., 181.]])

(b)numpy.linalg

from numpy.linalg import inv, qr
x = np.random.randn(5,5)
mat = x.T.dot(x)

(i)inv()

inv(mat) # invert 逆
array([[ 891.89281934, -170.61175297,  309.30921484,  -20.04646306,
         331.54866714],
       [-170.61175297,   34.31559149,  -58.38536593,    4.86214551,
         -63.16345462],
       [ 309.30921484,  -58.38536593,  108.49295298,   -6.21220496,
         115.01146399],
       [ -20.04646306,    4.86214551,   -6.21220496,    1.31252171,
          -7.36221202],
       [ 331.54866714,  -63.16345462,  115.01146399,   -7.36221202,
         123.49445396]])
a = mat.dot(inv(mat))
a # 有误差,所以看起来不太像I
array([[ 1.00000000e+00, -7.89016539e-15,  2.58787478e-14,
        -5.90488249e-15,  3.81666466e-14],
       [ 2.88347099e-16,  1.00000000e+00, -1.53280197e-14,
        -5.64103998e-15, -2.88340488e-14],
       [-3.41925805e-14,  7.92307257e-15,  1.00000000e+00,
         3.11646821e-15, -1.01181538e-15],
       [-1.21304008e-13, -5.78606747e-16, -1.37942448e-14,
         1.00000000e+00, -2.22511130e-14],
       [ 3.98501266e-14,  1.39518727e-14,  5.26455377e-14,
         3.24456882e-15,  1.00000000e+00]])
a.dtype
dtype('float64')
a.round()
array([[ 1., -0.,  0., -0.,  0.],
       [ 0.,  1., -0., -0., -0.],
       [-0.,  0.,  1.,  0., -0.],
       [-0., -0., -0.,  1., -0.],
       [ 0.,  0.,  0.,  0.,  1.]])

(ii)qr()

  • Compute the qr factorization of a matrix.
  • Factor the matrix a as qr, where q is orthonormal and r is upper-triangular.
q, r = qr(mat)
r
array([[-2.96659044e+00, -3.21767296e+00,  3.93922588e-01,
         2.99766871e+00,  6.13702819e+00],
       [ 0.00000000e+00, -3.82625378e+00, -2.29884715e+00,
         7.41479928e+00,  6.24315251e-01],
       [ 0.00000000e+00,  0.00000000e+00, -1.40167295e+00,
         1.16181054e+00,  1.37841479e+00],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        -6.45930290e-01, -3.89372289e-02],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  2.64955879e-03]])
q
array([[-0.43313575,  0.09964799,  0.17103318,  0.03903137,  0.87845768],
       [-0.34127079, -0.46320476,  0.42064025, -0.68119717, -0.16735529],
       [ 0.21554375, -0.27033019, -0.7555402 , -0.46557635,  0.30472964],
       [ 0.13012615,  0.81285083,  0.0841993 , -0.5611334 , -0.01950661],
       [ 0.79532116, -0.2042223 ,  0.46462772, -0.05305607,  0.32720582]])

(6)伪随机数生成(numpy.random)

Python内置的random模块则只能⼀次⽣成⼀个样本值,如果需要产⽣⼤量样本值,numpy.random快了不⽌⼀个数量级,以下是测试:

from random import normalvariate
N = 1000000
%timeit samples = [normalvariate(0,1) for i in range(N)]
1.87 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.random.normal(size=N)
55 ms ± 3.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

(a)标准正态分布(normal distribution)

samples = np.random.normal(size=(4,4))
samples
array([[ 1.79400576, -0.34332551, -0.7315326 ,  0.37596174],
       [ 0.68897482, -1.01607419, -0.18746967,  0.03575278],
       [ 0.04196738,  0.96214581, -1.3443093 ,  1.14355111],
       [ 0.32311173, -1.22932036,  0.14297192,  1.8289397 ]])

(b)均匀分布(uniform distribution)

samples2 = np.random.uniform(size=(4,4))
samples2
array([[0.33197232, 0.50249425, 0.20872139, 0.44100725],
       [0.79970626, 0.8493364 , 0.38371009, 0.80270876],
       [0.81254287, 0.80318489, 0.18548665, 0.48484211],
       [0.60264807, 0.41739885, 0.62637336, 0.27848417]])

(7)随机漫步

简单的随机漫步的例⼦:从0开始,步⻓1和-1出现的概率相等

import random
position = 0
walk = [position]
steps = 1000
for i in range(steps):
    step = 1 if random.randint(0,1) else -1
    position += step
    walk.append(position)
plt.plot(walk[:100])
[]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7fGfelnv-1603093279110)(output_306_1.png)]

不难看出,这其实就是随机漫步中各步的累计和,可以⽤⼀个数组运算来实现。

nsteps = 1000
draws = np.random.randint(0,2,size = nsteps)
steps = np.where(draws>0,1,-1)
walk = steps.cumsum()
plt.plot(walk[:100])
[]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mCojaA3b-1603093279111)(output_312_1.png)]

一次模拟多个随机漫步

nwalks = 50 #模拟50个随机漫步
nsteps = 1000
draws = np.random.randint(0,2,size = (nwalks, nsteps))
steps = np.where(draws>0,1,-1)
walks = steps.cumsum(1) #沿着列找累加值,也就是求行累加值
walks
array([[ -1,  -2,  -3, ...,  -2,  -1,  -2],
       [ -1,   0,   1, ...,  12,  11,  12],
       [ -1,  -2,  -1, ..., -18, -17, -18],
       ...,
       [  1,   0,   1, ...,   8,   7,   8],
       [  1,   2,   3, ..., -42, -41, -42],
       [ -1,   0,  -1, ...,  16,  15,  16]], dtype=int32)
walks.max()#计算所有随机漫步过程的最大值
79
walks.min()#计算所有随机漫步过程的最小值
-100
hits30 = (np.abs(walks) >= 30).any(1) #沿着列找有到达过一次±30的随机漫步
hits30
array([False, False,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
       False,  True, False, False,  True,  True,  True, False,  True,
        True, False,  True,  True, False,  True, False,  True,  True,
        True, False,  True, False, False,  True,  True,  True,  True,
       False,  True, False,  True, False])
hits30.sum() #求所有True的数量
33

(8)numpy的合并

A=np.array([1,1,1])
B=np.array([2,2,2])
# vertical stack
print(np.vstack((A,B)))
print(np.vstack((A,B)).shape)
[[1 1 1]
 [2 2 2]]
(2, 3)
# horizontal stack
print(np.hstack((A,B)))
print(np.hstack((A,B)).shape)
[1 1 1 2 2 2]
(6,)
#如何实现把横向数列改成竖的数列,transpose不能实现
print(A[:,np.newaxis])#给横轴每一项在纵向加维度

A1 = np.array([1,1,1])[:,np.newaxis]
B1 = np.array([2,2,2])[:,np.newaxis]
print(A1)
print(B1)
C1 = np.vstack((A1,B1))
print(C1)
print(C1.shape)
D1 = np.hstack((A1,B1,B1,A1))
print(D1)
print(D1.shape)
[[1]
 [1]
 [1]]
[[1]
 [1]
 [1]]
[[2]
 [2]
 [2]]
[[1]
 [1]
 [1]
 [2]
 [2]
 [2]]
(6, 1)
[[1 2 2 1]
 [1 2 2 1]
 [1 2 2 1]]
(3, 4)
#多个array纵向或横向的合并
C2 = np.concatenate((A1,B1,B1,A1),axis=0)
print(C2)
[[1]
 [1]
 [1]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [1]
 [1]
 [1]]

(9)numpy array 的分割

A = np.arange(12).reshape(3,4)
A
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
# 均等分割,可以控制维度
np.split(A,2,axis=1)
[array([[0, 1],
        [4, 5],
        [8, 9]]),
 array([[ 2,  3],
        [ 6,  7],
        [10, 11]])]
# 不等量的分割,可以控制维度
np.array_split(A,3,axis=1)
[array([[0, 1],
        [4, 5],
        [8, 9]]),
 array([[ 2],
        [ 6],
        [10]]),
 array([[ 3],
        [ 7],
        [11]])]
#纵向均等分割
np.vsplit(A,3)
[array([[0, 1, 2, 3]]), array([[4, 5, 6, 7]]), array([[ 8,  9, 10, 11]])]
#纵向均等分割
np.hsplit(A,2)
[array([[0, 1],
        [4, 5],
        [8, 9]]),
 array([[ 2,  3],
        [ 6,  7],
        [10, 11]])]

(10)浅复制和深复制

a = np.arange(4)
a
array([0, 1, 2, 3])
#赋值运算符的浅复制
b = a
c = a
d = b
a[0]=100
print(b is a)
print("b: ",b)
print(c is a)
print("c: ",c)
print(d is a)
print("d: ",d)
True
b:  [100   1   2   3]
True
c:  [100   1   2   3]
True
d:  [100   1   2   3]
#copy()深复制
b=a.copy()
print(b is a)
print("b: ",b)
False
b:  [100   1   2   3]

你可能感兴趣的:(python,numpy,数据分析)