Task01:预备知识

Task01:预备知识

    • Ex1:利用列表推导式写矩阵乘法
    • Ex2:更新矩阵
    • Ex3:卡方统计量
    • Ex4:改进矩阵计算的性能
    • Ex5:连续整数的最大长度

参考链接:https://datawhalechina.github.io/joyful-pandas/build/html/%E7%9B%AE%E5%BD%95/ch1.html

Ex1:利用列表推导式写矩阵乘法

一般的矩阵乘法根据公式,可以由三重循环写出,请将其改写为列表推导式的形式。

import numpy as np
M1 = np.random.rand(2,3)
M2 = np.random.rand(3,4)
print(M1)
print(M2)
M1@M2
[[0.21882773 0.83856712 0.10216554]  # M1矩阵,shape = (2,3)
 [0.8178865  0.82343903 0.14219665]]  
[[0.4981882  0.81916687 0.285332   0.69347725]  # M2矩阵,shape = (3,4)
 [0.92874779 0.18476286 0.51331421 0.50188801]
 [0.08611562 0.91474668 0.99530425 0.47614143]]
array([[0.8966328 , 0.42764808, 0.59457277, 0.62126409],  # 结果维度:shape = (2,4)
       [1.18447394, 0.95220039, 0.79758107, 1.04816558]])
list(zip(*M2))  # zip()能够把多个可迭代对象打包成一个元组构成的可迭代对象,它返回了一个zip对象,通过tuple, list可以得到相应的打包结果
[(0.49818820178117007, 0.9287477924393945, 0.08611562119504934),
 (0.8191668746611285, 0.1847628644683621, 0.9147466808414475),
 (0.28533199558653055, 0.5133142118732157, 0.9953042512832437),
 (0.6934772523076975, 0.5018880125960207, 0.47614143326355496)]
#方法一:
z1 = [[sum(map(lambda x: x[0]*x[1],zip(i,j))) for i in zip(*M2)] for j in M1]  # M1*M2,列表推导式的内层循环是M2,外层是M1
z1
[[0.8966328043025076,
  0.4276480815824257,
  0.5945727688239985,
  0.6212640864654319],
 [1.1844739389352992,
  0.9522003935787853,
  0.7975810730177366,
  1.0481655767133908]]
#方法二:
z2 = [[sum([M1[i][k]*M2[k][j] for k in range(M1.shape[1])]) for j in range(M2.shape[1])] for i in range(M1.shape[0])]
z2
[[0.8966328043025076,
  0.4276480815824257,
  0.5945727688239985,
  0.6212640864654319],
 [1.1844739389352992,
  0.9522003935787853,
  0.7975810730177366,
  1.0481655767133908]]
 
 (np.abs(M1@M2 - z2) < 1e-15).all()
True

Ex2:更新矩阵

设矩阵 A m × n A_{m×n} Am×n ,现在对 A A A 中的每一个元素进行更新生成矩阵 B B B ,更新方法是 B i j = A i j ∑ k = 1 n 1 A i k B_{ij}=A_{ij}\sum_{k=1}^n\frac{1}{A_{ik}} Bij=Aijk=1nAik1 ,例如下面的矩阵为 A A A ,则 B 2 , 2 = 5 × ( 1 4 + 1 5 + 1 6 ) = 37 12 B_{2,2}=5\times(\frac{1}{4}+\frac{1}{5}+\frac{1}{6})=\frac{37}{12} B2,2=5×(41+51+61)=1237 ,请利用 Numpy 高效实现。
A = [ 1 2 3 4 5 6 7 8 9 ] A= \begin{bmatrix} 1 & 2 &3\\ 4 & 5 &6\\ 7 & 8 & 9 \end{bmatrix} A=147258369

A = np.arange(1,10).reshape(3,3)
A
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
#方法一:
for i in range(A.shape[0]):
    print(A[i]*(1/A[i]).sum(0))

[1.83333333 3.66666667 5.5       ]
[2.46666667 3.08333333 3.7       ]
[2.65277778 3.03174603 3.41071429]
# 方法二:
(1/A).sum(1)
array([1.83333333, 0.61666667, 0.37896825])

B = A*((1/A).sum(1).reshape(3,1))
B
array([[1.83333333, 3.66666667, 5.5       ],
       [2.46666667, 3.08333333, 3.7       ],
       [2.65277778, 3.03174603, 3.41071429]])

Ex3:卡方统计量

设矩阵 A m × n A_{m\times n} Am×n,记 B i j = ( ∑ i = 1 m A i j ) × ( ∑ j = 1 n A i j ) ∑ i = 1 m ∑ j = 1 n A i j B_{ij} = \frac{(\sum_{i=1}^mA_{ij})\times (\sum_{j=1}^nA_{ij})}{\sum_{i=1}^m\sum_{j=1}^nA_{ij}} Bij=i=1mj=1nAij(i=1mAij)×(j=1nAij),定义卡方值如下:
χ 2 = ∑ i = 1 m ∑ j = 1 n ( A i j − B i j ) 2 B i j \chi^2 = \sum_{i=1}^m\sum_{j=1}^n\frac{(A_{ij}-B_{ij})^2}{B_{ij}} χ2=i=1mj=1nBij(AijBij)2
请利用Numpy对给定的矩阵 A A A计算 χ 2 \chi^2 χ2

np.random.seed(0)
A = np.random.randint(10, 20, (8, 5))
A
array([[15, 10, 13, 13, 17],
       [19, 13, 15, 12, 14],
       [17, 16, 18, 18, 11],
       [16, 17, 17, 18, 11],
       [15, 19, 18, 19, 14],
       [13, 10, 13, 15, 10],
       [12, 13, 18, 11, 13],
       [13, 13, 17, 10, 11]])
A.sum(1).reshape(-1,1)*A.sum(0)  # shape = (8,5)
array([[ 8160,  7548,  8772,  7888,  6868],
       [ 8760,  8103,  9417,  8468,  7373],
       [ 9600,  8880, 10320,  9280,  8080],
       [ 9480,  8769, 10191,  9164,  7979],
       [10200,  9435, 10965,  9860,  8585],
       [ 7320,  6771,  7869,  7076,  6161],
       [ 8040,  7437,  8643,  7772,  6767],
       [ 7680,  7104,  8256,  7424,  6464]])
B =(A.sum(1).reshape(-1,1)*A.sum(0))/A.sum()
B 
array([[14.14211438, 13.08145581, 15.20277296, 13.67071057, 11.90294627],
       [15.18197574, 14.04332756, 16.32062392, 14.67590988, 12.77816291],
       [16.63778163, 15.38994801, 17.88561525, 16.08318891, 14.0034662 ],
       [16.42980936, 15.19757366, 17.66204506, 15.88214905, 13.82842288],
       [17.67764298, 16.35181976, 19.0034662 , 17.08838821, 14.87868284],
       [12.68630849, 11.73483536, 13.63778163, 12.26343154, 10.67764298],
       [13.93414211, 12.88908146, 14.97920277, 13.46967071, 11.72790295],
       [13.3102253 , 12.31195841, 14.3084922 , 12.86655113, 11.20277296]])   

chiq2 = (np.power((A-B),2)/B).sum()
chiq2 
11.842696601945802         

Ex4:改进矩阵计算的性能

Z Z Z m × n m×n m×n的矩阵, B B B U U U分别是 m × p m×p m×p p × n p×n p×n的矩阵, B i B_i Bi B B B的第 i i i行, U j U_j Uj U U U的第 j j j列,下面定义 R = ∑ i = 1 m ∑ j = 1 n ∥ B i − U j ∥ 2 2 Z i j \displaystyle R=\sum_{i=1}^m\sum_{j=1}^n\|B_i-U_j\|_2^2Z_{ij} R=i=1mj=1nBiUj22Zij,其中 ∥ a ∥ 2 2 \|\mathbf{a}\|_2^2 a22表示向量 a a a的分量平方和 ∑ i a i 2 \sum_i a_i^2 iai2

现有某人根据如下给定的样例数据计算 R R R的值,请充分利用Numpy中的函数,基于此问题改进这段代码的性能。

np.random.seed(0)
m, n, p = 100, 80, 50
B = np.random.randint(0, 2, (m, p))
U = np.random.randint(0, 2, (p, n))
Z = np.random.randint(0, 2, (m, n))
def solution1(B=B, U=U, Z=Z):
    L_res = []
    for i in range(m):
        for j in range(n):
            norm_value = ((B[i]-U[:,j])**2).sum()
            L_res.append(norm_value*Z[i][j])
    return sum(L_res)
solution1(B, U, Z)

100566
def solution2(B=B, U=U, Z=Z):
    T = np.array([[np.power([(B[i][k] - U[k][j]) for k in range(p)],2).sum() for j in range(n)] for i in range(m)])
    R = (T*Z).sum()
    return R
%timeit L1 = solution1(B, U, Z)
%timeit L2 = solution2(B, U, Z)  # 方法二比方法一还慢了一个数量级
37.5 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
311 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

答案中提供了更加高效的方法:
Task01:预备知识_第1张图片

def solution3(B=B, U=U, Z=Z):
    T = (np.power(B,2).sum(1).reshape(-1,1))+(np.power(U,2).sum(0))-2*B@U
    R = (T*Z).sum()
    return R
%timeit L3 = solution3(B, U, Z)  # 比方法一快了2个数量级
260 µs ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

与普通的运算方法相比,numpy内置的向量化运算缩减了运行时间,因而适当使用numpy函数能提高程序效率

Ex5:连续整数的最大长度

输入一个整数的Numpy数组,返回其中递增连续整数子数组的最大长度,正向是指递增方向。例如,输入[1,2,5,6,7],[5,6,7]为具有最大长度的连续整数子数组,因此输出3;输入[3,2,1,2,3,4,6],[1,2,3,4]为具有最大长度的连续整数子数组,因此输出4。请充分利用Numpy的内置函数完成。(提示:考虑使用nonzero, diff函数)

a = np.array([5,3,2,1,0,-1,2,3,4,6])
print(np.diff(a))
array_zero = np.diff(np.diff(a)) #数组中是否有连续的0
print(array_zero)
print(np.nonzero(array_zero))  # 数组中非0元素的索引
[-2 -1 -1 -1 -1  3  1  1  2]
[ 1  0  0  0  4 -2  0  1]
(array([0, 4, 5, 7], dtype=int64),)
np.diff(np.nonzero(array_zero))  # array_zero数组非0元素索引之差+1,即为原数组a的连续递增/递减数组的长度
array([[4, 1, 2]], dtype=int64)
np.diff(np.nonzero(array_zero)).max()+1  # 原数组a递增/递减连续整数子数组的最大长度
5

题目要求是连续递增整数子数组的最大长度,没想到合适的方法,视频提供的方法是:

np.diff(a)!=1
array([ True,  True,  True,  True,  True,  True, False, False,  True])
np.r_[1,np.diff(a)!=1,1]
array([1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1], dtype=int32)
np.nonzero(np.r_[1,np.diff(a)!=1,1])
(array([ 0,  1,  2,  3,  4,  5,  6,  9, 10], dtype=int64),)
np.diff(np.nonzero(np.r_[1,np.diff(a)!=1,1]))
array([[1, 1, 1, 1, 1, 1, 3, 1]], dtype=int64)
np.diff(np.nonzero(np.r_[1,np.diff(a)!=1,1])).max()
3

你可能感兴趣的:(datawhale,python,numpy)