算法设计与智能计算 || 专题三: 数据间的相似性度量

数据间的相似性度量

文章目录

  • 数据间的相似性度量
    • 1. 欧氏距离的计算
      • 1.1 一维数据间的欧氏距离
      • 1.2 多维数据间的欧氏距离
    • 1.3 利用 numpy 库的优势
    • 2. 欧氏距离的矩阵表达
      • 2. 代码实现
      • 2.2 调用机器学习库实现

常见的距离

算法设计与智能计算 || 专题三: 数据间的相似性度量_第1张图片

  • 欧式距离:

d ( x , y ) = ∑ i = 1 n ( x i − y i ) 2 = ∥ x − y ∥ 2 d(\boldsymbol{x},\boldsymbol{y})=\sqrt{\sum_{i=1}^n(x_i-y_i)^2}=\Vert\boldsymbol{x}-\boldsymbol{y}\Vert_2 d(x,y)=i=1n(xiyi)2 =xy2

  • cosine 距离

d ( x , y ) = x ⊤ ⋅ y d(\boldsymbol{x},\boldsymbol{y})=\boldsymbol{x}^\top\cdot\boldsymbol{y} d(x,y)=xy

  • Mahattan 距离:

d ( x , y ) = ∑ i = 1 n ∣ x i − y i ∣ = ∥ x − y ∥ 1 d(\boldsymbol{x},\boldsymbol{y})=\sum_{i=1}^n\vert x_i-y_i\vert=\Vert\boldsymbol{x}-\boldsymbol{y}\Vert_1 d(x,y)=i=1nxiyi=xy1

  • mikovski 距离

d ( x , y ) = ∑ i = 1 n ∣ x i − y i ∣ p p = ∥ x − y ∥ p d(\boldsymbol{x},\boldsymbol{y})=\sqrt[p]{\sum_{i=1}^n\vert x_i-y_i\vert^p}=\Vert\boldsymbol{x}-\boldsymbol{y}\Vert_p d(x,y)=pi=1nxiyip =xyp

1. 欧氏距离的计算

1.1 一维数据间的欧氏距离

def distance_1d(s):
    n = len(s)
    A = np.zeros((n,n))
    for i in range(n):
        for j in range(n):
            A[i,j] = abs(s[i]-s[j])
    
    return A

import numpy as np

if __name__ == "__main__":
    x = np.array([1,2,3,4,5])
    D = distance_1d(x)
    print(D)

1.2 多维数据间的欧氏距离

def distance_nd(S):
    nrow, ncol = S.shape # 获取输入矩阵的行数和列数
    A = np.zeros((nrow,nrow))
    for i in range(nrow):
        for j in range(nrow):
            summ = 0
            for k in range(ncol):
                summ = summ + (S[i,k]-S[j,k])**2
            A[i,j] = np.sqrt(summ)
    
    return A

import numpy as np

if __name__ == "__main__":
    X = np.random.randn(5,10)
    D = distance_nd(X)
    print(D)

1.3 利用 numpy 库的优势

def Euclid_dist(x,y):
    dist = np.sqrt(np.sum(np.square(x-y)))
    return dist

import numpy as np

if __name__ == "__main__":
    X = np.random.randn(5,10)
    nrow, ncol = X.shape # 获取输入矩阵的行数和列数
    A = np.zeros((nrow,nrow))
    for i in range(nrow):
        for j in range(nrow):
            A[i,j] = Euclid_dist(X[i,:],X[j,:])
    print(A)

2. 欧氏距离的矩阵表达

两个列向量(样本)之间的欧式距离表示为
d 2 ( x , y ) = ∑ k ( x k − y k ) 2 = ( x − y ) T ( x − y ) = x ⊤ x − 2 x ⊤ y + y ⊤ y d^2(\boldsymbol{x},\boldsymbol{y})=\sum_k(x_k-y_k)^2=(\boldsymbol{x}-\boldsymbol{y})^T(\boldsymbol{x}-\boldsymbol{y})=\boldsymbol{x}^\top\boldsymbol{x}-2\boldsymbol{x}^\top\boldsymbol{y}+\boldsymbol{y}^\top\boldsymbol{y} d2(x,y)=k(xkyk)2=(xy)T(xy)=xx2xy+yy

两个矩阵列与列之间的欧式距离表示为
d 2 ( X , Y ) = [ x 1 ⊤ x 1 − 2 x 1 ⊤ y 1 + y 1 ⊤ y 1 x 1 ⊤ x 1 − 2 x 1 ⊤ y 2 + y 2 ⊤ y 2 ⋯ x 1 ⊤ x 1 − 2 x 1 ⊤ y N + y N ⊤ y N x 2 ⊤ x 2 − 2 x 2 ⊤ y 1 + y 1 ⊤ y 1 x 2 ⊤ x 2 − 2 x 2 ⊤ y 2 + y 2 ⊤ y 2 ⋯ x 2 ⊤ x 2 − 2 x 2 ⊤ y N + y N ⊤ y N ⋮ ⋮ ⋱ ⋮ x M ⊤ x M − 2 x M ⊤ y 1 + y 1 ⊤ y 1 x M ⊤ x M − 2 x M ⊤ y 2 + y 2 ⊤ y 2 ⋯ x M ⊤ x M − 2 x M ⊤ y N + y N ⊤ y N ]    = [ x 1 ⊤ x 1 x 1 ⊤ x 1 ⋯ x 1 ⊤ x 1 x 2 ⊤ x 2 x 2 ⊤ x 2 ⋯ x 2 ⊤ x 2 ⋮ ⋮ ⋱ ⋮ x M ⊤ x M x M ⊤ x M ⋯ x M ⊤ x M ] + [ y 1 ⊤ y 1 y 2 ⊤ y 2 ⋯ y N ⊤ y N y 1 ⊤ y 1 y 2 ⊤ y 2 ⋯ y 2 ⊤ y 2 ⋮ ⋮ ⋱ ⋮ y 1 ⊤ y 1 y 2 ⊤ y 2 ⋯ y N ⊤ y N ]    − 2 [ x 1 ⊤ y 1 x 1 ⊤ y 2 ⋯ x 1 ⊤ y N x 2 ⊤ y 1 x 2 ⊤ y 2 ⋯ x 2 ⊤ y N ⋮ ⋮ ⋱ ⋮ x M ⊤ y 1 x M ⊤ y 2 ⋯ x M ⊤ y N ]    = [ x 1 ⊤ x 1 x 2 ⊤ x 2 ⋮ x M ⊤ x M ] ⋅ [ 1 , 1 , ⋯   , 1 ] N + [ 1 1 ⋮ 1 ] M ⋅ [ y 1 ⊤ y 1 , y 2 ⊤ y 2 , ⋯   , y N ⊤ y N ] − 2 [ x 1 ⊤ x 2 ⊤ ⋮ x M ⊤ ] ⋅ [ y 1 , y 2 , ⋯   , y N ] \begin{array}{ll} d^2(X,Y)&= \left[\begin{array}{cccc} \boldsymbol{x}_1^\top\boldsymbol{x}_1-2\boldsymbol{x}_1^\top\boldsymbol{y}_1+\boldsymbol{y}_1^\top\boldsymbol{y}_1&\boldsymbol{x}_1^\top\boldsymbol{x}_1-2\boldsymbol{x}_1^\top\boldsymbol{y}_2+\boldsymbol{y}_2^\top\boldsymbol{y}_2&\cdots &\boldsymbol{x}_1^\top\boldsymbol{x}_1-2\boldsymbol{x}_1^\top\boldsymbol{y}_N+\boldsymbol{y}_N^\top\boldsymbol{y}_N\\ \boldsymbol{x}_2^\top\boldsymbol{x}_2-2\boldsymbol{x}_2^\top\boldsymbol{y}_1+\boldsymbol{y}_1^\top\boldsymbol{y}_1&\boldsymbol{x}_2^\top\boldsymbol{x}_2-2\boldsymbol{x}_2^\top\boldsymbol{y}_2+\boldsymbol{y}_2^\top\boldsymbol{y}_2&\cdots &\boldsymbol{x}_2^\top\boldsymbol{x}_2-2\boldsymbol{x}_2^\top\boldsymbol{y}_N+\boldsymbol{y}_N^\top\boldsymbol{y}_N\\ \vdots & \vdots & \ddots & \vdots&\\ \boldsymbol{x}_M^\top\boldsymbol{x}_M-2\boldsymbol{x}_M^\top\boldsymbol{y}_1+\boldsymbol{y}_1^\top\boldsymbol{y}_1&\boldsymbol{x}_M^\top\boldsymbol{x}_M-2\boldsymbol{x}_M^\top\boldsymbol{y}_2+\boldsymbol{y}_2^\top\boldsymbol{y}_2&\cdots& \boldsymbol{x}_M^\top\boldsymbol{x}_M-2\boldsymbol{x}_M^\top\boldsymbol{y}_N+\boldsymbol{y}_N^\top\boldsymbol{y}_N\\ \end{array} \right]\\\;\\ &=\left[\begin{array}{cccc} \boldsymbol{x}_1^\top\boldsymbol{x}_1&\boldsymbol{x}_1^\top\boldsymbol{x}_1&\cdots &\boldsymbol{x}_1^\top\boldsymbol{x}_1\\ \boldsymbol{x}_2^\top\boldsymbol{x}_2&\boldsymbol{x}_2^\top\boldsymbol{x}_2&\cdots &\boldsymbol{x}_2^\top\boldsymbol{x}_2\\ \vdots & \vdots & \ddots & \vdots&\\ \boldsymbol{x}_M^\top\boldsymbol{x}_M&\boldsymbol{x}_M^\top\boldsymbol{x}_M&\cdots& \boldsymbol{x}_M^\top\boldsymbol{x}_M\\ \end{array} \right]+ \left[\begin{array}{cccc} \boldsymbol{y}_1^\top\boldsymbol{y}_1&\boldsymbol{y}_2^\top\boldsymbol{y}_2&\cdots &\boldsymbol{y}_N^\top\boldsymbol{y}_N\\ \boldsymbol{y}_1^\top\boldsymbol{y}_1&\boldsymbol{y}_2^\top\boldsymbol{y}_2&\cdots &\boldsymbol{y}_2^\top\boldsymbol{y}_2\\ \vdots & \vdots & \ddots & \vdots&\\ \boldsymbol{y}_1^\top\boldsymbol{y}_1&\boldsymbol{y}_2^\top\boldsymbol{y}_2&\cdots& \boldsymbol{y}_N^\top\boldsymbol{y}_N\\ \end{array} \right]\\\;\\ &\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad-2\left[\begin{array}{cccc} \boldsymbol{x}_1^\top\boldsymbol{y}_1&\boldsymbol{x}_1^\top\boldsymbol{y}_2&\cdots &\boldsymbol{x}_1^\top\boldsymbol{y}_N\\ \boldsymbol{x}_2^\top\boldsymbol{y}_1&\boldsymbol{x}_2^\top\boldsymbol{y}_2&\cdots &\boldsymbol{x}_2^\top\boldsymbol{y}_N\\ \vdots & \vdots & \ddots & \vdots&\\ \boldsymbol{x}_M^\top\boldsymbol{y}_1&\boldsymbol{x}_M^\top\boldsymbol{y}_2&\cdots& \boldsymbol{x}_M^\top\boldsymbol{y}_N\\ \end{array} \right]\\\;\\ &=\left[\begin{array}{c} \boldsymbol{x}_1^\top\boldsymbol{x}_1\\ \boldsymbol{x}_2^\top\boldsymbol{x}_2\\ \vdots\\ \boldsymbol{x}_M^\top\boldsymbol{x}_M \end{array}\right]\cdot[1,1,\cdots,1]_N +\left[\begin{array}{c} 1\\ 1\\ \vdots\\ 1 \end{array}\right]_M\cdot[\boldsymbol{y}_1^\top\boldsymbol{y}_1,\boldsymbol{y}_2^\top\boldsymbol{y}_2,\cdots,\boldsymbol{y}_N^\top\boldsymbol{y}_N] -2\left[\begin{array}{c} \boldsymbol{x}_1^\top\\ \boldsymbol{x}_2^\top\\ \vdots\\ \boldsymbol{x}_M^\top \end{array}\right]\cdot[\boldsymbol{y}_1,\boldsymbol{y}_2,\cdots,\boldsymbol{y}_N] \end{array} d2(X,Y)= x1x12x1y1+y1y1x2x22x2y1+y1y1xMxM2xMy1+y1y1x1x12x1y2+y2y2x2x22x2y2+y2y2xMxM2xMy2+y2y2x1x12x1yN+yNyNx2x22x2yN+yNyNxMxM2xMyN+yNyN = x1x1x2x2xMxMx1x1x2x2xMxMx1x1x2x2xMxM + y1y1y1y1y1y1y2y2y2y2y2y2yNyNy2y2yNyN 2 x1y1x2y1xMy1x1y2x2y2xMy2x1yNx2yNxMyN = x1x1x2x2xMxM [1,1,,1]N+ 111 M[y1y1,y2y2,,yNyN]2 x1x2xM [y1,y2,,yN]

若对数据矩阵本身求两两(列表示样本点)之间的欧式距离,则计算表达式可简单表示为

d 2 ( X , X ) = diag ( X ⊤ X ) ⋅ 1 ⊤ + 1 ⋅ diag ( X ⊤ X ) − 2 X ⊤ X d^2(X,X)=\text{diag}(X^\top X)\cdot\boldsymbol{1}^\top+\boldsymbol{1}\cdot\text{diag}(X^\top X)-2X^\top X d2(X,X)=diag(XX)1+1diag(XX)2XX

2. 代码实现

import numpy as np

X = np.array([[1,2,3,4,5,6,7,8,9],[1,1,1,1,1,1,1,1,1]])
D, N = X.shape
print('LLE running on {} points in {} dimensions\n'.format(N,D))

G = X.T@X
H = np.diag(G).reshape(-1,1)@np.ones((1,9))
dist = H+H.T-2*G
dist = np.sqrt(dist)
print(dist)
print('\n')
index = np.argsort(dist,axis=0)
neighborhood = index[1:5,:] 
print(neighborhood)

2.2 调用机器学习库实现

import numpy as np
from sklearn.metrics.pairwise import paired_distances

X = np.array([[1,2,3,4,5,6,7,8,9],[1,1,1,1,1,1,1,1,1]])
dist = paired_distances(X,X)
print(dist)
print('\n')
index = np.argsort(distance,axis=0)
print(index)
print('\n')
neighborhood = index[1:5,:] 
print(neighborhood)

你可能感兴趣的:(算法设计与智能计算,算法,numpy,python)