原始空间中样本之间的距离在低维空间中得以保持,如上图所示,即得到多维缩放(Multiple Dimensional Scaling, MDS),这是一种经典的降维方法。
思路:原始样本 A n × p A_{n \times p} An×p任意两个样本在低维空间中的欧氏距离等于原始空间中的距离。
输入:距离矩阵 D n × n D_{n \times n} Dn×n,其中 D i j D_{ij} Dij为样本 x i x_i xi到 x j x_j xj的距离;低维空间的维数 k k k。
输出:样本在低维空间中表示 Z n × k Z_{n \times k} Zn×k
过程:降维之后样本的内积矩阵 B = Z Z T B=Z Z^T B=ZZT, b i j = z ( i ) T z ( j ) b_{ij}=z_{(i)}^T z_{(j)} bij=z(i)Tz(j),有
D i j 2 = ∣ ∣ z ( i ) − z ( j ) ∣ ∣ 2 = ∣ ∣ z ( i ) ∣ ∣ 2 + ∣ ∣ z ( j ) ∣ ∣ 2 − 2 z ( i ) T z ( j ) = b i i + b j j − 2 b i j D_{ij}^2 = ||z_{(i)}-z_{(j)}||^2 =||z_{(i)}||^2+||z_{(j)}||^2 -2z_{(i)}^T z_{(j)}=b_{ii}+b_{jj}-2b_{ij} Dij2=∣∣z(i)−z(j)∣∣2=∣∣z(i)∣∣2+∣∣z(j)∣∣2−2z(i)Tz(j)=bii+bjj−2bij
假定降维之后的样本 Z Z Z被中心化,即 ∑ i = 1 n z ( i ) = \sum_{i=1}^n z_{(i)}= ∑i=1nz(i)= 0,显然矩阵B的行和列之和均为零,即 ∑ i = 1 n b i j = ∑ j = 1 n b i j = 0. \sum _{i=1} ^nb_{ij}=\sum_{j=1}^nb_{ij}=0. ∑i=1nbij=∑j=1nbij=0.易知
∑ i = 1 n D i j 2 = t r ( B ) + m b j j ∑ j = 1 n D i j 2 = t r ( B ) + m b i i ∑ i = 1 n ∑ j = 1 n D i j 2 = 2 m t r ( B ) \sum _{i=1}^nD_{ij}^2=tr(B)+mb_{jj} \\ \sum _{j=1}^nD_{ij}^2=tr(B)+mb_{ii} \\ \sum _{i=1}^n \sum _{j=1}^n D_{ij}^2=2mtr(B) i=1∑nDij2=tr(B)+mbjjj=1∑nDij2=tr(B)+mbiii=1∑nj=1∑nDij2=2mtr(B)
可得:
b i j = 1 2 n ∑ i = 1 n D i j 2 + 1 2 n ∑ j = 1 n D i j 2 − 1 2 n 2 ∑ i = 1 n ∑ j = 1 n D i j 2 − 1 2 n D i j 2 b_{ij}= \frac{1}{2n}\sum_{i=1}^nD_{ij}^2 + \frac{1}{2n}\sum_{j=1}^nD_{ij}^2 - \frac{1}{2n^2}\sum_{i=1}^n\sum_{j=1}^nD_{ij}^2 - \frac{1}{2n}D_{ij}^2 bij=2n1i=1∑nDij2+2n1j=1∑nDij2−2n21i=1∑nj=1∑nDij2−2n1Dij2
通过降维前后距离矩阵保持不 D n × n D_{n \times n} Dn×n变求内积矩阵 B n × n B_{n \times n} Bn×n,对 B B B做特征值分解, B = V Λ V T = V Λ 1 2 ( V Λ 1 2 ) T B=V\Lambda V^T=V\Lambda^{\frac{1}{2}} (V \Lambda^{\frac{1}{2}})^T B=VΛVT=VΛ21(VΛ21)T其中 Λ \Lambda Λ为所有特征值(或者删除为零的特征值)从大到小排列组成的对角矩阵, V V V为对应的特征向量矩阵。近似的,最终取 Z = V ~ Λ ~ 1 2 Z=\tilde{V}\tilde{\Lambda}^{\frac{1}{2}} Z=V~Λ~21, Λ ~ \tilde{\Lambda} Λ~为前 k k k个特征值组成的对角矩阵, V ~ \tilde{V} V~为对应的特征向量矩阵。
思考:实则也是对原矩阵进行了线性变换, Z = A × W Z=A\times W Z=A×W, W W W为 p × k p\times k p×k的变换矩阵,从这个角度考虑跟PCA类似。(还需进一步理解)
(1)把所有对角矩阵中特征值为0的去掉,记得验证得到的两个距离矩阵是否相同。
# -*- coding: utf-8 -*-
"""
Spyder Editor
This is a temporary script file.
"""
import numpy as np
from scipy.spatial.distance import cdist
class MDS:
def __init__(self):
from scipy.spatial.distance import cdist
def _cal_dist(self, matrix, row, col):
if row == '*' and col == '*':
return np.sum(matrix**2)
elif row == '*' and col != '*':
return np.sum(matrix[:, col]**2)
elif row != '*' and col == '*':
return np.sum(matrix[row, :]**2)
else:
return matrix[row, col]**2
def fit(self, dataset):
length = dataset.shape[0]
# 计算原始空间的距离矩阵
original_matrix = cdist(dataset, dataset, 'euclidean')
# 计算 dist_i.、dist_.j 以及 dist_..
dist_matrix = np.matrix(np.zeros(original_matrix.shape))
rows, cols = dist_matrix.shape
# 获得矩阵 B
for row in range(rows):
for col in range(cols):
distij = self._cal_dist(original_matrix, row, col)
dist_i = self._cal_dist(original_matrix, row, '*') / length
dist_j = self._cal_dist(original_matrix, '*', col) / length
dist_all = self._cal_dist(original_matrix, '*', '*') / (length**2)
dist_matrix[row, col] = -(distij - dist_i - dist_j + dist_all) / 2
# 计算特征值和特征向量
feature_values, feature_vectors = np.linalg.eig(dist_matrix)
select_feature_values = []
for i in range(len(feature_values) - 1, -1, -1):
if np.round(feature_values[i]) != 0:
select_feature_values.append(feature_values[i])
else:
feature_vectors = np.delete(feature_vectors, i, axis=1)
eye_matrix = np.eye(len(select_feature_values))
for i in range(len(select_feature_values)):
eye_matrix[i, i] = select_feature_values[i]
return np.dot(feature_vectors, eye_matrix**0.5)
mds = MDS()
a=mds.fit(np.array([
[1, 1, 1],
[2, 2, 2],
[3, 3, 3]
]))
print(a)
(2)有问题,以后再解决
import numpy as np
import math
# run this to get a test matrix
# A = np.random.randint(1,100,(5,20))
# np.save('mat.npy', A)
# exit()
A = np.array([
[1, 1, 1],
[2, 2, 2],
[3, 3, 3]
])
#print(A)
#print(A[1]) #A[i]表示的是某一行
n,m = A.shape
Dist = np.zeros((n,n))
B = np.zeros((n,n))
for i in range(n):
for j in range(n):
Dist[i][j] = sum((ix-jx)**2 for ix,jx in zip(A[i], A[j])) #距离矩阵
disti2 = np.array([0]*n)
distj2 = np.array([0]*n)
for x in range(n):
disti2[x] = np.mean([Dist[x][j] for j in range(n)])
distj2[x] = np.mean([Dist[i][x] for i in range(n)])
distij2 = np.mean([Dist[i][j] for i in range(n) for j in range(n)])
for i in range(n):
for j in range(n):
B[i][j] = -0.5*(Dist[i][j] - disti2[i] - distj2[j] + distij2) #样本的内积矩阵
w,v = np.linalg.eig(B) #v即为B的特征向量矩阵,不同特征值的特征向量不相关,但是不一定正交
v=v.transpose()
print(w,v)
U = [{'eVal':w[i], 'eVec':v[i]} for i in range(n)]
print(type(U))
U.sort(key = lambda obj:obj.get('eVal'), reverse = True) #从大到小排列
k=2 #k不能等于3,大概是不能计算NAN0的次幂??
w=np.array([0]*k)
v=np.zeros((k,n))
for i in range(k):
w[i] = U[i].get('eVal')**0.5 #U里面是小数怎么就变成整数了?
print(type(w[i])) #类型一直为整数,该怎么调,6开根号为2????
v[i] = U[i].get('eVec')
ans = np.dot(v.transpose(), np.diag(w)) #降维后的到新的数据矩阵
print(ans)
ans_dist = np.zeros((n,n))
for i in range(n):
# ans_str=""
for j in range(n):
ans_dist[i][j] = sum((ix-jx)**2 for ix,jx in zip(ans[i], ans[j]))
#np.set_printoptions(suppress=True) #不使用科学计数法显示数据
print("Orign dis[][] is :")
print(Dist)
print("MDS dis[][] is :")
print(ans_dist) #最终结果两个矩阵不相同