切片逆回归(Slice Inverse Regression, SIR)是Li(1991)1提出的一种经典的充分降维方法。SIR探究了多维变量 x x x对单变量 y y y的回归而不是一维变量 y y y对多维变量 x x x的回归。在对多维解释变量 x x x的降维过程中充分利用了 y y y的信息。
一般模型: y = f ( β 1 T x , … , β K T x , ε ) y=f(\beta^{T}_{1}x,\dots,\beta_{K}^{T}x,\varepsilon) y=f(β1Tx,…,βKTx,ε)
基本定理:
切片逆回归对数据 ( Y i , X i ) , ( i = 1 , … , n ) (Y_{i},X_{i}),(i=1,\dots,n) (Yi,Xi),(i=1,…,n)进行操作;
通过仿射变换标准化 X i X_{i} Xi得到 X i ~ = Σ ^ x x − 1 / 2 ( X i − X ‾ ) , i = 1 , … , n \widetilde{X_{i}}=\widehat{\Sigma}_{xx}^{-1/2}(X_{i}-\overline{X}),i=1,\dots,n Xi =Σ xx−1/2(Xi−X),i=1,…,n ,其中 Σ ^ x x \widehat{\Sigma}_{xx} Σ xx和 X ‾ \overline{X} X分别是 X X X 的样本协方差矩阵和样本均值;
将 Y Y Y的取值 Y i Y_{i} Yi从小到大进行排序,并切为 H H H片,记为 I 1 , I 2 , … , I h I_{1},I_{2},\dots,I_{h} I1,I2,…,Ih。其中 I h I_{h} Ih包含 Y i Y_{i} Yi的概率为 P h P_{h} Ph:
P h = 1 n ∑ i = 1 n δ h ( Y i ) P_{h}=\frac{1}{n}\sum_{i=1}^{n}\delta_{h}(Y_{i}) Ph=n1i=1∑nδh(Yi)
δ h ( Y i ) = { 0 , if Y i inside the h 1 , if Y i outside the h \delta_{h}\left(Y_{i}\right)=\left\{\begin{array}{l} 0, \text { if } Y_{i} \text { inside the } \mathrm{h} \\ 1, \text { if } Y_{i} \text { outside the } \mathrm{h} \end{array}\right. δh(Yi)={0, if Yi inside the h1, if Yi outside the h
对于每一个切片,计算其 x i x_{i} xi的样本均值 m h m_{h} mh,即 m h = 1 n P h ∑ y ∈ I h X i ~ m_{h}=\frac{1}{nP_{h}}\sum_{y\in{I{_h}}}\widetilde{X_{i}} mh=nPh1∑y∈IhXi ;
对数据 m h m_{h} mh进行(加权)主成分分析,加权的协方差矩阵为 V = ∑ h = 1 H p h m h m h ′ V=\sum_{h=1}^{H}p_{h}m_{h}m_{h}^{'} V=∑h=1Hphmhmh′,并找出V的特征值和特征向量。
令 k k k个最大的特征向量为 η k , ( k = 1 , 2 , … , k ) \eta_{k},(k=1,2,\dots,k) ηk,(k=1,2,…,k),可得 β k = η k Σ x x − 1 2 , k = 1 , 2 , … , k \beta_{k}=\eta_{k}\Sigma_{xx}^{-\frac{1}{2}},k=1,2,\dots,k βk=ηkΣxx−21,k=1,2,…,k
class SIR:
def __init__(self, H, K, estdim=0, Cn=1):
self.H = H
self.K = K
self.estdim = estdim
self.Cn = Cn
def fit(self, X, Y):
self.X = X
self.Y = Y
self.mean_x = np.mean(X, axis=0) # 计算样本均值
self.sigma_x = np.cov(X, rowvar=False) # 计算样本协方差矩阵
self.Z = np.matmul(X - np.tile(self.mean_x, (X.shape[0], 1)),
fractional_matrix_power(self.sigma_x, -0.5)) # 标准化后的数据阵
n, p = self.Z.shape
if self.Y.ndim == 1: # 判断响应变量Y的维度
self.Y = self.Y.reshape(-1, 1)
ny, py = self.Y.shape
W = np.ones((n, 1)) / n
nw, pw = W.shape
# 输入数据异常判断
if n != ny:
raise ValueError('X and Y must have the same number of samples')
elif p == 1:
raise ValueError('X must have at least 2 dimensions')
elif py != 1:
raise ValueError('Y must have only 1 dimension')
# 将y分为H片,c为每个片的样本数
c = np.ones((1, self.H)) * (n // self.H) + np.hstack(
[np.ones(
(1, n % self.H)), np.zeros((1, self.H - n % self.H))])
cumc = np.cumsum(c) # 计算切片累计和
# 参照Y的取值从小到大进行排序
temp = np.hstack((self.Z, self.Y, W))
temp = temp[np.argsort(temp[:, p])]
# 提取排序后的z,y,w
z, y, w = temp[:, :p], temp[:, p:p + 1], temp[:, p + 1]
muh = np.zeros((self.H, p))
wh = np.zeros((self.H, 1)) # 每个切片的权重
k = 0 # 初始化切片编号
for i in range(n):
if i >= cumc[k]: # 如果超过了切片的边界,则换下一个切片
k += 1
muh[k, :] = muh[k, :] + z[i, :] # 计算切片内自变量之和
wh[k] = wh[k] + w[i] # 计算每个切片包含Yi的概率
# 计算每个切片的样本均值,将其作为切片内自变量的取值
muh = muh / (np.tile(wh, (1, p)) * n)
# 加权主成分分析
self.M = np.zeros((p, p)) # 初始化切片 加权协方差矩阵
for i in range(self.H):
self.M = self.M + wh[i] * muh[i, :].reshape(-1, 1) * muh[i, :]
if self.estdim == 0: # 一般情况
self.D, self.V = eigs(A=self.M, k=self.K, which='LM')
else:
""" # 稀疏矩阵情况,待修正
[V D] = np.linalg,eig(full(M))
lambda = np.sort(abs(diag(D)),'descend')
L = np.log(lambda+1) - lambda
G = np.zeros((p,1))
if Cn == 1
Cn = n^(1/2)
elif Cn == 2
Cn = n^(1/3)
elif Cn == 3:
Cn = 0.5 * n^0.25
for k in range(p):
G(k) = n / 2 * sum(L(1:k)) / sum(L) - Cn * k * (k+1) / p
maxG, K = np. max(G)
V, D = eigs(M,K,'lm')
"""
pass
return self.V, self.K, self.M, self.D
def transform(self):
hatbeta = np.matmul(fractional_matrix_power(self.sigma_x, -0.5),self.V)
return hatbeta
X = pd.read_csv('X.csv',header=None).values
Y = pd.read_csv('Y.csv',header=None).values
model = SIR(H=6, K=4, estdim=0, Cn=1)
V, K, M, D = model.fit(X,Y)
hatbeta = model.transform()
Li, Ker-Chau. “Sliced Inverse Regression for Dimension Reduction: Rejoinder.” Journal of the American Statistical Association, vol. 86, no. 414, 1991, pp. 337–42. JSTOR, https://doi.org/10.2307/2290568. Accessed 17 Jul. 2022. ↩︎