The Johnson-Lindenstrauss lemma states that any high dimensional dataset
can be random projected into a lower dimensional Euclidean space
while controlling the distortion in the pairwise distances.
(变换的失真度由两个距离所控制)
The distortion introduced by a random projection p is asserted
(断言) by the fact that p is defining an eps-embedding(设置)
with good probability:
u, v are any rows taken from a dataset of shape [n_sample, n_featues]
and p is a projection by a random Gaussiian N(0, 1) matrix with
shape [n_components, n_features](or a sparse Achlioptas matrix)
WiKi interpretation:
The lemma states that a small set of points in a high-dimensional
space can be embedded into a space of much lower dimension in such
a way that distance between the points are nearly preserved.
The map used for the embedding is at least Lipschitz, and can even
be taken to be an orthogonal projection.
(高维数据的一个小集合可以得到一个强于Lipschitz条件的降维距离连续变换)
(变换可以是正交的)
The lemma has uses in compressed sensing, manifold learning(流型学习)
, dimensionality reduction, and graph embedding. Much of the data stored
and manipulated on computers, including text and images, can be represented
as points in a high-demensional space (see vector space model for the
case of text), However, the essential algorithms for working with
such data tend to become bogged down(陷入困境) very quickly
as dimension increases. It is therefore desirable to reduce
the dimensionality of the data in a way that preserves its
relevant structure. The Johnson-Lindenstrauss lemma is a classic
result in this vein.
Lemma
Given 0 < a < 1, a set X of m points in RN, and a number n >
8ln(m)/(a**2), there is a linear map f: RN to Rn
such that :
(1 - a) ||u - v|| ** 2 <= ||f(u) - f(v)|| ** 2 <= (1 + a) ||u - v|| ** 2
注意这里是线性变换。也就是说是在整个样本的一个子集可以找到这样一个变换。
One proof of the lemma takes f to be a suitable mutiple(倍数)
of orthogonal projection onto a random subspace of dimension in RN
and exploits the phenomenon of concentration of measure.
有一种证明方法是用某一倍数的随机生成的正交向量进行投影。
Obviously an orthogonal projection will, in general, reduce the average
distance between points, but the lemma can be viewed as dealing with
relative distance, which do not change under scaling. In a nutshell,
you roll the dice (骰子) and obtain a random projection, which will
reduce the average diatamce, and then you scale up the distance so that
the average distance returns to its previous value. If you keep rolling
the dice, you will , in polynomial random time, find a projection for
which the (scaled) distances satisfy the lemma.
这里对过程进行了形象描述。
正交投影变换可以看成其特殊情况。
回到sklearn:
The minimum number of components to guarantees the eps-embedding is
given by:
n_components >= 4log(n_samples) / (a ** 2/ 2 - a ** 3/ 3)
这里的决定比例关系是可以理解的。
Empirical validation(经验证实)
We validate the obove bounds on the digits dataset or on the 20 newsgroups
text document(TF-IDF word frequencies) dataset:
for the digits dataset, some 8x8 gray level pixels(像素)
for 500 handwritten digits pictures are randomly projected
to spaces for various larger number of dimensions n_components.
......
sklearn.random_projection::johnson_lindenstrauss_min_dim(n_sample, eps):
find a 'safe' number of components to randomly project to
plt.loglog
对图进行对数变换,可以消除较大的量级显示差异。
plt.semilogy
仅对y方向进行对数变换。
ndarray.ravel():
返回拉直后的向量(横向拉直)
sklearn.random_project::SparseRandomProjection
用稀疏矩阵进行随机投影的类
参数n_components 指定降维后的维数
调用fit_transform 可以对数据进行变换
下面的代码是sklearn中验证上述定理的例子,但这里是升维(64 维到
[300, 1000, 10000]维),是反问题验证的观点。
import numpy as np
import matplotlib.pyplot as plt
from sklearn.random_projection import johnson_lindenstrauss_min_dim
eps_range = np.linspace(0.1, 0.99, 5)
colors = plt.cm.Blues(np.linspace(0.3, 1.0, len(eps_range)))
n_samples_range = np.logspace(1, 9, 9)
plt.figure()
for eps, color in zip(eps_range, colors):
min_n_components = johnson_lindenstrauss_min_dim(n_samples_range, eps = eps)
plt.loglog(n_samples_range, min_n_components, color = color)
plt.legend(["eps = %0.1f" % eps for eps in eps_range], loc = "lower right")
plt.xlabel("Number of observations to eps-embed")
plt.ylabel("Minimum number of dimensions")
plt.title("Johnson-Lindenstrauss bounds:\nn_samples vs n_components")
n_samples_range = np.logspace(2, 6, 5)
colors = plt.cm.Blues(np.linspace(0.3, 1.0, len(n_samples_range)))
plt.figure()
for n_samples, color in zip(n_samples_range, colors):
min_n_components = johnson_lindenstrauss_min_dim(n_samples, eps = eps_range)
plt.semilogy(eps_range, min_n_components, color = color)
plt.legend(["n_samples = %d" % n for n in n_samples_range], loc = "upper right")
plt.xlabel("Distortion eps")
plt.ylabel("Mininum number of dimensions")
plt.title("Johnson-Lindenstrauss bounds:\nn_components vs eps")
from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.datasets import load_digits
#data = fetch_20newsgroups_vectorized().data[:500]
data = load_digits().data[:500]
n_samples, n_features = data.shape
print
print "data shape :"
print data.shape
print
print "Embedding %d samples with dim %d using various random projections" % (n_samples, n_features)
from sklearn.metrics.pairwise import euclidean_distances
n_components_range = np.array([300, 1000, 10000])
dists = euclidean_distances(data, squared = True).ravel()
nonzero = dists != 0
dists = dists[nonzero]
from time import time
from sklearn.random_projection import SparseRandomProjection
for n_components in n_components_range:
t0 = time()
rp = SparseRandomProjection(n_components = n_components)
projected_data = rp.fit_transform(data)
print
print "projected_data shape :"
print projected_data.shape
print
print "Projected %d samples from %d to %d in %0.3fs" % (n_samples, n_features, n_components, time() - t0)
if hasattr(rp, 'components_'):
n_bytes = rp.components_.data.nbytes
n_bytes += rp.components_.indices.nbytes
print "Random matrix with size: %0.3fMB" % (n_bytes / 1e6)
projected_dists = euclidean_distances(projected_data, squared = True).ravel()[nonzero]
plt.figure()
plt.hexbin(dists, projected_dists, gridsize = 100, cmap = plt.cm.PuBu)
plt.xlabel("Pairwise squared distance in original space")
plt.ylabel("Pairwise squared distance in projected space")
plt.title("Pairwise distance distribution for n_components = %d" % n_components)
cb = plt.colorbar()
cb.set_label("Sample pairs counts")
rates = projected_dists / dists
print "Mean distance rate: %0.2f (%0.2f)" % (np.mean(rates), np.std(rates))
plt.figure()
plt.hist(rates, bins = 50, normed = True, range = (0.,2.))
plt.xlabel("Squared distance rate: projected / original")
plt.ylabel("Distribution of samples pairs")
plt.title("Histogram of pairwise distance rates for n_components = %d" % n_components)
#plt.show()
随机投影的方法是用来解决大维数据的降维问题,但不代表会降低共线性,
但当我们要判定数据是否有强因子的时候,可以使用这种方法将为后再处理
其利用
随机保留了共线性,可以加快大维矩阵的运算,可以如下加快大维广义因子模型
的相关计算:
X = generate_X(1000, 5, 1000, 10, 10, 5, 5)[0]
#we use random projection which may have a more fast conclusion
from sklearn.random_projection import SparseRandomProjection
from sklearn.random_projection import johnson_lindenstrauss_min_dim
min_n_components = johnson_lindenstrauss_min_dim(1000, eps = 0.5)
print "min_n_components :"
print min_n_components
n_components = min_n_components
rp = SparseRandomProjection(n_components = n_components)
projected_X = rp.fit_transform(X)
eigen_v = svd(projected_X)[1]
eigen_v1 = eigen_v / (1 + eigen_v)
print eigen_v1[:-1] / eigen_v1[1:]
print np.argmax((eigen_v1[:-1] / eigen_v1[1:])[: 20])