机器学习基础 维基翻译 Johnson-Lindenstrauss降维 应用于广义因子模型 及简单的sklearn例子

The Johnson-Lindenstrauss lemma states that any high dimensional dataset
can be random projected into a lower dimensional Euclidean space
while controlling the distortion in the pairwise distances.
(变换的失真度由两个距离所控制)

The distortion introduced by a random projection p is asserted
(断言) by the fact that p is defining an eps-embedding(设置)
with good probability:

u, v are any rows taken from a dataset of shape [n_sample, n_featues]
and p is a projection by a random Gaussiian N(0, 1) matrix with
shape [n_components, n_features](or a sparse Achlioptas matrix)

WiKi interpretation:

The lemma states that a small set of points in a high-dimensional
space can be embedded into a space of much lower dimension in such
a way that distance between the points are nearly preserved.
The map used for the embedding is at least Lipschitz, and can even
be taken to be an orthogonal projection.
(高维数据的一个小集合可以得到一个强于Lipschitz条件的降维距离连续变换)
(变换可以是正交的)

The lemma has uses in compressed sensing, manifold learning(流型学习)
, dimensionality reduction, and graph embedding. Much of the data stored
and manipulated on computers, including text and images, can be represented
as points in a high-demensional space (see vector space model for the
case of text), However, the essential algorithms for working with
such data tend to become bogged down(陷入困境) very quickly
as dimension increases. It is therefore desirable to reduce
the dimensionality of the data in a way that preserves its
relevant structure. The Johnson-Lindenstrauss lemma is a classic
result in this vein.

Lemma
Given 0 < a < 1, a set X of m points in RN, and a number n >
8ln(m)/(a**2), there is a linear map f: RN to Rn
such that :
 (1 - a) ||u - v|| ** 2 <= ||f(u) - f(v)|| ** 2 <= (1 + a) ||u - v|| ** 2

注意这里是线性变换。也就是说是在整个样本的一个子集可以找到这样一个变换。

One proof of the lemma takes f to be a suitable mutiple(倍数)
of orthogonal projection onto a random subspace of dimension in RN
 and exploits the phenomenon of concentration of measure.
有一种证明方法是用某一倍数的随机生成的正交向量进行投影。

Obviously an orthogonal projection will, in general, reduce the average
 distance between points, but the lemma can be viewed as dealing with
relative distance, which do not change under scaling. In a nutshell,
you roll the dice (骰子) and obtain a random projection, which will
reduce the average diatamce, and then you scale up the distance so that
the average distance returns to its previous value. If you keep rolling
 the dice, you will , in polynomial random time, find a projection for
which the (scaled) distances satisfy the lemma.
这里对过程进行了形象描述。
正交投影变换可以看成其特殊情况。

回到sklearn:
The minimum number of components to guarantees the eps-embedding is
given by:
 n_components >= 4log(n_samples) / (a ** 2/ 2 - a ** 3/ 3)
这里的决定比例关系是可以理解的。

Empirical validation(经验证实)
We validate the obove bounds on the digits dataset or on the 20 newsgroups
text document(TF-IDF word frequencies) dataset:

for the digits dataset, some 8x8 gray level pixels(像素)
for 500 handwritten digits pictures are randomly projected
to spaces for various larger number of dimensions n_components.

......

sklearn.random_projection::johnson_lindenstrauss_min_dim(n_sample, eps):
 find a 'safe' number of components to randomly project to

plt.loglog
 对图进行对数变换,可以消除较大的量级显示差异。

plt.semilogy
 仅对y方向进行对数变换。

ndarray.ravel():
 返回拉直后的向量(横向拉直)

sklearn.random_project::SparseRandomProjection
 用稀疏矩阵进行随机投影的类
 参数n_components 指定降维后的维数
 调用fit_transform 可以对数据进行变换

下面的代码是sklearn中验证上述定理的例子,但这里是升维(64 维到
[300, 1000, 10000]维),是反问题验证的观点。

import numpy as np 
import matplotlib.pyplot as plt
from sklearn.random_projection import johnson_lindenstrauss_min_dim


eps_range = np.linspace(0.1, 0.99, 5)
colors = plt.cm.Blues(np.linspace(0.3, 1.0, len(eps_range)))

n_samples_range = np.logspace(1, 9, 9)

plt.figure()
for eps, color in zip(eps_range, colors):
 min_n_components = johnson_lindenstrauss_min_dim(n_samples_range, eps = eps)
 plt.loglog(n_samples_range, min_n_components, color = color)

plt.legend(["eps = %0.1f" % eps for eps in eps_range], loc = "lower right")
plt.xlabel("Number of observations to eps-embed")
plt.ylabel("Minimum number of dimensions")
plt.title("Johnson-Lindenstrauss bounds:\nn_samples vs n_components")

n_samples_range = np.logspace(2, 6, 5)
colors = plt.cm.Blues(np.linspace(0.3, 1.0, len(n_samples_range)))

plt.figure()
for n_samples, color in zip(n_samples_range, colors):
 min_n_components = johnson_lindenstrauss_min_dim(n_samples, eps = eps_range)
 plt.semilogy(eps_range, min_n_components, color = color)

plt.legend(["n_samples = %d" % n for n in n_samples_range], loc = "upper right")
plt.xlabel("Distortion eps")
plt.ylabel("Mininum number of dimensions")
plt.title("Johnson-Lindenstrauss bounds:\nn_components vs eps")

from sklearn.datasets import fetch_20newsgroups_vectorized 
from sklearn.datasets import load_digits

#data = fetch_20newsgroups_vectorized().data[:500]
data = load_digits().data[:500]
n_samples, n_features = data.shape 

print
print "data shape :"
print data.shape
print

print "Embedding %d samples with dim %d using various random projections" % (n_samples, n_features)

from sklearn.metrics.pairwise import euclidean_distances

n_components_range = np.array([300, 1000, 10000])
dists = euclidean_distances(data, squared = True).ravel()

nonzero = dists != 0
dists = dists[nonzero]

from time import time
from sklearn.random_projection import SparseRandomProjection

for n_components in n_components_range:
 t0 = time()
 rp = SparseRandomProjection(n_components = n_components)
 projected_data = rp.fit_transform(data)

 print
 print "projected_data shape :"
 print projected_data.shape
 print

 print "Projected %d samples from %d to %d in %0.3fs" % (n_samples, n_features, n_components, time() - t0)

 if hasattr(rp, 'components_'):
  n_bytes = rp.components_.data.nbytes  
  n_bytes += rp.components_.indices.nbytes
  print "Random matrix with size: %0.3fMB" % (n_bytes / 1e6)

 projected_dists = euclidean_distances(projected_data, squared = True).ravel()[nonzero]
 plt.figure()
 plt.hexbin(dists, projected_dists, gridsize = 100, cmap = plt.cm.PuBu)
 plt.xlabel("Pairwise squared distance in original space")
 plt.ylabel("Pairwise squared distance in projected space")
 plt.title("Pairwise distance distribution for n_components = %d" % n_components)

 cb = plt.colorbar()
 cb.set_label("Sample pairs counts")

 rates = projected_dists / dists
 print "Mean distance rate: %0.2f (%0.2f)" % (np.mean(rates), np.std(rates))

 plt.figure()
 plt.hist(rates, bins = 50, normed = True, range = (0.,2.))
 plt.xlabel("Squared distance rate: projected / original")
 plt.ylabel("Distribution of samples pairs")
 plt.title("Histogram of pairwise distance rates for n_components = %d" % n_components)

#plt.show()


随机投影的方法是用来解决大维数据的降维问题,但不代表会降低共线性,
但当我们要判定数据是否有强因子的时候,可以使用这种方法将为后再处理
其利用
随机保留了共线性,可以加快大维矩阵的运算,可以如下加快大维广义因子模型
的相关计算:
X = generate_X(1000, 5, 1000, 10, 10, 5, 5)[0]

#we use random projection which may have a more fast conclusion
from sklearn.random_projection import SparseRandomProjection
from sklearn.random_projection import johnson_lindenstrauss_min_dim

min_n_components = johnson_lindenstrauss_min_dim(1000, eps = 0.5)
print "min_n_components :"
print min_n_components


n_components = min_n_components
rp = SparseRandomProjection(n_components = n_components)
projected_X = rp.fit_transform(X)
eigen_v = svd(projected_X)[1]

eigen_v1 = eigen_v / (1 + eigen_v)
print eigen_v1[:-1] / eigen_v1[1:]
print np.argmax((eigen_v1[:-1] / eigen_v1[1:])[: 20])












你可能感兴趣的:(机器学习,Sklearn,算法基础,numpy,python,scikit-learn,sklearn,机器学习)