关于PCA的介绍和程序使用,请参照下面文章
http://blog.csdn.net/watkinsong/article/details/8234766
[COEFF,SCORE,latent] = princomp(X) returns latent, a vector containing the eigenvalues of the covariance matrix of X.
在n行p列的数据集X上做主成分分析。返回主成分系数。X的每行表示一个样本的观测值,每一列表示特征变量。
返回的COEFF是一个p行p列的矩阵,每一列包含一个主成分的系数,列是按主成分变量递减顺序排列
返回的SCORE是对主分的打分,也就是说原X矩阵在主成分空间的表示。SCORE每行对应样本观测值,每列对应一个主成份(变量),它的行和列的数目和X的行列数目相同。
返回的latent是一个向量,它是返回的按降序排列的特征值,根据这个你可以手动的选择降维以后的数据要选择前多少列。
cumsum(latent)./sum(latent)
注意,当维数p超过样本个数n的时候,用[...] = princomp(X,'econ')来计算,这样会显著提高计算速度
否则的话,会报错,out of memory
一般情况下,如果你的每个样本的特征维数远远大于样本数,比如30*1000000的维数,princomp要加上'econ', 就是princomp(x,'econ')这样使用,可以很大程度的加快计算速度,而且不会内存溢出,否则会经常报内存溢出。
如果你需要对测试样本降维,一般情况下,使用matlab自带的方式,肯定需要对测试样本减去一个训练样本均值,因为你在给训练样本降维的时候减去了均值,所以测试样本也要减去均值,然后乘以coeff这个矩阵,就获得了测试样本降维后的数据。比如说你的测试样本是1*1000000,那么乘上一个1000000*29的降维矩阵,就获得了1*29的降维后的测试样本的降维数据。
我的PCA代码如下
%% PCA process using Princomp based on training data
% training
train_features=trainData(:,1:numOfColumns-1);
train_labels=trainData(:,numOfColumns);
train_features_tfidf = train_features(:,108:size(train_features,2));
mean_value_train_tfidf = mean(train_features_tfidf,1); % will be used on testing data
[coeff, score, latent, Tsquared] = princomp(train_features_tfidf,'econ');
% coeff
% score
%% select the number of Principle component
latent= latent/sum(latent);
sumRate=0;
selection_index = [];
for k = 1:length(latent)
sumRate = sumRate + latent(k);
selection_index(k) = k;
if sumRate>0.95
break;
end
end
% Trough the following test, we can see the score_test and score
% computed are absolutely the same .!!!
% training_features_mean = train_features - ones(size(train_features,1),1) * mean(train_features,1);
% score_test = training_features_mean * coeff;
train_features =[ train_features(:,1:107),score];% [ train_features(:,1:107),score(:,selection_index)];
% testing
test_features=testData(:,1:numOfColumns-1);
test_labels=testData(:,numOfColumns);
test_features_tfidf = test_features(:,108:size(test_features,2));
test_features_mean = test_features_tfidf - ones(size(test_features_tfidf,1),1) * mean_value_train_tfidf;
test_features_score = test_features_mean * coeff; %test_features_mean * coeff(:,selection_index);
test_features = [test_features(:,1:107),test_features_score];