本文讲介绍垃圾邮件分类,其中用到SVM算法、Logistic回归、SEA-Logistic深度网络分类。下面分别讲解这几个算法在垃圾邮件分类中的用法。
数据集为spamData.mat,训练集有3065个样本,测试集有1536个样本,每个样本的维度为57.
数据集下载地址:https://github.com/probml/pmtkdata/tree/master/spamData
一、SVM分类
二、Logistic分类
三、SEA-Logistic分类
一、SVM分类
原理我就不讲了,前面博文有,而且网上有太多了资料,还有我没有别人讲的好,不能献丑了,我现在只讲如何利用Libsvm、MATLAB自带的svm工具箱分类垃圾邮件。
1.libsvm垃圾邮件分类
libsvm库下载:http://www.csie.ntu.edu.tw/~cjlin/libsvm/
详解:http://www.matlabsky.com/thread-11925-1-1.html
下载好的libsvm,然后添加到主路径下File->set path ->add with subfolders->加入libsvm-3.11文件夹的路径。
1.首先在MATLAB命令窗【Commond Window】中输入:mex -setup
2.出现 Would you like mex to locate installed compilers [y]/n? 选择y
3.Select a compiler:
[1] Microsoft Visual C++ 2010 in E:\VS2010
[0] None 选择:1
4.Are these correct [y]/n? 选择y
好了现在就可以用了
load('spamData.mat');
model = svmtrain(ytrain,Xtrain,'-t 0');
[predict_label,accuracy] = svmpredict(ytest,Xtest,model);
上面加红色标注的-t x x可以取0,1,2,3,4。系统默认为2,如果不加-t x
0)线性核函数
1)多项式核函数
2)RBF核函数
3)sigmoid核函数
4)自定义核函数
从上表可以看出,线性核函数效果最好,能够达到91.1458%。
数据要这样处理下:
ytrain(ytrain==0) = -1;
ytest(ytest==0) = -1;
2.MATLAB自带的svm工具箱分类垃圾邮件
Matlab自带了svm工具箱,现在我就介绍如何利用这个工具箱来做垃圾邮件分类。下面我先给出程序,再来解释。
load spamData
svmStruct = svmtrain(Xtrain,ytrain,'showplot',true);
classes=svmclassify(svmStruct,Xtest,'showplot',true);
nCorrect=sum(classes==ytest);
accuracy = nCorrect/length(classes);
accuracy = 100*accuracy;
accuracy = double(accuracy);
fprintf('accuracy=%s%%\n',accuracy);
运行会出现这个结果就忽略它,对结果没有什么影响。
在程序中点击右键打开svmtrain.m这个文件。在文件的dflts ={'linear',.......},我得是287行。可以改变核函数,可以选择okfuns = {'linear','quadratic', 'radial','rbf','polynomial','mlp'}; 上面dflts ={'linear',.......}可以改变。
linear是线性核。
quadratic是二次核函数。
radial是什么核?
rbf 是径向基核,通常叫做高斯核,但是Ng说跟高斯没有什么关系。
polynomial是多项式核。
mlp多层感知器核函数。
下面来看使用各个核在垃圾邮件分类中的识别率
可以看到二次核函数效果最好,能达到85.55%。
整理的数据集见资源,三种不同的预处理方式
给出Excise8.1的作业结果
logistic回归
SVM
二、Logistic分类
可以参考这篇文献:http://www.docin.com/p-160363677.html
前面的博文讲过softmax分类,可以改为Logistic回归。http://blog.csdn.net/hlx371240/article/details/40015395
这篇博文是在《最优化计算方法》这门课写的,当时用LBFGS和SD法优化参数,现在我用工具箱直接进行分类。
%% STEP 0: Initialise constants and parameters
inputSize = 57; % Size of input vector
numClasses = 2; % Number of classes
lambda = 1e-4; % Weight decay parameter
%%=====================================================================
%% STEP 1: Load data
load('D:\机器学习课程\作业三\spamData.mat');
Xtrain=Xtrain';
ytrain(ytrain==0) = 2; % Remap 0 to 10
inputData = Xtrain;
DEBUG = false;
if DEBUG
inputSize = 8;
inputData = randn(8, 100);
labels = randi(10, 100, 1);
end
% Randomly initialise theta
theta = 0.005 * randn(numClasses * inputSize, 1);%输入的是一个列向量
%%======================================================================
%% STEP 2: Implement softmaxCost
[cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, inputData, ytrain);
%%======================================================================
%% STEP 3: Learning parameters
options.maxIter = 100;
softmaxModel = softmaxTrain(inputSize, numClasses, lambda, ...
inputData, ytrain, options);
%%======================================================================
%% STEP 4: Testing
Xtest=Xtest';
ytest(ytest==0) = 2; % Remap 0 to 10
size(softmaxModel.optTheta)
size(inputData)
[pred] = softmaxPredict(softmaxModel, Xtest);
acc = mean(ytest(:) == pred(:));
fprintf('Accuracy: %0.3f%%\n', acc * 100);
softmaxTrain.m
function [softmaxModel] = softmaxTrain(inputSize, numClasses, lambda, inputData, labels, options)
if ~exist('options', 'var')
options = struct;
end
if ~isfield(options, 'maxIter')
options.maxIter = 400;
end
% initialize parameters
theta = 0.005 * randn(numClasses * inputSize, 1);
% Use minFunc to minimize the function
addpath minFunc/
options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost
% function. Generally, for minFunc to work, you
% need a function pointer with two outputs: the
% function value and the gradient. In our problem,
% softmaxCost.m satisfies this.
minFuncOptions.display = 'on';
[softmaxOptTheta, cost] = minFunc( @(p) softmaxCost(p, ...
numClasses, inputSize, lambda, ...
inputData, labels), ...
theta, options);
% Fold softmaxOptTheta into a nicer format
softmaxModel.optTheta = reshape(softmaxOptTheta, numClasses, inputSize);
softmaxModel.inputSize = inputSize;
softmaxModel.numClasses = numClasses;
end
softmaxCost.m
function [cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, data, labels)
% numClasses - the number of classes
% inputSize - the size N of the input vector
% lambda - weight decay parameter
% data - the N x M input matrix, where each column data(:, i) corresponds to
% a single test set
% labels - an M x 1 matrix containing the labels corresponding for the input data
%
% Unroll the parameters from theta
theta = reshape(theta, numClasses, inputSize);%将输入的参数列向量变成一个矩阵
numCases = size(data, 2);%输入样本的个数
groundTruth = full(sparse(labels, 1:numCases, 1));%这里sparse是生成一个稀疏矩阵,该矩阵中的值都是第三个值1
%稀疏矩阵的小标由labels和1:numCases对应值构成
cost = 0;
thetagrad = zeros(numClasses, inputSize);
M = bsxfun(@minus,theta*data,max(theta*data, [], 1));
M = exp(M);
p = bsxfun(@rdivide, M, sum(M));
cost = -1/numCases * groundTruth(:)' * log(p(:)) + lambda/2 * sum(theta(:) .^ 2);
thetagrad = -1/numCases * (groundTruth - p) * data' + lambda * theta;
grad = [thetagrad(:)];
end
softmaxPredict.m
function [pred] = softmaxPredict(softmaxModel, data)
% Unroll the parameters from theta
theta = softmaxModel.optTheta; % this provides a numClasses x inputSize matrix
pred = zeros(1, size(data, 2));
[nop, pred] = max(theta * data);
end
initializeParameters.m
function theta = initializeParameters(hiddenSize, visibleSize)
%% Initialize parameters randomly based on layer sizes.
r = sqrt(6) / sqrt(hiddenSize+visibleSize+1); % we'll choose weights uniformly from the interval [-r, r]
W1 = rand(hiddenSize, visibleSize) * 2 * r - r;
W2 = rand(visibleSize, hiddenSize) * 2 * r - r;
b1 = zeros(hiddenSize, 1);
b2 = zeros(visibleSize, 1);
% Convert weights and bias gradients to the vector form.
% This step will "unroll" (flatten and concatenate together) all
% your parameters into a vector, which can then be used with minFunc.
theta = [W1(:) ; W2(:) ; b1(:) ; b2(:)];
end
sigmoidInv.m
function sigmInv = sigmoidInv(x)
sigmInv = sigmoid(x).*(1-sigmoid(x));
end
这个算法利用的LBFGS,拟牛顿法,可以节约内存,用近似的Hessian矩阵代替精确的Hessian矩阵,这个是我的美女老师讲的,建议同学们选下学期美女老师的《数值优化》的课程,我觉得她讲得很好,听课绝对认真。
还有一个工具箱(minfunc)到资源下载。最后得到的识别率为:92.057%。但是不是每次跑出来的程序都是这个识别率,因为参数是随机产生的。
三、SAE-Logistic分类
SAE全称为Sparse Auto Encoder,是加了一层自学习层进一步提取特征,因为邮件中每个词的出现可能存在某种关联,然后再分类,这也是神经网络提取特征。
可以参考这篇文献:http://nlp.stanford.edu/~socherr/sparseAutoencoder_2011new.pdf
%STEP 2: 初始化参数和load数据
clear all;
clc;
load('D:\机器学习课程\作业三\spamData.mat');
Xtrain=Xtrain';
ytrain(ytrain==0) = 2;
Xtest=Xtest';
ytest(ytest == 0) = 2; % Remap 0 to 10
inputSize = 57;
numLabels = 2;
a=[57 50 45 40 35 30 25 20 15 10 5];
sparsityParam = 0.1;
lambda = 3e-3; % weight decay parameter
beta = 3; % weight of sparsity penalty term
numClasses = 2; % Number of classes (MNIST images fall into 10 classes)
lambda = 1e-4; % Weight decay parameter
%% ======================================================================
%STEP 2: 训练自学习层SAE
for i=1:11
hiddenSize = a(i);
theta = initializeParameters(hiddenSize, inputSize);
%-------------------------------------------------------------------
opttheta = theta;
addpath minFunc/
options.Method = 'lbfgs';
options.maxIter = 400;
options.display = 'on';
[opttheta, loss] = minFunc( @(p) sparseAutoencoderCost(p, ...
inputSize, hiddenSize, ...
lambda, sparsityParam, ...
beta, Xtrain), ...
theta, options);
trainFeatures = feedForwardAutoencoder(opttheta, hiddenSize, inputSize, ...
Xtrain);
%% ================================================
%STEP 3: 训练Softmax分类器
saeSoftmaxTheta = 0.005 * randn(hiddenSize * numClasses, 1);
softmaxLambda = 1e-4;
numClasses = 2;
softoptions = struct;
softoptions.maxIter = 500;
softmaxModel = softmaxTrain(hiddenSize,numClasses,softmaxLambda,...
trainFeatures,ytrain,softoptions);
theta_new = softmaxModel.optTheta(:);
%% ============================================================
stack = cell(1,1);
stack{1}.w = reshape(opttheta(1:hiddenSize * inputSize), hiddenSize, inputSize);
stack{1}.b =opttheta(2*hiddenSize*inputSize+1:2*hiddenSize*inputSize+hiddenSize);
[stackparams, netconfig] = stack2params(stack);
stackedAETheta = [theta_new;stackparams];
addpath minFunc/;
options = struct;
options.Method = 'lbfgs';
options.maxIter = 400;
options.display = 'on';
[stackedAEOptTheta,cost] = minFunc(@(p)stackedAECost(p,inputSize,hiddenSize,numClasses, netconfig,lambda, Xtrain, ytrain),stackedAETheta,options);
%% =================================================================
%STEP 4: 测试
[pred] = stackedAEPredict(stackedAEOptTheta, inputSize, hiddenSize, ...
numClasses, netconfig, Xtest);
acc = mean(ytest(:) == pred(:));
fprintf('Accuracy = %0.3f%%\n', acc * 100);
result(i)=acc * 100;
end
stackedAEPredict.m
function [pred] = stackedAEPredict(theta, inputSize, hiddenSize, numClasses, netconfig, data)
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);
stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig);
depth = numel(stack);
z = cell(depth+1,1);
a = cell(depth+1, 1);
a{1} = data;
for layer = (1:depth)
z{layer+1} = stack{layer}.w * a{layer} + repmat(stack{layer}.b, [1, size(a{layer},2)]);
a{layer+1} = sigmoid(z{layer+1});
end
[~, pred] = max(softmaxTheta * a{depth+1});
end
% You might find this useful
function sigm = sigmoid(x)
sigm = 1 ./ (1 + exp(-x));
end
stackedAECost.m
function [ cost, grad ] = stackedAECost(theta, inputSize, hiddenSize, ...
numClasses, netconfig, ...
lambda, data, labels)
% stackedAECost: Takes a trained softmaxTheta and a training data set with labels,
% and returns cost and gradient using a stacked autoencoder model. Used for
% finetuning.
% theta: trained weights from the autoencoder
% visibleSize: the number of input units
% hiddenSize: the number of hidden units *at the 2nd layer*
% numClasses: the number of categories
% netconfig: the network configuration of the stack
% lambda: the weight regularization penalty
% data: Our matrix containing the training data as columns. So, data(:,i) is the i-th training example.
% labels: A vector containing labels, where labels(i) is the label for the
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);
stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig);
% You will need to compute the following gradients
softmaxThetaGrad = zeros(size(softmaxTheta));
stackgrad = cell(size(stack));
for d = 1:numel(stack)
stackgrad{d}.w = zeros(size(stack{d}.w));
stackgrad{d}.b = zeros(size(stack{d}.b));
end
cost = 0; % You need to compute this
numCases = size(data, 2);%输入样本的个数
groundTruth = full(sparse(labels, 1:numCases, 1));
depth = numel(stack);
z = cell(depth+1,1);
a = cell(depth+1, 1);
a{1} = data;
for layer = (1:depth)
z{layer+1} = stack{layer}.w * a{layer} + repmat(stack{layer}.b, [1, size(a{layer},2)]);
a{layer+1} = sigmoid(z{layer+1});
end
M = softmaxTheta * a{depth+1};
M = bsxfun(@minus, M, max(M));
p = bsxfun(@rdivide, exp(M), sum(exp(M)));
cost = -1/numClasses * groundTruth(:)' * log(p(:)) + lambda/2 * sum(softmaxTheta(:) .^ 2);
softmaxThetaGrad = -1/numClasses * (groundTruth - p) * a{depth+1}' + lambda * softmaxTheta;
d = cell(depth+1);
d{depth+1} = -(softmaxTheta' * (groundTruth - p)) .* a{depth+1} .* (1-a{depth+1});
for layer = (depth:-1:2)
d{layer} = (stack{layer}.w' * d{layer+1}) .* a{layer} .* (1-a{layer});
end
for layer = (depth:-1:1)
stackgrad{layer}.w = (1/numClasses) * d{layer+1} * a{layer}';
stackgrad{layer}.b = (1/numClasses) * sum(d{layer+1}, 2);
end
% -------------------------------------------------------------------------
%% Roll gradient vector
grad = [softmaxThetaGrad(:) ; stack2params(stackgrad)];
end
% You might find this useful
function sigm = sigmoid(x)
sigm = 1 ./ (1 + exp(-x));
end
feedForwardAutoencoder.m
function [activation] = feedForwardAutoencoder(theta, hiddenSize, visibleSize, data)
% theta: trained weights from the autoencoder
% visibleSize: the number of input units (probably 64)
% hiddenSize: the number of hidden units (probably 25)
% data: Our matrix containing the training data as columns. So, data(:,i) is the i-th training example.
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this
% follows the notation convention of the lecture notes.
W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
% Instructions: Compute the activation of the hidden layer for the Sparse Autoencoder.
activation = sigmoid(W1*data+repmat(b1,[1,size(data,2)]));
end
function sigm = sigmoid(x)
sigm = 1 ./ (1 + exp(-x));
end
sparseAutoencoderCost.m
function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...
lambda, sparsityParam, beta, data)
% visibleSize: the number of input units (probably 64)
% hiddenSize: the number of hidden units (probably 25)
% lambda: weight decay parameter
% sparsityParam: The desired average activation for the hidden units (denoted in the lecture
% notes by the greek alphabet rho, which looks like a lower-case "p").
% beta: weight of sparsity penalty term
% data: Our 64x10000 matrix containing the training data. So, data(:,i) is the i-th training example.
% The input theta is a vector (because minFunc expects the parameters to be a vector).
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this
% follows the notation convention of the lecture notes.
W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);
% Cost and gradient variables (your code needs to compute these values).
% Here, we initialize them to zeros.
cost = 0;
W1grad = zeros(size(W1));
W2grad = zeros(size(W2));
b1grad = zeros(size(b1));
b2grad = zeros(size(b2));
Jcost = 0;%直接误差
Jweight = 0;%权值惩罚
Jsparse = 0;%稀疏性惩罚
[n m] = size(data);%m为样本的个数,n为样本的特征数
%前向算法计算各神经网络节点的线性组合值和active值
z2 = W1*data+repmat(b1,1,m);%注意这里一定要将b1向量复制扩展成m列的矩阵
a2 = sigmoid(z2);
z3 = W2*a2+repmat(b2,1,m);
a3 = sigmoid(z3);
% 计算预测产生的误差
Jcost = (0.5/m)*sum(sum((a3-data).^2));
%计算权值惩罚项
Jweight = (1/2)*(sum(sum(W1.^2))+sum(sum(W2.^2)));
%计算稀释性规则项
rho = (1/m).*sum(a2,2);%求出第一个隐含层的平均值向量
Jsparse = sum(sparsityParam.*log(sparsityParam./rho)+ ...
(1-sparsityParam).*log((1-sparsityParam)./(1-rho)));
%损失函数的总表达式
cost = Jcost+lambda*Jweight+beta*Jsparse;
%反向算法求出每个节点的误差值
d3 = -(data-a3).*sigmoidInv(z3);
sterm = beta*(-sparsityParam./rho+(1-sparsityParam)./(1-rho));%因为加入了稀疏规则项,所以
%计算偏导时需要引入该项
d2 = (W2'*d3+repmat(sterm,1,m)).*sigmoidInv(z2);
%计算W1grad
W1grad = W1grad+d2*data';
W1grad = (1/m)*W1grad+lambda*W1;
%计算W2grad
W2grad = W2grad+d3*a2';
W2grad = (1/m).*W2grad+lambda*W2;
%计算b1grad
b1grad = b1grad+sum(d2,2);
b1grad = (1/m)*b1grad;%注意b的偏导是一个向量,所以这里应该把每一行的值累加起来
%计算b2grad
b2grad = b2grad+sum(d3,2);
b2grad = (1/m)*b2grad;
grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];
end
function sigm = sigmoid(x)
sigm = 1 ./ (1 + exp(-x));
end
function sigmInv = sigmoidInv(x)
sigmInv = sigmoid(x).*(1-sigmoid(x));
end
stack2params.m
function [params, netconfig] = stack2params(stack)
params = [];
for d = 1:numel(stack)
params = [params ; stack{d}.w(:) ; stack{d}.b(:) ];
assert(size(stack{d}.w, 1) == size(stack{d}.b, 1), ...
['The bias should be a *column* vector of ' ...
int2str(size(stack{d}.w, 1)) 'x1']);
if d < numel(stack)
assert(size(stack{d}.w, 1) == size(stack{d+1}.w, 2), ...
['The adjacent layers L' int2str(d) ' and L' int2str(d+1) ...
' should have matching sizes.']);
end
end
if nargout > 1
% Setup netconfig
if numel(stack) == 0
netconfig.inputsize = 0;
netconfig.layersizes = {};
else
netconfig.inputsize = size(stack{1}.w, 2);
netconfig.layersizes = {};
for d = 1:numel(stack)
netconfig.layersizes = [netconfig.layersizes ; size(stack{d}.w,1)];
end
end
end
end