windows下使用libsvm3.2

一、官方介绍

libsvm主页:https://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html

libsvm介绍文档:http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf

官方关于更有效地使用libsvm的使用说明:http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf (非常有必要看)

数据库:https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

关于二分类的实例:https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html

关于多分类实例:https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html

常见问答:http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html   (这里可以帮你解决好多疑惑)

实用工具列表:https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/ (guide提到的liblinear在此)


二、需要软件

libsvm-3.20:http://www.csie.ntu.edu.tw/~cjlin/libsvm/libsvm-3.20.zip
python-2.7.10:https://www.python.org/ftp/python/2.7.10/python-2.7.10.msi(调用python工具时使用)
gnuplot5.0.1:http://jaist.dl.sourceforge.net/project/gnuplot/gnuplot/5.0.1/gp501-win32-mingw.exe(用绘图展示整个搜索最佳参数过程)



三、训练过程说明
——(以后输入命令以.bat格式存储即可使用)

1、提取数据形式的特征:(类别标签 特征序号:特征值)
1 1:2.111 2:3.567 3:-0.125
...
0 1:2.156 2:3.259 3:0.258
...
分别将训练样本数据和测试样本数据存成名为train的文件和名为test的文件(仅为了方便区分)

2、对特征数据进行缩放(提高运算效率)
svm-scale -l -1 -u 1 -s rangetrain >train.scale (-1~1表示缩放范围 -l表low -u表up -s表save 将变换后区间存为range  train是原始特征数据 train.scale是缩放后的数据)
svm-scale -r range1test>test.scale(-r 表read 将test的数据按同一range进行缩放)
说明:区间[0,1]和[-1,1]的效果是一样的,只是[0,1]的运算效率更高


3、寻找最优c、g参数
python grid.pytrain.scale(运算结束后,会提供最优参数c和g.比如运算结果是2.0 1.0 96.8922,96.8922为交叉验证准确率)

4、使用最优参数进行训练
svm-train -c 2 -g1train.scale(会生成一个名为train.scale.model文件,文件参数说明见后续补充说明.这里我们使用了默认核函数RBF,一般RBF是效果最好的)

5、拿训练结果进行测试
svm-predict test.scale train.scale.model test.predict(得预测结果test.predict文件以及正确率)



四、补充说明:

1、修改交叉验证
svm-scale -l -1 -u 1 train >train.scale 

svm-train -v 6 train.scale(交叉验证是为了得到更好的参数)

python grid.pytrain.scale
svm-train -c 2 -g 2 train.scale

2、关于/libsvm-3.20/tools/中的easy.py和grid.py
安装完python和gnuplot后,将E:\Program Files\Python,F:\libsvm-3.20\windows,E:\Program Files\gnuplot\bin三个文件夹添加到系统路径里面,修改上两个py文件中关于libsvm的路径和gnuplot的路径.
easy.py中:gnuplot_exe = r"e:\Program Files\gnuplot\bin\gnuplot.exe"
grid.py中:#svmtrain_pathname = r'f:\libsvm-3.20\windows\svm-train.exe'
           self.gnuplot_pathname = r'e:\Program Files\gnuplot\bin\gnuplot.exe'
可以按照guide.pdf,用easy.py测试guide中的实例。guide中实验数据链接:http://www.csie.ntu.edu.tw/~cjlin/papers/guide/data/

3、关于model文件中的参数说明
svm_type c_svc   (svc表用SVM作分类器,svr表用SVM作回归,c_svc 表用异常值惩罚因子C进行不完全分类)
kernel_type rbf     (径向基核,对于大多数情况都是一个较好的选择:d(x,y) = exp(-gamma*|x-y|2))
gamma 0.03125核函数的参数)
nr_class 2类别数)
total_sv 287支持向量总数)
rho 102.102判决函数的常数项b)
label 1 0类标签)
nr_sv 144 143各个类中落在边界上的向量个数)
SVSV下面枚举了所有的支持向量)
8192 1:-1 2:-0.688314 3:0.595954 4:0.416735

...


4、svmscale.exe参数说明

"-l lower : x scaling lower limit (default -1)\n"
"
-u upper : x scaling upper limit (default +1)\n"
"
-y y_lower y_upper : y scaling limits (default: no y scaling)\n"
"
-s save_filename : save scaling parameters to save_filename\n"
"
-r restore_filename : restore scaling parameters from restore_filename\n"


5、svmtrain.exe的参数列表

"-s svm_type : set type of SVM (default 0)\n"
" 0 -- C-SVC(multi-class classification)\n"
" 1 -- nu-SVC(multi-class classification)\n"
" 2 -- one-class SVM\n"
" 3 -- epsilon-SVR(regression)\n"
" 4 -- nu-SVR(regression)\n"
"-t kernel_type : set type of kernel function (default 2)\n"
" 0 -- linear: u'*v\n"
" 1 -- polynomial: (gamma*u'*v + coef0)^degree\n"
" 2 -- radial basis function: exp(-gamma*|u-v|^2)\n"
" 3 -- sigmoid: tanh(gamma*u'*v + coef0)\n"
" 4 -- precomputed kernel (kernel values in training_set_file)\n"
"-d degree : set degree in kernel function (default 3)\n"
"-g gamma : set gamma in kernel function (default 1/num_features)\n"
"-r coef0 : set coef0 in kernel function (default 0)\n"
"-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)\n"
"-n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5)\n"
"-p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1)\n"
"-m cachesize : set cache memory size in MB (default 100)\n"
"-e epsilon : set tolerance of termination criterion (default 0.001)\n"
"-h shrinking : whether to use the shrinking heuristics, 0 or 1 (default 1)\n"
"-b probability_estimates : whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)\n"
"-wi weight : set the parameter C of class i to weight*C, for C-SVC (default 1)\n"
"-v n: n-fold cross validation mode\n"
"-q : quiet mode (no outputs)\n"


6、常用FAQ

Q1: Is there a program to check if my data are in the correct format? 
The svm-train program in libsvm conducts only a simple check of the input data. To do a detailed check, after libsvm 2.85, you can use the python script tools/checkdata.py. See tools/README for details.

Q2: The output of training C-SVM is like the following. What do they mean? 
optimization finished, #iter = 219 
nu = 0.431030 
obj = -100.877286, rho = 0.424632 
nSV = 132, nBSV = 107 
Total nSV = 132
obj is the optimal objective value of the dual SVM problem. rho is the bias term in the decision function sgn(w^Tx - rho). nSV and nBSV are number of support vectors and bounded support vectors (i.e., alpha_i = C). nu-svm is a somewhat equivalent form of C-SVM where C is replaced by nu. nu simply shows the corresponding parameter. More details are in libsvm document.

Q3: Should I use float or double to store numbers in the cache ? 
We have float as the default as you can store more numbers in the cache. In general this is good enough but for few difficult cases (e.g. C very very large) where solutions are huge numbers, it might be possible that the numerical precision is not enough using only float.

Q4: Does it make a big difference if I scale each attribute to [0,1] instead of [-1,1]? 
For the linear scaling method, if the RBF kernel is used and parameter selection is conducted, there is no difference. Assume Mi and mi are respectively the maximal and minimal values of the ith attribute. Scaling to [0,1] means
                x'=(x-mi)/(Mi-mi)
For [-1,1],
                x''=2(x-mi)/(Mi-mi)-1.
In the RBF kernel,
                x'-y'=(x-y)/(Mi-mi), x''-y''=2(x-y)/(Mi-mi).
Hence, using (C,g) on the [0,1]-scaled data is the same as (C,g/2) on the [-1,1]-scaled data.
Though the performance is the same, the computational time may be different. For data with many zero entries, [0,1]-scaling keeps the sparsity of input data and hence may save the time.

Q5: My data are unbalanced. Could libsvm handle such problems? 
Yes, there is a -wi options. For example, if you use
> svm-train -s 0 -c 10 -w1 1 -w-1 5 data_file
the penalty for class "-1" is larger. Note that this -w option is for C-SVC only.

Q6: How can I use OpenMP to parallelize LIBSVM on a multicore/shared-memory computer? 
It is very easy if you are using GCC 4.2 or after.
In Makefile, add -fopenmp to CFLAGS.
In class SVC_Q of svm.cpp, modify the for loop of get_Q to:
#pragma omp parallel for private(j) 
for(j=start;j In the subroutine svm_predict_values of svm.cpp, add one line to the for loop:
#pragma omp parallel for private(i) 
for(i=0;ikvalue[i] = Kernel::k_function(x,model->SV[i],model->param);
For regression, you need to modify class SVR_Q instead. The loop in svm_predict_values is also different because you need a reduction clause for the variable sum:
#pragma omp parallel for private(i) reduction(+:sum) 
for(i=0;il;i++)
sum += sv_coef[i] * Kernel::k_function(x,model->SV[i],model->param);
Then rebuild the package. Kernel evaluations in training/testing will be parallelized. An example of running this modification on an 8-core machine using the data set ijcnn1:
8 cores:
%setenv OMP_NUM_THREADS 8
%time svm-train -c 16 -g 4 -m 400 ijcnn1
27.1sec
1 core:
%setenv OMP_NUM_THREADS 1
%time svm-train -c 16 -g 4 -m 400 ijcnn1
79.8sec
For this data, kernel evaluations take 80% of training time. In the above example, we assume you use csh. For bash, use
export OMP_NUM_THREADS=8
instead.
For Python interface, you need to add the -lgomp link option:
$(CXX) -lgomp -shared -dynamiclib svm.o -o libsvm.so.$(SHVER)
For MS Windows, you need to add /openmp in CFLAGS of Makefile.win

Q7: How could I know which training instances are support vectors? 
It's very simple. Since version 3.13, you can use the function
void svm_get_sv_indices(const struct svm_model *model, int *sv_indices)
to get indices of support vectors. For example, in svm-train.c, after
model = svm_train(&prob, ¶m);
you can add
int nr_sv = svm_get_nr_sv(model);
int *sv_indices = Malloc(int, nr_sv);
svm_get_sv_indices(model, sv_indices);
for (int i=0; iprintf("instance %d is a support vector\n", sv_indices[i]);
If you use matlab interface, you can directly check
model.sv_indices

Q8: After doing cross validation, why there is no model file outputted ? 
Cross validation is used for selecting good parameters. After finding them, you want to re-train the whole data without the -v option.

Q9: How do I choose the kernel? 
In general we suggest you to try the RBF kernel first. A recent result by Keerthi and Lin ( download paper here) shows that if RBF is used with model selection, then there is no need to consider the linear kernel. The kernel matrix using sigmoid may not be positive definite and in general it's accuracy is not better than RBF. (see the paper by Lin and Lin ( download paper here). Polynomial kernels are ok but if a high degree is used, numerical difficulties tend to happen (thinking about dth power of (<1) goes to 0 and (>1) goes to infinity).

Q10: I press the "load" button to load data points but why svm-toy does not draw them ? 
The program svm-toy assumes both attributes (i.e. x-axis and y-axis values) are in (0,1). Hence you want to scale your data to between a small positive number and a number less than but very close to 1. Moreover, class labels must be 1, 2, or 3 (not 1.0, 2.0 or anything else).


Q11:Feature selection tool
This is a simple python script (download here) to use F-score for selecting features. To run it, please put it in the sub-directory "tools" of LIBSVM.
Usage: ./fselect.py training_file [testing_file]
Output files: .fscore shows importance of features, .select gives the running log, and .pred gives testing results.
More information about this implementation can be found in Y.-W. Chen and C.-J. Lin,Combining SVMs with various feature selection strategies. To appear in the book "Feature extraction, foundations and applications." 2005. This implementation is still preliminary. More comments are very welcome.


Q12:Weights for data instances
Users can give a weight to each data instance. For LIBSVM users, please download the zip file (MATLAB and Python interfaces are included). You must store weights in a separated file and specify -W your_weight_file. This setting is different from earlier versions where weights are in the first column of training data.
1)Training/testing sets are the same as those for standard LIBSVM/LIBLINEAR.
2)We do not support weights for test data.
3)All solvers are supported.
4)Matlab/Python interfaces for both LIBSVM/LIBLIENAR are supported.


Q13:Binary-class Cross Validation with Different Criteria

参考文档:https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/eval/index.html

你可能感兴趣的:(机器学习)