上次已经说到了Mahout的计算项目模块mahout math。这里面包含了很多常用的数学计算或者统计方面的东西,有很多东西可能会用到,所以对这些基础的需要有很好的理解。Mahout提供了很多工具的命令行方式,下面列出所有的命令,当然这个是会变化的,而且每个都有不同的参数;这些命令也有很多相似之处,要每个都很熟悉还是要很多功力的。管中窥豹,可见一斑,这样可以知道Mahout到底可以做什么,提供了哪些直接使用的方式,可供参考:
Command | Comment | Detail |
arff.vector | 从ARFF文件产生向量 | Generate Vectors from an ARFF file or directory |
baumwelch | HMM Baum-Welch训练算法 | Baum-Welch algorithm for unsupervised HMM training |
buildforest | 构建随机森林分类器 | Build the random forest classifier |
canopy | Canopy聚类 | Canopy clustering |
cat | 打印文件或者资源方便查看 | Print a file or resource as the logistic regression models would see it |
cleansvd | 清空验证SVD输出 | Cleanup and verification of SVD output |
clusterdump | Dump聚类输出结果文本 | Dump cluster output to text |
clusterpp | 分组聚类输出 | Groups Clustering Output In Clusters |
cmdump | 以HTML或者文本格式Dump混淆矩阵 | Dump confusion matrix in HTML or text formats |
concatmatrices | 合并相同基的矩阵到单个矩阵中 | Concatenates 2 matrices of same cardinality into a single matrix |
cvb | LDA | LDA via Collapsed Variation Bayes (0th deriv. approx) |
cvb0_local | LDA local | LDA via Collapsed Variation Bayes, in memory locally. |
describe | 描述数据集中的字段和目标变量 | Describe the fields and target variable in a data set |
evaluateFactorization | 计算RMSE 和 MAE | compute RMSE and MAE of a rating matrix factorization against probes |
fkmeans | Fuzzy K-means聚类 | Fuzzy K-means clustering |
hmmpredict | 由给定的HMM模型产生随机观察序列 | Generate random sequence of observations by given HMM |
itemsimilarity | 物品相似度 | Compute the item-item-similarities for item-based collaborative filtering |
kmeans | K-means聚类 | K-means clustering |
lucene.vector | 产生Lucene索引向量 | Generate Vectors from a Lucene index |
lucene2seq | Lucene索引产生文本序列 | Generate Text SequenceFiles from a Lucene index |
matrixdump | 以CSV格式Dump矩阵 | Dump matrix in CSV format |
matrixmult | 获得两矩阵的积 | Take the product of two matrices |
parallelALS | 并行ALS | ALS-WR factorization of a rating matrix |
qualcluster | 运行聚类实验和摘要 | Runs clustering experiments and summarizes results in a CSV |
recommendfactorized | 使用等分因子获得推荐 | Compute recommendations using the factorization of a rating matrix |
recommenditembased | 使用基于物品的协作过滤推荐 | Compute recommendations using item-based collaborative filtering |
regexconverter | 按行基于正则表达式转换文本文件 | Convert text files on a per line basis based on regular expressions |
resplit | 将文件文件切分多等分 | Splits a set of SequenceFiles into a number of equal splits |
rowid | Map系列文件 | Map SequenceFile |
rowsimilarity | 计算行矩阵的成对相似度 | Compute the pairwise similarities of the rows of a matrix |
runAdaptiveLogistic | 运行自适应逻辑回归 | Score new production data using a probably trained and validated AdaptivelogisticRegression model |
runlogistic | 从CSV数据运行逻辑回归 | Run a logistic regression model against CSV data |
seq2encoded | 从文本序列文件获得编码稀疏向量 | Encoded Sparse Vector generation from Text sequence files |
seq2sparse | 从文本序列文件获得稀疏向量 | Sparse Vector generation from Text sequence files |
seqdirectory | 从目录创建序列文件 | Generate sequence files (of Text) from a directory |
seqdumper | 通用序列文件Dump | Generic Sequence File dumper |
seqmailarchives | 从压缩邮件目录中创建序列文件 | Creates SequenceFile from a directory containing gzipped mail archives |
seqwiki | Wikipedia xml dump至序列文件 | Wikipedia xml dump to sequence file |
spectralkmeans | 谱K-mean聚类 | Spectral k-means clustering |
split | 输入数据分为测试和训练数据 | Split Input data into test and train sets |
splitDataset | 等分训练和测试数据 | split a rating dataset into training and probe parts |
ssvd | 随机SVD | Stochastic SVD |
streamingkmeans | 流式K-mean聚类 | Streaming k-means clustering |
svd | Lanczos 奇异值分解 | Lanczos Singular Value Decomposition |
testforest | 测试随机森林分类器 | Test the random forest classifier |
testnb | 测试Bayes分类器 | Test the Vector-based Bayes classifier |
trainAdaptiveLogistic | 训练自适应逻辑回归模型 | Train an AdaptivelogisticRegression model |
trainlogistic | 基于随机梯度下降训练逻辑回归 | Train a logistic regression using stochastic gradient descent |
trainnb | 基于Bayes分类训练 | Train the Vector-based Bayes classifier |
transpose | 转置矩阵 | Take the transpose of a matrix |
validateAdaptiveLogistic | 验证自适应逻辑回归模型 | Validate an AdaptivelogisticRegression model against hold-out data set |
vecdist | 计算向量距离 | Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors |
vectordump | Dump向量至文本文件 | Dump vectors from a sequence file to text |
viterbi | Viterbi 算法 | Viterbi decoding of hidden states from given output states sequence |
当然上面的有些中文的翻译不是很准,也没有一一使用过,具体的使用还有很多细节。
Mahout提供了很多聚类,分类,推荐(协作过滤)方面的计算方法,对i数据分析提供了有意的帮助,目前用的比较成熟的应该就是推荐这块了,在很多系统里面得到了实际的应用,效果也不错;相对来说聚类分类还是使用的场合比较有限,有待进一步的研究。
前面几篇已经分析过了推荐方面的,从理论到实际操作,下面介绍一个逻辑回归(Logistic Regression)模型的例子。
1.数据准备
使用的是iris数据,iris数据是数据分析使用比较多的实验数据,不多说了。
打开R, 输入 iris,可以看到数据长什么样子,使用下面的命令导出数据
write.csv(iris,file="D:/work_doc/Doc/iris.csv")
数据是这样的:
"ID","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
"1",5.1,3.5,1.4,0.2,"setosa"
"2",4.9,3,1.4,0.2,"setosa"
"3",4.7,3.2,1.3,0.2,"setosa"
"4",4.6,3.1,1.5,0.2,"setosa"
2. 使用java代码来实际操作一番。
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.util.List;
import java.util.Locale;
import org.apache.commons.io.FileUtils;
import org.apache.mahout.classifier.sgd.CsvRecordFactory;
import org.apache.mahout.classifier.sgd.LogisticModelParameters;
import org.apache.mahout.classifier.sgd.OnlineLogisticRegression;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.SequentialAccessSparseVector;
import org.apache.mahout.math.Vector;
import com.google.common.base.Charsets;
import com.google.common.collect.Lists;
public class IrisLRTest {
private static LogisticModelParameters lmp;
private static PrintWriter output;
public static void main(String[] args) throws IOException {
// 初始化
lmp = new LogisticModelParameters();
output = new PrintWriter(new OutputStreamWriter(System.out,
Charsets.UTF_8), true);
lmp.setLambda(0.001);
lmp.setLearningRate(50);
lmp.setMaxTargetCategories(3);
lmp.setNumFeatures(4);
List
lmp.setTargetCategories(targetCategories);
lmp.setTargetVariable("Species"); // 需要进行预测的是Species属性
List
List
lmp.setTypeMap(predictorList, typeList);
// 读取数据
List
"D:\\work_doc\\Doc\\iris.csv"));
String header = raw.get(0);
List
CsvRecordFactory csv = lmp.getCsvRecordFactory();
csv.firstLine(header);
// 训练
OnlineLogisticRegression lr = lmp.createRegression();
for(int i = 0; i < 100; i++) { //训练次数
for (String line : content) {
Vector input = new RandomAccessSparseVector(lmp.getNumFeatures());
int targetValue = csv.processLine(line, input);
lr.train(targetValue, input);
}
}
// 评估分类结果
double correctRate = 0;
double sampleCount = content.size();
for (String line : content) {
Vector v = new SequentialAccessSparseVector(lmp.getNumFeatures());
int target = csv.processLine(line, v);
int score = lr.classifyFull(v).maxValueIndex();
//System.out.println("Target:" + target + "\tReal:" + score);
if(score == target) {
correctRate++;
}
}
output.printf(Locale.ENGLISH, "Rate = %.2f%n", correctRate / sampleCount);
}
}
代码里面给出了注释,过程比较容易理解。不仅是这个模型是这样的思路,很多其他的算法都是这样的过程,具体的训练方法,算法或者过程,有差别。
当然这里给出的是基于Mahout的代码,一样在R中也可以做很多模型,基本步骤类似。