1. mahout seqdirectory
$ mahout seqdirectory --input (-i) input Path to job input directory(原始文本文件). --output (-o) output The directory pathname for output.(<Text,Text>Sequence File) -ow
功能: 将原始文本数据集转换为< Text, Text > SequenceFile
2. mahout seq2sparke
功能: Convert and preprocesses the dataset(<Text,Text> SequenceFile) into a < Text, VectorWritable > SequenceFile containing term frequencies for each document.
即根据Sequence File转换为tfidf向量文件
说明:If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bigrams or -n 2 for L2 length normalization
mahout seq2sparse --output (-o) output The directory pathname for output. --input (-i) input Path to job input directory. --weight (-wt) weight The kind of weight to use. Currently TF or TFIDF. Default: TFIDF --norm (-n) norm The norm to use, expressed as either a float or "INF" if you want to use the Infinite norm. Must be greater or equal to 0. The default is not to normalize --overwrite (-ow) If set, overwrite the output directory --sequentialAccessVector (-seq) (Optional) Whether output vectors should be SequentialAccessVectors. If set true else false --namedVector (-nv) (Optional) Whether output vectors should be NamedVectors. If set true else false
-i Sequence File文件目录
-o 向量文件输出目录
-wt 权重类型,支持TF或者TFIDF两种选项,默认TFIDF
-n 使用的正规化,使用浮点数或者"INF"表示,
-ow 指定该参数,将覆盖已有的输出目录
-seq 指定该参数,那么输出的向量是SequentialAccessVectors
-nv 指定该参数,那么输出的向量是NamedVectors
3. mahout split
功能:Split the preprocessed dataset into training and testing sets.
将预处理的tfidf向量集转换为training和testing向量集
$ mahout split -i ${WORK_DIR}/20news-vectors/tfidf-vectors --trainingOutput ${WORK_DIR}/20news-train-vectors --testOutput ${WORK_DIR}/20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
说明:如上是将向量数据集分为训练数据和检测数据,以随机40-60拆分
3. mahout trainnb
功能:训练分类器
mahout trainnb --input (-i) input Path to job input directory. --output (-o) output The directory pathname for output. --alphaI (-a) alphaI Smoothing parameter. Default is 1.0 --trainComplementary (-c) Train complementary? Default is false. --labelIndex (-li) labelIndex The path to store the label index in --overwrite (-ow) If present, overwrite the output directory before running job --help (-h) Print out help --tempDir tempDir Intermediate output directory --startPhase startPhase First phase to run --endPhase endPhase Last phase to run
-i 输入路径
-o 输出路径
-a
-c 补偿性训练
-li label index文件的目录
-ow 指定该参数,删除输出目录
tempDir MapReduce作业的中间结果
startPhase 运行的第一个阶段
endPhase 运行的最后一个阶段
4. mahout testnb
功能:检验Bayes分类器
mahout testnb --input (-i) input Path to job input directory. --output (-o) output The directory pathname for output. --overwrite (-ow) If present, overwrite the output directory before running job --model (-m) model The path to the model built during training --testComplementary (-c) Test complementary? Default is false. --runSequential (-seq) Run sequential? --labelIndex (-l) labelIndex The path to the location of the label index --help (-h) Print out help --tempDir tempDir Intermediate output directory --startPhase startPhase First phase to run --endPhase endPhase Last phase to run
-i 输入路径
-o 输出路径
-ow 覆盖输出目录
-c