OnlineLogisticRegression

mahout源码分析

AdaptiveLogisticRegression.java 实现了OnlineLearner接口。
维护一个普通的OnlineLogisticRegression学习器池,池中的每一个元素都有不同的学习率。
一个主意是学习器池实际维护一个CrossFoldLearners(包含数个OnlineLogisticRegression对象)。
这些池允许我们进行性能估计如果对数据做很多次时。如果有好的参数,你或许更喜欢运行一个有这些设置的CrossFoldLearne。
在这里合适的实用是AUC,AUC的实用意味着OnlineLogisticRegression最合适二目标变量的分类问题。
可以通过扩展OnlineAuc来处理费而分类案例。
mahout分类源码分析
接口Interface OnlineLearner:
实现的类:
AbstractOnlineLogisticRegression, AdaptiveLogisticRegression, CrossFoldLearner, OnlineLogisticRegression
方法如下:
void train(int actual,
           Vector instance)
更新模型,实用一个目标变量的值和一个特征向量,主意这里假定:如果对训练数据进行多次,那么训练样本应该有相同的顺序。
参数说明:
actual:目标变量的值,这个值应该是一个半开区间[0....n)其中n是目标变来那个的类别。
instance:特征向量
void train(long trackingKey,
           java.lang.String groupKey,
           int actual,
           Vector instance)
更新模型,实用一个目标变量的值和一个特征向量
Parameters:
trackingKey - The tracking key for this training example.
groupKey - An optional value that allows examples to be grouped in the computation of the update to the model.
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.

void train(long trackingKey,
           int actual,
           Vector instance)
Updates the model using a particular target variable value and a feature vector.
Parameters:
trackingKey - The tracking key for this training example.
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.
void close()
为分类做准备,去掉任何临时数据结构,一个在线分类器应该能接受更多的训练在调用clouse(),但是关闭分类器或许使分类更有效。

类概要
public abstract class AbstractVectorClassifier
定义分类器接口,使用向量作为输入,作为一个抽象类被实现以至于它实现了很多和向量分类相关的方法。
构造函数:public AbstractVectorClassifier()
方法:
1 public abstract int numCategories()
返回目标变量的类号码,一个向量分类器将编码它的输出,使用基于0,1的类别编码
2 public abstract Vector classify(Vector instance)
描述:分类一个向量,返回一个标记类别号码的向量,It is assumed that the score for the missing category is one minus the sum of the scores that are returned. Note that the missing score is the 0-th score.
输入参数是要分类的特征向量
3 public Vector classifyNoLink(Vector features)
分类一个变量,但是不应用反向连接函数,对于逻辑回归和其他的通用线性模型,这仅仅是分类的线性部分。
参数:要分类的特征向量
返回:得分的特征向量。If transformed by the link function, these will become probabilities.
4 public abstract double classifyScalar(Vector instance)
分类一个向量在而分类器中将返回仅有一个元素的向量,这样使用这个方法可以避免向量的分配。
参数:要分类的向量
返回:类别1的得分
5 public Vector classifyFull(Vector instance)

Returns n probabilities, one for each category. If you can use an n-1 coding, and are touchy about allocation performance, then the classify method is probably better to use. The 0-th element of the score vector returned by this method is the missing score as computed by the classify method.
参数:要分类的特征向量
返回:A vector of probabilities, one for each category.
6 public Vector classifyFull(Vector r,
                           Vector instance)
Returns n probabilities, one for each category into a pre-allocated vector. One vector allocation is still done in the process of multiplying by the coefficient matrix, but that is hard to avoid. The cost of such an ephemeral allocation is very small in any case compared to the multiplication itself.

Parameters:
r - Where to put the results.
instance - A vector of features to be classified.
Returns:
A vector of probabilities, one for each category
7 classify
public Matrix classify(Matrix data)Returns n-1 probabilities, one for each category but the last, for each row of a matrix. The probability of the missing 0-th category is 1 - rowSum(this result).

Parameters:
data - The matrix whose rows are vectors to classify
Returns:
A matrix of scores, one row per row of the input matrix, one column for each but the last category.
8 classifyFull
public Matrix classifyFull(Matrix data)Returns n probabilities, one for each category, for each row of a matrix.

Parameters:
data - The matrix whose rows are vectors to classify
Returns:
A matrix of scores, one row per row of the input matrix, one column for each but the last category
9 classifyScalar
public Vector classifyScalar(Matrix data)Returns a vector of probabilities of the first category, one for each row of a matrix. This only makes sense if there are exactly two categories, but calling this method in that case can save a number of vector allocations.

Parameters:
data - The matrix whose rows are vectors to classify
Returns:
A vector of scores, with one value per row of the input matrix.
10 logLikelihood
public double logLikelihood(int actual,
                            Vector data)Returns a measure of how good the classification for a particular example actually is.

Parameters:
actual - The correct category for the example.
data - The vector to be classified.
Returns:
The log likelihood of the correct answer as estimated by the current model. This will always be <= 0 and larger (closer to 0) indicates better accuracy. In order to simplify code that maintains running averages, we bound this value at -100.


二,Class BayesFileFormatter
格式化一个文件,使得可以被贝叶斯map/reduce 的job读取
对于文档中的每一行,第一个标记是标签,剩下的是term
方法:
1 public static void collapse(java.lang.String label,
                            org.apache.lucene.analysis.Analyzer analyzer,
                            java.io.File inputDir,
                            java.nio.charset.Charset charset,
                            java.io.File outputFile)
                     throws java.io.IOException
将一个输入目录下的所有文件变为单个文件,使得可以被贝叶斯处理的格式,每行一个文档
参数:
Parameters:
label - The label
analyzer - The analyzer to use
inputDir - The input Directory
charset - The charset of the input files
outputFile - The file to collapse to
Throws:
java.io.IOException
2 format
public static void format(java.lang.String label,
                          org.apache.lucene.analysis.Analyzer analyzer,
                          java.io.File input,
                          java.nio.charset.Charset charset,
                          java.io.File outDir)
                   throws java.io.IOException
写输入文件到输出目录,每一个输入文件对应一个输出文件
Parameters:
label - The label of the file
analyzer - The analyzer to use
input - The input file or directory. May not be null
charset - The Character set of the input files
outDir - The output directory. Files will be written there with the same name as the input file
Throws:
java.io.IOException
3 readerToDocument
 public static java.lang.String[] readerToDocument(org.apache.lucene.analysis.Analyzer analyzer,
                                                  java.io.Reader reader)
                                           throws java.io.IOException
转化一个Reader为一个向量
参数
analyzer - The Analyzer to use
reader - The reader to feed to the Analyzer
Returns:
An array of unique tokens
Throws:
java.io.IOException
三,Class ClassifierResult
文档分类的结果。这个标签和相关的得分(概率)
四,Classify Runs the Bayes classifier using the given model location(HDFS/HBASE)
五,ConfusionMatrix The ConfusionMatrix Class stores the result of Classification of a Test Dataset.
六,ResultAnalyzer ResultAnalyzer captures the classification statistics and displays in a tabular manner

你可能感兴趣的:(Mahout,dw)