Twenty Newsgroups Classification任务之二seq2sparse

seq2sparse对应于mahout中的org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles,从昨天跑的算法中的任务监控界面可以看到这一步包含了7个Job信息,分别是:(1)DocumentTokenizer(2)WordCount(3)MakePartialVectors(4)MergePartialVectors(5)VectorTfIdf Document Frequency Count(6)MakePartialVectors(7)MergePartialVectors。打印SparseVectorsFromSequenceFiles的参数帮助信息可以看到如下的信息:

 

Usage:                                                                          

 [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize           

<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma      

<maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>      

--minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>        

--overwrite --help --sequentialAccessVector --namedVector --logNormalize]       

Options                                                                         

  --minSupport (-s) minSupport        (Optional) Minimum Support. Default       

                                      Value: 2                                  

  --analyzerName (-a) analyzerName    The class name of the analyzer            

  --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000 MB  

  --output (-o) output                The directory pathname for output.        

  --input (-i) input                  Path to job input directory.              

  --minDF (-md) minDF                 The minimum document frequency.  Default  

                                      is 1                                      

  --maxDFSigma (-xs) maxDFSigma       What portion of the tf (tf-idf) vectors   

                                      to be used, expressed in times the        

                                      standard deviation (sigma) of the         

                                      document frequencies of these vectors.    

                                      Can be used to remove really high         

                                      frequency terms. Expressed as a double    

                                      value. Good value to be specified is 3.0. 

                                      In case the value is less then 0 no       

                                      vectors will be filtered out. Default is  

                                      -1.0.  Overrides maxDFPercent             

  --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the DF.    

                                      Can be used to remove really high         

                                      frequency terms. Expressed as an integer  

                                      between 0 and 100. Default is 99.  If     

                                      maxDFSigma is also set, it will override  

                                      this value.                               

  --weight (-wt) weight               The kind of weight to use. Currently TF   

                                      or TFIDF                                  

  --norm (-n) norm                    The norm to use, expressed as either a    

                                      float or "INF" if you want to use the     

                                      Infinite norm.  Must be greater or equal  

                                      to 0.  The default is not to normalize    

  --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood      

                                      Ratio(Float)  Default is 1.0              

  --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.        

                                      Default Value: 1                          

  --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams to  

                                      create (2 = bigrams, 3 = trigrams, etc)   

                                      Default Value:1                           

  --overwrite (-ow)                   If set, overwrite the output directory    

  --help (-h)                         Print out help                            

  --sequentialAccessVector (-seq)     (Optional) Whether output vectors should  

                                      be SequentialAccessVectors. If set true   

                                      else false                                

  --namedVector (-nv)                 (Optional) Whether output vectors should  

                                      be NamedVectors. If set true else false   

  --logNormalize (-lnorm)             (Optional) Whether output vectors should  

                                      be logNormalize. If set true else false 

在昨天算法的终端信息中该步骤的调用命令如下:

 

 

./bin/mahout seq2sparse -i /home/mahout/mahout-work-mahout/20news-seq -o /home/mahout/mahout-work-mahout/20news-vectors -lnorm -nv -wt tfidf

我们只看对应的参数,首先是-lnorm 对应的解释为输出向量是否要使用log函数进行归一化(设置则为true),-nv解释为输出向量被设置为named 向量,这里的named是啥意思?(暂时不清楚),-wt tfidf解释为使用权重的算法,具体参考 http://zh.wikipedia.org/wiki/TF-IDF 。

 

第(1)步在SparseVectorsFromSequenceFiles的253行的:

 

DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);

这里进入可以看到使用的Mapper是:SequenceFileTokenizerMapper,没有使用Reducer。Mapper的代码如下:

 

 

protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {

    TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));

    CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);

    StringTuple document = new StringTuple();

    stream.reset();

    while (stream.incrementToken()) {

      if (termAtt.length() > 0) {

        document.add(new String(termAtt.buffer(), 0, termAtt.length()));

      }

    }

    context.write(key, document);

  }

该Mapper的setup函数主要设置Analyzer的,关于Analyzer的api参考: http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/Analyzer.html ,其中在map中用到的函数为 reusableTokenStream( String fieldName,  Reader reader) :Creates a TokenStream that is allowed to be re-used from the previous time that the same thread called this method.
编写下面的测试程序:

 

 

package mahout.fansy.test.bayes;



import java.io.IOException;

import java.io.StringReader;



import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.io.Text;

import org.apache.lucene.analysis.Analyzer;

import org.apache.lucene.analysis.TokenStream;

import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

import org.apache.mahout.common.ClassUtils;

import org.apache.mahout.common.StringTuple;

import org.apache.mahout.vectorizer.DefaultAnalyzer;

import org.apache.mahout.vectorizer.DocumentProcessor;



public class TestSequenceFileTokenizerMapper {



	/**

	 * @param args

	 */

	private static Analyzer analyzer = ClassUtils.instantiateAs("org.apache.mahout.vectorizer.DefaultAnalyzer",

Analyzer.class);

	public static void main(String[] args) throws IOException {

		testMap();

	}

	

	public static void testMap() throws IOException{

		Text key=new Text("4096");

		Text value=new Text("today is also late.what about tomorrow?");

		TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));

	    CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);

	    StringTuple document = new StringTuple();

	    stream.reset();

	    while (stream.incrementToken()) {

	      if (termAtt.length() > 0) {

	        document.add(new String(termAtt.buffer(), 0, termAtt.length()));

	      }

	    }

	    System.out.println("key:"+key.toString()+",document"+document);

	}



}

得出的结果如下:

 

 

key:4096,document[today, also, late.what, about, tomorrow]

其中,TokenStream有一个stopwords属性,值为:[but, be, with, such, then, for, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, these, by, to, of],所以当遇到这些单词的时候就不进行计算了。

 

额,又太晚了。哎,早困了,刷个牙线。。。



分享,快乐,成长


转载请注明出处:http://blog.csdn.net/fansy1990 



你可能感兴趣的:(Class)