Twenty Newsgroups Classification任务之二seq2sparse(1)

seq2sparse对应于mahout中的org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles,从昨天跑的算法中的任务监控界面可以看到这一步包含了7个Job信息,分别是:(1)DocumentTokenizer(2)WordCount(3)MakePartialVectors(4)MergePartialVectors(5)VectorTfIdf Document Frequency Count(6)MakePartialVectors(7)MergePartialVectors。打印SparseVectorsFromSequenceFiles的参数帮助信息可以看到如下的信息:

 

[java]  view plain copy
 
  1. Usage:                                                                            
  2.  [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize             
  3. <chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma        
  4. <maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>        
  5. --minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>          
  6. --overwrite --help --sequentialAccessVector --namedVector --logNormalize]         
  7. Options                                                                           
  8.   --minSupport (-s) minSupport        (Optional) Minimum Support. Default         
  9.                                       Value: 2                                    
  10.   --analyzerName (-a) analyzerName    The class name of the analyzer              
  11.   --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000 MB    
  12.   --output (-o) output                The directory pathname for output.          
  13.   --input (-i) input                  Path to job input directory.                
  14.   --minDF (-md) minDF                 The minimum document frequency.  Default    
  15.                                       is 1                                        
  16.   --maxDFSigma (-xs) maxDFSigma       What portion of the tf (tf-idf) vectors     
  17.                                       to be used, expressed in times the          
  18.                                       standard deviation (sigma) of the           
  19.                                       document frequencies of these vectors.      
  20.                                       Can be used to remove really high           
  21.                                       frequency terms. Expressed as a double      
  22.                                       value. Good value to be specified is 3.0.   
  23.                                       In case the value is less then 0 no         
  24.                                       vectors will be filtered out. Default is    
  25.                                       -1.0.  Overrides maxDFPercent               
  26.   --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the DF.      
  27.                                       Can be used to remove really high           
  28.                                       frequency terms. Expressed as an integer    
  29.                                       between 0 and 100. Default is 99.  If       
  30.                                       maxDFSigma is also set, it will override    
  31.                                       this value.                                 
  32.   --weight (-wt) weight               The kind of weight to use. Currently TF     
  33.                                       or TFIDF                                    
  34.   --norm (-n) norm                    The norm to use, expressed as either a      
  35.                                       float or "INF" if you want to use the       
  36.                                       Infinite norm.  Must be greater or equal    
  37.                                       to 0.  The default is not to normalize      
  38.   --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood        
  39.                                       Ratio(Float)  Default is 1.0                
  40.   --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.          
  41.                                       Default Value: 1                            
  42.   --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams to    
  43.                                       create (2 = bigrams, 3 = trigrams, etc)     
  44.                                       Default Value:1                             
  45.   --overwrite (-ow)                   If set, overwrite the output directory      
  46.   --help (-h)                         Print out help                              
  47.   --sequentialAccessVector (-seq)     (Optional) Whether output vectors should    
  48.                                       be SequentialAccessVectors. If set true     
  49.                                       else false                                  
  50.   --namedVector (-nv)                 (Optional) Whether output vectors should    
  51.                                       be NamedVectors. If set true else false     
  52.   --logNormalize (-lnorm)             (Optional) Whether output vectors should    
  53.                                       be logNormalize. If set true else false   

在昨天算法的终端信息中该步骤的调用命令如下:

 

 

[python]  view plain copy
 
  1. ./bin/mahout seq2sparse -i /home/mahout/mahout-work-mahout/20news-seq -o /home/mahout/mahout-work-mahout/20news-vectors -lnorm -nv -wt tfidf  

我们只看对应的参数,首先是-lnorm 对应的解释为输出向量是否要使用log函数进行归一化(设置则为true),-nv解释为输出向量被设置为named 向量,这里的named是啥意思?(暂时不清楚),-wt tfidf解释为使用权重的算法,具体参考http://zh.wikipedia.org/wiki/TF-IDF 。

 

第(1)步在SparseVectorsFromSequenceFiles的253行的:

 

[java]  view plain copy
 
  1. DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);  

这里进入可以看到使用的Mapper是:SequenceFileTokenizerMapper,没有使用Reducer。Mapper的代码如下:

 

 

[java]  view plain copy
 
  1. protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {  
  2.     TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));  
  3.     CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);  
  4.     StringTuple document = new StringTuple();  
  5.     stream.reset();  
  6.     while (stream.incrementToken()) {  
  7.       if (termAtt.length() > 0) {  
  8.         document.add(new String(termAtt.buffer(), 0, termAtt.length()));  
  9.       }  
  10.     }  
  11.     context.write(key, document);  
  12.   }  

该Mapper的setup函数主要设置Analyzer的,关于Analyzer的api参考:http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/Analyzer.html ,其中在map中用到的函数为reusableTokenStream(String fieldName, Reader reader) :Creates a TokenStream that is allowed to be re-used from the previous time that the same thread called this method.
编写下面的测试程序:

 

 

[java]  view plain copy
 
  1. package mahout.fansy.test.bayes;  
  2.   
  3. import java.io.IOException;  
  4. import java.io.StringReader;  
  5.   
  6. import org.apache.hadoop.conf.Configuration;  
  7. import org.apache.hadoop.io.Text;  
  8. import org.apache.lucene.analysis.Analyzer;  
  9. import org.apache.lucene.analysis.TokenStream;  
  10. import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;  
  11. import org.apache.mahout.common.ClassUtils;  
  12. import org.apache.mahout.common.StringTuple;  
  13. import org.apache.mahout.vectorizer.DefaultAnalyzer;  
  14. import org.apache.mahout.vectorizer.DocumentProcessor;  
  15.   
  16. public class TestSequenceFileTokenizerMapper {  
  17.   
  18.     /** 
  19.      * @param args 
  20.      */  
  21.     private static Analyzer analyzer = ClassUtils.instantiateAs("org.apache.mahout.vectorizer.DefaultAnalyzer",  
  22. Analyzer.class);  
  23.     public static void main(String[] args) throws IOException {  
  24.         testMap();  
  25.     }  
  26.       
  27.     public static void testMap() throws IOException{  
  28.         Text key=new Text("4096");  
  29.         Text value=new Text("today is also late.what about tomorrow?");  
  30.         TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));  
  31.         CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);  
  32.         StringTuple document = new StringTuple();  
  33.         stream.reset();  
  34.         while (stream.incrementToken()) {  
  35.           if (termAtt.length() > 0) {  
  36.             document.add(new String(termAtt.buffer(), 0, termAtt.length()));  
  37.           }  
  38.         }  
  39.         System.out.println("key:"+key.toString()+",document"+document);  
  40.     }  
  41.   
  42. }  

得出的结果如下:

 

 

[plain]  view plain copy
 
  1. key:4096,document[today, also, late.what, about, tomorrow]  

其中,TokenStream有一个stopwords属性,值为:[but, be, with, such, then, for, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, these, by, to, of],所以当遇到这些单词的时候就不进行计算了。

http://blog.csdn.net/fansy1990/article/details/10478515

你可能感兴趣的:(Mahout)