Mathout:Souce Code Analysis for Documents Clustering

A. Covert text file to sequeces file

Class: SequenceFilesFromDirectory

Functions: Converts a directory of text documents into SequenceFiles of Specified chunkSize. This class takes in a  parent directory containing sub folders of text documents and recursively reads the files and creates the   {@link org.apache.hadoop.io.SequenceFile}s of docid => content. The docid is set as the relative path of the   document from the parent directory prepended with a specified prefix. You can also specify the input encoding  of the text files. The content of the output SequenceFiles are encoded as UTF-8 text.

 

B.Covert sequences file to Document Vectors

Class: SparseVectorsFromSequenceFiles

Functions: Converts a given set of sequence files into SparseVectors

1. tokenize the documents in sequence file

DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);

 

Key: /2435: Value: 为什么很多看起来不是很复杂的网站,比如 Facebook 需要大量顶尖高手来开发?
Key: /2436: Value: iNOKNOK敲门网络是怎样的一个网站?
Key: /2437: Value: 北京上海生活成本为何超过了巴黎和纽约?

 ======>

Key: /2435: Value: [为什么, 很多, 看起来, 不是, 很复杂, 网站, 比如, facebook, 需要, 大量, 顶尖, 高手, 开发]
Key: /2436: Value: [inoknok, 敲门, 网络, 怎样, 一个, 网站]
Key: /2437: Value: [北京, 上海, 生活, 成本, 为何, 超过, 巴黎, 纽约]

 

2.Create Term Frequency (Tf) Vectors from the input set of documents

DictionaryVectorizer.createTermFrequencyVectors

 

2-1 startWordCounting(input, dictionaryJobPath, baseConf, minSupport);

Count the frequencies of words in parallel using Map/Reduce. If  term frequency in the corpus is smaller than

minSupport, then the term will be stripped, which means that the word or term will not be included in the dictionary.

tokenzied-documents:

Key: /2435: Value: [为什么, 很多, 看起来, 不是, 很复杂, 网站, 比如, facebook, 需要, 大量, 顶尖, 高手, 开发]
Key: /2436: Value: [inoknok, 敲门, 网络, 怎样, 一个, 网站]
Key: /2437: Value: [北京, 上海, 生活, 成本, 为何, 超过, 巴黎, 纽约]

.......

=============>

wordcount:

Key: 专用: Value: 1
Key: 世外桃源: Value: 1
Key: 世界: Value: 13
Key: 世界上: Value: 6
Key: 世界杯: Value: 20
Key: 世界观: Value: 1

2-2  Read the feature frequency List which is built at the end of the Word Count Job and assign ids to them

createDictionaryChunks(dictionaryJobPath, output, baseConf, chunkSizeInMegabytes, maxTermDimension);

 

wordcount:

Key:  80、90后最大疑惑: Value: 1
Key: -727379968: Value: 1
Key: 0.25%: Value: 1
Key: 0.9秒: Value: 1
Key: 00: Value: 1
Key: 001: Value: 1
Key: 0day: Value: 1
Key: 1.8: Value: 1
Key: 1/100: Value: 1
Key: 10: Value: 4
......

===========>

dictionary:

Key:  80、90后最大疑惑: Value: 0
Key: -727379968: Value: 1
Key: 0.25%: Value: 2
Key: 0.9秒: Value: 3
Key: 00: Value: 4
Key: 001: Value: 5
Key: 0day: Value: 6
Key: 1.8: Value: 7
Key: 1/100: Value: 8
Key: 10: Value: 9
Key: 100: Value: 10
.........

 

2-3  Create a partial vector using a chunk of features from the input documents. And merge partial vector to complete document vector

 

  makePartialVectors

 PartialVectorMerger.mergePartialVectors

dictionary

Key:  80、90后最大疑惑: Value: 0
Key: -727379968: Value: 1
Key: 0.25%: Value: 2
Key: 0.9秒: Value: 3

 

 

tokenized-documents

Key: /2435: Value: [为什么, 很多, 看起来, 不是, 很复杂, 网站, 比如, facebook, 需要, 大量, 顶尖, 高手, 开发]
Key: /2436: Value: [inoknok, 敲门, 网络, 怎样, 一个, 网站]
Key: /2437: Value: [北京, 上海, 生活, 成本, 为何, 超过, 巴黎, 纽约]

......

======>

Key: /2435: Value: {181:1.0,512:1.0,618:1.0,1738:1.0,2142:1.0,2221:1.0,2222:1.0,3072:1.0,3517:1.0,3776:1.0,4518:1.0,4545:1.0,4631:1.0}
Key: /2436: Value: {210:1.0,329:1.0,2296:1.0,2666:1.0,3776:1.0,3777:1.0}
Key: /2437: Value: {441:1.0,619:1.0,1208:1.0,2066:1.0,2390:1.0,3375:1.0,3714:1.0,4173:1.0}

.......

 

 

3. Calculates the document frequencies of all terms from the input set of vectors
TFIDFConverter.calculateDF(new Path(outputDir, tfDirName), outputDir, conf, chunkSize);
3-1
startDFCounting(input, wordCountPath, baseConf);

tf-vectors:

Key: /2435: Value: {181:1.0,512:1.0,618:1.0,1738:1.0,2142:1.0,2221:1.0,2222:1.0,3072:1.0,3517:1.0,3776:1.0,4518:1.0,4545:1.0,4631:1.0}
Key: /2436: Value: {210:1.0,329:1.0,2296:1.0,2666:1.0,3776:1.0,3777:1.0}
Key: /2437: Value: {441:1.0,619:1.0,1208:1.0,2066:1.0,2390:1.0,3375:1.0,3714:1.0,4173:1.0}
....

=======>

 df-count:

Key: 0: Value: 1
Key: 1: Value: 1
Key: 2: Value: 1
Key: 3: Value: 1
Key: 4: Value: 1
Key: 5: Value: 1
Key: 6: Value: 1
Key: 7: Value: 1
Key: 8: Value: 1
Key: 9: Value: 4
Key: 10: Value: 3
..........
3-2  Read the document frequency List  and split it to size-limited chunks.

createDictionaryChunks(wordCountPath, output, baseConf, chunkSizeInMegabytes)

 
4.  Prune the term frequency vectors

 HighDFWordsPruner.pruneVectors

 

5.

 TFIDFConverter.processTfIdf

 

 

 

 

 

 

 

 

 

 

你可能感兴趣的:(document)