Introduction
To find these topics in a particular set of documents,We’d modify our clustering code to work with word vectors instead of the document vectors we’ve been using so far. A word vector is nothing but a vector for each word, where the features would be IDs of the other words that occur along with it in the corpus, and the weights would be the number of documents they occur together in.
Latent Dirichlet analysis (LDA) is more than just this type of clustering. If two words having the same meaning or form don’t occur together, clustering won’t be able to associate those two based on other instances. This is where LDA shines. LDA can sift through patterns in the way words occur, and figure out which two have similar meanings or are being used in similar contexts. These groups of words can be
thought of as a concept or a topic.
The LDA algorithm works like Dirichlet clustering. It starts with an empty topic model, reads all the documents in a mapper phase in parallel, and calculates the probability of each topic for each word in the document. Once this is done, the counts of these probabilities are sent to the reducer where they’re summed, and the whole model is normalized. This process is run repeatedly until the model starts explaining the documents better—when the sum of the (log) probabilities stops changing. The degree of change is set by a convergence threshold parameter, similar to the threshold in k-means clustering. Instead of measuring the relative change in centroid, LDA estimates how well the model fits the data. If the likelihood value doesn’t change above this threshold, the iterations stop.
TF-IDF vs. LDA
While clustering documents, we used TF-IDF word weighting to bring out important words within a document. One of the drawbacks of TF-IDF was that it failed to recognize the co-occurrence or correlation between words, such as Coca Cola. Moreover, TF-IDF isn’t able to bring out subtle and intrinsic relations between words based on their occurrence and distribution. LDA brings out these relations based on the input word frequency, so it’s important to give term-frequency vectors as input to the algorithm,and not TF-IDF vectors.
Tuning the parameters of LDA
- the number of topics
- the number of words in the corpus If you need to speed up LDA, apart from decreasing the number of topics, you cankeep the features to a minimum, but if you need to find the complete probability distribution of all the words over topics you should leave this parameter alone. If you’reinterested in finding the topic model containing only the keywords from a large corpus, you can prune away the high frequency words in the corpus while creating vectors.You can lower the value of the maximum-document-frequency percentage parameter (--maxDFPercent) in the dictionary-based vectorizer. A value of 70 removes allwords that occur in more than 70 percent of the documents.
Invocation and Usage
Mahout's implementation of LDA operates on a collection of SparseVectors of word counts. These word counts should be non-negative integers, though things will-- probably --work fine if you use non-negative reals. (Note that the probabilistic model doesn't make sense if you do!) To create these vectors, it's recommended that you follow the instructions in Creating Vectors From Text , making sure to use TF and not TFIDF as the scorer.
Invocation takes the form:
bin/mahout cvb \ -i <input path for document vectors> \ -dict <path to term-dictionary file(s) , glob expression supported> \ -o <output path for topic-term distributions> -dt <output path for doc-topic distributions> \ -k <number of latent topics> \ -nt <number of unique features defined by input document vectors> \ -mt <path to store model state after each iteration> \ -maxIter <max number of iterations> \ -mipd <max number of iterations per doc for learning> \ -a <smoothing for doc topic distributions> \ -e <smoothing for term topic distributions> \ -seed <random seed> \ -tf <fraction of data to hold for testing> \ -block <number of iterations per perplexity check, ignored unless test_set_percentage>0
Topic smoothing should generally be about 50/K, where K is the number of topics. The number of words in the vocabulary can be an upper bound, though it shouldn't be too high (for memory concerns).
Choosing the number of topics is more art than science, and it's recommended that you try several values.
After running LDA you can obtain an output of the computed topics using the LDAPrintTopics utility:
bin/mahout ldatopics \ -i <input vectors directory> \ -d <input dictionary file> \ -w <optional number of words to print> \ -o <optional output working directory. Default is to console> \ -h <print out help> \ -dt <optional dictionary type (text|sequencefile). Default is text>
References
http://mahout.apache.org/users/clustering/latent-dirichlet-allocation.html
http://blog.csdn.net/wangran51/article/details/7408399
http://en.wikipedia.org/wiki/Dirichlet_distribution