Non-negative Matrix Factorization and Probabilistic Latent Semantic Analysis

Introduction

Non-negative Matrix Factorization is frequently confused with Probabilistic Latent Semantic Analysis. These two methods are applied for document clustering and have similarities and differences. This article explains both methods in the most simple way possible.

How it was discovered

Imagine having 5 documents, 2 of them about environment and 2 of them about U.S. Congress and 1 about both, that means it says about government legislation process in protecting an environment. We need to write a program that unmistakably identifies category of each document and also returns a degree of belonging of each document to a particular category. For this elementary example we limit our vocabulary to 5 words: AIR, WATER, POLLUTION, DEMOCRAT, REPUBLICAN. Category ENVIRONMENT and category CONGRESS may contain all 5 words but with different probability. We understand that the word POLLUTION has more chances to be in the article about ENVIRONMENT than in the article about CONGRESS, but can theoretically be in both. Presume after an examination of our data we built following document-term table:

document/word air water pollution democrat republican
doc 1 3 2 8 0 0
doc 2 1 4 12 0 0
doc 3 0 0 0 10 11
doc 4 0 0 0 8 5
doc 5 1 1 1 1 1


We distinguish our categories by the group of words assigned to them. We decide that category ENVIRONMENT normally should contain only words AIR, WATER, POLLUTION and category CONGRESS should contain only words DEMOCRAT and REPUBLICAN. We build another matrix, each row of which represent category and contains counts for only words that assigned to each category. 

categories air water pollution democrat republican
ENVIRONMENT 5 7 21 0 0
CONGRESS 0 0 0 19 17


We change values from frequencies to probabilities by dividing them by sums in rows, which turns each row into probability distribution. 

Matrix  W
categories air water pollution democrat republican
ENVIRONMENT 0.15 0.21 0.64 0 0
CONGRESS 0 0 0 0.53 0.47


Now we create another matrix that contains probability distribution for categories within each document that looks like follows:

Matrix  D
documents ENVIRONMENT CONGRESS
doc 1 1.0 0.0
doc 2 1.0 0.0
doc 3 0.0 1.0
doc 4 0.0 1.0
doc 5 0.6 0.4


It shows that top two documents speak about environment, next two about congress and last document about both. Ratios 0.6 and 0.4 for the last document are defined by 3 words from environment category and 2 words from congress category. Now we multiply both matrices and compare the result with original data but in a normalized form. Normalization in this case is division of each row by the sum of all elements in rows. The comparison is shown side-by-side below:

Product of  D * W
0.15 0.21 0.64 0.0 0.0
0.15 0.21 0.64 0.0 0.0
0.0 0.0 0.0 0.53 0.47
0.0 0.0 0.0 0.53 0.47
0.09 0.13 0.38 0.21 0.19
Normalized data  N
0.23 0.15 0.62 0.0 0.0
0.06 0.24 0.70 0.0 0.0
0.0 0.0 0.0 0.48 0.52
0.0 0.0 0.0 0.61 0.39
0.2 0.2 0.2 0.2 0.2

The correlation is obvious. The the technical problem is to find constrained matrices  W and  D (given only the number of categories), product of which is the best match to original data in normalized form  N.

Likelihood functions

We obtained matrices of decomposition in the above example by combining all words into groups. Obviously, we can not do that for the case when vocabulary is over hundred thousands, number of documents is over million and number of categories is unknown but presumed as relatively large number (let say 100). The generic approach used in both NMF and PLSA is maximization of likelihood function. It is very hard to understand the meaning of likelihood function for beginner and how these likelihood functions are constructed. I can say that they are introduced by experts in probability theory and considered technical subject. Here I try to explain one of them.

Let say we know that in document one the word AIR is met 3 times, the word WATER is met 2 times and the word POLLUTION is met 8 times. If we ask what is the probability that randomly selected word from document one is AIR, the answer is simple, it is 3/13. Same simple conclusion we can make for the word WATER, P{word = WATER} = 2/13, and for word POLLUTION, P{word = POLLUTION} = 8/13. Let us consider the function with three unknown probabilities 

L = 3 * P 1 + 2 * P 2 + 8 * P 3 

where P 1 is probability of word AIR, P 2 is probability of word WATER and P 3 is probability of word POLLUTION. Presume we do not know probabilities. Having this function and constraints P 1 + P 2 + P 3 = 1.0, we can estimate these probabilities by looking for constrained maximum of L. Lagrange method works well in this case. If we obtain P 1, P 2 and P 3 as values that maximize L, we get the above probabilities 3/13, 2/13 and 8/13. This is how it works and the above function is called likelihood. Obviously, we do not need to do that for this simple case. It was only an explanation of the matter of likelihood function. Generically, likelihood functions are used to estimate probabilities afer experiment is already conducted and frequencies of occurrence are known. Like in above example. We know the frequencies and looking for probabilities.

PLSA

Some articles present this Non-negative Matrix Factorization as Probabilistic Latent Semantic Analysis ( example), but it is not the same. The likelihood function for NMF is following:



and likelihood function for PLSA is different:



The conditional probability  P(wi | dj) for the first element in data is, for example 3/13, but joint probability  P(wi , dj) is 3/69. NMF algorithm is designed in presumption that sums of all probabilities in rows equal to 1.0, which is not true for joint probability. Maximization of likelihood function for PLSF and applying Maximization Expectation algorithm for obtaining a numerical solution lead to a set of following equations:



These formulas express everything via functions.  P(w|z) and  P(d|z) are functions of two variables. Their values similar to elements of matrices  W and  DP(z) is distribution function for categories. In matrix notation it will be diagonal matrix  Z with probabilities of each category on the principal diagonal.  P(w|z,d)is function of three variables. Numerator of E-step expression is product of single element of  W, D and  Z. Denominator of E-step is a scalar representing inner product of row and column of  D and  W times correspondent diagonal element of  Z. We can think of  P(w|z,d) as set of matrices of the first rank for each given  z. The size of each of these matrices match the size of document-term matrix. Numerators in M-step expressions are Hadamard products of source data and these matrices of the first rank. In computation of  P(z) we add all elements of this Hadamard product. In computation of  P(w|z) we add only columns and in computation of  P(d|z) we add only rows. The meaning of denominators in M-step is normalization, i.e. making sums of all elements in rows equal to 1.0 to match definition of probabilities. 

Both algorithms NMF and PLSA return approximately the same result if Z is not involved (set to constant and not alternate). Since NMF use relative or normalized values in rows, and PLSA use absolute values, the results match, when sums in rows of document-term matrix  N are approximately equal. The result of PLSA is skewed when some documents are larger in size. It is not good or bad, it is different.  P(z) brings up some computational stability issues. It does not affect  P(w|z) because it is filtered in normalization step, but it affects computation of  P(d|z) by skewing values in columns and destroying the result. When it converges, it converges to the same result as without it. On that reason I simply removed  P(z) from computation of  P(d|z). This is the only difference I introduced in my DEMO (link at the top). I found this computational issue also mentioned in other papers. Some even introduced new parameter called  inverse computational temperature and use it for stabilizing  P(z). That is overkill. There is no need for a new theory. The problem is in fast changes in  P(z). It can be solved by dampen these values during computation by averaging them with values from few previous steps or something similar. Some implementations look like they use  P(z) but they actually don't. I found one example of  PLSA in java . Although  P(z) is computed in iterations the values are always the same, so it does not affect the result. 

I'm not the only one who noticed that NMF and PLSA are computationally close. There is even theoretical proofthat these two methods are equivalent. I found them close but not equivalent. To see the difference you can simply multiply any row in data matrix by the constant. The difference is provided by usage of additional termP(z)

原文地址:http://ezcodesample.com/plsaidiots/NMFPLSA.html

你可能感兴趣的:(Non-negative Matrix Factorization and Probabilistic Latent Semantic Analysis)