Non-negative Matrix Factorization and Probabilistic Latent Semantic Analysis

Non-negative Matrix Factorization and Probabilistic Latent Semantic Analysis


October 12, 2011


http://ezcodesample.com/plsaidiots/NMFPLSA.html

Introduction

Non-negative Matrix Factorization is frequently confused with Probabilistic Latent Semantic Analysis. These two methods are applied for document clustering and have similarities and differences. This article explains both methods in the most simple way possible and provides side-by-side coding example for both of them.

How it was discovered

Imagine having 5 documents, 2 of them about environment and 2 of them about U.S. Congress and 1 about both, that means it says about government legislation process in protecting an environment. We need to write a program that unmistakably identifies category of each document and also returns a degree of belonging of each document to a particular category. For this elementary example we limit our vocabulary to 5 words: AIR, WATER, POLLUTION, DEMOCRAT, REPUBLICAN. Category ENVIRONMENT and category CONGRESS may contain all 5 words but with different probability. We understand that the word POLLUTION has more chances to be in the article about ENVIRONMENT than in the article about CONGRESS, but can theoretically be in both. Presume after an examination of our data we built following document-term table:

document/word air water pollution democrat republican
doc 1 3 2 8 0 0
doc 2 1 4 12 0 0
doc 3 0 0 0 10 11
doc 4 0 0 0 8 5
doc 5 1 1 1 1 1


We distinguish our categories by the group of words assigned to them. We decide that category ENVIRONMENT normally should contain only words AIR, WATER, POLLUTION and category CONGRESS should contain only words DEMOCRAT and REPUBLICAN. We build another matrix, each row of which represent category and contains counts for only words that assigned to each category. 

categories air water pollution democrat republican
ENVIRONMENT 5 7 21 0 0
CONGRESS 0 0 0 19 17


We change values from frequencies to probabilities by dividing them by sums in rows, which turns each row into probability distribution.

Matrix  W
categories air water pollution democrat republican
ENVIRONMENT 0.15 0.21 0.64 0 0
CONGRESS 0 0 0 0.53 0.47


Now we create another matrix that contains probability distribution for categories within each document that looks like follows:

Matrix  D
documents ENVIRONMENT CONGRESS
doc 1 1.0 0.0
doc 2 1.0 0.0
doc 3 0.0 1.0
doc 4 0.0 1.0
doc 5 0.6 0.4


It shows that top two documents speak about environment, next two about congress and last document about both. Ratios 0.6 and 0.4 for the last document are defined by 3 words from environment category and 2 words from congress category. Now we multiply both matrices and compare the result with original data but in a normalized form. Normalization in this case is division of each row by the sum of all elements in rows. The comparison is shown side-by-side below:

Product of  D * W
0.15 0.21 0.64 0.0 0.0
0.15 0.21 0.64 0.0 0.0
0.0 0.0 0.0 0.53 0.47
0.0 0.0 0.0 0.53 0.47
0.09 0.13 0.38 0.21 0.19
Normalized data  N
0.23 0.15 0.62 0.0 0.0
0.06 0.24 0.70 0.0 0.0
0.0 0.0 0.0 0.48 0.52
0.0 0.0 0.0 0.61 0.39
0.2 0.2 0.2 0.2 0.2

The correlation is obvious. The problem definition is to find constrained matrices  W  and  D  (given the number of categories), product of which is the best match with normalized data  N . When approximation is found matrix  D  will contain sought categories.

More scientific way of explanation

When matrices involved in a product are non-negative and have sums of all elements in rows equal to 1.0, the result matrix will have the same properties. It is true also for the columns. That means if we are looking for normalized  W  and  D  we know that their product is also normalized matrix. It is sort of pre-requisite or fundamental property of matrix product. 

The scientific explanation, usually, includes maximization of likelihood function that represents sum of all elements of the matrix, each element of which is product of frequency of occurrence of each word and logarithm of correspondent element of product  D * W  . In our example it is 

L = 3 * log(DW0,0) + 2 * log(DW0,1) + ... + 1 * log(DW4,4

Likelihood functions are used to estimate unknown probabilities given experimental data. This is exactly the case when it should be applied. Experimental data are frequencies, shown as numbers, and probabilities are elements of  D  and  W  that presented in a product in last expression. It is correct but not really necessary because it is known that maximum of the functional 

L = f0 * log(p0) + f1 * log(p1) + ... + fn * log(pn

with constraint  p0 + p1 + ... + pn = 1.0  are values of  pi  proportional to  fi . So normalized matrix  N  is the one that maximize likelihood function and we simply have to find such  and  , which product is as close to  N  as possible. 

Different numerical technique may be applied. For quick jump to approximate solution least squares may be applied. We need to compute matrix  as 

D = (N W T) (W W T-1 

then having  D  we compute  W  as 

W = (D T D)-1 DT

then loop until  D  and  W  converge. Inversion can be done approximately, both computed matrices are normalized after each step. The convergence is extremely fast. The matrices to be inverted have sizes equal to the number of categories, which is much less than number of files or words. In real case we have 50,000 words, 1,000,000 files and 100 categories. This is kind of real case scenario, for which the suggested algorithm must work. Inversion for 100 * 100 case takes now less than a second on regular lap-top and other matrix multiplication can be done concurrently in multi-processor hardware. The coding example can be found on the link at the top and examined. This approximate solution may be used as initial approximation for PLSA.

PLSA

Some articles present this Non-negative Matrix Factorization as Probabilistic Latent Semantic Analysis ( example ), but it is not the same. The likelihood function for NMF is following:



and likelihood function for PLSA is different:



The conditional probability  P(wi | dj)  for the first element in data is, for example 3/13, but joint probability  P(wi , dj)  is 3/69. NMF algorithm is designed in presumption that sums of all probabilities in rows equal to 1.0, which is not true for joint probability. Maximization of likelihood function for PLSF and applying Maximization Expectation algorithm for obtaining a numerical solution lead to a set of following equations:



These formulas express everything via functions.  P(w|z)  and  P(d|z)  are functions of two variables. Their values similar to elements of matrices  W  and  D P(z)  is distribution function for categories. In matrix notation it will be diagonal matrix  Z  with probabilities of each category on the principal diagonal.  P(w|z,d)  is function of three variables. Numerator of E-step expression is product of single element of W, D  and  Z . Denominator of E-step is a scalar representing inner product of row and column of  D  and  W  times correspondent diagonal element of  Z . We can think of  P(w|z,d)  as set of matrices of the first rank for each given  z . The size of each of these matrices match the size of document-term matrix. Numerators in M-step expressions are Hadamard products of source data and these matrices of the first rank. In computation of  P(z)  we add all elements of this Hadamard product. In computation of  P(w|z)  we add only columns and in computation of  P(d|z)  we add only rows. The meaning of denominators in M-step is normalization, i.e. making sums of all elements in rows equal to 1.0 to match definition of probabilities. 

Both algorithms NMF and PLSA return approximately the same result if Z is not involved (set to constant and not alternate). Since NMF use relative or normalized values in rows, and PLSA use absolute values, the results match, when sums in rows of document-term matrix N are approximately equal. The result of PLSA is skewed when some documents are larger in size. It is not good or bad, it is different. P(z) brings up some computational stability issues. It does not affect  P(w|z) because it is filtered in normalization step, but it affects computation of  P(d|z) by skewing values in columns and destroying the result. When it converges, it converges to the same result as without it. On that reason I simply removed  P(z) from computation of  P(d|z). This is the only difference I introduced in my DEMO (link at the top). I found this computational issue also mentioned in other papers. Some even introduced new parameter called  inverse computational temperature and use it for stabilizing  P(z). That is overkill. There is no need for a new theory. The problem is in fast changes in  P(z). It can be solved by dampen these values during computation by averaging them with values from few previous steps or something similar. Some implementations look like they use  P(z) but they actually don't. I found one example of  PLSA in java . Although  P(z) is computed in iterations the values are always the same, so it does not affect the result. 

I'm not the only one who noticed that NMF and PLSA are computationally close. There is even  theoretical proof that these two methods are equivalent. I found them close but not equivalent. To see the difference you can simply multiply any row in data matrix by the constant. The difference is provided by usage of additional term  P(z)

你可能感兴趣的:(Non-negative Matrix Factorization and Probabilistic Latent Semantic Analysis)