(1)SVD与LSI教程(1):理解SVD和LSI
(2)SVD和LSI教程(2):计算奇异值
(3)SVD与 LSI教程(3): 计算矩阵的全部奇异值
(4)SVD 与 LSI 教程(4): LSI计算
(5)SVD 与 LSI教程(5):LSI关键字研究与协同理论
/**********************作者信息****************/
Dr. E. Garcia
Mi Islita.comEmail | Last Update: 01/07/07
/**********************作者信息****************/
I wrote this tutorial to:
As with many IR topics, LSI is a subject that from time to time surfaces in the search engine marketing industry through forums, conferences and events. Often, these discussions are limited to partially quoting or interpreting IR papers and patents.
At the time of writing this industry doesn't provide its members with stepwise how-to instructions for implementing LSI even when most of the information is available online. Consequently search marketers don't understand LSI. In particular, they don't seem to grasp the advantages and limitations of the technique, what is/is not LSI or what this can or cannot do for them or their clients.
The result is the dissemination of inaccurate information. For instance, some marketers have assigned a meaning to the terms "latent" and "semantic" that is not in the LSI literature. Others have become "experts" at quoting each other hearsays. In an effort to sell their services, even others have come with "LSI-based" software, videos, "lessons", tools, etc., that are at best a caricature of how a search engine or IR system implements LSI. Whatever these tools score probably is not what a search engine like Google or Yahoo might be scoring.
Some of these marketing companies even display a tag cloud of words and try to sell the idea that they have a real or unique "LSI technology". Such clouds are easy to construct and link to search result pages. These can be constructed from any lookup list, thesaurus or search log files. No SVD is needed. In an effort to save face and avoid litigation from consumers, some of these purveyors of falsehood as other crooks and their friends play with words and call theirs "LSI-like", "LSI-based", "LSI-driven" technology or use similar snaky phrases. The funny thing is that other SEOs, bloggers, and marketers fall for these tactics. And how to forget the "LSI and Link Popularity" half lies and half crap promoted by those that offer link-based services? As usual, SEO bloggers repeat such hearsays like parrots since most of these don't really know how to SVD a simple matrix.
Since there is now a crew of search marketing firms claiming to sell all sort of LSI-based SEO services and making a profit out of the ignorance of consumers, I am making public my case against these firms in the post
Latest SEO Incoherences (LSI) - My Case Against "LSI based" Snakeoil Marketers
Stay away from such marketing firms, their claims and businesses.
By providing an SVD-LSI tutorial, complete with step-by-step how-to calculations and examples, I hope to put to rest the many myths and misquotes disseminated by these search marketers. Here is a list of the most common misconceptions:
Latent Semantic Indexing (LSI) ...
This list of misconceptions, myths or plain lies was recopilated from SEOBook, SearchEngineWatch, Cre8asiteforums, SEOChat, SEOMOZ, SeoRoundTable, Webmasterworld and similar forums. A sample of the Latest SEO Incoherences ("LSI") is available online.
This spreading of incorrect knowledge through electronic forums gives rise to a bursting phenomenon that in the past we have referred to as blogonomies. In our view knowledge, citation importance or link weight transmitted through such bursts can be considered corrupted. Thus, this tutorial pretends to dispel search marketing blogonomies relevant to SVD and LSI.
The fact is that query operators are not part of LSI. In addition to English text, LSI has been applied to text in Spanish and to text in other languages. LSI makes no presumptions regarding words in documents or queries -whether these are or should be nouns, verbs, adjectives or other form of tokens. LSI is not on-topic analysis or what SEOs like to call "theming". Current LSI algorithms ignore word order (term sequences), though a Syntagmatic Paradigmatic Model and Predication Algorithm has been proposed to work around this.
Another misconception is that latent semantic is co-occurrence. Actually is not; at least, not first-order co-occurrence. LSI works great at identifying terms that induce similarity in a reduced space, but research from Dr. Tom Landauer and his group at the University of Colorado (19) indicates that over 99 % of word-pairs whose similarity is induced never appear together in a paragraph. Readers should be reminded that synonyms or terms conveying a synonymity association don't tend to co-occur, but tend to occur in the same, similar or related context. While LSI itself is not co-occurrence, term co-occurrence is important in LSI studies.
A persistent myth in search marketing circles is that LSI grants contextuality; i.e., terms occurring in the same context. This is not always the case. Consider two documents X and Y and three terms A, B and C and wherein:
A and B do not co-occur.
X mentions terms A and C
Y mentions terms B and C.
:. A---C---B
The common denominator is C, so we define this relation as an in-transit co-occurrence since both A and B occur while in transit with C. This is called second-order co-occurrence and is a special case of high-order co-occurrence.
However, only because terms A and B are in-transit with C this does not grant contextuality, as the terms can be mentioned in different contexts in documents X and Y. For example, this would be the case of X and Y discussing different topics. Long documents are more prone to this.
Even if X and Y are monotopic these might be discussing different subjects. Thus, it would be fallacious to assume that high-order co-occurrence between A and B while in-transit with C equates to a contextuality relationship between terms. Add polysemy to this and the scenario worsens, as LSI can fail to address polysemy.
There are other things to think about. LSI is computationally expensive and its overhead is amplified with large-scale collections. Certainly LSI is not associative indexing or root (stem) indexing like some have suggested. It is not document indexing, but used with already indexed collections whose document terms have been prescored according to a particular term weight scheme. Furthermore, understanding a query; i.e., the assumption that the query must be of the natural language type, is not a requirement for implementing LSI.
In addition, the claim that terms must come from a specific portion of a document like title tags, anchor text, links or a specific url domain plays no role and is not a requisit for implementing LSI. These false concepts have been spreaded for a while, mostly by those that sell link-based services, who conveniently don't provide mathematical evidence on how LSI works since they cannot do the math.
True that some papers on large-scale distributed LSI mentions the word "domain" in connection with LSI, but the term is used in reference to information domains, not url domains or what is known as web sites. True that LSI can be applied to collections that have been precategorized by web site domains, but this is merely filtering and preclassification and is not part of the SVD algorithm used in LSI.
Let me mention that the technique of singular value decomposition used in LSI is not an AI algorithm, but a matrix decomposition technique developed in the sixties; though SVD has been used in many environments, including AI (1-14). Roughly speaking, SVD itself is just one matrix decomposition technique. Certainly there are more than one way of decomposing and analyzing a given matrix. Plenty of alternate techniques are available online (e.g., LU, QR, etc.).
True that SVD as NMF (non-negative matrix factorization) has been used to conduct email forensics. True that SVD has been used as an eavesdropping tool for identifying word patterns from web communities (15-18), but LSI is not a secret weapon from the Government designed to read your mind --at least not yet. :).
Another misconception is that LSI is CIRCA, a technology developed by Applied Semantics (acquired by Google). As mentioned at this IR Thoughts blog, this is another SEO blogonomy. CIRCA is based on ontologies, not on SVD. LSI is not based on ontologies, but on SVD.
When you think thoroughly There is No Such Thing as "LSI-Friendly" Documents. This is just another SEO Myth promoted by certain search engine marketing firms to market better whatever they sell. In the last tutorial of this series (SVD and LSI Tutorial 5: LSI Keyword Research and Co-Occurrence Theory), we explain in details why there is not such thing as "LSI-Friendly" documents and why SEOs cannot use LSI to optimize for ranking purposes any document.
Last, but not least, while marketers should not lose any sleep over LSI, they should not go to the other extreme and advice others to brush off the whole thing or ignore the power of language manipulations as some have suggested.
A recent neuropsychology study (11/28/2006) presented at the Annual Meeting of the Radiological Society of North America found that consumer's brain activities can be conditioned even before they put a foot on a store. This occurs by associating brands to keywords, which requires of proper language manipulation techniques. The study found popular brands activated parts of the brain linked to self-identity and reward. The findings help understand how the brain perceives and processes brands, aiding marketing initiatives.
FMRI revealed that well-known brands, regardless of the product, activated parts of the brain associated with positive emotional processing, self-identity and reward. Less well-known brands activated parts of the brain associated with negative emotional response. Lead researcher Dr. Christine Born, a radiologist at University Hospital, part of the Ludwig-Maximilians University in Munich, Germany said: "This is the first functional magnetic resonance imaging test examining the power of brands", "We found that strong brands activate certain areas of the brain independent of product categories."..."The vision of this research is to better understand the needs of people and to create markets which are more oriented towards satisfaction of those needs."
Marketing psychologist Paul Buckley, from the University of Wales Institute, Cardiff, said: "Marketing is all about learning by association - companies constantly push the same image over and over again from a range of media. So people associate a famous brand with positive imagery, and you would expect that positive imagery to trigger off positive emotional responses." In the December 2006 issue of IR Watch - The Newsletter we covered these studies in details for our subscribers. Also, in the January 2007 issue we explained why marketers that dismiss the power of language manipulations don't really understand consumer's behaviors and how these are conditioned. If you are a subscriber read the complete reports in:
IR Watch 2006-4: C-Indices and Keyword-Brand Associations
IR Watch 2007-1: Query Reformulations: How Do Users Search?
This tutorial is organized as follows.
Part I is an introduction to SVD and LSI --you are reading it right now.
Part II shows some justifications used when computing singular values.
Part III provides SVD stepwise how-to-calculations.
Part IV covers LSI and shows readers how documents and queries are scored.
Part V explains that co-occurrence, not the nature of the terms, is what makes LSI to form clusters.
All parts include plenty of illustrations, examples and exercises.
Assumptions and Prerequisites
I assume readers have a linear algebra background or understand the material covered in:
You may skip these tutorials if you are familiar with matrix decomposition. Most of the basic matrix operations and terminology to be used are explained or covered in these tutorials. If you are not familiar with linear algebra or have not read the tutorials, please STOP AND READ THESE.
The teaching technique that I will be using is the one described in previous tutorials and that pretends to mimics tf*IDF schemes. That is, global knowledge is presented first and then associated to local, more specific knowledge.
In addition, instead of lecturing about SVD I want to show you how things work --step by step. So, grab a pencil and a stack of papers. After manually computing SVDs for small matrices, you might want to use software to double check results.
With large matrices (of order greater than 3) you might want to use a software package like MatLab or an open source version of this package like SciLab, which is a free download. For relatively small matrices I recommend the use of matrix calculators like Bluebit and its Matrix ActiveX Component. You can use this component in your SVD or LSI project to impress others. To showcase on the Web the knowledge you are about to acquire, I recommend you to use JavaScript utilities like this Singular Value Decomposition Calculator. Be aware that some of these tools come with their own learning curves.
Grab your favorite drink and let's the fun begin.
In 1965 G. Golub and W. Kahan introduced Singular Value Decomposition (SVD) as a decomposition technique for calculating the singular values, pseudo-inverse and rank of a matrix (1). The conventional way of doing this was to convert a matrix to a row-echolon form. The rank of a matrix is then given by the number of nonzero rows or columns of the echolon form, whichever of these two numbers is smaller.
SVD is an entirely different approach. The technique decomposes a matrix A into three new matrices
Equation 1: A = USVT
where
U is a matrix whose columns are the eigenvectors of the AAT matrix. These are termed the left eigenvectors.
S is a matrix whose diagonal elements are the singular values of A. This is a diagonal matrix, so its nondiagonal elements are zero by definition.
V is a matrix whose columns are the eigenvectors of the ATA matrix. These are termed the right eigenvectors.
VT is the transpose of V.
The decomposition not only provides a direct method for computing the rank of a matrix, but exposes other equally interesting properties and features of matrices.
When computing the SVD of a matrix is desirable to reduce its dimensions by keeping its first k singular values. Since these are ordered in decreasing order along the diagonal of S and this ordering is preserved when constructing U and VT, keeping the first k singular values is equivalent to keeping the first k rows of S and VT and the first kcolumns of U. Equation 1 reduces to
Equation 2: A* = U* S* VT*
This process is termed dimensionality reduction, and A* is referred to as the Rank k Approximation of A or the "Reduced SVD" of A. This is exactly what is done in LSI. The top k singular values are selected as a mean for developing a "latent semantics" representation of A that is now free from noisy dimensions. This "latent semantics" representation is a specific data structure in low-dimensional space in which documents, terms and queries are embedded and compared. This hidden or "latent" data structure is masked by noisy dimensions and becomes evident after the SVD.
This discovery of "latent" data structures (or masked information) from a system can be achieved also when SVD is applied to address problems and systems that have nothing to do with documents, queries, textual data or word semantics at all. Example of these are systems encountered in Biology, Chemistry, Engineering, Medical Sciences and other Applied Sciences. A search in Google reveals that SVD is a well known chemometric technique that even has been applied to HPLC Chemistry (high performance -or pressure- liquid chromatography).
This is not surprising. In a sense, dimensionality reduction is a noise reduction process. Thus, SVD belongs to a class of dimensionality reduction techniques that deal with the uncovering of latent data structures. Other techniques for the discovery of latency have been developed. These compare favorable with SVD. An example of this is NonNegative Matrix Factorization (NMF) (15, 17).
Dimensionality reduction is somewhat an arbitrary process. How many k dimensions to keep can lead to the so-called "dimensionality reduction curse" in which performance is affected. To deal with this "curse" a workaround based on Fractal Geometry have been proposed. For instance, Kumaraswamy has suggested that the lost of performance of a method -in LSI, using dimensionality reduction and in data storage, using vector quantization- can be related to the fractal dimension of the data set under consideration (2). This and other workarounds (19) for dealing with the "dimensionality reduction curse" are out of the scope of this tutorial.
Since its introduction more than 40 years ago SVD has become a standard decomposition technique in linear algebra. It is a great technique for uncovering hidden or "latent" data structures while removing noise. Many problems encountered in Engineering and Sciences have been addressed with SVD (3 - 7).
In 1988 Deerwester et. al. used SVD to deal with the vocabulary problem in human-computer interaction and called their approach Latent Semantic Indexing (LSI) (8, 9), known also as LSA (latent semantic analysis). Thus, LSI is one application of SVD in the same way that IS (Information Space) is only one application of PCA (Principal Component Analysis), a technique often mistaken for LSI. The LSI, SVD, PCA and IS acronyms have been discussed in our IR Thoughts blog (10).
LSI has several bottlenecks associated with storage considerations, difficulties in upgrading the underlying database, and other speed and reliability considerations. Since both SVD and LSI are computationally expensive, LSI is not yet a practical solution for large commercial collections that must return results under less than a second. Even when Telcordia has proposed a Distributed LSI approach based on partitioning collections into conceptual information domains, it remains to be seen if such approach is feasible with a commercial environment full of marketing noise and vested interests (11).
On the Web, SVD and LSI have been used by search engines as auxiliary techniques on representative samples or small collections that have been pretreated. An example of this is Yahoo! pay-for-performance paid collection (12). It remains to be seen if advances in the field and better architectures make LSI suitable for search engine collections consisting of billion of documents or for searchable collections of any size.
LSI is great at addressing synonymy (different words with same meaning), but can fail to address polysemy (same word with different meanings). A solution that combines LSI with SI (synchronic imprints) has been proposed (13). A Yahoo! Research group (14) has proposed an improved algorithm known as VLSI (Variable Latent Semantic Indexing).
Now that we have a brief background on SVD and LSI, let's discuss how singular value decomposition works. The best way to grasp this is through a visual example.
The geometrical interpretation of SVD has been described elsewhere. I'm going to use a simple analogy. Depicts an arrow rotating in two dimensions and describing a circle. See Figure 1.
Figure 1. A rotating arrow of fixed length describing a circle.
Now suppose that when rotating this arrow we stretch and squeeze it in a variable manner and up to a maximum and minimum length. Instead of a circle the arrow now describes a two-dimensional ellipsoid.
Figure 2. A rotating arrow of variable length describing an ellipsoid.
This is exactly what a matrix does to a vector.
When we multiply a vector X by a matrix A to create a new vector AX = Y, the matrix performs two operations on the vector: rotation (the vector changes coordinates) and scaling (the length of the vector changes). In terms of SVD, the maximum stretching and minimum squeezing is determined by the singular values of the matrix. These are values inherent, unique or "singular" to A that can be uncovered by the SVD algorithm. In the figure the effect of the two largest singular values, s1 and s2 has been labeled. In Part 2 of this tutorial we will be computing these values.
For now, the important thing to remember is this: singular values describe the extent by which the matrix distorts the original vector and, thus, can be used to highlight which dimension(s) is/are affected the most by the matrix. Evidently, the largest singular value has the greatest influence on the overall dimensionality of the ellipsoid. In geometric terms Figure 1 and Figure 2 show that multipliying a vector by a matrix is equivalent to tranforming (mapping) a circle into an ellipsoid.
Now instead of a rotating vector depicts a set of m-dimensional vectors, all perpendiculars (orthogonal), with different lengths and defining a m-dimensional ellipsoid. The ellipsoid is embedded in an m-dimensional space, where m = k, with k being the number of nonzero singular values. If we eliminate dimensions by keeping the three largest singular values, a three-dimensional ellipsoid is obtained. This is a Rank 3 Approximation.
Now suppose that compared with the two largest singular values the third one is small enough that we ignore it. This is equivalent to removing one dimension. Geometrically, the 3-dimensional ellipsoid collapses into a 2-dimensional ellipsoid, and we obtain a Rank 2 Approximation. Obviously, if we keep only one singular value the ellipsoid collapses into a one-dimensional shape (a line). Figure 3 illustrates roughly the reduction.
Figure 3. Dimensionality Reduction.
This is why we say that keeping the top k singular values is a dimensionality reduction process. The main idea is this: singular values can be used to highlight which dimensions are affected the most when a vector is multiplied by a matrix. Thus, the decomposition allows us to determine which singular values can be retained, from here the expression "singular value decomposition".
In the case of LSI, SVD unveils the hidden or "latent" data structure, where terms, documents, and queries are embedded in the same low-dimensional and critical space. This is a significant departure from term vector models. In the vector space model each unique term from an index term is considered a dimension of a term space. Documents and queries are then represented as vectors in this term space. In this space terms are considered independent from each other. However, this is contrary with common knowledge since term dependency can arise in written or spoken language in many different ways. The most common are: through (a) synonymity and (b) polysemy.
LSI performs better than vector space models by embedding documents, terms and queries in the same space. This resolves the problem of high-order co-occurrence like in-transit co-occurrence. In-transit co-occurrence accounts for the fact that terms do not need to be actually in a document to be relevant to its content. They can co-occur while "in transit".
An example of this are synonyms. Synonyms don't tend to co-occur together (often they have both low pairwise c-indices and low Jaccard Indices from a cluster association matrix) but they tend to co-occur in the same context --this can be inspected by measuring their high-order c-indices.
It should be pointed out that "in-transit" co-occurrence is not unique to synonyms. Terms not related by a synonymity association can be in-transit and their high order co-occurrence can be examined with c-indices.
This is illustrated in Figure 4. The Venn Diagrams represent documents relevant to specific terms. The (dog, canine) and (canine, molar) pairs have a synonymity association, but this association is not present in the (dog, molar) pair or between the other pairs. The figure shows also that in-transit co-occurrence not only can be observed between individual tokens, but between distinct contexts.
Figure 4. In-transit co-occurrence between tokens and contexts.
To simplify, in the figure we have ignored other forms of co-occurrences between n-grams like multiple overlapping regions, which can actually give rise to an LSI "curse".
While LSI addresses synonyms, it can fail to address polysems (terms with different meanings) since dimensionality reduction can incorrectly conflate critical dimensions. So the popular opinion that LSI perfectly addresses synonyms and polysems is incorrect. Selection of k values and the right cluster is done arbitrarily, by-trial-and-error, and as such might become a "curse". Several workarounds have been proposed, but I prefer not to discuss these at this time.
In this part of the tutorial we have introduced SVD and LSI. Several myths and misconceptions regarding LSI have been mentioned. Some limitations, advantages and applications of SVD and LSI have been discussed. A geometrical visualization for interpreting the effect of a matrix on a vector has been provided.
It is now time to go over the mathematical basis and justifications of the SVD technique. This will be covered in Part II and III of the tutorial. In particular we want to show you how is that singular values are computed and the stepwise how-to-calculations leading to the decomposition, reconstruction and approximation of matrices.