Latent semantic Indexing(LSI)

 

Because of the tremendous diversity in the words people use to describe the same document,lexical methods are necessarily incomplete and imprecise. Using the singular value decomposition (SVD), one can take advantage of the implicit higher-order structure in the association of terms with documents by determining the SVD of large sparse term by document matrices. Terms and documents represented by 200-300 of the largest singular vectors are then matched against user queries. We call this retrieval method Latent Semantic Indexing (LSI).  because the subspace represents important associative relationships between terms and documents that are not evident in individual documents.

 

 

LSI assumes that there is some underlying or latent structure in word usage that is partially obscured by variability in word choice.

 

 

SVD 奇异值分解:

 

Given an m*n matrix , where without loss of generality m>=n and rank(A)=r, the singular value decomposition of A, denoted by SVD(A), is defined as

A=UΣVT

where UTU=VTV=I

and Σ=diag(σ1,...,σn),

σi>0 for 1<=i<=r, 

σj=0 for j>=r+1.

 

The fi rst columns of the orthogonal matrices and define the orthonormal eigenvectors associated with the nonzero eigenvalues of AAT and ATA, respectively. The columns of and are referred to as the left and right singular vectors, resp   ectively, and the singular values of A are defined as the diagonal elements of  which are the nonnegative square roots of the n

eigenvalues of AAT.

 

 

Interpretation of SVD components within LSI.

 

Ak=Best rank-k approximation to A.

U=term vectors

Σ=Singular values

V=Document Vectors

m=Number of terms

n=Number of documents

k=Number of factors

r=rank of A

 

the user query can be represented by 

q^=qTUkΣk-1


你可能感兴趣的:(user,methods,Components,Semantic,structure,orthogonal)