2.1 Seung-Taek Park’s Study on Lexical Signature
Before referencing the related studies and works, the terminology “Lexical Signature” (LS) is quite necessary to be mentioned first. LS is simply considered as an equivalent term to “key words/terms/phrases” in chapter 1. There are many related works have given LS various descriptions. Thomas A. Phelps and Robert Wilensky made this definition: a relatively small set of such terms can effectively discriminate a given document from all the others in a large collection [2]. They also proposed a way to create LS that meets the desired criteria which is selecting the first few terms of the document that have the highest "term frequency-inverse document frequency" (TF-IDF) values [2]. Martin Klein and Michael L. Nelson introduced the LS as a small set of terms derived from a document that capture the “aboutness” of itself [3]. S. T. Park studied and analyzed Phelps and Wilensky’s theory, and he claimed that, LS had following characteristics by concluding from Phelps and Wilensky’s paper [4][5]:
(1) LSs should extract the desired document and only that document [5].
(2) LSs should be robust enough to find documents that have been slightly modified [5].
(3) New LSs should have minimal overlap with existing LSs [5].
(4) LSs should have minimal search engine dependency [5].
Seung-Park also raised his own perspective about LS to help the user finding similar or relevant documents:
(1) LSs should easily extract the desired document. When a search engine returns more than one document, the desired document should be the top-ranked documents [5].
(2) LSs should be useful enough to find relevant information when the precise documents being searched for are lost [5].
After all, S. T. Park’s studies on LS are very insightful and helpful in this project. If type “Lexical Signature” as a search query into Google, then the first 10 results are most likely going to have both of his 2 papers “Analysis of lexical signatures for finding lost or related documents” [4] and “Analysis of lexical signatures for improving information persistence on the www” [5].
S. T. Park conducted a large amount of experiments with TF, DF, TFIDF, PW, TF3DF2, TF4DF1, TFIDF3DF2, TFIDF4DF1 separately and combined them synthetically [4][5], then, compared the results from Yahoo, MSN and AltaVista all in histograms. Including unique result, 1st-rank result and top 10 results [5], the success re-finding rate is more than 60% but less than 70% when take both 2 URLs match and 2 documents’ cosine value > 0.95 as a success re-finding into consideration. Thus, if only taking 2 URLs comparison as a measurement and having a success when they are matched, the success re-finding/re-locating rate would be probably lower.
Figure2.1
Figure2.2
Figure2.3 [5]
In this project, the LS’s definition follows S. T. Park’s theory: LSs are the key terms from the web page and can help to both identify the web page from others uniquely and retrieve the most relevant page effectively by search engines. Meanwhile, in experiments, LS cannot be simply considered as the unchanged terms (words) from the documents. Some necessary pre-procedures and transformations must be taken before starting to process the web pages/documents in the information retrieval ways, such as removing the stop words or transforming the words in different forms but close meanings into one unique term, like “lexica” and “lexical” to “lex”. Other than this, picking out only nouns and verbs or nouns and adjectives from the text is also feasible based on word form data base. These steps are implemented in Chapter 4 particularly by LUCENE and WORDNET, 2 open source Java projects well accepted in practical industry world.