2.2 Martin Klein and Michael Nelson’s study on Lexical Signature
Researchers have spent a lot of efforts in exploring how many LSs can give a best result. Martin Klein and Michael L. Nelson conclude 5 to 7 LSs are good enough in robust hyperlinks [2] after extensive experiments. Martin and Michael did not only conclude LS is a small set of terms derived from a document which can capture the “aboutness” of that document [3], but also defined a LS from a web page can discover the page at a different URL as well as to find relevant pages on internet [3]. Through their experiments on huge amount of web pages from 1996 – 2007 which were downloaded from Internet Archive, http://www.archive.org/index.php, they claimed that 5-, 6- and 7-term LSs performed the best in returning the interested URLs among the top 10 from Google, Yahoo, MSN live, Internet Archive, European Archive, CiteSeer and NSDL [3]. By apply equation 2-1 to 2-2 , the LS score versus number of terms in each query were derived in Figure2.4 .
Figure2. 4 LS Performance by Number of Terms [3]
Their experiments also showed that 50% URLs are returned as the top1 result, and 30% URLs were failed to re-locate/find by choosing LS in decreasing TF-IDF order [3] when they were reviewing Phelps and Wilensky’s research. Meanwhile, they also carefully studied the techniques for estimating IDF values which is a non-trivial issue in generating LS for the web pages. In their recent paper, 2008, “A comparison of techniques for estimating IDF values to generate lexical signatures for the web” [19], they introduced 3 quite different ways to estimate terms’ IDF and carefully examined their performances.
1. Local universe which was a set of pages downloaded from 98 websites, starting from 1996 to September, 2007 in each month [19].
2. Screen scraping Google web interface which was generated in January, 2008 [19].
3. Google N-Gram (NG) which was distributed in 2006 [19].
They compared these 3 IDF estimation techniques and claimed that local universe based data as well as the screen scraping based data is similar compared to their baseline, Google N-Gram based data.
Besides listing the detail percentage of success and fail to retrieve a URL, they used the following 2 equations in paper [3] to evaluate the score of LSs: fair score and optimistic score.
[3] 2- 1
[3] 2- 2
R(i) shows the ith page’s rank returned by SE after sending the query, when it gets bigger value, the fair score will be lower, N is the total sample pages in their experiments which is 98 and is the average value.
[3] 2- 3
[3] 2- 4
In the optimistic score equation, Sopt is different from Sfair which is only determined by pages’ rank. is the average fair score value.
They set Rmax = 100 which makes Sfair can always be positive if the desired page appears in first 100 results from SE. If R(o) > Rmax, when the desired page does not appear in first 100 results, then simply set Sfair = 0 and Sopt = 0. The final results of scores were from 2 terms to 15 terms per query and scores ranged from 0.2 to 0.8. They also concluded the scores on one page since year 1996 to 2007 ranged from 0.1 to 0.6 [3]. More details and score curves in their paper are not included in this project report.