3.1 Experiments Steps
The project is designed to find a best query (LS query) which can represent the web page and be searched by search engine easily. The web pages are picked up from the surface web which is easily accessed without any authority restrictions. The experiments are arranged in following general steps:
(1) Find a large amount of web pages which will be used for LS extraction and re-finding/relocation, download them as test pages into local disk and save their URLs at the same time. The source of web pages used for experiments requires careful considerations as all kinds of web sites showing up today, some of them are ill-formatted or have poor information. Obviously, they are not the ideal sources in experiments. The detail of web pages selection and crawling is non-trivial and will be introduced in section 3.3.1 specifically.
(2) HTML parsing and text extraction are needed before extracting LSs. The preprocessing reasons and steps will be in section 3.3.2 .
(3) Apply the algorithms in chapter 2 which are designed to extract lexical signatures from these downloaded pages. In chapter 4, TF, DF, TFIDF, TF3DF2, TF4DF1, TFIDF3DF2, TFIDF4DF1 as well as word rank and sentence rank will be applied to the pages after step2. But different from S. T. Park’s experiments, in this project, varying term numbers in a query is accepted, which make such as TF3DF2 and TFIDF3DF2 to and , the ratio in selecting between DF order terms and TF or TFIDF order terms is not changed. For convenience, the selections on TF, DF and their mixed forms are listed from (a) to (h) which have been precisely described in S. T. Park’s paper. Although the topic of how many terms are going to help getting better searching results is studied and unveiled by Martin and Michael, they claim that 5 terms a query is the most efficient number in getting the desired result appeared in top 10 from SE [3]. More terms in a query means obviously more feasible in sentence-rank. Meanwhile, Google raised their web search limit to 32 words 4 years ago back to 2005. In this project, up to 15 words are allowed per query, with succeeding words being ignored. The performance with different number of terms in a query is an interesting topic and they will be explored in chapter 4.
a) TF: Select terms in decreasing term frequency (TF) order. If there is a tie, then pick terms based on increasing document frequency (DF). If there is another tie, randomly select the terms [5].
b) DF: Select terms in increasing DF order. If there is a tie, then pick terms based on decreasing TF order. If there is another tie, randomly select the terms [5].
c) TFIDF: Select terms in decreasing term-frequency inverse-document frequency (TFIDF) order. If there is a tie, then pick terms based on increasing DF order. If there is another tie, randomly select the terms [5].
d) PW: Select terms based on Phelps and Wilensky’s [3] method: First, list terms in TF order and then pick LS in decreasing TFIDF order (i.e., decreasing TFIDF order where the TF term is capped at five). If there is a tie, then pick terms based on increasing DF order. If there is another tie, randomly select the terms [5].
e) TF3DF2: Select two terms in increasing DF order. If there is a tie, then pick terms based on decreasing TF order. If there is another tie, randomly select the terms. Then filter out all terms which have DF value 1. Select three terms maximizing TF. If there is a tie, it is resolved the same way as with TF method [5].
f) TF4DF1: Select one term based on increasing DF order first. Then filter out all terms which have DF value 1. Select four terms maximizing TF [5].
g) TFIDF3DF2: Select two terms based on increasing DF order first. Then filter out all terms which have DF value 1. Select three terms maximizing TFIDF [5].
h) TFIDF4DF1: Select one term based on increasing DF order first. Then filter out all terms which have DF value 1. Select four terms maximizing TFIDF [5].
(4) Select several search engines which support developer mode. Google web search and news search, Yahoo web search and news search are picked as general search engines in the test. More details are in section 3.2.
(5) Download the result links in first page returned by SE as well as their corresponding HTML pages. If there is a URL match between test URL and result URL, a successful retrieval is recorded. However if there is no match between test URL and first 10 result URLs, then, comparison between the test page and result pages from search engine is done to see if they have the same topic or content. It is similar to step 2 because both of them need HTML page parsing, extraction and converting them to understandable pure text without noises from HTML tags. Also, the criterion of considering whether
(6) Repeat step 2 to step 5 on all test pages downloaded by step 1, count all the success retrievals and compute a success retrieval rate which is in 3.1 . This is the most straightforward measurement which is also adopted in both S. T. Park and Xiaojun Wang’s researches. Martin and Michael’s LS score is also a good way to evaluate the query generation, however, their score is not as straightforward as 3.1
3. 1
(7) A stable and reliable LS extraction algorithm should help the success retrieval rate be higher than 90% as the ultimate requirement. The traditional ways which do not maintain the linguistic meaning such as TF shows quite different results comparing to graph-based rank algorithms which maintain the linguistic meaning. For example, after step 3, all terms are listed in alphabetic order and query’s terms are picked from a to z. In chapter 4, the results from Google and Yahoo show both of those search engines have a strong ability in returning desired pages by meaningful query rather than discrete terms simply ordered from TF, IDF and so on, even some stop words are included. The results and analysis will be in chapter 4.
Figure3. 1
Figure3.1 is the flow chart showing the overall procedure in experiments including from step 1 to step 7. The URL in the left top rectangle of Figure3.1 is a directory built with a large amount of effective URL which can be accessed online. The corresponding pages are then downloaded and saved on local disk.
The algorithms in extracting lexical signatures from a page are in 3 rectangles in the right top of Figure3.1 . They are the core in this project. Another key programming issue is 2 rectangles in button of Figure3.1 page content comparison and results record. The middle rectangles in Figure3.1 ’s left and right are a pipeline showing all the operations.