2.5 Graph-Based ranking algorithm
In the previous sections, all the methods of extracting LSs disregard the natural language’s consistency in the web pages by only considering the discrete terms. The document’s terms are arranged in alphabetic order before applying TF or DF, which totally destroys the linguistic information in web pages. Take an example from Thomas A. Phelps and Robert Wilensk’s paper [3],
http://www.cs.berkeley.edu/˜wilensky/NLP.html cannot be located using this query “texttiling wilensky disambiguation subtopic iago” in Google at current time, as shown in Figure2.8 , Google only return 4 results with highly related content but none of them have the same URL as required, however it was claimed as successful in January, 2008. Meanwhile it can be returned by Yahoo search with the same query but a different address like this in Figure2.9 :
http://www.eecs.berkeley.edu/Faculty/Homepages/wilensky.html/NLP.html?lexical-signature=texttiling+wilensky+disambiguation+subtopic+iago
This shows the fact that 2 different URLs can open the same page. About different URLs binding on the same web page, it is not discussed in this project. It is supposed to be a successful retrieval if document similarity is taken as a measurement, however, the URL matching measurement will clearly put it into a false retrieval.
Figure2. 8
Figure2. 9
From Figure2.8 and Figure2.9 , we can see the un-stable performance from traditional LS generation techniques even when they were as typical samples as in papers years before. The studies on URL and its page content change have been researched before, such as Martin Klein and Michael L. Nelson studied the pages ranging from 1996 to 2007 [3], but they are not included in this project. One single page bound by different URL actually happens quite often, taking Binghamton University ’s home page as an example, http://www.binghamton.edu/index.html and http://www2.binghamton.edu actually connect to the same page. In chapter 3, section 3.4 shows typical examples in Yahoo news pages that Yahoo changes the same news page’s URL all the time.
One approach of solving such kind of difficulty, which mainly focuses on finding relative or similar web pages rather than locating web pages only by URLs matching, is referenced by automatically generating key words and summarizations for academic papers. They are introduced as having capabilities by reserving both the underlying language information and relatively stable performance, because there are fewer chances to have two documents with the same content but different URLs, and even this happens, Martin and Michael studied the graph-based algorithms and concluded that they actually had the ability to improve the relative/similar web page re-finding/re-location when the original copy is lost [3].
Graph-based ranking algorithm is a way of deciding the importance of a vertex within a graph, by taking all text as global information and recursively computing from the entire graph rather than relying only on local vertex-specific information [9][10].The basic idea implemented by a graph-based ranking model is a vertex can receive and cast “voting” or “recommendation” to the others [12][13]. When one vertex links to another one, it is basically casting a vote to the other in the graph. The higher the number of votes is received by a vertex, the higher the importance of the vertex is taken [9]. Figure2.10 (a) is an example showing how it works when a vertex casts all its weight to the other vertices. Figure2.10 (b) is an example a vertex casts 90% of its weight to the others while it keeps 10%. Page-Rank is a typical implementation of this graph-based ranking algorithm. The score of a vertex Vi is defined as:
[11] 2- 5
Where d is a damping factor that can be set between 0 and 1. In 2-5 , j is the vertex points to i and Out(Vj) is the score delivered from j to i.
(a) (b)
Figure2. 10
Meanwhile, graph-based ranking algorithm can also be split into 2 groups as Figure2.11 shows the combinations: weighted (on edges) graph and un-weighted (on edges) graph, undirected-graph and directed-graph. One group’s condition can be combined with the other group’s condition.
Figure2. 11
Because undirected un-weighted (on both edges and vertices) graph does not have actual meaning in this project, it is not discussed. Figure2.10 (a) and (b) are 2 examples of directed graph, with weights on vertex, but without value on the edges. Figure2.12 is an example of undirected weighted (on edges) graph. It is the case that assuming out-degree of a vertex is equal to the in-degree of the vertex, take undirected edges like bi-directions edges [9][10].. The weight from i to j and from j to i are same. The weight’s recursive computation formula:
[9] 2- 6
and show Vi, Vj and Vk are connected but cannot show any direction among Vi, Vj and Vk. Figure2.13 is an example of directed weighted graph. It is the case that a weight is added according to the direction from one vertex to another. The weight from i to j is wij, but the weight from j to i is 0.
Figure2. 12
Figure2. 13
[9] 2- 7
Compared to the un-directed weighted graph’s formula 2-6 , and in formula 2-7 show the direction between Vj, Vi and Vk, Vj.