2.4 Michal Cutler’s Study on HTML Structure
In 1997, Michal Cutler proposed a method that makes use of structures and hyperlinks of HTML documents to improve the effectiveness of retrieving HTML documents [6]. She classified the HTML into categories based on HTML’s tags, such as Title, H1, H2, H3, H4, H5, H6 and so on, and claimed that the terms in different HTML tags have different weight. Based on this idea, a new method for extracting lexical signatures from a web page can use the terms that have the highest weights that are computed with the HTML tag structures taken into consideration [6].
It is quite necessary to outline Cutler’s two papers both: “Using the Structure of HTML Documents to Improve Retrieval” [6] and “A New Study on Using HTML Structures to Improve Retrieval” [7].
First of all, she raised an excellent idea of differentiating the term weights for the different HTML tags. The first paper classified an HTML page into following categories in Table2.1 . The detailed specifications and functions of each tag are not listed here in this section. She also mentioned that the tag importance is Anchor > H1 – H2 > H3 – H6 > Strong > Title > Plain Text [6].
Class Name |
HTML tags |
Anchor |
<a href=>…<a> |
H1-H2 |
<h1>…</h1>, <h2>…</h2> |
H3-H6 |
<h3>…</h3>, <h4>…</h4>, <h5>…</h5>, <h6>…</h6> |
Strong |
<strong>...</strong>, <b>…</b>, <em>…</em>, <i>…</i>, <u>…</u>, <dl>…</dl>, <ol>…</ol>, <ul>…</ul> |
Title |
<title>…</title> |
Plain Text |
None of the above |
Table2. 1 [6]
The second paper classified an HTML page into following categories in Table2.2 . The later paper combined all the header tags together but split the strong tags into 2 categories: list and strong. Meanwhile, the second paper considered the text in Title tag and Header tag to be more important than the others rather than Anchor and Header tags are the 2 most important categories in Table2.1 [6]
The tags <dl>, <ol> and <ul>’s functions are listed in Appendix A.
Class Name |
HTML tags |
Title |
<title>…</title> |
Header |
<h1>…</h1>, <h2>…</h2>, <h3>…</h3>, <h4>…</h4>, <h5>…</h5>, <h6>…</h6> |
List |
<dl>…</dl>, <ol>…</ol>, <ul>…</ul> |
Strong |
<strong>...</strong>, <b>…</b>, <em>…</em>, <i>…</i>, <u>…</u> |
Anchor |
<a href=>…<a> |
Plain Text |
None of the above |
Table2. 2 [6]
The basic ideas behind the two papers’ categories are the same: split the text into different classes based on their tags and then associate them with different weights. When a term appears in more than one class, it only counts terms which appear in higher level. For example, <H1><A href=”http//www.binghamton.edu”>university</A><H1>, ‘university’ is classified into Header category rather than Anchor directory according to Table2.2 [6] , but it is in Anchor category according to Table2.1 [6] .
Figure2.5 is a snapshot from http://research.binghamton.edu/. The text in the squares is either in Strong tag or Anchor tag, they are highlighted with either in bigger font size or different color rather than regular black. Apparently, it is consistent with the author’s intention that he/she wants people to notice these lines which should draw more attention to the highlighted content and have more weight than the other un-highlighted text.
Figure2. 5
However, difficulties come along with applying different weight to different HTML tags. Take the following piece of HTML as an example, in Figure2.6 , which is from Yahoo news page:
Figure2. 6
Take a careful look at the red square and orange square, “Mario left a comment: Obama’s ….”, is separated into 2 different parts, the terms in blue are in Anchor tag which have HREF links to the other pages, while, ‘left a comment’ in orange square is taken off from the Anchor tag, and clearly showed in a Strong text style as compared to “to see what your Connections are…”. However, Yahoo put ‘left a comment’ into a pre-defined <P> tag and set it into a Strong style. This can lead the conventional ways in parsing HTML becoming inaccurate and destroy the original order in the text. As Figure2.7 shows, the <P> tags and <A> tags are mixed together, which can lead to confusion in differentiating the text in those 2 kinds of tags if the program is not designed carefully.
Figure2. 7
On the other hand, because these 2 papers focused on their test search engine WEBOR [7] which was developed by Weiyi Meng and Michal Cutler, Culter’s theory and research were apparently going on with clearly understanding of the working mechanism in WEBOR. Meanwhile, Cutler also had the access to control and modify WEBOR itself according to the requirement of changing CIV [6][7].
The conclusion could be unclear in applying this LS extraction method to Google, Yahoo or other commercial SEs which keep their searching mechanism as top secrets from others.