2.5.4 Sentence-Rank on Web Pages

2.5.4 Sentence-Rank on Web Pages

Similar to Xiaojun Wang’s applying word rank on web pages, this project applies sentence rank on web page. The passages in the web pages can be extracted as shown in Figure2.19 ’s red square.

Figure2. 19

However, there are conditions sentence rank can not work. Some web pages may not have sentences or passages, which make sentence rank on those pages not effective when there are only titles or phrases. Take Figure2.20 as an example, there is not any passage in Yahoo’s home page, and although there are several sentences in the center 3 red squares which are shown in anchor tags separately, it brings difficulties to construct connections among those independent sentences, because they actually come from completely different topics. Meanwhile, there are a bunch of simple words and phrases in the left blue squares, such as “answer”, “auto” and “finance”. It brings challenges in combining the terms and sentences as well as applying sentence rank. Therefore, the page like Figure2.20 is not a typical type can be applied by sentence rank.

Figure2. 20 A typical example of link-based page

There is a simple way to exclude the pages which are not suitable for sentence rank. A threshold p is defined to separate the pages into 2 categories linked-based page and content-based page after using formula 2-11 .

  2- 11

The pages like Figure2.20 can be concluded as a linked-based page which has a high portion with text in link. The linked-based pages are easily found from the websites’ home page and index page. Compared to Figure2.19 , a content-based page has high portion in plain text without link such as Figure2.20 .

你可能感兴趣的:(2.5.4 Sentence-Rank on Web Pages)