5 Conclusion

5 Conclusion

5        Conclusion

5.1   Summaries

DF, TF3DF2, TF5DF5, TFIDF3DF2, TFIDF5DF5, WordRank3DF2 and WordRank5DF 5’ s overall average success retrieve rate reach above 70% from Google web search. Meanwhile, directed forward sentence rank achieves better results from Yahoo search.

At the same time, sentence related methods, including random sentence pick, undirected-graph sentence rank, directed-forward sentence rank and directed-backward sentence rank’s maximum success retrieval rate reach above 74%, more than other terms-based algorithms. The average success rate from Google web search is all higher than Yahoo web search except directed-forward sentence rank with 68.44% from Google and 68.99% from Yahoo, with less than 0.6% difference. When terms in a query are less than 5, the high ratio DF-related algorithms such as DF, TF3DF2, TF5DF5, TFIDF3DF2, TFIDF5DF5, WordRank3DF2 and WordRank5DF5 have higher success retrieval rate than the others.

 

Method

Google

Yahoo

1

Title

94.67

42.07%

84.00

37.33%

2

TF

88.33

39.26%

80.67

35.85%

3

DF

157.33

69.93%

138.67

61.63%

4

TFIDF

106.33

47.26%

96.00

42.67%

5

PW

89.00

39.56%

78.67

34.96%

6

TF3DF2

139.67

62.07%

130.00

57.78%

7

TF4DF1

95.33

42.37%

89.67

39.85%

8

TF5DF5

143.67

63.85%

133.00

59.11%

9

TFIDF3DF2

142.67

63.41%

134.00

59.56%

10

TFIDF4DF1

112.00

49.78%

104.00

46.22%

11

TFIDF5DF5

147.67

65.63%

137.67

61.19%

12

WordRank

71.00

31.56%

67.33

29.93%

13

NounVerbsRank

51.00

22.67%

47.33

21.04%

14

WordRank3DF2

148.00

65.78%

136.33

60.59%

15

WordRank4DF1

90.00

40.00%

84.67

37.63%

16

WordRank5DF5

145.33

64.59%

134.00

59.56%

17

WordRank3TFIDF2

59.33

26.37%

57.00

25.33%

18

WordRank4TFIDF1

66.33

29.48%

61.33

27.26%

19

WordRank5TFIDF5

57.00

25.33%

50.67

22.51%

20

RandomSentence

112.00

49.78%

105.33

46.81%

21

SentenceRank

111.00

49.33%

117.00

52%

22

ForwardSentence

109.67

48.74%

121.33

53.93%

23

BackwardSentence

105.33

46.81%

103.00

45.78%

Table5. 1 Average Success Rate from 3-, 4- and 5-terms in a query

Figure5. 1

 

Method

Google

Yahoo

1

Title

120.60

53.60%

106.00

47.11%

2

TF

157.60

70.04%

129.20

57.42%

3

DF

160.80

71.47%

127.00

56.44%

4

TFIDF

168.00

74.67%

134.00

59.56%

5

PW

159.00

70.67%

123.60

54.93%

6

TF3DF2

166.40

73.96%

138.80

61.69%

7

TF4DF1

168.20

74.76%

141.80

63.08%

8

TF5DF5

163.60

72.71%

134.40

59.73%

9

TFIDF3DF2

165.60

73.60%

144.00

64.00%

10

TFIDF4DF2

169.20

75.20%

144.20

64.09%

11

TFIDF5DF5

163.80

72.80%

138.00

61.33%

12

WordRank

154.20

68.53%

130.40

57.96%

13

NounVerbsRank

136.40

60.62%

125.00

55.56%

14

WR3DF2

162.80

72.36%

130.60

58.04%

15

WR4DF1

170.40

75.73%

148.60

66.04%

16

WR5DF5

163.40

72.62%

134.40

59.73%

17

WR3TFIDF2

154.20

68.53%

127.60

56.71%

18

WR4TFIDF1

158.80

70.58%

133.40

59.29%

19

WR5TFIDF5

148.80

66.13%

122.40

54.40%

20

RandomSentence

171.20

76.09%

166.00

73.78%

21

SentenceRank

165.40

73.51%

162.40

72.17%

22

ForwardSentence

171.20

76.09%

169.80

75.47%

23

BackwardSentence

166.20

73.87%

159.20

70.76%

Table5. 2 Average Success Rate from 11-, 12-, 13, 14- and 15-terms a query

Figure5. 2

The difference between Google and Yahoo is smaller when random sentence, undirected-graph sentence rank, forward sentence rank and backward sentence rank are applied, which is all smaller than 5%, while all the others exceed 5%.

 

5.2   Limitations

The LS performance largely depends on search engines and web page extraction in this project. Google is well-known for its page-rank strategy rather than other pure text ranking strategy deployments, which is adopted by AltaVista [12]. Meanwhile, there are other links-based ranking strategy such as authorities and hubs, which also take link information as a key element in ranking the URLs. During ranking process, the term information like TF or DF probably cannot help the page itself to get a high rank when there are a large portion of pages all have the terms in the search query. In this project, all the URL matching is only taking from the first 10 URLs, which do not only satisfy terms matching , but also the link directions, the top 10 results are certainly referenced by many other similar content pages after ranking. Therefore, trying using less than 15 terms to summarize over a web page could have a lot of overlap with the other pages, then the success retrieval would depend on the ranking algorithms which take link direction into consideration in a large portion without pure text analysis. The chance of the original page appearing in the top 10 URLs from SE is limited.

 

5.2.1     HTML Parsing and Text Extraction

The other challenge comes from the beginning of the experiment: HTML parsing and text extraction. The more accurate text extraction brings higher success retrieval rate. The success rate largely depends on how well the extraction of a given web page can be. Focusing on the main content or concentrating on the topic from a web page now is a non-trivial issue. In the following of this report, I pick up some most viewed news today as examples to show the way of extracting the topic or main content from a web page.

The experiments in reports show a lot of news web pages use <div> to divide their content. This is based on that the text in different division can be extracted according to human visual cues. Here are 2 pages from Yahoo and ABC News.

 

(a)                                                   (b)

Figure5. 3

In Figure5.3 , only the text in the red box is the main content and the text in other color’s boxes is either navigation links or advertisement links. Fortunately, not only Yahoo and ABC, lots of other news websites such as CNN, BBC and Google News use <div> to separate the web page content.

Next step is how to prove a given text is the main content among phrases, passages, blocks and links. Now I am considering classifying the web page first. I make 2 categories: based on text and based on links. The first one is the same as Figure5.3 (a) and (b) above; the second one is home page or index page, such as Figure5.4

Figure5. 4

There are lots of titles with links but barely paragraphs or passages in Figure5.4 . A simple way can be applied to classify these 2 kinds of pages: the ratio between the text within links and the text without links. As Figure5.4 shows that almost 90% percent of text is also surrounded by <A>, the anchor tags. But in Figure5.3 (a) and (b), there is a large part of text without anchor tags which shows the main content perfectly. If it is a content based page, then the main content excluded the navigation and advertisement should be focused and extracted properly. If it is a link based page, then, not only the phrases in the anchor text, but also the URL address in anchor text is also a significant part which can not be simply removed during the extraction.

Figure5. 5

The noises from the copyright in the web page such as in Figure5.5 can also cause distractions.  In order to have a more accurate query generation method, the text from the main content or topic must be extracted and maintained according to the types and positions of the web page.

 

5.2.2     Solution

(a)                                                                      (b)

Figure5. 6

Figure5.6 is a demo of the software called “Crunch” which was developed in Columbia University [20]. It is a good template of extracting and converting a web page into a pure text version according to its own structure. In Figure5.6 -(b), the page only has different font style, font size and HREF, which are good enough for future processing, such as Michal Cutler’s theory. Due to the time limitation, the experiments on HTML extractions have not gone as deep as “Crunch”. Meanwhile, because of the ranking factor taking link information, the belonging domain can be part of the query, it helps more accurate locate on where the page is, however, the query’s term number grows along with combining an HREF address into a query. The performance of query with domain or link has not been widely test yet.

你可能感兴趣的:(5 Conclusion)