Facilitating wrapper generation with page analysis

Author:

Bo Wu, Xueqi Cheng, Yu Wang, Gang Zhang, Guodong Ding

Abstract:

Extracting structured data from web pages is the necessary step for in-depth web mining. Much work has focused on inducing separate wrappers for different sites with human guidance. However, the information on the internet is distributed in various websites and web pages. Therefore current approaches suffer from the requirement of huge amount of labeled training pages to obtain satisfied results. On the other hand, the quality of data extracted by fully automatic methods is not reliable. In this paper, we propose a novel method to facilitate wrapper generation by combining wrapper induction and page analysis approaches. In addition to manual labeled data, we also take advantage of a set of unlabeled pages to improve the quality of induced wrappers. Our experiments demonstrate that our system achieves a satisfying result with fewer manually labeled training pages.

conference paper IEEE ISI(infromation security and informatics) 2009

你可能感兴趣的:(page)