Text mining and web mining

 

Text mining and web mining are two interrelated fields that have received a lot of attention in recent years. Text mining [1, 2] is concerned with the analysis of very large document collections and the extraction of hidden knowledge from text-based data. Web mining [3] refers to the analysis and mining of all web-related data, including web content, hyperlink structure, and web access statistics. Among the three aspects of web mining, text mining is most closely related to web content mining. However, whereas text mining deals with text documents in general, such as emails, letters, reports, and articles, that exist in both intranet and internet environment, web content mining is primarily concerned with the materials on the web only. Dealing with free-form unstructured and semi-structured text, text mining can be envisaged as an immediate extension of data mining or knowledge discovery from databases [2].Web content mining, on the other hand, covers a wider scope of dealing with rich multimedia contents, including text, image, audio, and video, intermixed with HTML formatting tags and hyper-links. Nevertheless, as text constitutes a large portion of web content, text mining is still recognized by many as a key enabling technology for web resource management and mining. Text and web mining are both technically interesting and commercially relevant. A good number of companies including high-tech start-ups and established players, such as Verity, Autonomy, Megaputers, Microsoft, and IBM, have released a range of text and web mining related products and services since a few years back. The various functions supported include search and retrieval, document navigation/exploration, text analysis, and knowledge management. We have also witnessed convergence of interests from many established academic fields, including statistics, pattern recognition, machine learning, database, data mining, natural language processing, and computational linguistic into text and web mining. To name a few, some of the well-known efforts in the research communities include World Wide Knowledge Base (Web->KB) [5] and Web Watcher [6] by the Text Learning Group at Carnegie Mellon University, Natural Language Learning [7] research by the Machine Learning Group at the University of Texas at Austin, and Web Base [8], that has culminated the PageRank Algorithm [9] as used in Google, at Stanford University. This special issue contains eight technical papers selected by a panel of over thirty international experts through a rigorous peer review process. The articles include three openly solicited papers as well as expanded versions of five papers presented at the International Workshop on Text and Web Mining, held in conjunction with the Sixth Pacific Rim International Conference on Artificial Intelligence (PRICAI’2000)in Melbourne Convention Centre, Australia on 28August 2000. The papers collected here focus primarily on text and web content mining, covering such topics as document retrieval, text/web categorization, tagging, schema extraction, clustering, and information discovery.

The first two articles focus on the problem of document retrieval. Genetic Mining of HTML Structures in Web-Document Retrieval by Kim and Zhan exploits the inherent HTML structure in web document to facilitate document retrieval. Tan et al., on the other hand, present a novel approach to text retrieval from document images based on word shape analysis. We include two articles on document clustering that coincidentally are both based on a popular class of unsupervised neural networks known as Self-Organizing Map (SOM) [10]. Rauber and Merkl describe a SOM-based digital library and discuss the representational issues of topics and genres. Lee and Yang present a framework for performing multilingual text mining based on Self-Organizing Maps. On the supervised learning aspect, we have a paper by He, Tan, and Tan that reports benchmark comparisons of three state-of-the machine learning methods for Chinese document categorization. While the first five articles focus on techniques for organizing textual information, the remaining three papers investigate the problem of extracting information and knowledge from documents. The contributed paper of Velardi and Kickoff describes various text mining techniques to automatically enrich domain ontology. Carchiolo, Longheu, and Malgeri propose a method for extracting logical schema from the Web. The last paper of the issue describes an application of text mining to medical decision support. By formulating rule mining as a categorization problem, Loh, Oliveira, and Gameiro present a method for constructing automatic decision systems from patient records. We hope this issue provides a snapshot of some of the latest advancement in the fields of text and web mining. More collections of text and web mining related articles can be found in the form of workshop proceedings of several major conferences, such as the KDD’2000Workshop on Text Mining [11] and the PRICAI’2000Workshop on Text and Web Mining [12]. We believe text and web mining will continue to gain importance in the years to come. Several promising research directions include content personalization, multilingual content mining, and web resource management/mining. We also expect to see more domain-specific applications of text and web mining technologies to building personalized/customized vertical portals and competitive intelligence systems. We hope you enjoy reading the special issue and look forward to more exciting development in the fields of text and web mining.

 

你可能感兴趣的:(Text mining and web mining)