In the following, the task of finding relevant information in large document collections is presented as well as the WEBSOM method developed in the Neural Networks Research Centre of the Helsinki University of Technology. The presentation is based on [].
One of the most common computing tasks nowadays is information retrieval. Especially in the rapidly growing World Wide Web (WWW) there is a vast amount of potentially useful information available, but reaching it is not straightforward. It is important to develop more powerful methods for the exploration of miscellaneous document collections.
Searching for relevant documents has traditionally been based on keywords and their Boolean expressions. Often the search results show high recall and low precision, or vice versa. Considerable efforts have been used to develop alternative methods, but their practical applicability has been low.
The WEBSOM method is based on an algorithm called the Self-Organizing Map (SOM). The latter, developed in our laboratory, is a general unsupervised learning algorithm for analyzing and visualizing high-dimensional statistical data. It is one of the most widespread artificial neural network models used in application areas like process monitoring, image analysis, telecommunications, and categorization of economic data. The SOM, its mathematical basis, and about one thousand applications are presented in the recent monograph [].
The basic WEBSOM architecture consists of two levels. The word category map [] first learns in a self-organizing process to represent relations of words based on their averaged short contexts. The words are mapped onto the two-dimensional map grid, ordered according to the similarities in their usage. The word category map is then used to form a word histogram of the textual document to be analyzed. The histogram, "fingerprint" of the document, is used as input to the second SOM, the document map. The document map self-organizes to represent the similarities between the contents of the documents; each document attains a location on the map based on its contents. Different areas on the map specialize in different topics and the topics change smoothly along the map.
The WEBSOM demo is available in the Internet. To make it easy and practical to explore the organized document collections we have developed a WWW-based browsing environment. The self-organized document map offers a general idea of the underlying document space. The user may view any area of the map in detail by simply pointing to the map image with the mouse. The Websom browsing interface is implemented as a set of HTML documents that can be viewed using a graphical WWW browser, like Mosaic or Netscape, at the WEBSOM home page at http://websom.hut.fi/websom/ [].
The WEBSOM method is basically applicable to any kind of collection of textual documents. It is especially suitable for exploration tasks in which the users either do not know the domain very well, or they have only a limited idea of the contents of the full-text database being examined. With the WEBSOM, the documents are ordered meaningfully according to their contents. Maps also help the exploration by giving an overall view of what the information space looks like.
In the World Wide Web, one application could be organization of home pages instead of the newsgroup articles. Also electronic mail messages may automatically be positioned on a suitable map according to personal interests. Relevant areas and single nodes on the map can be used as "mailboxes" in which specified information will be automatically gathered.
For more detailed information of the WEBSOM method in general, its variants, and application examples see, e.g., [,,]. A detailed description of SOM as a numerical or textual data exploration method and tool can be found in []. Previously the SOM has been used in creating document maps, e.g., by Lin et al [] to form a map based on titles of scientific documents. Scholtes has developed, based on the SOM, a neural filter and a neural interest map for information retrieval [,]. Merkl [] has clustered textual descriptions of software library components. In comparison, one of the novel features of the WEBSOM method is the idea of applying the SOM algorithm twice: first for word category analysis and second for creating document maps, based on the first analysis. The natural language processing model of Miikkulainen [] contains SOM as a central component.
The SOM program package with documentation is available for non-commercial purposes []. The original technical report and some WEBSOM demonstrations are also available [].