Home | Guide | Tools Listing | News | Background | Search | About Us |
Publisher's Note: we are very pleased to feature this valuable research on free and open source search engines on SearchTools.com. It was originally written in around 2004 and revised in April 2006.
Abstract: This paper reviews nine search engine software packages −Alkaline, Fluid Dynamic, ht://Dig, Juggernautsearch, mnoGoSearch, Perlfect, SWISH-E, Webinator, and Webglimpse− which are free to users. Their features and functionalities are compared and contrasted with emphasis on searching mechanisms, crawler and indexer features, and searching features.
The Internet and computer technology have immeasurably increased the availability of information. However, as the size of information systems increases, it becomes harder for users to retrieve relevant information. Search engines have been developed to facilitate fast information retrieval. There are many software packages for search engine construction on the Internet. The website searchtools.com alone lists more than 170 search tools, many of which are free or free for noncommercial use. With so many software packages, selecting suitable search engine software is, as hard if not harder than retrieving relevant information efficiently from websites. Motivated a desire to aid website administrators in choosing a suitable search engine, this paper reviews basic information, feature, and functionalities of nine free search engine software packages: Alkaline, Fluid Dynamic, ht://Dig, Juggernautsearch 1.0.1, mnoGoSearch, Perlfect, SWISH-E, Webinator, and Webglimpse 2.x.
The remainder of the paper starts with an introduction to free search engine software. Then, we summarize basic information such as source code availability and platform compatibility of the nine software packages. After that, their features and functionalities are compared and contrasted. Finally, we conclude our comparison results.
Free search engine software can be spotted at websites such as searchtools.com, sourceforge.net, searchenginewatch.com, and codebeach.com. Some of them are freeware with only binary files distributed, while others are open source software. In general, however, free search engine software is not well-documented and has undergone few formal tests, which makes it difficult to understand the functionalities they provide.
According to whoever provides the actual search service, free search tools can be categorized into remote site search service and the server-side search engine. In the former, the indexer and query engine run on a remote server that stores the index file. When it comes to the time of search, a form on a user's local Web page sends a message to the remote search engine, which then sends the query results back to the user. A server-side search engine is what we usually think of as a search engine. It runs on the user's server, and takes that server's CPU time and disk space. In this paper, the term search engine refers only to server-side search engines.
According to what is indexed, search engines are classified as file system search engines and website search engines. File system search engines index only files in the server's local file system. Website search engines can index remote servers by feeding URLs to web crawlers. Most search engines combine the two functions, and can index both local file systems and remote servers. The nine search engine software packages compared here are all website search engines, some of which can index local file systems.
A fully functional website search engine software package should have the following four blocks:
The nine software packages we compare either have all four blocks or allow adding the missing blocks.
This section intends to provide some basic information about each search engine software package. The information includes licensing, where to find, source code availability, documentation availability, what is written in, platform compatibility, completeness of the package, and who built it.
Licensing refers to whether the software is a freeware or is free under some conditions. Source code availability provides the website address to download the source code if it is available. Documentation availability indicates where to find the documentation files. What is written in tells what programming language is used in implementing the software. Platform compatibility specifies what operating systems that the software can run on. If the software package is fully functional, i.e. it has a web crawler, an indexer, a query engine, and a query interface, we consider the package to be complete. Who built it tells us the developers of the software.
A website administrator who is looking for a suitable software package can take a look into this information first to decide whether a software package is a potential candidate. For example, if the search engine software can not be installed in the platform on which the web server is running, there is no need for the administrator to look into specific features of the software. We summarize the basic information of the nine software packages in Table 1.
Software Name | Licensing | Where to Find? | Source Code Availability | Documentation Availability | What is written in? | Platform Compatibility | Complete-ness of package | Who Built it? |
---|---|---|---|---|---|---|---|---|
Alkaline | Free for non-commercial use | alkaline.vestris. com | -Not available in public domain. -Can be purchased under license. |
- User's Guide (pdf) - FAQ (pdf) | C++ | -Linux -Solaris -IRIX -BSDI -FreeBSD - Win NT/2000/ -XP Pro |
Complete | Daniel Doubrovkine, who is the founder of Vestris Inc. and Hassan Sultan who developed the cellular expansion algorithm. |
Fluid Dynamics | -freeware available -free trial shareware |
www.xav.com/ | Available @ www.xav.com /scripts/search/install.html | Some information | Perl | -Unix -Linux -Win95/98/ME/ NT/2000 |
Complete | Copyrighted by Zoltan Milosevic, Fluid Dynamics Software Corporation |
ht://Dig | Free | htdig.sourceforge. net or www.htdig.org/ | -Stable release version 3.1.6 @ www.htdig.org /files/htdig-3.1.6.tar.gz - Beta release version 3.2.0b6 @ www.htdig.org/files/htdig-3.2.0b6.tar.gz | Available @ www.htdig.org/ | C, C++ | -Solaris -HP/UX -IRIX -SunOS -Linux -Mac OS X -Mac OS 9 |
Complete | Loic Dachary and Geoff Hutchinson, San Diego State University |
Juggernaut-search 1.0.1 | Free for personal use | juggernautsearch .com | Available @ juggernautsearch.com /JS.1.0.1.tgz | -Installation and Operation Guide available @ juggernautsearch.com /JSInstall.htm -Executive Summary @ juggernautsearch.com/ | Perl | -Unix -Linux - Win NT, 2000 and XP for non-free version |
Complete | Donald Kasper, and etc. of HyperProject, Inc. |
mnoGoSearch | Free Unix version | mnogosearch.org/ | Available @ mnogosearch .org /download.html | Reference manual available @ mnogosearch.org /doc/ | C | -Unix -Linux -FreeBSD -Mac OS X |
Complete | Alexander Barkov, Mark Napartovich, Ramil Kalimullin, Aleksey Botchkov, Sergei Kartashoff, and etc. of Lavtech. Corp. |
Perlfect | Free | perlfect.com /freescripts /search/ | Available @ perlfect.com /freescripts/ search/ | Readme, FAQ, and Example of configuration file available @ perlfect.com /freescripts/ search/ | Perl | -Win NT -Unix -Linux |
Complete | N.Moraitakis and G. & |
SWISH-E | Free | swish-e.org/ | Available @ swish-e.org /download/index.html | Available @ swish-e.org /docs/index.html | C, Perl | -SunOS -FreeBSD -Net BSD -Linux -OSF/1 -AIX -Windows NT |
Need additional CGI to invoke searching | -Original version, SWISH, is built by Kevin Hughes. - In Fall 1996, The Library of UC Berkeley received permission from Kevin Hughes to implement bug fixes and enhancements to the original binary, hence SWISH-E. |
Webinator | Free for up to 10,000 pages and 10,000 hits per day | www.thunderstone. com/ | Not in public domain | Available @ www.thunderstone.com /site/webinator5man/ or @ www.thunderstone.com /site/webinator5man /webinator5.pdf | Vortex-Tex is' Web Script language | -Unix: Solaris SPARC, Linux Intel, SGI Irix 5/6, Unixware, Solaris x86, BSDI, SGI Irix 4 , AT&T SVR4 386, SunOs 4, SCO 5, DEC Alpha Unix 4 , HP UX 10, SCO 5.02, DEC Alpha Unix 3, IBM AIX 4.2, -Windows NT and 2000 | Complete | Thunderstone - EPI Inc. |
Webglimpse 2.x | Has a free version for educational and governmental use | www.webglimpse. net/ | Glimpse & Webglimpse available @ webglimpse.net /download.php | Available @ www.webglimpse.net /subdocs/ OR @ webglimpse.org/pubs /webglimpse.pdf | Glimpse: C full text search engine Webglimpse: Perl spider and indexer | - Solaris - SunOS - openBSD - AIX - IRIX - Mach -OSF -Rhapsody (Mac OS X) |
Complete | University of Arizona |
We compare and contrast the nine software packages from the following four perspectives.
We consider the indexing method and the ranking method as the searching mechanism of a search engine, since these two methods usually determines how many disk space the search engine requires, how fast the indexing process is, and how fast and accurate the search process is.
Most search engines operate on the principle that pre-indexed data is easier and faster to search than raw data. The form and quality of the index created from the original documents is of paramount importance to how searches are performed. The commonly used indexing method is the full text inverted index. It takes a large amount of disk space and the indexing process is slow, because it keeps most of the information in a document. Another method is to index only the title, keywords, description, and author parts of a document. In this way, the indexing process can be very fast and the resulted index is relatively small. Some search engines have their novel indexing method. WebGlimpse uses two-level indexing, which we will introduce later. Alkaline applies Cellular Expansion Algorithm in indexing, which is still kept as a technical secret.
Ranking method refers to the method that decides a document's relevance to a query. Factors such as word frequency in the document, word position in the text, and link popularity are usually considered. Different search engine takes into consideration of different factors.
We compare the following functionalities of built-in web crawlers and indexers.
Searching features are considered from nine aspects:
We only consider three features in this category:
We compare and contrast the features of the nine search engine software packages according to the above comparison criterions. The main results are summarized in Table 2. Individual analyses are also provided in subsections respectively.
Alkaline | Fluid Dynamic | ht://Dig | Juggernautsearch 1.0.1 | mnoGoSearch | Perlfect | SWISH-E | Webinator | Webglimpse 2.x | ||
---|---|---|---|---|---|---|---|---|---|---|
Searching Mechanism
|
||||||||||
Indexing Method | Cellular Expansion Algorithm | Attribute Indexing | Inverted Index | Keywords Index | Inverted Index | Inverted Index | Don't know | Inverted Index | Two-level query | |
Relevance Ranking | Word weight | Word Frequency; Word Weight | Word weight | Word weight | Word weight | Gerald Salton Algorithm | Don't know | Please refer to details | Don't know | |
Crawler and Indexer Features
|
||||||||||
Robot Exclusion Standard Support | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | |
Crawler Retrieval Depth Control | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | |
Duplicate Page Detection | Yes | Yes | Yes | Yes | Yes | Don't know | Yes | Yes | Don't know | |
File Format to be Indexed | html, htm, text, shtml, PDF, embedded Shockwave flash objects, doc, rtf, LaTex/Tex, WordPerfect, Xml, and MPEG Layer 3 | html, htm, shtml, shtm, stm, txt, mp3, and PDF | html, txt, PDF, MS Word, Power-Point, PostScript and Excel | txt, htm, html, .shtm, .shtml, ppt, doc, xls, .ps, rtf, BAT, C, CGI, CXX, CPP, H, Java, PHP, PL | html, txt, pdf, ps, doc, MP3, SQL database text fields | html, txt, pdf | html, XML txt, doc, PDF, and gzipped files | .html, .htm, .txt, .pdf, .doc, .swf, WordPerfect, .asp, .jsp, .shtml, .jhtml, .phtml | HTML documents, Word, PDF, and any other documents that can be filtered to plaintext | |
Index Protected Server | Yes | No | Yes | No | Yes | No | No | No | No | |
Searching Features
|
||||||||||
Boolean Search | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | |
Phrase Matching | No | Yes | No | No | Yes | No | Yes | Yes | No | |
Attribute Search | Yes | Yes | Yes | Yes | Yes | No | Yes | No | No | |
Fuzzy Search | No | No | Yes | Don't know | Yes | No | Yes | Yes | Yes | |
Word Forms | Yes | Yes | Yes | Don't know | Yes | No | Yes | Yes | No | |
Wild Card | Yes | Yes | Yes | Don't know | Yes | No | Yes | Yes | Yes | |
Regular Expression | No | No | No | Don't know | No | No | No | Yes | Yes | |
Numeric Data Search | Yes | No | No | No | No | No | No | Yes | No | |
Case Sensitivity | Yes | No | No | Don't know | No | No | No | No | Yes | |
Natural Language Query | No | No | No | No | No | No | No | Yes | No | |
Other Features
|
||||||||||
International Language | No | Latin-extended languages | Yes | No | Yes | Yes | Yes | No | Yes | |
Page Limit | Theoretical Limit: 2 billion documents. Recommended: 50, 000-500,000 pages. | No "hard" limit. "Soft" limit: 100,00 documents | No theoretical limit. Can be over 100,000 pages. | unlimited | Several Millions | 1,000+ pages | Don't know | 10,000 pages for free | Don't know | |
Customizable Result Formatting | Yes | Yes | Yes | Don't know | Yes | Yes | Don't know | Yes | No |
Alkaline is a powerful search server. It supports most of the features we discussed here [2] [3].
Alkaline uses the concept of "cellular expansion" to index and search documents. The cellular expansion algorithm is a technique of hashing and quickly finding short binary blobs. It is claimed that the algorithm makes searching incomplete word forms in 500,000 documents blazing fast. But I haven't been able to find any published document for this algorithm.
Alkaline uses an adaptive mechanism that is said to be able to closely match the results to the elements searched. The more extensive the search query is, the better the relevance the user gets. The word weight ranking gives different weight to words in title, meta keywords, words in description, and words in text body. Alkaline has the Weight option to modify ranking weights. Another option Alkaline provides for changing the ranking is WeakWords. Words in the WeakWords list are assigned lower weight.
Alkaline supports robot directives. AlkalineBOT is the registered robot. Alkaline is compliant with the /robots.txt directives. It will not follow a link if a tag is found. It will also not index document contents if a tag is found. By specifying Robots=N in the configuration file, Alkaline's robots support can be disabled.
Alkaline allows administrators to define the maximum depth of URLs to follow. The MD5 digest mechanism [4] within Alkaline can identify and ignore symbolic links and duplicated documents, such as http://www.abc.com and http://www.abc.com/index.html.
Alkaline can index html, htm, text, and shtml files. To index PDF, embedded Shockwave flash objects, doc, rtf, LaTex/Tex, WordPerfect, Xml, and MPEG Layer 3 files, Alkaline needs external document filters. A retrieved document of these kinds can be passed to any external filter, processed by this filter and then indexed based on the HTML output.
Retrieval of secured pages on password protected sites (HTTP/1.0 BASIC authentication, NTLM support for Windows NT versions, no support for SSL) is supported by alkaline.
Alkaline supports Boolean Search, Attribute Search, Word Forms, Wild Card, Numeric Data Search, and Case Sensitivity. It dose not support Phrase Matching, Fuzzy Search, Regular Expression, and Natural Language Query.
- Boolean search: To express the fact that a page must contain a word, a '+' sign is placed in front of the word. To search for all pages not containing a word, a '-' sign is used.
- Attribute Search: Alkaline can define search scopes such as Host Scope ( host: abc.com), Path Scope ( path: abc/directory ), URL Scope (url: www.abc.com/abc/directory ), File Extension Scope ( ext:cpp, h), and Meta Scope to allow searching within these scopes.
- Word forms: Alkaline supports word stemming. Searching for light will find all pages containing light, lightning, delighted, etc.
- Wild card: Alkaline can use * to return a list of all indexed documents.
- Numeric Data Search: Alkaline indexes words such as quantity=15 in a special manner. Thus it can support search such as quantity < 15, quantity =15, or quantity > 15.
- Case Sensitivity: Alkaline chooses a case-sensitive search when at least one upper-case letter is present in a word.
Alkaline dose not support language other than English. There's a theoretical limit of two billion documents that Alkaline can index. But the recommended usage is to index something around 50,000-500,000 pages and 250,000 word forms. Layout of search results is fully customizable.
For detailed features of Alkaline, please refer to Appendix 1, which is the feature summary from the documentation of Alkaline [2].
4.2.2 Fluid Dynamic
Fluid Dynamic search engine uses attribute indexing [5]. A document's text, keywords, description, title, and address are all extracted and used for searching. Basically, it is a full text indexing, but the option "Max Characters: File" allows one to determine the maximum number of bytes read from any document. Keeping it at a low value will save indexing time at the expense of accuracy in searching.
The ranking of documents is decided by the frequency of query words in the documents. Query words found in the title, keywords, or description parts of the documents are given additional weight which is allowed to be modified by changing the values of "Multiplier: Title", "Multiplier: Keywords", and "Multiplier: Description" settings. Every time a search term found in the web page text, one point is added to the web page's relevance. Every time a search term found in the title, the value of the "Multiplier: Title" setting is added to the relevance. Similar additions are made for the META Keywords and Description. Results can be ranked by last modified time, time web page last indexed, and their inverses.
Fluid dynamic supports Robot Exclusion Standards, i.e. it respects both the robot.txt file and the Robots Meta tags. The crawler can stop after each level of crawling waiting for manual approval, thus an administrator is able to control the depth of crawling. It can detect duplicated pages and will not index them.
Fluid dynamic can index html, htm, shtml, shtm, stm, and mp3 files. To index PDF files, it needs the helper utility xpdf package from www.foolabs.com/xpdf. It can not index servers protected by passwords.
Fluid Dynamic supports Boolean Search, Phrase Matching, Attribute Search, Word Forms, and Wild Card [5] [6]. It does not support Fuzzy Search, Regular Expression, Numeric Data Search, Case Sensitivity, and Natural language Query.
- Boolean search: To express the fact that a page must contain a word, a '+' sign or "and" is placed in front of the word. To search for all pages not containing a word, a '- ' sign or "not" is used. "or" or '|' means that this search term is preferred. Additional preferred terms will increase the ranking.
- Phrase Matching: Enclosing words in quotation marks causes them to be evaluated as a phrase.
- Attribute Search: Fluid Dynamic is able to limit search scopes within URLs, titles, texts, or links by using url:value (host:value or domain:value), title:value , text:value , or link:value .
- Word forms: Fluid Dynamic supports approximate English-language plural forms of words.
- Wild card: Fluid Dynamic uses * to represent one or more character or symbol.
Fluid Dynamic is designed to search languages that use the Latin character set, including English, German, and Dutch. All Latin extended characters are reduced to their English equivalents. The query interface and result display are template-based, thus are easy to customize. It's also easy to translate the user interface into non-English languages. There's no theoretical page limit for Fluid Dynamic, but the soft limit because of the disk space and CPU running load is about 100,000 documents.
For detailed features of Fluid Dynamic, please refer to Appendix 2, which is the feature summary from the documentation of Fluid Dynamic [5].
4.2.3 ht://Dig
ht://Dig uses the most standard indexing method: full text reverse index. The relevance ranking method is word weight. It is said that word weights are generally determined by the importance of the word in a document.
The crawler of ht://Dig supports Robot Exclusion Standards. The depth of crawling can be limited by setting maxhops option when running the crawling program, htdig. ht://Dig uses the signature of the document to detect duplicated pages. But it was reported that ht://Dig did not remove duplicates [7].
ht://Dig can index html and txt files by default. PDF, MS Word, PowerPoint, PostScript and Excel files can be indexed with the aid of external parsers or converters. The path name of external parser or converter must be put in the configuration file.
ht://Dig can index protected servers. It can be configured to use a specific username and password when it retrieves documents on a password protected server.
ht://Dig supports Boolean Search, Attribute Search, Fuzzy Search, Word Forms, and Wild Card. It dose not support Phrase Matching, Regular Expression, Numeric Data Search, Case Sensitivity, and Nature Language Query.
- Boolean Search: AND is used to search for more than one keywords. OR is used to search for any of the keywords.
- Attribute Search: ht://Dig can be set to perform search which only returns documents whose URL matche a certain pattern. It's different from the concept of Attribute Search. We list it here because it is similar to only search within a URL scope.
- Fuzzy Search: ht://Dig supports soundex, metaphone, accents, and synonyms search.
- Word Forms: ht://Dig supports word stemming.
- Wild Card: Wild card usage is not found in any documentation of ht://Dig. But the search engine at he Kennedy Space Center website [8], which is built using ht://Dig, supports powerful wild cards. More specifically,
� * is used to substitute one or more characters.
� ? is used to substitute one character.
§ #: Entering
� [ ] : Entering
§{ } : Entering
� - : Entering
Both SGML entities, such as 'à' and ISO-Latin-1 characters can be indexed and searched by ht://Dig. In order to support a specific language, we need to configure ht://Dig to use dictionary and affix files for the language of our choice by setting the locale attribute. There is no theoretical page limit. Usually, ht://Dig can index more than 100,000 pages. The output of a search can be easily customized using HTML templates.
For detailed features of ht://Dig, please refer to Appendix 3 which is the feature summary from the documentation of ht://Dig [9].
In the documentation of Juggernautsearch, I couldn't find enough information to make conclusion about weather it supports some of the features we are discussing here. But Juggernautsearch uses a special indexing method which makes the indexing and searching process very fast.
Juggernautsearch extracts the top keywords from a document and index only these keywords. These keywords are assigned word weight according to their appearance frequency in the document. The index file stores these words in the order of decreased weight. When it comes to the time of search, only the keywords stored in the indexed file are examined, the weight of the words in a document is used to calculate the relevance ranking. Since only the keywords are indexed and searched, the indexing and searching are very fast and the index files take little disk space.
Juggernautsearch supports Robot Exclusion Standard. The crawler of Juggernautsearch is called Pagerunner. It does not provide control over the depth of crawling. Juggernautsearch can detect duplicated pages. It pre-scans retrieved URLs to remove unwanted URLs and URLs that have already been visited, and ensures that once indexed a URL will not be crawled again in later web crawl iterations. It can not index protected sites.
The file formats that Juggernautsearch can index are as follows:
- Text and HTML files (.TXT, .HTM, .HTML, .SHTM, .SHTML, others)
- Microsoft PowerPoint files (.PPT)
- Microsoft Word files (.DOC)
- Microsoft Excel files (.XLS)
- Computer Language source files (.BAT, .C, .CGI, .CXX, .CPP, .H, Java, .PHP, .PL, others
- Postscript files (.PS)
- Rich Text Format files (.RTF)
Juggernautsearch supports Attribute Search. It can restrict search to be performed only in URLs. Juggernautsearch doesn't support Boolean Search. This is related to their indexing method. Boolean search that returns pages omitting a keyword can work only when we have the full document to search. While Juggernautsearch only extracts the top few keywords, requesting a search to exclude a word can not guarantee that the word is not in the document. No Boolean Search is the price for fast indexing and searching. In addition, Juggernautsearch dose not support Phrase Matching, Numeric Data Search, and Natural Language Query.
Juggernautsearch does not support languages other than English. It dose not have a page limit, because the index file is very small.
Juggernautsearch has opened a challenge toward ht://Dig because of the criticism from some of the developers of ht://Dig. An interesting comparison between Juggernautsearch and ht://Dig can be found in [10]. The comparison table is attached as Appendix 4.
mnoGoSearch uses full text inverted index. Words in different parts of the document are assigned different weights. To determine the relevance of a document, mnoGoSearch considers several factors: number of complete phrases (taking into account of word weights), number of words from query found in a document, and number of incomplete phrases (taking into account of word weights).
mnoGoSearch supports Robot Exclusion Standards. The crawling depth of the crawler can be limited. By default, it can index html and txt files. With the aid of external parser, pdf, ps, and doc files can be indexed. On servers supporting HTTP 1.1, mnoGoSearch can index mp3 files. It can also index SQL database text fields. mnoGoSearch has the ability to index password protected servers.
mnoGoSearch supports Boolean Search, Phrase Matching, Attribute Search, Fuzzy Search, Word Forms, and Wild Card. It does not support Regular Expression, Numeric Data Search, Case Sensitivity, and Nature Language Query.
- Boolean Search: '&' represents logic AND; '|' represent logic OR; '~' represent logic NOT.
- Phrase Matching: Words enclosed with double quotation will be treated as a phase in searching.
- Attribute Search: mnoGoSearch can limit search within documents with given tags, or with given URL substrings.
- Fuzzy Search: Supports synonyms and substring search.
- Word Forms: Supports word stemming.
- Wild Card: '%' can be used as the wild card to define URL limit, but it can not be used in ordinary search words.
mnoGoSearch supports almost all known 8 bit character sets as well as some multi-byte charsets including Korean euc-kr, Chinese big5 and gb2312, Japanese shift-jis, as well as utf8. The euc-kr, big5, gb2312 and shift-jis character sets are not supported by default, because the conversion tables for them are rather large that leads to increase in size of the executable files [11]. mnoGoSearch also supports the following Macintosh character sets: MacCE, MacCroatian, MacGreek, MacRoman, MacTurkish, MacIceland, MacRomania, MacThai, MacArabic, MacHebrew, MacCyrillic. When we talk about supporting language instead of supporting character sets, mnoGoSearch can support around 700 languages, which includes most of the frequently used language in the world.
mnoGoSearch can index about several million documents. It provides PHP3, Perl, and C CGI access to the search engine, offering significant flexibility and options in arranging search results.
For detailed features of mnoGoSearch, please refer to Appendix 5 which is the feature summary from the documentation of mnoGoSearch [11].
Perlfect implements the most standard indexing and ranking algorithm. It uses the inverted index. When it comes to calculate word weight, it applies the algorithm of Gerald Salton [12], that is, the weight W, of a term T, in a document D, is:
W(T, D) = tf(T, D) * log ( DN / df(T)),
where tf(T, D) is the term frequency of T in D, DN is the total number of documents, and df(T) is the sum of frequencies of T in every document considered or as it called the document frequency of T.
Perlfect is the only search engine among the nine that does not support Robot Exclusion Standard. Thus, it is mainly designed for adding a search function to a single website. The depth of crawling can not be controlled. It can not index protected servers.
Perlfect only supports the Boolean Search feature. A '+' sign is used to include a word, while a '-' sign is used to exclude a word.
The result page of Perlfect can be shown in many different languages such as German, French, and Italian. The user interface is fully customizable using the provided templates. Perlfect is a lightweight search engine. It can only index about 1,000+ documents.
For detailed features of perlfect, please refer to Appendix 6 which is the feature summary in the documentation of perlfect [13].
I haven't been successful in finding out what indexing and ranking methods SWISH-E uses in publicly available documents.
The crawler supports the Robot Exclusion Standards. Its maximum depth of crawling can be controlled. SWISH-E can not index protected servers.
SWISH-E can index html, xml and txt files. With the use of filters that convert other types of files such as MS Word documents, pdf, or gzipped files into one of the file types that Swish-e understands, SWISH-E can then index them. Files with extensions gif, xbm, au, mov, and mpg can be indexed but their content can not be indexed.
SWISH-E supports Boolean Search, Phrase Matching, Attribute Search, Fuzzy Search, Word Forms, and Wild Card. It doesn't support Regular Expression, Numerical Data Search, Case Sensitivity, and Natural Language Query.
- Boolean Search: and, or, and not are three logical operators of SWISH-E. The operators are case sensitive.
- Phrase Matching: Words in double quotation are treated as a phase in searching.
- Attribute Search: SWISH-E allows users to specify certain META tags that can be used as document properties. Search can be limited to documents with specified properties.
- Fuzzy Search: SWISH-E supports soundex search.
- Word Forms: SWISH-E supports word stemming.
- Wild Card: * is used to replace single or multiple characters.
SWISH-E supports all the languages that use single byte characters.
For detailed features of SWISH-E, please refer to Appendix 7 which is the feature summary from the documentation of SWISH-E [14].
Webinator uses inverted index. The ranking algorithm takes into consideration of relative word ordering, word proximity, database frequency, document frequency, and position in text. The relative importance of these factors in computing the quality of a hit can be altered under Ranking Factors option.
The crawler supports the Robot Exclusion Standards. Its maximum depth of crawling can be controlled. Webinator can not index protected servers.
Webinator can detect duplicates by hashing the textual content of the page and not storing any page with a hash code that is already in the database. Files with extension html, htm, txt, pdf, doc, swf, asp, jsp, shtml, jhtml, or phtml can be indexed by Webinator.
Webinator supports Boolean Search, Phrase Matching, Fuzzy Search, Word Forms, Wild Card, Regular Expression, Numerical Data Search, and Natural Language Query. It does not support Attribute Search and Case Sensitivity.
- Boolean Search: An '-' is used to exclude a word; An '+' is used to include a word.
- Phrase Matching: Enclose the words in double quotation or hyphenate the words together.
- Fuzzy Search: It lets you find "looks roughly like" or "sounds like" information. To invoke a fuzzy match, precede the word or pattern with the '%' character.
- Word Forms: Word stemming is supported.
- Wild Card: * can be used to match just the prefix of a word or to ignore the middle of something.
- Regular Expression: Users can find those items that cannot be located with a simple wildcard search using regular expression pattern matcher. To invoke the REX regular expression pattern matcher within a query, precede the expression with a '/'. For example, we can use /19[789][0-9] to find years between 1970 and 1999.
- Numerical Data Search: It allows you to find quantities in textual information in any way they may be represented. To invoke a numeric value search within a query, precede the value with a '#'. For example, query #>5000 may return match "2.2 million".
- Natural Language Query: A query can be in the form of a sentence or question.
Webinator doesn't support languages other than English. The free version of Webinator only can index about 10,000 pages. It has customizable user interface.
For detailed features of Webinator, please refer to Appendix 8 which is the feature summary from the documentation of Webinator [15].
The indexer and query engine of WebGlimpse is Glimpse. Glimpse implements a two-level query method, which leads to small index files and fast index construction, and supports arbitrarily approximate matching. The idea of two-level query method is a hybrid of inverted index and sequential search with no indexing [16] [17].
The first step of a document indexing process is to divide the whole collection into small pieces, which are called blocks. The number of blocks can not exceed 256, so that the address of a block can be stored with one byte. The whole collection is scanned word by word. Then, an index that is similar to a regular inverted index with one notable exception is created. In an inverted index, every occurrence of every word is indexed with a pointer to the exact location of the occurrence. In Glimpse's index, every word is indexed, but not every occurrence. Each entry in the index contains a word and the block numbers in which that word occurs. Since each block can be identified with one byte, and many occurrences of the same word are combined in the index into one entry, the index is typically quite small.
The searching process consists of two phases. First, Glimpse searches the index for a list of all blocks that may contain a match to the query. Then, each such is searched separately. Since the index is small, agrep is used to perform flexible sequential search. Because of the sequential search, arbitrarily approximate search such as fuzzy search, word forms, regular expression, and wild card are easily supported.
WebGlimpse supports Robot Exclusion Standard. The crawling depth can be controlled. It can not index a protected server. By default, it can index html and txt files. With the aid of filters, it can index PDF, and any other documents that can be filtered to plaintext.
WebGlimpse supports Boolean Search, Fuzzy Search, Word Form, Wild Card, Regular Expression, and Case Sensitivity. It does not support Phrase Matching, Attribute Search, Numeric Data Search, and Natural Language Query.
- Boolean Search: 'AND' operation is denoted by the symbol ';'. 'OR' operation is denoted by the symbol ','.
- Fuzzy Search: Supports mis-spelling and partial word match.
- Word Form: Supports common word endings.
- Wild Card: The symbol '#' is used to denote a sequence of any number (including 0) of arbitrary characters. '*' works too.
- Regular Expression: The union operation '|', Kleene closure '*', and parentheses ( ) are supported to form regular expressions.
- Case Sensitivity: It supports case sensitive search.
WebGlimpse can index all single byte languages. But the output of the interface is not configurable unless the commercial version of the software is purchased.
For detailed features of WebGlimpse, please refer to Appendix 9 which is the feature summary from the documentation of WebGlimpse [18].
We compared and contrasted nine free search engine software packages. Each package has its pros and cons. Most search engines support Boolean Search, Phrase Search, and Word Forms. ht://Dig has a powerful wild card. Juggernautsearch and WebGlimpse has small index files and fast indexing processes. Webinator supports natural language query, and it is the only search engine reviewed that can search numeric value in the textual environment. mnoGoSearch excels in supporting multiple languages. Perl scripts search engines such as Perlfect and RuterSearch are usually light-weighted. They have less functionality, but they are easy to use and install. In a nutshell, choosing which search engine software package to use is a decision that should be based on the matching between requirements and software features.
[1] The original version of this paper was finished in 2002 as a project paper for the course, Information Sciences and Technology 511: Information Management − Information and Technology, taught by Dr. Lee Giles at the College of Information Sciences and Technology, The Pennsylvania State University, University Park, PA. The author was a graduate student at the College of Information Sciences and Technology, The Pennsylvania State University at that time. This paper removed all obsolete contents of the original version.
[2] Alkaline: a UNIX/NT Search Engine - Alkaline 1.9 Users Guide. Vestris Inc., Switzerland . http://alkaline.vestris.com/docs/pdf/alkaline.pdf
[3] Alkaline: a UNIX/NT Search Engine - Alkaline 1.5 Frequently Asked Questions. Vestris Inc., Switzerland . http://alkaline.vestris.com/docs/pdf/alkaline-faq.pdf
[4] http://www.faqs.org/rfcs/rfc1321.html
[5] http://www.xav.com/scripts/search/features.html
[6] http://www.xav.com/scripts/search/help/
[7] Comparing Open Source Indexers. http://www.infomotions.com/musings/opensource-indexers/
[8] http://kscsearch.ksc.nasa.gov/htdig/
[9] http://www.htdig.org/
[10] Donald T. Kasper, Juggernautsearch Internet Search Engine 1.0.1 Technical Responses and Comparison to HTDIG (HT://DIG), May 2001. http://juggernautsearch.com/htdig.htm.
[11] http://mnogosearch.org/doc/
[12] http://www.perlfect.com/freescripts/search/development.shtml
[13] http://perlfect.com/freescripts/search/
[14] http://swish-e.org/docs/index.html
[15] Webinator WWW Site Indexer Version 5.0. http://www.thunderstone.com/site/webinator5man/webinator5.pdf
[16] Manber, U.; Wu, S., "GLIMPSE: A Tool to Search Through Entire File System". TR 93-34, Department of Computer Science, University of Arizona, Tucson, Arizona, 1993.
[17] Udi Manber, Mike Smith,, Burra Gopal, "WebGlimpse Combining Browsing and Searching" , Proceedings of 1997 Usenix Technical Conference, Jan 6-10, 1997
[18] http://www.webglimpse.net/features.html
|
|
|
|
Background
|
|
|