Building a Simple Search Engine with C#

Everyone is talking about Search technology at the moment - how Google has risen to the top of the heap, how Yahoo is trying to regain its former number one spot, and how Microsoft is playing catch-up. But for the average ASP.NET developer, those sites are really about helping people find you on the Web. Once they've visited your website, how do you provide a cheap, fast, customised search to maximise the usability of your content?

There are a number of options available:

Search 'Technology' Advantages Disadvantages
Microsoft Index Server Comes with Windows 2000, XP, 2003 File-system indexing only, doesn't spider website links or database-driven pages (there are tricks around this)
Other server-side software
eg. DTSearch, mnoGoSearch
Shop around for features that you need, including multiple language support Cost
May be difficult to setup/customise
'Hosted services'
eg. Google, PicoSearch
Often free or low cost
Easy to set up
Lack of control
Often template driven or host ads which may distract your users

Most website operators will find at least one of these products can meet their needs, but it will always be a trade-off between cost, features and flexibility.

This article describes a simple, free, easy to install Search feature. The goal is to build a simple search tool that can be installed simply by placing three files on a website, and that could be easily extended to rival the features of the products listed above!

There are two main parts to a Search engine:

  • the build process, which processes files, indexing their contents and creating the 'catalog'
  • the search process, which uses the 'catalog' to find the search term and the names of the files it appears in

Design

A Catalog contains a collection of Words,
and each Word contains a reference to every File that it appears in

The first step was to think about how to implement the catalog objects. A Binary Search Tree seemed like a good idea (see the great articles on MSDN), but in order to keep things simple Hashtables will do the job. We can always refactor the code to use a more sophisticated Collection class later on.

The simple object model looks like this:

You can see that some assumptions have been made in this model.
Firstly, we store limited information about the File - just enough to produce a familiar search results page:

  • Url - a web-based address for the file (this will become important later)
  • Title - the of the document </li> <li><b>FileDate</b> - date the file was last modified </li> <li><b>Size</b> - in bytes </li> <li><b>Description</b> - a 'summary' of the document </li> </ul> <p></p> <p>The <b>Word</b> object is even simpler - the properties are: </p> <ul> <li><b>Text</b> - the actual word! We will standardise on lowercase for all the data stored we store </li> <li><b>InFiles</b> - the collection of Files that this Word was found in </li> </ul> <p></p> <p>Lastly, the <b>Catalog</b> itself has a single property - the collection of Words called <b>index </b>. It also has two methods, one to <i>add</i> Words to the catalog and another to <i>search</i> the catalog and get back a list of files (the <i>search results</i>). </p> <p>There are two important assumptions which aren't immediately apparent from the model - there should only be ONE File object for each physical file, and ONE Word object for each word (so there will only be one Word object that represents the word "microsoft" for example), although that word will appear in many of the files we search. Why this is so, and how we manage it is covered in the catalog build process. </p> <h3>Code Structure </h3> <table width="700" border="1"> <tbody> <tr> <th>Searcharoo.cs</th> <td>Implementation of the object model; compiled into both ASPX pages</td> </tr> <tr> <th>SearcharooCrawler.aspx </th> <td valign="top"><code><%@ Page Language="C#" Src="Searcharoo.cs" %><br><%@ import Namespace="Searcharoo.Net"%> </code><br>Code to build the catalog using the common classes, and place the resulting Catalog object in the ASP.NET Application Cache</td> </tr> <tr> <th>Searcharoo.aspx </th> <td valign="top"><code><%@ Page Language="C#" Src="Searcharoo.cs" %><br><%@ import Namespace="Searcharoo.Net"%> </code><br>Retrieves the Catalog object from the Cache and allows searching via an HTML form. </td> </tr> </tbody> </table> <h3>Object Model [Searcharoo.cs] </h3> <p>This file contains the C# code that defines the object model for our catalog, including the methods to add and search Words. These objects are used by both the crawler and the search page. </p> <table width="700" border="0"> <tbody> <tr> <td class="codelist"> <div class="dotnetcoders-code"> <span class="keyword">namespace</span> Searcharoo <span class="punctuation">.</span>Net <span class="punctuation">{</span> <br>     <span class="keyword">public</span> <span class="keyword">class</span> Catalog <span class="punctuation">{</span> <br>         <span class="keyword">private</span> System <span class="punctuation">.</span>Collections <span class="punctuation">.</span>Hashtable <span class="identifier">index</span> <span class="punctuation">;</span> <br>         <span class="keyword">public</span> Catalog <span class="punctuation">(</span> <span class="punctuation">)</span> <span class="punctuation">{</span> <span class="punctuation">}</span> <br>         <span class="keyword">public</span> <span class="keyword">bool</span> Add <span class="punctuation">(</span> <span class="keyword">string</span> <span class="identifier">word</span> <span class="punctuation">,</span> File <span class="identifier">infile</span> <span class="punctuation">,</span> <span class="keyword">int</span> <span class="identifier">position</span> <span class="punctuation">)</span> <span class="punctuation">{</span> <span class="punctuation">}</span> <br>         <span class="keyword">public</span> Hashtable Search <span class="punctuation">(</span> <span class="keyword">string</span> <span class="identifier">searchWord</span> <span class="punctuation">)</span> <span class="punctuation">{</span> <span class="punctuation">}</span> <br>     <span class="punctuation">}</span> <br> <br>     <span class="keyword">public</span> <span class="keyword">class</span> Word <span class="punctuation">{</span> <br>         <span class="keyword">public</span> <span class="keyword">string</span> Text <span class="punctuation">;</span> <br>         <span class="keyword">private</span> System <span class="punctuation">.</span>Collections <span class="punctuation">.</span>Hashtable <span class="identifier">fileCollection</span> <span class="punctuation">;</span> <br>         <span class="keyword">public</span> Word <span class="punctuation">(</span> <span class="keyword">string</span> <span class="identifier">text</span> <span class="punctuation">,</span> File <span class="identifier">infile</span> <span class="punctuation">,</span> <span class="keyword">int</span> <span class="identifier">position</span> <span class="punctuation">)</span> <span class="punctuation">{</span> <span class="punctuation">}</span> <br>         <span class="keyword">public</span> <span class="keyword">void</span> Add <span class="punctuation">(</span>File <span class="identifier">infile</span> <span class="punctuation">,</span> <span class="keyword">int</span> <span class="identifier">position</span> <span class="punctuation">)</span> <span class="punctuation">{</span> <span class="punctuation">}</span> <br>         <span class="keyword">public</span> Hashtable InFiles <span class="punctuation">(</span> <span class="punctuation">)</span> <span class="punctuation">{</span> <span class="punctuation">}</span> <br>     <span class="punctuation">}</span> <br> <br>     <span class="keyword">public</span> <span class="keyword">class</span> File <span class="punctuation">{</span> <br>         <span class="keyword">public</span> <span class="keyword">string</span> Url <span class="punctuation">;</span> <br>         <span class="keyword">public</span> <span class="keyword">string</span> Title <span class="punctuation">;</span> <br>         <span class="keyword">public</span> <span class="keyword">string</span> Description <span class="punctuation">;</span> <br>         <span class="keyword">public</span> DateTime CrawledDate <span class="punctuation">;</span> <br>         <span class="keyword">public</span> <span class="keyword">long</span> Size <span class="punctuation">;</span> <br>         <span class="keyword">public</span> File <span class="punctuation">(</span> <span class="keyword">string</span> <span class="identifier">url</span> <span class="punctuation">,</span> <span class="keyword">string</span> <span class="identifier">title</span> <span class="punctuation">,</span> <span class="keyword">string</span> <span class="identifier">description</span> <span class="punctuation">,</span> DateTime <span class="identifier">datecrawl</span> <span class="punctuation">,</span> <span class="keyword">long</span> <span class="identifier">length</span> <span class="punctuation">)</span> <span class="punctuation">{</span> <span class="punctuation">}</span> <br>     <span class="punctuation">}</span> <br> <span class="punctuation">}</span> </div> </td> </tr> <tr> <td class="caption">Listing 1 - Overview of the object model (interfaces only - implementation code has been removed)</td> </tr> </tbody> </table> <h3>Build the Crawler [SearcharooCrawler.aspx] </h3> <p>Now that we have a model and structure, what next? In the interests of 'getting something working', the first build task is to simulate how our 'build' process is going to find the files we want to search. There are two ways we can look for files </p> <ol type="i"> <li>Spidering - following 'the web' of links in HTML pages to search an entire website (or sites) </li> <li>Crawling - crawling through a set of files and folders and indexing all the files in those folders, using the file system. This can only work when the files are locally accessible. </li> </ol> <p></p> <p>The big search engines - Yahoo, Google, MSN - all spider the internet to build their search catalogs. However following links to find documents requires us to write an HTML parser that can find and interpret the links, and then follow them! That's a little too much for one article, so we're going to start with some simple file crawling code to populate our catalog. The great thing about our object model is that it doesn't really care if it is populated by Spidering or Crawling - it will work for either method, only the code that populates it will change. </p> <p>Here is a simple method that we can use to locate the files we want to search by traversing the file system: </p> <table width="700" border="0"> <tbody> <tr> <td class="codelist"> <div class="dotnetcoders-code"> <span class="keyword">private</span> <span class="keyword">void</span> CrawlPath <span class="punctuation">(</span> <span class="keyword">string</span> <span class="identifier">root</span> <span class="punctuation">,</span> <span class="keyword">string</span> <span class="identifier">path</span> <span class="punctuation">)</span> <span class="punctuation">{</span> <br>    System <span class="punctuation">.</span>IO <span class="punctuation">.</span>DirectoryInfo <span class="identifier">m_dir</span> <span class="punctuation">=</span> <span class="keyword">new</span> System <span class="punctuation">.</span>IO <span class="punctuation">.</span>DirectoryInfo <span class="punctuation">(</span> <span class="identifier">path</span> <span class="punctuation">)</span> <span class="punctuation">;</span> <br>     <span class="comment">// ### Look for matching files to summarise what will be catalogued ### <br></span>     <span class="keyword">foreach</span> <span class="punctuation">(</span>System <span class="punctuation">.</span>IO <span class="punctuation">.</span>FileInfo f <span class="keyword">in</span> <span class="identifier">m_dir</span> <span class="punctuation">.</span>GetFiles <span class="punctuation">(</span> <span class="identifier">m_filter</span> <span class="punctuation">)</span> <span class="punctuation">)</span> <span class="punctuation">{</span> <br>        Response <span class="punctuation">.</span>Write <span class="punctuation">(</span> <span class="identifier">path</span> <span class="punctuation">.</span>Substring <span class="punctuation">(</span> <span class="identifier">root</span> <span class="punctuation">.</span>Length <span class="punctuation">)</span> <span class="punctuation">+</span> @ <span class="string">"/" + f.Name + "</span>< <span class="identifier">br</span>>" <span class="punctuation">)</span> <span class="punctuation">;</span> <br>     <span class="punctuation">}</span> <span class="comment">// foreach <br></span>     <span class="keyword">foreach</span> <span class="punctuation">(</span>System <span class="punctuation">.</span>IO <span class="punctuation">.</span>DirectoryInfo d <span class="keyword">in</span> <span class="identifier">m_dir</span> <span class="punctuation">.</span>GetDirectories <span class="punctuation">(</span> <span class="punctuation">)</span> <span class="punctuation">)</span> <span class="punctuation">{</span> <br>        CrawlPath <span class="punctuation">(</span> <span class="identifier">root</span> <span class="punctuation">,</span> <span class="identifier">path</span> <span class="punctuation">+</span> @"/" <span class="punctuation">+</span> d <span class="punctuation">.</span>Name <span class="punctuation">)</span> <span class="punctuation">;</span> <br>     <span class="punctuation">}</span> <span class="comment">// foreach <br></span> <span class="punctuation">}</span> <br> </div> </td> </tr> <tr> <td class="caption">Listing 2 - Crawling the filesystem</td> </tr> </tbody> </table> <p> </p> <table width="700" border="0"> <tbody> <tr> <td></td> </tr> <tr> <td class="caption">Screenshot 1 - To test the file crawler we downloaded the HTML from the CIA World FactBook</td> </tr> </tbody> </table> <p>Now that we are confident we can access the files, we need to process each one in order to populate the catalog. Firstly, let's be clear about what that process is: </p> <ol> <li>get the list of files and folders in the root directory (done) </li> <li>open the first file and read its contents </li> <li>look for the file's Title, Description and calculate its size </li> <li>generate the file's Url (because we're crawling the file-system, but we want the file to have a web address to click on). </li> <li>clean up the text into a collection of words </li> <li>add each word to the catalog, linked to this file </li> <li>close the file and open the next one (or open a directory once all the files are processed) </li> <li>repeat until no more files are found </li> </ol> <p></p> <p>There's three different coding tasks to do: </p> <ol type="a"> <li>opening the files we find - we'll use the System.IO namespace for this </li> <li>finding specific text in the file (the Title and Description) - either the System.String static methods or the System.RegularExpressions namespaces might help here </li> <li>cleaning up the text and parsing it into individual words - definitely a job for RegularExpressions. </li> </ol> <p></p> <p>Getting (a) working was easy: </p> <table width="700"> <tbody> <tr> <td class="codelist"> <div class="dotnetcoders-code"> System <span class="punctuation">.</span>IO <span class="punctuation">.</span>DirectoryInfo <span class="identifier">m_dir</span> <span class="punctuation">=</span> <span class="keyword">new</span> System <span class="punctuation">.</span>IO <span class="punctuation">.</span>DirectoryInfo <span class="punctuation">(</span> <span class="identifier">path</span> <span class="punctuation">)</span> <span class="punctuation">;</span> <br> <span class="comment">// Look for matching files <br></span> <span class="keyword">foreach</span> <span class="punctuation">(</span>System <span class="punctuation">.</span>IO <span class="punctuation">.</span>FileInfo f <span class="keyword">in</span> <span class="identifier">m_dir</span> <span class="punctuation">.</span>GetFiles <span class="punctuation">(</span> <span class="identifier">m_filter</span> <span class="punctuation">)</span> <span class="punctuation">)</span> <span class="punctuation">{</span> <br>    Response <span class="punctuation">.</span>Write <span class="punctuation">(</span>DateTime <span class="punctuation">.</span>Now <span class="punctuation">.</span>ToString <span class="punctuation">(</span> <span class="string">"t"</span> <span class="punctuation">)</span> <span class="punctuation">+</span> <span class="string">" "</span> <br>         <span class="punctuation">+</span> <span class="identifier">path</span> <span class="punctuation">.</span>Substring <span class="punctuation">(</span> <span class="identifier">root</span> <span class="punctuation">.</span>Length <span class="punctuation">)</span> <span class="punctuation">+</span> @"/" <span class="punctuation">+</span> f <span class="punctuation">.</span>Name <span class="punctuation">)</span> <span class="punctuation">;</span>Response <span class="punctuation">.</span>Flush <span class="punctuation">(</span> <span class="punctuation">)</span> <span class="punctuation">;</span> <br>     <span class="identifier">fileurl</span> <span class="punctuation">=</span> <span class="identifier">m_url</span> <span class="punctuation">+</span> <span class="identifier">path</span> <span class="punctuation">.</span>Substring <span class="punctuation">(</span> <span class="identifier">root</span> <span class="punctuation">.</span>Length <span class="punctuation">)</span> <span class="punctuation">.</span>Replace <span class="punctuation">(</span>@ <span class="string">"/", "</span> <span class="punctuation">/</span> <span class="string">") + "</span> <span class="punctuation">/</span>" <span class="punctuation">+</span> f <span class="punctuation">.</span>Name <span class="punctuation">;</span> <br> <br>    System <span class="punctuation">.</span>IO <span class="punctuation">.</span>StreamReader <span class="identifier">reader</span> <span class="punctuation">=</span> System <span class="punctuation">.</span>IO <span class="punctuation">.</span>File <span class="punctuation">.</span>OpenText <span class="punctuation">(</span> <span class="identifier">path</span> <span class="punctuation">+</span> @"/" <span class="punctuation">+</span> f <span class="punctuation">.</span>Name <span class="punctuation">)</span> <span class="punctuation">;</span> <br>     <span class="identifier">fileContents</span> <span class="punctuation">=</span> <span class="identifier">reader</span> <span class="punctuation">.</span>ReadToEnd <span class="punctuation">(</span> <span class="punctuation">)</span> <span class="punctuation">;</span> <br>     <span class="identifier">reader</span> <span class="punctuation">.</span>Close <span class="punctuation">(</span> <span class="punctuation">)</span> <span class="punctuation">;</span> <span class="comment">// now use the fileContents to build the catalog... <br></span> </div> </td> </tr> <tr> <td class="caption">Listing 3 - Opening the files</td> </tr> </tbody> </table> <p>A quick Google helped find a solution to (b). </p> <table width="700"> <tbody> <tr> <td class="codelist"> <div class="dotnetcoders-code"> <span class="comment">// ### Grab the <TITLE> ### <br></span>Match TitleMatch <span class="punctuation">=</span> Regex <span class="punctuation">.</span>Match <span class="punctuation">(</span> <span class="identifier">fileContents</span> <span class="punctuation">,</span> <span class="string">"<title>([^<]*)" , RegexOptions .IgnoreCase | RegexOptions .Multiline ) ;
    filetitle = TitleMatch .Groups [ 1 ] .Value ;
    // ### Parse out META data ###
    Match DescriptionMatch = Regex .Match ( fileContents , "" , RegexOptions .IgnoreCase | RegexOptions .Multiline ) ;
    filedesc = DescriptionMatch .Groups [ 1 ] .Value ;
    // ### Get the file SIZE ###
    filesize = fileContents .Length ;
    // ### Now remove HTML, convert to array, clean up words and index them ###
    fileContents = stripHtml ( fileContents ) ;

    Regex r = new Regex (@ "/s+" ) ;             // remove all whitespace
    string wordsOnly = stripHtml ( fileContents ) ;

    // ### If no META DESC, grab start of file text ###
    if ( null = = filedesc | | String .Empty = = filedesc ) {
    if ( wordsOnly .Length > 350 )
         filedesc = wordsOnly .Substring ( 0 , 350 ) ;
    else if ( wordsOnly .Length > 100 )
         filedesc = wordsOnly .Substring ( 0 , 100 ) ;
    else
         filedesc = wordsOnly ; // file is only short!
    }
Listing 4 - Massage the file contents

And finally (c) involved a very simple Regular Expression or two, and suddenly we have the document as an Array of words, ready for processing!

protected string stripHtml ( string strHtml ) {
     //Strips the HTML tags from strHTML
    System .Text .RegularExpressions .Regex objRegExp
             = new System .Text .RegularExpressions .Regex ( "<(.|/n)+?>" ) ;

     // Replace all tags with a space, otherwise words either side
     // of a tag might be concatenated
     string strOutput = objRegExp .Replace ( strHtml , " " ) ;

     // Replace all < and > with < and >
     strOutput = strOutput .Replace ( "<" , "<" ) ;
     strOutput = strOutput .Replace ( ">" , ">" ) ;

     return strOutput ;
}
Listing 5 - Remove HTML
and
Regex r = new Regex (@ "/s+" ) ;         // remove all whitespace
wordsOnly = r .Replace ( wordsOnly , " " ) ; // compress all whitespace to one space
string [ ] wordsOnlyA = wordsOnly .Split ( ' ' ) ; // results in an array of words
Listing 6 - Remove unnecessary whitespace

To recap - we have the code that, given a starting directory, will crawl through it (and its subdirectories), opening each HTML file, removing the HTML tags and putting the words into an array of strings.

Now that we can parse each document into words, we can populate our Catalog!

Build the Catalog

All the hard work has been done in parsing the file - building the catalog is as simple as adding objects to

     // ### Loop through words in the file ###
     int i = 0 ;     // Position of the word in the file (starts at zero)
     string key = "" ; // the 'word' itself
     // Now loop through the words and add to the catalog
     foreach ( string word in wordsOnlyA ) {
         key = word .Trim ( ' ' , '?' , '/"' , ',' , '/'' , ';' , ':' , '.' , '(' , ')' ) .ToLower ( ) ;
         m_catalog .Add ( key , infile , i ) ;
        i + + ;
     } // foreach word in the file
Listing 7 - Add words to the catalog

As each file is processed a line is written to the browser to indicate the catalog build progess, showing the File.Url and the number of words parsed.

Screenshot 2 - Processing the CIA World FactBook - it contains 40,056 words according to our code.

After the last file is processed, the Catalog object is added to the Application Cache object, and is ready for searching!

Build the Search

The finished Catalog now contains a collection of Words, and each Word object has a collection of the Files it was found in. The Search method of the Catalog takes a single word as the search parameter, and returns the Hashtable of File objects where that Word was found. The returned Hashtable keys are File objects and the values are the rank (ie. count of the number of times the words appear).

All the hard work has been done in parsing the file - building the catalog is as simple as adding objects to

/// Returns all the Files which contain the searchWord
/// Hashtable
public Hashtable Search ( string searchWord ) {
     // apply the same 'trim' as when we're building the catalog
     searchWord = searchWord .Trim ( '?' , '/"' , ',' , '/'' , ';' , ':' , '.' , '(' , ')' ) .ToLower ( ) ;
    Hashtable retval = null ;
     if ( index .ContainsKey ( searchWord ) ) { // does all the work !!!
        Word thematch = (Word ) index [ searchWord ] ;
         retval = thematch .InFiles ( ) ; // return the collection of File objects
     }
     return retval ;
}
Listing 8 - the Search method

The key point is how simple the Search method can be, because of the amount of work performed during the cataloging.

Obviously there are a number of enhancements we could make here, starting with multiple word searches (finding the intersection of the File Hashtables for each Word), implementing Boolean searches, fuzzy matches (or matching word stems/roots)... the list is (almost) endless, but beyond the scope of this article.

Build the Results [Searcharoo.aspx]

Searcharoo.aspx initially displays an HTML form to allow the user to enter the search term.

Screenshot 3 - Enter the search term

When this form is submitted, we look for the Word in the index Hashtable using the ContainsKey() method, and rely on the efficiency of the .NET Framework's searching a collection for an object using its HashCode. The Hashtable.ContainsKey() method is actually doing the search for us.

The Catalog.Search() method returns a Hashtable containing the matching File objects, so all we have to do is display the them in HTML format!

The display process has been broken into a few steps below:

Firstly, we call the Search method to get the result Hashtable. If the result is null skip to Listing 13 because there were no matches, otherwise we have a little more work to do...

// Do the search
Hashtable searchResultsArray = m_catalog .Search ( searchterm ) ;
// Format the results
if ( null ! = searchResultsArray ) {
Listing 9 - The actual search is the easy bit

The Dictionary returned from the Search() method has File objects as the key and the page rank as the value. The problem is they are not in any particular order!
To access these objects in the foreach loop, we need to cast the key object to a File and the value object to int.

Firstly, we call the Search method to get the result Hashtable. If the result is null skip to the end because there were no matches, otherwise we have a little more work to do...

// intermediate data-structure for 'ranked' result HTML
SortedList output = new SortedList ( searchResultsArray .Count ) ; // empty sorted list
DictionaryEntry fo ;
File infile ;
string result = "" ;
// build each result row
foreach ( object foundInFile in searchResultsArray ) {
     // build the HTML output in the sorted list, so the 'unsorted'
     // searchResults are 'sorted' as they're added to the SortedList
     fo = (DictionaryEntry ) foundInFile ;

     infile = (File ) fo .Key ;
     int rank = ( int ) fo .Value ;
Listing 10 - Processing the results

Firstly, we call the Search method to get the result Hashtable. If the result is null, game over, otherwise we have a little more work to do.

     // Create the formatted output HTML
     result = ( "+ infile .Url + ">" ) ;
     result + = ( "" + infile .Title + "
" ) ;
     result + = ( " + infile .Url + " target=/"_TOP/" " ) ;
     result + = ( "title=/"open in new window/" style=/"font-size:xx-small/">↑
" ) ;
     result + = ( " (" + rank + ")" ) ;
     result + = ( "
"
+ infile .Description + "..." ) ;
     result + = ( "
"
+ infile .Url + " - " + infile .Size ) ;
     result + = ( "bytes
- " + infile .CrawledDate + "

" ) ;

Listing 11 - Pure formatting

Before we can output the results, we need to get them in some order. We'll use a SortedList and add the HTML result string to it using the page rank as the key. If there is already an result with the same rank, we'll concatenate the results together (they'll appear one after the other).

     int sortrank = ( rank * - 1 ) ; // multiply by -1 so larger score goes to the top
     if ( output .Contains ( sortrank ) ) { // rank exists; concatenate same-rank output strings
         output [ sortrank ] = ( ( string ) output [ sortrank ] ) + result ;
     } else {
         output .Add ( sortrank , result ) ;
     }
     result = "" ; // clear string for next loop
Listing 12 - Sorting the results by rank

To make sure the highest rank appears at the top of the list, the rank is multiplied by -1!
Now all we have to do is Response.Write the SortedList, string by string, followed by the number of matches.

     // Now output to the HTML Response
     foreach ( object rows in output ) { // Already sorted!
        Response .Write ( ( string ) ( (DictionaryEntry ) rows ) .Value ) ;
     }
    Response .Write ( "

Matches: " + searchResultsArray .Count ) ;
} else {
    Response .Write ( "

Matches: 0" ) ;
}
Response .Write ( "

↑ top" ) ;
Response .End ( ) ; // Stop here

Listing 13 - Output the results

The output should look familiar to any web search engine user. We've implemented a simple ranking mechanism (a word count, shown in parentheses after the Title/Url) however it doesn't support paging.

Screenshot 4 - Search results contain a familiar amount of information, and the word-count-rank value. Clicking a link opens the local copy of the HTML file (the ↑ opens in a new window).

Using the sample code

The goal of this article was to build a simple search engine that you can install just by placing some files on your website; so you can copy Searcharoo.cs, SearcharooSpider.aspx and Searcharoo.aspx to your web root and away your go!
However that means you accept all the default settings, such as only searching .HTML files, and the search starting from the location of the Searcharoo files.

To change those defaults you need to add some settings to web.config:

<appSettings>
    
    <add key="Searcharoo_PhysicalPath" value="c:/Inetpub/wwwroot/" />
    
    <add key="Searcharoo_VirtualRoot " value="http://localhost/" />
    
    <add key="Searcharoo_FileFilter" value="*.html"/>
appSettings>
Listing 14 - web.config
Then simply navigate to http://localhost/Searcharoo.aspx (or wherever you put the Searcharoo files) and it will build the catalog for the first time.

If your application re-starts for any reason (ie. You compile code into the /bin/ folder, or change web.config settings) the catalog will need to be rebuilt - the next user who performs a search will trigger the catalog build. This is accomplished by checking if the Cache contains a valid Catalog and if not using Server.Transfer to start the crawler.

Future

In the real world, most ASP.NET websites probably have more than just HTML pages, including links to DOC, PDF or other external files and ASPX dynamic/database-generated pages.
The other issue you might have is storing a large blob of data in your Application Cache. For most websites the size of this object will be manageable - but if you've got a lot of content you might not want that in memory all the time.
The good news is the code above can be easily extended to cope with these additional scenarios (including spidering web links, and using a database to store the catalog)... check back for future articles.

Postscript : What about code-behind and Visual-Studio.NET?

You'll notice the two ASPX pages use the src="Searcharoo.cs" @Page attribute to share the common object model without compiling to an assembly, with the page-specific 'inline' using

你可能感兴趣的:(Building a Simple Search Engine with C#)