Full-Text Search in ASP.NET using Lucene.NET

This post is about the full-text search engine   Lucene.NET  and how I integrated it into   BugTracker.NET  .   If you are thinking of adding full-text search to your application, you might find this post useful.  I'm not saying this is THE way of using Lucene.NET, but it is an example of ONE way.
 
Lucene.NET is a C# port of the original   Lucene  , an Apache Foundation open source project, written in java.

Why did I use Lucene.NET instead of the SQL Server full-text search engine?  Well, I'd like to say that I did some research into the pros and cons of the two choices, but actually I didn't do any comparative research.  What happened was that during a   Stackoverflow podcast  I heard Joel Spolsky mention that FogBugz uses Lucene as its engine and that he was happy with it.   I trust him, and  I was curious, so, one weekend I downloaded Lucene.NET and played with it a bit and before the weekend was over I was already done integrating it into BugTracker.NET.   I never looked at the SQL Server alternative at all, so I can't tell you anything about it.

Lucene itself is a class library, not an executable.   You call Lucene functions to do the search.  There is an open source standalone server built on Lucene called   Solr  .   You send Solr messages to do the search.  One way of using Lucene would have been to have my users run Solr side-by-side with SQL Server.   As with SQL Server full-text search, I can't tell you anything about Solr because I didn't try it.   It wouldn't have made sense to use Solr for BugTracker.NET, I think, because Solr would have been an additional installation hassle.   And running a server wouldn't have been doable at all at a cheap shared host like GoDaddy, where my own BugTracker.NET demo lives.    So, instead of using Solr, I used the Lucene class libraries directly.

To integrate Lucene, I had to build the following, which I list here and then describe in more detail below.

1) How Lucene would build its searchable index.   Lucene doesn't search my SQL Server database directly.   Instead, it searches its own "database", its own index.
2) The design of the Lucene index
3) How I would update Lucene's index whenever data in my database changes.
4) Sending the search query to Lucene.
5) Displaying the results.


Now the details.  I've simplied my code for this post, so that you can more easily see the overall design and understand the concepts and my design choices.   



1) Building the index.

When an ASP.NET application receives its first HTTP request after having been shut down, the Application_OnStart event fires, which I handle in Global.asax.   I call my "build_lucene_index" method.    Notice that I have a configuration setting "EnableLucene".    I was nervous about the my understanding of Lucene and whether my way of using it was the right architecture, and so I wanted to make sure I gave my users a way of turning Lucene off in case it was causing trouble.    More on that in a bit.

For a really big database, you wouldn't want to necessarily build the search index from scratch over and over, but I'm counting on BugTracker.NET databases being on the small side.   Is that a safe assumption?  A bug database shouldn't be that big or else you're doing it wrong, right?


public void Application_OnStart(Object sender, EventArgs e)
{
    if (btnet.Util.get_setting("EnableLucene", "1") == "1")
    {
        build_lucene_index(this.Application);
    }
}



The build_lucene_index method starts a new worker thread, where the real work is done.


public static void build_lucene_index(System.Web.HttpApplicationState app)
{
    System.Threading.Thread thread = new System.Threading.Thread(threadproc_build);
    thread.Start(app);
}


The worker thread first grabs a lock so that it can build the index without being disturbed by other threads.   The other threads would be the result of users either searching or users updating text, triggering a modification to Lucene's index.    I don't want those threads to be dealing with a partially built index, so I make those threads wait for the one-and-only lock.

My way of handling multithreading was one of the things that I was nervous about.   I feared some sort of hard-to-reproduce deadlock condition, or race condition, but so far, there have been no reports from BugTracker.NET users of any trouble, so I my design appears to be solid.

To create the index, I create a Lucene "IndexWriter".   I run a SQL query against my database to fetch the text I want to be able to search and the database keys that go with that text.   Then I loop through the query results adding a Lucene "Document" for each row.   Actually, in my real code, I get the searchable text from several different fields in my database, but in the snippet below I have simplified my harvesting of text from my database.  



Lucene.Net.Analysis.Standard.StandardAnalyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer();

static void threadproc_build(object obj)
{
    lock (my_lock)
    {
        try
        {

            Lucene.Net.Index.IndexWriter writer = new Lucene.Net.Index.IndexWriter("c:\\folder_where_lucene_index_lives", analyzer , true);
            
            DataSet ds = btnet.DbUtil.get_dataset("select bug_id, bug_text from bugs")
            
            foreach (DataRow dr in ds.Tables[0].Rows)
            {
                writer.AddDocument(create_doc(
                    (int)dr["bug_id"],
                    (string)dr["bug_text"]));
            }
            
            writer.Optimize();
            writer.Close();
        }
        catch (Exception e)
        {
            btnet.Util.write_to_log("exception building Lucene index: " + e.Message);
        }
    }
}
 



2) The design of the Lucene Index

Here's where I create a Lucene "Document".    An index contains a list of documents.   A doc has fields that you define.   My doc shown here has three fields.   The first field "text" is what Lucene will analyze and index, the searchable text.  The second field is the key I will use to link the Lucene data to the rows in my database.   Notice I tell Lucene that this key should be UN_TOKENIZED, stored as is.   That's all you need for a minimal Lucene doc, a key for you and some text to search on for Lucene.  The third field in my example is the text again, but this time, UN_TOKENIZED, stored as is.  I will use that text for having Lucene highlight in my results page the snippets where the hits are.   More on highlighting later.

One of the decisions you'll have to make when using Lucene is what text to index and how to package it for Lucene.    In my database, the text doesn't just live in one field.    A bug has a short text description, a list of comments, a list of incoming and outgoing emails, and even Digg-style tags.   In my real code as opposed to the snippets here,  I fetch text from all these places.   My real Lucene doc has four fields, the forth being another database key that I can use to link to the specific comment or email where the search hit is.   BugTracker.NET supports custom text fields and in the future I hope to harvest that text from the database and add it to the Lucene doc.

So, if your app is like mine, with text in many different places, then you'll have a challenge like mine, how to package the text into a Lucene doc.


static Lucene.Net.Documents.Document create_doc(int bug_id, string text)
{     
    Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
    
    doc.Add(new Lucene.Net.Documents.Field(
        "text",
        new System.IO.StringReader(text)));
    
    doc.Add(new Lucene.Net.Documents.Field(
        "bug_id",
        Convert.ToString(bug_id),
        Lucene.Net.Documents.Field.Store.YES,
        Lucene.Net.Documents.Field.Index.UN_TOKENIZED));
    
    // For the highlighter, store the raw text
    doc.Add(new Lucene.Net.Documents.Field(
        "raw_text",
        text,
        Lucene.Net.Documents.Field.Store.YES,
        Lucene.Net.Documents.Field.Index.UN_TOKENIZED));

    return doc;
}



3) Updating the index

Whenever a user updates text in a bug I launch a worker thread to update the index.    The worker thread grabs a lock so that only one thread is updating the index at a time.   

The worker thread creates a Lucene "IndexModifier", deletes the old doc, and replaces it with a new one.

Notice that the thread closes the "searcher".   The searcher is a Lucene "Searcher".   The life cycle of a Searcher is that it first loads the index and then does its searches using that loaded, cached version of the index.   If the real index changes on disk, the searcher wouldn't know about it.   It would continue searching the out-of-date cached copy of the index in its memory.    That might be ok for your situation, and if your index is very big and the cost of creating a new searcher is high, you might be forced to use a searcher with a stale index.   BugTracker.NET databases tend to be small, so I can get away with making sure my searcher always has an up-to-date index to work with.

The official Lucene fact says that a Searcher (aka IndexSearcher) "is thread-safe. Multiple search threads may use the same instance of IndexSearcher concurrently without any problems. It is recommended to use only one IndexSearcher from all threads in order to save memory."  



Lucene.Net.Search.Searcher searcher = null;

static void threadproc_update(object obj)
{
    lock (my_lock) // If a thread is updating the index, no other thread should be doing anything with it.
    {
        
        try
        {
            if (searcher != null)
            {
                try
                {
                    searcher.Close();
                }
                catch (Exception e)
                {
                    btnet.Util.write_to_log("Exception closing lucene searcher:" + e.Message);
                }
                searcher = null;
            }
            
            Lucene.Net.Index.IndexModifier modifier = new Lucene.Net.Index.IndexModifier("c:\\folder_where_lucene_index_lives", analyzer, false);
            
            // same as build, but uses "modifier" instead of write.
            // uses additional "where" clause for bugid
            
            int bug_id = (int)obj;
            
            modifier.DeleteDocuments(new Lucene.Net.Index.Term("bug_id", Convert.ToString(bug_id)));
            
            DataSet ds = btnet.DbUtil.get_dataset("select bug_id, bug_text from bugs where bug_id = " + ConvertToString(bug_id));
            
            foreach (DataRow dr in ds.Tables[0].Rows) // one row...
            {
                modifier.AddDocument(create_doc(
                    (int)dr["bug_id"],
                    (string)dr["bug_text"]));
            }
            
            modifier.Flush();
            modifier.Close();
            
        }
        catch (Exception e)
        {
            btnet.Util.write_to_log("exception updating Lucene index: " + e.Message);
        }
    }
}




4) Sending the search query to Lucene

To search, create a Lucene "QueryParser".    Call its Parse() method passing the text the user typed in.   The Parse() method returns a "Query".   Call the Searcher's Search() method passing the Query.   The Search() method returns a Lucene "Hits" object, a collection of the search hits.    
          
As I've mentioned, I want my searcher to always be using the most up-to-date index, so whenever I do update the index, I destroy the old searcher, and then recreate it again the next time it's needed.     

Since IIS is handling the HTTP requests with multiple threads, these searches are happening on multiple threads.   Each search tries to grab my one-and-only lock, the one that keeps the updating threads from conflicting with each other and that keeps the updating threads from conflicting with searches.     Because there is just this one-and-only lock, all the searches on the website have to line up in single-file to get through this bottleneck.   Sounds terrible, doesn't it?   But so far, no reports of any problems.   It's just a bug tracker, not twitter, and so I can get away with this design, and there's no confusion ever about people doing searches with out-of-date indexes.

     
Lucene.Net.QueryParsers.QueryParser parser = new Lucene.Net.QueryParsers.QueryParser("text", analyzer );
Lucene.Net.Search.Query query = null;

try
{
    if (string.IsNullOrEmpty(text_user_entered))
    {
        throw new Exception("You forgot to enter something to search for...");
    }
    
    query = parser.Parse(text_user_entered);
    
}
catch (Exception e)
{
    display_exception(e);
}


lock (my_lock)
{
    
    Lucene.Net.Search.Hits hits = null;
    try
    {
        if (searcher == null)
        {
            searcher = new Lucene.Net.Search.IndexSearcher("c:\\folder_where_lucene_index_lives");
        }

        hits = searcher.Search(query);

    }
    catch (Exception e)
    {
        display_exception(e);
    }
    
    for (int i = 0; i < hits.Length(); i++)
    {
        Lucene.Net.Documents.Document doc = hits.Doc(i);
        ~~
        ~~ more processing of the hits and the Lucene docs here ~~
        ~~
    }
}



5) Displaying the results

If you didn't like my design prior to this point, what with the locking and the bottleneck, then you are going to really hate it now, because it gets weird now.    The search results I get back from Lucene is in the form of a Hits object, a collection of hits that you access by index.   The collection is in the order of the probability score, which you can get using the Hits.Score() method.   You can also get at the Lucene Document related to the hit via the Hits.Doc() method.

Now, back when I was designing my Lucene Document, I had to be thinking ahead regarding how I would display the results.   Would I display the results based purely on what's in the document?   If so, then I would have had to add fields to the doc for everything I wanted to eventually be displaying, not just the fields I needed for search.   The more fields I put in the doc, the more I would have to be updating the doc and the index to keep it in sync with my database, and the more I would be duplicating database data in the Lucene index.   So, there was a downside to relying strickly on the Lucene doc for my display.

Also, and for me more importantly, I already have a page in my app that knows how to display a list of bugs based on the result of a SQL query.   I didn't want to have to adopt that page to work with a Lucene Hits object.    I wanted to somehow convert the Lucene results into the format expected by that existing page.

So, I decided to try importing the Hits into the database, then letting my existing page fetch the hits out of the database, joining the hits to my bugs table to pick up the fields that I had not bothered to duplicate in the Lucene doc as fields.

The code below shows how I imported the Lucene hits into the database.    In short, I create a big batch of SQL Statements and execute them in one trip to the server.    The batch of SQL Statements creates a temporary table with a unique name plus a bunch of insert statements, one for every Lucene hit I want to import and display.    I import the best 100 hits, which is more than enough.   Lucene can find multiple hits in the same document, but I only want to list a given bug once in the search results, so I have logic for that below, the dict_already_seen_ids.

You will probably want to show your users the text around where the hit is, with the searched-for words highlighted, displayed in their context.   Lucene can prepare that displayable snippet of text for your.   You have to create a bunch of Lucene objects, a Formatter, a SimplerFragmenter, a QueryScorer, a Highlighter, etc, as does my code below.   I specified a snippet length of 400 characters and I specified the highlighting to be done using this HTML:  <span style='background:yellow;'></span>.    I feed to the highlighter the original Query and the raw text that I had saved in the doc.   Lucene then gave me the formatted, highlighted snippets, which I inserted into my temporary database table.

You might think that the import of the Lucene hits into the database would perform poorly, but actually, it's fast.    Had this not worked, then my plan B would have been to create a more complete Lucene Doc, and then somehow programmatically synthesize an ADO.NET recordset for my page downstream that displays results.



Lucene.Net.Highlight.Formatter formatter = new Lucene.Net.Highlight.SimpleHTMLFormatter(
    "<span style='background:yellow;'>",
    "</span>");

Lucene.Net.Highlight.SimpleFragmenter fragmenter = new Lucene.Net.Highlight.SimpleFragmenter(400);
Lucene.Net.Highlight.QueryScorer scorer = new Lucene.Net.Highlight.QueryScorer(query);
Lucene.Net.Highlight.Highlighter highlighter = new Lucene.Net.Highlight.Highlighter(formatter, scorer);
highlighter.SetTextFragmenter(fragmenter); 

StringBuilder sb = new StringBuilder();
string guid = Guid.NewGuid().ToString().Replace("-", "");
Dictionary&lt;string, int&gt; dict_already_seen_ids = new Dictionary&lt;string, int&gt;();
sb.Append(@"
    create table #$GUID
    (
        temp_bg_id int,
        temp_score float,
        temp_text nvarchar(3000)
    )
");


// insert the search results into a temp table which we will join with what's in the database
for (int i = 0; i < hits.Length(); i++)
{
    if (dict_already_seen_ids.Count < 100)
    {
        Lucene.Net.Documents.Document doc = hits.Doc(i);
        string bg_id = doc.Get("bg_id");
        if (!dict_already_seen_ids.ContainsKey(bg_id))
        {
            dict_already_seen_ids[bg_id] = 1;
            sb.Append("insert into #");
            sb.Append(guid);
            sb.Append(" values(");
            sb.Append(bg_id);
            sb.Append(",");
            //sb.Append(Convert.ToString((hits.Score(i))));
            sb.Append(Convert.ToString((hits.Score(i))).Replace(",", "."));  // Somebody said this fixes a bug. Localization issue?
            sb.Append(",N'");
            
            string raw_text = Server.HtmlEncode(doc.Get("raw_text"));


            Lucene.Net.Analysis.TokenStream stream = analyzer.TokenStream("", new System.IO.StringReader(raw_text));

            string highlighted_text = highlighter.GetBestFragments(stream, raw_text, 1, "...").Replace("'", "''");


            if (highlighted_text == "") // someties the highlighter fails to emit text...

            {
                highlighted_text = raw_text.Replace("'","''");
            }
            if (highlighted_text.Length > 3000)
            {
                highlighted_text = highlighted_text.Substring(0,3000);
            }
            sb.Append(highlighted_text);
            sb.Append("'");
            sb.Append(")\n");
        }
    }
    else
    {
        break;
    }
}  
 

We're done.  I'd be very interested in your feedback.   Was my explanation here helpful to you?   Were my design choices stupid?   I'd like to hear from you.
 

你可能感兴趣的:(asp.net)