Vivek's blog - Roll your own autocomplete solution using Tries.

Vivek's blog - Roll your own autocomplete solution using Tries.

Roll your own autocomplete solution using Tries.

You might have come across many websites with autocomplete suggestions, most notably Google.

Adding such an option to your site or application might seem daunting but there is a very simple recursive data structure that solves the problem. There is a ton of literature on the net on how to do this using black box approaches like Lucene, Solr, Sphinx, Redis etc. But all these packages require a lot of configuration and you also lose flexibility. Tries can be implemented in a few lines of code in any language of your choice.

A trie is basically a tree, with each node representing a letter as illustrated in the figure above. Words are paths along this tree and the root node has no characters associated with it.

  1. The value of each node is the path or the character sequence leading upto it.
  2. The children at each node are ideally represented using a hash table mapping the next character to the child nodes.
  3. Its also useful to set a flag at every node to indicate whether a word/phrase ends there.

Now we can define methods to insert a string and to search for one.

       
       
class Trie ( object ):
     def __init__ ( self , value = None ):
         self . children = {}
         self . value = value
         self . flag = False # Flag to indicate that a word ends at this node
     def add ( self , char ):
         val = self . value + char if self . value else char
         self . children [ char ] = Trie ( val )
     def insert ( self , word ):
         node = self
         for char in word :
             if char not in node . children :
                 node . add ( char )
             node = node . children [ char ]
         node . flag = True
     def find ( self , word ):
         node = self
         for char in word :
             if char not in node . children :
                 return None
             node = node . children [ char ]
         return node . value
view raw trie.py This Gist brought to you by GitHub.
The insertion cost has a linear relationship with the string length. Now lets define the methods for listing all the strings which start with a certain prefix. The idea is to traverse down to the node representing the prefix and from there do a breadth first search of all the nodes that are descendants of the prefix node. We check if the node is at a word boundary from the flag we defined earlier and append the node’s value to the results. A caveat of the recursive approach, is you might run into stack depth problems when the maximum length of a string reaches tens of thousands of characters.

 

 

       
       
class Trie ( object ):
     ...
     def all_prefixes ( self ):
         results = set ()
         if self . flag :
             results . add ( self . value )
         if not self . children : return results
         return reduce ( lambda a , b : a | b ,
                      [ node . all_prefixes () for
                       node in self . children . values ()]) | results
     def autocomplete ( self , prefix ):
         node = self
         for char in prefix :
             if char not in node . children :
                 return set ()
             node = node . children [ char ]
         return node . all_prefixes ()
view raw
trie.py
This Gist brought to you by GitHub.

To delete an item,  first traverse down to the leaf of the keyword and then work backwards, checking for common paths.

And there you have it, an efficient way of retrieving strings with a common prefix in less than 40 lines of python code. I also wrote a C++ version using the STL data structures in about 90 lines, the time complexity for this version however is O(n log n) as the STL uses a red-black tree implementation for an associative array. Colin Dean has wrote a Ruby version and Marcus McCurdy, a Java one. You can find them all over here - https://github.com/vivekn/autocomplete. Read more about tries at Wikipedia.

Update: Thanks to bmahler on reddit for pointing out that the wordlist approach was unnecessary and space inefficient. It has been corrected.

Vivek's blog - Roll your own autocomplete solution using Tries.

Roll your own autocomplete solution using Tries.

You might have come across many websites with autocomplete suggestions, most notably Google.

Adding such an option to your site or application might seem daunting but there is a very simple recursive data structure that solves the problem. There is a ton of literature on the net on how to do this using black box approaches like Lucene, Solr, Sphinx, Redis etc. But all these packages require a lot of configuration and you also lose flexibility. Tries can be implemented in a few lines of code in any language of your choice.

A trie is basically a tree, with each node representing a letter as illustrated in the figure above. Words are paths along this tree and the root node has no characters associated with it.

  1. The value of each node is the path or the character sequence leading upto it.
  2. The children at each node are ideally represented using a hash table mapping the next character to the child nodes.
  3. Its also useful to set a flag at every node to indicate whether a word/phrase ends there.

Now we can define methods to insert a string and to search for one.

       
       
class Trie ( object ):
     def __init__ ( self , value = None ):
         self . children = {}
         self . value = value
         self . flag = False # Flag to indicate that a word ends at this node
     def add ( self , char ):
         val = self . value + char if self . value else char
         self . children [ char ] = Trie ( val )
     def insert ( self , word ):
         node = self
         for char in word :
             if char not in node . children :
                 node . add ( char )
             node = node . children [ char ]
         node . flag = True
     def find ( self , word ):
         node = self
         for char in word :
             if char not in node . children :
                 return None
             node = node . children [ char ]
         return node . value
view raw trie.py This Gist brought to you by GitHub.
The insertion cost has a linear relationship with the string length. Now lets define the methods for listing all the strings which start with a certain prefix. The idea is to traverse down to the node representing the prefix and from there do a breadth first search of all the nodes that are descendants of the prefix node. We check if the node is at a word boundary from the flag we defined earlier and append the node’s value to the results. A caveat of the recursive approach, is you might run into stack depth problems when the maximum length of a string reaches tens of thousands of characters.

 

       
       
class Trie ( object ):
     ...
     def all_prefixes ( self ):
         results = set ()
         if self . flag :
             results . add ( self . value )
         if not self . children : return results
         return reduce ( lambda a , b : a | b ,
                      [ node . all_prefixes () for
                       node in self . children . values ()]) | results
     def autocomplete ( self , prefix ):
         node = self
         for char in prefix :
             if char not in node . children :
                 return set ()
             node = node . children [ char ]
         return node . all_prefixes ()
view raw
trie.py
This Gist brought to you by GitHub.

To delete an item,  first traverse down to the leaf of the keyword and then work backwards, checking for common paths.

And there you have it, an efficient way of retrieving strings with a common prefix in less than 40 lines of python code. I also wrote a C++ version using the STL data structures in about 90 lines, the time complexity for this version however is O(n log n) as the STL uses a red-black tree implementation for an associative array. Colin Dean has wrote a Ruby version and Marcus McCurdy, a Java one. You can find them all over here - https://github.com/vivekn/autocomplete. Read more about tries at Wikipedia.

Update: Thanks to bmahler on reddit for pointing out that the wordlist approach was unnecessary and space inefficient. It has been corrected.

你可能感兴趣的:(autocomplete)