Creating a real-time search engine with IndexTank and Heroku

In 1998 I created my first search engine. It was very simple: it would crawl sites for mp3 files and generate a search index twice every hour. It took me about six weeks to develop, and I also had to buy a Pentium server to host it. I paid $125/month to have it co-located with an ISP.

Today you could do the same thing for free and in a fraction of the time using existing cloud services. More importantly, you can get it started for free and only spend money to grow the service once it gains traction. I will show you how to build a real-time search engine using IndexTank as the back-end and Heroku as your application host.

The only requirement for this tutorial is to have a Heroku account with the capability of using add-ons (i.e. validated with your credit card). Of course, knowing Ruby and Rails will be helpful to understand what’s going on. Let’s do it!

First off, let’s choose something to search. Presumably we are interested in a real-time stream such as Twitter updates, blog content, etc. Any text stream with a public api will do, for the purpose of this example I chose Plixi, a social photo sharing application. My idea was to search the text associated with pictures that people post to the site.

You can read all about the plixi api here, for the purpose of this we are interested in the json version of the real-time photo stream:

http://api.plixi.com/api/tpapi.svc/json/photos?getuser=true

Here’s a snippet of code to parse that stream and extract some useful fields. Try it out (make sure you have ‘json’ installed in your gems):

fetcher.rb

view plain copy to clipboard print ?
  1. require 'rubygems'  
  2. require 'json'  
  3. require 'net/http'  
  4.   
  5. plixi_url='http://api.plixi.com/api/tpapi.svc/json/photos?getuser=true'  
  6. photos = JSON.parse(Net::HTTP.get_response(URI.parse(plixi_url)).body)  
  7. count, list = photos['Count'], photos['List']  
  8. list.each_with_index do |p, i|  
  9.     u = p['User']  
  10.     #only want photos that come with some text  
  11.     if p.has_key?('Message')  
  12.         id = p['GdAlias']  
  13.         text = p['Message']  
  14.         timestamp = Integer(p['UploadDate'])  
  15.         screen_name = u['ScreenName']  
  16.         thumbnail_url = p['ThumbnailUrl']  
  17.         printf "%s,%s,%s\n", id, screen_name, text  
  18.     end  
  19. end  

So we have an interesting data stream. How do we index it and search it? First, let’s create a new Rails app (we are using Rails 3 for this, it will be different for older versions) and associate it with a Heroku app (make sure the heroku gem is installed, if not: $ sudo gem install heroku).

$ rails new plixidemo
$ cd plixidemo

Associate this rails app with a Heroku app

$ git init
$ heroku create

Request the IndexTank add-on for this app and download the IndexTank gem for local testing

$ heroku addons:add indextank:trial
$ gem install indextank

We need to add the following lines to plixidemo/Gemfile so that our app can use the Indextank client:

gem 'indextank', '1.0.10'
gem 'json_pure', '1.4.6', :require => 'json'

Let’s create two models under app/models. This will be the heart of our application. First, a photo:

app/models/photo.rb

view plain copy to clipboard print ?
  1. class Photo  
  2.   def initialize(data)  
  3.     @data = data  
  4.   end  
  5.   
  6.   def id  
  7.     @data['Id']  
  8.   end  
  9.   
  10.   def screen_name  
  11.     self.user['ScreenName']  
  12.   end  
  13.   
  14.   def to_document  
  15.     {  
  16.       :plixi_id      => self.id,  
  17.       :text          => self.message,  
  18.       :timestamp     => self.upload_date.to_i,  
  19.       :screen_name   => self.screen_name,  
  20.       :thumbnail_url => self.thumbnail_url  
  21.     }  
  22.   end  
  23.   
  24.   def method_missing(name, *args, &block)  
  25.     if @data[name.to_s.classify]  
  26.       @data[name.to_s.classify]  
  27.     else  
  28.       super  
  29.     end  
  30.   end  
  31. end  

Second, a searcher that knows how to communicate with the indextank api:

app/models/photo_searcher.rb

view plain copy to clipboard print ?
  1. require 'open-uri'  
  2.   
  3. class PhotoSearcher  
  4.   def self.index  
  5.     @api  = IndexTank::Client.new(ENV['INDEXTANK_API_URL'] || 'http://your_api_url')  
  6.     @index ||= @api.indexes('idx')  
  7.     @index  
  8.   end  
  9.   
  10.   # retrieve photos from IndexTank  
  11.   def self.search(query)  
  12.     index.search(query, :fetch=>'text,thumbnail_url,screen_name,plixi_id,timestamp')  
  13.   end  
  14.   
  15. end  

note: ‘your_api_url’ can be found on heroku.com:

My Apps -> [your app] -> Add-ons -> IndexTank Search

or, from the command line:

$ heroku config --long|grep INDEXTANK

And one controller, app/controllers/photos_controller.rb

view plain copy to clipboard print ?
  1. class PhotosController < ApplicationController  
  2.   def index  
  3.     @docs = PhotoSearcher.search(params[:query]) if params[:query].present?  
  4.   end  
  5. end  

Of course, we need a view for our search page: app/views/photos/index.html.erb

view plain copy to clipboard print ?
  1. <%= form_tag photos_path, :method => :get do %>  
  2.   <%= text_field_tag :query %>  
  3.   <button type="submit">Search</button>  
  4. <% end %>  
  5.   
  6. <% if @docs %>  
  7.   <p id="result-count">Your search for "<%= params[:query] %>" returned <%= pluralize @docs['matches'], 'result' %></p>  
  8.   
  9.   <ul id="results">  
  10.     <% @docs['results'].each do |doc| %>  
  11.       <li>  
  12.         <%= link_to "http://plixi.com/p/#{doc['plixi_id']}" do %>  
  13.           <%= image_tag doc['thumbnail_url'] %>  
  14.           <%= doc['screen_name'] %> -  
  15.           <%= time_ago_in_words Time.at(doc['timestamp'].to_i) %> ago  
  16.   
  17.           <%= simple_format doc['text'] %>  
  18.         <% end %>  
  19.       </li>  
  20.     <% end %>  
  21.   </ul>  
  22. <% end %>  

And add the following to your config/routes.rb

view plain copy to clipboard print ?
  1. resources :photos:only => [:index]  
  2. root :to => 'photos#index'  

Remember to remove public/index.html and launch your app. If you go to it with your browser to http://localhost:3000, you should see a search box. Your index is empty, so all queries should return zero results.

It’s time to index some documents. Let’s go back to fetcher.rb from the beginning of the tutorial and add the following lines:

After rubygems:

  1. require ‘indextank’  

After the plixi_url line:

view plain copy to clipboard print ?
  1. api = IndexTank::Client.new(ENV['INDEXTANK_API_URL'] || '<API_URL>')  
  2. index = api.indexes 'idx'  

[note: you can find out your <API_URL> by selecting IndexTank Search from the add-ons pull-down menu for your Heroku app]

And the most important line of all, after the printf statement:

view plain copy to clipboard print ?
  1. index.document(i.to_s).add({:plixi_id => id, :text => text, :timestamp => timestamp, :screen_name => screen_name, :thumbnail_url => thumbnail_url})  

Now, run fetcher.rb and all the output you see will be indexed. Go back to your app and search for anything you just saw, it will show up in the search results right away!

We are ready to upload this to Heroku.

git add .
git commit -a
git push heroku master

You now have a real-time photo search engine running at [your_app].heroku.com! Next steps:

  • Change the fetcher to keep updating your index (remember that you can index up to 5000 documents). Be polite and do not hit Plixi more than a few times per minute!
  • Learn how to use our auto-complete service and build a pretty web app.

Check out our demo app here:

http://plixitank.heroku.com/

你可能感兴趣的:(Creating a real-time search engine with IndexTank and Heroku)