In 1998 I created my first search engine. It was very simple: it would crawl sites for mp3 files and generate a search index twice every hour. It took me about six weeks to develop, and I also had to buy a Pentium server to host it. I paid $125/month to have it co-located with an ISP.
Today you could do the same thing for free and in a fraction of the time using existing cloud services. More importantly, you can get it started for free and only spend money to grow the service once it gains traction. I will show you how to build a real-time search engine using IndexTank as the back-end and Heroku as your application host.
The only requirement for this tutorial is to have a Heroku account with the capability of using add-ons (i.e. validated with your credit card). Of course, knowing Ruby and Rails will be helpful to understand what’s going on. Let’s do it!
First off, let’s choose something to search. Presumably we are interested in a real-time stream such as Twitter updates, blog content, etc. Any text stream with a public api will do, for the purpose of this example I chose Plixi, a social photo sharing application. My idea was to search the text associated with pictures that people post to the site.
You can read all about the plixi api here, for the purpose of this we are interested in the json version of the real-time photo stream:
http://api.plixi.com/api/tpapi.svc/json/photos?getuser=true
Here’s a snippet of code to parse that stream and extract some useful fields. Try it out (make sure you have ‘json’ installed in your gems):
fetcher.rb
So we have an interesting data stream. How do we index it and search it? First, let’s create a new Rails app (we are using Rails 3 for this, it will be different for older versions) and associate it with a Heroku app (make sure the heroku gem is installed, if not: $ sudo gem install heroku).
$ rails new plixidemo $ cd plixidemo
Associate this rails app with a Heroku app
$ git init $ heroku create
Request the IndexTank add-on for this app and download the IndexTank gem for local testing
$ heroku addons:add indextank:trial $ gem install indextank
We need to add the following lines to plixidemo/Gemfile so that our app can use the Indextank client:
gem 'indextank', '1.0.10' gem 'json_pure', '1.4.6', :require => 'json'
Let’s create two models under app/models. This will be the heart of our application. First, a photo:
app/models/photo.rb
Second, a searcher that knows how to communicate with the indextank api:
app/models/photo_searcher.rb
note: ‘your_api_url’ can be found on heroku.com:
My Apps -> [your app] -> Add-ons -> IndexTank Search
or, from the command line:
$ heroku config --long|grep INDEXTANK
And one controller, app/controllers/photos_controller.rb
Of course, we need a view for our search page: app/views/photos/index.html.erb
And add the following to your config/routes.rb
Remember to remove public/index.html and launch your app. If you go to it with your browser to http://localhost:3000, you should see a search box. Your index is empty, so all queries should return zero results.
It’s time to index some documents. Let’s go back to fetcher.rb from the beginning of the tutorial and add the following lines:
After rubygems:
After the plixi_url line:
[note: you can find out your <API_URL> by selecting IndexTank Search from the add-ons pull-down menu for your Heroku app]
And the most important line of all, after the printf statement:
Now, run fetcher.rb and all the output you see will be indexed. Go back to your app and search for anything you just saw, it will show up in the search results right away!
We are ready to upload this to Heroku.
git add . git commit -a git push heroku master
You now have a real-time photo search engine running at [your_app].heroku.com! Next steps:
Check out our demo app here:
http://plixitank.heroku.com/