For about two months, we’ve been working on a static website that exposes the results of complicated economics model to non-economists. We decided to make the site static because of the overhead involved in computing the results and the proprietary nature of the model. We would simply pre-generate the output for all valid permutations of the inputs. The visitor could then choose her inputs from a questionnaire, click a button and immediately be shown the results.
The caveat of this decision is that in addition to the numerical outputs, three graphs and a summary (both in HTML and PDF) would need to be generated for each permutation. Since there were 3600 permutations, this would amount to 18000 files in total. Initial local runs of our generation process took about 30 seconds for each permutation, mostly due to embedding the graph images into the PDF. On a single machine, that would take 30 hours of uninterrupted processing! Clearly, this was a job for “the cloud”.
Before we get into a discussion of the process of configuring and running the jobs, here’s overview of the tools we used to tackle the problem.
We initially considered using Amazon’s Elastic MapReduce to run the generation jobs, but it requires Java and Hadoop, we had already invested a lot of time in our Ruby tool chain. It is nigh impossible to automatically install Ruby and ImageMagick on an EMR node. Thus, we decided to use vanilla EC2 with the tools shown below.
Prawn is the new kid in town for generating PDF in Ruby. Prawn is pretty well-written and easy to start using, and greatly improves on PDF::Writer.
Gruff was not the most obvious choice for this project. We liked the flexibility and hackability of Scruffy , but translating its output to PDF was a nightmare and there were some strange inconsistencies in it. In the end, Gruff proved fast, reliable, and simple. The major caveat, as described above, is that embedding images in Prawn is orders of magnitude slower than simply drawing on the canvas.
Haml has been around for 3 years now. Many people cringe at the indentation-sensitive syntax, but it prevents so much frustration that it was a good fit for the project. Naturally, we also used its cousin Sass, and the new-ish CSS/Sass meta-framework Compass . The combination of the these three made it really quick to get started with the static site and make design changes as we iterated.
You may have already heard of the awesome configuration management tool, Chef . Chef allows you to ensure consistent configuration of your servers using a nice Ruby DSL and a huge library of community-developed “cookbooks” that covers many common use-cases. We were given the chance to try out an alpha of their “Chef Platform”, which is essentially a scalable, hosted, multi-tenant version of the server component of Chef and uses the pre-release version of Chef 0.8. With that, “knife”–the new CLI tool for interacting with the Chef server API–and the custom Opscode AMI, we were well-equipped to quickly deploy a bunch of EC2 nodes. We’ll talk more about the details of the Chef recipes below.
What’s the best way to distribute a bunch of one-time jobs to a slew of independent machines? A message queue, of course! Despite the version packaged with Ubuntu 9.04 being pretty old, we chose RabbitMQ , having used it on another project. AMQP is also well supported in Ruby .
The first step to start our processing job was to get the data up to S3. You could do this any number of ways, but we created a bucket solely for the data and uploaded all 3600 CSV files with a desktop client.
Next, we created the scripts for the workers and the job initiator. We would potentially need to run the process multiple times, so we chose Aman Gupta’s EventMachine-based AMQP client.
Here’s the worker script, which was set up as a daemon using runit:
#!/usr/bin/env ruby
$: << File.expand_path(
File.join(
File.dirname(
__FILE__)
,'..'
,'lib'
)
)
require
'rubygems'
require
'eventmachine'
require
'mq'
require
'custom_libraries'
Signal.trap(
'INT'
)
{ AMQP.stop{ EM.stop } }
Signal.trap(
'TERM'
)
{ AMQP.stop{ EM.stop } }
AMQP.start(
:host => ARGV.shift)
do
MQ.prefetch(
1)
MQ.queue(
'jobs'
)
.bind(
MQ.direct(
'jobs'
)
)
.subscribe do
|header, body|
GenerationJob.new(
body)
.generate
end
end
Basically, it connects to the RabbitMQ host specified on the command line, subscribes to the job queue, and starts processing messages.
The job initiation script is almost as simple:
#!/usr/bin/env ruby
$: << File.expand_path(
File.join(
File.dirname(
__FILE__)
,'..'
,'lib'
)
)
require
'rubygems'
require
'eventmachine'
require
'mq'
AWSID = (
ENV['AMAZON_ACCESS_KEY_ID'
] || 'XXXXXXXXXXXXXXXXXXXX'
)
AWSKEY = (
ENV['AMAZON_SECRET_ACCESS_KEY'
] || 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
)
Signal.trap(
'INT'
)
{ AMQP.stop{ EM.stop } }
Signal.trap(
'TERM'
)
{ AMQP.stop{ EM.stop } }
host = ARGV.shift
input_bucket = "custom-data"
output_bucket = "custom-output"
output_prefix = Time.now.strftime(
"/%Y%m%d%H%M%S"
)
count = 0
AMQP.start(
:host => host)
do
exchange = MQ.direct(
'jobs'
)
STDIN.each_line do
|file|
count += 1
$stdout.print "."
; $stdout.flush
payload = {
:input
=> [input_bucket, file.strip],
:output
=> [output_bucket, output_prefix],
:s3id
=> AWSID,
:s3key
=> AWSKEY
}
exchange.publish(
Marshal.dump(
payload)
)
end
AMQP.stop { EM.stop }
end
puts "#{count} data enqueued for generation."
It reads from STDIN the names of files to add to the queue, which are stored in the S3 bucket. Before running the job, we created a text file that listed each of the 3600 files, one per line, which could then be piped to this script on the command line. Then it passes along all the information each worker needs to find the data, and where to put it when completed. We scoped the output by the time the job was enqueued, making it easier to discern older runs from newer ones.
Now that the meat of the job was ready, we dived into configuring the servers with Chef. We created a Chef repository, added the Opscode cookbooks as a submodule, and uploaded these default cookbooks to the server:
We created some additional cookbooks to fill out the generic setup:
Lastly we created our custom cookbook, which sets up all the libraries we need, downloads the code, and sets up the worker process as a runit service. Let’s walk through the default recipe in that cookbook:
%w{haml gruff fastercsv activesupport prawn prawn-core prawn-format prawn-layout eventmachine amqp aws-s3}.each do
|g|
gem_package g
end
This simply installs all of gems that we need to run the job.
# Find the node that has the job queue
q = search(
:node, "run_list:role*job_queue*"
)
[0].first
Here we use Chef’s search feature to find the node that has RabbitMQ installed and running so we can pass it to the worker script.
# Create directory to put the code in
directory "/srv"
# Unzip the code if necessary
execute "Unpack code"
do
command "tar xzf generationjobs.tar.gz"
cwd "/srv"
action :nothing
end
# Download the code
remote_file "/srv/generationjobs.tar.gz"
do
source "generationjobs.tar.gz"
notifies :run
, resources(
:execute => "Unpack code"
)
, :immediate
end
# Create the directory where output goes
directory "/srv/generationjobs/tmp"
do
recursive true
end
In these four resources, we set up the working directory for the worker process, download the project code (stored on the Chef server as a tarball), and unpack it. The interesting thing about this sequence is that we don’t automatically unpack the tarball. Since the Chef client runs periodically in the background, we don’t want to be unpacking the code every time, but only when it has changed. We use an immediate notification from the remotefile resource to tell the unpacking to run when the tarball is a new version; remote file won’t download the tarball unless the file checksum has changed.
# Create runit service for worker
runit_service "generationworker"
do
options(
{:worker_bin
=> "/srv/generationjobs/bin/worker"
, :queue_host
=> q})
only_if { q }
end
The last step is a pseudo-resource defined in the “runit” cookbook that creates all the pieces of a runit daemon for you; we only had to create the configuration templates for the daemon and put them in our cookbook. The additional options passed to the runitservice tell the templates the location of the worker code and the RabbitMQ host. We also take advantage of the “only if” option so the service won’t be created if there’s no host with RabbitMQ on it yet.
The last step in the Chef configuration was to create two roles , one for the queue and one for the worker. Naturally, the node that has the queue can also act as a worker. Here’s what the role JSON documents look like:
// The queue role
{
"name": "job_queue",
"chef_type": "role",
"json_class": "Chef::Role",
"default_attributes": {
},
"description": "Provides a message queue for sending jobs out to the workers.",
"recipes": [
"erlang",
"rabbitmq"
],
"override_attributes": {
}
}
// The worker role
{
"name": "job_worker",
"chef_type": "role",
"json_class": "Chef::Role",
"default_attributes": {
},
"description": "Processes the data from a queue into the PDF, PNG and HTML output.",
"recipes": [
"apt",
"build-essential",
"ruby",
"gemcutter",
"imagemagick::rmagick",
"runit",
"custom"
],
"override_attributes": {
}
}
Now comes the fun (and easy) part! Armed with an AWS account, an EC2 certificate, and knife, we began firing up nodes to run the job. With Opscode’s preconfigured Chef AMI, you can pass a JSON node configuration in the EC2 initial data. First we generated the configuration for the job queue node:
$ knife instance_data --run-list="role[job_queue] role[job_worker]" | pbcopy
With the JSON configuration in the clipboard, we could paste it into ElasticFox (or the AWS Management console) and fire up the first EC2 node. Several minutes later, the node was ready to go. Now, we created a similar configuration, but with only the worker role:
$ knife instance_data --run-list="role[job_worker]" | pbcopy
Then we fired up nine of the nodes with that configuration and proceeded to initiate the job:
$ ssh -i ~/ec2-keys/my-ec2-cert.pem root@ec2-public-hostname
[root@ec2-public-hostname]$ cd /srv/generationworker
[root@ec2-public-hostname]$ bin/startjobs localhost > manifest.txt
After all the preparation, that’s all there was to it! A little over an hour later, we had generated PNG graphs, PDF, and HTML from all 3600 datasets.
It’s no mystery why “cloud computing” is so popular. The ability to quickly and cheaply access computational power, utilize it, and then dispose of it is really appealing, and tools like Chef and EC2 make it really easy to accomplish. What can you cook up?