We have been doing realtime processing for a long time at BackType. We've recently developed a new system for doing realtime processing called Storm to replace our old system of queues and workers. Storm is a distributed, reliable, and fault-tolerant stream processing system. Its use cases are so broad that we consider it to be a fundamental new primitive for data processing. That's why we call it the Hadoop of realtime: it does for realtime processing what Hadoop does for batch processing. We are planning to open-source Storm sometime in the next few months.
Like most people doing realtime processing, our old system was a complicated graph of queues and workers. Worker processes would take messages off a queue and then update a database and/or fire off new messages to other queues.
There was a lot of pain in doing realtime processing this way. We found that we spent most of our time worrying about where to send messages, where to receive messages from, deploying workers, and deploying queues. Worst of all, the system wasn't fault tolerant: we had to make sure our queues and workers stayed up.
Storm solves these issues completely. It abstracts the message passing away, automatically parallelizes the stream computation on a cluster of machines, and lets you focus on your realtime processing logic. Even more interesting, Storm enables a whole new range of applications we didn't anticipate when we initially designed it.
Properties of Storm
Here are the key properties of Storm:
1. Simple programming model: Just like how MapReduce dramatically lowers the complexity for doing parallel batch processing, Storm's programming model dramatically lowers the complexity for doing realtime processing.
2. Runs any programming language: Even though Storm runs on the JVM (and is written in Clojure), you can use any programming language on top of Storm. We've added support for Ruby and Python, and support can easily be added for any language -- all you need to do is code a ~100 line library which implements a simple communication protocol with Storm.
3. Fault-tolerant: To launch a processing topology on Storm, all you have to do is provide a jar containing all your code. Storm then distributes that jar, assigns workers across the cluster to execute the topology, monitors the topology, and automatically reassigns workers that go down.
4. Horizontally scalable: All computations are done in parallel. To scale a realtime computation, all you have to do is add more machines and Storm takes care of the rest.
5. Reliable: Storm guarantees that each message will be fully processed at least once. Messages will be processed exactly once as long as there are no errors.
6. Fast: Storm is built with speed in mind. ZeroMQ is used for the underlying message passing, and care has been taken so that messages are processed extremely quickly.
Use cases for Storm
There are three broad use cases for Storm:
1. Stream processing: This is the traditional realtime processing use case: process messages and update a variety of databases.
2. Continuous computation: Storm can be used to do a continuous computation and stream out the results as they're computed. For example, we used Storm the other day to compute trending users on Twitter off of the Twitter firehose. Every second, Storm streams out the 50 users with the most retweets in the last few minutes with perfect accuracy. We stream this information directly into a webpage which visualizes and animates the trending users in realtime.
3. Distributed RPC: Distributed RPC is perhaps the most unexpected and most compelling use case for Storm. There are a lot of queries that are both hard to precompute and too intense to compute on the fly on a single machine. Traditionally, you have to do an approximation of some sort to lower the cost of a query like this. Storm gives the capability to parallelize an intense query so that you can compute it in realtime.
An example of a query that is only possible with distributed RPC is "reach": computing the number of unique people exposed to a URL on Twitter. To compute reach, you need to get all the people who tweeted the URL, get all the followers of all those people, unique that set of followers, and then count the number of uniques. It's an intense computation that potentially involves thousands of database calls and tens of millions of follower records. It can take minutes or worse to compute on a single machine. With Storm, you can do every step of the reach computation in parallel and compute reach for any URL in seconds (and less than a second for most URLs).
The idea behind distributed RPC is that you run a processing topology on Storm that implements the RPC function and waits for RPC invocations. An RPC invocation is a message containing the parameters of the RPC request and information of where Storm should send the results. The topology picks up messages, computes the RPC call in parallel, and returns the results to the return address.
Summary
Storm is already enormously useful for us at BackType. It reduces a ton of complexity in our realtime processing and lets us do things we couldn't do before. We look forward to the day we open source it.
You should follow the BackType tech team on Twitter here.