【流式计算】twitter storm Rationale[1]





The past decade has seen a revolution in data processing.MapReduce, Hadoop, and related technologies have made it possible to store andprocess data at scales previously unthinkable. Unfortunately, these dataprocessing technologies are not realtime systems, nor are they meant to be.There's no hack that will turn Hadoop into a realtime system; realtime dataprocessing has a fundamentally different set of requirements than batchprocessing.


However, realtime data processing at massive scale isbecoming more and more of a requirement for businesses. The lack of a"Hadoop of realtime" has become the biggest hole in the dataprocessing ecosystem.


Storm fills that hole.


Before Storm, you would typically have to manually builda network of queues and workers to do realtime processing. Workers wouldprocess messages off a queue, update databases, and send new messages to otherqueues for further processing. Unfortunately, this approach has seriouslimitations:


  1. Tedious: You spend most of your development time configuring where to send messages, deploying workers, and deploying intermediate queues. The realtime processing logic that you care about corresponds to a relatively small percentage of your codebase.


  1. Brittle: There's little fault-tolerance. You're responsible for keeping each worker and queue up.


  1. Painful to scale: When the message throughput get too high for a single worker or queue, you need to partition how the data is spread around. You need to reconfigure the other workers to know the new locations to send messages. This introduces moving parts and new pieces that can fail.


Although the queues and workers paradigm breaksdown for large numbers of messages, message processing is clearlythe fundamental paradigm for realtime computation. The question is: how do youdo it in a way that doesn't lose data, scales to huge volumes of messages, andis dead-simple to use and operate?


Storm satisfies these goals.

还好,我们有Storm 。

Why Storm is important

Storm exposes a set of primitives for doing realtimecomputation. Like how MapReduce greatly eases the writing of parallel batchprocessing, Storm's primitives greatly ease the writing of parallel realtimecomputation.


The key properties of Storm are:

  1. Extremely broad set of use cases: Storm can be used for processing messages and updating databases (stream processing), doing a continuous query on data streams and streaming the results into clients (continuous computation), parallelizing an intense query like a search query on the fly (distributed RPC), and more. Storm's small set of primitives satisfy a stunning number of use cases.


  1. Scalable: Storm scales to massive numbers of messages per second. To scale a topology, all you have to do is add machines and increase the parallelism settings of the topology. As an example of Storm's scale, one of Storm's initial applications processed 1,000,000 messages per second on a 10 node cluster, including hundreds of database calls per second as part of the topology. Storm's usage of Zookeeper for cluster coordination makes it scale to much larger cluster sizes.

可扩展性:Storm在每秒海量信息的下可扩展,为达到拓扑的可扩展,我们必须要增加机器并且为其增加一些并行化的配置。例如一个Storm应用在一个10个节点的集群上每秒处理1000000个消息 — 包括每秒一百多次的数据库调用。Storm使用ZooKeeper来协调集群内的各种配置使得Storm的集群可以很容易的扩展很大。

  1. Guarantees no data loss: A realtime system must have strong guarantees about data being successfully processed. A system that drops data has a very limited set of use cases. Storm guarantees that every message will be processed, and this is in direct contrast with other systems like S4.

保证数据无丢失: 实时系统必须保证数据被成功的处理。那些会丢失数据的系统的适用场景非常窄, 而storm保证每一条消息都会被处理, 这一点和例如S4相比有巨大的反差

  1. Extremely robust: Unlike systems like Hadoop, which are notorious for being difficult to manage, Storm clusters just work. It is an explicit goal of the Storm project to make the user experience of managing Storm clusters as painless as possible.

系统健壮性:不像Hadoop — 出了名的难管理, storm集群非常容易管理。容易管理是storm的设计目标之一。

  1. Fault-tolerant: If there are faults during execution of your computation, Storm will reassign tasks as necessary. Storm makes sure that a computation can run forever (or until you kill the computation).

高容错性:如果在消息处理过程中出了一些异常, 如果有必要storm会重新分配任务。 storm保证一个处理逻辑永远运行 (除非你杀掉这个处理逻辑)。

  1. Programming language agnostic: Robust and scalable realtime processing shouldn't be limited to a single platform. Storm topologies and processing components can be defined in any language, making Storm accessible to nearly anyone.

支持多语言编程:健壮性和可伸缩性不应该局限于一个平台。Storm的topology和消息处理组件可以用任何语言来定义, 这一点使得任何人都易于接收.





