【流式计算】twitter storm Rationale[1]

源:https://github.com/nathanmarz/storm/wiki/Rationale


2012-04014译

Rationale基本原理

  •  

The past decade has seen a revolution in data processing.MapReduce, Hadoop, and related technologies have made it possible to store andprocess data at scales previously unthinkable. Unfortunately, these dataprocessing technologies are not realtime systems, nor are they meant to be.There's no hack that will turn Hadoop into a realtime system; realtime dataprocessing has a fundamentally different set of requirements than batchprocessing.

在过去的数年中,数据计算处理发生了许多的变革。MapReduce、Hadoop,以及其他相关的技术,使我们在之前无法企及的规模上进行数据存储与计算成为了可能。遗憾的是,这些数据计算技术不是一个实时的系统,而且他们也没打算去实现实时。我们没有将Hadoop转换成实时系统的捷径,实时计算与批量计算所满足的需求上有着本质的区别。

However, realtime data processing at massive scale isbecoming more and more of a requirement for businesses. The lack of a"Hadoop of realtime" has become the biggest hole in the dataprocessing ecosystem.

尽管如此,基于海量数据的实时计算在商业上的需求越来越强烈。Hadoop的时效性缺点成为了数据计算这一生态系统的一个巨大“天坑”。

Storm fills that hole.

Strom系统将拟补以上问题。

Before Storm, you would typically have to manually builda network of queues and workers to do realtime processing. Workers wouldprocess messages off a queue, update databases, and send new messages to otherqueues for further processing. Unfortunately, this approach has seriouslimitations:

在Storm系统出现之前,你可能不得不为你的实时系统建立一个任务队列以及任务节点网络。任务节点处理队列里的信息,更新数据库,以及给下一阶段的任务队列发送消息以维持后续的计算。不幸的是,这些都将面临一些严峻的局限性:

  1. Tedious: You spend most of your development time configuring where to send messages, deploying workers, and deploying intermediate queues. The realtime processing logic that you care about corresponds to a relatively small percentage of your codebase.

系统开发乏味:你将在配置信息,部署工作节点,部署任务调度上花费大量的开发时间。而你的实时计算代码只是占了你的整个系统的一小部分。

  1. Brittle: There's little fault-tolerance. You're responsible for keeping each worker and queue up.

系统的脆弱:几乎没有什么容错性。你要保证每一个任务节点及队列运行正常。

  1. Painful to scale: When the message throughput get too high for a single worker or queue, you need to partition how the data is spread around. You need to reconfigure the other workers to know the new locations to send messages. This introduces moving parts and new pieces that can fail.

海量数据之痛:当系统的吞吐量超过单工作节点或队列的承受力是,你需要关注数据如何分片,如何分发。你需要配置其他节点使其知道这些新增的节点位置以便相互传输、协调工作。这里介绍了可能存在失败的数据转移。

Although the queues and workers paradigm breaksdown for large numbers of messages, message processing is clearlythe fundamental paradigm for realtime computation. The question is: how do youdo it in a way that doesn't lose data, scales to huge volumes of messages, andis dead-simple to use and operate?

尽管包含着大量信息的工作节点及队列的列表发生故障,信号的处理依然是实时计算系统的基础核心部分。问题是:面对海量的信息,我们怎么在不丢失数据的要求下去完成他,而且是在他们(节点和队列)是容易死掉的情况下使用和运转。

Storm satisfies these goals.

还好,我们有Storm 。

Why Storm is important

Storm exposes a set of primitives for doing realtimecomputation. Like how MapReduce greatly eases the writing of parallel batchprocessing, Storm's primitives greatly ease the writing of parallel realtimecomputation.

Strom为实时计算开放了一些通用原语,就像MapReduce极大地简化了并行批处理计算的写操作,Strom也极大地简化了并行实时计算的写操作。

The key properties of Storm are:

  1. Extremely broad set of use cases: Storm can be used for processing messages and updating databases (stream processing), doing a continuous query on data streams and streaming the results into clients (continuous computation), parallelizing an intense query like a search query on the fly (distributed RPC), and more. Storm's small set of primitives satisfy a stunning number of use cases.

极为广泛的用例:Strom可以被用于流处理,处理消息及更新数据库;连续计算,对数据进行连续的查询并以流的形式返还给客户端结果;分布式RPC,以并行的方式运行昂贵的运算。Storm的这些很少的原语便可满足相当可观的用例。

  1. Scalable: Storm scales to massive numbers of messages per second. To scale a topology, all you have to do is add machines and increase the parallelism settings of the topology. As an example of Storm's scale, one of Storm's initial applications processed 1,000,000 messages per second on a 10 node cluster, including hundreds of database calls per second as part of the topology. Storm's usage of Zookeeper for cluster coordination makes it scale to much larger cluster sizes.

可扩展性:Storm在每秒海量信息的下可扩展,为达到拓扑的可扩展,我们必须要增加机器并且为其增加一些并行化的配置。例如一个Storm应用在一个10个节点的集群上每秒处理1000000个消息 — 包括每秒一百多次的数据库调用。Storm使用ZooKeeper来协调集群内的各种配置使得Storm的集群可以很容易的扩展很大。

  1. Guarantees no data loss: A realtime system must have strong guarantees about data being successfully processed. A system that drops data has a very limited set of use cases. Storm guarantees that every message will be processed, and this is in direct contrast with other systems like S4.

保证数据无丢失: 实时系统必须保证数据被成功的处理。那些会丢失数据的系统的适用场景非常窄, 而storm保证每一条消息都会被处理, 这一点和例如S4相比有巨大的反差

  1. Extremely robust: Unlike systems like Hadoop, which are notorious for being difficult to manage, Storm clusters just work. It is an explicit goal of the Storm project to make the user experience of managing Storm clusters as painless as possible.

系统健壮性:不像Hadoop — 出了名的难管理, storm集群非常容易管理。容易管理是storm的设计目标之一。

  1. Fault-tolerant: If there are faults during execution of your computation, Storm will reassign tasks as necessary. Storm makes sure that a computation can run forever (or until you kill the computation).

高容错性:如果在消息处理过程中出了一些异常, 如果有必要storm会重新分配任务。 storm保证一个处理逻辑永远运行 (除非你杀掉这个处理逻辑)。

  1. Programming language agnostic: Robust and scalable realtime processing shouldn't be limited to a single platform. Storm topologies and processing components can be defined in any language, making Storm accessible to nearly anyone.

支持多语言编程:健壮性和可伸缩性不应该局限于一个平台。Storm的topology和消息处理组件可以用任何语言来定义, 这一点使得任何人都易于接收.

 

 

 

 

 


你可能感兴趣的:(mapreduce,hadoop,processing,twitter,任务,parallel)