原文地址:https://github.com/nathanmarz/storm/wiki/Concepts
http://xumingming.sinaapp.com/117/twitter-storm%E7%9A%84%E4%B8%80%E4%BA%9B%E5%85%B3%E9%94%AE%E6%A6%82%E5%BF%B5/
This page lists the main concepts of Storm and links to resources where you can find more information. The concepts discussed are:
The logic for a realtime application is packaged into a Storm topology. A Storm topology is analogous to a MapReduce job. One key difference is that a MapReduce job eventually finishes, whereas a topology runs forever (or until you kill it, of course). A topology is a graph of spouts and bolts that are connected with stream groupings. These concepts are described below.
Resources:
TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(1, new TestWordSpout(true), 5); builder.setSpout(2, new TestWordSpout(true), 3); builder.setBolt(3, new TestWordCounter(), 3) .fieldsGrouping(1, new Fields("word")) .fieldsGrouping(2, new Fields("word")); builder.setBolt(4, new TestGlobalCount()) .globalGrouping(1); Map conf = new HashMap(); conf.put(Config.TOPOLOGY_WORKERS, 4); StormSubmitter.submitTopology("mytopology", conf, builder.createTopology());
src/jvm/
as a source pathTopologyBuilder builder = new TopologyBuilder(); builder.setSpout(1, new TestWordSpout(true), 5); builder.setSpout(2, new TestWordSpout(true), 3); builder.setBolt(3, new TestWordCounter(), 3) .fieldsGrouping(1, new Fields("word")) .fieldsGrouping(2, new Fields("word")); builder.setBolt(4, new TestGlobalCount()) .globalGrouping(1); Map conf = new HashMap(); conf.put(Config.TOPOLOGY_WORKERS, 4); conf.put(Config.TOPOLOGY_DEBUG, true); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("mytopology", conf, builder.createTopology()); Utils.sleep(10000); cluster.shutdown();
The stream is the core abstraction in Storm. A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion. Streams are defined with a schema that names the fields in the stream's tuples. Additionally, corresponding fields across tuples must have the same type. That is, the first field must always have the same type and the second field must always have the same type, but different fields can have different types. By default, tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.
Every stream is given an id when declared. Since single-stream spouts and bolts are so common, OutputFieldsDeclarer has convenience methods for declaring a single stream without specifying an id. In this case, the stream is given the default id of 1.
Resources:
A spout is a source of streams in a topology. Generally spouts will read tuples from an external source and emit them into the topology (e.g. a Kestrel queue or the Twitter API). Spouts can either be reliable or unreliable. A reliable spout is capable of replaying a tuple if it failed to be processed by Storm, whereas an unreliable spout forgets about the tuple as soon as it is emitted.
Spouts can emit more than one stream. To do so, declare multiple streams using the declareStream
method of OutputFieldsDeclarer and specify the stream to emit to when using the emit
method on SpoutOutputCollector.
The main method on spouts is nextTuple
. nextTuple
either emits a new tuple into the topology or simply returns if there are no new tuples to emit. It is imperative that nextTuple
does not block for any spout implementation, because Storm calls all the spout methods on the same thread.
The other main methods on spouts are ack
and fail
. These are called when Storm detects that a tuple emitted from the spout either successfully completed through the topology or failed to be completed. ack
and fail
are only called for reliable spouts. See the Javadoc for more information.
Resources:
import backtype.storm.spout.SpoutOutputCollector; import backtype.storm.task.TopologyContext; import backtype.storm.topology.IRichSpout; import backtype.storm.topology.OutputFieldsDeclarer; import backtype.storm.tuple.Fields; import backtype.storm.tuple.Values; import backtype.storm.utils.Utils; import java.util.Map; import java.util.Random; public class RandomSentenceSpout implements IRichSpout { SpoutOutputCollector _collector; Random _rand; @Override /** * 决定这个Spout是否可以有多个tasks */ public boolean isDistributed() { return true; } @Override /** * 一个worker下面的所有tasks会共享这个SpoutOutputCollector的 */ public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _rand = new Random(); } @Override /** * 确保nextTuple是non-block的 */ public void nextTuple() { Utils.sleep(100); String[] sentences = new String[] { "the cow jumped over the moon", "an apple a day keeps the doctor away", "four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature"}; String sentence = sentences[_rand.nextInt(sentences.length)]; _collector.emit(new Values(sentence)); } @Override public void close() { } @Override /** * storm将msgId对应的tuple的执行结果(成功)告诉spout */ public void ack(Object msgId) { } @Override /** * storm将msgId对应的tuple的执行结果(失败)告诉spout */ public void fail(Object msgId) { } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } }
All processing in topologies is done in bolts. Bolts can do anything from filtering, functions, aggregations, joins, talking to databases, and more.
Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts. For example, transforming a stream of tweets into a stream of trending images requires at least two steps: a bolt to do a rolling count of retweets for each image, and one or more bolts to stream out the top X images (you can do this particular stream transformation in a more scalable way with three bolts than with two).
Bolts can emit more than one stream. To do so, declare multiple streams using the declareStream
method of OutputFieldsDeclarer and specify the stream to emit to when using the emit
method on OutputCollector.
When you declare a bolt's input streams, you always subscribe to specific streams of another component. If you want to subscribe to all the streams of another component, you have to subscribe to each one individually. InputDeclarer has syntactic sugar for subscribing to streams declared on the default stream id. Saying declarer.shuffleGrouping(1)
subscribes to the default stream on component 1 and is equivalent to declarer.shuffleGrouping(1, DEFAULT_STREAM_ID)
.
The main method in bolts is the execute
method which takes in as input a new tuple. Bolts emit new tuples using the OutputCollector object. Bolts must call the ack
method on the OutputCollector
for every tuple they process so that Storm knows when tuples are completed (and can eventually determine that its safe to ack the original spout tuples). For the common case of processing an input tuple, emitting 0 or more tuples based on that tuple, and then acking the input tuple, Storm provides an IBasicBolt interface which does the acking automatically.
Its perfectly fine to launch new threads in bolts that do processing asynchronously. OutputCollector is thread-safe and can be called at any time.
Resources:
public static class WordCount implements IBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); @Override public void prepare(Map conf, TopologyContext context) { } @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if(count==null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void cleanup() { } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }
Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping defines how that stream should be partitioned among the bolt's tasks.
There are six types of stream groupings in Storm:
emit
method in OutputCollector (which returns the task ids that the tuple was sent to).Resources:
setBolt
is called on TopologyBuilder
and is used for declaring a bolt's input streams and how those streams should be groupedStorm guarantees that every spout tuple will be fully processed by the topology. It does this by tracking the tree of tuples triggered by every spout tuple and determining when that tree of tuples has been successfully completed. Every topology has a "message timeout" associated with it. If Storm fails to detect that a spout tuple has been completed within that timeout, then it fails the tuple and replays it later.
To take advantage of Storm's reliability capabilities, you must tell Storm when new edges in a tuple tree are being created and tell Storm whenever you've finished processing an individual tuple. These are done using the OutputCollector object that bolts use to emit tuples. Anchoring is done in the emit
method, and you declare that you're finished with a tuple using the ack
method.
This is all explained in much more detail on Guaranteeing message processing.
Each spout or bolt executes as many tasks across the cluster. Each task corresponds to one thread of execution, and stream groupings define how to send tuples from one set of tasks to another set of tasks. You set the parallelism for each spout or bolt in the setSpout
and setBolt
methods of TopologyBuilder.
Topologies execute across one or more worker processes. Each worker process executes a subset of all the tasks for the topology. For example, if the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks (as threads within the worker). Storm tries to spread the tasks evenly across all the workers.
Resources:
Storm has a variety of configurations for tweaking the behavior of nimbus, supervisors, and running topologies. Some configurations are system configurations and cannot be modified on a topology by topology basis, whereas other configurations can be modified per topology.
Every configuration has a default value defined in defaults.yaml in the Storm codebase. You can override these configurations by defining a storm.yaml in the classpath of Nimbus and the supervisors. Finally, you can define a topology-specific configuration that you submit along with your topology when using StormSubmitter.
The preference order for configuration values is defaults.yaml < storm.yaml < topology specific configuration. However, the topology-specific configuration can only override configs prefixed with "TOPOLOGY".
Resources: