原文中用了一张图来说明在一个storm cluster中,topology运行时的并发机制。
其实说白了,当一个topology在storm cluster中运行时,它的并发主要跟3个逻辑实体想过:worker,executor 和task
1. Worker 是运行在工作节点上面,被Supervisor守护进程创建的用来干活的进程。每个Worker对应于一个给定topology的全部执行任务的一个子集。反过来说,一个Worker里面不会运行属于不同的topology的执行任务。
2. Executor可以理解成一个Worker进程中的工作线程。一个Executor中只能运行隶属于同一个component(spout/bolt) 的task。一个Worker进程中可以有一个或多个Executor线程。在默认情况下,一个Executor运行一个task。
3. Task则是spout和bolt中具体要干的活了。一个Executor可以负责1个或多个task。每个component(spout/bolt) 的并发度就是这个component对应的task数量。同时,task也是各个节点之间进行grouping(partition)的单位。
有多种方法可以进行并发度的配置,其优先级如下:
defaults.yaml
< storm.yaml
< topology 私有配置 < component level(spout/bolt) 的私有配置
至于具体怎么配置,至今拷贝过来大家看看便知:
parallelism_hint
parameter now specifies the initial number of executors (not tasks!) for that bolt.Here is an example code snippet to show these settings in practice:
topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2) .setNumTasks(4) .shuffleGrouping(blue-spout);
The GreenBolt
was configured as per the code snippet above whereas BlueSpout
and YellowBolt
only set the parallelism hint (number of executors). Here is the relevant code:
Config conf = new Config();
conf.setNumWorkers(2); // use two worker processes
topologyBuilder.setSpout("blue-spout", new BlueSpout(), 2); // set parallelism hint to 2
topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)
.setNumTasks(4)
.shuffleGrouping("blue-spout");
topologyBuilder.setBolt("yellow-bolt", new YellowBolt(), 6)
.shuffleGrouping("green-bolt");
StormSubmitter.submitTopology(
"mytopology",
conf,
topologyBuilder.createTopology()
);
And of course Storm comes with additional configuration settings to control the parallelism of a topology, including:
主要有两种方法可以rebalance一个topology:
# Reconfigure the topology "mytopology" to use 5 worker processes, # the spout "blue-spout" to use 3 executors and # the bolt "yellow-bolt" to use 10 executors. storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10