Storm实战之WordCount

在storm环境部署完毕,并正确启动之后,现在就可以真正进入storm开发了,按照惯例,以wordcount作为开始。
这个例子很简单,核心组件包括:一个spout,两个bolt,一个Topology。

spout从一个路径读取文件,然后readLine,向bolt发射,一个文件处理完毕后,重命名,以不再重复处理。

第一个bolt将从spout接收到的字符串按空格split,产生word,发射给下一个bolt。

第二个bolt接收到word后,统计、计数,放到HashMap容器中。

1,定义一个spout,作用是源源不断滴向bolt发射字符串。

[Java]  纯文本查看  复制代码
01 import java.io.File;
02 import java.io.IOException;
03 import java.util.Collection;
04 import java.util.List;
05 import java.util.Map;
06  
07 import org.apache.commons.io.FileUtils;
08 import org.apache.commons.io.filefilter.FileFilterUtils;
09  
10 import backtype.storm.spout.SpoutOutputCollector;
11 import backtype.storm.task.TopologyContext;
12 import backtype.storm.topology.OutputFieldsDeclarer;
13 import backtype.storm.topology.base.BaseRichSpout;
14 import backtype.storm.tuple.Fields;
15 import backtype.storm.tuple.Values;
16  
17 public class WordReader extends BaseRichSpout {
18     private static final long serialVersionUID = 2197521792014017918L;
19     private String inputPath;
20     private SpoutOutputCollector collector;
21  
22     @Override
23     @SuppressWarnings(\"rawtypes\")
24     public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
25         this.collector = collector;
26         inputPath = (String) conf.get(\"INPUT_PATH\");
27     }
28  
29     @Override
30     public void nextTuple() {
31         Collection<File> files = FileUtils.listFiles(new File(inputPath),
32                 FileFilterUtils.notFileFilter(FileFilterUtils.suffixFileFilter(\".bak\")), null);
33         for (File f : files) {
34             try {
35                 List<String> lines = FileUtils.readLines(f, \"UTF-8\");
36                 for (String line : lines) {
37                     collector.emit(new Values(line));
38                 }
39                 FileUtils.moveFile(f, new File(f.getPath() + System.currentTimeMillis() + \".bak\"));
40             catch (IOException e) {
41                 e.printStackTrace();
42             }
43         }
44     }
45  
46     @Override
47     public void declareOutputFields(OutputFieldsDeclarer declarer) {
48         declarer.declare(new Fields(\"line\"));
49     }
50  
51 }



2,定义一个bolt,作用是接收spout发过来的字符串,并分割成word,发射给下一个bolt。

[Java]  纯文本查看  复制代码
01 import org.apache.commons.lang.StringUtils;
02  
03 import backtype.storm.topology.BasicOutputCollector;
04 import backtype.storm.topology.OutputFieldsDeclarer;
05 import backtype.storm.topology.base.BaseBasicBolt;
06 import backtype.storm.tuple.Fields;
07 import backtype.storm.tuple.Tuple;
08 import backtype.storm.tuple.Values;
09  
10 public class WordSpliter extends BaseBasicBolt {
11  
12     private static final long serialVersionUID = -5653803832498574866L;
13  
14     @Override
15     public void execute(Tuple input, BasicOutputCollector collector) {
16         String line = input.getString(0);
17         String[] words = line.split(\" \");
18         for (String word : words) {
19             word = word.trim();
20             if (StringUtils.isNotBlank(word)) {
21                 word = word.toLowerCase();
22                 collector.emit(new Values(word));
23             }
24         }
25     }
26  
27     @Override
28     public void declareOutputFields(OutputFieldsDeclarer declarer) {
29         declarer.declare(new Fields(\"word\"));
30  
31     }
32  
33 }



3,定义一个bolt,接收word,并统计。

[Java]  纯文本查看  复制代码
01 import java.util.HashMap;
02 import java.util.Map;
03 import java.util.Map.Entry;
04  
05 import backtype.storm.task.TopologyContext;
06 import backtype.storm.topology.BasicOutputCollector;
07 import backtype.storm.topology.OutputFieldsDeclarer;
08 import backtype.storm.topology.base.BaseBasicBolt;
09 import backtype.storm.tuple.Tuple;
10  
11 public class WordCounter extends BaseBasicBolt {
12     private static final long serialVersionUID = 5683648523524179434L;
13     private HashMap<String, Integer> counters = new HashMap<String, Integer>();
14     private volatile boolean edit = false;
15  
16     @Override
17     @SuppressWarnings(\"rawtypes\")
18     public void prepare(Map stormConf, TopologyContext context) {
19         final long timeOffset = Long.parseLong(stormConf.get(\"TIME_OFFSET\").toString());
20         new Thread(new Runnable() {
21             @Override
22             public void run() {
23                 while (true) {
24                     if (edit) {
25                         for (Entry<String, Integer> entry : counters.entrySet()) {
26                             System.out.println(entry.getKey() + \" : \" + entry.getValue());
27                         }
28                         System.out.println(\"WordCounter---------------------------------------\");
29                         edit = false;
30                     }
31                     try {
32                         Thread.sleep(timeOffset * 1000);
33                     catch (InterruptedException e) {
34                         e.printStackTrace();
35                     }
36                 }
37             }
38         }).start();
39     }
40  
41     @Override
42     public void execute(Tuple input, BasicOutputCollector collector) {
43         String str = input.getString(0);
44         if (!counters.containsKey(str)) {
45             counters.put(str, 1);
46         else {
47             Integer c = counters.get(str) + 1;
48             counters.put(str, c);
49         }
50         edit = true;
51         System.out.println(\"WordCounter+++++++++++++++++++++++++++++++++++++++++++\");
52     }
53  
54     @Override
55     public void declareOutputFields(OutputFieldsDeclarer declarer) {
56  
57     }
58 }



注意WordCounter类的prepare方法,里面定义了一个Thread,持续监控容器的变化(word个数增加或者新增word)。

4,定义一个Topology,提交作业。

[Java]  纯文本查看  复制代码
01 public class WordCountTopo {
02     public static void main(String[] args) {
03         if (args.length != 2) {
04             System.err.println(\"Usage: inputPaht timeOffset\");
05             System.err.println(\"such as : java -jar WordCount.jar D://input/ 2\");
06             System.exit(2);
07         }
08         TopologyBuilder builder = new TopologyBuilder();
09         builder.setSpout(\"word-reader\", new WordReader());
10         builder.setBolt(\"word-spilter\", new WordSpliter()).shuffleGrouping(\"word-reader\");
11         builder.setBolt(\"word-counter\", new WordCounter()).shuffleGrouping(\"word-spilter\");
12         String inputPaht = args[0];
13         String timeOffset = args[1];
14         Config conf = new Config();
15         conf.put(\"INPUT_PATH\", inputPaht);
16         conf.put(\"TIME_OFFSET\", timeOffset);
17         conf.setDebug(false);
18         LocalCluster cluster = new LocalCluster();
19         cluster.submitTopology(\"WordCount\", conf, builder.createTopology());
20     }
21 }



5, 代码完成后,导出jar(导出时不要指定Main class),然后上传至storm集群,通过命令./storm jar com.x.x.WordCountTopo /data/tianzhen/input 2来提交作业。

Topo启动,spout、bolt执行过程:

Storm实战之WordCount_第1张图片 


Thread监控的统计结果:


Storm实战之WordCount_第2张图片 


源文件处理之后被重命名为*.bak。

和Hadoop不同,在任务执行完之后,Topo不会停止,spout会一直监控数据源,不停地往bolt发射数据。
所以现在如果源数据发生变化,应该能够立马体现出来。我往path下再放一个文本文件,结果:


Storm实战之WordCount_第3张图片 


可见,结果立刻更新了,storm的实时性就体现在这里。

你可能感兴趣的:(Storm实战之WordCount)