网上搜集的storm 一些有用的资料

IMHO, Flume has lots of built-in data sources, decorators (batching,
archiving) and sinks (HDFS etc). So its easy to adopt on existing
system and easy to extend, especially for logs files.

Kafka could be used as a basic log aggregation and is general purpose
producer-consumer messaging system. And has high throughput.

I feel Flume focus more on aggregation whereas Kafka on super-fast
messaging. Even though they can be sources of storm, if you archive
log aggregation and log data processing together Flume might be
better, or if you want a extreme low-latency data processing from
logs, Kafka would be winner.


Min, so it sounds Kafka is more suitable for real-time solution
providing by Storm. Aggregation calculations can be done already
inside Storm Bolt with some real ESP engine, for example with Drools
Fusion or Esper. I am building solution with Drools Fusion

Sam, I'm starting from AMQP solution for now.
Thank you for reply. I thought the problems was in parameters of
exchange but it really appears it was in parameters of queue. Solved.

搜集spout数据源的几种策略:

We have several log files on different remote computers which sharding
by nginx and with the same format .
How can I tail these files paralell as the input of Spout?

1.Is there some way that I can control a Spout running on a specific
mechine with log file? For example ,I have 4 log-server ,and
setspout(id,class,4),just fill the 4 spouts into 4 log-server.

2.Or I can only scratch all log files into one then tailed by
Spout ,which doesn't sound effective.

3.Also I heard the FaceBook has developed PTail in their real-time
system,how could I imitate it?

4.Of course I can put  content of logs into MQ, which means I should
write a daemon process and run it on each merchine.

Thank you for your advice!


Another option would be using the Flume.


You might have to build a FlumeSinkSpout.

Some tools:
* Flume (we use this in our performance monitoring and analytics
services)
* Scribe
* Kafka
* Fluentd
* ...


storm作者的建议:

How you get log data into Storm depends on what you want your message processing guarantees to be. If you don't care about dropping messages, then you could do logs -> Scribe -> ScribeSpout. Scribe would have to be able to discover where the ScribeSpout tasks are (since they could be started on any machine), which could be accomplished using Zookeeper. I believe Scribe has this discovery functionality builtin already.


If you care about processing every message, then you should do an architecture like logs -> Scribe -> Kestrel -> KestrelSpout.


其他人:

I have a very similar requirement where I have to pipe data from many log files to a bunch of aggregators so that they can do the aggregation. I was thinking to setting up the log file readers as spouts and aggregators as bolts.

So now I have to run the log file reader's main and in that emit the log file lines.
A few questions
1. Do you recommend doing this?
2. How can I have a main from which I cam emit tuples. I cant package this class as a regular spout because it HAS to run on the right host and storm does not give you that control as far as I know.


In 0.8.0 you can write a custom class to dispatch your log tailing spout to the right boxes.


We have a small daemon tailing logs and feeding Kafka.   Much lighter weight than Flume, gives us transactionality, and easily allows multiple topologies to share the same data source.


Hi =)  My options/answers inline (mind you I'm learning Storm right now), but hope I can help some.

I looked at this post and my question is related to it https://groups.google.com/forum/#!msg/storm-user/Zvy6RrT8RHo/OVB03TlT2HgJ

I am looking to tail log files from many servers to storm and them carry out staged processing on the messages from these log files. So what I would like is to have a spout on each of the servres on which the log files exist that will emit the log lines and then have the blots to consume them running on different machines within the same data center.

Because you can't (currently) define which machine a Spout should run (check out issue #164 https://github.com/nathanmarz/storm/issues/164) I think this could be an issue.  Once this is complete, though, you should be able to define this in the topology and, using a library as previously suggested in that linked thread), monitor logs via a custom Spout.


The questions I have are:
1. Is this not recommended? From the post mentioned above it seems to me that it is not. If so why?
 
I don't see why not - this looks like a good use case.  I agree with some of the other respondents in the referenced post, using something like Flume or another library to produce a Spout to do this is a good approach.  Plus, you can contribute that back =)

2. If we are to do this I need to be able to write a spout which has a main() so that I can read the log and emit the tuples. I cant package these in the topology since I need to control exactly where these should run (where the log files are).

Right now, that's kinda true (reference about about issue #164).  Once you can define your Spout deployment location, this could be done in a normal Spout IMO.  For now, you can make a solution where you're reading the data from the logs (using a library as previously identified) and pushing it to Krestle, for example.  Then, using storm-krestle (backtype.storm.spout.Krestle)
, you can bring them into Storm for processing.



你可能感兴趣的:(processing,Parameters,performance,library,reference,Deployment)