I see that even the included pattern sets for Cisco and other devices make extensive use of greedy matches. How do other folks work with syslog data without greedydata? For folks processing syslog style messages, are you achieving 10K+ eps rates (as others seem to talk about) without massive horizontal scaling? I can just eek out 2-2.5K EPS on an 8 core AWS instance. While horizontal scaling works well enough for me, that's a lot more compute resources than I expected to need. My single node commercial SIEM product can do much higher rates with a whole lot less and that's what I'm looking to largely replace with ELK!
I'm running 8 worker threads on my 8 core instance, with all cores running nearly pegged with the vast majority of the time going to the worker threads. Apart from a bit of mutate (casting and renaming), the only thing going on is the grok filter followed by the KV filter (template version of the filter config is at https://github.com/davidski/cookbooks/blob/master/sch-logstash/templates/default/config/filter_sidewinder.conf.erb). The consensus I hear from #logstash on Freenode and mentioned from elastichsearch.org most recently in the ELK for DevOps webinar is to avoid greedy data. It's possible there's other back pressure in the pipeline but my ES index rates scale linearly when I add mode indexer nodes and I don't know of any way to directly monitor internal logstash pipeline queues to see where any pressue is coming from.
I'm really mostly curious in what sorts of rates folks are seeing on syslog style traffic ingestion. Most of the benchmarks I've seen are either off of generated events or with well structured data. Are folks really getting 10K+ eps rates with typical syslog ingestion without big farms of indexers? So far, that seems to be a unicorn.
URL:http://comments.gmane.org/gmane.comp.sysutils.logstash.user/10463
感谢群里 @wood 童鞋提供的信息,Grok 在高压力情况下确实比较容易率先成为瓶颈。所以在日志格式可控的情况下,最好可以想办法跳过使用 Grok 的环节。在早先的 cookbook 里,就有通过自定义 LogFormat 成 JSON 样式的做法。我前年博客上也写过 nginx 上如此做的示例:http://chenlinux.com/2012/09/21/json-event-for-logstash/index.html。
url:http://chenlinux.com/2014/06/11/nginx-access-log-to-elasticsearch/
Logstash and its Grok filter are excellent and I love them, but it was going so slow that the data was useless by the time I had finally ingested it to review it, here's what was wrong and how I fixed it.
The Grok filter in Logstash was only able to handle between 10 and 30 log lines per second based on my many REGEX lines. This is slower than the speed at which we generate log lines (~50/sec).
At Pindrop we send all log messages to Syslog for aggregation. This means that all logs for all purposes are in the same file. I set up an ELK Stack on my laptop to:
The grok filter config was immediately very complicated, due to the widely varying nature of our log lines, mixture of appliations sending log lines, and the detail I needed to pluck from these lines. For example, some of our log lines have a Python array pretty printed, whereas others did a JSON serialization of the data. Extracting phone numbers and usernames from this data is a dizzying set of REGEX queries.
Thankfully to develop these grok lines, Grok Debugger exists, which will save your sanity.
Here's an example log line:
Dec 18 07:49:05 box-004 MainProcess[36381]: pindrop.profiler INFO [req bdcecd58a4ab7c2e@box-004] fetch_data - finished at 1418888945.046632, taking 0.006296 seconds (0.00 cpu)
This is one of the easier examples, the data I want from this line is:
Dec 18 07:49:05
)pindrop.profiler
)fetch_data
)0.006296
)So I have a grok configuration that looks like this:
filter {
grok {
match => ["message", "%{pd_SYSLOG_NORMAL} %{pd_REQUEST} %{DATA:proc_detail} - %{DATA}, taking %{NUMBER:duration_seconds:float} seconds"]
patterns_dir => "patterns/"
}
}
I also have all these reusable patterns that I wrote/modified from the Logstash base.
pd_UNIQUE_ID [a-f0-9]{12,16}
pd_SOMECLASS (?:[a-zA-Z0-9-_]+[.]?)+[A-Za-z0-9$_]+
pd_SYSLOG_BASE %{SYSLOGTIMESTAMP:timestamp} %{HOST} %{SYSLOGPROG}:
pd_SYSLOG_NORMAL %{pd_SYSLOG_BASE} %{pd_SOMECLASS} (INFO|ERROR|WARN)
pd_REQUEST \[req %{pd_UNIQUE_ID:unique_id}@%{HOST}\]
This is a pretty specific pattern, I realize. But with our many different types of loglines, it's hard to be more generic, capture all the data I need, and be able to come in and edit them again later and maintain my sanity (which is already wearing thin).
I knew that since I had about 15 different styles of patterns all in sequence from most-specific to least (with break_on_match => true
) that testing each of these in sequence was going to be the biggest time-sink. I optimized and limited these patterns as much as I could, but with little success.
I dove into Logstash's Grok Filter and started timing the regex match, like so:
now = Time.now
grok, match = grok.match(input.to_s) # Original Line total = Time.now - now
if total > @average_time
@logger.warn("Took #{total}s which is #{total - @average_time} longer than average of #{@average_time}", :event => input) @average_time = (@average_time + total) / 2
end
This gave me some good insight
My longest offenders were taking 1.5s, where the quick ones were less than 0.01s.
In the end, I knew that simplifying the many patterns we would inevitably test these log lines against was the only thing that was going to make a difference. As I was scouring the Logstash Docs, I noticed that there was this section on overwriting a field with a match from a grok line.
This gave me a 'eureka' moment in which I realized I should match the Syslog base of the line once and only once, and then in a separate grok filter, match my speciality parts. Now my grok line is just:
match => ["message", "%{pd_REQUEST} %{DATA:proc_detail} - %{DATA}, taking %{NUMBER:duration_seconds:float} seconds"]
pd_SYSLOG_NORMAL
actually expands into 5 different patterns. The result of removing this is that I now only have 5 patterns left in the match line (pd_REQUEST
is 2). A 50% decrease in patterns for just this one line, but more than that, in the worst case a line was running on those initial 5 patterns 15x. So this means I take 70 matching patterns out of the mix (assuming that 1 of those has to match).
The speed up was dramatic, I went from matching 10-30 lines/s to 500 lines/s. So, if you're trying to match on the same thing over and over again, pull it out into a separate grok filter, it will save you time and sanity.
URL:http://ghost.frodux.in/logstash-grok-speeds/
The problem is that if I disable the only filter I have enabled my StatsD counter says that the amount of processed events is 12k events/s. If I add the grok filter logstash is unable to catch-up position in the redis queue. Server specs are the following:
Grok disabled
12k events/s
CPU usage 70-75%
h3. Grok enabled
2.9k events/s
CPU usage 50%
I use the StatsD counter as metric. Here the graph of the different scenarios:
http://imgur.com/KHPQYHr
So, is there any real performance issue while using grok filters or am I screwing something? I have not experience with java apps, so please let me know what information may be useful for you in order to see what the problem is.
URL:https://logstash.jira.com/browse/LOGSTASH-480