slivelight

Streaming 101: The world beyond batch

By Tyler Akidau

August 5, 2015

Editor's note: This is the first post in a two-part series about the evolution of data processing, with a focus on streaming systems, unbounded data sets, and the future of big data. See part two. Also, check out "Streaming Systems," by Tyler Akidau, Slava Chernyak, and Reuven Lax.

Check out the resource centers on O'Reilly online learning for more ways to expand your skillset.

Streaming data processing is a big deal in big data these days, and for good reasons. Amongst them:

Businesses crave ever more timely data, and switching to streaming is a good way to achieve lower latency.
The massive, unbounded data sets that are increasingly common in modern business are more easily tamed using a system designed for such never-ending volumes of data.
Processing data as they arrive spreads workloads out more evenly over time, yielding more consistent and predictable consumption of resources.

Despite this business-driven surge of interest in streaming, the majority of streaming systems in existence remain relatively immature compared to their batch brethren, which has resulted in a lot of exciting, active development in the space recently.

As someone who’s worked on massive-scale streaming systems at Google for the last five+ years (MillWheel, Cloud Dataflow), I’m delighted by this streaming zeitgeist, to say the least. I’m also interested in making sure that folks understand everything that streaming systems are capable of and how they are best put to use, particularly given the semantic gap that remains between most existing batch and streaming systems. To that end, the fine folks at O’Reilly have invited me to contribute a written rendition of my Say Goodbye to Batch talk from Strata + Hadoop World London 2015. Since I have quite a bit to cover, I’ll be splitting this across two separate posts:

Streaming 101: This first post will cover some basic background information and clarify some terminology before diving into details about time domains and a high-level overview of common approaches to data processing, both batch and streaming.
The Dataflow Model: The second post will consist primarily of a whirlwind tour of the unified batch + streaming model used by Cloud Dataflow, facilitated by a concrete example applied across a diverse set of use cases. After that, I’ll conclude with a brief semantic comparison of existing batch and streaming systems.

So, long-winded introductions out of the way, let’s get nerdy.

Background

To begin with, I’ll cover some important background information that will help frame the rest of the topics I want to discuss. We’ll do this in three specific sections:

Terminology: To talk precisely about complex topics requires precise definitions of terms. For some terms that have overloaded interpretations in current use, I’ll try to nail down exactly what I mean when I say them.
Capabilities: I’ll remark on the oft-perceived shortcomings of streaming systems. I’ll also propose the frame of mind that I believe data processing system builders need to adopt in order to address the needs of modern data consumers going forward.
Time domains: I’ll introduce the two primary domains of time that are relevant in data processing, show how they relate, and point out some of the difficulties these two domains impose.

Terminology: What is streaming?

Before going any further, I’d like to get one thing out of the way: what is streaming? The term “streaming” is used today to mean a variety of different things (and for simplicity, I’ve been using it somewhat loosely up until now), which can lead to misunderstandings about what streaming really is, or what streaming systems are actually capable of. As such, I would prefer to define the term somewhat precisely.

The crux of the problem is that many things that ought to be described by what they are (e.g., unbounded data processing, approximate results, etc.), have come to be described colloquially by how they historically have been accomplished (i.e., via streaming execution engines). This lack of precision in terminology clouds what streaming really means, and in some cases, burdens streaming systems themselves with the implication that their capabilities are limited to characteristics frequently described as “streaming,” such as approximate or speculative results. Given that well-designed streaming systems are just as capable (technically more so) of producing correct, consistent, repeatable results as any existing batch engine, I prefer to isolate the term streaming to a very specific meaning: a type of data processing engine that is designed with infinite data sets in mind. Nothing more. (For completeness, it’s perhaps worth calling out that this definition includes both true streaming and micro-batch implementations.)

As to other common uses of “streaming,” here are a few that I hear regularly, each presented with the more precise, descriptive terms that I suggest we as a community should try to adopt:

Unbounded data: A type of ever-growing, essentially infinite data set. These are often referred to as “streaming data.” However, the terms streaming or batch are problematic when applied to data sets, because as noted above, they imply the use of a certain type of execution engine for processing those data sets. The key distinction between the two types of data sets in question is, in reality, their finiteness, and it’s thus preferable to characterize them by terms that capture this distinction. As such, I will refer to infinite “streaming” data sets as unbounded data, and finite “batch” data sets as bounded data.
Unbounded data processing: An ongoing mode of data processing, applied to the aforementioned type of unbounded data. As much as I personally like the use of the term streaming to describe this type of data processing, its use in this context again implies the employment of a streaming execution engine, which is at best misleading; repeated runs of batch engines have been used to process unbounded data since batch systems were first conceived (and conversely, well-designed streaming systems are more than capable of handling “batch” workloads over bounded data). As such, for the sake of clarity, I will simply refer to this as unbounded data processing.
Low-latency, approximate, and/or speculative results: These types of results are most often associated with streaming engines. The fact that batch systems have traditionally not been designed with low-latency or speculative results in mind is a historical artifact, and nothing more. And of course, batch engines are perfectly capable of producing approximate results if instructed to. Thus, as with the terms above, it’s far better describing these results as what they are (low-latency, approximate, and/or speculative) than by how they have historically been manifested (via streaming engines).

From here on out, any time I use the term “streaming,” you can safely assume I mean an execution engine designed for unbounded data sets, and nothing more. When I mean any of the other terms above, I will explicitly say unbounded data, unbounded data processing, or low-latency / approximate / speculative results. These are the terms we’ve adopted within Cloud Dataflow, and I encourage others to take a similar stance.

On the greatly exaggerated limitations of streaming

Next up, let’s talk a bit about what streaming systems can and can’t do, with an emphasis on can; one of the biggest things I want to get across in these posts is just how capable a well-designed streaming system can be. Streaming systems have long been relegated to a somewhat niche market of providing low-latency, inaccurate/speculative results, often in conjunction with a more capable batch system to provide eventually correct results, i.e. the Lambda Architecture.

For those of you not already familiar with the Lambda Architecture, the basic idea is that you run a streaming system alongside a batch system, both performing essentially the same calculation. The streaming system gives you low-latency, inaccurate results (either because of the use of an approximation algorithm, or because the streaming system itself does not provide correctness), and some time later a batch system rolls along and provides you with correct output. Originally proposed by Twitter’s Nathan Marz (creator of Storm), it ended up being quite successful because it was, in fact, a fantastic idea for the time; streaming engines were a bit of a letdown in the correctness department, and batch engines were as inherently unwieldy as you’d expect, so Lambda gave you a way to have your proverbial cake and eat it, too. Unfortunately, maintaining a Lambda system is a hassle: you need to build, provision, and maintain two independent versions of your pipeline, and then also somehow merge the results from the two pipelines at the end.

As someone who has spent years working on a strongly-consistent streaming engine, I also found the entire principle of the Lambda Architecture a bit unsavory. Unsurprisingly, I was a huge fan of Jay Kreps’ Questioning the Lambda Architecture post when it came out. Here was one of the first highly visible statements against the necessity of dual-mode execution; delightful. Kreps addressed the issue of repeatability in the context of using a replayable system like Kafka as the streaming interconnect, and went so far as to propose the Kappa Architecture, which basically means running a single pipeline using a well-designed system that’s appropriately built for the job at hand. I’m not convinced that notion itself requires a name, but I fully support the idea in principle.

Quite honestly, I’d take things a step further. I would argue that well-designed streaming systems actually provide a strict superset of batch functionality. Modulo perhaps an efficiency delta[1], there should be no need for batch systems as they exist today. And kudos to the Flink folks for taking this idea to heart and building a system that’s all-streaming-all-the-time under the covers, even in “batch” mode; I love it.

The corollary of all this is that broad maturation of streaming systems combined with robust frameworks for unbounded data processing will, in time, allow the relegation of the Lambda Architecture to the antiquity of big data history where it belongs. I believe the time has come to make this a reality. Because to do so, i.e. to beat batch at its own game, you really only need two things:

Correctness — This gets you parity with batch.At the core, correctness boils down to consistent storage. Streaming systems need a method for checkpointing persistent state over time (something Kreps has talked about in his Why local state is a fundamental primitive in stream processing post), and it must be well-designed enough to remain consistent in light of machine failures. When Spark Streaming first appeared in the public big data scene a few years ago, it was a beacon of consistency in an otherwise dark streaming world. Thankfully, things have improved somewhat since then, but it is remarkable how many streaming systems still try to get by without strong consistency; I seriously cannot believe that at-most-once processing is still a thing, but it is. To reiterate, because this point is important: strong consistency is required for exactly-once processing, which is required for correctness, which is a requirement for any system that’s going to have a chance at meeting or exceeding the capabilities of batch systems. Unless you just truly don’t care about your results, I implore you to shun any streaming system that doesn’t provide strongly consistent state. Batch systems don’t require you to verify ahead of time if they are capable of producing correct answers; don’t waste your time on streaming systems that can’t meet that same bar. If you’re curious to learn more about what it takes to get strong consistency in a streaming system, I recommend you check out the MillWheel and Spark Streaming papers. Both papers spend a significant amount of time discussing consistency. Given the amount of quality information on this topic in the literature and elsewhere, I won’t be covering it any further in these posts.
Tools for reasoning about time — This gets you beyond batch.Good tools for reasoning about time are essential for dealing with unbounded, unordered data of varying event-time skew. An increasing number of modern data sets exhibit these characteristics, and existing batch systems (as well as most streaming systems) lack the necessary tools to cope with the difficulties they impose. I will spend the remainder of this post, and the bulk of the next post, explaining and focusing on this point. To begin with, we’ll get a basic understanding of the important concept of time domains, after which we’ll take a deeper look at what I mean by unbounded, unordered data of varying event-time skew. We’ll then spend the rest of this post looking at common approaches to bounded and unbounded data processing, using both batch and streaming systems.

Event time vs. processing time

To speak cogently about unbounded data processing requires a clear understanding of the domains of time involved. Within any data processing system, there are typically two domains of time we care about:

Event time, which is the time at which events actually occurred.
Processing time, which is the time at which events are observed in the system.

Not all use cases care about event times (and if yours doesn’t, hooray! — your life is easier), but many do. Examples include characterizing user behavior over time, most billing applications, and many types of anomaly detection, to name a few.

In an ideal world, event time and processing time would always be equal, with events being processed immediately as they occur. Reality is not so kind, however, and the skew between event time and processing time is not only non-zero, but often a highly variable function of the characteristics of the underlying input sources, execution engine, and hardware. Things that can affect the level of skew include:

Shared resource limitations, such as network congestion, network partitions, or shared CPU in a non-dedicated environment.
Software causes, such as distributed system logic, contention, etc.
Features of the data themselves, including key distribution, variance in throughput, or variance in disorder (e.g., a plane full of people taking their phones out of airplane mode after having used them offline for the entire flight).

As a result, if you plot the progress of event time and processing time in any real-world system, you typically end up with something that looks a bit like the red line in Figure 1.

Figure 1: Example time domain mapping. The X-axis represents event time completeness in the system, i.e. the time X in event time up to which all data with event times less than X have been observed. The Y-axis represents the progress of processing time, i.e. normal clock time as observed by the data processing system as it executes. Image: Tyler Akidau.

The black dashed line with a slope of one represents the ideal, where processing time and event time are exactly equal; the red line represents reality. In this example, the system lags a bit at the beginning of processing time, veers closer toward the ideal in the middle, then lags again a bit toward the end. The horizontal distance between the ideal and the red line is the skew between processing time and event time. That skew is essentially the latency introduced by the processing pipeline.

Since the mapping between event time and processing time is not static, this means you cannot analyze your data solely within the context of when they are observed in your pipeline if you care about their event times (i.e., when the events actually occurred). Unfortunately, this is the way most existing systems designed for unbounded data operate. To cope with the infinite nature of unbounded data sets, these systems typically provide some notion of windowing the incoming data. We’ll discuss windowing in great depth below, but it essentially means chopping up a data set into finite pieces along temporal boundaries.

If you care about correctness and are interested in analyzing your data in the context of their event times, you cannot define those temporal boundaries using processing time (i.e., processing time windowing), as most existing systems do; with no consistent correlation between processing time and event time, some of your event time data are going to end up in the wrong processing time windows (due to the inherent lag in distributed systems, the online/offline nature of many types of input sources, etc.), throwing correctness out the window, as it were. We’ll look at this problem in more detail in a number of examples below as well as in the next post.

Unfortunately, the picture isn’t exactly rosy when windowing by event time, either. In the context of unbounded data, disorder and variable skew induce a completeness problem for event time windows: lacking a predictable mapping between processing time and event time, how can you determine when you’ve observed all the data for a given event time X? For many real-world data sources, you simply can’t. The vast majority of data processing systems in use today rely on some notion of completeness, which puts them at a severe disadvantage when applied to unbounded data sets.

I propose that instead of attempting to groom unbounded data into finite batches of information that eventually become complete, we should be designing tools that allow us to live in the world of uncertainty imposed by these complex data sets. New data will arrive, old data may be retracted or updated, and any system we build should be able to cope with these facts on its own, with notions of completeness being a convenient optimization rather than a semantic necessity.

Before diving into how we’ve tried to build such a system with the Dataflow Model used in Cloud Dataflow, let’s finish up one more useful piece of background: common data processing patterns.

Data processing patterns

At this point in time, we have enough background established that we can start looking at the core types of usage patterns common across bounded and unbounded data processing today. We’ll look at both types of processing, and where relevant, within the context of the two main types of engines we care about (batch and streaming, where in this context, I’m essentially lumping micro-batch in with streaming since the differences between the two aren’t terribly important at this level).

Bounded data

Processing bounded data is quite straightforward, and likely familiar to everyone. In the diagram below, we start out on the left with a data set full of entropy. We run it through some data processing engine (typically batch, though a well-designed streaming engine would work just as well), such as MapReduce, and on the right end up with a new structured data set with greater inherent value:

Figure 2: Bounded data processing with a classic batch engine. A finite pool of unstructured data on the left is run through a data processing engine, resulting in corresponding structured data on the right. Image: Tyler Akidau.

Though there are, of course, infinite variations on what you can actually calculate as part of this scheme, the overall model is quite simple. Much more interesting is the task of processing an unbounded data set. Let’s now look at the various ways unbounded data are typically processed, starting with the approaches used with traditional batch engines, and then ending up with the approaches one can take with a system designed for unbounded data, such as most streaming or micro-batch engines.

Unbounded data — batch

Batch engines, though not explicitly designed with unbounded data in mind, have been used to process unbounded data sets since batch systems were first conceived. As one might expect, such approaches revolve around slicing up the unbounded data into a collection of bounded data sets appropriate for batch processing.

Fixed windows

The most common way to process an unbounded data set using repeated runs of a batch engine is by windowing the input data into fixed-sized windows, then processing each of those windows as a separate, bounded data source. Particularly for input sources like logs, where events can be written into directory and file hierarchies whose names encode the window they correspond to, this sort of thing appears quite straightforward at first blush since you’ve essentially performed the time-based shuffle to get data into the appropriate event time windows ahead of time.

In reality, however, most systems still have a completeness problem to deal with: what if some of your events are delayed en route to the logs due to a network partition? What if your events are collected globally and must be transferred to a common location before processing? What if your events come from mobile devices? This means some sort of mitigation may be necessary (e.g., delaying processing until you’re sure all events have been collected, or re-processing the entire batch for a given window whenever data arrive late).

Figure 3: Unbounded data processing via ad hoc fixed windows with a classic batch engine. An unbounded data set is collected up front into finite, fixed-size windows of bounded data that are then processed via successive runs a of classic batch engine. Image: Tyler Akidau.

Sessions

This approach breaks down even more when you try to use a batch engine to process unbounded data into more sophisticated windowing strategies, like sessions. Sessions are typically defined as periods of activity (e.g., for a specific user) terminated by a gap of inactivity. When calculating sessions using a typical batch engine, you often end up with sessions that are split across batches, as indicated by the red marks in the diagram below. The number of splits can be reduced by increasing batch sizes, but at the cost of increased latency. Another option is to add additional logic to stitch up sessions from previous runs, but at the cost of further complexity.

Figure 4: Unbounded data processing into sessions via ad hoc fixed windows with a classic batch engine. An unbounded data set is collected up front into finite, fixed-size windows of bounded data that are then subdivided into dynamic session windows via successive runs a of classic batch engine. Image: Tyler Akidau.

Either way, using a classic batch engine to calculate sessions is less than ideal. A nicer way would be to build up sessions in a streaming manner, which we’ll look at later on.

Unbounded data — streaming

Contrary to the ad hoc nature of most batch-based unbounded data processing approaches, streaming systems are built for unbounded data. As I noted earlier, for many real-world, distributed input sources, you not only find yourself dealing with unbounded data, but also data that are:

Highly unordered with respect to event times, meaning you need some sort of time-based shuffle in your pipeline if you want to analyze the data in the context in which they occurred.
Of varying event time skew, meaning you can’t just assume you’ll always see most of the data for a given event time X within some constant epsilon of time Y.

There are a handful of approaches one can take when dealing with data that have these characteristics. I generally categorize these approaches into four groups:

Time-agnostic
Approximation
Windowing by processing time
Windowing by event time

We’ll now spend a little bit of time looking at each of these approaches.

Time-agnostic

Time-agnostic processing is used in cases where time is essentially irrelevant — i.e., all relevant logic is data driven. Since everything about such use cases is dictated by the arrival of more data, there’s really nothing special a streaming engine has to support other than basic data delivery. As a result, essentially all streaming systems in existence support time-agnostic use cases out of the box (modulo system-to-system variances in consistency guarantees, of course, for those of you that care about correctness). Batch systems are also well suited for time-agnostic processing of unbounded data sources, by simply chopping the unbounded source into an arbitrary sequence of bounded data sets and processing those data sets independently. We’ll look at a couple of concrete examples in this section, but given the straightforwardness of handling time-agnostic processing, won’t spend much more time on it beyond that.

Filtering

A very basic form of time-agnostic processing is filtering. Imagine you’re processing Web traffic logs, and you want to filter out all traffic that didn’t originate from a specific domain. You would look at each record as it arrived, see if it belonged to the domain of interest, and drop it if not. Since this sort of thing depends only on a single element at any time, the fact that the data source is unbounded, unordered, and of varying event time skew is irrelevant.

Figure 5: Filtering unbounded data. A collection of data (flowing left to right) of varying types is filtered into a homogeneous collection containing a single type. Image: Tyler Akidau.

Inner-joins

Another time-agnostic example is an inner-join (or hash-join). When joining two unbounded data sources, if you only care about the results of a join when an element from both sources arrive, there’s no temporal element to the logic. Upon seeing a value from one source, you can simply buffer it up in persistent state; you only need to emit the joined record once the second value from the other source arrives. (In truth, you’d likely want some sort of garbage collection policy for unemitted partial joins, which would likely be time based. But for a use case with little or no uncompleted joins, such a thing might not be an issue.)

Figure 6: Performing an inner join on unbounded data. Joins are produced when matching elements from both sources are observed. Image: Tyler Akidau.

Switching semantics to some sort of outer join introduces the data completeness problem we’ve talked about: once you’ve seen one side of the join, how do you know whether the other side is ever going to arrive or not? Truth be told, you don’t, so you have to introduce some notion of a timeout, which introduces an element of time. That element of time is essentially a form of windowing, which we’ll look at more closely in a moment.

Approximation algorithms

Figure 7: Computing approximations on unbounded data. Data are run through a complex algorithm, yielding output data that look more or less like the desired result on the other side. Image: Tyler Akidau.

The second major category of approaches is approximation algorithms, such as approximate Top-N, streaming K-means, etc. They take an unbounded source of input and provide output data that, if you squint at them, look more or less like what you were hoping to get. The upside of approximation algorithms is that, by design, they are low overhead and designed for unbounded data. The downsides are that a limited set of them exist, the algorithms themselves are often complicated (which makes it difficult to conjure up new ones), and their approximate nature limits their utility.

It’s worth noting: these algorithms typically do have some element of time in their design (e.g., some sort of built-in decay). And since they process elements as they arrive, that element of time is usually processing-time based. This is particularly important for algorithms that provide some sort of provable error bounds on their approximations. If those error bounds are predicated on data arriving in order, they mean essentially nothing when you feed the algorithm unordered data with varying event-time skew. Something to keep in mind.

Approximation algorithms themselves are a fascinating subject, but as they are essentially another example of time-agnostic processing (modulo the temporal features of the algorithms themselves), they’re quite straightforward to use, and thus not worth further attention given our current focus.

Windowing

The remaining two approaches for unbounded data processing are both variations of windowing. Before diving into the differences between them, I should make it clear exactly what I mean by windowing since I’ve only touched on it briefly. Windowing is simply the notion of taking a data source (either unbounded or bounded), and chopping it up along temporal boundaries into finite chunks for processing. The following diagram shows three different windowing patterns:

Figure 8: Example windowing strategies. Each example is shown for three different keys, highlighting the difference between aligned windows (which apply across all the data) and unaligned windows (which apply across a subset of the data). Image: Tyler Akidau.

Fixed windows: Fixed windows slice up time into segments with a fixed-size temporal length. Typically (as in Figure 8), the segments for fixed windows are applied uniformly across the entire data set, which is an example of aligned windows. In some cases, it’s desirable to phase-shift the windows for different subsets of the data (e.g., per key) to spread window completion load more evenly over time, which instead is an example of unaligned windows since they vary across the data.
Sliding windows: A generalization of fixed windows, sliding windows are defined by a fixed length and a fixed period. If the period is less than the length, then the windows overlap. If the period equals the length, you have fixed windows. And if the period is greater than the length, you have a weird sort of sampling window that only looks at subsets of the data over time. As with fixed windows, sliding windows are typically aligned, though may be unaligned as a performance optimization in certain use cases. Note that the sliding windows in the Figure 8 are drawn as they are to give a sense of sliding motion; in reality, all five windows would apply across the entire data set.
Sessions: An example of dynamic windows, sessions are composed of sequences of events terminated by a gap of inactivity greater than some timeout. Sessions are commonly used for analyzing user behavior over time, by grouping together a series of temporally-related events (e.g., a sequence of videos viewed in one sitting). Sessions are interesting because their lengths cannot be defined a priori; they are dependent upon the actual data involved. They’re also the canonical example of unaligned windows since sessions are practically never identical across different subsets of data (e.g., different users).

The two domains of time discussed — processing time and event time — are essentially the two we care about[2]. Windowing makes sense in both domains, so we’ll look at each in detail and see how they differ. Since processing time windowing is vastly more common in existing systems, I’ll start there.

Windowing by processing time

Figure 9: Windowing into fixed windows by processing time. Data are collected into windows based on the order they arrive in the pipeline. Image: Tyler Akidau.

When windowing by processing time, the system essentially buffers up incoming data into windows until some amount of processing time has passed. For example, in the case of five-minute fixed windows, the system would buffer up data for five minutes of processing time, after which it would treat all the data it had observed in those five minutes as a window and send them downstream for processing.

There are a few nice properties of processing time windowing:

It’s simple. The implementation is extremely straightforward since you never worry about shuffling data within time. You just buffer things up as they arrive and send them downstream when the window closes.
Judging window completeness is straightforward. Since the system has perfect knowledge of whether all inputs for a window have been seen or not, it can make perfect decisions about whether a given window is complete or not. This means there is no need to be able to deal with “late” data in any way when windowing by processing time.
If you’re wanting to infer information about the source as it is observed, processing time windowing is exactly what you want. Many monitoring scenarios fall into this category. Imagine tracking the number of requests per second sent to a global-scale Web service. Calculating a rate of these requests for the purpose of detecting outages is a perfect use of processing time windowing.

Good points aside, there is one very big downside to processing time windowing: if the data in question have event times associated with them, those data must arrive in event time order if the processing time windows are to reflect the reality of when those events actually happened. Unfortunately, event-time ordered data are uncommon in many real-world, distributed input sources.

As a simple example, imagine any mobile app that gathers usage statistics for later processing. In cases where a given mobile device goes offline for any amount of time (brief loss of connectivity, airplane mode while flying across the country, etc.), the data recorded during that period won’t be uploaded until the device comes online again. That means data might arrive with an event time skew of minutes, hours, days, weeks, or more. It’s essentially impossible to draw any sort of useful inferences from such a data set when windowed by processing time.

As another example, many distributed input sources may seem to provide event-time ordered (or very nearly so) data when the overall system is healthy. Unfortunately, the fact that event-time skew is low for the input source when healthy does not mean it will always stay that way. Consider a global service that processes data collected on multiple continents. If network issues across a bandwidth-constrained transcontinental line (which, sadly, are surprisingly common) further decrease bandwidth and/or increase latency, suddenly a portion of your input data may start arriving with much greater skew than before. If you are windowing that data by processing time, your windows are no longer representative of the data that actually occurred within them; instead, they represent the windows of time as the events arrived at the processing pipeline, which is some arbitrary mix of old and current data.

What we really want in both of those cases is to window data by their event times in a way that is robust to the order of arrival of events. What we really want is event time windowing.

Windowing by event time

Event time windowing is what you use when you need to observe a data source in finite chunks that reflect the times at which those events actually happened. It’s the gold standard of windowing. Sadly, most data processing systems in use today lack native support for it (though any system with a decent consistency model, like Hadoop or Spark Streaming, could act as a reasonable substrate for building such a windowing system).

This diagram shows an example of windowing an unbounded source into one-hour fixed windows:

Figure 10: Windowing into fixed windows by event time. Data are collected into windows based on the times they occurred. The white arrows call out example data that arrived in processing time windows that differed from the event time windows to which they belonged. Image: Tyler Akidau.

The solid white lines in the diagram call out two particular data of interest. Those two data both arrived in processing time windows that did not match the event time windows to which they belonged. As such, if these data had been windowed into processing time windows for a use case that cared about event times, the calculated results would have been incorrect. As one would expect, event time correctness is one nice thing about using event time windows.

Another nice thing about event time windowing over an unbounded data source is that you can create dynamically sized windows, such as sessions, without the arbitrary splits observed when generating sessions over fixed windows (as we saw previously in the sessions example from the “Unbounded data — batch” section):

Figure 11: Windowing into session windows by event time. Data are collected into session windows capturing bursts of activity based on the times that the corresponding events occurred. The white arrows again call out the temporal shuffle necessary to put the data into their correct event-time locations. Image: Tyler Akidau.

Of course, powerful semantics rarely come for free, and event time windows are no exception. Event time windows have two notable drawbacks due to the fact that windows must often live longer (in processing time) than the actual length of the window itself:

Buffering: Due to extended window lifetimes, more buffering of data is required. Thankfully, persistent storage is generally the cheapest of the resource types most data processing systems depend on (the others being primarily CPU, network bandwidth, and RAM). As such, this problem is typically much less of a concern than one might think when using any well-designed data-processing system with strongly consistent persistent state and a decent in-memory caching layer. Also, many useful aggregations do not require the entire input set to be buffered (e.g., sum, or average), but instead can be performed incrementally, with a much smaller, intermediate aggregate stored in persistent state.
Completeness: Given that we often have no good way of knowing when we’ve seen all the data for a given window, how do we know when the results for the window are ready to materialize? In truth, we simply don’t. For many types of inputs, the system can give a reasonably accurate heuristic estimate of window completion via something like MillWheel’s watermarks (which I’ll talk about more in Part 2). But in cases where absolute correctness is paramount (again, think billing), the only real option is to provide a way for the pipeline builder to express when they want results for windows to be materialized, and how those results should be refined over time. Dealing with window completeness (or lack, thereof), is a fascinating topic, but one perhaps best explored in the context of concrete examples, which we’ll look at next time.

Conclusion

Whew! That was a lot of information. To those of you that have made it this far: you are to be commended! At this point we are roughly halfway through the material I want to cover, so it’s probably reasonable to step back, recap what I’ve covered so far, and let things settle a bit before diving into Part 2. The upside of all this is that Part 1 is the boring post; Part 2 is where the fun really begins.

Recap

To summarize, in this post I’ve:

Clarified terminology, specifically narrowing the definition of “streaming” to apply to execution engines only, while using more descriptive terms like unbounded data and approximate/speculative results for distinct concepts often categorized under the “streaming” umbrella.
Assessed the relative capabilities of well-designed batch and streaming systems, positing that streaming is in fact a strict superset of batch, and that notions like the Lambda Architecture, which are predicated on streaming being inferior to batch, are destined for retirement as streaming systems mature.
Proposed two high-level concepts necessary for streaming systems to both catch up to and ultimately surpass batch, those being correctness and tools for reasoning about time, respectively.
Established the important differences between event time and processing time, characterized the difficulties those differences impose when analyzing data in the context of when they occurred, and proposed a shift in approach away from notions of completeness and toward simply adapting to changes in data over time.
Looked at the major data processing approaches in common use today for bounded and unbounded data, via both batch and streaming engines, roughly categorizing the unbounded approaches into: time-agnostic, approximation, windowing by processing time, and windowing by event time.

Next time

This post provides the context necessary for the concrete examples I’ll be exploring in Part 2. That post will consist of roughly the following:

A conceptual look at how we’ve broken up the notion of data processing in the Dataflow Model across four related axes: what, where, when, and how.
A detailed look at processing a simple, concrete example data set across multiple scenarios, highlighting the plurality of use cases enabled by the Dataflow Model, and the concrete APIs involved. These examples will help drive home the notions of event time and processing time introduced in this post, while additionally exploring new concepts, such as watermarks.
A comparison of existing data-processing systems across the important characteristics covered in both posts, to better enable educated choice amongst them, and to encourage improvement in areas that are lacking, with my ultimate goal being the betterment of data processing systems in general, and streaming systems in particular, across the entire big data community.

Should be a good time. See you then!

[1] One which I propose is not an inherent limitation of streaming systems, but simply a consequence of design choices made in most streaming systems thus far. The efficiency delta between batch and streaming is largely the result of the increased bundling and more efficient shuffle transports found in batch systems. Modern batch systems go to great lengths to implement sophisticated optimizations that allow for remarkable levels of throughput using surprisingly modest compute resources. There’s no reason the types of clever insights that make batch systems the efficiency heavyweights they are today couldn’t be incorporated into a system designed for unbounded data, providing users flexible choice between what we typically consider to be high-latency, higher-efficiency “batch” processing and low-latency, lower-efficiency “streaming” processing. This is effectively what we’ve done with Cloud Dataflow by providing both batch and streaming runners under the same unified model. In our case, we use separate runners because we happen to have two independently designed systems optimized for their specific use cases. Long-term, from an engineering perspective, I’d love to see us merge the two into a single system which incorporates the best parts of both, while still maintaining the flexibility of choosing an appropriate efficiency level. But that’s not what we have today. And honestly, thanks to the unified Dataflow Model, it’s not even strictly necessary; so it may well never happen. (Return)

[2] If you poke around enough in the academic literature or SQL-based streaming systems, you’ll also come across a third windowing time domain: tuple-based windowing (i.e., windows whose sizes are counted in numbers of elements). However, tuple-based windowing is essentially a form of processing-time windowing where elements are assigned monotonically increasing timestamps as they arrive at the system. As such, we won’t discuss tuple-based windowing in detail here (though we will see an example of it in Part 2). (Return)

你可能感兴趣的:(数据技术,batch,big,data)

WPF中的ComboBox控件几种数据绑定的方式互联网打工人no1 wpf c#
一、用字典给ItemsSource赋值（此绑定用的地方很多，建议熟练掌握）在XMAL中：在CS文件中privatevoidBindData(){DictionarydicItem=newDictionary();dicItem.add(1,"北京");dicItem.add(2,"上海");dicItem.add(3,"广州");cmb_list.ItemsSource=dicItem;cmb_l
将cmd中命令输出保存为txt文本文件落难Coder Windows cmd window
最近深度学习本地的训练中我们常常要在命令行中运行自己的代码，无可厚非，我们有必要保存我们的炼丹结果，但是复制命令行输出到txt是非常麻烦的，其实Windows下的命令行为我们提供了相应的操作。其基本的调用格式就是：运行指令>输出到的文件名称或者具体保存路径测试下，我打开cmd并且ping一下百度：pingwww.baidu.com>./data.txt看下相同目录下data.txt的输出：如果你再
Linux MariaDB使用OpenSSL安装SSL证书 Meta39 MySQL Oracle MariaDB Linux Windows ssl linux mariadb
进入到证书存放目录，批量删除.pem证书警告：确保已经进入到证书存放目录find.-typef-iname\*.pem-delete查看是否安装OpenSSLopensslversion没有则安装yuminstallopensslopenssl-devel开启SSL编辑/etc/my.cnf文件（没有的话就创建，但是要注意，在/etc/my.cnf.d/server.cnf配置了datadir的，
网络编程基础记得开心一点啊网络
目录♫什么是网络编程♫Socket套接字♪什么是Socket套接字♪数据报套接字♪流套接字♫数据报套接字通信模型♪数据报套接字通讯模型♪DatagramSocket♪DatagramPacket♪实现UDP的服务端代码♪实现UDP的客户端代码♫流套接字通信模型♪流套接字通讯模型♪ServerSocket♪Socket♪实现TCP的服务端代码♪实现TCP的客户端代码♫什么是网络编程网络编程，指网络上
K近邻算法_分类鸢尾花数据集 _feivirus_ 算法机器学习和数学分类机器学习 K近邻
importnumpyasnpimportpandasaspdfromsklearn.datasetsimportload_irisfromsklearn.model_selectionimporttrain_test_splitfromsklearn.metricsimportaccuracy_score1.数据预处理iris=load_iris()df=pd.DataFrame(data=ir
4.C_数据结构_队列荣世蓥数据结构数据结构
概述什么是队列：队列是限定在两端进行插入操作和删除操作的线性表。具有先入先出(FIFO)的特点相关名词：队尾：写入数据的一段队头：读取数据的一段空队：队列中没有数据，队头指针=队尾指针满队：队列中存满了数据，队尾指针+1=队头指针循环队列1、基本内容循环队列是以数组形式构成的队列数据结构。循环队列的结构体如下：typedefintdata_t;//队列数据类型#defineN64//队列容量typ
vue项目element-ui的table表格单元格合并酋长哈哈 vue.js elementui javascript 前端
一、合并效果二全部代码exportdefault{name:'CellMerge',data(){return{tableData:[{id:'1',name:'王小虎',amount1:'165',amount2:'3.2',amount3:10},{id:'1',name:'王小虎',amount1:'162',amount2:'4.43',amount3:12},{id:'1',name:'
python tif转png Python与遥感 python 开发语言
importosfromosgeoimportgdalimportnumpyasnpfromPILimportImage#提取432三波段fromspectralimport*#输入文件夹路径defget_img(dataset_img):width=dataset_img.RasterXSize#获取行列数height=dataset_img.RasterYSizebands=dataset_i
MongoDB知识概括 GeorgeLin98 持久层 mongodb
MongoDB知识概括MongoDB相关概念单机部署基本常用命令索引-IndexSpirngDataMongoDB集成副本集分片集群安全认证MongoDB相关概念业务应用场景：传统的关系型数据库（如MySQL），在数据操作的“三高”需求以及应对Web2.0的网站需求面前，显得力不从心。解释：“三高”需求：①Highperformance-对数据库高并发读写的需求。②HugeStorage-对海量数
Vue中table合并单元格用法 weixin_30613343 javascript ViewUI
地名结果人名性别{{item.name}}已完成未完成{{item.groups[0].name}}{{item.groups[0].sex}}{{item.groups[son].name}}{{item.groups[son].sex}}exportdefault{data(){return{list:[{name:'地名1',result:'1',groups:[{name:'张三',sex
06选课支付模块之基于消息队列发送支付通知消息 echo 云清学成在线 java rabbitmq 消息队列支付通知学成在线
消息队列发送支付通知消息需求分析订单服务作为通用服务，在订单支付成功后需要将支付结果异步通知给其他对接的微服务，微服务收到支付结果根据订单的类型去更新自己的业务数据技术方案使用消息队列进行异步通知需要保证消息的可靠性即生产端将消息成功通知到服务端：消息发送到交换机-->由交换机发送到队列-->消费者监听队列，收到消息进行处理，参考文章02-使用Docker安装RabbitMQ-CSDN博客生产者确
uniapp map组件自定义markers标记点以对_ uni-app学习记录 uni-app javascript 前端
需求是根据后端返回数据在地图上显示标记点，并且根据数据状态控制标记点颜色，标记点背景通过两张图片实现控制{{item.options.labelName}}exportdefault{data(){return{storeIndex:0,locaInfo:{longitude:120.445172,latitude:36.111387},markers:[//标点列表{id:1,//标记点idin
放松的一天 4da9b7687fa0
20190325总结起床07:20图片发自App睡觉:23:00天气:晴今日任务清单学习·信息·阅读•水滴阅读Day40Alice’sAdventuresinWonderlandChapter6.2图片发自App•BBC跟读训练营Day24图片发自App图片发自App图片发自App•潘多拉口语训练营Day6Wow.Whatabigboy!•文化知识学习今日无•阅读时间地狱健康·饮食·锻炼•饮食目标
博客网站制作教程 2401_85194651 java maven
首先就是技术框架：后端：Java+SpringBoot数据库：MySQL前端：Vue.js数据库连接：JPA(JavaPersistenceAPI)1.项目结构blog-app/├──backend/│├──src/main/java/com/example/blogapp/││├──BlogApplication.java││├──config/│││└──DatabaseConfig.java
vue + Element UI table动态合并单元格我家媳妇儿萌哒哒 element UI vue.js 前端 javascript
一、功能需求1、根据名称相同的合并工作阶段和主要任务合并这两列，但主要任务内容一样，但要考虑主要任务一样，但工作阶段不一样的情况。（枞向合并）2、落实情况里的定量内容和定性内容值一样则合并。（横向合并）二、功能实现exportdefault{data(){return{tableData:[{name:'a',address:'1',age:'1',six:'2'},{name:'a',addre
Python实现TIFF 文件转换为 PNG 和 JPG 格式 sand&wich python 开发语言
在日常的图像处理工作中，可能会遇到需要将TIFF格式的图像转换为其他格式的情况，例如PNG和JPG。下面，本文将介绍如何使用Python和GDAL库实现这一功能。准备工作在开始之前，请确保已经安装了必要的库：GDAL（GeospatialDataAbstractionLibrary）可以使用以下命令安装GDAL：pipinstallgdal代码实现以下是一个将TIFF文件转换为PNG文件的示例代码
如何利用大数据与AI技术革新相亲交友体验 h17711347205 回归算法安全系统架构交友小程序
在数字化时代，大数据和人工智能（AI）技术正逐渐革新相亲交友体验，为寻找爱情的过程带来前所未有的变革（编辑h17711347205）。通过精准分析和智能匹配，这些技术能够极大地提高相亲交友系统的效率和用户体验。大数据的力量大数据技术能够收集和分析用户的行为模式、偏好和互动数据，为相亲交友系统提供丰富的信息资源。通过分析用户的搜索历史、浏览记录和点击行为，系统能够深入了解用户的兴趣和需求，从而提供更
使用datepicker和uploadify的冲突解决（IE双击才能打开附件上传对话框） zhanglb12
在开发的过程当中，IE的兼容无疑是我们的一块绊脚石，在我们使用的如期的datepicker插件和使用上传附件的uploadify插件的时候，两者就产生冲突，只要点击过时间的插件，uploadify上传框要双才能打开ie浏览器提示错误Missinginstancedataforthisdatepicker解决方案//if(.browser.msie&&'9.0'===.browser.version
golang获取用户输入的几种方式余生逆风飞翔 golang 开发语言后端
一、定义结构体typeUserInfostruct{Namestring`json:"name"`Ageint`json:"age"`Addstring`json:"add"`}typeReturnDatastruct{Messagestring`json:"message"`Statusstring`json:"status"`DataUserInfo`json:"data"`}二、get请求的
【Java】已解决：org.springframework.jdbc.datasource.lookup.DataSourceLookupFailureException 屿小夏 java 开发语言
文章目录一、分析问题背景问题背景描述出现问题的场景二、可能出错的原因三、错误代码示例四、正确代码示例五、注意事项已解决：org.springframework.jdbc.datasource.lookup.DataSourceLookupFailureException在使用Spring框架进行开发时，数据源的配置和使用是非常关键的一环。然而，有时候我们可能会遇到org.springframewo
el-table实现全选整表，单元一页复选框功能周bro vue.js elementui javascript 前端
全选整表单选一页0":popper-append-to-body="false":total="tableData.length":page-size="pageObj.pagesize":page-sizes="[10,50,100]"layout="total,sizes,prev,pager,next"@size-change="handleSizeChange"@current-chang
Vue + Express实现一个表单提交九旬大爷的梦
最近在折腾一个cms系统，用的vue+express，但是就一个表单提交就弄了好久，记录一下。环境：Node10+前端：Vue服务端：Express依赖包：vueexpressaxiosexpress-formidableelement-ui（可选）前言：axiosget请求参数是：paramsaxiospost请求参数是：dataexpressget接受参数是req.queryexpresspo
未来软件市场是怎么样的？做开发的生存空间如何？ cesske 软件需求
目录前言一、未来软件市场的发展趋势二、软件开发人员的生存空间前言未来软件市场是怎么样的？做开发的生存空间如何？一、未来软件市场的发展趋势技术趋势：人工智能与机器学习：随着技术的不断成熟，人工智能将在更多领域得到应用，如智能客服、自动驾驶、智能制造等，这将极大地推动软件市场的增长。云计算与大数据：云计算服务将继续普及，大数据技术的应用也将更加广泛。企业将更加依赖云计算和大数据来优化运营、提升效率，并
Kubernetes部署MySQL数据持久化沫殇-MS Kubernetes MySQL数据库 kubernetes mysql 容器
一、安装配置NFS服务端1、安装nfs-kernel-server：sudoapt-yinstallnfs-kernel-server2、服务端创建共享目录#列出所有可用块设备的信息lsblk#格式化磁盘sudomkfs-text4/dev/sdb#创建一个目录：sudomkdir-p/data/nfs/mysql#更改目录权限：sudochown-Rnobody:nogroup/data/nfs
Hadoop架构 henan程序媛 hadoop 大数据分布式
一、案列分析1.1案例概述现在已经进入了大数据(BigData)时代，数以万计用户的互联网服务时时刻刻都在产生大量的交互，要处理的数据量实在是太大了，以传统的数据库技术等其他手段根本无法应对数据处理的实时性、有效性的需求。HDFS顺应时代出现，在解决大数据存储和计算方面有很多的优势。1.2案列前置知识点1.什么是大数据大数据是指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的大量数据集合，
使用input[type=file]遇上的一些问题刘圣凯
项目遇到一个需要，如下image.png功能大致就是添加图片，展示出来，然后在用户点击提交的时候把图片传给后台，在和后台交涉之后，决定在用户选择图片之后转成formdata传给后台，后台返回一个url，提交的时候将url返回给后台/**转formdata*/varformdata=newFormData();formdata.append("file1",$("#pic")[0].files[0]
详解mybatis的一二级缓存以及缓存失效原因仰望天花板缓存数据库 mybatis java mysql
数据库的大部分场景下是从磁盘读取，如果数据从内存进行读取，速度较比磁盘要快得多。但因为内存的容量有限，所以一般只会把使用和查询较多的数据缓存起来，以便快速反应，其他使用率不太多的继续存放在磁盘。mybatis分为一级缓存和二级缓存1.一级缓存一级缓存存放在SqlSqeeion上，默认开启1.1pojo@DatapublicclassRole{privateLongid;privateStringr
小程序通过js控制页面字体颜色属性祈澈菇凉
需求：当电量少于百分之20的时候，显示电量的字体显示为红色。1：在wxml里面设置属性batStyle：style="{{item.batStyle}}"电量:{{item.battery}}%2：当复合逻辑条件的时候，在js里面carList[i].batStyle="color:red";success:function(res){constcarList=res.data.list;for(
Golang Channel PandaSkr golang
Channel解析1.Channel源码分析1.1Channel数据结构typehchanstruct{qcountuint//channel的元素数量dataqsizuint//channel循环队列长度bufunsafe.Pointer//指向循环队列的指针elemsizeuint16//元素大小closeduint32//channel是否关闭0-未关闭elemtype*_type//元素类
matlab游标标注移动,matlab实现图形窗口的数据游标莫白想 matlab游标标注移动
DatacursorsforfigurewindowSeveralrelatedfunctions:CreateCursorsetsupaverticalcursoronallaxesinafigure.Thecursorscanbemovedaroundusingthemouse.MultiplecursorsaresupportedineachfigureGetCursorLocationre
算法单链的创建与删除换个号韩国红果果 c 算法
先创建结构体 struct student { int data; //int tag;//标记这是第几个 struct student *next; }; // addone 用于将一个数插入已从小到大排好序的链中 struct student *addone(struct student *h,int x){ if(h==NULL) //??????
《大型网站系统与Java中间件实践》第2章读后感白糖_ java中间件
断断续续花了两天时间试读了《大型网站系统与Java中间件实践》的第2章，这章总述了从一个小型单机构建的网站发展到大型网站的演化过程---整个过程会遇到很多困难，但每一个屏障都会有解决方案，最终就是依靠这些个解决方案汇聚到一起组成了一个健壮稳定高效的大型系统。看完整章内容，
zeus持久层spring事务单元测试 deng520159 java DAO spring jdbc
今天把zeus事务单元测试放出来,让大家指出他的毛病, 1.ZeusTransactionTest.java 单元测试 package com.dengliang.zeus.webdemo.test; import java.util.ArrayList; import java.util.List; import org.junit.Test; import
Rss 订阅开发周凡杨 html xml 订阅 rss 规范
RSS是 Really Simple Syndication的缩写（对rss2.0而言，是这三个词的缩写，对rss1.0而言则是RDF Site Summary的缩写，1.0与2.0走的是两个体系）。 RSS
分页查询实现 g21121 分页查询
在查询列表时我们常常会用到分页，分页的好处就是减少数据交换，每次查询一定数量减少数据库压力等等。按实现形式分前台分页和服务器分页：前台分页就是一次查询出所有记录，在页面中用js进行虚拟分页，这种形式在数据量较小时优势比较明显，一次加载就不必再访问服务器了，但当数据量较大时会对页面造成压力，传输速度也会大幅下降。服务器分页就是每次请求相同数量记录，按一定规则排序，每次取一定序号直接的数据
spring jms异步消息处理 510888780 jms
spring JMS对于异步消息处理基本上只需配置下就能进行高效的处理。其核心就是消息侦听器容器，常用的类就是DefaultMessageListenerContainer。该容器可配置侦听器的并发数量，以及配合MessageListenerAdapter使用消息驱动POJO进行消息处理。且消息驱动POJO是放入TaskExecutor中进行处理，进一步提高性能，减少侦听器的阻塞。具体配置如下：
highCharts柱状图布衣凌宇 hightCharts 柱图
第一步：导入 exporting.js,grid.js,highcharts.js;第二步：写controller @Controller@RequestMapping(value="${adminPath}/statistick")public class StatistickController { private UserServi
我的spring学习笔记2-IoC（反向控制依赖注入） aijuans spring mvc Spring 教程 spring3 教程 Spring 入门
IoC（反向控制依赖注入）这是Spring提出来了，这也是Spring一大特色。这里我不用多说，我们看Spring教程就可以了解。当然我们不用Spring也可以用IoC，下面我将介绍不用Spring的IoC。 IoC不是框架，她是java的技术，如今大多数轻量级的容器都会用到IoC技术。这里我就用一个例子来说明：如：程序中有 Mysql.calss 、Oracle.class 、SqlSe
TLS java简单实现 antlove java ssl keystore tls secure
1. SSLServer.java package ssl; import java.io.FileInputStream; import java.io.InputStream; import java.net.ServerSocket; import java.net.Socket; import java.security.KeyStore; import
Zip解压压缩文件百合不是茶 Zip格式解压 Zip流的使用文件解压
ZIP文件的解压缩实质上就是从输入流中读取数据。Java.util.zip包提供了类ZipInputStream来读取ZIP文件,下面的代码段创建了一个输入流来读取ZIP格式的文件; ZipInputStream in = new ZipInputStream(new FileInputStream(zipFileName)); &n
underscore.js 学习（一） bijian1013 JavaScript underscore
工作中需要用到underscore.js，发现这是一个包括了很多基本功能函数的js库，里面有很多实用的函数。而且它没有扩展 javascript的原生对象。主要涉及对Collection、Object、Array、Function的操作。学
java jvm常用命令工具——jstatd命令(Java Statistics Monitoring Daemon) bijian1013 java jvm jstatd
1.介绍 jstatd是一个基于RMI（Remove Method Invocation）的服务程序，它用于监控基于HotSpot的JVM中资源的创建及销毁，并且提供了一个远程接口允许远程的监控工具连接到本地的JVM执行命令。 jstatd是基于RMI的，所以在运行jstatd的服务
【Spring框架三】Spring常用注解之Transactional bit1129 transactional
Spring可以通过注解@Transactional来为业务逻辑层的方法(调用DAO完成持久化动作)添加事务能力，如下是@Transactional注解的定义： /* * Copyright 2002-2010 the original author or authors. * * Licensed under the Apache License, Version
我(程序员)的前进方向 bitray 程序员
作为一个普通的程序员,我一直游走在java语言中,java也确实让我有了很多的体会.不过随着学习的深入,java语言的新技术产生的越来越多,从最初期的javase,我逐渐开始转变到ssh,ssi,这种主流的码农,.过了几天为了解决新问题,webservice的大旗也被我祭出来了,又过了些日子jms架构的activemq也开始必须学习了.再后来开始了一系列技术学习,osgi,restful.....
nginx lua开发经验总结 ronin47
使用nginx lua已经两三个月了，项目接开发完毕了，这几天准备上线并且跟高德地图对接。回顾下来lua在项目中占得必中还是比较大的，跟PHP的占比差不多持平了，因此在开发中遇到一些问题备忘一下 1：content_by_lua中代码容量有限制，一般不要写太多代码，正常编写代码一般在100行左右（具体容量没有细心测哈哈，在4kb左右），如果超出了则重启nginx的时候会报 too long pa
java-66-用递归颠倒一个栈。例如输入栈{1,2,3,4,5}，1在栈顶。颠倒之后的栈为{5,4,3,2,1}，5处在栈顶 bylijinnan java
import java.util.Stack; public class ReverseStackRecursive { /** * Q 66.颠倒栈。 * 题目：用递归颠倒一个栈。例如输入栈{1,2,3,4,5}，1在栈顶。 * 颠倒之后的栈为{5,4,3,2,1}，5处在栈顶。 *1. Pop the top element *2. Revers
正确理解Linux内存占用过高的问题 cfyme linux
Linux开机后，使用top命令查看，4G物理内存发现已使用的多大3.2G，占用率高达80%以上： Mem: 3889836k total, 3341868k used, 547968k free, 286044k buffers Swap: 6127608k total,&nb
[JWFD开源工作流]当前流程引擎设计的一个急需解决的问题 comsci 工作流
当我们的流程引擎进入IRC阶段的时候，当循环反馈模型出现之后，每次循环都会导致一大堆节点内存数据残留在系统内存中，循环的次数越多，这些残留数据将导致系统内存溢出，并使得引擎崩溃。。。。。。而解决办法就是利用汇编语言或者其它系统编程语言，在引擎运行时，把这些残留数据清除掉。
自定义类的equals函数 dai_lm equals
仅作笔记使用 public class VectorQueue { private final Vector<VectorItem> queue; private class VectorItem { private final Object item; private final int quantity; public VectorI
Linux下安装R语言 datageek R语言 linux
命令如下：sudo gedit /etc/apt/sources.list1、deb http://mirrors.ustc.edu.cn/CRAN/bin/linux/ubuntu/ precise/ 2、deb http://dk.archive.ubuntu.com/ubuntu hardy universesudo apt-key adv --keyserver ke
如何修改mysql 并发数(连接数)最大值 dcj3sjt126com mysql
MySQL的连接数最大值跟MySQL没关系，主要看系统和业务逻辑了方法一：进入MYSQL安装目录打开MYSQL配置文件 my.ini 或 my.cnf查找 max_connections=100 修改为 max_connections=1000 服务里重起MYSQL即可　　方法二：MySQL的最大连接数默认是100客户端登录：mysql -uusername -ppass
单一功能原则 dcj3sjt126com 面向对象的程序设计软件设计编程原则
单一功能原则[ 编辑] SOLID 原则单一功能原则开闭原则 Liskov代换原则接口隔离原则依赖反转原则查论编在面向对象编程领域中，单一功能原则（Single responsibility principle）规定每个类都应该有
POJO、VO和JavaBean区别和联系 fanmingxing VO POJO javabean
POJO和JavaBean是我们常见的两个关键字，一般容易混淆，POJO全称是Plain Ordinary Java Object / Plain Old Java Object，中文可以翻译成：普通Java类，具有一部分getter/setter方法的那种类就可以称作POJO，但是JavaBean则比POJO复杂很多，JavaBean是一种组件技术，就好像你做了一个扳子，而这个扳子会在很多地方被
SpringSecurity3.X--LDAP：AD配置 hanqunfeng SpringSecurity
前面介绍过基于本地数据库验证的方式，参考http://hanqunfeng.iteye.com/blog/1155226，这里说一下如何修改为使用AD进行身份验证【只对用户名和密码进行验证，权限依旧存储在本地数据库中】。将配置文件中的如下部分删除：
mac mysql 修改密码 IXHONG mysql
$ sudo /usr/local/mysql/bin/mysqld_safe –user=root & //启动MySQL(也可以通过偏好设置面板来启动)$ sudo /usr/local/mysql/bin/mysqladmin -uroot password yourpassword //设置MySQL密码（注意，这是第一次MySQL密码为空的时候的设置命令，如果是修改密码，还需在-
设计模式--抽象工厂模式 kerryg 设计模式
抽象工厂模式：工厂模式有一个问题就是，类的创建依赖于工厂类，也就是说，如果想要拓展程序，必须对工厂类进行修改，这违背了闭包原则。我们采用抽象工厂模式，创建多个工厂类，这样一旦需要增加新的功能，直接增加新的工厂类就可以了，不需要修改之前的代码。总结：这个模式的好处就是，如果想增加一个功能，就需要做一个实现类，
评"高中女生军训期跳楼” nannan408
首先，先抛出我的观点，各位看官少点砖头。那就是，中国的差异化教育必须做起来。孔圣人有云：有教无类。不同类型的人，都应该有对应的教育方法。目前中国的一体化教育，不知道已经扼杀了多少创造性人才。我们出不了爱迪生，出不了爱因斯坦，很大原因，是我们的培养思路错了，我们是第一要“顺从”。如果不顺从，我们的学校，就会用各种方法，罚站，罚写作业，各种罚。军
scala如何读取和写入文件内容？ qindongliang1922 java jvm scala
直接看如下代码： package file import java.io.RandomAccessFile import java.nio.charset.Charset import scala.io.Source import scala.reflect.io.{File, Path} /** * Created by qindongliang on 2015/
C语言算法之百元买百鸡 qiufeihu c 算法
中国古代数学家张丘建在他的《算经》中提出了一个著名的“百钱买百鸡问题”，鸡翁一，值钱五，鸡母一，值钱三，鸡雏三，值钱一，百钱买百鸡，问翁，母，雏各几何？代码如下： #include <stdio.h> int main() { int cock,hen,chick; /*定义变量为基本整型*/ for(coc
Hadoop集群安全性：Hadoop中Namenode单点故障的解决方案及详细介绍AvatarNode wyz2009107220 NameNode
正如大家所知，NameNode在Hadoop系统中存在单点故障问题，这个对于标榜高可用性的Hadoop来说一直是个软肋。本文讨论一下为了解决这个问题而存在的几个solution。 1. Secondary NameNode 原理：Secondary NN会定期的从NN中读取editlog，与自己存储的Image进行合并形成新的metadata image 优点：Hadoop较早的版本都自带，