StructuredStreaming: A Declarative API for Real-Time

Abstract

With the ubiquity of real-time data,organizations need streaming systems that are scalable, easy to use, and easyto integrate into business applications. Structured Streaming is a newhigh-level streaming API in Apache Spark based on our experience with SparkStreaming. Structured Streaming differs from other recent streaming APIs, suchas Google Dataflow, in two main ways. First, it is a purely declarative API based on automaticallyincrementalizing a

static relational query (expressed using SQLor DataFrames), in contrast to APIs that ask the user to build a DAG ofphysical operators. Second, Structured Streaming aims to support end-to-end real-time applications thatintegrate streaming with batch and interactive analysis. We found that thisintegration was often a key challenge in practice. Structured Streamingachieves high performance via Spark SQL’s code generation engine and canoutperform Apache Flink by up to 2× and Apache Kafka Streams by 90×. It alsooffers rich operational features such as rollbacks, code updates, and mixedstreaming/batch execution. We describe the system’s design and use cases fromseveral hundred production deployments on Databricks, the largest of whichprocess over 1 PB of data per month.

ACM Reference Format:

M. Armbrust et al.. 2018. Structured Streaming: ADeclarative API for RealTime Applications in Apache Spark. In SIGMOD’18: 2018 International Conference onManagement of Data, June 10–15, 2018, Houston, TX, USA. ACM, New York, NY,USA, 13 pages. https://doi.org/10.1145/3183713.3190664

1      Introduction

Many high-volume data sources operate inreal time, including sensors, logs from mobile applications, and the Internetof Things. As organizations have gotten better at capturing this data, theyalso want to process it in real time, whether to give human analysts thefreshest possible data or drive automated decisions. Enabling broad access tostreaming computation requires systems that are scalable, easy to use and easyto integrate into business applications.

While there has beentremendous progress in distributed stream processing systems in the past fewyears [2, 15, 17, 27, 32], these systems still remain fairly challenging to usein practice. In this paper, we begin by describing these challenges, based onour experience

Permissionto make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made ordistributed for profit or commercial advantage and that copies bear this noticeand the full citation on the first page. Copyrights for components of this workowned by others than the author(s) must be honored. Abstracting with credit ispermitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected].

SIGMOD’18,June 10–15, 2018, Houston, TX, USA

© 2018 Copyright held by the owner/author(s).Publication rights licensed to Association for Computing Machinery. ACM ISBN978-1-4503-4703-7/18/06...$15.00 https://doi.org/10.1145/3183713.3190664

with Spark Streaming [37], one of theearliest stream processing systems to provide a high-level, functional API. Wefound that two challenges frequently came up with users. First, streamingsystems often ask users to think in terms of complex physical executionconcepts, such as at-least-once delivery, state storage and triggering modes,that are unique to streaming. Second, many systems focus only on streaming computation, but in real use cases, streaming isoften part of a larger business application that also includes batch analytics,joins with static data, and interactive queries. Integrating streaming systemswith these other workloads (e.g., maintaining transactionality) requiressignificant engineering effort.

Motivated by these challenges,we describe Structured Streaming, a new high-level API for stream processingthat was developed in Apache Spark starting in 2016. Structured Streamingbuilds on many ideas in recent stream processing systems, such as separatingprocessing time from event time and triggers in Google Dataflow [2], using arelational execution engine for performance [12], and offering alanguage-integrated API [17, 37], but aims to make them simpler to use andintegrated with the rest of Apache Spark. Specifically, Structured Streamingdiffers from other widely used open source streaming APIs in two ways:

•  Incremental query model: StructuredStreaming automatically incrementalizes queries on static datasets expressedthrough Spark’s SQL and DataFrame APIs [8], meaning that users typically onlyneed to understand Spark’s batch APIs to write a streaming query. Event timeconcepts are especially easy to express and understand in this model. Althoughincremental query execution and view maintenance are well studied [11, 24, 29,38], we believe Structured Streaming is the first effort to adopt them in awidely used open source system. We found that this incremental API generallyworked well for both novice and advanced users. For example, advanced users canuse a set of stateful processing operators that give fine-grained control toimplement custom logic while fitting into the incremental model.

•  Supportforend-to-endapplications:Structured Streaming’s

API and built-in connectors make it easyto write code that is

“correct by default" when interactingwith external systems and can be integrated into larger applications usingSpark and other software. Data sources and sinks follow a simple transactionalmodel that enables “exactly-once" computation by default. Theincrementalization based API naturally makes it easy to run a streaming queryas a batch job or develop hybrid applications that join streams with staticdata computed through Spark’s batch APIs. In addition, users can managemultiple streaming queries dynamically and run interactive queries onconsistent snapshots of stream output, making it possible to write applicationsthat go beyond computing a fixed result to let users refine and drill intostreaming data.

Beyond these designdecisions, we made several other design choices in Structured Streaming thatsimplify operation and increase performance. First, Structured Streaming reusesthe Spark SQL execution engine [8], including its optimizer and runtime codegenerator. This leads to high throughput compared to other streaming systems(e.g., 2× the throughput of Apache Flink and 90× that of Apache Kafka Streamsin the Yahoo! Streaming Benchmark [14]), as in Trill [12], and also letsStructured Streaming automatically leverage new SQL functionality added toSpark. The engine runs in a microbatch execution mode by default [37] but itcan also use a low-latency continuous operators for some queries because theAPI is agnostic to execution strategy [6].

Second, we found thatoperating a streaming application can be challenging, so we designed the engineto support failures, code updates and recomputation of already outputted data.For example, one common issue is that new data in a stream causes anapplication to crash, or worse, to output an incorrect result that users do notnotice until much later (e.g., due to mis-parsing an input field). InStructured Streaming, each application maintains a write-ahead event log inhuman-readable JSON format that administrators can use to restart it from anarbitrary point. If the application crashes due to an error in a user-definedfunction, administrators can update the UDF and restart from where it left off,which happens automatically when the restarted application reads the log. Ifthe application was outputting incorrect data instead, administrators canmanually roll it back to a point before the problem started and recompute itsresults starting from there.

Our team has beenrunning Structured Streaming applications for customers of Databricks’ cloudservice since 2016, as well as using the system internally, so we end the paperwith some example use cases. Production applications range from interactivenetwork security analysis and automated alerts to incremental Extract,Transform and Load (ETL). Users often leverage the design of the engine ininteresting ways, e.g., by running a streaming query

“discontinuously" as a series ofsingle-microbatch jobs to leverage Structured Streaming’s transactional inputand output without having to pay for cloud servers running 24/7. The largestcustomer applications we discuss process over 1 PB of data per month onhundreds of machines. We also show that Structured Streaming outperforms ApacheFlink and Kafka Streams by 2× and 90× respectively in the widely used Yahoo!Streaming Benchmark [14].

The rest of this paperis organized as follows. We start by discussing the stream processingchallenges reported by users in Section 2. Next, we give an overview ofStructured Streaming (Section 3), then describe its API (Section 4), queryplanning (Section 5), execution (Section 6) and operational features (Section7). In Section 8, we describe several large use cases at Databricks and itscustomers. We then measure the system’s performance in Section 9, discussrelated work in Section 10 and conclude in Section 11.

2      Stream Processing Challenges

Despite extensive progress in the past fewyears, distributed streaming applications are still generally considereddifficult to develop and operate. Before designing Structured Streaming, wespent time discussing these challenges with users and designers of otherstreaming systems, including Spark Streaming, Truviso, Storm, Dataflow andFlink. This section details the challenges we saw.

2.1        Complex and Low-Level APIs

Streaming systems were invariably consideredmore difficult to use than batch ones due to complex API semantics. Somecomplexity is to be expected due to new concerns that arise only in streaming:for example, the user needs to think about what type of intermediate resultsthe system should output before it has received all the data relevant to aparticular entity, e.g., to a customer’s browsing session on a website.However, other complexity arises due to the lowlevelnature of many streaming APIs: these APIs often ask users to specifyapplications at the level of physicaloperators with complex semantics instead of a more declarative level.

As a concrete example,the Google Dataflow model [2] has a powerful API with a rich set of options forhandling event time aggregation, windowing and out-of-order data. However, inthis model, users need to specify a windowing mode, triggering mode and triggerrefinement mode (essentially, whether the operator outputs deltas oraccumulated results) for each aggregation operator. Adding an operator thatexpects deltas after an aggregation that outputs accumulated results will leadto unexpected results. In essence, the raw API [10] asks the user to write aphysical operator graph, not a logical query, so every user of the system needsto understand the intricacies of incremental processing.

Other APIs, such as Spark Streaming [37] andFlink’s DataStream

API [18], are also based on writing DAGs ofphysical operators and offer a complex array of options for managing state [20].In addition, reasoning about applications becomes even more complex in systemsthat relax exactly-once semantics [32], effectively requiring the user todesign and implement a consistency model.

To address thisissue, we designed Structured Streaming to make simple applications simple toexpress using its incremental query model. In addition, we found that addingcustomizable stateful processing operatorsto this model still enabled advanced users to build their own processing logic,such as custom session-based windows, while staying within the incrementalmodel (e.g., these same operators also work in batch jobs). Other open sourcesystems have also recently added incremental SQL queries [15, 19], and ofcourse databases have long supported them [11, 24, 29, 38].

2.2         Integration in End-to-End Applications

The second challenge we found was thatnearly every streaming workload must run in the context of a largerapplication, and this integration often requires significant engineeringeffort. Many streaming APIs focus primarily on reading streaming input from asource and writing streaming output to a sink, but end-to-end businessapplications need to perform other tasks. Examples include:

(1)  The business purpose of the application may be to enable interactivequeries on fresh data. In this case, a streaming job is used to update summarytables in a structured storage system such as an RDBMS or Apache Hive [33]. Itis important that when the streaming job updates its result, it does soatomically, so users do not see partial results. This can be difficult withfile-based big data systems like Hive, where tables are partitioned acrossfiles, or even with parallel loads into a data warehouse.

(2)  An Extract, Transform and Load (ETL) job might need to join a streamwith static data loaded from another storage system or transformed using abatch computation. In this case, it is important to be able to reason aboutconsistency across the two systems (e.g., what happens when the static data isupdated?), and it is useful to write the whole computation in a single API.

(3)  A team may occasionally need to run its streaming business logic asa batch application, e.g., to backfill a result on old data or test alternateversions of the code. Rewriting the code in a separate system would betime-consuming and error-prone.

We address thischallenge by integrating Structured Streaming closely with Spark’s batch andinteractive APIs.

2.3       Operational Challenges

One of the largest challenges to deployingstreaming applications in practice is management and operation. Some key issuesinclude:

•  Failures: This is the mostheavily studied issue in the research literature. In addition to single nodefailures, systems also need to support graceful shutdown and restart of thewhole application, e.g., to let operators migrate it to a new cluster.

•  Code Updates: Applications arerarely perfect, so developers may need to update their code. After an update,they may want the application to restart where it left off, or possibly to recompute past results that wereerroneous due to a bug. Both cases need to be supported in the streamingsystem’s state management and fault recovery mechanisms. Systems should alsosupport updating the runtime itself (e.g., patching Spark).

•  Rescaling: Applications seevarying load over time, and generally increasingload in the long term, so operators may want to scale them up and downdynamically, especially in the cloud. Systems based on a static communicationtopology, while conceptually simple, are difficult to scale dynamically.

•  Stragglers: Instead of outrightfailing, nodes in the streaming system can slow down due to hardware orsoftware issues and degrade the throughput of the whole application. Systemsshould automatically handle this situation.

•  Monitoring: Streaming systemsneed to give operators clear visibility into system load, backlogs, state sizeand other metrics.

2.4       Cost and Performance Challenges

Beyond operational and engineering issues,the cost-performance of streaming applications can be an obstacle because theseapplications run 24/7. For example, without dynamic rescaling, an applicationwill waste resources outside peak hours; and even with rescaling, it may bemore expensive to compute a result continuously than to run a periodic batchjob. We thus designed Structured Streaming to leverage all the executionoptimizations in Spark SQL [8].

So far, we chose tooptimize throughput as our main performancemetric because we found that it was often the most important metric inlarge-scale streaming applications. Applications that require a distributed streaming system usuallywork with large data volumes coming from external sources (e.g., mobile devices,sensors or IoT), where data may already incur a delay just getting to thesystem. This is one reason why event time processing is an important feature inthese systems [2]. In contrast, latency-sensitive applications

Figure 1: The components of StructuredStreaming.

such as high-frequency trading or physicalsystem control loops often run on a single scale-up processor, or even customhardware like ASICs and FPGAs [3]. However, we also designed StructuredStreaming to support executing over latency-optimized engines, and implementeda continuous processing mode for this task, which we describe in Section 6.3.This is a change over Spark Streaming, where microbatching was “baked into"the API.

3      Structured Streaming Overview

Structured Streaming aims to tackle thestream processing challenges we identified through a combination of API andexecution engine design. In this section, we give a brief overview of theoverall system. Figure 1 shows Structured Streaming’s main components.

Input and Output. StructuredStreaming connects to a variety of inputsources and output sinks for I/O.To provide “exactly-once" output and fault tolerance, it places tworestrictions on sources and sinks, which are similar to other exactly-oncesystems [17, 37]:

(1)  Input sources must be replayable,allowing the system to re-read recent input data if a node crashes. Inpractice, organizations use a reliable message bus such as Amazon Kinesis orApache Kafka [5, 23] for this purpose, or simply a durable file system.

(2)  Output sinks must support idempotentwrites, to ensure reliable recovery if a node fails while writing.Structured Streaming can also provide atomicoutput for certain sinks that support it, where the entire update to thejob’s output appears atomically even if it was written by multiple nodesworking in parallel.

In addition to external systems, StructuredStreaming also supports input and output from tables in Spark SQL. For example,users can compute a static table from any of Spark’s batch input sources andjoin it with a stream, or ask Structured Streaming to output to an in-memorySpark table that users can query interactively.

API. Users program Structured Streamingby writing a query against one or more streams and tables using Spark SQL’sbatch APIs: SQL and DataFrames [8]. This query defines an output table that the user wants to compute, assuming that eachinput stream is replaced by a table holding all the data received from thatstream so far. The engine then determines how to compute and write this outputtable into a sink incrementally,using similar techniques to incremental view maintenance [11, 29]. Differentsinks also support different output modes,which determine how the system may write out its results: for example, somesinks are append-only by nature, while others allow updating records in placeby key.

To support streamingspecifically, Structured Streaming also adds several API features that fit inthe existing Spark SQL API:

(1)  Triggers control how often the engine will attempt to compute a new resultand update the output sink, as in Dataflow [2].

(2)  Users can mark a column as denoting event time (a timestamp set at the data source), and set a watermark policy to determine whenenough data has been received to output a result for a specific event time, asin [2].

(3)  Stateful operators allow users to track and update mutable state by key in order toimplement complex processing, such as custom session-based windows. These aresimilar to Spark Streaming’s updateStateByKey API [37].

Note that windowing,another key feature for streaming, is done using Spark SQL’s existingaggregation APIs. In addition, all the new APIs added by Structured Streamingalso work in batch jobs.

Execution. Once it has received a query,Structured Streaming optimizes it, incrementalizes it, and begins executing it.By default, the system uses a microbatch model similar to Discretized Streamsin Spark Streaming, which supports dynamic load balancing, rescaling, faultrecovery and straggler mitigation by dividing work into small tasks [37]. Inaddition, it can use a continuous processing mode based on traditionallong-running operators (Section 6.3).

In both cases,Structured Streaming uses two forms of durable storage to achieve faulttolerance. First, a write-ahead log keepstrack of which data has been processed and reliably written to the output sinkfrom each input source. For some output sinks, this log can be integrated withthe sink to make updates to the sink atomic. Second, the system uses alarger-scale state store to holdsnapshots of operator states for long-running aggregation operators. These arewritten asynchronously, and may be “behind" the latest data written to theoutput sink. The system will automatically track which state it has lastupdated in its log, and recompute state starting from that point in the data onfailure. Both the log and state store can run over pluggable storage systems(e.g., HDFS or S3).

Operational Features. Using the durabilityof the write-ahead log and state store, users can achieve several forms ofrollback and recovery. An entire StructuredStreaming application can be shut down and restarted on new hardware. Runningapplications also tolerate node crashes, additions and stragglersautomatically, by sending tasks to new nodes. For code updates to UDFs, it issufficient to stop and restart the application, and it will begin using the newcode. In addition, users can manually roll back the application to a previouspoint in the log and redo the part of the computation starting then, beginningfrom an older snapshot of the state store.

In addition, StructuredStreaming’s ability to execute with microbatches lets it “adaptivelybatch" data so that it can quickly catch up with input data if the loadspikes or if a job is rolled back, then return to low latency later. This makesoperation significantly simpler (e.g., users can safely update job code moreoften).

The next sections gointo detail about Structured Streaming’s API (§4), query planning (§5) and jobexecution and operation (§6).

4      Programming Model

Structured Streaming combines elements ofGoogle Dataflow [2], incremental queries [11, 29, 38] and Spark Streaming [37]to enable stream processing beneath the Spark SQL API. In this section, westart by showing a short example, then describe the semantics of the model andthe streaming-specific operators we added in Spark SQL to support streaming usecases (e.g., stateful operators).

4.1      A Short Example

Structured Streaming operates within Spark’sstructured data APIs: SQL, DataFrames and Datasets [8]. The main abstractionusers work with is tables (representedby the DataFrames or Dataset classes), which each represent a view to becomputed from input sources to the system.[1]When users create a table/DataFrame from a streaming input source,and attempt to compute it, Spark will automatically launch a streamingcomputation.

As a simpleexample, let us start with a batch jobthat counts clicks by country of origin for a web application. Suppose that theinput data is JSON files and the output should be Parquet. This job can bewritten with Spark DataFrames in Scala as follows:

// Define a DataFrame to read from staticdata data = spark.read.format("json").load("/in")

// Transform it to compute a result counts= data.groupBy($"country").count()

// Write to a static data sink

counts.write.format("parquet").save("/counts")

Changing this job to useStructured Streaming only requires modifying the input and output sources, notthe transformation in the middle. For example, if new JSON files are going to continually be uploaded to the /indirectory, we can modify our job to continually update /countsby changing only the first and last lines:

// Define a DataFrame to read streamingdata data = spark.readStream.format("json").load("/in")

// Transform it to compute a result counts= data.groupBy($"country").count()

// Write to a streaming data sink

counts.writeStream.format("parquet")

.outputMode("complete").start("/counts")

The output modeparameter on the sink here specifies how Structured Streaming should update thesink. In this case, the complete mode means to write a complete result file foreach update, because the file output sink chosen does not support fine-grainedupdates. However, other sinks, such as key-value stores, support additionaloutput modes (e.g., updating just the changed keys).

Under the hood,Structured Streaming will automatically incrementalize the query specified bythe transformation(s) from input sources to data sinks, and execute it in astreaming fashion. The engine will also automatically maintain state andcheckpoint it to external storage as needed—in this case, for example, we havea running count aggregation since the start of the stream, so the engine willkeep track of the running counts for each country.

Finally, the API alsonaturally supports windowing and event time through Spark SQL’s existingaggregation operators. For example, instead of counting data by country, wecould count it in 1-hour sliding windows advancing every 5 minutes by changingjust the middle line of the computation as follows:

// Count events by windows on the"time" field

data.groupBy(window($"time","1h","5min")).count()

The time field here (event time) is just a field in the data, similar to countryearlier. Users can also set a watermark on this fieldto let the system forget state for old windows after a timeout (§4.3.1). 4.2      Programming Model Semantics

Formally, we define the semantics ofStructured Streaming’s programming model as follows:

(1)  Each input source provides a partially ordered set of records overtime. We assume partial orders here because some message bus systems areparallel and do not define a total order across records—for example, Kafkadivides streams into “partitions" that are each ordered.

(2)  The user provides a query to execute across the input data that canoutput a result table at any givenpoint in processing time. Structured Streaming will always produce resultsconsistent with running this query on a prefixof the data in all input sources. That is, it will never show results thatincorporate one input record but do not incorporate its ancestors in thepartial order. Moreover, these prefixes will be increasing over time.

(3)  Triggers tell the system when torun a new incremental computation and update the result table. For example, inmicrobatch mode, the user may wish to trigger an incremental update everyminute (in processing time).

(4)  The sink’s output mode specifieshow the result table is written to the output system. The engine supports threedistinct modes:

•  Complete: The engine writes the wholeresult table at once,

e.g., replacing a whole file in HDFS witha new version. This is of course inefficient when the result is large.

•  Append: The engine can only add recordsto the sink. For example, a map-only job on a set of input files results inmonotonically increasing output.

•  Update: The engine updates the sink inplace based on a key for each record, updating only keys whose values changed.

Figure 2 illustrates themodel visually. One attractive property of the model is that the contents of the result table (which islogically just a view that need never be materialized) are definedindependently of the output mode (whether we output the whole table on everytrigger, or only deltas). In contrast, APIs such as Dataflow require theequivalent of an output mode on every operator, so users must plan the wholeoperator DAG keeping in mind whether each operator is outputting completeresults or positive or negative deltas, effectively incrementalizing the queryby hand.

Figure2: Structured Streaming’s semantics for two output modes. Logically, all inputdata received up to a point in processing time is viewed as a large inputtable, and the user provides a query that defines a result table based on thisinput. Physically, Structured Streaming computes changes to the result table incrementally (without having to storeall input data) and outputs results based on its output mode. For completemode, it outputs the whole result table (left), while for append mode, it onlyoutputs newly added records (right).

A second attractiveproperty is that the model has strong consistency semantics, which we call prefix consistency. First, it guaranteesthat when input records are relatively ordered within a source (e.g., logrecords from the same device), the system will only produce results thatincorporate them in the same records (e.g., never skipping a record). Second,because the result table is defined based on all data in the input prefix at once, we know that all rows in theresult table reflect all input records. In contrast, in some systems based onmessage-passing between nodes, the node that receives a record might send anupdate to two downstream nodes, but there is no guarantee that the outputs fromthese two are synchronized. Prefix consistency also makes operation easier, asusers can roll back the system to a specific point in the write-ahead log(i.e., a specific prefix of the data) and recompute outputs from that point.

In summary, withthe Structured Streaming models, as long as users understand a regular Spark orDataFrame query, they can understand the content of the result table for theirjob and the values that will be written to the sink. Users need not worry aboutconsistency, failures or incorrect processing orders.

Finally, the readermight notice that some of the output modes we defined are incompatible withcertain types of query. For example, suppose we are aggregating counts bycountry, as in our code example in the previous section, and we want to use theappend output mode. There is no way for the system to guarantee it has stoppedreceiving records for a given country, so this combination of query and outputmode will not be allowed by the system. We describe which combinations areallowed in Section 5.1. 4.3 Streaming Specific Operators

Many Structured Streaming queries can bewritten using just the standard operators in Spark SQL, such as selection,aggregation and joins. However, to support some requirements unique tostreaming, we added two new types of operators to Spark SQL: watermarking operators tell the systemwhen to “close" an event time window and output results or forget state,and stateful operators let userswrite custom logic to implement complex processing. Crucially, both of thesenew operators still fit in Structured Streaming’s incremental semantics (§4.2),and both can also be used in batch jobs.

4.3.1 Event TimeWatermarks. From a logical point of view, the key idea in event time is totreat application-specified timestamps as an arbitrary field in the data,allowing records to arrive out-of-order [2, 24]. We can then use standardoperators and incremental processing to update results grouped by event time.In practice, however, it is useful for the processing system to have some loosebounds on how late data can arrive, for two reasons:

(1)  Allowing arbitrarily late data might require storing arbitrarilylarge state. For example, if we count data by 1-minute event time window, thesystem needs to remember a count for every 1-minute window since theapplication began, because a late record might still arrive for any particularminute. This can quickly lead to large amounts of state, especially if combinedwith another grouping key. The same issue happens with joins.

(2)  Some sinks do not support data retraction, making it useful to beable to write the results for a given event time after a timeout. For example,custom downstream applications want to start working with a “final" resultand might not support retractions. Append-mode sinks also do not supportretractions.

Structured Streaminglets developers set a watermark [2] for event time columns using the withWatermarkoperator. This operator gives the system a delaythreshold tC for a given timestampcolumn C. At any point in time, thewatermark for C is max(C)−tC, thatis, tC seconds before the maximum event time seen so far in C. Note that this choice of watermark isnaturally robust to backlogged data: if the system cannot keep up with theinput rate for a period of time, the watermark will not move forwardarbitrarily during that time, and all events that arrived within at most T seconds of being produced will stillbe processed.

When present, watermarksaffect when stateful operators can forget old state (e.g., if grouping by awindow derived from a watermarked column), and when Structured Streaming willoutput data with an event time key to append-mode sinks. Different inputstreams can have different watermarks.

4.3.2 StatefulOperators. For developers who want to write custom stream processing logic,Structured Streaming’s stateful operators are “UDFs with state" that giveusers control over the computation while fitting into Structured Streaming’ssemantics and fault tolerance mechanisms. There are two stateful operators, mapGroupsWithStateand flatMapGroupsWithState. Both operators act on data that has been assigned a key using groupByKey, and let the developers track and update a state for each key using custom logic, as well as output recordsfor each key. They are closely based on Spark Streaming’s updateStateByKeyoperator [37].

The mapGroupsWithStateoperator, on a grouped dataset with keys of type K and values of type V , takes in a user-defined update function with the followingarguments:

• key of type K

// Define an updatefunction that simply tracks the // number of events for each key as its state,returns // that as its result, and times out keys after 30 min. def updateFunc(key:UserId, newValues: Iterator[Event], state: GroupState[Int]):Int= {

valtotalEvents = state.get() + newValues.size() state.update(totalEvents)state.setTimeoutDuration("30 min")returntotalEvents

}

// Use this updatefunction on a stream, returning a // new table lens that contains the sessionlengths. lens = events.groupByKey(event => event.userId)

.mapGroupsWithState(updateFunc)

Figure3: Using mapGroupsWithStateto track the number of events per session, timingout sessions after 30 minutes.

•  newValues oftype Iterator[V]

•  state oftype GroupState[S], where S is auser-specified class.

The operator will invokethis function whenever one or more new values are received for a key. On eachcall, the function receives all of the values that were received for that keysince the last call (multiple values may be batched for efficiency). It alsoreceives a state object that wraps arounda user-defined data type S, andallows the user to update the state, drop this key from state tracking, or seta timeout for this specific key (either in event time or processing time). Thisallows the user to store arbitrary data for the key, as well as implementcustom logic for dropping state (e.g., custom exit conditions when implementingsession-based windows).

Finally, the update function returns auser-specified return type R for itskey. The return value of mapGroupsWithState is a new table with the final Rrecord outputted for each group in the data (when the group is closed ortimes out). For example, the developer may wish to track user sessions on awebsite using mapGroupsWithState, and outputthe total number of pages clicked for each session.

To illustrate, Figure 3shows how to use mapGroupsWithState to trackuser sessions, where a session is defined as a series of events with the same userIdand gaps less than 30 minutes between them. We outputthe final number of events in each session as our return value R. A job could then compute metrics suchas the average number of events per session by aggregating the result table lens. The second stateful operator, flatMapGroupsWithState, is very similar to mapGroupsWithState, except that the update function can return zero or more values of type Rper update instead of one. For example, this operator could be used tomanually implement a stream-to-table join. The return values can either bereturned all at once, when the group is closed, or incrementally across callsto the update function. Both operators also work in batch mode, in which casethe update function will only be called once.

5      Query Planning

We implemented Structured Streaming’s queryplanning using the Catalyst extensible optimizer in Spark SQL [8], which allowswriting composable rules using pattern matching in Scala. Query planningproceeds in three stages: analysis to determine whether the query is valid,incrementalization and optimization.

5.1     Analysis

The first stage of query planning isanalysis, where the engine validates the user’s query and resolves theattributes and data types referred to in the query. Structured Streaming usesSpark SQL’s existing analysis passes to resolve attributes and types, but addsnew rules to check that the query can be executed incrementally by the engine.It also checks that the user’s chosen output mode is valid for this specificquery. For example, the Append output mode can only be used with queries whoseoutput is monotonic [4]: that is, where a given output record will not beremoved once it is written. In this mode, only selections, joins, andaggregations over keys that include event time are allowed (in which case theengine will only output the value for a given event time once its watermark haspassed). Similarly, in the Complete output mode, where the whole output tableneeds to be written on each trigger, Structured Streaming only permits aggregationqueries where the amount of state that needs to be tracked is proportional tothe number of keys in the result. A full description of the supported modes isavailable in the Structured Streaming documentation [31].

5.2      Incrementalization

The next step of the query planning processis incrementalizing the static query provided by the user to efficiently updateresults in response to new data. In general, Structured Streaming’sincrementalizer aims to ensure that the query’s result can be updated in timeproportional to the amount of new data received before each trigger or to theamount of new rows that have to be produced, without a dependance on the totalamount of data received so far.

The engine canincrementalize a restricted, but growing, class of queries. As of Spark 2.3.0,the supported queries can contain:

•  Any number of selections,projections and SELECT DISTINCTs.

•  Inner, left-outer andright-outer joins between a stream and a table or between two streams. Forouter joins against a stream, the join condition must involve a watermarkedcolumn.

•  Stateful operators like mapGroupsWithState(§4.3.2).

•  Up to one aggregation (possiblyon compound keys).

•  Sorting after an aggregation,only in complete output mode.

The engine uses Catalysttransformation rules to map these supported queries into trees of physicaloperators that perform both computation and state management. For example, anaggregation in the user query might be mapped to a StatefulAggregate operatorthat tracks open groups inside Structured Streaming’s state store (§6.1) andoutputs the desired result. Internally, Structured Streaming also tracks anoutput mode for each physical operator in the DAG produced duringincrementalization, similar to the refinement mode for aggregation operators inDataflow [2]. For example, some operators may update emitted records(equivalent to update mode), while others may only emit new records (appendmode). Crucially, in Structured Streaming, users do not have to specify theseintra-DAG modes manually.

Incrementalization is anactive area of work in Structured Streaming, but we have found that even therestricted set of queries available today is suitable for many use cases (§8).In other cases, users have leveraged Structured Streaming’s stateful operators(§4.3.2) to implement custom incremental processing logic that maintains

Figure4: State management during the execution of Structured Streaming. Inputoperators are responsible for defining epochs in each input source and savinginformation about them (e.g., offsets) reliably in the write-ahead log.Stateful operators also checkpoint state asynchronously, marking it with itsepoch, but this does not need to happen on every epoch. Finally, outputoperators log which epochs’outputshavebeenreliablycommittedtotheidempotentoutput sink; the very last epochmay be rewritten on failure.

state of their choice. We expect to add moreadvanced automatic incrementalization techniques into the engine over time.

5.3      Query Optimization

The final stage of planning is optimization.Structured Streaming applies most of the optimization rules in Spark SQL [8],such as predicate pushdown, projection pushdown, expression simplification andothers. In addition, it uses Spark SQL’s Tungsten binary format for data inmemory (avoiding the overhead of Java objects), and its runtime code generatorto compile chains of operators to Java bytecode that runs over this format.This design means that most of the work in logical and execution optimizationfor analytical workloads in Spark SQL automatically applies to streaming.

6      Application Execution

The final component of Structured Streamingis its execution strategy. In this section, we describe how the engine tracksstate, and then the two execution modes: microbatching via fine-grained tasksand continuous processing using long-lived operators. We then discuss operationalfeatures to simplify managing and deploying Structured Streaming applications.

6.1       State Management and Recovery

At a high level, Structured Streaming tracksstate in a manner similar to Spark Streaming [37], in both its microbatch andcontinuous modes. The state of an application is tracked using two externalstorage systems: a write-ahead log thatsupports durable, atomic writes at low latency, and a state store that can store larger amounts of data durably andallows parallel access (e.g., S3 or HDFS). Structured Streaming uses thesesystems together to recover on failure.

The engine places tworequirements on input sources and output sinks to provide fault tolerance.First, input sources should be replayable,i.e., allow re-reading recent data using some form of identifier, such as astream offset. Durable message bus systems like Kafka and Kinesis meet thisneed. Second, output sinks should be idempotent,allowing Structured Streaming to rewrite some already written data on failure.Sinks can implement this in different ways.

Given these properties,Structured Streaming performs state tracking using the following mechanism, asshown in Figure 4:

(1)  As input operators read data, the master node of the Sparkapplication defines epochs based onoffsets in each input source. For example, Kafka and Kinesis present topics asa series of partitions, each of which are byte streams, and allow reading datausing offsets in these partitions. The master writes the start and end offsetsof each epoch durably to the log.

(2)  Any operators requiring state checkpoint their state periodicallyand asynchronosuly to the state store, using incremental checkpoints whenpossible. They store the epoch ID along with each checkpoint written. Thesecheckpoints do not need to happen on every epoch or to block processing.[2]

(3)  Output operators write the epochs they committed to the log. Themaster waits for all nodes running an operator to report a commit for a givenepoch before allowing commits for the next epoch. Depending on the sink, themaster can also run an operation to finalize the writes from multiple nodes ifthe sink supports this. This means that if the streaming application fails,only one epoch may be partially written.3

(4)  Upon recovery, the new instance of the application starts by readingthe log to find the last epoch that has not been committed to the sink,including its start and end offsets. It then uses the offsets of earlier epochsto reconstruct the application’s inmemory state from the last epoch written tothe state store. This just requires loading the old state and running thoseepochs with the same offsets while disabling output. Finally, the system rerunsthe last epoch and relies on the sink’s idempotence to write its results, thenstarts defining new epochs.

Finally, all of thestate management in this design is transparent to user code. Both theaggregation operators and custom stateful processing operators (e.g., mapGroupsWithState) automatically checkpoint state to the state store, withoutrequiring custom code to do it. The user’s data types only need to beserializable.

6.2       Microbatch Execution Mode

Structured Streaming jobs can execute in twomodes: microbatching or continuous operators. The microbatch mode uses the discretized streams execution model fromSpark Streaming [37], and inherits its benefits, such as dynamic loadbalancing, rescaling, straggler mitigation and fault recovery withoutwhole-system rollback.

In this mode, epochs aretypically set to be a few hundred milliseconds to a few seconds, and each epochexecutes as a traditional Spark job composed of a DAG of independent tasks[36]. For example, a query doing selection followed by stateful aggregationmight execute as a set of “map" tasks for the selection and “reduce"tasks for the aggregation, where the reduce tasks track state in memory onworker nodes and periodically checkpoint it to the state store. As in SparkStreaming, this mode provides the following benefits:

•  Dynamic load balancing: Eachoperator’s work is divided into small, independent tasks that can be scheduledon any node, so the system can automatically balance these across nodes if someare executing slower than others.

•  Fine-grained fault recovery: Ifa node fails, only its tasks need to be rerun, instead of having to roll backthe whole cluster to a checkpoint as in most systems based on topologies oflong-lived operators. Moreover, the lost tasks can be rerun in parallel, further reducing recoverytime [37].

•  Straggler mitigation: Sparkwill launch backup copies of slow tasks as it does in batch jobs, anddownstream tasks will simply use the output from whichever copy finishes first.

•  Rescaling: Adding or removing anode is simple as tasks will automatically be scheduled on all the availablenodes.

•  Scale and throughput: Becausethis mode reuses Spark’s batch execution engine, it inherits all theoptimizations in this engine, such as a high-performance shuffle implementation[34] and the ability to run on thousands of nodes.

The main disadvantage ofthis mode is a higher minimum latency, as there is overhead to launching a DAGof tasks in Spark. In practice, however, latencies of a few seconds areachievable even on large clusters running multi-step computations. Depending onthe application, these are on a similar time scale to data collection andalerting systems.

6.3       Continuous Processing Mode

A new continuous processing added in ApacheSpark 2.3 [6] executes Structured Streaming jobs using long-lived operators asin traditional streaming systems such as Telegraph and Borealis [1, 13]. Thismode enables lower latency at a cost of less operational flexibility (e.g.,limited support for rescaling the job at runtime).

The key enabler for thisexecution mode was choosing a declarative API for Structured Streaming that isnot tied to the execution strategy. For example, the original Spark StreamingAPI had some operators based on processing time that leaked the concept ofmicrobatches into the programming model, making it hard to move programs toanother type of engine. In contrast, Structured Streaming’s API and semanticsare independent of the execution engine: continuous execution is similar tohaving a much larger number of triggers. Note that unlike systems based purelyon unsynchronized message passing, such as Storm [32], we do retain the conceptof triggers and epochs in this mode so the output from multiple nodes can becoordinated and committed together to the sink.

Because the APIsupports fine-grained execution, Structured Streaming jobs could theoreticallyrun on any existing distributed streaming engine design [1, 13, 17]. Incontinuous processing, we built a simple continuous operator engine that livesinside Spark and can reuse Spark’s scheduling infrastructure and per-nodeoperators

(e.g., code-generated operators). The firstversion released in Spark 2.3.0 only supports “map-like” jobs (i.e., no shuffleoperations), which were one of the most common scenarios where users wantedlower latency, but the design can be extended to support shuffles.

Compared to microbatchexecution, there are two differences when using continuous processing:

(1)  The master launches long-running tasks on each partition usingSpark’s scheduler that each read one partition of the input source (e.g.,Kinesis stream) but execute multiple epochs.If one of these tasks fails, Spark will simply relaunch it.

(2)  Epochs are coordinated differently. The master periodically tellsnodes to start a new epoch, and receives a start offset for the epoch on eachinput partition, which it inserts into the writeahead log. When it asks them tostart the next epoch, it also receives end offsets for the previous one, writesthese to the log, and tells nodes to commit the epoch when it has written allthe end offsets. Thus, the master is not on the critical path for inspectingall the input sources and defining start/end offsets.

We found that the mostcommon use case where organizations wanted low latency and the scale of a distributed processing engine was “stream tostream” map operations to transform data before it is used in other streamingapplications. For example, an organization might upload events to Kafka, runsome simple ETL transformations as a streaming job, and write the transformeddata to Kafka again for consumption by other streaming applications. In thistype of design, each transformation job will add latency to all downstreamsteps, so organizations wish to minimize this latency.

7      Operational Features

We used several properties of our executionstrategy and API to design a number of operational features in StructuredStreaming that tackle common problems in deployments. Perhaps most importantlyacross these features, we aimed to make both Structured Streaming’s semanticsand its fault tolerance model easy to understand. With a simple design,operators can form an accurate model of how a system runs and what variousactions will do to it.

7.1     Code Updates

Developers can update User-Defined Functions(UDFs) in their program and simply restart the application to use the newversion of the code. For example, if a UDF is crashing on a particular inputrecord, that epoch of processing will fail, so the developer can update thecode and restart the application again to continue processing. This alsoapplies to stateful operator UDFs, which can be updated as long as they retainthe same schema for their state objects. We also designed Spark’s log and statestore formats to be binary compatible across Spark framework updates.

7.2      Manual Rollback

Sometimes, an application outputs wrong results for some time before auser notices: for example, a field that fails to parse might simply be reportedas NULL. Therefore, rollbacks are a factof life for many operators. In Structured Streaming, it is easy to determinewhich records went into each epoch from the write-ahead log and roll back theapplication to the epoch where a problem started occurring. We chose to storethe write-ahead log as JSON to let administrators perform these operationsmanually.[3]As long as the input sources and state store still have data fromthe failed epoch, the job can start again from a previous point. Message buseslike Kafka are typically configured for several weeks of retention so rollbacksare often possible.

Manual rollbacksinteract well with Structured Streaming’s prefix consistency guarantee forexecution semantics 4.2. Specifically, when an administrator rolls back the jobto a point in the writeahead log, she knows which prefix of the input streamsthis point corresponds to, and the job can recompute output from that point onwhile retaining consistency within the new output. Beyond this guarantee,Structured Streaming’s support for running the same code as a batch job and forrescaling means that administrators can run the recovery on a temporarilylarger cluster to catch up quickly, further reducing the operational complexityof manual rollbacks.

7.3        Hybrid Batch and Streaming Execution

The most obvious benefit of StructuredStreaming’s unified API is that users can share code between batch andstreaming jobs, or run the same program as a batch job for testing. However, wehave also found this useful for purely streaming scenarios in two ways:

•  “Run-once" triggers for cost savings: Many Databricks customers wanted the transactionality and statemanagement properties of a streaming engine withoutrunning servers 24/7. Virtually all ETL workloads require tracking how farin the input one has gotten and which results have been saved reliably, whichcan be difficult to implement by hand. These functions are exactly whatStructured Streaming’s state management provides. Thus, several customersimplemented ETL jobs by running a single epochof a Structured Streaming job every few hours as a batch computation, using theprovided “run once" trigger that was originally designed for testing. Thisleads to significant cost savings (in one case, up to 10× [35]) forlower-volume applications. With all the major cloud providers now supportingper-second or per-minute billing [9], we believe this type of “discontinuousprocessing" will become more common.

•  Adaptive batching: Even streamingapplications occasionally experience large backlogs. For example, a linkbetween two datacenters might go down, temporarily delaying data transfer, orthere might simply be a spike in user activity. In these cases, StructuredStreaming will automatically execute longer epochs in order to catch up withthe input streams, often achieving similar throughput to Spark’s batch jobs.This will not greatly increase latency, given that data is already backlogged,but will let the system catch up faster. In cloud environments, operators canalso add extra nodes to the cluster temporarily.

7.4     Monitoring

Structured Streaming uses Spark’s existingmetrics API and structured event log to report information such as number ofrecords processed, bytes shuffled across the network, etc. These interfaces arefamiliar to operators and easy to connect to a variety of UI tools.

7.5     Fault and Straggler Recovery

As discussed in §6.2, Structured Streaming’smicrobatch mode can recover from node failures, stragglers and load imbalancesusing Spark’s fine-grained task execution model. The continuous processing moderecovers from node failures, but does not yet protect against stragglers orload imbalance.

Figure5: Information security platform use case. Using Structured Streaming and SparkSQL, a team of analysts can query both streaming and historical data and easilyinstall queries for new attack patterns as streaming alerts.

8      Production Use Cases

We have supported Structured Streaming onDatabricks’ managed cloud service [16] since 2016, and today, our cloud isrunning hundreds of production streaming applications at a given time (i.e.,applications running 24/7). The largest of these applications ingest over 1 PBof data per month and run on hundreds of servers. We also use StructuredStreaming internally to monitor our services, including the execution ofStructured Streaming itself. In this section, we describe three customerworkloads that leverage various aspects of Structured Streaming, as well as ourinternal use case.

8.1        Information Security Platform

A large customer has used StructuredStreaming to develop a largescale security platform to enable over 100 analyststo scour through network traffic logs to quickly identify and respond tosecurity incidents, as well as to generate automated alerts. This platformcombines streaming with batch and interactive queries and is thus a greatexample of the system’s support for end-to-endapplications.

Figure 5 shows thearchitecture of the platform. Intrusion Detection Systems (IDSes) monitor allthe network traffic in the organization, and output logs to S3. From here, aStructured Streaming jobs ETLs these logs into a compact Apache Parquet basedtable stored on Databricks Delta [7] to enable fast and concurrent access frommultiple downstream applications. Other Structured Streaming jobs then processthese logs to produce additional tables (e.g., by joining them with otherdata). Analysts query these tables interactively, using SQL or Dataframes, todetect and diagnose new attack patterns. If they identify a compromise, theyalso look back through historical data to trace previous actions from thatattacker. Finally, in parallel, the Parquet logs are processed by anotherStructured Streaming cluster that generates real-time alerts based onpre-written rules.

The key challenges inrealizing this platform are (1) building a robust and scalable streamingpipeline, while (2) providing the analysts with an effective environment toquery both fresh and historical data. Using standard tools and servicesavailable on AWS, a team of 20 people took over six months to build and deploya previous version of this platform in production. This previous version hadseveral limitations, including only being able to store a small amount of datafor historical queries due to using a traditional data warehouse for theinteractive queries. In contrast, a team of five engineers was able toreimplement the platform using Structured Streaming in two weeks. The newplatform was simultaneously more scalable and able to support more complexanalysis using Spark’s ML APIs. Next, we provide a few examples to illustratethe advantages of Structured Streaming that made this possible.

First, StructuredStreaming’s ability to adaptively vary the batch size enabled the developers tobuild a streaming pipeline that deals not only with spikes in the workload, butalso with failures and code upgrades. Consider a streaming job that goesoffline either due to failure or upgrades. When the cluster is brought backonline, it will start automatically to process the data all the way back fromthe moment it went offline. Initially, the cluster will use large batches tomaximize the throughput. Once it catches up, the cluster switches to smallbatches for low latency. This allows administrators to regularly upgradeclusters without the fear of excessive downtime.

Second, the ability tojoin a stream with other streams, as well as with historical tables, has considerablysimplified the analysis. Consider the simple task of figuring out which devicea TCP connection originates at. It turns out that this task is challenging inthe presence of mobile devices, as these devices are given dynamic IP addresses every time they join the network. Hence, fromTCP logs alone, is not possible to track down the end-points of a connection.With Structured Streaming, an analyst can easily solve this problem. She cansimply join the TCP logs with DHCP logs to map the IP address to the MACaddress, and then use the organization’s internal database of network devicesto map the MAC address to a particular machine and user. In addition, userswere able to do this join in real time using stateful operators as both the TCPand DHCP logs were being streamed in.

Finally, using the samesystem for streaming, interactive queries and ETL has provided developers withthe ability to quickly iterate and deploy new alerts. In particular, it enablesanalysts to build and test queries for detecting new attacks on offline data,and then deploy these queries directly on the alerting cluster. In one example,an analyst developed a query to identify exfiltration attacks via DNS. In thisattack, malware leaks confidential information from the compromised host bypiggybacking this information into DNS requests sent to an external DNS serverowned by the attacker. One simplified query to detect such an attackessentially computes the aggregate size of the DNS requests sent by every hostover a time interval. If the aggregate is greater than a given threshold, thequery flags the corresponding host as potentially being compromised. Theanalyst used historical data to set this threshold, so as to achieve thedesired balance between false positive and false negative rates. Once satisfiedwith the result, the analyst simply pushed the query to the alerting cluster.The ability to use the same system and the same API for data analysis and forimplementing the alerts led not only to significant engineering cost savings,but also to better security, as it is significantly easier to deploy new rules.

8.2        Monitoring Live Video Delivery

A large media company is usingStructured Streaming to compute quality metrics for their live video trafficand interactively identify delivery problems. Live video delivery is especiallychallenging because network problems can severely disrupt utility. Forprerecorded video, clients can use large buffers to mask issues, and adegradation at most results in extra buffering time; but for live video, aproblem may mean missing a critical moment in a sports match or similar event.This organization collects video quality metrics from clients in real time,performs ETL operations and aggregation using Structured Streaming, then storesthe results in a data warehouse. This allows operations engineers tointeractively query fresh data to detect and diagnose quality issues (e.g.,determine whether an issue is tied to a specific ISP, video server or othercause).

8.3       Analyzing Game Performance

A large gaming company uses StructuredStreaming to monitor the latency experienced by players in a popular onlinegame with tens of millions of monthly active users. As in the video use case,high network performance is essential for the user experience when gaming, andrepeated problems can quickly lead to player churn. This organization collectslatency logs from its game clients to cloud storage and then performs a varietyof streaming analyses.

For example, one job joins the measurementswith a table of Internet

Autonomous Systems (ASes) and thenaggregates the performance by AS over time to identify poorly performing ASes.When such an AS is identified, the streaming job triggers an alert, and ITstaff can contact the AS in question to remediate the issue.

8.4        Cloud Monitoring at Databricks

At Databricks, we have been using ApacheSpark since the start of the company to monitor our own cloud service,understand workload statistics, trigger alerts, and let our engineersinteractively debug issues. The monitoring pipeline produces dozens ofinteractive dashboards as well as structured Parquet tables for ad-hoc SQLqueries. These dashboards also play a key role for business users to understandwhich customers have increasing or decreasing usage, prioritize featuredevelopment, and proactively identify customers that are experiencing problems.

We built at least threeversions of a monitoring pipeline using a combination of batch and streamingAPIs starting four years ago, and in all the cases, we found that the majorchallenges were operational. Despite our best efforts, pipelines could be brittle,experiencing frequent failures when aspects of our input data changed (e.g.,new schemas or reading from more locations than before), and upgrading them wasa daunting exercise. Worse yet, failures and upgrades often resulted in missingdata, so we had to manually go back and re-run jobs to reconstruct the missingdata. Testing pipelines was also challenging due to their reliance on multipledistinct Spark jobs and storage systems. Our experience with StructuredStreaming shows that it successfully addresses many of these challenges. Notonly we were able to reimplement our pipelines in weeks, but the managementoverhead decreased drastically. Restartability coupled with adaptive batching,transactional sources/sinks and well-defined consistency semantics have enabledsimpler fault recovery, upgrades, and rollbacks to repair old results.Moreover, we can test the same code in batch mode on data samples or use manyof the same functions in interactive queries.

Our pipelines withStructured Streaming also regularly combine its batch and streamingcapabilities. For example, the pipeline to monitor streaming jobs starts withan ETL job that reads JSON

          (a)vs. Other Systems                         (b)System Scaling

Figure6: Throughput results on the Yahoo! benchmark.

events from Kafka and writes them to acolumnar Parquet table in S3. Dozens of other batch and streaming jobs thenquery this table to produce dashboards and other reports. Because Parquet is acompact and column-oriented format, this architecture consumes drastically fewer resources than having every jobread directly from Kafka, and simultaneously places less load on the Kafkabrokers. Overall, streaming jobs’ latencies range from seconds to minutes, andusers can also query the Parquet table interactively in seconds.

9      Performance Evaluation

In this section, we measure the performanceof Structured Streaming using controlled benchmarks. We study performance vs.other systems on the Yahoo! Streaming Benchmark [14], scalability, and thethroughput-latency tradeoff with continuous processing.

9.1       Performance vs. Other Streaming Systems

To evaluate performance compared toother streaming engines, we used the Yahoo! Streaming Benchmark [14], a widelyused workload that has also been evaluated in other open source systems. Thisbenchmark requires systems to read ad click events, join them against a statictable of ad campaigns by campaign ID, and output counts by campaign on10-second event-time windows.

We compared Kafka Streams 0.10.2, ApacheFlink 1.2.1 and Spark

2.3.0 on a cluster with five c3.2xlargeAmazon EC2 workers (each with 8 virtual cores and 15 GBRAM) and one master. For Flink, we used the optimized version of the benchmarkpublished by dataArtisans for a similar cluster [22]. Like in that benchmark, thesystems read data from a Kafka cluster running on the workers with 40partitions (one per core), and write results to Kafka. The original Yahoo!benchmark used Redis to hold the static table for joining ad campaigns, but wefound that Redis could be a bottleneck, so we replaced it with a table in eachsystem (a KTable in Kafka, a DataFramein Spark, and an in-memory hash map in Flink).

Figure 6a shows eachsystem’s maximum stable throughput, i.e., the throughput it can process beforea backlog begins to form. We see that streaming system performance can varysignificantly. Kafka Streams implements a simple message-passing model throughthe

Kafka message bus, but only attains 700,000records/second on our

40-corecluster. Apache Flink reaches 33 million records/s. Finally, StructuredStreaming reaches 65 million records/s, nearly 2× the throughput of Flink. Thisparticular Structured Streaming query is

                  0             200000 400000 600000 800000 1000000

Input Rate (records/s)

Figure7: Latency of continuous processing vs. input rate. Dashed line shows maxthroughput in microbatch mode.

implemented using just DataFrame operationswith no UDF code. The performance thus comes solely from Spark SQL’s built inexecution optimizations, including storing data in a compact binary format andruntime code generation. As pointed out by the authors of Trill [12] andothers, execution optimizations can make a large difference in streamingworkloads, and many systems based on per-record operations do not maximizeperformance.

9.2     Scalability

Figure 6b shows how Structured Streaming’sperformance scales for the Yahoo! benchmark as we vary the size of our cluster.We used 1, 5, 10 and 20 c3.2xlarge AmazonEC2 workers (with 8 virtual cores and 15 GB RAM each) and the same experimentalsetup as in §9.1, including one Kafka partition per core. We see that throughputscales close to linearly, from 11.5 million records/s on 1 node to 225 millionrecords/s on 20 nodes (i.e., 160 cores).

9.3       Continuous Processing

We benchmarked Structurd Streaming’scontinuous processing mode on a 4-core server to show the latency-throughputtradeoffs it can achieve. (Because partitions run independently in this mode,we expect the latency to stay the same as more nodes are added.) Figure 7 showsthe results for a map job reading from Kafka,with the dashed line showing the maximum throughput achievable by microbatchmode. We see that continuous mode is able to achieve much lower latency withouta large drop in throughput (e.g., less than 10 ms latency at half the maximumthroughput of microbatching). Its maximum stable throughput is also slightlyhigher because microbatch mode incurs latency due to task scheduling.

10      Related Work

Structured Streaming builds on many existingsystems for stream processing and big data analytics, including Spark SQL’sDataFrame API [8], Spark Streaming [37], Dataflow [2], incremental querysystems [11, 24, 29, 38] and distributed stream processing [21]. At a highlevel, the main contributions of this work are:

•  An account of real-world userchallenges with streaming systems, including operational challenges that arenot always discussed in the research literature (§2).

•  A simple, declarativeprogramming model that incrementalizes awidely used batch API (Spark DataFrames/SQL) to provide similar capabilities toDataflow [2] and other streaming systems.

•  An execution engine providinghigh throughput, fault tolerance, and rich operational features that combineswith the rest of Apache Spark to let users easily build end-to-end applications.

From an API standpoint,the closest work is incremental query systems [11, 24, 29, 38], includingrecent distributed systems such as

Stateful Bulk Processing [25] and Naiad[26]. Structured Streaming’s

API is an extension of Spark SQL [8],including its declarative DataFrame interface for programmatic construction ofrelational queries. Apache Flink also recently added a table API (currently inbeta) for defining relational queries that can map to either streaming or batchexecution [19], but this API lacks some of the features of StructuredStreaming, such as custom stateful operators (§4.3.2).

Other recent streamingsystems have language-integrated APIs that operate at a lower, more“imperative" level. In particular, Spark Streaming [37], Google Dataflow[2] and Flink’s DataStream API [18] provide various functional operators butrequire users to choose the right DAG of operators to implement a particularincrementalization strategy (e.g., when to pass on deltas versus completeresults); essentially, these are equivalent to writing a physical executionplan. Structured Streaming’s API is simpler for users who are not experts onincrementalization. Structured Streaming adopts the definitions of event time,processing time, watermarks and triggers from Dataflow but incorporates them inan incremental model.

For execution,Structured Streaming uses concepts similar to discretized streams for microbatchmode [37] and traditional streaming engines for continuous processing mode [1,13, 21]. It also builds on an analytical engine for performance like Trill[12]. The most unique contribution here is the integration of batch andstreaming queries to enable sophisticated end-to-end applications. As describedin §8, Structured Streaming users can easily write applications that combinebatch, interactive and stream processing using the same code (e.g., securitylog analysis). In addition, they leverage powerful operational features such asrun-once triggers (running a streaming application “discontinuously" asbatch jobs to retain its transactional features but lower costs), code updates,and batch processing to handle backlogs or code rollbacks (§7).

11      Conclusion

Stream processing is a powerful tool, butstreaming systems are still difficult to use, operate and integrate into largerapplications. We designed Structured Streaming to simplify all three of thesetasks while integrating with the rest of Apache Spark. Unlike many other opensource streaming engines, Structured Streaming purposefully adopts a veryhigh-level API: incrementalizing an existing Spark SQL or DataFrame query. Thismakes it accessible to a wide range of users. Although Structured Streaming’sAPI is more declarative and constrained, we found that works well for a diverserange of applications, including those that require custom logic for statefulprocessing. Beyond this focus on a high-level API, Structured Streaming alsoincludes several powerful operational features and achieves high performanceusing the Spark SQL engine. Experience across hundreds of customer use casesshows that users can leverage the system to build sophisticated businessapplications.

12      Acknowledgements

We would like to thank the diverse ApacheSpark developer community that has contributed to Structured Streaming, SparkStreaming and Spark SQL over the years. We also thank the SIGMOD reviewers fortheir detailed feedback on the paper.

References

[1]     DanielJ. Abadi, Yanif Ahmad, Magdalena Balazinska, Mitch Cherniack, Jeong hyon Hwang,Wolfgang Lindner, Anurag S. Maskey, Er Rasin, Esther Ryvkina, Nesime Tatbul,Ying Xing, and Stan Zdonik. 2005. The design of the borealis stream processingengine. In In CIDR. 277–289.

[2]     TylerAkidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J.Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, EricSchmidt, and Sam Whittle. 2015. The Dataflow Model: A Practical Approach toBalancing Correctness, Latency, and Cost in Massive-scale, Unbounded,Out-of-order Data Processing. Proc. VLDBEndow. 8, 12 (Aug. 2015), 1792–1803. https://doi.org/10.14778/2824032.2824076

[3]     Intel         Altera.  2017.   Financial/HPC  –            Financial            Offload.             https:

//www.altera.com/solutions/industry/computer-and-storage/applications/computer/financial-offload.html.(2017).

[4]     PeterAlvaro, Neil Conway, Joseph M. Hellerstein, and William R. Marczak. 2011.Consistency analysis in Bloom: A CALM and collected approach. In In Proceedings 5th Biennial Conference onInnovative Data Systems Research. 249–260.

[5]     Amazon.2017. Amazon Kinesis. https://aws.amazon.com/kinesis/.(2017).

[6]     MichaelArmbrust. 2017. SPARK-20928: Continuous Processing Mode for StructuredStreaming. https://issues.apache.org/jira/browse/SPARK-20928. (2017).

[7]     Michael   Armbrust,          Bill         Chambers,        and       Matei   Zaharia.              2017.

DatabricksDelta: A Unified Data Management System for Real-time Big Data. https://databricks.com/blog/2017/10/25/databricks-delta-a-unified-management-system-for-real-time-big-data.html.(2017).

[8]     MichaelArmbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K.

Bradley,Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and MateiZaharia. 2015. Spark SQL: Relational Data Processing in Spark. 1383–1394. https://doi.org/10.1145/2723372.2742797

[9]     JeffBarr. 2017. New – Per-Second Billing for EC2 Instances and EBS Volumes. https://aws.amazon.com/blogs/aws/ new-per-second-billing-for-ec2-instances-and-ebs-volumes/.(2017).

[10]   ApacheBeam. 2017. Apache Beam programming guide. https://beam.apache. org/documentation/programming-guide/.(2017).

[11]   JoseA. Blakeley, Per-Ake Larson, and Frank Wm Tompa. 1986. Efficiently UpdatingMaterialized Views. SIGMOD Rec. 15, 2(June 1986), 61–71. https:

//doi.org/10.1145/16856.16861

[12]   BadrishChandramouli, Jonathan Goldstein, Mike Barnett, Robert DeLine, Danyel Fisher,John C. Platt, James F. Terwilliger, and John Wernsing. 2014. Trill: AHigh-performance Incremental Query Processor for Diverse Analytics. Proc.

VLDB Endow. 8, 4 (Dec. 2014), 401–412. https://doi.org/10.14778/2735496.2735503

[13]   SirishChandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin,

Joseph M.Hellerstein, Wei Hong, Sailesh Krishnamurthy, Samuel R. Madden, Fred Reiss, andMehul A. Shah. 2003. TelegraphCQ: Continuous Dataflow Processing. In Proceedings of the 2003 ACM SIGMODInternational Conference on Management of Data (SIGMOD ’03). ACM, New York,NY, USA, 668–668. https://doi.org/10. 1145/872757.872857

[14]   SanketChintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Tom Graves,

MarkHolderbaugh, Zhuo Liu, Kyle Nusbaum, Kishorkumar Patil, Boyang Peng, and PaulPoulosky. 2015. Benchmarking Streaming Computation Engines at Yahoo! https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at.(2015).

[15]   Confluent.2017. KSQL: Streaming SQL for Kafka. https://www.confluent.io/ product/ksql/. (2017).

[16]   Databricks.2017. Databricks unified analytics platform. https://databricks.com/ product/unified-analytics-platform.(2017).

[17]   ApacheFlink. 2017. Apache Flink. http://flink.apache.org.(2017).

[18]   ApacheFlink. 2017. Flink DataStream API Programming Guide. https://ci.apache.

org/projects/flink/flink-docs-release-1.3/dev/datastream_api.html.(2017).

[19]   ApacheFlink. 2017. Flink Table & SQL API Beta. https://ci.apache.org/projects/ flink/flink-docs-release-1.3/dev/table/index.html.(2017).

[20]   ApacheFlink. 2017. Working with State. https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/state.html.(2017).

[21]   LukaszGolab and M. Tamer Özsu. 2003. Issues in Data Stream Management. SIGMOD Rec. 32, 2 (June 2003), 5–14. https://doi.org/10.1145/776985.776986

[22]   JamieGrier. 2016. Extending the Yahoo! Streaming Benchmark. https:// data-artisans.com/blog/extending-the-yahoo-streaming-benchmark.(2016).

[23]   ApacheKafka. 2017. Kafka. http://kafka.apache.org.(2017).

[24]   SaileshKrishnamurthy, Michael J. Franklin, Jeffrey Davis, Daniel Farina, PashaGolovko, Alan Li, and Neil Thombre. 2010. Continuous Analytics overDiscontinuous Streams. In Proceedings ofthe 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD ’10).ACM, New York, NY, USA, 1081–1092.

https://doi.org/10.1145/1807167.1807290

[25]   DionysiosLogothetis, Christopher Olston, Benjamin Reed, Kevin C. Webb, and Ken Yocum.2010. Stateful Bulk Processing for Incremental Analytics. In Proceedings of the 1st ACM Symposium onCloud Computing (SoCC ’10). ACM, New York, NY, USA, 51–62. https://doi.org/10.1145/1807128.1807138

[26]   FrankMcSherry, Derek Murray, Rebecca Isaacs, and Michael Isard. 2013. Differentialdataflow, In Proceedings of CIDR 2013. https://www.microsoft.com/en-us/ research/publication/differential-dataflow/

[27]   DerekG. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, andMartín Abadi. 2013. Naiad: A Timely Dataflow System. 439–455. https: //doi.org/10.1145/2517349.2522738

[28]   Pandas.2017. pandas Python data analysis library. http://pandas.pydata.org. (2017).

[29]   X.Qian and Gio Wiederhold. 1991. Incremental Recomputation of Active RelationalExpressions. IEEE Trans. on Knowl. andData Eng. 3, 3 (Sept. 1991), 337–341. https://doi.org/10.1109/69.91063

[30]   R[n. d.]. R project for statistical computing. http://www.r-project.org. ([n.d.]).

[31]   ApacheSpark. 2017. Spark Documentation. http://spark.apache.org/docs/latest. (2017).

[32]   ApacheStorm. 2017. Apache Storm. http://storm.apache.org.(2017).

[33]   AshishThusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang,Suresh Anthony, Hao Liu, and Raghotham Murthy. 2010. Hive - a petabyte scaledata warehouse using Hadoop. In ICDE,Feifei Li, Mirella M. Moro, Shahram Ghandeharizadeh, Jayant R. Haritsa, GerhardWeikum, Michael J. Carey,

Fabio Casati,Edward Y. Chang, Ioana Manolescu, Sharad Mehrotra, Umeshwar

Dayal, andVassilis J. Tsotras (Eds.). IEEE, 996–1005. http://infolab.stanford.edu/ ~ragho/hive-icde2010.pdf

[34]   ReynoldXin et al. [n. d.]. GraySort on Apache Spark by Databricks. http: //sortbenchmark.org/ApacheSpark2014.pdf.([n. d.]).

[35]   BurakYavuz and Tyson Condie. 2017. Running Streaming Jobs Once a Day For 10x CostSavings. https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html.(2017).

[36]   MateiZaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, MurphyMcCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. ResilientDistributed Datasets: A Fault-tolerant Abstraction for In-memory ClusterComputing. 15–28.

[37]   MateiZaharia, Tathagata Das, Haoyuan Li, Tim Hunter, Scott Shenker, and Ion Stoica.2013. Discretized Streams: Fault-Tolerant Streaming Computation at Scale. In SOSP.

[38]   YueZhuge, Héctor García-Molina, Joachim Hammer, and Jennifer Widom. 1995.

View Maintenance in a Warehousing Environment. In Proceedings of the 1995 ACM SIGMODInternational Conference on Management of Data (SIGMOD ’95).

ACM, NewYork, NY, USA, 316–327. https://doi.org/10.1145/223784.223848



[1] Spark SQL offers several slightly different APIs that map to thesame query engine. The DataFrame API, modeled after data frames in R and Pandas[28, 30], offers a simple interface to build relational queriesprogrammatically that is familiar to many users. The Dataset API adds statictyping over DataFrames, similar to RDDs [36].

Alternatively,users can write SQL directly. All APIs produce a relational query plan.

[2] In Spark 2.3.0, we actually make one checkpoint per epoch, but weplan to make them less frequent in a future release, as is already done inSpark Streaming. 3 Some sinks, such as Amazon S3, provide no way toatomically commit multiple writes from different writer nodes. In such cases,we have also created Spark data sources that add transactions over theunderlying storage system. For example, Databricks Delta [7] offers aconsistent view of S3 data for both streaming and batch queries, along withadditional features such as index maintenance.

[3] One additional step they may have to do is remove faulty data fromthe output sink, depending on the sink chosen. For the file sink, for example,it’s straightforward to find which files were written in a particular epoch andremove those.

你可能感兴趣的:(StructuredStreaming: A Declarative API for Real-Time)