Following are some excerpts from the paper Spark SQL: Relational Data Processing in Spark by Michael Armbrust et al.. Those excerpts summarize the main idea of the paper.
Paper name: Spark SQL: Relational Data Processing in Spark
Paper published time: June 2015
Paper authors: Michael Armbrust et al.
Key words: Databases; Data Warehouse; Machine Learning; Spark; Hadoop; SQL; Distributed Processing; Structured and Semistructured data; DataFrame; Optimizer
[TOC]
Spark SQL integrates relational processing with Spark’s functional programming API. Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g., declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g., machine learning).
Compared to previous systems, Spark SQL makes two main additions:
It offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code.
It includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points.
Using Catalyst, we have built a variety of features (e.g., schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis.
We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.
Most data pipelines would ideally be expressed with a combination of both relational queries and complex procedural algorithms due to the following reasons:
MapReduce gave users a powerful, but low-level, procedural programming interface. Programming such systems was onerous and required manual optimization by the user to achieve high performance.
Systems like Pig, Hive, Dremel and Shark all take advantage of declarative queries to provide richer automatic optimizations. But the relational approach is insufficient for many big data applications for the following reasons:
Goals for Spark SQL:
Rather than forcing users to pick between a relational or a procedural API, however, Spark SQL lets users seamlessly intermix the two through two contributions.
Spark SQL provides a DataFrame API that can perform relational operations on both external data sources and Spark’s built-in distributed collections. This API is similar to the widely used data frame concept in R, but evaluates operations lazily so that it can perform relational optimizations.
To support the wide range of data sources and algorithms in big data, Spark SQL introduces a novel extensible optimizer called Catalyst. Catalyst makes it easy to add data sources, optimization rules, and data types for domains such as machine learning.
Spark SQL runs as a library on top of Spark. It exposes SQL interfaces, which can be accessed through JDBC/ODBC or through a command-line console, as well as the DataFrame API integrated into Spark’s supported programming languages.
The main abstraction in Spark SQL’s API is a DataFrame, a distributed collection of rows with a homogeneous schema. Unlike RDDs, DataFrames keep track of their schema and support various relational operations that lead to more optimized execution.
DataFrames can be constructed from tables in a system catalog (based on external data sources) or from existing RDDs of native Java/Python objects. Each DataFrame can also be viewed as an RDD of Row objects, allowing users to call procedural Spark APIs such as map.
DataFrames support all common relational operators, including projection (select), filter (where), join, and aggregations (groupBy). These operators all take expression objects in a limited DSL that lets Spark capture the structure of the expression.
All of these operators build up an abstract syntax tree (AST) of the expression, which is then passed to Catalyst for optimization. This is unlike the native Spark API that takes functions containing arbitrary Scala/Java/Python code, which are then opaque to the runtime engine.
Apart from the relational DSL, DataFrames can be registered as temporary tables in the system catalog and queried using SQL.
Spark SQL uses a nested data model based on Hive for tables and DataFrames. It supports all major SQL data types, including boolean, integer, double, decimal, string, date, and timestamp, as well as complex (i.e., non-atomic) data types: structs, arrays, maps and unions. Complex data types can also be nested together to create more powerful types.
Unlike many traditional DBMSes, Spark SQL provides first-class support for complex data types in the query language and the API. In addition, Spark SQL also supports user-defined types.
DataFrames can be significantly easier for users to work with than relational query languages thanks to their integration in a full programming language.
For example, users can break up their code into Scala, Java or Python functions that pass DataFrames between them to build a logical plan, and will still benefit from optimizations across the whole plan when they run an output operation. Likewise, developers can use control structures like if statements and loops to structure their work.
One user said that the DataFrame API is “concise and declarative like SQL, except I can name intermediate results,” referring to how it is easier to structure computations and debug intermediate steps.
DataFrames API analyze logical plans eagerly (i.e., to identify whether the column names used in expressions exist in the underlying tables, and whether their data types are appropriate), even though query results are computed lazily. Thus, Spark SQL reports an error as soon as user types an invalid line of code instead of waiting until execution. This is again easier to work with than a large SQL statement.
Database systems often require User-Defined Functions (UDFs) to be defined in a separate programming environment that is different from the primary query interfaces. Spark SQL’s DataFrame API supports inline definition of UDFs, without the complicated packaging and registration process found in other database systems.
In Spark SQL, UDFs can be registered inline by passing Scala, Java or Python functions, which may use the full Spark API internally.
Once registered, the UDF can also be used via the JDBC/ODBC interface by business intelligence tools.
Three features were added to Spark SQL specifically to handle challenges in “big data" environments.
In Spark SQL, we added a JSON data source that automatically infers a schema from a set of records. Our schema inference algorithm works in one pass over the data, and can also be run on a sample of the data if desired.
Specifically, the algorithm attempts to infer a tree of STRUCT types, each of which may contain atoms, arrays, or other STRUCTs. For each field defined by a distinct path from the root JSON object (e.g., tweet.loc.latitude), the algorithm finds the most specific Spark SQL data type that matches observed instances of the field.
We implement this algorithm using a single reduce operation over the data, which starts with schemata (i.e., trees of types) from each individual record and merges them using an associative “most specific supertype" function that generalizes the types of each field. This makes the algorithm both single-pass and communication-efficient, as a high degree of reduction happens locally on each node.
In Spark SQL, we also use the same algorithm for inferring schemas of RDDs of Python objects, as Python is not statically typed so an RDD can contain multiple object types. In the future, we plan to add similar inference for CSV files and XML.
Spark ML introduced a new high-level API that uses DataFrames. This new API is based on the concept of machine learning pipelines, an abstraction in other highlevel ML libraries like SciKit-Learn.
A pipeline is a graph of transformations on data, such as feature extraction, normalization, dimensionality reduction, and model training, each of which exchange datasets.
To represent datasets, the new API uses DataFrames, where each column represents a feature of the data. All algorithms that can be called in pipelines take a name for the input column(s) and output column(s), and can thus be called on any subset of the fields and produce new ones. This makes it easy for developers to build complex pipelines while retaining the original data for each record.
As data sources often reside in different machines or geographic locations, naively querying them can be prohibitively expensive. Spark SQL data sources leverage Catalyst to push predicates down into the data sources whenever possible.
For example, we can use the JDBC data source and the JSON data source to join two tables together. Conveniently, both data sources can automatically infer the schema without users having to define it. The JDBC data source will also push the filter predicate down into MySQL to reduce the amount of data transferred.
To implement Spark SQL, we designed a new extensible optimizer, Catalyst, based on functional programming constructs in Scala. Catalyst’s extensible design had two purposes.
we wanted to make it easy to add new optimization techniques and features to Spark SQL, especially to tackle various problems we were seeing specifically with “big data” (e.g., semistructured data and advanced analytics).
We wanted to enable external developers to extend the optimizer—for example, by adding data source specific rules that can push filtering or aggregation into external storage systems, or support for new data types.
Catalyst supports both rule based and cost-based optimization. Cost-based optimization is performed by generating multiple plans using rules, and then computing their costs.
Catalyst uses standard features of the Scala programming language, such as pattern-matching, to let developers use the full programming language while still making rules easy to specify.
At its core, Catalyst contains a general library for representing trees and applying rules to manipulate them. On top of this framework, we have built libraries specific to relational query processing (e.g., expressions, logical query plans), and several sets of rules that handle different phases of query execution: analysis, logical optimization, physical planning, and code generation to compile parts of queries to Java bytecode. Catalyst also offers several public extension points, including external data sources and user-defined types.
The main data type in Catalyst is a tree composed of node objects. Each node has a node type and zero or more children. New node types are defined in Scala as subclasses of the TreeNode class. These objects are immutable and can be manipulated using functional transformations.
Trees can be manipulated using rules, which are functions from a tree to another tree. While a rule can run arbitrary code on its input tree (given that this tree is just a Scala object), the most common approach is to use a set of pattern matching functions that find and replace subtrees with a specific structure.
In Catalyst, trees offer a transform method that applies a pattern matching function recursively on all nodes of the tree, transforming the ones that match each pattern to a result.
Catalyst will tests which parts of a tree a given rule applies to, automatically skipping over and descending into subtrees that do not match. This ability means that rules only need to reason about the trees where a given optimization applies and not those that do not match. Thus, rules do not need to be modified as new types of operators are added to the system.
In practice, rules may need to execute multiple times to fully transform a tree. Catalyst groups rules into batches, and executes each batch until it reaches a fixed point, that is, until the tree stops changing after applying its rules. Running rules to fixed point means that each rule can be simple and self-contained, and yet still eventually have larger global effects on a tree.
Catalyst’s general tree transformation framework are used in four phases:
In the physical planning phase, Catalyst may generate multiple plans and compare them based on cost. All other phases are purely rule-based. Each phase uses different types of tree nodes; Catalyst includes libraries of nodes for expressions, data types, and logical and physical operators.
Catalyst’s design around composable rules makes it easy for users and third-party libraries to extend. Developers can add batches of rules to each phase of query optimization at runtime, as long as they adhere to the contract of each phase (e.g., ensuring that analysis resolves all attributes). However, to make it even simpler to add some types of extensions without understanding Catalyst rules, we have also defined two narrower public extension points: data sources and user-defined types. These still rely on facilities in the core engine to interact with the rest of the rest of the optimizer.
Developers can define a new data source for Spark SQL using several APIs, which expose varying degrees of possible optimization. These interfaces allow data sources to implement various degrees of optimization, while still making it easy for developers to add simple data sources of virtually any type.
Similar interfaces exist for writing data to an existing or new table. These are simpler because Spark SQL just provides an RDD of Row objects to be written.
Adding new types can be challenging, however, as data types pervade all aspects of the execution engine.
In Catalyst, we solve this issue by mapping user-defined types to structures composed of Catalyst’s built-in types. To register a Scala type as a UDT, users provide a mapping from an object of their class to a Catalyst Row of built-in types, and an inverse mapping back.
In user code, they can now use the Scala type in objects that they query with Spark SQL, and it will be converted to built-in types under the hood. Likewise, they can register UDFs that operate directly on their type.