Using such APIs, functional programming certainly feels like true declarative programming. The most popular true declarative programming language is SQL. When you join two tables, you don’t tell the RDBMS how to implement that join. It may decide at its discretion whether a nested loop, merge join, hash join, or some other algorithm is the most suitable in the context of the complete query and of all the available meta information. This is extremely powerful because the performance assumptions that are valid for a simple join may no longer be valid for a complex one, where a different algorithm would outperform the original one. By this abstraction, you can just easily modify a query in 30 seconds, without worrying about low-level details like algorithms or performance. When an API allows you to combine both (e.g. jOOQ and Streams), you will get the best of both worlds – and those worlds aren’t too different. In the following sections, we’ll compare common SQL constructs with their equivalent expressions written in Java 8 using Streams and jOOλ, in case the Stream API doesn’t offer enough functionality.
For the sake of this article, we’re going to assume that SQL rows / records have an equivalent representation in Java. For this, we’ll be using jOOλ’s Tuple type, which is essentially:
|
… plus a lot of useful gimmicks like Tuple being Comparable
, etc. Note that we’re assuming the following imports in this and all subsequent examples.
|
Much like SQL rows, a tuple is a “value-based” type, meaning that it doesn’t really have an identity. Two tuples (1, 'A')
and (1, 'A')
can be considered exactly equivalent. Removing identity from the game makes SQL and functional programming with immutable data structures extremely elegant.
In SQL, the FROM
clause logically (but not syntactically) precedes all the other clauses. It is used to produce a set of tuples from at least one table, possibly multiple joined tables. A single-table FROM
clause can be trivially mapped to Stream.of(), for instance, or to any other method that simply produces a stream: SQL
|
yielding
+----+----+ | v1 | v2 | +----+----+ | 1 | 1 | | 2 | 2 | +----+----+
Java
|
yielding
(1, 1) (2, 2)
Selecting from multiple tables is already more interesting. The easiest way to combine two tables in SQL is by producing a cartesian product, either via a table list or using a CROSS JOIN
. The following two are equivalent SQL statements: SQL
|
yielding
+----+----+ | v1 | v2 | +----+----+ | 1 | A | | 1 | B | | 2 | A | | 2 | B | +----+----+
In a cross join (or cartesian product), every value from t1
is combined with every value from t2
producing size(t1) * size(t2)
rows in total. Java In functional programming using Java 8’s Stream
, the Stream.flatMap() method corresponds to SQL CROSS JOIN
as can be seen in the following example:
|
yielding
(1, A) (1, B) (2, A) (2, B)
Note how we have to wrap the second stream in a Supplier
because streams can be consumed only once, but the above algorithm is really implementing a nested loop, combining all elements of stream s2
with each element from stream s1
. An alternative would be not to use streams but lists (which we will do in subsequent examples, for simplicity):
|
In fact, CROSS JOIN
can be chained easily both in SQL and in Java: SQL
|
yielding
+----+----+----+ | v1 | v2 | v3 | +----+----+----+ | 1 | A | X | | 1 | A | Y | | 1 | B | X | | 1 | B | Y | | 2 | A | X | | 2 | A | Y | | 2 | B | X | | 2 | B | Y | +----+----+----+
Java
|
yielding
(1, A, X) (1, A, Y) (1, B, X) (1, B, Y) (2, A, X) (2, A, Y) (2, B, X) (2, B, Y)
Note how we explicitly unnested the tuples from the first CROSS JOIN
operation to form “flat” tuples in the second operation. This is optional, of course. Java with jOOλ’s crossJoin() Us jOOQ developers, we’re a very SQL-oriented people, so it is only natural to have added a crossJoin() convenience method for the above use-case. So our triple-cross join can be written like this:
|
yielding
((1, A), X) ((1, A), Y) ((1, B), X) ((1, B), Y) ((2, A), X) ((2, A), Y) ((2, B), X) ((2, B), Y)
In this case, we didn’t unnest the tuple produced in the first cross join. From a merely relational perspective, this doesn’t matter either. Nested tuples are the same thing as flat tuples. In SQL, we just don’t see the nesting. Of course, we could still unnest as well by adding a single additional mapping:
|
yielding, again
(1, A, X) (1, A, Y) (1, B, X) (1, B, Y) (2, A, X) (2, A, Y) (2, B, X) (2, B, Y)
(You may have noticed that map()
corresponds to SELECT
as we’ll see again later on)
The SQL INNER JOIN
is essentially just syntactic sugar for a SQL CROSS JOIN
with a predicate that reduces the tuple set after cross-joining. In SQL, the following two ways of inner joining are equivalent: SQL
|
yielding
+----+----+ | v1 | v2 | +----+----+ | 1 | 1 | +----+----+
(note that the keyword INNER
is optional). So, the values 2
from t1
and the values 3
from t2
are “thrown away”, as they produce any rows for which the join predicate yields true. The same can be expressed easily, yet more verbosely in Java Java (inefficient solution!)
|
The above correctly yields
(1, 1)
But beware that you’re attaining this result after producing a cartesian product, the nightmare of every DBA! As mentioned at the beginning of this article, unlike in declarative programming, in functional programming you instruct your program to do exactly the order of operations that you specify. In other words:
In functional programming, you define the exact “execution plan” of your query. In declarative programming, an optimiser may reorganise your “program”
There is no optimiser to transform the above into the much more efficient: Java (more efficient)
|
The above also yields
(1, 1)
Notice, how the join predicate has moved from the “outer” stream into the “inner” stream, that is produced in the function passed to flatMap()
. Java (optimal) As mentioned previously, functional programming doesn’t necessarily allow you to rewrite algorithms depending on knowledge of the actual data. The above presented implementation for joins always implement nested loop joins going from the first stream to the second. If you join more than two streams, or if the second stream is very large, this approach can be terribly inefficient. A sophisticated RDBMS would never blindly apply nested loop joins like that, but consider constraints, indexes, and histograms on actual data. Going deeper into that topic would be out of scope for this article, though. Java with jOOλ’s innerJoin() Again, inspired by our work on jOOQ we’ve also added an innerJoin() convenience method for the above use-case:
|
yielding
(1, 1)
… because after all, when joining two streams, the only really interesting operation is the join Predicate
. All else (flatmapping, etc.) is just boilerplate.
SQL’s OUTER JOIN
works like INNER JOIN
, except that additional “default” rows are produced in case the JOIN
predicate yields false
for a pair of tuples. In terms of set theory / relational algebra, this can be expressed as such:
Or in a SQL-esque dialect:
R LEFT OUTER JOIN S ::= R INNER JOIN S UNION ( (R EXCEPT (SELECT R.* FROM R INNER JOIN S)) CROSS JOIN (null, null, ..., null) )
This simply means that when left outer joining S
to R
, there will be at least one row in the result for each row in R
, with possibly an empty value for S
. Inversely, when right outer joining S
to R
, there will be at least one row in the result for each row in S
, with possibly an empty value for R
. And finally, when full outer joining S
to R
, there will be at least one row in the result for each row in R
with possibly an empty value for S
AND for each row in S
with possibly an empty value for R
. Let us look at LEFT OUTER JOIN
, which is used most often in SQL. SQL
|
yielding
+----+------+ | v1 | v2 | +----+------+ | 1 | 1 | | 2 | null | +----+------+
(note that the keyword OUTER
is optional). Java Unfortunately, the JDK’s Stream API doesn’t provide us with an easy way to produce “at least” one value from a stream, in case the stream is empty. We could be writing a utility function as explained by Stuart Marks on Stack Overflow:
|
Or, we just use jOOλ’s Seq.onEmpty()
|
(notice, we’re putting null
in a stream. This might not always be a good idea. We’ll follow up with that in a future blog post) The above also yields
(1, 1) (2, null)
How to read the implicit left outer join?
v1
from the left stream s1
v1
, we flatmap the right stream s2
to produce a tuple (v1, v2)
(a cartesian product, cross join)(v1, v2)
v2
, we’ll generate a single tuple containing the value of the left stream v1
and null
Java with jOOλ For convenience, jOOλ also supports leftOuterJoin() which works as described above:
|
yielding
(1, 1) (2, null)
Trivially, a RIGHT OUTER JOIN
is just the inverse of the previous LEFT OUTER JOIN
. The jOOλ implementation of rightOuterJoin() looks like this:
|
As you can see, the RIGHT OUTER JOIN
inverses the results of a LEFT OUTER JOIN
, that’s it. For example:
|
yielding
(1, 1) (null, 3)
The most straight-forward mapping is probably SQL’s WHERE
clause having an exact equivalent in the Stream
API: Stream.filter(). SQL
|
yielding
+---+ | v | +---+ | 2 | +---+
Java
|
yielding
2
The interesting thing with filter()
and the Stream API in general is that the operation can apply at any place in the call chain, unlike the WHERE
clause, which is limited to be placed right after the FROM
clause – even if SQL’s JOIN .. ON
or HAVING
clauses are semantically similar.
The least straight-forward mapping is GROUP BY
vs. Stream.collect(). First off, SQL’s GROUP BY may be a bit tricky to fully understand. It is really part of the FROM
clause, transforming the set of tuples produced by FROM .. JOIN .. WHERE
into groups of tuples, where each group has an associated set of aggregatable tuples, which can be aggregated in the HAVING
, SELECT
, and ORDER BY
clauses. Things get even more interesting when you use OLAP features like GROUPING SETS, which allow for duplicating tuples according to several grouping combinations. In most SQL implementations that don’t support ARRAY
or MULTISET
, the aggregatable tuples are not available as such (i.e. as nested collections) in the SELECT
. Here, the Stream
API’s feature set excels. On the other hand, the Stream
API can group values only as a terminal operation, where in SQL, GROUP BY
is applied purely declaratively (and thus, lazily). The execution planner may choose not to execute the GROUP BY
at all if it is not needed. For instance:
|
The above query is semantically equivalent to
|
The grouping in the subquery was unnecessary. Someone may have copy-pasted that subquery in there from somewhere else, or refactored the query as a whole. In Java, using the Stream
API, each operation is always executed. For the sake of simplicity, we’ll stick to the most simple examples here
A special case is when we do not specify any GROUP BY
clause. In that case, we can specify aggregations on all columns of the FROM
clause, producing always exactly one record. For instance: SQL
|
yielding
+-----+ | sum | +-----+ | 6 | +-----+
Java
|
yielding
6
A more common case of aggregation in SQL is to specify an explicit GROUP BY
clause as explained before. For instance, we may want to group by even and odd numbers: SQL
|
yielding
+-------+-------+-----+ | v % 2 | count | sum | +-------+-------+-----+ | 0 | 1 | 2 | | 1 | 2 | 4 | +-------+-------+-----+
Java For this simple grouping / collection use-case, luckily, the JDK offers a utility method called Collectors.groupingBy(), which produces a collector that generates a Map
type like this:
|
yielding
{0=[2], 1=[1, 3]}
This certainly takes care of the grouping. Now we want to produce aggregations for each group. The slightly awkward JDK way to do this would be:
|
we’ll now get:
{0=IntSummaryStatistics{count=1, sum=2, min=2, average=2.000000, max=2}, 1=IntSummaryStatistics{count=2, sum=4, min=1, average=2.000000, max=3}}
As you can see, the count()
and sum()
values have been calculated somewhere along the lines of the above.
When doing multiple aggregations with Java 8’s Stream
API, you will quickly be forced to wrestle low-level API implementing complicated collectors and accumulators yourself. This is tedious and unnecessary. Consider the following SQL statement: SQL
|
In one go, we want to:
Java In a previous article, we’ve explained in detail how this can be achieved using convenience API from jOOλ via Seq.groupBy()
|
The above yields
{(1, 1)=(IntSummaryStatistics{count=2, sum=3, min=1, average=1.500000, max=2}, IntSummaryStatistics{count=2, sum=4, min=1, average=2.000000, max=3}), (4, 9)=(IntSummaryStatistics{count=2, sum=17, min=8, average=8.500000, max=9}, IntSummaryStatistics{count=2, sum=13, min=6, average=6.500000, max=7}), (5, 2)=(IntSummaryStatistics{count=3, sum=12, min=3, average=4.000000, max=5}, IntSummaryStatistics{count=3, sum=13, min=4, average=4.333333, max=5})}
For more details, read the full article here. Notice how using Stream.collect(), or Seq.groupBy() already makes for an implicit SELECT
clause, which we are no longer needed to obtain via map()
(see below).
As mentioned before, there aren’t really different ways of applying predicates with the Stream
API, there is only Stream.filter()
. In SQL, HAVING
is a “special” predicate clause that is syntactically put after the GROUP BY
clause. For instance: SQL
|
yielding
+-------+-------+ | v % 2 | count | +-------+-------+ | 1 | 2 | +-------+-------+
Java Unfortunately, as we have seen before, collect()
is a terminal operation in the Stream
API, which means that it eagerly produces a Map
, instead of transforming the Stream
into a Stream
, which would compose much better in complex Stream
. This means that any operation that we’d like to implement right after collecting will have to be implemented on a new stream produced from the output Map
:
|
yielding
1=IntSummaryStatistics{count=2, sum=4, min=1, average=2.000000, max=3}
As you can see, the type transformation that is applied is:
Map
Set>
Stream>
The SELECT
clause in SQL is nothing more than a tuple transformation function that takes the cartesian product of tuples produced by the FROM
clause and transforms it into a new tuple expression, which is fed either to the client, or to some higher-level query if this is a nested SELECT. An illustration: FROM output
+------+------+------+------+------+ | T1.A | T1.B | T1.C | T2.A | T2.D | +------+------+------+------+------+ | 1 | A | a | 1 | X | | 1 | B | b | 1 | Y | | 2 | C | c | 2 | X | | 2 | D | d | 2 | Y | +------+------+------+------+------+
Applying SELECT
SELECT t1.a, t1.c, t1.b || t1.d +------+------+--------------+ | T1.A | T1.C | T1.B || T1.D | +------+------+--------------+ | 1 | a | AX | | 1 | b | BY | | 2 | c | CX | | 2 | d | DY | +------+------+--------------+
Using Java 8 Streams, SELECT
can be achieved very simply by using Stream.map(), as we’ve already seen in previous examples, where we unnested tuples using map()
. The following examples are functionally equivalent: SQL
|
yielding
+----+----+ | c1 | c2 | +----+----+ | 3 | 6 | | 6 | 7 | +----+----+
Java
|
yielding
(3, 6) (6, 7)
The DISTINCT
keyword that can be supplied with the SELECT
clause simply removes duplicate tuples right after they have been produced by the SELECT
clause. An illustration: FROM output
+------+------+------+------+------+ | T1.A | T1.B | T1.C | T2.A | T2.D | +------+------+------+------+------+ | 1 | A | a | 1 | X | | 1 | B | b | 1 | Y | | 2 | C | c | 2 | X | | 2 | D | d | 2 | Y | +------+------+------+------+------+
Applying SELECT DISTINCT
SELECT DISTINCT t1.a +------+ | T1.A | +------+ | 1 | | 2 | +------+
Using Java 8 Streams, SELECT DISTINCT
can be achieved very simply by using Stream.distinct() right after Stream.map(). The following examples are functionally equivalent: SQL
|
yielding
+----+----+ | c1 | c2 | +----+----+ | 3 | 6 | | 6 | 7 | +----+----+
Java
|
yielding
(3, 6) (6, 7)
Set operations are powerful both in SQL and using the Stream
API. The UNION ALL
operation maps to Stream.concat()
, as can be seen below: SQL
|
yielding
+---+ | v | +---+ | 1 | | 2 | | 1 | | 3 | +---+
Java
|
yielding
1 2 1 3
Java (using jOOλ) Unfortunately, concat()
exists in Stream
only as a static
method, while Seq.concat() also exists on instances when working with jOOλ.
|
In SQL, UNION
is defined to remove duplicates after concatenating the two sets via UNION ALL
. The following two statements are equivalent:
SELECT * FROM t UNION SELECT * FROM u; -- equivalent SELECT DISTINCT * FROM ( SELECT * FROM t UNION ALL SELECT * FROM u );
Let’s put this in action: SQL
|
yielding
+---+ | v | +---+ | 1 | | 2 | | 3 | +---+
Java
|
The ORDER BY
mapping is trivial SQL
|
yielding
+---+ | v | +---+ | 1 | | 3 | | 4 | +---+
Java
|
yielding
1 3 4
The LIMIT
mapping is even more trivial SQL
|
yielding
+---+ | v | +---+ | 1 | | 4 | +---+
Java
|
yielding
1 4
The OFFSET
mapping is trivial as well SQL
|
yielding
+---+ | v | +---+ | 4 | | 3 | +---+
Java
|
yielding
4 3
In the above article, we’ve seen pretty much all the useful SQL SELECT
query clauses and how they can be mapped to the Java 8 Stream
API, or to jOOλ’s Seq
API, in case Stream
doesn’t offer sufficient functionality. The article shows that SQL’s declarative world is not that much different from Java 8’s functional world. SQL clauses can compose ad-hoc queries just as well as Stream
methods can be used to compose functional transformation pipelines. But there is a fundamental difference. While SQL is truly declarative, functional programming is still very instructive. The Stream
API does not make optimisation decisions based on constraints, indexes, histograms and other meta information about the data that you’re transforming. Using the Stream
API is like using all possible optimisation hints in SQL to force the SQL engine to choose one particular execution plan over another. However, while SQL is a higher level algorithm abstraction, the Stream
API may allow you to implement more customisable algorithms.
原文链接:
Common SQL Clauses and Their Equivalents in Java 8 Streams – Java, SQL and jOOQ.