Apache Flink offers a Table API as a unified, relational API for batch and stream processing, i.e., queries are executed with the same semantics on unbounded, real-time streams or bounded, batch data sets and produce the same results.
Apache Flink提供了一个Table API作为批处理和流处理的统一的关系API。查询以相同的语义在无边界的实时流或有边界的批处理数据集上执行,并产生相同的结果。
The Table API in Flink is commonly used to ease the definition of data analytics, data pipelining, and ETL applications.
Flink中的Table API通常用于简化数据分析、数据管道和ETL应用程序的定义。
鉴于我还不知道ETL是啥,贴一个百度百科。
ETL (数据仓库技术)
ETL,是英文Extract-Transform-Load的缩写,用来描述将数据从来源端经过抽取(extract)、转换(transform)、加载(load)至目的端的过程。ETL一词较常用在数据仓库,但其对象并不限于数据仓库。
In this tutorial, you will learn how to build a real-time dashboard to track financial transactions by account.
在本笔记中,您将学习如何构建一个按帐户跟踪金融交易的实时仪表板。
The pipeline will read data from Kafka and write the results to MySQL visualized via Grafana.
管道将从Kafka读取数据,并将结果通过Grafana可视化写入MySQL。
(我还没用过Grafana,不过走一步看一步把,先贴个官网)
官网:https://grafana.com/
This walkthrough assumes that you have some familiarity with Java or Scala, but you should be able to follow along even if you come from a different programming language. It also assumes that you are familiar with basic relational concepts such as SELECT and GROUP BY clauses.
前提是你会用java,scala,python其中一种,并且会用SQL。
跟上篇一样,自己百度或者论坛求助啥的。
If you want to follow along, you will require a computer with:
The required configuration files are available in the flink-playgrounds repository.
所需的配置文件可以在flink-playgrounds存储库中获得。
Once downloaded, open the project flink-playground/table-walkthrough
in your IDE and navigate to the file SpendReport.
下载后,在IDE中打开项目flink-playground/table-walkthrough并导航到文件SpendReport。
EnvironmentSettings settings = EnvironmentSettings.newInstance().build();
TableEnvironment tEnv = TableEnvironment.create(settings);
tEnv.executeSql("CREATE TABLE transactions (\n" +
" account_id BIGINT,\n" +
" amount BIGINT,\n" +
" transaction_time TIMESTAMP(3),\n" +
" WATERMARK FOR transaction_time AS transaction_time - INTERVAL '5' SECOND\n" +
") WITH (\n" +
" 'connector' = 'kafka',\n" +
" 'topic' = 'transactions',\n" +
" 'properties.bootstrap.servers' = 'kafka:9092',\n" +
" 'format' = 'csv'\n" +
")");
tEnv.executeSql("CREATE TABLE spend_report (\n" +
" account_id BIGINT,\n" +
" log_ts TIMESTAMP(3),\n" +
" amount BIGINT\n," +
" PRIMARY KEY (account_id, log_ts) NOT ENFORCED" +
") WITH (\n" +
" 'connector' = 'jdbc',\n" +
" 'url' = 'jdbc:mysql://mysql:3306/sql-demo',\n" +
" 'table-name' = 'spend_report',\n" +
" 'driver' = 'com.mysql.jdbc.Driver',\n" +
" 'username' = 'sql-demo',\n" +
" 'password' = 'demo-sql'\n" +
")");
Table transactions = tEnv.from("transactions");
report(transactions).executeInsert("spend_report");
The Execution Environment 执行环境对象
The first two lines set up your TableEnvironment. The table environment is how you can set properties for your Job, specify whether you are writing a batch or a streaming application, and create your sources.
前两行设置了您的TableEnvironment。通过表环境,您可以为作业设置属性,指定是编写批处理应用程序还是流应用程序,以及创建源。
This walkthrough creates a standard table environment that uses the streaming execution.
本演练创建了一个使用流执行的标准表环境。
EnvironmentSettings settings = EnvironmentSettings.newInstance().build();
TableEnvironment tEnv = TableEnvironment.create(settings);
Registering Tables 注册Tables
Next, tables are registered in the current catalog that you can use to connect to external systems for reading and writing both batch and streaming data.
接下来,在当前目录中注册表,您可以使用这些表连接到外部系统,以便读取和写入批处理和流数据。
A table source provides access to data stored in external systems, such as a database, a key-value store, a message queue, or a file system.
表源提供对存储在外部系统(如数据库、键值存储、消息队列或文件系统)中的数据的访问。
A table sink emits a table to an external storage system.
A table sink将表发送到外部存储系统。
Depending on the type of source and sink, they support different formats such as CSV, JSON, Avro, or Parquet.
根据源和接收器的类型,它们支持不同的格式,如CSV、JSON、Avro或Parquet。
tEnv.executeSql("CREATE TABLE transactions (\n" +
" account_id BIGINT,\n" +
" amount BIGINT,\n" +
" transaction_time TIMESTAMP(3),\n" +
" WATERMARK FOR transaction_time AS transaction_time - INTERVAL '5' SECOND\n" +
") WITH (\n" +
" 'connector' = 'kafka',\n" +
" 'topic' = 'transactions',\n" +
" 'properties.bootstrap.servers' = 'kafka:9092',\n" +
" 'format' = 'csv'\n" +
")");
Two tables are registered; a transaction input table, and a spend report output table.
注册了两张表;一个事务输入表和一个花费报告输出表。
The transactions (transactions) table lets us read credit card transactions, which contain account ID’s (account_id), timestamps (transaction_time), and US$ amounts (amount).
transactions (transactions)表允许我们读取信用卡事务,其中包含帐户ID (account_id)、时间戳(transaction_time)和美元金额(amount)。
The table is a logical view over a Kafka topic called transactions containing CSV data.
该表是Kafka主题的逻辑视图,称为包含CSV数据的事务。
tEnv.executeSql("CREATE TABLE spend_report (\n" +
" account_id BIGINT,\n" +
" log_ts TIMESTAMP(3),\n" +
" amount BIGINT\n," +
" PRIMARY KEY (account_id, log_ts) NOT ENFORCED" +
") WITH (\n" +
" 'connector' = 'jdbc',\n" +
" 'url' = 'jdbc:mysql://mysql:3306/sql-demo',\n" +
" 'table-name' = 'spend_report',\n" +
" 'driver' = 'com.mysql.jdbc.Driver',\n" +
" 'username' = 'sql-demo',\n" +
" 'password' = 'demo-sql'\n" +
")");
The second table, spend_report, stores the final results of the aggregation.
第二个表spend_report存储聚合的最终结果。
Its underlying storage is a table in a MySql database.
它的底层存储是MySql数据库中的一个表。
The Query
With the environment configured and tables registered, you are ready to build your first application.
配置了环境并注册了表之后,就可以构建第一个应用程序了。
From the TableEnvironment you can read from an input table to read its rows and then write those results into an output table using executeInsert.
在TableEnvironment中,您可以从输入表中读取它的行,然后使用executeInsert将这些结果写入输出表。
The report function is where you will implement your business logic. It is currently unimplemented.
报表功能是实现业务逻辑的地方。目前尚未实现。
Table transactions = tEnv.from("transactions");
report(transactions).executeInsert("spend_report");
The project contains a secondary testing class SpendReportTest that validates the logic of the report. It creates a table environment in batch mode.
该项目包含第二个测试类SpendReportTest,用于验证报告的逻辑。它以批处理模式创建表环境。
EnvironmentSettings settings = EnvironmentSettings.newInstance().inBatchMode().build();
TableEnvironment tEnv = TableEnvironment.create(settings);
One of Flink’s unique properties is that it provides consistent semantics across batch and streaming.
Flink的独特属性之一是它在批处理和流之间提供了一致的语义。
This means you can develop and test applications in batch mode on static datasets, and deploy to production as streaming applications.
这意味着您可以在静态数据集上以批处理模式开发和测试应用程序,并将其作为流应用程序部署到生产中。
Now with the skeleton of a Job set-up, you are ready to add some business logic.
现在有了作业设置的框架,就可以添加一些业务逻辑了。
The goal is to build a report that shows the total spend for each account across each hour of the day.
我们的目标是构建一个报告,显示每天每小时每个账户的总花费。
This means the timestamp column needs be be rounded down from millisecond to hour granularity.
这意味着需要将timestamp列的粒度从毫秒舍入到小时。
Flink supports developing relational applications in pure SQL or using the Table API.
Flink支持纯SQL或使用表API开发关系应用程序。
The Table API is a fluent DSL inspired by SQL, that can be written in Python, Java, or Scala and supports strong IDE integration.
Table API是一种流畅的DSL,灵感来自SQL,可以用Python、Java或Scala编写,并支持强大的IDE集成。
Just like a SQL query, Table programs can select the required fields and group by your keys.
与SQL查询一样,表程序可以根据键选择所需的字段和组。
These features, along with built-in functions like floor and sum, you can write this report.
这些功能,以及像floor和sum这样的内置功能,你可以写这个report。
public static Table report(Table transactions) {
return transactions.select(
$("account_id"),
$("transaction_time").floor(TimeIntervalUnit.HOUR).as("log_ts"),
$("amount"))
.groupBy($("account_id"), $("log_ts"))
.select(
$("account_id"),
$("log_ts"),
$("amount").sum().as("amount"));
}
Flink contains a limited number of built-in functions, and sometimes you need to extend it with a user-defined function. If floor wasn’t predefined, you could implement it yourself.
Flink包含数量有限的内置函数,有时需要使用用户定义的函数对其进行扩展。
如果floor函数不是预定义的,您可以自己实现它。
import java.time.LocalDateTime;
import java.time.temporal.ChronoUnit;
import org.apache.flink.table.annotation.DataTypeHint;
import org.apache.flink.table.functions.ScalarFunction;
public class MyFloor extends ScalarFunction {
public @DataTypeHint("TIMESTAMP(3)") LocalDateTime eval(
@DataTypeHint("TIMESTAMP(3)") LocalDateTime timestamp) {
return timestamp.truncatedTo(ChronoUnit.HOURS);
}
}
And then quickly integrate it in your application.
然后快速地将其集成到应用程序中。
public static Table report(Table transactions) {
return transactions.select(
$("account_id"),
call(MyFloor.class, $("transaction_time")).as("log_ts"),
$("amount"))
.groupBy($("account_id"), $("log_ts"))
.select(
$("account_id"),
$("log_ts"),
$("amount").sum().as("amount"));
}
This query consumes all records from the transactions table, calculates the report, and outputs the results in an efficient, scalable manner.
该查询使用transactions表中的所有记录,计算报告,并以一种有效的、可伸缩的方式输出结果。
Running the test with this implementation will pass.
使用此实现运行测试将获得通过。
Grouping data based on time is a typical operation in data processing, especially when working with infinite streams.
基于时间的数据分组是数据处理中的一种典型操作,特别是在处理无限流时。
A grouping based on time is called a window and Flink offers flexible windowing semantics.
基于时间的分组称为窗口,Flink提供了灵活的窗口语义。
The most basic type of window is called a Tumble window, which has a fixed size and whose buckets do not overlap.
最基本的窗户类型被称为滚筒式窗户,它有固定的大小和桶不重叠。
public static Table report(Table transactions) {
return transactions
.window(Tumble.over(lit(1).hour()).on($("transaction_time")).as("log_ts"))
.groupBy($("account_id"), $("log_ts"))
.select(
$("account_id"),
$("log_ts").start().as("log_ts"),
$("amount").sum().as("amount"));
}
This defines your application as using one hour tumbling windows based on the timestamp column.
这将您的应用程序定义为基于时间戳列使用一个小时的滚动窗口。
So a row with timestamp 2019-06-01 01:23:47 is put in the 2019-06-01 01:00:00 window.
因此,将带有时间戳2019-06-01 01:23:47的行放在2019-06-01 01:00:00窗口中。
Aggregations based on time are unique because time, as opposed to other attributes, generally moves forward in a continuous streaming application.
基于时间的聚合是唯一的,因为与其他属性相比,时间通常在连续流应用程序中向前移动。
Unlike floor and your UDF, window functions are intrinsics, which allows the runtime to apply additional optimizations.
与floor和您的UDF不同,窗口函数是内部函数,它允许运行时应用额外的优化。
In a batch context, windows offer a convenient API for grouping records by a timestamp attribute.
在批处理上下文中,windows提供了一个方便的API,可以通过时间戳属性对记录进行分组。
Running the test with this implementation will also pass.
使用此实现运行测试也将通过。
And that’s it, a fully functional, stateful, distributed streaming application!
就是这样,一个功能齐全、有状态的分布式流应用程序!
The query continuously consumes the stream of transactions from Kafka, computes the hourly spendings, and emits results as soon as they are ready.
查询持续地消耗来自Kafka的事务流,计算每小时的开销,并在准备好之后立即发出结果。
Since the input is unbounded, the query keeps running until it is manually stopped.
由于输入是无限制的,因此查询将一直运行,直到手动停止。
And because the Job uses time window-based aggregations, Flink can perform specific optimizations such as state clean up when the framework knows that no more records will arrive for a particular window.
而且由于作业使用基于时间窗口的聚合,Flink可以执行特定的优化,比如当框架知道某个特定窗口将没有更多记录到达时进行状态清理。
The table playground is fully dockerized and runnable locally as streaming application.
这个 table playground 是完全dockerized 并且像流应用程序一样可以本地运行。
The environment contains a Kafka topic, a continuous data generator, MySql, and Grafana.
该环境包含Kafka主题、连续数据生成器、MySql和Grafana。
From within the table-walkthrough folder start the docker-compose script.
从表演练文件夹中启动docker编写脚本。
$ docker-compose build
$ docker-compose up -d
You can see information on the running job via the Flink console.
http://localhost:8082/
Explore the results from inside MySQL.
从MySQL内部探索结果。
$ docker-compose exec mysql mysql -Dsql-demo -usql-demo -pdemo-sql
mysql> use sql-demo;
Database changed
mysql> select count(*) from spend_report;
+----------+
| count(*) |
+----------+
| 110 |
+----------+
Finally, go to Grafana to see the fully visualized result!
最后,去Grafana看看完全可视化的结果!