一、Flink 专栏
Flink 专栏系统介绍某一知识点,并辅以具体的示例进行说明。
1、Flink 部署系列
本部分介绍Flink的部署、配置相关基础内容。
2、Flink基础系列
本部分介绍Flink 的基础部分,比如术语、架构、编程模型、编程指南、基本的datastream api用法、四大基石等内容。
3、Flik Table API和SQL基础系列
本部分介绍Flink Table Api和SQL的基本用法,比如Table API和SQL创建库、表用法、查询、窗口函数、catalog等等内容。
4、Flik Table API和SQL提高与应用系列
本部分是table api 和sql的应用部分,和实际的生产应用联系更为密切,以及有一定开发难度的内容。
5、Flink 监控系列
本部分和实际的运维、监控工作相关。
二、Flink 示例专栏
Flink 示例专栏是 Flink 专栏的辅助说明,一般不会介绍知识点的信息,更多的是提供一个一个可以具体使用的示例。本专栏不再分目录,通过链接即可看出介绍的内容。
两专栏的所有文章入口点击:Flink 系列文章汇总索引
本文介绍了flink的重要概念,dataset、datastream详细示例入门和提交任务至on yarn运行。
重要概念包括:
本文依赖flink环境、hadoop环境好用
本文部分图片来源于互联网。
本文分为5个部分,即Flink重要概念介绍、dataset的wordcount示例、datastream的内部匿名类wordcount示例和datastream的lambda之wordcount、wordcount示例提交至yarn运行示例。
在Flink1.12时支持流批一体,DataSetAPI已经不推荐使用了。
更多词汇参考:https://nightlies.apache.org/flink/flink-docs-release-1.17/zh/docs/concepts/glossary/
对于分布式执行,Flink 将算子的 subtasks 链接成 tasks。每个 task 由一个线程执行。将算子链接成 task 是个有用的优化:它减少线程间切换、缓冲的开销,并且减少延迟的同时增加整体吞吐量。链行为是可以配置的;请参考链文档以获取详细信息。
下图中样例数据流用 5 个 subtask 执行,因此有 5 个并行线程。
数据在两个operator(算子)之间传递的时候有两种模式:
客户端在提交任务的时候会对Operator进行优化操作,能进行合并的Operator会被合并为一个Operator,合并后的Operator称为Operator chain,实际上就是一个执行链,每个执行链会在TaskManager上一个独立的线程中执行,就是SubTask。
Each worker (TaskManager) is a JVM process, and may execute one or more subtasks in separate threads. To control how many tasks a TaskManager accepts, it has so called task slots (at least one).
Each task slot represents a fixed subset of resources of the TaskManager. A TaskManager with three slots, for example, will dedicate 1/3 of its managed memory to each slot. Slotting the resources means that a subtask will not compete with subtasks from other jobs for managed memory, but instead has a certain amount of reserved managed memory. Note that no CPU isolation happens here; currently slots only separate the managed memory of tasks.
By adjusting the number of task slots, users can define how subtasks are isolated from each other. Having one slot per TaskManager means that each task group runs in a separate JVM (which can be started in a separate container, for example). Having multiple slots means more subtasks share the same JVM. Tasks in the same JVM share TCP connections (via multiplexing) and heartbeat messages. They may also share data sets and data structures, thus reducing the per-task overhead.
每个TaskManager是一个JVM的进程,为了控制一个TaskManager(worker)能接收多少个task,Flink通过Task Slot来进行控制。
TaskSlot数量是用来限制一个TaskManager工作进程中可以同时运行多少个工作线程,TaskSlot 是一个 TaskManager 中的最小资源分配单位,一个 TaskManager 中有多少个 TaskSlot 就意味着能支持多少并发的Task处理。
Flink将进程的内存进行了划分到多个slot中,内存被划分到不同的slot之后可以获得如下优点。
By default, Flink allows subtasks to share slots even if they are subtasks of different tasks, so long as they are from the same job. The result is that one slot may hold an entire pipeline of the job. Allowing this slot sharing has two main benefits:
由Flink程序直接映射成的数据流图是StreamGraph,也被称为逻辑流图,因为它们表示的是计算逻辑的高级视图。
为了执行一个流处理程序,Flink需要将逻辑流图转换为物理数据流图(也叫执行图),详细说明程序的执行方式。
Flink 中的执行图可以分成四层:StreamGraph -> JobGraph -> ExecutionGraph -> 物理执行图。
Flink 应用程序结构主要包含三部分,Source/Transformation/Sink,如下图所示:
<properties>
<encoding>UTF-8encoding>
<project.build.sourceEncoding>UTF-8project.build.sourceEncoding>
<maven.compiler.source>1.8maven.compiler.source>
<maven.compiler.target>1.8maven.compiler.target>
<java.version>1.8java.version>
<scala.version>2.12scala.version>
<flink.version>1.12.0flink.version>
properties>
<dependencies>
<dependency>
<groupId>jdk.toolsgroupId>
<artifactId>jdk.toolsartifactId>
<version>1.8version>
<scope>systemscope>
<systemPath>${JAVA_HOME}/lib/tools.jarsystemPath>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-clients_2.12artifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-scala_2.12artifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-javaartifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-streaming-scala_2.12artifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-streaming-java_2.12artifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-table-api-scala-bridge_2.12artifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-table-api-java-bridge_2.12artifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-table-planner-blink_2.12artifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-table-commonartifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.slf4jgroupId>
<artifactId>slf4j-log4j12artifactId>
<version>1.7.7version>
<scope>runtimescope>
dependency>
<dependency>
<groupId>log4jgroupId>
<artifactId>log4jartifactId>
<version>1.2.17version>
<scope>runtimescope>
dependency>
<dependency>
<groupId>org.projectlombokgroupId>
<artifactId>lombokartifactId>
<version>1.18.2version>
<scope>providedscope>
dependency>
dependencies>
官网链接说明:Apache Flink 1.12 Documentation: Flink DataStream API 编程指南
Flink programs look like regular programs that transform DataStreams/dataset. Each program consists of the same basic parts:
getExecutionEnvironment(),推荐使用
createLocalEnvironment()
createRemoteEnvironment(String host, int port, String... jarFiles)
示例如下:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env可以加载很多种数据源,比如文件、socket、fromelements等
示例如下:
DataSet<String> lines = env.fromElements("flink hadoop hive", "flink hadoop hive", "flink hadoop", "flink");
DataStream<String> text = env.readTextFile("file:///path/to/file");
flink的核心功能之一就是转换处理操作,有很多种实现
示例如下:
DataStream<Integer> parsed = input.map(new MapFunction<String, Integer>() {
@Override
public Integer map(String value) {
return Integer.parseInt(value);
}
});
sink可以有很多种数据源,比如关系型数据库、消息队列、hdfs、redis等
示例如下:
writeAsText(String path)
print()
Once you specified the complete program you need to trigger the program execution by calling execute() on the StreamExecutionEnvironment. Depending on the type of the ExecutionEnvironment the execution will be triggered on your local machine or submit your program for execution on a cluster.
The execute() method will wait for the job to finish and then return a JobExecutionResult, this contains execution times and accumulator results.
If you don’t want to wait for the job to finish, you can trigger asynchronous job execution by calling executeAysnc() on the StreamExecutionEnvironment. It will return a JobClient with which you can communicate with the job you just submitted. For instance, here is how to implement the semantics of execute() by using executeAsync().
示例如下:
final JobClient jobClient = env.executeAsync();
final JobExecutionResult jobExecutionResult = jobClient.getJobExecutionResult().get();
That last part about program execution is crucial to understanding when and how Flink operations are executed. All Flink programs are executed lazily: When the program’s main method is executed, the data loading and transformations do not happen directly. Rather, each operation is created and added to a dataflow graph. The operations are actually executed when the execution is explicitly triggered by an execute() call on the execution environment. Whether the program is executed locally or on a cluster depends on the type of execution environment
The lazy evaluation lets you construct sophisticated programs that Flink executes as one holistically planned unit.
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.AggregateOperator;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
/**
* @author alanchan
*
*/
public class WordCount_DataSet {
public static void main(String[] args) throws Exception {
// 1、设置运行环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// 2、设置数据源
DataSet<String> lines = env.fromElements("flink,hadoop,hive", "flink,hadoop,hive", "flink,hadoop", "flink");
// 3、转换,将数组转成单个单词
DataSet<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String value, Collector<String> out) throws Exception {
String[] words = value.split(",");
for (String word : words) {
out.collect(word);
}
}
});
words.print();
System.out.println("-------------------------------------------------------------------------------");
// 3、转换, 将每个单词记录次数
DataSet<Tuple2<String, Integer>> wordCount = words.map(new MapFunction<String, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> map(String value) throws Exception {
return new Tuple2(value, 1);
}
});
wordCount.print();
System.out.println("-------------------------------------------------------------------------------");
// 3、转换, 将单词分组并统计每个单词的总数
AggregateOperator<Tuple2<String, Integer>> result = wordCount.groupBy(0).sum(1);
// 4、sink ,输出
result.print();
//5.触发执行-execute
//如果有print,,DataSet不需要调用execute;,DataStream需要调用execute
}
}
flink
hadoop
hive
flink
hadoop
hive
flink
hadoop
flink
-------------------------------------------------------------------------------
(flink,1)
(hadoop,1)
(hive,1)
(flink,1)
(hadoop,1)
(hive,1)
(flink,1)
(hadoop,1)
(flink,1)
-------------------------------------------------------------------------------
(hadoop,3)
(flink,4)
(hive,2)
参考示例1中依赖
参考示例1中依赖
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
/**
* @author alanchan
*
*/
public class WordCount_DataStream {
/**
* @param args
* @throws Exception
*/
public static void main(String[] args) throws Exception {
// 1.准备环境-env
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
// env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
// env.setRuntimeMode(RuntimeExecutionMode.BATCH);
//默认是RuntimeExecutionMode.STREAMING
// 2.准备数据-source
DataStream<String> linesDS = env.fromElements("flink,hadoop,hive", "flink,hadoop,hive", "flink,hadoop", "flink");
// 3.处理数据-transformation
DataStream<String> wordsDS = linesDS.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String value, Collector<String> out) throws Exception {
// value就是一行行的数据
String[] words = value.split(",");
for (String word : words) {
out.collect(word);// 将切割处理的一个个的单词收集起来并返回
}
}
});
DataStream<Tuple2<String, Integer>> wordAndOnesDS = wordsDS.map(new MapFunction<String, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> map(String value) throws Exception {
// value就是进来一个个的单词
return Tuple2.of(value, 1);
}
});
//DataSet中分组是groupBy,DataStream分组是keyBy
KeyedStream<Tuple2<String, Integer>, String> groupedDS = wordAndOnesDS.keyBy(t -> t.f0);
DataStream<Tuple2<String, Integer>> result = groupedDS.sum(1);
// 4.输出结果-sink
result.print();
// 5.触发执行-execute
env.execute();// DataStream需要调用execute
}
}
15> (hadoop,3)
13> (flink,4)
2> (hive,2)
参考示例1中依赖
参考示例1中依赖
import java.util.Arrays;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
/**
* @author alanchan
*
*/
public class WordCount_DataStream {
/**
* @param args
*/
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// String ip = "192.168.10.42";
// int port = 9999;
// DataStream lines = env.socketTextStream(ip, port);
DataStream<String> linesDS = env.fromElements("flink,hadoop,hive", "flink,hadoop,hive", "flink,hadoop", "flink");
wc(env,lines);
// execute/启动并等待程序结束
env.execute();
}
// 通过lamda表达式实现
public static void wc(StreamExecutionEnvironment env,DataStream<String> lines) throws Exception {
// @Override
// public void flatMap(String value, Collector out) throws Exception {
// String[] words = value.split(" ");
// for (String word : words) {
// out.collect(word);
// }
// }
lines.flatMap((String value, Collector<String> out) -> Arrays.stream(value.split(",")).forEach(out::collect)).returns(Types.STRING)
// @Override
// public Tuple2 map(String value) throws Exception {
// return new Tuple2(value, 1);
// }
.map((String value) -> new Tuple2<>(value, 1)).returns(Types.TUPLE(Types.STRING, Types.INT))
// @Override
// public String getKey(Tuple2 value) throws Exception {
// return value.f0;
// }
.keyBy(t -> t.f0).sum(1).print();
}
}
15> (hadoop,1)
2> (hive,1)
15> (hadoop,2)
2> (hive,2)
13> (flink,1)
13> (flink,2)
13> (flink,3)
15> (hadoop,3)
13> (flink,4)
提交任务到flink环境运行,即on yarn的运行环境,本示例使用per-job模式。
参考示例1中依赖
参考示例1中依赖
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
/**
* @author alanchan
*
*/
public class WordCount_DataStream_onyarn {
public static void main(String[] args) throws Exception {
// 1.准备环境-env
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
// env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
// env.setRuntimeMode(RuntimeExecutionMode.BATCH);
// 2.准备数据-source
DataStream<String> linesDS = env.fromElements("flink,hadoop,hive", "flink,hadoop,hive", "flink,hadoop", "flink");
// 3.处理数据-transformation
DataStream<String> wordsDS = linesDS.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String value, Collector<String> out) throws Exception {
// value就是一行行的数据
String[] words = value.split(",");
for (String word : words) {
out.collect(word);// 将切割处理的一个个的单词收集起来并返回
}
}
});
DataStream<Tuple2<String, Integer>> wordAndOnesDS = wordsDS.map(new MapFunction<String, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> map(String value) throws Exception {
// value就是进来一个个的单词
return Tuple2.of(value, 1);
}
});
KeyedStream<Tuple2<String, Integer>, String> groupedDS = wordAndOnesDS.keyBy(t -> t.f0);
DataStream<Tuple2<String, Integer>> result = groupedDS.sum(1);
// 4.输出结果-sink
// 设置hdfs执行用户名
System.setProperty("HADOOP_USER_NAME", "alanchan");
// 获取参数
ParameterTool params = ParameterTool.fromArgs(args);
// 输出结果至hdfs中
String output = params.get("output") + "-" + System.currentTimeMillis();
result.writeAsText(output).setParallelism(1);
// 5.触发执行-execute
env.execute();// DataStream需要调用execute
}
}
#到工程目录pom.xml目录下,在地址栏中输入cmd命令回车
#或在cmd中切换到该目录下执行命令
mvn package -Dmaven.test.skip=true
把打包的文件上传至flink所在服务器或通过flink web ui提交作业。
本示例将文件名称改为了original-wc.jar(此处不是必须),上传吃server1的/usr/local/bigdata/testdata目录下
官网说明:Apache Flink 1.12 Documentation: Execution Mode (Batch/Streaming)
/usr/local/flink-1.13.5/bin/flink run -Dexecution.runtime-mode=BATCH -m yarn-cluster -yjm 2048 -ytm 2048 -c org.wc.WordCount_DataStream_onyarn /usr/local/bigdata/testdata/original-wc.jar --output hdfs://server2:8020//flinktest/wc
yarn web ui:http://server1:8088/cluster
查看任务运行情况
flink web ui :http://server1:8082/#/job/completed
查询任务运行情况
hdfs web ui :http://192.168.10.42:9870/explorer.html#/flinktest
查看运行结果
以上,介绍了flink的重要概念,dataset、datastream详细示例入门和提交任务至on yarn运行。