Apache Flink是一个框架和分布式处理引擎,用于在无边界和有边界数据流上进行有状态的计算。
官网:https://flink.apache.org/zh/flink-architecture.html
有谁再用?
Apache Flink 为全球许多公司和企业的关键业务提供支持。在这个页面上,我们展示了一些著名的 Flink 用户,他们在生产中运行着有意思的用例,并提供了展示更详细信息的链接。
在项目的 wiki 页面中有一个 谁在使用 Flink 的页面,展示了更多的 Flink 用户。请注意,该列表并不全面。我们只添加明确要求列出的用户。
大厂一般做实时数仓建设、实时数据监控、实时反作弊风控、画像系统等。
(1)无界流:有定义流的开始,但没有定义流的结束。他们会无休止地产生数据。无界流的数据必须持续处理,即数据被摄取后需要立刻处理。我们不能等到所有数据都达到在处理,因为输入是无限的,在任何时候输入都不会完成。处理无界数据通常要求以特定顺序摄取事件,例如事件发生的顺序,一边能够推断结果的完整性
(2)有界流:有定义流的开始,也有定义流的结束。有界流可以在摄取所有数据后在进行计算。有界流所有数据可以被排序,所以并不需要有序摄取。有界流处理通常被称为批处理。
Apache Flink 擅长处理无界和有界数据集 精确的时间控制和状态化使得 Flink 的运行时(runtime)能够运行任何处理无界流的应用。有界流则由一些专为固定大小数据集特殊设计的算法和数据结构进行内部处理,产生了出色的性能。
有状态的 Flink 程序针对本地状态访问进行了优化。任务的状态始终保留在内存中,如果状态大小超过可用内存,则会保存在能高效访问的磁盘数据结构中。任务通过访问本地(通常在内存中)状态来进行所有的计算,从而产生非常低的处理延迟。Flink 通过定期和异步地对本地状态进行持久化存储来保证故障场景下精确一次的状态一致性。
(1)事件驱动型应用
什么是事件驱动型应用?
事件驱动型应用是一类具有状态的应用,它从一个或多个事件流提取数据,并根据到来的事件触发计算、状态更新或其他外部动作。
事件驱动型应用是在计算存储分离的传统应用基础上进化而来。在传统架构中,应用需要读写远程事务型数据库。
相反,事件驱动型应用是基于状态化流处理来完成。在该设计中,数据和计算不会分离,应用只需访问本地(内存或磁盘)即可获取数据。系统容错性的实现依赖于定期向远程持久化存储写入 checkpoint。下图描述了传统应用和事件驱动型应用架构的区别。
事件驱动型应用的优势?
事件驱动型应用无须查询远程数据库,本地数据访问使得它具有更高的吞吐和更低的延迟。而由于定期向远程持久化存储的 checkpoint 工作可以异步、增量式完成,因此对于正常事件处理的影响甚微。事件驱动型应用的优势不仅限于本地数据访问。传统分层架构下,通常多个应用会共享同一个数据库,因而任何对数据库自身的更改(例如:由应用更新或服务扩容导致数据布局发生改变)都需要谨慎协调。反观事件驱动型应用,由于只需考虑自身数据,因此在更改数据表示或服务扩容时所需的协调工作将大大减少。
典型的事件驱动型应用实例
(2)数据分析应用
什么是数据分析应用?
数据分析任务需要从原始数据中提取有价值的信息和指标。传统的分析方式通常是利用批查询,或将事件记录下来并基于此有限数据集构建应用来完成。为了得到最新数据的分析结果,必须先将它们加入分析数据集并重新执行查询或运行应用,随后将结果写入存储系统或生成报告。
借助一些先进的流处理引擎,还可以实时地进行数据分析。和传统模式下读取有限数据集不同,流式查询或应用会接入实时事件流,并随着事件消费持续产生和更新结果。这些结果数据可能会写入外部数据库系统或以内部状态的形式维护。仪表展示应用可以相应地从外部数据库读取数据或直接查询应用的内部状态。
流式分析应用的优势?
和批量分析相比,由于流式分析省掉了周期性的数据导入和查询过程,因此从事件中获取指标的延迟更低。不仅如此,批量查询必须处理那些由定期导入和输入有界性导致的人工数据边界,而流式查询则无须考虑该问题。
另一方面,流式分析会简化应用抽象。批量查询的流水线通常由多个独立部件组成,需要周期性地调度提取数据和执行查询。如此复杂的流水线操作起来并不容易,一旦某个组件出错将会影响流水线的后续步骤。而流式分析应用整体运行在 Flink 之类的高端流处理系统之上,涵盖了从数据接入到连续结果计算的所有步骤,因此可以依赖底层引擎提供的故障恢复机制。
Flink 如何支持数据分析类应用?
Flink 为持续流式分析和批量分析都提供了良好的支持。具体而言,它内置了一个符合 ANSI 标准的 SQL 接口,将批、流查询的语义统一起来。无论是在记录事件的静态数据集上还是实时事件流上,相同 SQL 查询都会得到一致的结果。同时 Flink 还支持丰富的用户自定义函数,允许在 SQL 中执行定制化代码。如果还需进一步定制逻辑,可以利用 Flink DataStream API 和 DataSet API 进行更低层次的控制。此外,Flink 的 Gelly 库为基于批量数据集的大规模高性能图分析提供了算法和构建模块支持。
(3)数据管道应用
什么是数据管道?
提取-转换-加载(ETL)是一种在存储系统之间进行数据转换和迁移的常用方法。ETL 作业通常会周期性地触发,将数据从事务型数据库拷贝到分析型数据库或数据仓库。
数据管道和 ETL 作业的用途相似,都可以转换、丰富数据,并将其从某个存储系统移动到另一个。但数据管道是以持续流模式运行,而非周期性触发。因此它支持从一个不断生成数据的源头读取记录,并将它们以低延迟移动到终点。例如:数据管道可以用来监控文件系统目录中的新文件,并将其数据写入事件日志;另一个应用可能会将事件流物化到数据库或增量构建和优化查询索引。
数据管道的优势?
和周期性 ETL 作业相比,持续数据管道可以明显降低将数据移动到目的端的延迟。此外,由于它能够持续消费和发送数据,因此用途更广,支持用例更多。
Flink 如何支持数据管道应用?
很多常见的数据转换和增强操作可以利用 Flink 的 SQL 接口(或 Table API)及用户自定义函数解决。如果数据管道有更高级的需求,可以选择更通用的 DataStream API 来实现。Flink 为多种数据存储系统(如:Kafka、Kinesis、Elasticsearch、JDBC数据库系统等)内置了连接器。同时它还提供了文件系统的连续型数据源及数据汇,可用来监控目录变化和以时间分区的方式写入文件。
典型的数据管道应用实例
<properties>
<encoding>UTF-8encoding>
<project.build.sourceEncoding>UTF-8project.build.sourceEncoding>
<maven.compiler.source>1.8maven.compiler.source>
<maven.compiler.target>1.8maven.compiler.target>
<java.version>1.8java.version>
<scala.version>2.12scala.version>
<flink.version>1.13.1flink.version>
properties>
<dependencies>
<dependency>
<groupId>org.projectlombokgroupId>
<artifactId>lombokartifactId>
<version>1.18.16version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-clients_${scala.version}artifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-scala_${scala.version}artifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-javaartifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-streaming-scala_${scala.version}artifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-streaming-java_${scala.version}artifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.slf4jgroupId>
<artifactId>slf4j-log4j12artifactId>
<version>1.7.7version>
<scope>runtimescope>
dependency>
<dependency>
<groupId>log4jgroupId>
<artifactId>log4jartifactId>
<version>1.2.17version>
<scope>runtimescope>
dependency>
<dependency>
<groupId>com.alibabagroupId>
<artifactId>fastjsonartifactId>
<version>1.2.44version>
dependency>
dependencies>
新建log4j.properties,日志文件
### 配置appender名称
log4j.rootLogger = debugFile, errorFile
### debug级别以上的日志到:src/logs/debug.log
log4j.appender.debugFile = org.apache.log4j.DailyRollingFileAppender
log4j.appender.debugFile.File = src/logs/flink.log
log4j.appender.debugFile.Append = true
### Threshold属性指定输出等级
log4j.appender.debugFile.Threshold = info
log4j.appender.debugFile.layout = org.apache.log4j.PatternLayout
log4j.appender.debugFile.layout.ConversionPattern = %-d{yyyy-MM-dd HH:mm:ss} [ %t:%r ] - [ %p ] %n%m%n
### error级别以上的日志 src/logs/error.log
log4j.appender.errorFile = org.apache.log4j.DailyRollingFileAppender
log4j.appender.errorFile.File = src/logs/error.log
log4j.appender.errorFile.Append = true
log4j.appender.errorFile.Threshold = error
log4j.appender.errorFile.layout = org.apache.log4j.PatternLayout
log4j.appender.errorFile.layout.ConversionPattern = %-d{yyyy-MM-dd HH:mm:ss} [ %t:%r ] - [ %p ] %n%m%n
/**
* @author lixiang
*/
public class Test {
private static List<String> list = new ArrayList<>();
static {
list.add("SpringBoot,Docker");
list.add("Netty,SpringCloud");
list.add("Flink,Linux");
}
public static void test1() {
Tuple3<Integer, String, Integer> tuple3 = Tuple3.of(1, "lixiang", 23);
System.out.println(tuple3.f0);
System.out.println(tuple3.f1);
System.out.println(tuple3.f2);
}
public static void test2() {
List<String> collect = list.stream().map(obj -> obj + "拼接").collect(Collectors.toList());
System.out.println(collect);
}
public static void test3() {
List<String> collect = list.stream().flatMap(obj -> Arrays.stream(obj.split(","))).collect(Collectors.toList());
System.out.println(collect);
}
}
public static void main(String[] args) {
test1();
test2();
test3();
}
/**
* @author lixiang
*/
public class FlinkDemo {
public static void main(String[] args) throws Exception {
//构建执行任务环境以及任务的启动入口,存储全局相关的参数
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//设置并行度
env.setParallelism(1);
//相同类型元素的数据流
DataStream<String> stringDataStream = env.fromElements("Java,SpringBoot,SpringCloud", "Java,Linux,Docker");
stringDataStream.print("处理前");
//FlatMapFunction,key是输入类型,value是Collector响应的收集的类型,看源码的注释也是DataStream里面泛型类型
DataStream<String> flatMapDataStream = stringDataStream.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String value, Collector<String> out) throws Exception {
String[] arr = value.split(",");
for (String s : arr) {
out.collect(s);
}
}
});
flatMapDataStream.print("处理后");
//DataStream需要execute,可以取个名称
env.execute("data stream job");
}
}
可以设置多个线程并行执行
//设置并行度
env.setParallelism(3);
(1)Flink和Blink关系
(2)算子Operator
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
(3)Flink在生产环境中的用法
(4)Flink 部署方式是灵活,主要是对Flink计算时所需资源的管理方式不同
(1)增加maven依赖
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-runtime-web_${scala.version}artifactId>
<version>${flink.version}version>
dependency>
访问方式:ip:8081
(2)代码开发
/**
* flink UI demo
* @author lixiang
*/
public class FlinkUI {
public static void main(String[] args) throws Exception {
//构建执行任务环境以及任务的启动入口
final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
//设置并行度
env.setParallelism(1);
//监听192.168.139.80:8888输送过来的数据流
DataStreamSource<String> stream = env.socketTextStream("192.168.139.80", 8888);
//流处理
SingleOutputStreamOperator<String> streamOperator = stream.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String value, Collector<String> out) throws Exception {
String[] split = value.split(",");
for (String s : split) {
out.collect(s);
}
}
});
streamOperator.print("处理后");
//执行任务
env.execute("data stream job");
}
}
访问:127.0.0.1:8081
(1)运行流程
(1)Flink是一个分布式系统,需要有效分配和管理计算资源才能执行流应用程序。
(2)什么是JobManager
(3)什么是TaskManager
(4)JobManager进程由三个不同的组件组成
(5)TaskManager中task slot的数量表示并发处理task的数量
(6)Task Slots任务槽
(1)Flink是分布式流式计算框架
(2)流程
(3)并行度的调整配置
(4)一个很重要的区分TaskSolt和parallelism并行度配置
(5)Flink有3种运行模式
env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
(1)Flink的API层级为流式/批式处理应用程序的开发提供了不同级别的抽象
(2)Flink编程模型
(3)Source来源
(4)Connectors与第三方系统进行对接(用于source或者sink都可以)
(5)Apache Bahir连接器
元素集合
代码实战
/**
* @author lixiang
*/
public class FlinkSourceDemo {
public static void main(String[] args) throws Exception {
//构建执行任务环境以及任务的启动入口
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStreamSource<String> ds1 = env.fromElements("java,springboot", "kafka,redis", "openstack,k8s,docker");
ds1.print("ds1");
DataStreamSource<String> ds2 = env.fromCollection(Arrays.asList("hive", "hadoop", "hbase", "rabbitmq", "java"));
ds2.print("ds2");
DataStreamSource<Long> ds3 = env.fromSequence(0, 10);
ds3.print("ds3");
//执行任务
env.execute("data job");
}
}
文件/文件系统
/**
* @author lixiang
*/
public class FlinkSourceDemo2 {
public static void main(String[] args) throws Exception {
//构建执行任务环境以及任务的启动的入口
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStream<String> textDS = env.readTextFile("E:\\软件资料\\log.txt");
//DataStream hdfsDS = env.readTextFile("hdfs://lixiang:8010/file/log/words.txt");
textDS.print("textDS");
env.execute("text job");
}
}
基于Socket
/**
* flink UI demo
* @author lixiang
*/
public class FlinkUI {
public static void main(String[] args) throws Exception {
//构建执行任务环境以及任务的启动入口
final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
//设置并行度
env.setParallelism(1);
//监听192.168.139.80:8888输送过来的数据流
DataStreamSource<String> stream = env.socketTextStream("192.168.139.80", 8888);
//流处理
SingleOutputStreamOperator<String> streamOperator = stream.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String value, Collector<String> out) throws Exception {
String[] split = value.split(",");
for (String s : split) {
out.collect(s);
}
}
});
streamOperator.print("处理后");
//执行任务
env.execute("data stream job");
}
}
自定义Source,实现接口自定义数据源
设置实体VideoOrder
/**
* 订单实体类
* @author lixiang
*/
@Data
@NoArgsConstructor
@AllArgsConstructor
public class VideoOrder
{
private String tradeNo;
private String title;
private int money;
private int userId;
private Date createTime;
}
设置VideoOrderSource
/**
* @author lixiang
*/
public class VideoOrderSource extends RichParallelSourceFunction<VideoOrder>
{
private volatile Boolean flag = true;
private Random random = new Random();
private static List<String> list = new ArrayList<>();
static
{
list.add("SpringBoot2.x课程");
list.add("Linux入到到精通");
list.add("Flink流式技术课程");
list.add("Kafka流式处理消息平台");
list.add("微服务SpringCloud教程");
}
@Override
public void run(SourceContext<VideoOrder> sourceContext) throws Exception {
int x = 0;
while (flag)
{
Thread.sleep(1000);
String id = UUID.randomUUID().toString();
int userId = random.nextInt(10);
int money = random.nextInt(100);
int videoNum = random.nextInt(list.size());
String title = list.get(videoNum);
sourceContext.collect(new VideoOrder(id, title, money, userId, new Date()));
x++;
if (x == 10)
{
cancel();
}
}
}
/**
* 取消任务
*/
@Override
public void cancel()
{
flag = false;
}
}
编写Flink任务
/**
* @author lixiang
*/
public class FlinkMainSource {
public static void main(String[] args) throws Exception {
//构建执行任务环境以及任务的启动入口
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStreamSource<VideoOrder> source = env.addSource(new VideoOrderSource());
source.print("接入的数据");
SingleOutputStreamOperator<Integer> streamOperator = source.flatMap(new FlatMapFunction<VideoOrder, Integer>() {
@Override
public void flatMap(VideoOrder value, Collector<Integer> out) throws Exception {
out.collect(value.getMoney());
}
});
streamOperator.print("处理后");
//流程启动
env.execute("custom source job");
}
}
(1)开启WebUI
final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
(2)设置不同并行度
(1)Sink输出源
(1)部署MySQL环境
docker pull mysql:5.7
docker run -itd -p 3306:3306 --name my-mysql -e MYSQL_ROOT_PASSWORD=123456 mysql:5.7
(2)连接MySQL创建表
CREATE TABLE `video_order` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`user_id` int(11) DEFAULT NULL,
`money` int(11) DEFAULT NULL,
`title` varchar(32) DEFAULT NULL,
`trade_no` varchar(64) DEFAULT NULL,
`create_time` date DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
(3)加入flink-mysql依赖
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-connector-jdbc_2.12artifactId>
<version>1.12.0version>
dependency>
<dependency>
<groupId>mysqlgroupId>
<artifactId>mysql-connector-javaartifactId>
<version>8.0.25version>
dependency>
(4)编写MySQLSink
/**
* @author lixiang
*/
public class MysqlSink extends RichSinkFunction<VideoOrderDO> {
private Connection conn = null;
private PreparedStatement ps = null;
@Override
public void invoke(VideoOrderDO videoOrder, Context context) throws Exception {
//给ps中的?设置具体值
ps.setInt(1, videoOrder.getUserId());
ps.setInt(2, videoOrder.getMoney());
ps.setString(3, videoOrder.getTitle());
ps.setString(4, videoOrder.getTradeNo());
ps.setDate(5, new Date(videoOrder.getCreateTime().getTime()));
int i = ps.executeUpdate();
System.out.println("处理数据,插入数据库结果:" + (i > 0));
}
@Override
public void open(Configuration parameters) throws Exception {
System.out.println("---open---");
conn = DriverManager.getConnection("jdbc:mysql://192.168.139.20:3306/flink?useUnicode=true&characterEncoding=utf8&allowMultiQueries=true&serverTimezone=Asia/Shanghai", "root", "123456");
String sql = "INSERT INTO `video_order` (`user_id`, `money`, `title`, `trade_no`, `create_time`) VALUES(?,?,?,?,?);";
ps = conn.prepareStatement(sql);
}
@Override
public void close() throws Exception {
if (conn != null) {
conn.close();
}
if (ps != null) {
ps.close();
}
System.out.println("---close---");
}
}
(5)整合MySQLSink
public class FlinkMainSink {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
env.setParallelism(1);
DataStreamSource<VideoOrderDO> source = env.addSource(new VideoOrderSource());
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
3, // 尝试重启的次数
Time.of(10, TimeUnit.SECONDS) // 间隔
));
source.print("接收的数据");
source.addSink(new MysqlSink());
//流程启动
env.execute("custom sink job");
}
}
(6)运行结果
(1)部署Redis环境
拉取镜像:docker pull redis
启动容器:docker run -d --name redis -p 6379:6379 redis --requirepass "123456"
(2)Flink怎么操作redis?
(3)Redis Sink 核心是RedisMapper 是一个接口,使用时要编写自己的redis操作类实现这个接口中的三个方法
(4)添加redis的connector依赖,使用connector整合redis
<dependency>
<groupId>org.apache.bahirgroupId>
<artifactId>flink-connector-redis_2.11artifactId>
<version>1.0version>
dependency>
(5)自定义RedisSink
/**
* 定义泛型,就是要返回的类型
* @author lixiang
*/
public class MyRedisSink implements RedisMapper<Tuple2<String,Integer>> {
/**
* 选择对应的数据结构,和key的名称
* @return
*/
@Override
public RedisCommandDescription getCommandDescription() {
return new RedisCommandDescription(RedisCommand.HSET,"VIDEO_ORDER_COUNTER");
}
/**
* 返回key
* @param value
* @return
*/
@Override
public String getKeyFromData(Tuple2<String,Integer> value) {
return value.f0;
}
/**
* 返回value
* @param value
* @return
*/
@Override
public String getValueFromData(Tuple2<String,Integer> value) {
return value.f1.toString();
}
}
(5)编写Flink任务类
/**
* @author lixiang
*/
public class FlinkRedisDemo {
public static void main(String[] args) throws Exception {
//构建执行任务环境以及任务的启动入口
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
//自己构建的数据源
/*DataStream ds = env.fromElements(new VideoOrderDO(5, 32, "java", "2123143432", new Date()),
new VideoOrderDO(5, 40, "spring", "2123143432", new Date()),
new VideoOrderDO(5, 60, "springBoot", "2233143432", new Date()),
new VideoOrderDO(5, 29, "springBoot", "2125643432", new Date()),
new VideoOrderDO(5, 67, "docker", "2129843432", new Date()),
new VideoOrderDO(5, 89, "java", "2120943432", new Date()));*/
//使用自定义的source
DataStream<VideoOrderDO> ds = env.addSource(new VideoOrderSource());
//map转换。来一个记录一个,方便后续统计
DataStream<Tuple2<String,Integer>> mapDS = ds.map(new MapFunction<VideoOrderDO, Tuple2<String,Integer>>() {
@Override
public Tuple2<String,Integer> map(VideoOrderDO value) throws Exception {
return new Tuple2<>(value.getTitle(), 1);
}
});
KeyedStream<Tuple2<String, Integer>, String> keyedStream = mapDS.keyBy(new KeySelector<Tuple2<String, Integer>, String>() {
@Override
public String getKey(Tuple2<String, Integer> value) throws Exception {
return value.f0;
}
});
DataStream<Tuple2<String, Integer>> sumDS = keyedStream.sum(1);
//输出统计
sumDS.print();
FlinkJedisPoolConfig conf = new FlinkJedisPoolConfig.Builder().setHost("192.168.139.20").setPassword("123456").setPort(6379).build();
sumDS.addSink(new RedisSink<>(conf,new MyRedisSink()));
//DataStream需要调用execute,可以取这个名称
env.execute("custom redis job");
}
}
/**
* 自定义的source
* @author lixiang
*/
public class VideoOrderSource extends RichParallelSourceFunction<VideoOrderDO>
{
private volatile Boolean flag = true;
private Random random = new Random();
private static List<String> list = new ArrayList<>();
static
{
list.add("SpringBoot2.x");
list.add("Linux");
list.add("Flink");
list.add("Kafka");
list.add("SpringCloud");
list.add("SpringBoot");
list.add("Docker");
list.add("Netty");
}
@Override
public void run(SourceContext<VideoOrderDO> sourceContext) throws Exception {
int x = 0;
while (flag)
{
Thread.sleep(1000);
String id = UUID.randomUUID().toString();
int userId = random.nextInt(10);
int money = random.nextInt(100);
int videoNum = random.nextInt(list.size());
String title = list.get(videoNum);
String uuid = UUID.randomUUID().toString();
sourceContext.collect(new VideoOrderDO(userId, money, title,uuid, new Date()));
}
}
/**
* 取消任务
*/
@Override
public void cancel()
{
flag = false;
}
}
(1)Kafka环境搭建
#zk部署
docker run -d --name zookeeper -p 2181:2181 -t wurstmeister/zookeeper
#kafka部署,换成自己的IP
docker run -d --name kafka \
-p 9092:9092 \
-e KAFKA_BROKER_ID=0 \
-e KAFKA_ZOOKEEPER_CONNECT=192.168.139.20:2181 \
-e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://192.168.139.20:9092 \
-e KAFKA_LISTENERS=PLAINTEXT://0.0.0.0:9092 wurstmeister/kafka
#进入容器内部,创建topic
docker exec -it kafka /bin/bash
cd /opt/kafka
bin/kafka-topics.sh --create --zookeeper 192.168.139.20:2181 --replication-factor 1 --partitions 1 --topic test-topic
#创建生产者发送消息
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test-topic
#运行一个消费者,注意--from-beginning从开头第一个开始消费
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-topic --from-beginning
(2)Flink整合Kafka读取消息,发送消息
之前自定义SourceFunction,Flink官方也有提供对接外部系统的,比如读取Kafka
flink官方提供的连接器
添加依赖
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-connector-kafka_${scala.version}artifactId>
<version>${flink.version}version>
dependency>
/**
* @author lixiang
*/
public class FlinkKafka {
public static void main(String[] args) throws Exception {
//构建执行任务环境以及任务的启动入口
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties props = new Properties();
//kafka地址
props.setProperty("bootstrap.servers", "192.168.139.20:9092");
//组名
props.setProperty("group.id", "video-order-group");
//字符串序列化和反序列化规则
props.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
//offset重置规则
props.setProperty("auto.offset.reset", "latest");
//自动提交
props.setProperty("enable.auto.commit", "true");
props.setProperty("auto.commit.interval.ms", "2000");
//有后台线程每隔10s检测一下Kafka的分区变化情况
props.setProperty("flink.partition-discovery.interval-millis","10000");
//监听test-topic发送的消息
FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<>("test-topic",new SimpleStringSchema(),props);
consumer.setStartFromGroupOffsets();
DataStream<String> consumerDS = env.addSource(consumer);
consumerDS.print("test-topic接收的消息");
//接到消息后,处理
DataStream<String> dataStream = consumerDS.map(new MapFunction<String, String>() {
@Override
public String map(String value) throws Exception {
return "新来一个订单课程:"+value;
}
});
//处理后的消息发送到order-topic
FlinkKafkaProducer<String> producer = new FlinkKafkaProducer<>("order-topic",new SimpleStringSchema(),props);
dataStream.addSink(producer);
env.execute("kafka job");
}
}
#创建生产者发送消息
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test-topic
#运行一个消费者
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic order-topic --from-beginning
(1)java里面的Map操作
/**
* @author lixiang
* flink map算子demo
*/
public class FlinkMapDemo {
public static void main(String[] args) throws Exception {
//构建环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStream<VideoOrderDO> streamSource = env.addSource(new VideoOrderSource());
streamSource.print("处理前数据");
DataStream<Tuple2<String,Integer>> dataStream = streamSource.map(new MapFunction<VideoOrderDO, Tuple2<String,Integer>>() {
@Override
public Tuple2<String, Integer> map(VideoOrderDO value) throws Exception {
return new Tuple2<>(value.getTitle(),value.getMoney());
}
});
dataStream.print("处理后");
env.execute("map job");
}
}
(2)java里面的FlatMap操作
/**
* @author lixiang
* flatMap 算子demo
*/
public class FlinkFlatMapDemo {
public static void main(String[] args) throws Exception {
//构建环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStream<String> ds = env.fromElements("java&35,spring&20,springboot&30", "springcloud&21,shiro&39,docker&56,linux&87", "netty&98,kafka&48");
ds.print("处理前");
DataStream<Tuple2<String,Integer>> out = ds.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
@Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
String[] element = value.split(",");
for (String s : element) {
String[] eles = s.split("&");
out.collect(new Tuple2<>(eles[0],Integer.parseInt(eles[1])));
}
}
});
out.print("处理后");
env.execute("flatMap job");
}
}
(1)RichMap实战
/**
* @author lixiang
* flink map算子demo
*/
public class FlinkMapDemo {
public static void main(String[] args) throws Exception {
//构建环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStream<VideoOrderDO> streamSource = env.addSource(new VideoOrderSource());
streamSource.print("处理前数据");
DataStream<Tuple2<String,Integer>> dataStream = streamSource.map(new RichMapFunction<VideoOrderDO, Tuple2<String,Integer>>() {
@Override
public Tuple2<String, Integer> map(VideoOrderDO value) throws Exception {
return new Tuple2<>(value.getTitle(),value.getMoney());
}
@Override
public void open(Configuration parameters) throws Exception {
System.out.println("open方法执行");
}
@Override
public void close() throws Exception {
System.out.println("close方法执行");
}
});
dataStream.print("处理后");
env.execute("map job");
}
}
(2)RichFlatMapFunction实战
/**
* @author lixiang
* flatMap 算子demo
*/
public class FlinkFlatMapDemo {
public static void main(String[] args) throws Exception {
//构建环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStream<String> ds = env.fromElements("java&35,spring&20,springboot&30", "springcloud&21,shiro&39,docker&56,linux&87", "netty&98,kafka&48");
ds.print("处理前");
DataStream<Tuple2<String,Integer>> out = ds.flatMap(new RichFlatMapFunction<String, Tuple2<String,Integer>>() {
@Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
String[] element = value.split(",");
for (String s : element) {
String[] eles = s.split("&");
out.collect(new Tuple2<>(eles[0],Integer.parseInt(eles[1])));
}
}
@Override
public void open(Configuration parameters) throws Exception {
System.out.println("open方法执行");
}
@Override
public void close() throws Exception {
System.out.println("close方法执行");
}
});
out.print("处理后");
env.execute("flatMap job");
}
}
/**
* @author lixiang
* keyBy 算子demo
*/
public class FlinkKeyByDemo {
public static void main(String[] args) throws Exception {
//构建环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStream<VideoOrderDO> dataStream = env.addSource(new VideoOrderSource());
//根据title进行分组
KeyedStream<VideoOrderDO,String> keyedStream = dataStream.keyBy(new KeySelector<VideoOrderDO, String>() {
@Override
public String getKey(VideoOrderDO value) throws Exception {
return value.getTitle();
}
});
//分组后将相同标题的money进行累加
SingleOutputStreamOperator<VideoOrderDO> sumDS = keyedStream.sum("money");
//map转换
DataStream<Tuple2<String, Integer>> outputStreamOperator = sumDS.map(new MapFunction<VideoOrderDO, Tuple2<String,Integer>>() {
@Override
public Tuple2<String, Integer> map(VideoOrderDO value) throws Exception {
return new Tuple2<>(value.getTitle(),value.getMoney());
}
});
outputStreamOperator.print();
env.execute("keyBy job");
}
}
/**
* @author lixiang
* flink filter算子demo
* 先过滤money大于30的,然后根据标题进行分组,然后求每组money总和,最后map转换
*/
public class FlinkFliterDemo {
public static void main(String[] args) throws Exception {
//构建执行任务环境以及任务的启动的入口, 存储全局相关的参数
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
env.setParallelism(1);
DataStreamSource<VideoOrderDO> ds = env.addSource(new VideoOrderSource());
DataStream<Tuple2<String,Integer>> out = ds.filter(new FilterFunction<VideoOrderDO>() {
@Override
public boolean filter(VideoOrderDO value) throws Exception {
return value.getMoney()>30;
}
}).keyBy(new KeySelector<VideoOrderDO, String>() {
@Override
public String getKey(VideoOrderDO value) throws Exception {
return value.getTitle();
}
}).sum("money").map(new MapFunction<VideoOrderDO, Tuple2<String,Integer>>() {
@Override
public Tuple2<String, Integer> map(VideoOrderDO value) throws Exception {
return new Tuple2<>(value.getTitle(), value.getMoney());
}
});
out.print();
env.execute("filter sum job");
}
}
/**
* @author lixiang
* reduce 算子demo
*/
public class FlinkReduceDemo {
public static void main(String[] args) throws Exception {
//构建执行任务环境以及任务的启动的入口, 存储全局相关的参数
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
env.setParallelism(1);
DataStreamSource<VideoOrderDO> ds = env.addSource(new VideoOrderSource());
SingleOutputStreamOperator<Tuple2<String,Integer>> reduce = ds.keyBy(new KeySelector<VideoOrderDO, String>() {
@Override
public String getKey(VideoOrderDO value) throws Exception {
return value.getTitle();
}
}).reduce(new AggregationFunction<VideoOrderDO>() {
//value1是历史对象,value2是加入统计的对象,所以value1.f1是历史值,value2.f1是新值,不断累加
@Override
public VideoOrderDO reduce(VideoOrderDO value1, VideoOrderDO value2) throws Exception {
value1.setMoney(value1.getMoney() + value2.getMoney());
return value1;
}
}).map(new MapFunction<VideoOrderDO, Tuple2<String,Integer>>() {
@Override
public Tuple2<String,Integer> map(VideoOrderDO value) throws Exception {
return new Tuple2<>(value.getTitle(),value.getMoney());
}
});
reduce.print();
env.execute("reduce job");
}
}
/**
* @author lixiang
* maxBy-max-minBy-min的使用
*/
public class FlinkMinMaxDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
//env.setParallelism(1);
DataStream<VideoOrderDO> ds = env.fromElements(new VideoOrderDO(5, 32, "java", "2123143432", new Date()),
new VideoOrderDO(25, 40, "spring", "2123143432", new Date()),
new VideoOrderDO(45, 60, "springBoot", "2233143432", new Date()),
new VideoOrderDO(15, 29, "springBoot", "2125643432", new Date()),
new VideoOrderDO(54, 67, "java", "2129843432", new Date()),
new VideoOrderDO(59, 89, "java", "2120943432", new Date()));
SingleOutputStreamOperator<VideoOrderDO> out = ds.keyBy(new KeySelector<VideoOrderDO, String>() {
@Override
public String getKey(VideoOrderDO value) throws Exception {
return value.getTitle();
}
}).max("money");
out.print();
env.execute("max job");
}
}
/**
* @author lixiang
* maxBy-max-minBy-min的使用
*/
public class FlinkMinMaxDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
//env.setParallelism(1);
DataStream<VideoOrderDO> ds = env.fromElements(new VideoOrderDO(5, 32, "java", "2123143432", new Date()),
new VideoOrderDO(25, 40, "spring", "2123143432", new Date()),
new VideoOrderDO(45, 60, "springBoot", "2233143432", new Date()),
new VideoOrderDO(15, 29, "springBoot", "2125643432", new Date()),
new VideoOrderDO(54, 67, "java", "2129843432", new Date()),
new VideoOrderDO(59, 89, "java", "2120943432", new Date()));
SingleOutputStreamOperator<VideoOrderDO> out = ds.keyBy(new KeySelector<VideoOrderDO, String>() {
@Override
public String getKey(VideoOrderDO value) throws Exception {
return value.getTitle();
}
}).maxBy("money");
out.print();
env.execute("max job");
}
}
(1)窗口属性
(2)窗口大小size 和 滑动间隔 slide
(1)什么情况下才可以使用WindowAPI
(2)窗口分配器Window Assigners
(3)窗口触发器trigger
(4)窗口window function,对窗口内的数据操作
aggregate(agg函数,WindowFunction(){ })
AggregateFunction
IN是输入类型,ACC是中间聚合状态类型,OUT是输出类型,是聚合统计当前窗口的数据
apply(new processWindowFunction(){})
IN是输入类型,OUT是输出类型,KEY是分组类型,W是时间窗
WindowFunction
如果想处理每个元素更底层的API的时候用
//对数据进行解析 ,process对每个元素进行处理,相当于 map+flatMap+filter
process(new KeyedProcessFunction(){processElement、onTimer})
滚动窗口 Tumbling Windows
窗口具有固定大小
窗口数据不重叠
比如指定了一个5分钟大小的滚动窗口,无限流的数据会根据时间划分为[0:00, 0:05)、[0:05, 0:10)、[0:10, 0:15)等窗口
代码实战
/**
* @author lixiang
* Tumbling-Window滚动窗口
*/
public class FlinkTumblingDemo {
public static void main(String[] args) throws Exception {
//构建环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
DataStream<VideoOrderDO> ds = env.addSource(new VideoOrderSource());
KeyedStream<VideoOrderDO, String> keyedStream = ds.keyBy(new KeySelector<VideoOrderDO, String>() {
@Override
public String getKey(VideoOrderDO value) throws Exception {
return value.getTitle();
}
});
SingleOutputStreamOperator<Map<String, Object>> map = keyedStream.window(TumblingProcessingTimeWindows.of(Time.seconds(5))).sum("money").map(new MapFunction<VideoOrderDO, Map<String, Object>>() {
@Override
public Map<String, Object> map(VideoOrderDO value) throws Exception {
Map<String, Object> map = new HashMap<>();
map.put("title", value.getTitle());
map.put("money", value.getMoney());
map.put("createDate", TimeUtil.toDate(value.getCreateTime()));
return map;
}
});
map.print();
env.execute("Tumbling Window job");
}
}
/**
* @author lixiang
* Tumbling-Window滚动窗口
*/
public class FlinkSlidingDemo {
public static void main(String[] args) throws Exception {
//构建环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
DataStream<VideoOrderDO> ds = env.addSource(new VideoOrderSource());
KeyedStream<VideoOrderDO, String> keyedStream = ds.keyBy(new KeySelector<VideoOrderDO, String>() {
@Override
public String getKey(VideoOrderDO value) throws Exception {
return value.getTitle();
}
});
//每5s去统计过去20s的数据
SingleOutputStreamOperator<Map<String, Object>> map = keyedStream.window(SlidingProcessingTimeWindows.of(Time.seconds(20),Time.seconds(5))).sum("money").map(new MapFunction<VideoOrderDO, Map<String, Object>>() {
@Override
public Map<String, Object> map(VideoOrderDO value) throws Exception {
Map<String, Object> map = new HashMap<>();
map.put("title", value.getTitle());
map.put("money", value.getMoney());
map.put("createDate", TimeUtil.toDate(value.getCreateTime()));
return map;
}
});
map.print();
env.execute("Sliding Window job");
}
}
基于数量的滚动窗口, 滑动计数窗口
统计分组后同个key内的数据超过5次则进行统计 countWindow(5)
/**
* @author lixiang
* Tumbling-Window滚动窗口
*/
public class FlinkWindow1Demo {
public static void main(String[] args) throws Exception {
//构建环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
DataStream<VideoOrderDO> ds = env.addSource(new VideoOrderSource());
KeyedStream<VideoOrderDO, String> keyedStream = ds.keyBy(new KeySelector<VideoOrderDO, String>() {
@Override
public String getKey(VideoOrderDO value) throws Exception {
return value.getTitle();
}
});
SingleOutputStreamOperator<Map<String, Object>> map = keyedStream.countWindow(5).sum("money").map(new MapFunction<VideoOrderDO, Map<String, Object>>() {
@Override
public Map<String, Object> map(VideoOrderDO value) throws Exception {
Map<String, Object> map = new HashMap<>();
map.put("title", value.getTitle());
map.put("money", value.getMoney());
map.put("createDate", TimeUtil.toDate(value.getCreateTime()));
return map;
}
});
map.print();
env.execute("Count Window job");
}
}
/**
* @author lixiang
* Tumbling-Window滚动窗口
*/
public class FlinkWindow1Demo {
public static void main(String[] args) throws Exception {
//构建环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
DataStream<VideoOrderDO> ds = env.addSource(new VideoOrderSource());
KeyedStream<VideoOrderDO, String> keyedStream = ds.keyBy(new KeySelector<VideoOrderDO, String>() {
@Override
public String getKey(VideoOrderDO value) throws Exception {
return value.getTitle();
}
});
SingleOutputStreamOperator<Map<String, Object>> map = keyedStream.countWindow(5,2).sum("money").map(new MapFunction<VideoOrderDO, Map<String, Object>>() {
@Override
public Map<String, Object> map(VideoOrderDO value) throws Exception {
Map<String, Object> map = new HashMap<>();
map.put("title", value.getTitle());
map.put("money", value.getMoney());
map.put("createDate", TimeUtil.toDate(value.getCreateTime()));
return map;
}
});
map.print();
env.execute("Count Window job");
}
}
aggregate(agg函数,WindowFunction(){ })
AggregateFunction<IN, ACC, OUT>
IN是输入类型,ACC是中间聚合状态类型,OUT是输出类型,是聚合统计当前窗口的数据
/**
* @author lixiang
* Tumbling-Window滚动窗口
*/
public class FlinkWindow1Demo {
public static void main(String[] args) throws Exception {
//构建环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
DataStream<VideoOrderDO> ds = env.addSource(new VideoOrderSource());
WindowedStream<VideoOrderDO, String, TimeWindow> stream = ds.keyBy(new KeySelector<VideoOrderDO, String>() {
@Override
public String getKey(VideoOrderDO value) throws Exception {
return value.getTitle();
}
}).window(TumblingProcessingTimeWindows.of(Time.seconds(5)));
stream.aggregate(new AggregateFunction<VideoOrderDO, Map<String, Object>, Map<String, Object>>() {
//初始化累加器
@Override
public Map<String, Object> createAccumulator() {
return new HashMap<>();
}
//聚合方式
@Override
public Map<String, Object> add(VideoOrderDO value, Map<String, Object> accumulator) {
if (accumulator.size() == 0) {
accumulator.put("title", value.getTitle());
accumulator.put("money", value.getMoney());
accumulator.put("num", 1);
accumulator.put("createTime", value.getCreateTime());
} else {
accumulator.put("title", value.getTitle());
accumulator.put("money", value.getMoney() + Integer.parseInt(accumulator.get("money").toString()));
accumulator.put("num", 1 + Integer.parseInt(accumulator.get("num").toString()));
accumulator.put("createTime", value.getCreateTime());
}
return accumulator;
}
//返回结果
@Override
public Map<String, Object> getResult(Map<String, Object> accumulator) {
return accumulator;
}
//合并内容
@Override
public Map<String, Object> merge(Map<String, Object> a, Map<String, Object> b) {
return null;
}
}).print();
env.execute("Tumbling Window job");
}
}
apply(new WindowFunction(){ })
IN是输入类型,OUT是输出类型,KEY是分组类型,W是时间窗
WindowFunction<IN, OUT, KEY, W extends Window>
/**
* @author lixiang
* apply
*/
public class FlinkWindow2Demo {
public static void main(String[] args) throws Exception {
//构建环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
DataStream<VideoOrderDO> ds = env.addSource(new VideoOrderSource());
WindowedStream<VideoOrderDO, String, TimeWindow> stream = ds.keyBy(new KeySelector<VideoOrderDO, String>() {
@Override
public String getKey(VideoOrderDO value) throws Exception {
return value.getTitle();
}
}).window(TumblingProcessingTimeWindows.of(Time.seconds(5)));
stream.apply(new WindowFunction<VideoOrderDO, Map<String,Object>, String, TimeWindow>() {
@Override
public void apply(String key, TimeWindow timeWindow, Iterable<VideoOrderDO> iterable, Collector<Map<String, Object>> collector) throws Exception {
List<VideoOrderDO> list = IterableUtils.toStream(iterable).collect(Collectors.toList());
long sum = list.stream().collect(Collectors.summarizingInt(VideoOrderDO::getMoney)).getSum();
Map<String,Object> map = new HashMap<>();
map.put("sumMoney",sum);
map.put("title",key);
collector.collect(map);
}
}).print();
env.execute("apply Window job");
}
}
process(new ProcessWindowFunction(){})
IN是输入类型,OUT是输出类型,KEY是分组类型,W是时间窗
ProcessWindowFunction<IN, OUT, KEY, W extends Window>
/**
* @author lixiang
* process-Window滚动窗口
*/
public class FlinkWindow3Demo {
public static void main(String[] args) throws Exception {
//构建环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
DataStream<VideoOrderDO> ds = env.addSource(new VideoOrderSource());
WindowedStream<VideoOrderDO, String, TimeWindow> stream = ds.keyBy(new KeySelector<VideoOrderDO, String>() {
@Override
public String getKey(VideoOrderDO value) throws Exception {
return value.getTitle();
}
}).window(TumblingProcessingTimeWindows.of(Time.seconds(5)));
stream.process(new ProcessWindowFunction<VideoOrderDO, Map<String,Object>, String, TimeWindow>() {
@Override
public void process(String key, ProcessWindowFunction<VideoOrderDO, Map<String, Object>, String, TimeWindow>.Context context, Iterable<VideoOrderDO> iterable, Collector<Map<String, Object>> collector) throws Exception {
List<VideoOrderDO> list = IterableUtils.toStream(iterable).collect(Collectors.toList());
long sum = list.stream().collect(Collectors.summarizingInt(VideoOrderDO::getMoney)).getSum();
Map<String,Object> map = new HashMap<>();
map.put("sumMoney",sum);
map.put("title",key);
collector.collect(map);
}
}).print();
env.execute("process Window job");
}
}
窗口函数对比
aggregate(new AggregateFunction(){});
apply(new WindowFunction(){})
process(new ProcessWindowFunction(){}) //比WindowFunction功能强大
(1)基本概念
推迟窗口触发的时间,实现方式:通过当前窗口中最大的eventTime-延迟时间所得到的Watermark与窗口原始触发时间进行对比,当Watermark大于窗口原始触发时间时则触发窗口执行!!!我们知道,流处理从事件产生,到流经source,再到operator,中间是有一个过程和时间的,虽然大部分情况下,流到operator的数据都是按照事件产生的时间顺序来的,但是也不排除由于网络、分布式等原因,导致乱序的产生,所谓乱序,就是指Flink接收到的事件的先后顺序不是严格按照事件的Event Time顺序排列的。
那么此时出现一个问题,一旦出现乱序,如果只根据eventTime决定window的运行,我们不能明确数据是否全部到位,但又不能无限期的等下去,此时必须要有个机制来保证一个特定的时间后,必须触发window去进行计算了,这个特别的机制,就是Watermark。
(2)Watermark水位线介绍
由flink的某个operator操作生成后,就在整个程序中随event数据流转
With Periodic Watermarks(周期生成,可以定义一个最大允许乱序的时间,用的很多)
With Punctuated Watermarks(标点水位线,根据数据流中某些特殊标记事件来生成,相对少)
衡量数据是否乱序的时间,什么时候不用等早之前的数据
是一个全局时间戳,不是某一个key下的值
是一个特殊字段,单调递增的方式,主要是和数据本身的时间戳做比较
用来确定什么时候不再等待更早的数据了,可以触发窗口进行计算,忍耐是有限度的,给迟到的数据一些机会
注意
触发计算后,其他窗口内数据再到达也被丢弃
(1)时间工具类
public class TimeUtil {
/**
* 时间处理
* @param date
* @return
*/
public static String toDate(Date date){
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss");
ZoneId zoneId = ZoneId.systemDefault();
return formatter.format(date.toInstant().atZone(zoneId));
}
/**
* 字符串转日期类型
* @param time
* @return
*/
public static Date strToDate(String time){
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss");
LocalDateTime dateTime = LocalDateTime.parse(time, formatter);
return Date.from(dateTime.atZone(ZoneId.systemDefault()).toInstant());
}
/**
* 时间处理
* @param date
* @return
*/
public static String format(long date){
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss");
ZoneId zoneId = ZoneId.systemDefault();
return formatter.format(new Date(date).toInstant().atZone(zoneId));
}
}
(2)Flink入口函数
/**
* @author lixiang
*/
public class FlinkWaterDemo {
public static void main(String[] args) throws Exception {
//初始化环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//设置并行度
env.setParallelism(1);
//监听socket输入
DataStreamSource<String> source = env.socketTextStream("192.168.139.20", 8888);
//一对多转换,将输入的字符串转成Tuple类型
SingleOutputStreamOperator<Tuple3<String, String, Integer>> flatMap = source.flatMap(new FlatMapFunction<String, Tuple3<String, String, Integer>>() {
@Override
public void flatMap(String value, Collector<Tuple3<String, String, Integer>> out) throws Exception {
String[] split = value.split(",");
out.collect(new Tuple3<String,String,Integer>(split[0],split[1],Integer.parseInt(split[2])));
}
});
//设置watermark,官方文档直接拿来的,注意修改自己的时间参数
SingleOutputStreamOperator<Tuple3<String, String, Integer>> watermarks = flatMap.assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple3<String, String, Integer>>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((element, recordTimestamp) -> {
return TimeUtil.strToDate(element.f1).getTime();
}));
//根据标题进行分组
watermarks.keyBy(new KeySelector<Tuple3<String, String, Integer>, String>() {
@Override
public String getKey(Tuple3<String, String, Integer> value) throws Exception {
return value.f0;
}
}).window(TumblingEventTimeWindows.of(Time.seconds(10))).apply(new WindowFunction<Tuple3<String, String, Integer>, String, String, TimeWindow>() {
//滚动窗口,10s一统计,全窗口函数
@Override
public void apply(String key, TimeWindow timeWindow, Iterable<Tuple3<String, String, Integer>> iterable, Collector<String> collector) throws Exception {
List<String> eventTimeList = new ArrayList<>();
int total = 0;
for (Tuple3<String, String, Integer> order : iterable) {
eventTimeList.add(order.f1);
total = total + order.f2;
}
String outStr = "分组key:"+key+",总价:"+total+",窗口开始时间:"+TimeUtil.format(timeWindow.getStart())+",窗口结束时间:"+TimeUtil.format(timeWindow.getEnd())+",窗口所有事件时间:"+eventTimeList;
collector.collect(outStr);
}
}).print();
env.execute("watermark job");
}
}
(3)测试数据,nc -lk 8888监听8888端口,一条一条的输入
[root@flink ~]# nc -lk 8888
java,2022-11-11 23:12:07,10
java,2022-11-11 23:12:11,10
java,2022-11-11 23:12:08,10
mysql,2022-11-11 23:12:13,20
java,2022-11-11 23:12:13,10
java,2022-11-11 23:12:17,10
java,2022-11-11 23:12:09,10
java,2022-11-11 23:12:20,10
java,2022-11-11 23:12:22,10
java,2022-11-11 23:12:25,10
/**
* @author lixiang
*/
public class FlinkWaterDemo {
public static void main(String[] args) throws Exception {
//初始化环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//设置并行度
env.setParallelism(1);
//监听socket输入
DataStreamSource<String> source = env.socketTextStream("192.168.139.20", 8888);
//一对多转换,将输入的字符串转成Tuple类型
SingleOutputStreamOperator<Tuple3<String, String, Integer>> flatMap = source.flatMap(new FlatMapFunction<String, Tuple3<String, String, Integer>>() {
@Override
public void flatMap(String value, Collector<Tuple3<String, String, Integer>> out) throws Exception {
String[] split = value.split(",");
out.collect(new Tuple3<String,String,Integer>(split[0],split[1],Integer.parseInt(split[2])));
}
});
//设置watermark,官方文档直接拿来的,注意修改自己的时间参数
SingleOutputStreamOperator<Tuple3<String, String, Integer>> watermarks = flatMap.assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple3<String, String, Integer>>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((element, recordTimestamp) -> {
return TimeUtil.strToDate(element.f1).getTime();
}));
//根据标题进行分组
watermarks.keyBy(new KeySelector<Tuple3<String, String, Integer>, String>() {
@Override
public String getKey(Tuple3<String, String, Integer> value) throws Exception {
return value.f0;
}
}).window(TumblingEventTimeWindows.of(Time.seconds(10)))
//1min的容忍时间,即使时间段窗口被统计了,只要数据没有超过1min就可以再次被统计进去
.allowedLateness(Time.minutes(1))
.apply(new WindowFunction<Tuple3<String, String, Integer>, String, String, TimeWindow>() {
//滚动窗口,10s一统计,全窗口函数
@Override
public void apply(String key, TimeWindow timeWindow, Iterable<Tuple3<String, String, Integer>> iterable, Collector<String> collector) throws Exception {
List<String> eventTimeList = new ArrayList<>();
int total = 0;
for (Tuple3<String, String, Integer> order : iterable) {
eventTimeList.add(order.f1);
total = total + order.f2;
}
String outStr = "分组key:"+key+",总价:"+total+",窗口开始时间:"+TimeUtil.format(timeWindow.getStart())+",窗口结束时间:"+TimeUtil.format(timeWindow.getEnd())+",窗口所有事件时间:"+eventTimeList;
collector.collect(outStr);
}
}).print();
env.execute("watermark job");
}
}
/**
* @author lixiang
*/
public class FlinkWaterDemo {
public static void main(String[] args) throws Exception {
//初始化环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//设置并行度
env.setParallelism(1);
//监听socket输入
DataStreamSource<String> source = env.socketTextStream("192.168.139.20", 8888);
//一对多转换,将输入的字符串转成Tuple类型
SingleOutputStreamOperator<Tuple3<String, String, Integer>> flatMap = source.flatMap(new FlatMapFunction<String, Tuple3<String, String, Integer>>() {
@Override
public void flatMap(String value, Collector<Tuple3<String, String, Integer>> out) throws Exception {
String[] split = value.split(",");
out.collect(new Tuple3<String,String,Integer>(split[0],split[1],Integer.parseInt(split[2])));
}
});
//设置watermark,官方文档直接拿来的,注意修改自己的时间参数
SingleOutputStreamOperator<Tuple3<String, String, Integer>> watermarks = flatMap.assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple3<String, String, Integer>>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((element, recordTimestamp) -> {
return TimeUtil.strToDate(element.f1).getTime();
}));
//new 一个OutputTag Bean
OutputTag<Tuple3<String,String,Integer>> lateData = new OutputTag<Tuple3<String,String,Integer>>("lateData"){};
//根据标题进行分组
SingleOutputStreamOperator<String> operator = watermarks.keyBy(new KeySelector<Tuple3<String, String, Integer>, String>() {
@Override
public String getKey(Tuple3<String, String, Integer> value) throws Exception {
return value.f0;
}
}).window(TumblingEventTimeWindows.of(Time.seconds(10)))
//1min的容忍时间,即使时间段窗口被统计了,只要数据没有超过1min就可以再次被统计进去
.allowedLateness(Time.minutes(1))
//侧输入,最后的兜底数据
.sideOutputLateData(lateData)
.apply(new WindowFunction<Tuple3<String, String, Integer>, String, String, TimeWindow>() {
//滚动窗口,10s一统计,全窗口函数
@Override
public void apply(String key, TimeWindow timeWindow, Iterable<Tuple3<String, String, Integer>> iterable, Collector<String> collector) throws Exception {
List<String> eventTimeList = new ArrayList<>();
int total = 0;
for (Tuple3<String, String, Integer> order : iterable) {
eventTimeList.add(order.f1);
total = total + order.f2;
}
String outStr = "分组key:" + key + ",总价:" + total + ",窗口开始时间:" + TimeUtil.format(timeWindow.getStart()) + ",窗口结束时间:" + TimeUtil.format(timeWindow.getEnd()) + ",窗口所有事件时间:" + eventTimeList;
collector.collect(outStr);
}
});
operator.print();
//侧输出流数据
operator.getSideOutput(lateData).print();
env.execute("watermark job");
}
}
[root@flink ~]# nc -lk 8888
java,2022-11-11 23:12:07,10
java,2022-11-11 23:12:11,10
java,2022-11-11 23:12:08,10
mysql,2022-11-11 23:12:13,20
java,2022-11-11 23:12:13,10
java,2022-11-11 23:12:17,10
java,2022-11-11 23:12:09,10
java,2022-11-11 23:12:20,10
java,2022-11-11 23:14:22,10
java,2022-11-11 23:12:25,10 #设置一个超过1分钟的数据,测试
(1)如何保证在需要的窗口内获取指定的数据?数据有乱序延迟
(2)应用场景:实时监控平台
(3)总结Flink的机制
(4)版本弃用API
新接口,`WatermarkStrategy`,`TimestampAssigner` 和 `WatermarkGenerator` 因为其对时间戳和 watermark 等重点的抽象和分离很清晰,并且还统一了周期性和标记形式的 watermark 生成方式
新接口之前是用AssignerWithPeriodicWatermarks和AssignerWithPunctuatedWatermarks ,现在可以弃用了
(1)什么是State状态
(2)有状态和无状态介绍
(3)状态管理分类
ManagedState
RawState
State数据结构(状态值可能存在内存、磁盘、DB或者其他分布式存储中)
从Flink 1.13开始,社区重新设计了其公共状态后端类,以帮助用户更好地理解本地状态存储和检查点存储的分离 用户可以迁移现有应用程序以使用新 API,⽽不会丢失任何状态或⼀致性。
(1)Flink内置了以下这些开箱即用的state backends :
(2)State状态详解
MemoryStateBackend(内存,不推荐在生产场景使用)
FsStateBackend(文件系统上,本地文件系统、HDFS, 性能更好,常用)
RocksDBStateBackend (无需担心 OOM 风险,是大部分时候的选择)
(3)配置方式
方式一:可以flink-conf.yaml使用配置键在中配置默认状态后端state.backend。
配置条目的可能值是hashmap (HashMapStateBackend)、rocksdb (EmbeddedRocksDBStateBackend)
或实现状态后端工厂StateBackendFactory的类的完全限定类名
#全局配置例子一
# The backend that will be used to store operator state checkpoints
state.backend: hashmap
# Optional, Flink will automatically default to JobManagerCheckpointStorage
# when no checkpoint directory is specified.
state.checkpoint-storage: jobmanager
#全局配置例子二
state.backend: rocksdb
state.checkpoints.dir: file:///checkpoint-dir/
# Optional, Flink will automatically default to FileSystemCheckpointStorage
# when a checkpoint directory is specified.
state.checkpoint-storage: filesystem
方式二:代码 单独job配置例子
//代码配置一
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStateBackend(new HashMapStateBackend());
env.getCheckpointConfig().setCheckpointStorage(new JobManagerCheckpointStorage());
//代码配置二
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStateBackend(new EmbeddedRocksDBStateBackend());
env.getCheckpointConfig().setCheckpointStorage("file:///checkpoint-dir");
//或者
env.getCheckpointConfig().setCheckpointStorage(new FileSystemCheckpointStorage("file:///checkpoint-dir"));
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-statebackend-rocksdb_${scala.version}artifactId>
<version>1.13.1version>
dependency>
sum()、maxBy() 等函数底层源码也是有ValueState进行状态存储
需求:
编码实战
/**
* 使用valueState实现maxBy功能,统计分组内订单金额最高的订单
* @author lixiang
*/
public class FlinkStateDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStream<String> ds = env.socketTextStream("192.168.139.20", 8888);
DataStream<Tuple3<String, String, Integer>> flatMapDS = ds.flatMap(new RichFlatMapFunction<String, Tuple3<String, String, Integer>>() {
@Override
public void flatMap(String value, Collector<Tuple3<String, String, Integer>> out) throws Exception {
String[] arr = value.split(",");
out.collect(Tuple3.of(arr[0], arr[1], Integer.parseInt(arr[2])));
}
});
//一定要key by后才可以使用键控状态ValueState
SingleOutputStreamOperator<Tuple2<String, Integer>> maxVideoOrder = flatMapDS.keyBy(new KeySelector<Tuple3<String,String,Integer>, String>() {
@Override
public String getKey(Tuple3<String, String, Integer> value) throws Exception {
return value.f0;
}
}).map(new RichMapFunction<Tuple3<String, String, Integer>, Tuple2<String, Integer>>() {
private ValueState<Integer> valueState = null;
@Override
public void open(Configuration parameters) throws Exception {
valueState = getRuntimeContext().getState(new ValueStateDescriptor<Integer>("total", Integer.class));
}
@Override
public Tuple2<String, Integer> map(Tuple3<String, String, Integer> tuple3) throws Exception {
// 取出State中的最大值
Integer stateMaxValue = valueState.value();
Integer currentValue = tuple3.f2;
if (stateMaxValue == null || currentValue > stateMaxValue) {
//更新状态,把当前的作为新的最大值存到状态中
valueState.update(currentValue);
return Tuple2.of(tuple3.f0, currentValue);
} else {
//历史值更大
return Tuple2.of(tuple3.f0, stateMaxValue);
}
}
});
maxVideoOrder.print();
env.execute("valueState job");
}
}
[root@flink ~]# nc -lk 8888
java,2022-11-11 23:12:07,10
java,2022-11-11 23:12:11,10
java,2022-11-11 23:12:08,30
mysql,2022-11-11 23:12:13,20
java,2022-11-11 23:12:13,10
什么是Checkpoint 检查点
开箱即用,Flink 捆绑了这些检查点存储类型:
配置
//全局配置checkpoints
state.checkpoints.dir: hdfs:///checkpoints/
//作业单独配置checkpoints
env.getCheckpointConfig().setCheckpointStorage("hdfs:///checkpoints-data/");
//全局配置savepoint
state.savepoints.dir: hdfs:///flink/savepoints
Savepoint 与 Checkpoint 的不同之处
端到端(end-to-end)状态一致性
数据一致性保证都是由流处理器实现的,也就是说都是在Flink流处理器内部保证的
在真实应用中,了流处理器以外还包含了数据源(例如Kafka、Mysql)和输出到持久化系统(Kafka、Mysql、Hbase、CK)
端到端的一致性保证,是意味着结果的正确性贯穿了整个流处理应用的各个环节,每一个组件都要保证自己的一致性。
//两个检查点之间间隔时间,默认是0,单位毫秒
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
//Checkpoint过程中出现错误,是否让整体任务都失败,默认值为0,表示不容忍任何Checkpoint失败
env.getCheckpointConfig().setTolerableCheckpointFailureNumber(5);
//Checkpoint是进行失败恢复,当一个 Flink 应用程序失败终止、人为取消等时,它的 Checkpoint 就会被清除
//可以配置不同策略进行操作
// DELETE_ON_CANCELLATION: 当作业取消时,Checkpoint 状态信息会被删除,因此取消任务后,不能从 Checkpoint 位置进行恢复任务
// RETAIN_ON_CANCELLATION(多): 当作业手动取消时,将会保留作业的 Checkpoint 状态信息,要手动清除该作业的 Checkpoint 状态信息
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
//Flink 默认提供 Extractly-Once 保证 State 的一致性,还提供了 Extractly-Once,At-Least-Once 两种模式,
// 设置checkpoint的模式为EXACTLY_ONCE,也是默认的,
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
//设置checkpoint的超时时间, 如果规定时间没完成则放弃,默认是10分钟
env.getCheckpointConfig().setCheckpointTimeout(60000);
//设置同一时刻有多少个checkpoint可以同时执行,默认为1就行,以避免占用太多正常数据处理资源
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
//设置了重启策略, 作业在失败后能自动恢复,失败后最多重启3次,每次重启间隔10s
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 10000));
(1)什么是FlinkCEP
(2)FlinkCEP用途
(3)FlinkCEP使用流程
(4)CEP并不包含在flink中,使用前需要自己导入
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-cep-scala_2.11artifactId>
<version>1.7.0version>
dependency>
(1)模式(Pattern):定义处理事件的规则
三种模式PatternAPI
近邻模式
严格近邻:期望所有匹配事件严格地一个接一个出现,中间没有任何不匹配的事件, API是.next()
宽松近邻:允许中间出现不匹配的事件,API是.followedBy()
非确定性宽松近邻:可以忽略已经匹配的条件,API是followedByAny()
指定时间约束:指定模式在多长时间内匹配有效,API是within
如果您不希望事件类型直接跟随另一个,notNext()
如果您不希望事件类型介于其他两种事件类型之间,notFollowedBy()
模式分类
单次模式:接收一次一个事件
循环模式:接收一个或多个事件
(2)其他参数
需求:同个账号,在5秒内连续登录失败2次,则认为存在而已登录问题
数据格式 李祥,2022-11-11 12:01:01,-1
/**
* cep-demo
* @author lixiang
*/
public class FlinkCEPDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStream<String> ds = env.socketTextStream("192.168.139.20",8888);
SingleOutputStreamOperator<Tuple3<String, String, Integer>> flatMapDS = ds.flatMap(new FlatMapFunction<String, Tuple3<String, String, Integer>>() {
@Override
public void flatMap(String value, Collector<Tuple3<String, String, Integer>> out) throws Exception {
String[] arr = value.split(",");
out.collect(Tuple3.of(arr[0], arr[1], Integer.parseInt(arr[2])));
}
});
SingleOutputStreamOperator<Tuple3<String, String, Integer>> watermarks = flatMapDS.assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple3<String, String, Integer>>forMonotonousTimestamps()
.withTimestampAssigner((event, timestamp) -> TimeUtil.strToDate(event.f1).getTime()));
KeyedStream<Tuple3<String, String, Integer>, String> keyedStream = watermarks.keyBy(new KeySelector<Tuple3<String, String, Integer>, String>() {
@Override
public String getKey(Tuple3<String, String, Integer> value) throws Exception {
return value.f0;
}
});
//定义模式
Pattern<Tuple3<String, String, Integer>, Tuple3<String, String, Integer>> pattern = Pattern.<Tuple3<String, String, Integer>>
begin("firstTimeLogin")
.where(new SimpleCondition<Tuple3<String, String, Integer>>() {
@Override
public boolean filter(Tuple3<String, String, Integer> value) throws Exception {
return value.f2 == -1;
}
})
.next("secondTimeLogin")
.where(new SimpleCondition<Tuple3<String, String, Integer>>() {
@Override
public boolean filter(Tuple3<String, String, Integer> value) throws Exception {
return value.f2 == -1;
}
}).within(Time.seconds(5));
//匹配检查
PatternStream<Tuple3<String, String, Integer>> patternStream = CEP.pattern(keyedStream, pattern);
SingleOutputStreamOperator<Tuple3<String, String, String>> select = patternStream.select(new PatternSelectFunction<Tuple3<String, String, Integer>, Tuple3<String, String, String>>() {
@Override
public Tuple3<String, String, String> select(Map<String, List<Tuple3<String, String, Integer>>> map) throws Exception {
Tuple3<String, String, Integer> firstLoginFail = map.get("firstTimeLogin").get(0);
Tuple3<String, String, Integer> secondLoginFail = map.get("secondTimeLogin").get(0);
return Tuple3.of(firstLoginFail.f0, firstLoginFail.f1, secondLoginFail.f1);
}
});
select.print("匹配结果");
env.execute("CEP job");
}
}
张三,2022-11-11 12:01:01,-1
李四,2022-11-11 12:01:10,-1
李四,2022-11-11 12:01:11,-1
张三,2022-11-11 12:01:13,-1
李四,2022-11-11 12:01:14,-1
李四,2022-11-11 12:01:15,1
张三,2022-11-11 12:01:16,-1
李四,2022-11-11 12:01:17,-1
张三,2022-11-11 12:01:20,1
Flink 部署方式是灵活,主要是对Flink计算时所需资源的管理方式不同
(1)安装JDK8环境
(1)上传jdk1.8安装包,解压到指目录
tar -xvf jdk-8u181-linux-x64.tar.gz -C /usr/local/
(2)查看解压后的文件
[root@flink ~]# cd /usr/local
[root@flink local]# ls
bin etc flink-1.13.1 games include jdk1.8.0_181 lib lib64 libexec sbin share src
(3)jdk1.8.0_181重命名为jdk1.8
mv jdk1.8.0_181/ jdk1.8
(4)配置环境变量
vi /etc/profile
添加配置:
JAVA_HOME=/usr/local/jdk1.8
CLASSPATH=$JAVA_HOME/lib/
PATH=$PATH:$JAVA_HOME/bin
export PATH JAVA_HOME CLASSPATH
(5)刷新配置
source /etc/profile
(6)查看java环境
[root@flink ~]# java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
Please specify JAVA_HOME. Either in Flink config ./conf/flink-conf.yaml or as system-wide JAVA_HOME.
(2)准备flink环境
tar -xvf flink-1.13.1-bin-scala_2.12.tgz -C /usr/local/
调整配置文件:conf/flink-conf.yaml
#web ui 端口
rest.port=8081
#调整jobmanager和taskmanager的大小,根据自己的机器进行调整
jobmanager.memory.process.size: 256m
taskmanager.memory.process.size: 256m
本地模式用到这两个脚本
start-cluster.sh
stop-cluster.sh
启动本地模式:./start-cluster.sh
注意这会可能会报错:
The derived from fraction jvm overhead memory (19.200mb (20132659 bytes)) is less than its min value 192.000mb (201326592 bytes), min value will be used instead
原因分析:
flink最少需要192M的内存才能启动,jdk1.8之后堆内存采用的是元空间,元空间是根据内存大小自动分配的。如果元空间给的太小,flink将会启动不起来。
调整如下配置:conf/flink-conf.yaml
taskmanager.memory.process.size: 512m
taskmanager.memory.framework.heap.size: 64m
taskmanager.memory.framework.off-heap.size: 64m
taskmanager.memory.jvm-metaspace.size: 64m
taskmanager.memory.jvm-overhead.fraction: 0.2
taskmanager.memory.jvm-overhead.min: 16m
taskmanager.memory.jvm-overhead.max: 64m
taskmanager.memory.network.fraction: 0.1
taskmanager.memory.network.min: 1mb
taskmanager.memory.network.max: 256mb