这是官网的快速搭建环境方法,可直接参考flink官网:
https://nightlies.apache.org/flink/flink-docs-release-1.15/zh/docs/try-flink/flink-operations-playground/
都是中文,很方便。
linux version CentOS 7
docker 20.10.17
Docker Compose version v2.4.1
flink version 1.15.0
https://www.runoob.com/docker/centos-docker-install.html
https://www.runoob.com/docker/docker-compose.html
$ cd
$ mkdir flink
$ cd flink
$ vim docker-compose.yml
version: "2.1"
services:
jobmanager:
image: flink
expose:
- "6123"
ports:
- "8081:8081"
command: jobmanager
environment:
- JOB_MANAGER_RPC_ADDRESS=jobmanager
volumes:
- /home/hadoop/flink/flink-docker/conf/job/flink-conf.yaml:/opt/flink/conf/flink-conf.yaml
restart: always
taskmanager:
image: flink
expose:
- "6121"
- "6122"
depends_on:
- jobmanager
command: taskmanager
links:
- "jobmanager:jobmanager"
environment:
- JOB_MANAGER_RPC_ADDRESS=jobmanager
volumes:
- /home/hadoop/flink/flink-docker/conf/task/flink-conf.yaml:/opt/flink/conf/flink-conf.yaml
restart: always
$ mkdir -p /home/hadoop/flink/flink-docker/conf/task/
$ mkdir -p /home/hadoop/flink/flink-docker/conf/job/
配置太长可去网络搜索 (这里是不是可以直接从容器内同步出来)
由于flink占用内存过大,这里进行了一点修改
# docker-compose.yml同级目录下
# 启动flink容器
$ docker-compose up -d
# 首次运行需要拉取镜像,需要等一段时间
# 有可能卡住不动(网络问题),Ctrl+C 取消再重新运行
# 关闭flink容器
$ docker-compose down
# docker查看运行的容器
$ docker ps
4189a4252f47 flink "/docker-entrypoint.…" 25 hours ago Up 25 hours 6121-6123/tcp, 8081/tcp flink-taskmanager-1
a26070126d29 flink "/docker-entrypoint.…" 25 hours ago Up 25 hours 6123/tcp, 0.0.0.0:8081->8081/tcp, :::8081->8081/tcp flink-jobmanager-1
# docker查看下载的镜像与版本
$ docker images
# linux查看8081端口是否开启
$ netstat -nultp | grep 8081
tcp 0 0 0.0.0.0:8081 0.0.0.0:* LISTEN 13917/docker-proxy
tcp6 0 0 :::8081 :::* LISTEN 13922/docker-proxy
# 网络访问localhost:8081
$ curl localhost:8081 或 浏览器访问(本地访问使用 localhost,外部访问使用公网ip)
使用docker简单几步操作,flink环境就搭建好了。
现在就可以通过Web页面查看运行信息(slot、内存信息等)或者提交任务进行学习测试了。
idea
java
maven
https://nightlies.apache.org/flink/flink-docs-release-1.15/zh/docs/try-flink/datastream/
# 大致流程:
# 打开 命令提示符 工具
# 进入工作文件夹 workplace
# 执行官网的mvn xxx命令,生成一个项目
file->open->项目......
根据flink官方网站给出的操作方式,可以很简单的创建一个flink模板项目。
由于官方的案例数据源为代码生成,输出为日志打印,甚至可以直接执行。
接下来可以在此基础上开发自己的flink应用,也可根据官方文档继续学习“欺诈检测”案例。
https://blog.csdn.net/liutao43/article/details/115522760
ParameterToolpublic class StreamWordCount {
public static void main(String[] args)throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
ParameterTool parameterTool = ParameterTool.fromArgs(args);
String host = parameterTool.get("host");
int port = parameterTool.getInt("port");
DataStreamSource<String> stringDataSource = env.socketTextStream(host , port);
// 对数据集进行处理,按空格分词展开,转换成(word, 1)二元组进行统计
SingleOutputStreamOperator<Tuple2<String, Integer>> sum = stringDataSource
.flatMap(new MyFlatMapper())
.keyBy((Tuple2<String, Integer> value) -> {return value.f0;})
.sum(1).setParallelism(2);
sum.print().setParallelism(1);
env.execute();
}
public static class MyFlatMapper implements FlatMapFunction<String, Tuple2<String, Integer>>{
private static final long serialVersionUID = 7883096705374505894L;
@Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception{
String[] words = value.split(" ");
for (String word : words){
out.collect(new Tuple2<String,Integer>(word,1));
}
}
}
}
虽然在查看结果时遇到了点问题,但是已经成功编写、运行了一个流任务(页面中可从任务执行过程查看数据条数)。
任务比较简洁,也足够验证项目模板和flink环境了。
接下来的学习案例,可以从这个的基础上修改source、sink、处理过程,来进一步学习flink。
下面就把输出改为MySQL,解决一下看不到结果的问题。
来源
根据netcat-print案例修改
代码
public static void main(String[] args) throws Exception {
ParameterTool parameterTool = ParameterTool.fromArgs(args);
String host = parameterTool.get("host");
int port = parameterTool.getInt("port");
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> stringDataSource = env.socketTextStream(host, port);
//这是一段可以直接在Ide中测试运行的代码,将上面注释,下面解开注释即可在Ide中运行
// StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// DataStreamSource stringDataSource = env.fromElements("abc zdf zdf abc abc 123");
SingleOutputStreamOperator<Tuple2<String, Integer>> sum = stringDataSource
.flatMap(new MyFlatMapper())
.keyBy((Tuple2<String, Integer> value) -> {
return value.f0;
})
.sum(1).setParallelism(2);
sum.addSink(JdbcSink.sink(
"insert into word_count values(?,?,?)",
new JdbcStatementBuilder<Tuple2<String, Integer>>() {
@Override
public void accept(PreparedStatement preparedStatement, Tuple2<String, Integer> value) throws SQLException {
preparedStatement.setString(1, value.f0);
preparedStatement.setInt(2, value.f1);
preparedStatement.setBigDecimal(3, BigDecimal.valueOf(Calendar.getInstance().getTimeInMillis()));
}
},
new JdbcExecutionOptions.Builder().withBatchSize(1).build(),
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withDriverName("com.mysql.jdbc.Driver")
.withUrl("jdbc:mysql://49.232.208.228:3306/flink?useSSL=false")
.withUsername("root")
.withPassword("root")
.build()
));
env.execute();
}
public static class MyFlatMapper implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
String[] words = value.split(" ");
for (String word : words) {
out.collect(new Tuple2<String, Integer>(word, 1));
}
}
}
由于是流处理,reduce过程中的数据也会被输出,添加了一个时间戳字段。
word count ts
helloWorld 1 1657162091992
flinkTest 1 1657162099482
helloWorld 2 1657162148936
helloWorld 3 1657162158543
总算可以看到结果了,而且MySQL也是非常常用的sink。
下一步计划学习一下自定义source与自定义sink、kafka source等。
找不到原来的博客了。
public class MySQLSink extends RichSinkFunction<WordCount> {
PreparedStatement preparedStatement;
private Connection connection;
private ReentrantLock reentrantLock = new ReentrantLock();
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
//准备数据库相关实例
buildPreparedStatement();
}
@Override
public void close() throws Exception {
super.close();
try {
if (null != preparedStatement) {
preparedStatement.close();
preparedStatement = null;
}
} catch (Exception e) {
e.printStackTrace();
}
try {
if (null != connection) {
connection.close();
connection = null;
}
} catch (Exception e) {
e.printStackTrace();
}
}
@Override
public void invoke(WordCount value, Context context) throws Exception {
preparedStatement.setString(1, value.getWord());
preparedStatement.setInt(2, value.getCount());
preparedStatement.executeUpdate();
}
/**
* 准备好connection和preparedStatement
* 获取mysql连接实例,考虑多线程同步,
* 不用synchronize是因为获取数据库连接是远程操作,耗时不确定
*
* @return
*/
private void buildPreparedStatement() {
if (null == connection) {
boolean hasLock = false;
try {
hasLock = reentrantLock.tryLock(10, TimeUnit.SECONDS);
if (hasLock) {
Class.forName("com.mysql.cj.jdbc.Driver");
connection = DriverManager.getConnection("jdbc:mysql://49.232.208.228:3306/flink?serverTimezone=GMT&allowPublicKeyRetrieval=true&useSSL=false&characterEncoding=utf8", "root", "root");
}
if (null != connection) {
preparedStatement = connection.prepareStatement("insert into word_count (word, count) values (?, ?)");
}
} catch (Exception e) {
//生产环境慎用
e.printStackTrace();
} finally {
if (hasLock) {
reentrantLock.unlock();
}
}
}
}
}
public class TestMySQLSink {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//并行度为1
env.setParallelism(1);
List list = new ArrayList<>();
list.add(new WordCount("aaa", 11));
list.add(new WordCount("bbb", 12));
list.add(new WordCount("ccc", 13));
list.add(new WordCount("ddd", 14));
list.add(new WordCount("eee", 15));
list.add(new WordCount("fff", 16));
env.fromCollection(list)
.addSink(new MySQLSink());
env.execute("sink demo : customize mysql obj");
}
}
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//从kafka读取数据
KafkaSource<String> source = KafkaSource.<String>builder()
.setBootstrapServers("49.232.208.228:9092")
.setTopics("baby_basic2")
.setGroupId("my-group")
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(new SimpleStringSchema())
.build();
// env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source").print();
DataStream<Tuple3<Long, String, Integer>> data = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source")
.map(new MapFunction<String, Baby>() {
@Override
public Baby map(String s) throws Exception {
String[] value = s.split(",");
return new Baby(Long.parseLong(value[0]),Integer.parseInt(value[1]),Integer.parseInt(value[2]));
}
})
.keyBy(k -> String.valueOf(k.birthday).substring(0,4))
.window(TumblingProcessingTimeWindows.of(Time.minutes(3)))
.process(new ProcessWindowFunction<Baby, Tuple3<Long, String, Integer>, String, TimeWindow>() {
@Override
public void process(String s,
Context context,
Iterable<Baby> iterable,
Collector<Tuple3<Long, String, Integer>> collector) throws Exception {
int sum = 0;
for (Baby f : iterable) {
sum ++;
}
collector.collect(Tuple3.of(context.window().getEnd(), s, sum));
}
});
data.print();
data.addSink(JdbcSink.sink(
"insert into baby_basic values(?,?,?)",
new JdbcStatementBuilder<Tuple3<Long, String, Integer>>() {
@Override
public void accept(PreparedStatement preparedStatement, Tuple3<Long, String, Integer> value) throws SQLException {
preparedStatement.setLong(1, value.f0.longValue());
preparedStatement.setInt(2, Integer.parseInt(value.f1));
preparedStatement.setInt(3, value.f2);
}
},
new JdbcExecutionOptions.Builder().withBatchSize(1).build(),
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withDriverName("com.mysql.jdbc.Driver")
.withUrl("jdbc:mysql://49.232.208.228:3306/flink?useSSL=false")
.withUsername("root")
.withPassword("root")
.build()
));
env.execute();
}
public static class Baby{
private Long id;
private int birthday;
private int sex;
@Override
public String toString() {
return "Baby{" +
"id=" + id +
", birthday=" + birthday +
", sex=" + sex +
'}';
}
public Baby(Long id, int birthday, int sex) {
this.id = id;
this.birthday = birthday;
this.sex = sex;
}
}
kafka 数据源与MySQL输出是非常常见的实时数据处理过程,这个案例基本完成了需求,并打包上传到flink成功运行1天时间。
需要优化的地方:
处理逻辑有问题:试下reduce算子
处理过程分步写一下
方法结构调整一下
使用参数的方式接收一下关键参数
由Kafka source + MySQL sink案例改写
public class TestKafkaToMySQL_v2 {
private final KafkaSource source;
private final SinkFunction> sink;
public TestKafkaToMySQL_v2(
KafkaSource source,
SinkFunction> sink) {
this.source = source;
this.sink = sink;
}
//main方法:添加数据源 输出,启动处理方法
public static void main(String[] args) throws Exception {
//kafka source
KafkaSource source = KafkaSource.builder()
.setBootstrapServers("49.232.208.228:9092")
.setTopics("baby_basic3")
.setGroupId("my-group-v2")
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(new SimpleStringSchema())
.build();
//MySQL sink
SinkFunction> sink = JdbcSink.sink(
"insert into baby_basic_v2 values(?,?,?)",
new JdbcStatementBuilder>() {
@Override
public void accept(PreparedStatement preparedStatement, Tuple3 value) throws SQLException {
preparedStatement.setLong(1, value.f0.longValue());
preparedStatement.setInt(2, value.f1);
preparedStatement.setInt(3, value.f2);
}
},
new JdbcExecutionOptions.Builder().withBatchSize(1).build(),
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withDriverName("com.mysql.cj.jdbc.Driver")
.withUrl("jdbc:mysql://49.232.208.228:3306/flink?useSSL=false")
.withUsername("root")
.withPassword("root")
.build()
);
TestKafkaToMySQL_v2 job =
new TestKafkaToMySQL_v2(source, sink);
job.execute();
}
//处理数据的方法
private void execute() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream dataString = env.fromSource(this.source, WatermarkStrategy.noWatermarks(),"kafka source");
DataStream dataObject = dataString.map(new StringToObj());
//keyBy、开窗、聚合
SingleOutputStreamOperator> result = dataObject
.keyBy(k -> String.valueOf(k.getBirthday()).substring(0 ,4))
.window(TumblingProcessingTimeWindows.of(Time.minutes(3)))
.aggregate(new MyAggregate());
result.addSink(this.sink);
env.execute("聚合操作");
}
//聚合操作
private static class MyAggregate
implements AggregateFunction, Tuple3> {
@Override
public Tuple2 createAccumulator() {
return new Tuple2<>(0, 0);
}
@Override
public Tuple2 add(Baby baby, Tuple2 integerIntegerTuple2) {
return new Tuple2<>(baby.getBirthday()/10000, integerIntegerTuple2.f1 + 1);
}
@Override
public Tuple3 getResult(Tuple2 integerIntegerTuple2) {
return new Tuple3<>(System.currentTimeMillis(), integerIntegerTuple2.f0, integerIntegerTuple2.f1);
}
@Override
public Tuple2 merge(Tuple2 integerIntegerTuple2, Tuple2 acc1) {
return null;
}
}
//String 转换为对象
public static class StringToObj implements MapFunction {
@Override
public Baby map(String s) throws Exception {
String[] value = s.split(",");
return new Baby(Long.parseLong(value[0]),Integer.parseInt(value[1]),Integer.parseInt(value[2]));
}
}
}
a调整了一下方法结构,算是kafka-mysql处理的终版了。之后对算子、处理过程进行测试可以套用这个常用的模板。
关于需求与实现的思考:
需求想的比较简单:统计流数据 得出每年的baby数量
一直希望mysql能得到最终的结果:获得流整体的统计结果
1.首先开窗:
只对窗口内的数据进行聚合处理,不能与窗口前的数据进行累加
这样可以再对mysql表进行查询一下得出结果
这种解决感觉是比较合理的:
实时对窗内的数据进行计算
mysql统计整体的结果举个更简单的例子:
实时流计算每分钟多少条数据,MySQL使用sql统计下一共多少条数据。
2.不开窗使用reduce:
reduce的缺点就是只能得到同类结果(一般情况 合并的结果对象 与 被合并的对象 结构都不一样)
解决办法是创建一个包含(合并的结果对象与被合并的对象)的对象
再使用map转换为输出格式
3.stream.keyBy(...).process(new MyProcessFunction());
这个可以考虑一下
b