1.12新版本之后,flink流批一体
会智能判断是local还是remote。
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
StreamExecutionEnvironment env = StreamExecutionEnvironment.createRemoteEnvironment(
"host", // 主机名
1234, //端口号
"path/to/jarFile.jar" // jobmanager 的jar包
);
返回集群执行环境,需要指定jar包
①命令行
bin/flink run -Dexecution.runtime-mode=BATCH ……
②代码设置
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
env.setRuntimeMode(RuntimeExcutionMode.BATCH);
//写死代码灵活度差
③batch模式的应用场景
有界流数据
整个程序输入端的具体操作
定义event,具有特点:
方便flink解析+序列化。
package com.shinho.chapter05_source;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.util.ArrayList;
public class SourceTest {
public static void main(String[] args) throws Exception {
//创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
//1.从文件读取(批量)
DataStreamSource<String> stream1 = env.readTextFile("input/clicks.txt");
//2.从集合中读取数据
ArrayList<Integer> nums = new ArrayList<>();
nums.add(2);
nums.add(5);
DataStreamSource<Integer> ds = env.fromCollection(nums);
ArrayList<Event> events = new ArrayList<>();
events.add(new Event("mary","/home",1000L));
events.add(new Event("bob","/prod",2000L));
DataStreamSource<Event> stream2 = env.fromCollection(events);
//3.从元素读取
DataStreamSource<Event> ele_ds = env.fromElements(new Event("bob", "/prod", 2000L));
//4. socket文本流(是流数据但不稳定)
DataStreamSource<String> stream4 = env.socketTextStream("node00", 7777);
// stream1.print("txt print");
// ds.print("collection print");
// stream2.print("event print");
// ele_ds.print("ele print");
stream4.print("socket stream");
env.execute();
}
}
分布式消息传输队列,高吞吐、易于扩展的消息系统,消息系统的传输方式与流处理完全一致,flink与kafka天生一对:
①kafka:数据收集与传输
②flink:分析计算
依赖
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.12</artifactId>
<version>1.13.6</version>
</dependency>
Kafka安装:
压缩包:https://archive.apache.org/dist/
安装:https://www.cnblogs.com/ding2016/p/8282907.html
启动
[root@node00 bin]# kafka-server-start.sh -daemon ../config/server.properties
生产者
[root@node00 bin]# kafka-console-producer.sh --broker-list node00:9092 --topic clicks
代码中实现消费者
//5 读取kafka
Properties properties = new Properties();
properties.setProperty("bootstrap.servers","node00:9092");
properties.setProperty("group.id","consumer.group");
properties.setProperty("key.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("value.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("auto.offset.reset","latest");
DataStreamSource<String> kafka_stream = env.addSource(new FlinkKafkaConsumer<String>("clicks", new SimpleStringSchema(), properties));
kafka_stream.print("kafka");
env.execute();
自定义source
package com.shinho.chapter05_source;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import java.util.Calendar;
import java.util.Random;
public class ClickSource implements SourceFunction<Event> {
private Boolean running = true;
@Override
public void run(SourceContext<Event> sourceContext) throws Exception {
Random random = new Random();
String[] users = {"gyz","xxz","ztk"};
String[] urls = {"./home","./prod","./prod?id=100"};
while (running){
String user = users[random.nextInt(users.length)];
String url = urls[random.nextInt(urls.length)];
Long timestamp = Calendar.getInstance().getTimeInMillis();
sourceContext.collect(new Event(user,url,timestamp));
Thread.sleep(1000L);
}
}
@Override
public void cancel() {
running = false;
}
}
main方法
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStreamSource<Event> ds = env.addSource(new ClickSource());
ds.print("CLICK RESOUCE");
env.execute();
}
对于简单的sourcefunction,并行度只能是1。
DataStreamSource<Integer> ds = env.addSource(new ParallelCustomSource()).setParallelism(2);
想要并行source,需要ParallelSourceFunction
public static class ParallelCustomSource implements ParallelSourceFunction<Integer>{
private Boolean running =true;
private Random random = new Random();
@Override
public void run(SourceContext<Integer> sourceContext) throws Exception {
while (running){
sourceContext.collect(random.nextInt());
}
}
@Override
public void cancel() {
running = false;
}
}
基本类型:java基本类型及包装类,void、String、Date……
复合:元组(最灵活)、scala样例类、行类型(row)、pojo(flink自定义类似java bean模式)
POJO
FlatMapOperator<String, Tuple2<String, Long>> wordAndOne = dataSource.flatMap((String line, Collector<Tuple2<String, Long>> out) -> {
String[] words = line.split(" ");
//每个单词转换二元组
for (String word : words) {
out.collect(Tuple2.of(word, 1L));
}
}).returns(Types.TUPLE(Types.STRING, Types.LONG));